Golden Pages Scraping: Create your First Web Scraper to Extract Data from a Web Page

Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.

You should run these to discover the Hub.

Find a simple but more up-to-date version of this tutorial here

This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. Many more features were included and some controls now have a new name. The following can still be a good complement to get acquainted with scrapers. The Sraper Editor can now be found in the ‘Scrapers’ view instead of ‘Source’ but the principle remains funamentally the same.

In many cases the automatic data extraction functions: tables, lists, guess, will be enough and you will manage to extract and export the data in just a few clicks.

If, however, the page is too complex, or if your needs are more specific there is a way to extract data manually: Create your own scraper.

Scrapers will be saved to your personal database and you will be able to re-apply them on the same URL or on other URLs starting, for instance, with the same domain name.

A scraper can even be applied to whole lists of URLs.

You can also export your scrapers and share them with other users.

Let’s get acquainted with this feature by creating a simple one.

1. Launch OutWit Hub

2. Choose the Web Page to Scrape

Let’s use this example of an HTML list: http://www.outwit.com/support/help/hub/tutorials/GrabDataExample1.html

Type the URL in the address bar.

In our present example, the data could be extracted simply using the ‘List’ view in the data section.

    If you don’t see anything in the list view,

    reload the page.

    In the ‘Lists’ view, like in most other views, right-clicking on selected rows gives you access to a wealth of features to edit and clean the data.

If the data, as extracted in the list view, is not structured enough for your needs you will have to create a customized scraper for this page.

The Scraper Editor is on the right side of the ‘Source’ view, with the colorized HTML source of the page.

The text in black is the content actually displayed on the page. This colorization makes it very easy to identify the data you are interested in.

Building a scraper is simply telling the program what comes immediately before and after the data you want to extract and/or its format.

So let’s create a scraper for this list.

Click on ‘New,’ type in the URL of the page and a name for your new scraper.

Fill the cells with the most logical markers you find around the different pieces of data (don’t look below for the solution… your computer is watching and you would loose ten points.)

Your first version should logically look like this:

Hit ‘Save,’ and that’s it! You are ready to run your first scraper.

If you now go to the ‘Scraper’ view and hit refresh, the results are there.

They are not bad… but not totally satisfying:

The first row contains text instead of the Coordinates, and the City is missing.

Another look at the source code explains it. The parenthesis ( which is used as the Marker Before Coordinates, appears in a comment hidden in the source code:

You must, therefore, be a little more precise and define the format of the first character that must be found after the marker.

Here, a good way is to use the Regular Expression syntax in the Format field. RegExps can become pretty tricky if you need to find complex patterns, but here, what you want to say is simple: “a string that starts with a digit”.

For this, you need to type \d.+ (a digit \d, followed by a series of one or more characters .+)

Hit Save.

Back to the scraper view, the new result is pretty good.

    Reload to see the updates.

One last problem, though, the first city took its continent along with it…

Let’s have a look at the source code one last time.

<li>, our Marker Before City, also appears before the continent.

A simple way, here, is to select all the characters between the beginning of the line and the city name, and copy them into the scraper editor. It makes the marker more specific, and it will keep working because all cities are at the same indentation level:

Our final scraper looks like this:

    Don’t forget to hit ‘Save’ for indeed we did it!

OK, the present example is not all that exciting and the figures are already out of date. It would almost be faster to do the 15 rows manually.

But, what if the data filled 20 pages and we decided to update the population figures tomorrow?

Better: what if the data was changing every morning, like job ads, sport results, or stock market indices?… No problem, you would simply re-apply your new scraper.

Source:http://blog.outwit.com/?p=55

Golden Pages Scraping

Friday, 13 December 2013

Create your First Web Scraper to Extract Data from a Web Page

No comments:

Post a Comment