Adding URLs to your extractor


In this section we will discuss the different ways of adding URLs to your extractor. Before we start, we need to make the distinction between adding training URLs and adding URLs to the extractor itself.

Training vs. regular URLs

You add training URLs from within the editor to teach the extractor what data you want to extract and for adding URLs to an extractor. Another method is to give the extractor the list of URLs from the different webpages you want to extract from. So, this is the how the workflow works:

  1. Create your extractor with a URL and select the data you want.
  2. Add a few more similar pages from the same website as training URLs on the editor to teach your extractor how to extract data from that website (between 2 and 5 is usually enough depending on the website).
  3. Save your extractor and return to the dashboard to add more URLs to it.

Remember, even when pages within a website look similar the underlying structure of the pages might be different. This is why adding training URLs to your extractor will increase the success rate of your extractor

Adding URLs manually

Adding URLs is as simple as copying and pasting into the URL list. You can paste multiple URLs at the same time as long as they are in different lines or separated by commas. You paste them into this box in the dashboard:

Adding URLs through another extractor

You can use list extractors to collect link URLs in a document. These can then be used as the basis for your extraction. So let us use the BBC news extractor I created earlier. In this case I can create another extractor which creates a list of links of the current front page of the BBC website. You can instruct Import.io to use the URLs directly from another extractor.

To do this click the box at the top of the setting page with the text "an explicit list of URLs" and select "URLs from another extractor".

This will then ask you for the name of the extractor you wish to extract from, and then what column the links are in. Once these have been put in you can run the extractor through the "run URLs" button. This setup is very useful as it means that If I ran the BBC front page links each day, they would generate a different set of URLs, thus enabling me to see what has been on the BBC front page each day, without ever having to go to the site. For more information on how to link extractors click here.

Using the URL Generator

The URL Generator is the quickest way to generate multiple URLs by using the patterns in the URLs. For example, often a site will have a page number using this generate you can change that page number. To use the URL generator click the "show URL generator" in the middle of the settings tab.

This will bring up a new box to add the URL you want to change. In this example I am going to use https://www.kiva.org/lend/1. Now, to use the generator you have to select the part of the URL you want to change. For example in this URL the 1 responds to a Kiva page, so to access over similar pages I need to change the 1, so I highlight the 1, leading to this outcome.

The number one has become parameter 1, so we can now change this using the controls to the right. In this case I am going to get pages 1 to 100, to do this I simply change the box to 1-100 and the new URLs are generated. The step box considers how much to add to the first number before creating a new URL, so if the step is 5 the next URL would be www.kiva.org/lend/6.

You can also change from a range of number to a list of values by clicking the range of numbers drop down box. This enables you to change the text displayed. For example, in my Kiva example I could change the text from "lend" to "fund", by changing lend to being the parameter and then adding in "fund" to the list of values.

Note: in a list of values there must be a comma between values. Also more than one parameter can be used at the same time.

For more information on URL Generation or linking extractors visit the URL Generator or Linking extractors (chain) pages.

results matching ""

    No results matching ""