Dealing with webpage differences
This section covers how to deal with differences in webpages. When using the extractor sometimes webpages will have slightly different formats, this section covers how to collect data from webpages with multiple formats.
Solutions in extracting from multiple pages
There is no one step in extracting from multiple pages, rather it is more part trial and error, this is because it depends on the differences between the pages. As such we suggest different solutions which may solve the problem.
Train additional pages
The first method in dealing with multiple pages is to train additional pages, by training additional URLs the extractor becomes better able to identify differences in pages, which will enable different types of pages to be extracted.
Adding in new columns
One method for dealing with multiple pages is to add new columns and train the new column on webpages that aren't working correctly. For example, having two columns which contain the date from the webpages.
Simplifying your data selection
When data selecting in the editor rather than selecting a specific section, try selecting more general sections. This can sometimes mean that extra information is included in your extraction, however this can often be separated when editing the data, one method of doing so is using the text to column command in Excel. Another is to use the change columns output using regular expressions.
Using more than one extractor
Sometimes the simplest method to deal with different pages is to create a new extractor and run the pages through the new extractor. This can often work if you are finding that similar pages are consistently not running, creating a specific extractor for these pages may solve the extraction problem.