Dealing with infinite scroll

What do we mean by "infinite scroll"?

There are some websites where in order to see more items on the list you have to scroll with your mouse (or trackpad) to the bottom of the list. Here, you don't have links to the different pages more items only appear when you scroll down the page.

What's the problem?

Infinite scroll can be a tricky one, because the URL generally remains static (it doesn't change, even when you're on a different page). Websites can also handle this in different ways structurally, so it's not always possible to get around it, but here are some tips and tricks to work with pages with infinite scrolling.

The key is to find the underlying URL pattern for the different pages or pagination, even if is not explicit in the URL.

Example 1

Let's take this page with infinite scroll for example: http://www.pinko.com/en-gb/catalog/index/springsummer.

Step 1. Clear the network tab

First, before you scroll down, right-click>inspect and click on the network tab. Then, clear any existing activity by hitting the clear button next to the red circle on the lefthand side.

Step 2. Scroll down and identify items with "xhr"

Now, scroll down on the page until more items appear and look for an action with the type "xhr"

Step 3. Click the xhr item and select the headers tab.

The Headers tab is on the right hand side.

You can see that when you scroll down the page, the site makes a GET request to:

http://www.pinko.com/en-gb/catalog/index/springsummer?pg=4

Step 4. Identify the page component of the URL.

The ?pg=4 is the URL parameter that corresponds to the page number.

If you go directly to this URL it skips straight to those items. Now we know how the website really paginates!

So now, you have everything you need to create your extractor!

Step 5. Create your extractor

Create your extractor using this URL: http://www.pinko.com/en-gb/catalog/index/springsummer?pg=1

Step 6. Add the rest of the URLs

Add the rest of the URLs:

http://www.pinko.com/en-gb/catalog/index/springsummer?pg=2

http://www.pinko.com/en-gb/catalog/index/springsummer?pg=3

http://www.pinko.com/en-gb/catalog/index/springsummer?pg=4

http://www.pinko.com/en-gb/catalog/index/springsummer?pg=5

http://www.pinko.com/en-gb/catalog/index/springsummer?pg=6

etc...

on the Settings tab on your extractor. Save, and Run your extractor!

Example 2

As we mentioned before, websites are build in many different ways. Here is another example.

https://shop.boggi.com/categoria-prodotto/pe16/giacche-classiche-pe16/

Step 1. Clear the network tab

Right-click>inspect and click on the network tab.
Beginning with the same methodology.
Clear any existing activity by hitting the clear button next to the red circle on the lefthand side.

Step 2. Scroll down and identify items with "xhr"

Scroll down on the page until more items appear.
Look for an action with the type "xhr"

Step 3. Click the xhr item and select the headers tab.

This time the Request URL looks like it is to some kind of PHP script which used a form to make a POST request. It is not necessary to understand this, but we can't use this URL.

Step 4. Identify the page component is note in this URL and consider alternatives:

Instead you can look into the elements of the HTML and see if the is any reference to page numbers in there.

Step 5. Inspect elements instead of network tab.

Right-click on one of the items that appears after scrolling down and click inspect, and look for the "elements" rather than the network tab.

This should take you to the location of that item in the HTML. You're looking for anything that says page/pagination/scroll/scrolling/lazyload etc. (You can use command + f and search directly for these terms as an alternative)