[Node.js] How to scrape all products from all pages (I did as video tutorial)

tiagorvmartins · November 6, 2019, 1:20pm

I am using the library web-scraper-headless to scrape multiple products from a website that contains pagination.

I followed the tutorial for pagination using several pagination childs of each other, below is my current sitemap and the starting url of my sitemap.

The problem is that the output data doesn't contain all products as I would expect, for example on page 39 I only got 3 different products and in that page there are 10 products, I checked the data preview using chrome extension and all the 10 products appear on data preview on page 39. Page 39 is just an example, I suppose there will be more pages where not all products were scrapped.

You could say that I am using a very short delay and the embedded jsdom wouldn't have time to process everything, but I am using 10 seconds of delay and page delay as the following settings show:

const scraperOpts = {
delay: 10000,
pageLoadDelay: 10000
};

I was expecting a total of 428 products and from the scrapping I am only getting 161 unique products, what is wrong here? Can someone please give me some guidance?

Url: https://www.cartridgesave.co.uk/printers.html?p=1

Sitemap:
{"_id":"printers","startUrl":["https://www.cartridgesave.co.uk/printers.html?p=1"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".search div:nth-of-type(2) .pages-items a","multiple":true,"delay":0},{"id":"product-link","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".product-item-inner a.product-item-link","multiple":true,"delay":0},{"id":"ManufacturerPartNo","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Manufacturer Part No.:') td","multiple":false,"regex":"","delay":0},{"id":"Brand","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Brand:') td","multiple":false,"regex":"","delay":0},{"id":"ProductType","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Product Type:') td","multiple":false,"regex":"","delay":0},{"id":"Connectivity","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Connectivity:') td","multiple":false,"regex":"","delay":0},{"id":"Height","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Height:') td","multiple":false,"regex":"","delay":0},{"id":"Width","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Width:') td","multiple":false,"regex":"","delay":0},{"id":"Depth","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Depth:') td","multiple":false,"regex":"","delay":0},{"id":"CartridgesLink","type":"SelectorLink","parentSelectors":["product-link"],"selector":"a.catridge_printer_link","multiple":false,"delay":0},{"id":"Catridges","type":"SelectorLink","parentSelectors":["CartridgesLink"],"selector":".product-item-inner a.product-item-link","multiple":true,"delay":0},{"id":"CatridgesModel","type":"SelectorText","parentSelectors":["Catridges"],"selector":"#information tr:contains('Manufacturer Part No.:') td","multiple":false,"regex":"","delay":0}]}

Thanks a lot!

leemeng · November 6, 2019, 2:25pm

I took a quick look at the website and your sitemap, and I believe the site is using lazy loading. So you would probably need a scroller to load the bottom of the page. You can confirm this by visiting a page you have not been to before, say page 42. But do not scroll down yet. Do a preview of your data scraper and see if the item count is correct. If it is lacking the bottom-of-page items then you know you need a scroller.

In your earlier data previews, you had probably scrolled down the page already, so that was why it seemed to work fine.

tiagorvmartins · November 6, 2019, 2:34pm

Thanks for your help leemeng!
I tested what you did told me but all elements are loaded on the data preview before I scroll down the page, how would I do a scroller anyway? so I can put the scrapping running again and test it? Never used scroller, thanks!

EDIT: I checked the website again, and you are right at least for the images are doing lazy-loading, because only when I scroll to them they are loaded. But the data preview always shows all elements

tiagorvmartins · November 7, 2019, 9:14am

Something is wrong with my sitemap for sure or I dont understand.
I run the scraper this night again, and before I got 1654 items (161 unique products) and now I got 1652 items (161 unique products), it seems there is some kind of limitation on it that doesn't go further this number, some memory limits maybe or what? The same unique products! This is very weird.
I increased the delay to 20000 and left pageLoadDelay at 10000 which I supposed would fix it but no...
I also tried using Element Scroll Down but using the library is not possible, it throws an error telling window is not defined.
Can someone please check my sitemap? or if possible tell me what I am doing wrong? I believe its not a bug from the extension.

tiagorvmartins · November 25, 2019, 3:34pm

I notice that while scrapping using the chrome extension, that after a few pages the scrape ends. I tried to access the last page and I get a bad request 400, then I understood that the problem was cookies being too large, because the browser keeps cookies for the recent pages seen. I am now scraping on a incognito page of chrome, and hopefully it will work.

But how can I enable incognito using web-scraper-headless npm package for node (https://www.npmjs.com/package/web-scraper-headless) ? I understood that it uses jsdom as browser under-the-hood but does it keeps cookies? if it does then I will have the same issue while running this on my code.

Thanks again!