Not all data is being scraped - is because delay 0 on selectors?

tiagorvmartins · November 26, 2019, 3:44pm

NOTE: I have a separate topic but that is regarding the nodejs module, this one is for the extension itself, which is not working as I would expect it.

I am trying to scrape printers and cartridges compatibility from a specific website, but not all the printers/cartridges are scraped. Even after increasing the Request Interval to 20000ms and Page Load Delay to 15000ms, I got 183 printers from a total of 467, some printers (and even cartridges) are being skipped, and I can't figure why.

I did start a scrape using this sitemap on the chrome browser using the extension and I notice now that is skipping printers. It starts always on the last printer of each page, and on a run that I did of the scraper, it opened the last printer but then it didn't continue through the cartridges link.

I believe the problem is on the SelectorLink that is skipping some items, on the extension I did element preview and all of them are properly selected, so still no clue.

It goes through all the pages, because I checked the results and I have printers from the last page (page 47), but for some reason there are printers missing, and I would require this module to work flawless.

I have delay 0 on all selectors, might be this the issue?

Url: https://www.cartridgesave.co.uk/printers.html?p=1

Sitemap:
{"_id":"printers-test-amount","startUrl":["https://www.cartridgesave.co.uk/printers.html?p=1"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".search div:nth-of-type(2) a.next","multiple":true,"delay":0},{"id":"product-link","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".product-item-inner a.product-item-link","multiple":true,"delay":0},{"id":"ManufacturerPartNo","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Manufacturer Part No.:') td","multiple":false,"regex":"","delay":0},{"id":"Brand","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Brand:') td","multiple":false,"regex":"","delay":0},{"id":"ProductType","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Product Type:') td","multiple":false,"regex":"","delay":0},{"id":"Connectivity","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Connectivity:') td","multiple":false,"regex":"","delay":0},{"id":"Height","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Height:') td","multiple":false,"regex":"","delay":0},{"id":"Width","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Width:') td","multiple":false,"regex":"","delay":0},{"id":"Depth","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Depth:') td","multiple":false,"regex":"","delay":0},{"id":"CartridgesLink","type":"SelectorLink","parentSelectors":["product-link"],"selector":"a.catridge_printer_link","multiple":false,"delay":0},{"id":"Catridges","type":"SelectorLink","parentSelectors":["CartridgesLink"],"selector":".product-item-inner a.product-item-link","multiple":true,"delay":0},{"id":"CatridgesModel","type":"SelectorText","parentSelectors":["Catridges"],"selector":"#information tr:contains('Manufacturer Part No.:') td","multiple":false,"regex":"","delay":0},{"id":"Title","type":"SelectorText","parentSelectors":["product-link"],"selector":"span[itemprop='name']","multiple":false,"regex":"","delay":0},{"id":"ShippingWeight","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Shipping weight:') td","multiple":false,"regex":"","delay":0},{"id":"Functionality","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information td[data-th='Functionality']","multiple":false,"regex":"","delay":0},{"id":"ColourMono","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Colour or mono:') td","multiple":false,"regex":"","delay":0},{"id":"PaperSize","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Paper size:') td","multiple":false,"regex":"","delay":0},{"id":"StandardTrayMediaTypes","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Standard tray media types:') td","multiple":false,"regex":"","delay":0},{"id":"ISOASeriesSizes","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('ISO A-series sizes (A0...A9):') td","multiple":false,"regex":"","delay":0},{"id":"ISOBSeriesSizes","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('ISO B-series sizes (B0...B9):') td","multiple":false,"regex":"","delay":0},{"id":"NonISOSizes","type":"SelectorText","parentSelectors":["product-link"],"selector":"#information tr:contains('Non-ISO print media sizes:') td","multiple":false,"regex":"","delay":0}]}

Please can you give me some help? Thank you so much in advance!

webber · November 27, 2019, 10:36am

The delay for selectors only needs to be added when using the click or scroll selector, otherwise - not needed. The sitemap has been structured correctly and is scraping data as it is supposed to when running it on my side.

Would look into a local issue like your pc or maybe the site is blocking your IP after some while, resulting in incomplete data

tiagorvmartins · November 27, 2019, 10:43am

Hi webber deeply appreciated for your answer, the IP being blocked can't be the issue, because it scrapes till the last page, I can see printers on my scrape from the page 47 (which is the last, I am using the next selector goes on each page sequentially), but for some reason, there were skipped printers on previous pages.

I am happy it's not a problem from the sitemap itself, I will have to re-run the scrape multiple times till have all the printers and then merge all the information. Can't figure out another possible solution at the moment.

Thank you for your time again.

webber · November 27, 2019, 10:49am

You can try and scrape all of the product URLs first to make sure you have the correct count first and then use the URLs as multiple start URLs in the metadata to scrape the information within them.

tiagorvmartins · November 27, 2019, 11:19am

That could work but would make the process more complex, I am currently building several configurations one for each website I want to scrape (each with its own sitemap) and then a scheduler that runs each scrape configuration sequentially, and after the process is complete it inserts the data into a mongo database, I am using web-scraper-headless module for nodejs.

Using the solution you mention, I would need to have some way to separate the logic of a specific configuration into a 2-step logic, which is not advisable at the state of my project.

Although it's a good suggestion for a simple scrape using the extension on it's own, thank you for your input!

leemeng · November 29, 2019, 2:40am

Better if you first figure out exactly which data are not being scraped so you can diagnose properly. e.g. if you're missing mainly data which appears at the bottom of pages, that probably means you need to add a scroller.

if your issue can be reduced/solved by using longer delays, that probably means the web server is slow, your net connection is slow, or both.

tiagorvmartins · November 29, 2019, 9:08am

Hi @leemeng thank you for your reply, I did two scrapes with different delays and the odd thing is that the amount of printers in total are the same on both scrapes, but the missing printers on each scrape were different from each other, however when I merged both scrapes I got more 5 printers than before. If I continuously scrape I will come to a point where all printers have been scrape (I hope so).

The missing printers are random, I have seen it picking printers from bottom of the page so it can't be the scroller (also I can't use the scroller using web-scraper-headless only using headless mode (puppeteer) not with jsdom, not possible) . Sometimes it even fails from picking printers that are on top of the page, so its really random and weird, increasing the page load delay and delay (on the second scrape) to 30s didn't help either.

I will carry on, and accept this issue, can't do nothing at the moment, the only option would be doing what @webber suggested, a 2-step scraping, but that's not for this time maybe later.

Haswod · May 28, 2020, 1:04pm

Hello All, Has this been sorted ?

Shootz · June 4, 2020, 4:20am

I'm experiencing he same problem. Any idea how to solve that?