Scraping links inconsistent when scraping multiple URLs

thorjelly · March 22, 2024, 6:17pm

Web Scraper version: 1.75.7
Chrome version: 122.0.6261.111
OS: Arch Linux

Link to the site you were scraping: Batteries & Power Adapters Parts & Upgrades - XPS Laptops | Dell USA

Sitemap:

{"_id":"DellSearchList","startUrl":["https://www.dell.com/en-us/shop/pfydresults/278727?categoryId=8490&sid=52","https://www.dell.com/en-us/shop/pfydresults/269121?categoryId=8490&sid=52","https://www.dell.com/en-us/shop/pfydresults/273243?categoryId=8490&sid=52","https://www.dell.com/en-us/shop/pfydresults/275214?categoryId=8490&sid=52","https://www.dell.com/en-us/shop/pfydresults/233361?categoryId=8490&sid=52"],"selectors":[{"id":"paginator","parentSelectors":["_root","paginator"],"paginationType":"auto","type":"SelectorPagination","selector":"button.dds__pagination__next-page"},{"id":"product_page","parentSelectors":["_root","paginator"],"type":"SelectorLink","selector":".ps-title a","multiple":true,"linkType":"linkFromHref"},{"id":"parts","parentSelectors":["product_page"],"type":"SelectorText","selector":"div.ps-product-info","multiple":false,"regex":""}]}

List of products for XPS L321X and XPS 17 (9720) scrape fine. XPS 17 (9710) lists 4 products, but only one gets scraped. XPS 17 (9700) skips all products entirely.

However, if I scrape pages individually, it works:

{"_id":"DellSearchSingle","startUrl":["https://www.dell.com/en-us/shop/pfydresults/273243?categoryId=8490&sid=52"],"selectors":[{"id":"paginator","parentSelectors":["_root","paginator"],"paginationType":"auto","type":"SelectorPagination","selector":"button.dds__pagination__next-page"},{"id":"product_page","parentSelectors":["_root","paginator"],"type":"SelectorLink","selector":".ps-title a","multiple":true,"linkType":"linkFromHref"},{"id":"parts","parentSelectors":["product_page"],"type":"SelectorText","selector":"div.ps-product-info","multiple":false,"regex":""}]}

In this case, I scrape XPS 17 (9710) with no other URLs and it scrapes all 4 products fine. Nothing is changed about the scraper except for the list of start URLs.

This happens whether or not I use a list of start URLs, or scrape links to each accessory list on Batteries & Power Adapters Parts & Upgrades - XPS Laptops | Dell USA -- certain products get skipped.

From what I could tell, if I change the order of start urls, different products on different pages get skipped. So it appears to be a problem with how webscraperio internally handles queuing which links to load.

JanAp · March 26, 2024, 9:37am

Hi,

This happens due to the fact, that Web scraper is designed to filter out duplicate listings.

If you just scrape the product URLs and don't open them (by defining child elements for the link selector), all records will be scraped. If the URLs are opened, the duplicates will be discarded.

thorjelly · March 28, 2024, 6:05pm

I see. I did not notice that these listings were duplicates. Thank you.