Start Urls using range with variable upper limit

Aeolus · April 20, 2022, 8:48am

Hello fellow scrapers!

I am scraping data from shops that sell items, using start urls with range. Every now and then, new items are added to the shops, making the amount of pages change. I am trying to figure out if there is a way, using the range method on start urls, to have a variable upper limit, that would automatically capture the amount of pages the site has without me having to check out the site every other day and make the appropriate changes on my Sitemaps.

For example I want this:
.../catalog?page=[1-45]

to work like this:
.../catalog?page=[1-"last page"] , in case more items are added in the future and the pages become 46 or 50 or 150.

I am already aware of (and using) the Pagination Selector, however due to some quirks of specific sites, this can method can become very inconvenient.

In case the approach of having a variable upper limit on the range is impossible to implement, I am open to suggestions that would solve this problem.

Thanks in advance for your time!

Kind regards,

leemeng · April 20, 2022, 2:27pm

Most websites don't care if you have an invalid page number in the URL, so you could just figure out a suitable buffer for your URLs and add it to the range. For example, if your current starturl is /catalog?page=[1-45] and you think there might be 5 more pages in the future, you can just change it to /catalog?page=[1-50]

If WS encounters an invalid URL (e.g. actual URL stops at page 47, so 48-50 would not be valid), it will just ignore it and your will have a blank result line. These are easy to spot and filter out.

Ref: Specify multiple urls with ranges

Aeolus · April 20, 2022, 6:54pm

Thank you very much for your quick response. Truth is I was already doing that to make sure I will be grabbing all the data, as well as changing the "products per page values" on the url to something bigger, so that I get the scraping done a bit faster.
Thanks again for your insights though and have a nice day!