Pagination and redirected URL

geo · December 17, 2019, 2:27pm

I have a problem with pagination when using Web Scraper. The problem is that the URL in the address bar of the site I want to scrape is repeatedly redirected to the first page of the website. So when I want to scrape …. p[2-5], this results in a csv-file with four times the same contents: those of the first page.

During the scraping process, I see all pages pass by, and after each page I see the first page again, where it’s redirected to (and is apparently scraped after the redirection). Does someone have an idea how I can scrape every page with the right contents?

Thanks in advance!

Url: https://www.funda.nl/en/huur/heel-nederland/verhuurd/sorteer-afmelddatum-af/

Sitemap:
{"_id":"rented","startUrl":["https://www.funda.nl/huur/heel-nederland/verhuurd/sorteer-afmelddatum-af/p[2-5]"],"selectors":[{"id":"element","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.search-result-content","multiple":true,"delay":0},{"id":"prijs","type":"SelectorText","parentSelectors":["element"],"selector":"[data-global-id='4979865'] span.search-result-price","multiple":false,"regex":"","delay":0},{"id":"straat","type":"SelectorText","parentSelectors":["element"],"selector":"[data-search-result-item-anchor='86788864'] h2","multiple":false,"regex":"","delay":0},{"id":"postcode","type":"SelectorText","parentSelectors":["element"],"selector":"small","multiple":false,"regex":"","delay":0},{"id":"oppervlak","type":"SelectorText","parentSelectors":["element"],"selector":"li span","multiple":false,"regex":"","delay":0},{"id":"prijs2","type":"SelectorText","parentSelectors":["element"],"selector":"span.search-result-price","multiple":false,"regex":"","delay":0}]}

webber · December 18, 2019, 2:42pm

Just an issue with the page itself. For some reason, it does not allow to use new links to navigate through the page.

You can use an Element Click selector to paginate through an hope you will not run out of RAM, as there are quite a lot of pages to iterate through.

If you do run out of RAM and the scraper crashes, you can try and filter the results down a bit and split this into multiple sitemaps to increase the chances to get through all of the pages.

{"_id":"rented","startUrl":["https://www.funda.nl/huur/heel-nederland/verhuurd/sorteer-afmelddatum-af/"],"selectors":[{"id":"element","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.search-result-content","multiple":true,"delay":"1000","clickElementSelector":".pagination > a[rel=\"next\"]","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueCSSSelector"},{"id":"prijs","type":"SelectorText","parentSelectors":["element"],"selector":"[data-global-id='4979865'] span.search-result-price","multiple":false,"regex":"","delay":0},{"id":"straat","type":"SelectorText","parentSelectors":["element"],"selector":"[data-search-result-item-anchor='86788864'] h2","multiple":false,"regex":"","delay":0},{"id":"postcode","type":"SelectorText","parentSelectors":["element"],"selector":"small","multiple":false,"regex":"","delay":0},{"id":"oppervlak","type":"SelectorText","parentSelectors":["element"],"selector":"li span","multiple":false,"regex":"","delay":0},{"id":"prijs2","type":"SelectorText","parentSelectors":["element"],"selector":"span.search-result-price","multiple":false,"regex":"","delay":0}]}

geo · December 19, 2019, 12:46pm

Thank you so much for your useful answer. What I did was:

Add an extra filter (resulting in a changed URL), I did this to be able to check the output after a short while;
Change the element selector into the element click selector;
Address the page numbers that are involved.
And it works exactly as I wanted. Thanks again!