Scrape data from paginated detail page?

techhouse · April 26, 2021, 8:26am

I am having troubles scraping this website.
I seem to be able to follow the pagination, but the detailed link request is only selecting the first entry of each page.
As well, I am unable to collect data from the detailed link.

The goal is to collect all the data from the detailed location page for each location on each list page.

What am I doing wrong?

Url: https://facilities.westkelownacity.ca/?CategoryIds=23&Page=[1-3]

Sitemap:
{"_id":"west_kelowna-playgrounds","startUrl":["https://facilities.westkelownacity.ca/?CategoryIds=23&Page=[1-3]"],"selectors":[{"id":"playground-links","type":"SelectorElementClick","parentSelectors":["playground-selector"],"selector":"h4","multiple":true,"delay":2000,"clickElementSelector":"h4","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},{"id":"name","type":"SelectorText","parentSelectors":["playground-links"],"selector":"h2","multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","parentSelectors":["playground-links"],"selector":".sidebar-feed p:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"hours","type":"SelectorText","parentSelectors":["playground-links"],"selector":".sidebar-feed li","multiple":false,"regex":"","delay":0},{"id":"description","type":"SelectorText","parentSelectors":["playground-links"],"selector":"p:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"photo","type":"SelectorImage","parentSelectors":["playground-links"],"selector":"img.photoItem","multiple":false,"delay":0},{"id":"playground-selector","type":"SelectorElement","parentSelectors":["_root"],"selector":"a.sidebar-item","multiple":true,"delay":0},{"id":"","type":"SelectorText","parentSelectors":["playground-links"],"selector":"","multiple":false,"regex":"","delay":0}]}

ViestursWS · April 26, 2021, 11:25am

Hi, @techhouse After checking out this website, it seems that there are no valid links that would lead to the different pages, they seem to be embedded into javascript. If you look at the data preview there are no unique links. Only the links which lead to the main page, but if you click on the link you can see that the URL changes so there's no possibility to use element-click selector here as well.

techhouse · April 26, 2021, 11:24pm

Thanks, @viesturs .
I guess that is why I only get one result from each page.
Is there another method to make this scrape work?

leemeng · April 27, 2021, 3:03am

It is possible to scrape this site in two stages, where in stage 1, you get all the "data-value" tags in each row and create URLs, and then in stage 2 you have a different sitemap which uses all those stage 1 URLs as Starturls.

For the stage 1 scrape, you can use this scraper (along with your paginator):

Type: Element attribute
Selector: ul > li div.l-item-container a:first-of-type
Multiple Yes (checked)
Attribute name: data-value

This will yield a bunch of "data-value" which look like:
be8a60a8-bc73-4c59-bd03-c47a85425252
58b944a0-1a4a-423c-8be9-e921d8f60796

You will then need to prefix the site URL so that they become:
https://facilities.westkelownacity.ca/Home/Detail?Id=be8a60a8-bc73-4c59-bd03-c47a85425252
https://facilities.westkelownacity.ca/Home/Detail?Id=58b944a0-1a4a-423c-8be9-e921d8f60796

These would be the direct links to detail pages, and you can use them as multiple URLs in a new sitemap.

I have posted about adding suffix/prefix to URLs here:

techhouse · April 28, 2021, 5:23am

Thank you, for this clever solution.
Using your method, I created the links with a spreadsheet, then cleaned-up the format in notepad.