I want to scrape data from a directory website.
The path is homepage -> regions page -> city page -> list of companies -> company information
My problem is with the pagination on the list companies page the pagination is 1/2/3/4/5/6/7/8/9/10/20
on the "20 page" it show 1/20 only
so to get access to 11/12/13/14/15/16/17/18/19 I have to go to 10 then go to 11/12 etc.
The problem is the scraper go to 20 then 10 skipping what is between.
My second problem is less important but still :
The scraper will start at the main page pagination number 1 but go at the end (20) then 10/9/8/7/6/5/4/3/2 but don’t get back to 1 so the problem is I can get the data by making two selector, one for the main 1 page and one for the rest but the result will be 1/10/9/8/7/6/5/4/3/2 so the order is not respected I want either 10/9/8/7/6/5/4/3/2/1 or 1/2/3/4/5/6/7/8/9/10 if possible.
Url: https://www.pagesjaunes.fr/
Sitemap:
{"_id":"pagejaune2","startUrl":["https://www.pagesjaunes.fr"],"selectors":[{"id":"regions","parentSelectors":["_root"],"type":"SelectorLink","selector":".region a","multiple":true,"linkType":"linkFromHref"},{"id":"villes","parentSelectors":["regions"],"type":"SelectorLink","selector":"p:nth-of-type(2) a","multiple":true,"linkType":"linkFromHref"},{"id":"voir tous les pros","parentSelectors":["villes"],"type":"SelectorLink","selector":"#plus-de-liens li:nth-of-type(1) a","multiple":false,"linkType":"linkFromHref"},{"id":"pagination","parentSelectors":["voir tous les pros"],"type":"SelectorLink","selector":".pagination a","multiple":true,"linkType":"linkFromHref"},{"id":"liste entreprises autres pages","parentSelectors":["pagination"],"type":"SelectorLink","selector":".col-xs-12 li a","multiple":true,"linkType":"linkFromHref"},{"id":"nom","parentSelectors":["liste entreprises autres pages"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"Page 1","parentSelectors":["voir tous les pros"],"type":"SelectorLink","selector":".col-xs-12 li a","multiple":true,"linkType":"linkFromHref"},{"id":"sect","parentSelectors":["Page 1"],"type":"SelectorText","selector":"div.teaser-rub","multiple":true,"regex":""}]}
Thanks for your time and help.