Pagination Problem - nth-of-type

Hi

I'm having trouble scraping all pages of this site -

https://www.ggf.org.uk/members/

I can navigate into each service link from this page, and then from there click into each company record and scrape the data that I need. The problem comes with the pagination in the service pages, where the navigation is handled by a series of elements that dont have a class or id associated with them.

On the first page of results if I select the next page ('>') link it selects a:nth-of-type(4) as the selector. This is fine for the first page, but when the list loads page 2, a:nth-of-type(4) points to page 4 as the page loads another element to navigate backwards and additional elements for page numbers. Now to select the '>' element I would need to select a:nth-of-type(6). As a result the scraper only loads certain pages and doesnt pick up all records.

Is there a way around this, so it always picks up the '>' selector?

Sitemap:

{"_id":"cpa_ggf","startUrl":["http://www.ggf.org.uk/members/"],"selectors":[{"id":"ServiceLink","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.categories","multiple":true,"delay":0},{"id":"CompanyLink","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":"h2.geodir-entry-title a","multiple":true,"delay":0},{"id":"Name","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"h2.entry-title","multiple":false,"regex":"","delay":0},{"id":"Address","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"div.featured-overlay","multiple":false,"regex":"","delay":0},{"id":"Web","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"div.featured-overlay a","multiple":false,"regex":"","delay":0},{"id":"Info","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"li p","multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","parentSelectors":["ServiceLink"],"selector":"a:nth-of-type(4)","multiple":false,"delay":0}]}

Thanks

Yes, you can. You would need to use a selector '.Navi a:contains('>')', that will select the '>' on every page. Just make sure that you check the 'Multiple' box for the Link selector, otherwise it will stop after the first click. I have updated your sitemap, so that the pagination works here:

{"_id":"cpa_ggf","startUrl":["http://www.ggf.org.uk/members/"],"selectors":[{"id":"ServiceLink","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.categories","multiple":true,"delay":0},{"id":"CompanyLink","type":"SelectorLink","parentSelectors":["ServiceLink","pagination"],"selector":"h2.geodir-entry-title a","multiple":true,"delay":0},{"id":"Name","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"h2.entry-title","multiple":false,"regex":"","delay":0},{"id":"Address","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"div.featured-overlay","multiple":false,"regex":"","delay":0},{"id":"Web","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"div.featured-overlay a","multiple":false,"regex":"","delay":0},{"id":"Info","type":"SelectorText","parentSelectors":["CompanyLink"],"selector":"li p","multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","parentSelectors":["ServiceLink","pagination"],"selector":".Navi a:contains('>')","multiple":true,"delay":0}]}

2 Likes

Brilliant - works perfectly! I've had the same problem before so good to know about the 'contains' method for future.

Thanks for your help!