Pagination exception

JeffJJones · July 9, 2018, 1:37am

Pagination won't work for my current site. For example, Web scraper generates a link for pagination that looks like this:

https://www.linkedin.com/recruiter/projects/1074489146#page=2

from the following HTML code:

<a class="page-link -sitemap-select-item-selected" data-tracking-control-name="pagination-2" href="#page=2" title="Page 2" data-li-page="2"><span class="hide-a11y">Page </span>2</a>

But, the real URL that gets executed when one physically clicks on the Page 2 pagination link on the web site is the following:

https://www.linkedin.com/recruiter/projects/1074489146#status/0/25

Page 3 URL looks like the following:

https://www.linkedin.com/recruiter/projects/1074489146#status/0/50

Obviously, some sort of translation to showing the first 25 entries in the result set, then the next 25 entries, and so on.

I don't know if this URL encryption is meant to defeat web scraping or not, however, it doesn't allow for Web Scraper to property compute the correct next page URL.

How do I get around this?

Thanks,
Jeff

bretfeig · July 9, 2018, 2:00am

I have a sitemap for Linkedin projects. Let me dig it up and see how I handled It. Can take remember if I used this or dataminer to scrape.

bretfeig · July 9, 2018, 2:35am

Here is the sitemap that will paginate through linkedin recruiter projects scraping name, title, location and current pipeline status. You need t change the starting URL to match your project. I used (li.next a.page-link) as the link selector and made that a child of root and it's self.

I then created an element which identified each row. (which was also made a child of the pagination page)

{"_id":"project","startUrl":["https://www.linkedin.com/recruiter/projects/1080628893#status/0/25"],"selectors":[{"id":"page-change","type":"SelectorLink","selector":"li.next a.page-link","parentSelectors":["_root","page-change"],"multiple":false,"delay":0},{"id":"Element","type":"SelectorElement","selector":"div.row-inner","parentSelectors":["_root","page-change"],"multiple":true,"delay":0},{"id":"Name","type":"SelectorText","selector":"a.title","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0},{"id":"title","type":"SelectorText","selector":"p.headline","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0},{"id":"Location","type":"SelectorText","selector":"dd:nth-of-type(1)","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0},{"id":"status","type":"SelectorText","selector":"span.status-text","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0}]}

JeffJJones · July 11, 2018, 2:14am

Thank you 'bretfeig'.

However, this is not working.

'li.next a.page-link' produces the URL of https://www.linkedin.com/recruiter/projects/1074489146#page=2. Which is not the actual URL you are sent to when you click on the NEXT button. You are instead sent to '...1074489146#status/0/50'.

thanks

JeffJJones · July 12, 2018, 8:09pm

I think I figured out how to make this work. Web Scraper has an "Element Click Selector" for actually clicking the Next button. But I have yet to figure out how to use it properly. I have reviewed the documentation for it, but alas, still confused. Anyone?

JeffJJones · July 12, 2018, 8:54pm

I have a "Next" button on the list of results; it shows the next 25 results and so on until there are no more results.

I have added a "Element Click Selector" under root named "page".
Selector is set to "div.row"; this contains each candidate's information that I am scraper.
I set Click Selector to "li.next a.page-link".
Click Type is set to "Click More".
Click Element Uniqueness is set to "Unique Text".
Multiple is checked.
Discard is unchecked.
Delay is set to 3000ms.

I have an Element Selector named "Candidate". Child of "root" and "page".

If I Scrape, it does go through every page, but only scrapes the data from the first and last page! What am I doing wrong?

{"_id":"linkedinv2","startUrl":["https://www.linkedin.com/recruiter/projects/1074489146"],"selectors":[{"id":"Candidate","type":"SelectorElement","selector":"div.row","parentSelectors":["_root","page"],"multiple":true,"delay":"500"},{"id":"First Name","type":"SelectorText","selector":"a.title","parentSelectors":["Candidate"],"multiple":false,"regex":".(?=\s)","delay":0},{"id":"Last Name","type":"SelectorText","selector":"a.title","parentSelectors":["Candidate"],"multiple":false,"regex":"(?<=\s).","delay":0},{"id":"Employer","type":"SelectorText","selector":"p.headline","parentSelectors":["Candidate"],"multiple":false,"regex":"(?<=\sat\s+).","delay":0},{"id":"Location","type":"SelectorText","selector":"dd:nth-of-type(1)","parentSelectors":["Candidate"],"multiple":false,"regex":"","delay":0},{"id":"Title","type":"SelectorText","selector":"p.headline","parentSelectors":["Candidate"],"multiple":false,"regex":".(?=\sat\s)","delay":0},{"id":"page","type":"SelectorElementClick","selector":"div.row","parentSelectors":["_root"],"multiple":true,"delay":"3000","clickElementSelector":"li.next a.page-link","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"}]}

KristapsWS · July 13, 2018, 12:16pm

You don't need "Candidate" selector. Make all of the "Candidate" child selectors as child selectors for "page" selector and delete "Candidate" selector.

JeffJJones · July 13, 2018, 1:57pm

Thanks so much. In hind-sight, that should have been obvious to me.

Jeff

bretfeig · July 13, 2018, 10:30pm

bretfeig:

{"_id":"project","startUrl":["https://www.linkedin.com/recruiter/projects/1080628893#status/0/25"],"selectors":[{"id":"page-change","type":"SelectorLink","selector":"li.next a.page-link","parentSelectors":["_root","page-change"],"multiple":false,"delay":0},{"id":"Element","type":"SelectorElement","selector":"div.row-inner","parentSelectors":["_root","page-change"],"multiple":true,"delay":0},{"id":"Name","type":"SelectorText","selector":"a.title","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0},{"id":"title","type":"SelectorText","selector":"p.headline","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0},{"id":"Location","type":"SelectorText","selector":"dd:nth-of-type(1)","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0},{"id":"status","type":"SelectorText","selector":"span.status-text","parentSelectors":["Element"],"multiple":false,"regex":"","delay":0}]}

Hmm. That's odd, it worked fine for me