Rows with different classes can't be scraped with pagination

Hi.
im trying to scrape this link
https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Cco%3A0%2Cec%3A0%2Cpe%3A0%2Ccl%3A0%2Cpy%3A0%2Cuy%3A0%2Cbo%3A0%2Cve%3A0%2Cgy%3A0%2Csr%3A0&industry=3&page=2&searchSessionId=SAEt5TVHRL2vADSeLWDjBw%3D%3D

the first page is ok, then from the second page it will grab only the first 10 companies
it seems like starting from the 11th company, the rows are nested under another class called "deferred area"

this is my sitemap

Sitemap:
{"_id":"selectall2","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Ccl%3A0%2Cpe%3A0%2Cco%3A0%2Cve%3A0%2Cec%3A0%2Cuy%3A0%2Cbo%3A0%2Cpy%3A0%2Cgy%3A0%2Csr%3A0&industry=3&page=1&searchSessionId=SAEt5TVHRL2vADSeLWDjBw%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"li.pv5, div.deferred-area li.pv5","multiple":true,"delay":"2000","clickElementSelector":"button.search-results__pagination-next-button span.v-align-middle","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"company","type":"SelectorText","parentSelectors":["pagination"],"selector":"dt.result-lockup__name a.ember-view","multiple":false,"regex":"","delay":"2000"}]}

how can i fix that?
thanks!

I don't have access to sales navigator. However, I would try an app called instant data scraper (check chrome store) It's a bit of a one click, hit or miss type of scraper that happens to work on linkedin and linkedin recruiter.

Also, be careful. Linkedin deploys counter measures and will eventually put you in "linkedin Jail" by suspending your account

I also cannot access LinkedIn sale navigator. But looking at your sitemap, let me make some guesses that hopefully can help.

(1) Rows with different classes may not be the problem. Usually, the "lowest common denominator" CSS selector should get the right element selected. That is your li.pv5.

(2) Try not to use Element Click selector to encompass both the items and pagination. How about using Link selector for pagination, and use Element selector for items. Make pagination child of root and itself. Make items the child of root and pagination. Then under items, add selectors for details.

The reason I make this guess is that the URL looks like it contains explicit parameters of &page=1 and &searchSessionId= ... This could mean the URL changes on every page, and you should treat navigation links as simple links.

Just my guesses. Hope they help.

@bretfeig thanks but that extension doesn't work with linkedin
@jasond i will read again what you posted and give it a try :smiley:
thanks guys!

Sure it does.. I've tested it with Recruiter and Regular

Screen clipping taken: 9/13/2018 7:35 AM

oh man!
yes sorry i tried again and it is working now
not sure why i got that message before
btw as you can see it is grabbing only the first 10 rows

hey @jasond
i think i got lost
i tried this
{"_id":"testlinkedinnewinterface3","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Cpe%3A0%2Cco%3A0%2Ccl%3A0%2Cec%3A0%2Cuy%3A0%2Cve%3A0%2Csr%3A0%2Cbo%3A0%2Cgy%3A0%2Cpy%3A0&industry=25&page=1&searchSessionId=Q2DstbvoTn2uIA0jUm2wQg%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".search-results__pagination-next-button","multiple":false,"delay":0},{"id":"rows","type":"SelectorElement","parentSelectors":["pagination"],"selector":"ol.search-results__result-list > li.pv5:nth-of-type(1)","multiple":true,"delay":0},{"id":"company","type":"SelectorText","parentSelectors":["rows"],"selector":"dt.result-lockup__name a.ember-view","multiple":true,"regex":"","delay":0}]}

but not working
not sure i got what you suggested

I'm only guessing. How about this sitemap?

{"_id":"a_testlinkedinnewinterface3","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Cpe%3A0%2Cco%3A0%2Ccl%3A0%2Cec%3A0%2Cuy%3A0%2Cve%3A0%2Csr%3A0%2Cbo%3A0%2Cgy%3A0%2Cpy%3A0&industry=25&page=1&searchSessionId=Q2DstbvoTn2uIA0jUm2wQg%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".search-results__pagination-next-button","multiple":false,"delay":0},{"id":"rows","type":"SelectorElement","parentSelectors":["_root","pagination"],"selector":"ol.search-results__result-list > li.pv5","multiple":true,"delay":0},{"id":"company","type":"SelectorText","parentSelectors":["rows"],"selector":"dt.result-lockup__name a.ember-view","multiple":true,"regex":"","delay":0}]}

The changes I made:

(1) Additionally make "root" also the parent of "rows". That is, "rows" has 2 parents: root and pagination.

This assumes that rows are already available for scraping on page 1.

(2) Removed ":nth-of-type(1)" from the "rows" selector.

This assumes the CSS selector "ol.search-results__result-list > li.pv5" can help you select all the rows, on page 1 or page 2, etc.

Experimenting with some delays may help if LinkedIn uses a lot of JS and AJax loading data.

hey @jasond
thanks for that
but again, it is grabbing only the first 10 rows
also, pagination doesn't work at all

im trying this sitemap now

{"_id":"testlinkedinnewinterface2","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Ccl%3A0%2Cco%3A0%2Cpe%3A0%2Cuy%3A0%2Cbo%3A0%2Cec%3A0%2Cpy%3A0%2Cve%3A0%2Cgy%3A0%2Csr%3A0&industry=24&numOfFollowers=NFR5%2CNFR4%2CNFR3%2CNFR2&page=3&searchSessionId=Q2DstbvoTn2uIA0jUm2wQg%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"dl","multiple":true,"delay":0,"clickElementSelector":"button.search-results__pagination-next-button span.v-align-middle","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"companylink","type":"SelectorLink","parentSelectors":["pagination"],"selector":"dt.result-lockup__name a.ember-view","multiple":false,"delay":0},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.entity-card.banner","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["elements"],"selector":"h1.title","multiple":false,"regex":"","delay":0},{"id":"employees","type":"SelectorText","parentSelectors":["elements"],"selector":"span.cta-link a","multiple":false,"regex":"","delay":0},{"id":"industry","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"size","type":"SelectorText","parentSelectors":["pagination"],"selector":"a.result-lockup__undecorated-link","multiple":false,"regex":"","delay":0},{"id":"country","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"website","type":"SelectorLink","parentSelectors":["elements"],"selector":"a.meta-link.website","multiple":false,"delay":0}]}

using dl instead of pv5 but still i get only the first 10 rows :frowning:

It wouldn't be a small thing like the below? How about instead of

(a) ol.search-results__result-list > li.pv5

try just

(b) li.pv5

or something less specific then (a), unless this has become too general.

On some Websites, I've experienced rows in page 1 and page 2 having different classes, because page 1 rows are "featured" or have some priority items.

i tried this
{"_id":"testlinkedinnewinterface2","startUrl":["Sales Navigator span.v-align-middle","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"companylink","type":"SelectorLink","parentSelectors":["pagination"],"selector":"dt.result-lockup__name a.ember-view","multiple":false,"delay":"2000"},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.entity-card.banner","multiple":true,"delay":"2000"},{"id":"name","type":"SelectorText","parentSelectors":["elements"],"selector":"h1.title","multiple":false,"regex":"","delay":"2000"},{"id":"employees","type":"SelectorText","parentSelectors":["elements"],"selector":"span.cta-link a","multiple":false,"regex":"","delay":"2000"},{"id":"industry","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(1)","multiple":false,"regex":"","delay":"2000"},{"id":"size","type":"SelectorText","parentSelectors":["pagination"],"selector":"a.result-lockup__undecorated-link","multiple":false,"regex":"","delay":"2000"},{"id":"country","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(3)","multiple":false,"regex":"","delay":"2000"},{"id":"website","type":"SelectorLink","parentSelectors":["elements"],"selector":"a.meta-link.website","multiple":false,"delay":"2000"}]}

but it scraped only 20 rows
i think the first 10 rows from 2 pages
even if the class was correct (seemed so)