Rows with different classes can't be scraped with pagination

eldoland · September 12, 2018, 3:22pm

Hi.
im trying to scrape this link
https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Cco%3A0%2Cec%3A0%2Cpe%3A0%2Ccl%3A0%2Cpy%3A0%2Cuy%3A0%2Cbo%3A0%2Cve%3A0%2Cgy%3A0%2Csr%3A0&industry=3&page=2&searchSessionId=SAEt5TVHRL2vADSeLWDjBw%3D%3D

the first page is ok, then from the second page it will grab only the first 10 companies
it seems like starting from the 11th company, the rows are nested under another class called "deferred area"

this is my sitemap

Sitemap:
{"_id":"selectall2","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Ccl%3A0%2Cpe%3A0%2Cco%3A0%2Cve%3A0%2Cec%3A0%2Cuy%3A0%2Cbo%3A0%2Cpy%3A0%2Cgy%3A0%2Csr%3A0&industry=3&page=1&searchSessionId=SAEt5TVHRL2vADSeLWDjBw%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"li.pv5, div.deferred-area li.pv5","multiple":true,"delay":"2000","clickElementSelector":"button.search-results__pagination-next-button span.v-align-middle","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"company","type":"SelectorText","parentSelectors":["pagination"],"selector":"dt.result-lockup__name a.ember-view","multiple":false,"regex":"","delay":"2000"}]}

how can i fix that?
thanks!

bretfeig · September 13, 2018, 9:27am

I don't have access to sales navigator. However, I would try an app called instant data scraper (check chrome store) It's a bit of a one click, hit or miss type of scraper that happens to work on linkedin and linkedin recruiter.

Also, be careful. Linkedin deploys counter measures and will eventually put you in "linkedin Jail" by suspending your account

jasond · September 13, 2018, 10:56am

I also cannot access LinkedIn sale navigator. But looking at your sitemap, let me make some guesses that hopefully can help.

(1) Rows with different classes may not be the problem. Usually, the "lowest common denominator" CSS selector should get the right element selected. That is your li.pv5.

(2) Try not to use Element Click selector to encompass both the items and pagination. How about using Link selector for pagination, and use Element selector for items. Make pagination child of root and itself. Make items the child of root and pagination. Then under items, add selectors for details.

The reason I make this guess is that the URL looks like it contains explicit parameters of &page=1 and &searchSessionId= ... This could mean the URL changes on every page, and you should treat navigation links as simple links.

Just my guesses. Hope they help.

eldoland · September 13, 2018, 11:17am

@bretfeig thanks but that extension doesn't work with linkedin
@jasond i will read again what you posted and give it a try
thanks guys!

bretfeig · September 13, 2018, 11:35am

Sure it does.. I've tested it with Recruiter and Regular

Screen clipping taken: 9/13/2018 7:35 AM

bretfeig · September 13, 2018, 11:36am

eldoland · September 13, 2018, 11:55am

oh man!
yes sorry i tried again and it is working now
not sure why i got that message before
btw as you can see it is grabbing only the first 10 rows

eldoland · September 13, 2018, 12:08pm

hey @jasond
i think i got lost
i tried this
{"_id":"testlinkedinnewinterface3","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Cpe%3A0%2Cco%3A0%2Ccl%3A0%2Cec%3A0%2Cuy%3A0%2Cve%3A0%2Csr%3A0%2Cbo%3A0%2Cgy%3A0%2Cpy%3A0&industry=25&page=1&searchSessionId=Q2DstbvoTn2uIA0jUm2wQg%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".search-results__pagination-next-button","multiple":false,"delay":0},{"id":"rows","type":"SelectorElement","parentSelectors":["pagination"],"selector":"ol.search-results__result-list > li.pv5:nth-of-type(1)","multiple":true,"delay":0},{"id":"company","type":"SelectorText","parentSelectors":["rows"],"selector":"dt.result-lockup__name a.ember-view","multiple":true,"regex":"","delay":0}]}

but not working
not sure i got what you suggested

jasond · September 14, 2018, 5:33am

I'm only guessing. How about this sitemap?

{"_id":"a_testlinkedinnewinterface3","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Cpe%3A0%2Cco%3A0%2Ccl%3A0%2Cec%3A0%2Cuy%3A0%2Cve%3A0%2Csr%3A0%2Cbo%3A0%2Cgy%3A0%2Cpy%3A0&industry=25&page=1&searchSessionId=Q2DstbvoTn2uIA0jUm2wQg%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":".search-results__pagination-next-button","multiple":false,"delay":0},{"id":"rows","type":"SelectorElement","parentSelectors":["_root","pagination"],"selector":"ol.search-results__result-list > li.pv5","multiple":true,"delay":0},{"id":"company","type":"SelectorText","parentSelectors":["rows"],"selector":"dt.result-lockup__name a.ember-view","multiple":true,"regex":"","delay":0}]}

The changes I made:

(1) Additionally make "root" also the parent of "rows". That is, "rows" has 2 parents: root and pagination.

This assumes that rows are already available for scraping on page 1.

(2) Removed ":nth-of-type(1)" from the "rows" selector.

This assumes the CSS selector "ol.search-results__result-list > li.pv5" can help you select all the rows, on page 1 or page 2, etc.

Experimenting with some delays may help if LinkedIn uses a lot of JS and AJax loading data.

eldoland · September 14, 2018, 10:24am

hey @jasond
thanks for that
but again, it is grabbing only the first 10 rows
also, pagination doesn't work at all

eldoland · September 14, 2018, 12:12pm

im trying this sitemap now

{"_id":"testlinkedinnewinterface2","startUrl":["https://www.linkedin.com/sales/search/company?companySize=D&geo=samerica%3A0%2Cbr%3A0%2Car%3A0%2Ccl%3A0%2Cco%3A0%2Cpe%3A0%2Cuy%3A0%2Cbo%3A0%2Cec%3A0%2Cpy%3A0%2Cve%3A0%2Cgy%3A0%2Csr%3A0&industry=24&numOfFollowers=NFR5%2CNFR4%2CNFR3%2CNFR2&page=3&searchSessionId=Q2DstbvoTn2uIA0jUm2wQg%3D%3D"],"selectors":[{"id":"pagination","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"dl","multiple":true,"delay":0,"clickElementSelector":"button.search-results__pagination-next-button span.v-align-middle","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"companylink","type":"SelectorLink","parentSelectors":["pagination"],"selector":"dt.result-lockup__name a.ember-view","multiple":false,"delay":0},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.entity-card.banner","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["elements"],"selector":"h1.title","multiple":false,"regex":"","delay":0},{"id":"employees","type":"SelectorText","parentSelectors":["elements"],"selector":"span.cta-link a","multiple":false,"regex":"","delay":0},{"id":"industry","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"size","type":"SelectorText","parentSelectors":["pagination"],"selector":"a.result-lockup__undecorated-link","multiple":false,"regex":"","delay":0},{"id":"country","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"website","type":"SelectorLink","parentSelectors":["elements"],"selector":"a.meta-link.website","multiple":false,"delay":0}]}

using dl instead of pv5 but still i get only the first 10 rows

jasond · September 14, 2018, 1:41pm

It wouldn't be a small thing like the below? How about instead of

(a) ol.search-results__result-list > li.pv5

try just

(b) li.pv5

or something less specific then (a), unless this has become too general.

On some Websites, I've experienced rows in page 1 and page 2 having different classes, because page 1 rows are "featured" or have some priority items.

eldoland · September 14, 2018, 3:27pm

i tried this
{"_id":"testlinkedinnewinterface2","startUrl":["Sales Navigator span.v-align-middle","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"companylink","type":"SelectorLink","parentSelectors":["pagination"],"selector":"dt.result-lockup__name a.ember-view","multiple":false,"delay":"2000"},{"id":"elements","type":"SelectorElement","parentSelectors":["companylink"],"selector":"div.entity-card.banner","multiple":true,"delay":"2000"},{"id":"name","type":"SelectorText","parentSelectors":["elements"],"selector":"h1.title","multiple":false,"regex":"","delay":"2000"},{"id":"employees","type":"SelectorText","parentSelectors":["elements"],"selector":"span.cta-link a","multiple":false,"regex":"","delay":"2000"},{"id":"industry","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(1)","multiple":false,"regex":"","delay":"2000"},{"id":"size","type":"SelectorText","parentSelectors":["pagination"],"selector":"a.result-lockup__undecorated-link","multiple":false,"regex":"","delay":"2000"},{"id":"country","type":"SelectorText","parentSelectors":["pagination"],"selector":"li.result-lockup__misc-item:nth-of-type(3)","multiple":false,"regex":"","delay":"2000"},{"id":"website","type":"SelectorLink","parentSelectors":["elements"],"selector":"a.meta-link.website","multiple":false,"delay":"2000"}]}

but it scraped only 20 rows
i think the first 10 rows from 2 pages
even if the class was correct (seemed so)