Pagination and Scrolling required

Jeff_Jones · October 30, 2019, 2:03pm

Using LinkedIn.com...Recruiter Search. It has pagination and on each page that displays 25 candidates, the candidates are only loaded when scrolled into view. I don't think giving you the URL will help as you will need to logon in order to see the actual data.

I don't know how to use ElementScroll in conjunction with ElementClick in order to make this work.

The ElementClick clickElementSelector is ".mini-pagination__quick-link [type='chevron-right-icon']". And this works just fine.

The candidates are inside a HTML List construct 'ol'.

The unloaded li candidates have "li.profile-list__occlusion-area" selector; and the li element is empty. Once scrolled into view, the li element contains "div data-test-paginated-list-item"; among many other items.

Inside the candidate li items, for test purposes, I just want to grab the Name with selector "artdeco-entity-lockup-title a".

I have been trying so many permutations of ElementClick and ElementScroll and parenting without 100% luck. Can someone please help me to understand how to add ElementClick, ElementScroll, and SelectorText in order to page through 6 pages of candidates under these circumstances? I don't know the parenting hierarchy nor how the other parameters should be set for each.

Jeff_Jones · October 30, 2019, 2:50pm

For more info, If I remove the ElementScroll and use the following scraper definition, it grabs the top 2-3 visible candidates per page.

{
  "_id": "do_it_all",
  "startUrl": [
    "https://www.linkedin.com/talent/hire/274604540/discover/recruiterSearch?searchContextId=8abbf515-984e-4f39-bd3d-0a5851aa4116&searchHistoryId=3726191516&searchRequestId=997e3381-5212-4abc-be66-a59a7b627d5c&start=0&uiOrigin=PAGINATION"
  ],
  "selectors": [
    {
      "id": "pager",
      "type": "SelectorElementClick",
      "parentSelectors": [
        "_root"
      ],
      "selector": "article.profile-list-item",
      "multiple": true,
      "delay": "2000",
      "clickElementSelector": ".mini-pagination__quick-link [type='chevron-right-icon']",
      "clickType": "clickMore",
      "discardInitialElements": "do-not-discard",
      "clickElementUniquenessType": "uniqueText"
    },
    {
      "id": "FullName",
      "type": "SelectorText",
      "parentSelectors": [
        "pager"
      ],
      "selector": "artdeco-entity-lockup-title a",
      "multiple": false,
      "regex": "",
      "delay": 0
    }
  ]
}

Jeff_Jones · October 30, 2019, 3:18pm

I am trying everything I can think of here to get this to work.

This combination of ElementClick, ElementScroll, and SelectorText returns NOTHING! I don't get it. It seems to got from page 1 to page 6, stopping to scroll and load the dynamic items on each page. But nothing comes out of the scrape.

{
  "_id": "do_it_all",
  "startUrl": [
    "https://www.linkedin.com/talent/hire/274604540/discover/recruiterSearch?searchContextId=8abbf515-984e-4f39-bd3d-0a5851aa4116&searchHistoryId=3726191516&searchRequestId=997e3381-5212-4abc-be66-a59a7b627d5c&start=0&uiOrigin=PAGINATION"
  ],
  "selectors": [
    {
      "id": "pager",
      "type": "SelectorElementClick",
      "parentSelectors": [
        "_root"
      ],
      "selector": "ol.profile-list",
      "multiple": true,
      "delay": "2000",
      "clickElementSelector": ".mini-pagination__quick-link [type='chevron-right-icon']",
      "clickType": "clickMore",
      "discardInitialElements": "do-not-discard",
      "clickElementUniquenessType": "uniqueText"
    },
    {
      "id": "scroller",
      "type": "SelectorElementScroll",
      "parentSelectors": [
        "pager"
      ],
      "selector": "ol.profile-list > li",
      "multiple": true,
      "delay": "500"
    },
    {
      "id": "FullName",
      "type": "SelectorText",
      "parentSelectors": [
        "scroller"
      ],
      "selector": "artdeco-entity-lockup-title a",
      "multiple": false,
      "regex": "",
      "delay": 0
    }
  ]
}

In the Chrome console, the following jQuery returns the name of each candidate:

$('ol.profile-list > li artdeco-entity-lockup-title a')[index-here].innerText

So I know that the selectors seem correct.

WebScraper just seems to be acting irrationally here. But I am sure it is my lack of understanding on how to combine Click and Scroll.

Just trying to provide as much info as possible so that someone may be able to help.

leemeng · October 31, 2019, 3:10am

For these types of sites, the structure would usually be

root -> paginator -> scroller -> data scrapers

The paginator's main selector needs to be broad enough to cover the whole area of the scroller. This is usually the parent element of the wrapper element for the data scrapers. The scroller would then contain the wrapper element for the data scrapers. You would also need to set the scroller as child of root or it will not operate on the start page.

The example scraper below will both scroll and paginate. I intentionally made it stop at page 3. I used Page load delay: 5000.

{"_id":"lampesdirect_demo_pagination-scroll","startUrl":["https://www.lampesdirect.fr/catalogsearch/result/?q=LEDVANCE&p=1"],"selectors":[{"id":"Pagination","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div#instant-search-results-container","multiple":true,"delay":"4000","clickElementSelector":"li.ais-pagination--item__page:nth-of-type(-n+4) a","clickType":"clickOnce","discardInitialElements":"discard-when-click-element-exists","clickElementUniquenessType":"uniqueText"},{"id":"Scroller and data","type":"SelectorElementScroll","parentSelectors":["_root","Pagination"],"selector":"a[class^=\"result\"][title]","multiple":true,"delay":"3000"},{"id":"Prod name","type":"SelectorText","parentSelectors":["Scroller and data"],"selector":"h3","multiple":false,"regex":"","delay":0},{"id":"Price","type":"SelectorText","parentSelectors":["Scroller and data"],"selector":".price-excluding-tax span.price","multiple":false,"regex":"","delay":0}]}

Jeff_Jones · October 31, 2019, 4:14pm

Ok - I got it to work with the following sitemap and timer settings; also using your 5 secs page load delay on start. With one small issue, 'I got the first page candidates twice'. Ideas?

{
  "_id": "lee_test_full_name",
  "startUrl": [
    "https://www.linkedin.com/talent/hire/274604540/discover/recruiterSearch?searchContextId=9c897058-cecb-4436-97f0-bb518d282cf6&searchHistoryId=3726191516&searchRequestId=9cfda557-dd8e-4381-8a24-b976baae18f7&start=0&uiOrigin=PROJECT_RESUME_SEARCH_HISTORY"
  ],
  "selectors": [
    {
      "id": "FullName",
      "type": "SelectorText",
      "parentSelectors": [
        "scroll_element"
      ],
      "selector": "a",
      "multiple": false,
      "regex": "",
      "delay": 0
    },
    {
      "id": "scroll_element",
      "type": "SelectorElementScroll",
      "parentSelectors": [
        "_root",
        "pager"
      ],
      "selector": "article.profile-list-item",
      "multiple": true,
      "delay": "2000"
    },
    {
      "id": "pager",
      "type": "SelectorElementClick",
      "parentSelectors": [
        "_root"
      ],
      "selector": "ol.profile-list",
      "multiple": false,
      "delay": "4000",
      "clickElementSelector": ".mini-pagination__quick-link [type='chevron-right-icon']",
      "clickType": "clickMore",
      "discardInitialElements": "do-not-discard",
      "clickElementUniquenessType": "uniqueText"
    }
  ]
}

Jeff_Jones · October 31, 2019, 4:16pm

This was 6 pages and 130 candidates. Only produced two warnings about

Warning: Accessing PropTypes via the main React package is deprecated, and will be removed in React v16.0. Use the latest available v15.* prop-types package from npm instead.
Warning: Accessing createClass via the main React package is deprecated, and will be removed in React v16.0. Use a plain JavaScript class instead. If you're not yet ready to migrate, create-react-class v15.* is available on npm as a temporary, drop-in replacement.

Jeff_Jones · October 31, 2019, 4:55pm

If I set the scroller element to only have a parent of 'pager', then I get rid of the duplicates, but it only gets the first 2 of 25 candidates from the first page. It seems to be an either/or situation: either I only get 2, or I get all of them twice. Any suggestions?

Jeff_Jones · October 31, 2019, 5:58pm

I tried to run this against 345 candidates across 14 pages and it crashed with the following two errors on the extensions page... A repeat run of the same scrape produced the same errors.

,"error":"timeout: Job execution timeout","stack":"Error: timeout: Job execution timeout\n at chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:541:27","timestamp":1572544537,"level_name":"ERROR","message":"Job execution failed"}

{"error":"{"message":"The message port closed before a response was received."}","method":"scrollDownBody","request":"{"method":"scrollDownBody","params":[528,"article.profile-list-item",false]}","stack":"Error\n at a.error (chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:468:35)\n at chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:27879:99","timestamp":1572544537,"level_name":"ERROR","message":"Failed to send message to chrome tab"}

leemeng · November 1, 2019, 12:25am

I find that the scroller element does not seem to work properly if delay is too short, try 5000 to see if that works.

Your pager delay also seems too short. As you're using the Type:Click method for pagination, its delay is the one being used for next pages. Page load delay is only used for the starturl and for Type:Link (your pager does not use this delay). Try doubling the pager's delay to 8000 to see if that works.

For dynamic websites like this, WS won't work properly if the pages aren't fully loaded. So it's better to start off with long delays and tweak as needed.

Jeff_Jones · November 1, 2019, 3:43pm

Hello Lee.

I changed the delays as you stated and it crashes after processing just a few pages out of 14.

{"url":"https://www.linkedin.com/talent/hire/274604540/discover/recruiterSearch?searchContextId=9c897058-cecb-4436-97f0-bb518d282cf6&searchHistoryId=3726191516&searchRequestId=9cfda557-dd8e-4381-8a24-b976baae18f7&start=0&uiOrigin=PROJECT_RESUME_SEARCH_HISTORY","parentSelector":"_root","sitemapName":"lee_test_full_name","driver":"chrometab","error":"timeout: Job execution timeout","stack":"Error: timeout: Job execution timeout\n at chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:541:27","timestamp":1572617634,"level_name":"ERROR","message":"Job execution failed"}

{
  "_id": "lee_test_full_name",
  "startUrl": [
    "https://www.linkedin.com/talent/hire/274604540/discover/recruiterSearch?searchContextId=9c897058-cecb-4436-97f0-bb518d282cf6&searchHistoryId=3726191516&searchRequestId=9cfda557-dd8e-4381-8a24-b976baae18f7&start=0&uiOrigin=PROJECT_RESUME_SEARCH_HISTORY"
  ],
  "selectors": [
    {
      "id": "FullName",
      "type": "SelectorText",
      "parentSelectors": [
        "scroll_element"
      ],
      "selector": "a",
      "multiple": false,
      "regex": "",
      "delay": 0
    },
    {
      "id": "scroll_element",
      "type": "SelectorElementScroll",
      "parentSelectors": [
        "_root",
        "pager"
      ],
      "selector": "article.profile-list-item",
      "multiple": true,
      "delay": "5000"
    },
    {
      "id": "pager",
      "type": "SelectorElementClick",
      "parentSelectors": [
        "_root"
      ],
      "selector": "ol.profile-list",
      "multiple": false,
      "delay": "8000",
      "clickElementSelector": ".mini-pagination__quick-link [type='chevron-right-icon']",
      "clickType": "clickMore",
      "discardInitialElements": "do-not-discard",
      "clickElementUniquenessType": "uniqueText"
    }
  ]
}

Jeff_Jones · November 1, 2019, 4:18pm

A second run crashed again about midway.

Jeff_Jones · November 4, 2019, 9:40pm

Lee? Any ideas on this?