Element scroll down selector doesn't work with react

monty.williams · April 2, 2018, 9:34pm

I can't get the element scroll down selector to load more than 9 elements of this 291 element page. It will only load them all if I manually click on the window and scroll down using my trackpad.

I've been able to scrape complex items from this site including the 'Read more' popups that occur on some descriptions. I've reduced my complex Sitemap to the simplest script that fails.

Web Scraper version: 0.3.7
Chrome version: Version 65.0.3325.181 (Official Build) (64-bit)
OS: Mac OS High Sierrs macOS 10.13.3 (17D102)

Sitemap:

{"_id":"scrolltest","startUrl":["https://www.britbox.com/us/programmes"],"selectors":[{"id":"loader","type":"SelectorElementScroll","selector":"div.program-item","parentSelectors":["_root"],"multiple":true,"delay":"4000"},{"id":"Title","type":"SelectorText","selector":"h3.program-item__program-title","parentSelectors":["loader"],"multiple":false,"regex":"","delay":0}]}

I don't know if this "error" is relevant, but just in case ...

Error Message:

background_script.js:1 {"url":"https://www.britbox.com/us/programmes","tabUrl":"chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/empty-page.html","status":"loading","timestamp":1522703625,"level_name":"ERROR","message":"chrome tab didn't start loading"}

martins · April 3, 2018, 10:59am

The scroll down selector is scrolling down body element. In some cases the scrollable element is another element.

Current Web Scraper version doesn't have a feature to select another element which should be scrolled down but we are working on this. At the moment you can try to add the CSS selector directly in sitemap code.

I don't have a account for this site but the login page had a scrollbar for HTML selector. Here is a version of your sitemap that would scroll down the HTML element. Modify the scrollElementSelector attribute if needed.

{"_id":"scrolltest2","startUrl":["https://www.britbox.com/us/programmes"],"selectors":[{"id":"loader","type":"SelectorElementScroll","selector":"div.program-item","scrollElementSelector":"html","parentSelectors":["_root"],"multiple":true,"delay":"4000"},{"id":"Title","type":"SelectorText","selector":"h3.program-item__program-title","parentSelectors":["loader"],"multiple":false,"regex":"","delay":0}]}

monty.williams · April 3, 2018, 10:23pm

Brilliant!

Works like a charm. I may need to tweak some delays as I got inconsistent results from two runs of my more complex sitemap. This is so much simpler than scrapy.

You don't need a login to grab info on the shows, only to actually watch one. I'll run this weekly so I learn what's new without a lot of clicking and scrolling on their website.

I built scrapers using curl and awk and/or python for https://mhzchoice.vhx.tv/ and https://acorn.tv/ which I run with a cron job. I notice you have Jenkins tests for WebScraper. Is is possible to scrape using my WebScraper sitemaps from the command line? Or do I have to open Chrome, select the sitemap, and then click "scrape"?

monty.williams · April 3, 2018, 10:57pm

I hate it when websites are inconsistent at different levels. As an example, BritBox has a different layout if a show has one season or multiple seasons. e.g.

One Season

Two Seasons

When trying to list episodes from a single scrape, I get some duplicates. I think resolving this is impossible in WebScraper, because the logical parent for the single season episodes is a different page than multiple season episodes. It's quite easy to remove the duplicates in post processing, however.

You can see what I mean from these two shows:

{
  "_id": "seasonTest",
  "startUrl": ["https://www.britbox.com/us/programmes"],
  "selectors": [
    {
      "id": "Pgm",
      "type": "SelectorElement",
      "selector": "div.program-item",
      "parentSelectors": ["_root"],
      "multiple": true,
      "delay": "2000"
    },
    {
      "id": "PgmURL",
      "type": "SelectorLink",
      "selector": "a[href*='/us/show/'].program-item__block",
      "parentSelectors": ["Pgm"],
      "multiple": false,
      "delay": "0"
    },
    {
      "id": "SsnURL",
      "type": "SelectorLink",
      "selector": "a[href*='/us/season/'].brand-season-item",
      "parentSelectors": ["PgmURL"],
      "multiple": true,
      "delay": ""
    },
    {
      "id": "EpiURL",
      "type": "SelectorLink",
      "selector": "a[href*='/us/episode/']",
      "parentSelectors": ["PgmURL"],
      "multiple": false,
      "delay": 0
    },
    {
      "id": "Program_Title",
      "type": "SelectorText",
      "selector": "h1.brand-hero-info__title",
      "parentSelectors": ["SsnURL", "EpiURL"],
      "multiple": false,
      "regex": "",
      "delay": ""
    }
  ]
}

There should only be two of A Bit of Fry and Laurie, not three.

monty.williams · April 3, 2018, 10:59pm

The full sitemap shows more, but is slower because it scrapes some popups. I had fun figuring out the 'Read more' popups.

{
  "_id": "BritBoxSeasons",
  "startUrl": ["https://www.britbox.com/us/programmes"],
  "selectors": [
    {
      "id": "Pgm",
      "type": "SelectorElementScroll",
      "selector": "div.program-item",
      "scrollElementSelector": "html",
      "parentSelectors": ["_root"],
      "multiple": true,
      "delay": "8000"
    },
    {
      "id": "PgmURL",
      "type": "SelectorLink",
      "selector": "a[href*='/us/show/'].program-item__block",
      "parentSelectors": ["Pgm"],
      "multiple": false,
      "delay": "0"
    },
    {
      "id": "SsnURL",
      "type": "SelectorLink",
      "selector": "a[href*='/us/season/'].brand-season-item",
      "parentSelectors": ["PgmURL"],
      "multiple": true,
      "delay": ""
    },
    {
      "id": "EpiURL",
      "type": "SelectorLink",
      "selector": "a[href*='/us/episode/']",
      "parentSelectors": ["PgmURL"],
      "multiple": false,
      "delay": 0
    },
    {
      "id": "Program_Title",
      "type": "SelectorText",
      "selector": "h1.brand-hero-info__title",
      "parentSelectors": ["SsnURL", "EpiURL"],
      "multiple": false,
      "regex": "",
      "delay": ""
    },
    {
      "id": "Sn_Title",
      "type": "SelectorText",
      "selector": "a.brand-season-item.active h2.program-item__program-title",
      "parentSelectors": ["SsnURL"],
      "multiple": false,
      "regex": "",
      "delay": ""
    },
    {
      "id": "Sn_Years",
      "type": "SelectorText",
      "selector": "a.brand-season-item.active p.season-metadata.hero-text",
      "parentSelectors": ["SsnURL"],
      "multiple": false,
      "regex": "^\\d+",
      "delay": ""
    },
    {
      "id": "Sn_Epis",
      "type": "SelectorText",
      "selector": "p.season-metadata",
      "parentSelectors": ["SsnURL"],
      "multiple": false,
      "regex": "\\d+ Episodes?",
      "delay": ""
    },
    {
      "id": "Sn_Description",
      "type": "SelectorText",
      "selector": "p.brand-hero-info__description",
      "parentSelectors": ["SsnURL", "EpiURL"],
      "multiple": false,
      "regex": "",
      "delay": ""
    },
    {
      "id": "Read_More",
      "type": "SelectorElementClick",
      "selector": "div.item-details-overlay",
      "parentSelectors": ["SsnURL", "EpiURL"],
      "multiple": false,
      "delay": "3000",
      "clickElementSelector": "span.brand-hero-info__read-more",
      "clickType": "clickMore",
      "discardInitialElements": false,
      "clickElementUniquenessType": "uniqueText"
    },
    {
      "id": "More_Description",
      "type": "SelectorText",
      "selector": "p.item-details-modal__description",
      "parentSelectors": ["Read_More"],
      "multiple": false,
      "regex": "",
      "delay": 0
    }
  ]
}

rweber · August 3, 2018, 7:14pm

@martins I'm having a similar issue here with a React page but may require a different solution.
I'm trying to pull data from an infinite scrolling and authenticated page where only the logged in user can see the content: https://www.instagram.com/[username]/saved/
where [username] is the logged in user's username

The SelectorElementScroll seems to be working in the preview, but it's only returning the last 15 instead of the full 200+ items shown on my page.

{
   "_id":"insta-saves-8-3",
   "startUrl":[
      "https://www.instagram.com/[username]/saved/"
   ],
   "selectors":[
      {
         "id":"post-img",
         "type":"SelectorImage",
         "selector":".FFVAD",
         "parentSelectors":[
            "post"
         ],
         "multiple":false,
         "delay":0
      },
      {
         "id":"post-url",
         "type":"SelectorElementAttribute",
         "selector":".v1Nh3 a",
         "parentSelectors":[
            "post"
         ],
         "multiple":false,
         "extractAttribute":"href",
         "delay":0
      },
      {
         "id":"post",
         "type":"SelectorElementScroll",
         "selector":"div.Nnq7C",
         "parentSelectors":[
            "_root"
         ],
         "multiple":true,
         "delay":"5000"
      }
   ]
}