Pagination seems to work, but not getting all subsidiary data

Hello, I'm a webscraper.io newbie using the FireFox extension.

I am trying to scrape data from all locations from the URL: Data Center Locations: Top Cities, States, Countries and Regions.
If I watch the scraper run, I can see it actually go through all 79(currently, it's dynamic) pages of the website and then start drilling into each location to get additional info. In this current timeframe there should be 3115 locations, but my output data only shows 2428 locations. Is there a limit as to how much data can be scraped with the webscraper.io extension?

TIA!!!!

Sitemap:
{"_id":"DataCenters","startUrl":["https://www.datacenters.com/locations"],"selectors":[{"id":"pagination","parentSelectors":["_root","pagination"],"paginationType":"clickMore","type":"SelectorPagination","selector":"button.Control__control__ijHLR:nth-of-type(n+2)"},{"id":"dc-element","parentSelectors":["pagination"],"type":"SelectorElement","selector":"div.LocationTile__location__tZKRS","multiple":true},{"id":"dc-owner","parentSelectors":["dc-element"],"type":"SelectorText","selector":"div.LocationTile__provider__BSecG","multiple":false,"regex":""},{"id":"dc-name","parentSelectors":["dc-element"],"type":"SelectorText","selector":"div.LocationTile__name__NrDKr","multiple":false,"regex":""},{"id":"dc-address","parentSelectors":["dc-element"],"type":"SelectorText","selector":"div.LocationTile__address__Utj30","multiple":false,"regex":""},{"id":"dc-element-detail-link","parentSelectors":["dc-element"],"type":"SelectorLink","selector":"a","multiple":false,"linkType":"linkFromHref"},{"id":"dc-phone","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":".LocationProviderDetail__phoneItemWrapper__PpePG span","multiple":false,"regex":""},{"id":"dc-totalspace","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"div.LocationProviderDetail__providerInfoItem__kuPAs:nth-of-type(3) span","multiple":false,"regex":""},{"id":"dc-colocationspace","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"div.LocationProviderDetail__providerInfoItem__kuPAs:nth-of-type(4) span","multiple":false,"regex":""},{"id":"dc-totalpower","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"div.LocationProviderDetail__providerInfoItem__kuPAs:nth-of-type(5) span","multiple":false,"regex":""}]}

Many of your selectors contain random characters, and these could change at any time. That means your sitemap won't "see" any selector that has changed. This is typical for sites that use the React framework. You can try using a partial match of the class which will ignore the random part. For example, the address selector can be:

div[class^='LocationTile__address']

Ref: CSS Selectors Reference

Thanks for your help...I made those changes but I still do not get all of the data for all the different datacenter locations. There are 3115 locations available, spread out over 79 pages. I only get 2428 of those locations from the scrape. I tried increasing the wait time from the default of 2000ms to 5000ms. No luck!

Here is the updated sitemap:

{"_id":"DataCenters","startUrl":["https://www.datacenters.com/locations"],"selectors":[{"id":"pagination","parentSelectors":["_root","pagination"],"paginationType":"clickMore","type":"SelectorPagination","selector":"button.Control__control__ijHLR:nth-of-type(n+2)"},{"id":"dc-element","parentSelectors":["pagination"],"type":"SelectorElement","selector":"div[class^='LocationTile__location']","multiple":true},{"id":"dc-owner","parentSelectors":["dc-element"],"type":"SelectorText","selector":"div[class^='LocationTile__provider']","multiple":false,"regex":""},{"id":"dc-name","parentSelectors":["dc-element"],"type":"SelectorText","selector":"div[class^='LocationTile__name']","multiple":false,"regex":""},{"id":"dc-address","parentSelectors":["dc-element"],"type":"SelectorText","selector":"div[class^='LocationTile__address']","multiple":false,"regex":""},{"id":"dc-element-detail-link","parentSelectors":["dc-element"],"type":"SelectorLink","selector":"a","multiple":false,"linkType":"linkFromHref"},{"id":"dc-phone","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"span#sidebarPhone:nth-of-type(2)","multiple":false,"regex":""},{"id":"dc-totalspace","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"#totalSpace strong","multiple":false,"regex":""},{"id":"dc-colocationspace","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"#products strong","multiple":false,"regex":""},{"id":"dc-totalpower","parentSelectors":["dc-element-detail-link"],"type":"SelectorText","selector":"#power strong","multiple":false,"regex":""}]}

I did some quick tests and it does seem to miss some data. Anyway, I manually extracted the datacenter page URLs by sniffing around the Network tab (it's a bit complicated). You can find them all, 3161 links, in the page below. You only need to adjust your sitemap to start from this page, no paginator is needed. Page expires in 2 months.

@leemeng: Thanks so much for your help. I made the necessary changes to implement your page of datacenter locations and was able to grab all the necessary data. My next step is to make this dynamic, so maybe thinking of using the sitemap.xml file or maybe finding something unique in the location pages and using that to drive the scrape. Here is the sitemap.xml for anyone else who might be interested.

{
  "_id": "DataCenters-4",
  "startUrl": [
    "https://pastelink.net/kvjfxy0j"
  ],
  "selectors": [
    {
      "id": "location-links",
      "linkType": "linkFromHref",
      "multiple": true,
      "parentSelectors": [
        "_root"
      ],
      "selector": ".body-display a",
      "type": "SelectorLink"
    },
    {
      "id": "dc-owner-name",
      "multiple": false,
      "parentSelectors": [
        "location-links"
      ],
      "regex": "",
      "selector": "h1",
      "type": "SelectorText"
    },
    {
      "id": "dc-address",
      "multiple": false,
      "parentSelectors": [
        "location-links"
      ],
      "regex": "",
      "selector": "span.LocationShowSidebar__sidebarAddress__AZdxu",
      "type": "SelectorText"
    },
    {
      "id": "dc-telephone",
      "multiple": false,
      "parentSelectors": [
        "location-links"
      ],
      "regex": "",
      "selector": "span#sidebarPhone:nth-of-type(2)",
      "type": "SelectorText"
    },
    {
      "id": "dc-totalspace",
      "multiple": false,
      "parentSelectors": [
        "location-links"
      ],
      "regex": "",
      "selector": "#totalSpace strong",
      "type": "SelectorText"
    },
    {
      "id": "dc-totalpower",
      "multiple": false,
      "parentSelectors": [
        "location-links"
      ],
      "regex": "",
      "selector": "#power strong",
      "type": "SelectorText"
    }
  ]
}
1 Like