Recognizing and confirming what cannot be scraped with Webscraper.io

jasond · September 11, 2018, 12:32pm

Hi,

I understand some Ajax-based Websites cannot be scraped with Webscraper.io. But how do I identify clearly which cannot be scraped?

For example, if I look under F12, Developer Tools > Network

and I see a lot of XHR traffic, does that mean this Website won't be scrapable?

Shall I look further into whether these XHR are carrying the key data set? In general, are AngularJS Websites not scrapable?

Sorry, I'm a bit blur on these terms. Just looking for more definitive guides. What is a more definitive or more standard way of describing this issue? So that I can more quickly conclude that certain sites cannot be scraped (and not due to my lack of knowledge about Webscraper.io)?

Is it true that sometimes I can make a sitemap that seems to run when tested AT INDIVIDUAL LEVELS, but not run AS A WHOLE, due to this Ajax/XHR limitation?

Thanks!
Jason

iconoclast · September 12, 2018, 1:31am

Hi there!

AngularJS / ReactJS and others are just JavaScript libraries/frameworks to help website work dynamically (change something on-the-fly). You can scrape pretty much everything except for built-in apps that show stuff on canvas that cannot be accessed through Elements tree. So easiest way to determine if a website can be scraped, is just look if certain elements can be accessed through Element tree.

jasond · September 12, 2018, 3:31am

Thanks for the reply.

(1) Basically can I generally conclude that only Flash and alike are the kind of things that I CANNOT scrape with WebScraper.io?

(2) In the gray zone, where element trees are available, but the data are delivered by JSON/AJAX/XHR process then populated/rendered by JS - these Websites also cannot be DIRECTLY scraped?

However, the above data can be harvested using that API/URL for the JSON file (discovered in Developer Tools > Network), to be processed/parsed with other tools (PHP, jQuery, NodeJS)?

I guess my questions can be rephrased to try to clarify (2) above: That there will be Websites that look deceptively like they can work, because all the element trees are available. But in the end, you would find the data are hidden inside JSON and very dynamically rendered. Hence using Webscraper alone we cannot complete the scrape?

I'm asking to clarify this again because I'd like to narrow down this "blindspot" area to quickly differentiate Websites.

Also, it is related to a "new feature" I'm about to suggest about allowing "file://" and "non-Top-Level-Domain" in the Start URL.

Thanks!