Ability to use file:// protocol and non-Top-Level-Domain URLs in "Start URL"

jasond · September 12, 2018, 3:50am

Hi,

(1) I currently find under Edit metadata > Start URL, I can only input a "proper" URL, that includes 2 conditions:

(a) a "http://" or "https://" beginning, and

(b) that ends in a valid-looking top level domain (TLD), such as .com and .net (and even .local).

(2) Is it possible to allow:

(a) the "file://" protocol beginning, and

(b) any-format URLs, such as localhost, 127.0.0.1, or 192.168.1.100?

(3) The reason for this request is that it will open up a lot of possibilities:

(a) It will go a long way to simplify multiple-URL scraping. Currently we need to add Start URL one by one with the + button. Or often we can use serial number [1-2000].

With such a new feature, we can build larger HTML file with a long list of varied URLs in Excel or other scripts, then call it from Start URL with "file://..." or "http://localhost/urls.html".

Currently I can still do something like this, but I have to serve the listing file from a local (Apache) server, and change the "hosts" file to simulate a "http://myproject.local" URL that contains the valid TLD of ".local". This works pretty well. But I wonder if it can become simpler.

(b) It will open the possibility of wrapping JSON data file in our own parser script, that generates a HTML-looking file locally, then ask Webscraper.io to scrape that local HTML file. That way, we might be able to "directly" scrape JSON-only data file (after predicting and organizing the JSON file API/URLs).

The above can still be done with our own local server. But the current restrictions make it more difficult by disallowing locally-served JS or jQuery scripted files.

Are there reasons to restrict or exclude the "file://" protocol and invalid TLD currently? URL validation is a good point; can validation result appear as a warning text only?

Thanks for the great software!

leemeng · September 6, 2019, 1:04am

Bumping this 'cos I was about to make similar request. Support for "file://" protocol would be great. Chrome already supports it.

Currently there are cumbersome workarounds such as uploading the html file to a web hosting service, or running your own local web server.

Related to this is the "I need to scrape loads of URLs" problem. You can of course add them manually in the Start URL section, but this is not practical if you have hundreds or thousands of URLs. Easier if we could scrape from a local file containing the URLs.

galipeau · May 19, 2021, 11:03pm

I wonder why this topic was never discussed further. Also for me the possibility to start from a local html page would make so many things easier or even possible at all.

I have not been able to get this to work in either firefox or chrome.

Is there a technical reason why web scraper blocks a local start page when it comes to defining a site map?

In any case web scraper is a great software. Thanks for this.

Greg