Extract data from mirrored website (Start URL not valid)

bret · December 21, 2018, 10:14am

Start URL not a valid URL

Url: file:///C:/My%20Web%20Sites/example/example.com/browse/index.html

When I open this url in my Chrome browser it works. I downloaded / mirrored the website and can browse it offline using HTTrack. Now I would like to run Webscraper.io and extract specific data points from each page.

However, when trying to launch webscraper it gives me the following error: The start URL is not a valid URL

is there a way to scrape locally mirrored websites using webscraper?

bret · December 24, 2018, 8:37am

Does anyone know how to solve this issue? Thank you in advance!

youring · December 24, 2018, 9:41am

I think firing up a local wamp server would help

bret · December 24, 2018, 9:54am

It is not my website. I can't download the database to run wamp. Using Httrack I can download the html on every page which contains the required information.

My problem why I am not just using webscraper on the live website is that the server limits the number of pages one IP can visit in a certain time period. This is why I wanted to scrape the website locally after downloading it with HTTrack to circumvent the IP block.

Thank you in advance

youring · December 24, 2018, 1:09pm

I mean to setup http://localhost/ links to let web scraper recognize as a valid URL.
Everything has a http server which you can simply start with.

leemeng · August 28, 2019, 4:51pm

WS currently does not recognize file:///C:/ type links, even though Chrome does. You can either run your own web server on your own machine, or upload the html files to an external web host.

Previously I had used SimpleServer from AnalogX for local hosting.