Webscraper stopped before completion of the scraping

Creedy · July 23, 2020, 6:21pm

Hi guys !

I'm new on this forum and I'm also a beginner in programming and datascraping, so sorry if i ask some dumb questions. But I have a real issue with my Chrome webscraper extension.

MY OBJECTIVE : I'm trying to scrape some data from 24000 companies with start URL https://www.zonebourse.com/bourse/actions/.

THE ISSUE : I've run a small test using the Europe tab > France tab (850 companies) with start url https://www.zonebourse.com/bourse/actions/Europe-3/France-51/ and it worked perfectly well. However, when I try to run the sitemap with start url https://www.zonebourse.com/bourse/actions/ or even https://www.zonebourse.com/bourse/actions/Europe-3/ , the scraping stops before completion and I can't get any data.

Do you have any idea what's happening and how it could be fixed?

Thanks in advance,

Web Scraper version: 0.4.2
Chrome version: Version 83.0.4103.116 (64 bits)
OS: 64 bits

Sitemap:

{"_id":"zoneboursescraping","startUrl":["https://www.zonebourse.com/bourse/actions/Europe-3/"],"selectors":[{"id":"page2","type":"SelectorElementClick","parentSelectors":["_root"],"selector":".tabBodyLV17 tr:nth-of-type(n+2) td:nth-of-type(n+2)","multiple":true,"delay":"2000","clickElementSelector":"a.nPageEndTab","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueHTMLText"},{"id":"Link","type":"SelectorLink","parentSelectors":["page2"],"selector":"a","multiple":true,"delay":0},{"id":"Sector","type":"SelectorText","parentSelectors":["Link"],"selector":"a.link1","multiple":false,"regex":"","delay":0},{"id":"MarketCap","type":"SelectorText","parentSelectors":["Link"],"selector":".std_txt tr:nth-of-type(2) td.fvtCorps1:nth-of-type(5)","multiple":false,"regex":"","delay":0},{"id":"Revenue 2020","type":"SelectorText","parentSelectors":["Link"],"selector":".std_txt td:nth-of-type(1) tr:nth-of-type(1) .dfCur b","multiple":false,"regex":"","delay":0},{"id":"Revenue 2021","type":"SelectorText","parentSelectors":["Link"],"selector":"td:nth-of-type(2) tr:nth-of-type(1) .dfCur b","multiple":false,"regex":"","delay":0},{"id":"Capitalization","type":"SelectorText","parentSelectors":["Link"],"selector":"td:nth-of-type(3) .dfCur b","multiple":false,"regex":"","delay":0},{"id":"VE/CA 2020","type":"SelectorText","parentSelectors":["Link"],"selector":".Bord tr:nth-of-type(2) td > b","multiple":false,"regex":"","delay":0},{"id":"VE/CA2021","type":"SelectorText","parentSelectors":["Link"],"selector":".std_txt td tr:nth-of-type(3) td > b","multiple":false,"regex":"","delay":0},{"id":"Currency","type":"SelectorText","parentSelectors":["Link"],"selector":"td.fvCur","multiple":false,"regex":"","delay":0},{"id":"Net Income 2020","type":"SelectorText","parentSelectors":["Link"],"selector":".std_txt td:nth-of-type(1) tr:nth-of-type(2) .dfCur b","multiple":false,"regex":"","delay":0},{"id":"Net Income 2021","type":"SelectorText","parentSelectors":["Link"],"selector":"td:nth-of-type(2) tr:nth-of-type(2) .dfCur b","multiple":false,"regex":"","delay":0}]}

Error Message:

background_script.js:465 {"error":"{\"message\":\"Could not establish connection. Receiving end does not exist.\"}","method":"getWrappedHTML","request":"{\"method\":\"getWrappedHTML\",\"params\":[7134]}","stack":"Error\n    at a.error (chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:468:35)\n    at chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:27879:99","timestamp":1595527447,"level_name":"ERROR","message":"Failed to send message to chrome tab"}
log @ background_script.js:465
background_script.js:465 {"error":"{\"message\":\"Could not establish connection. Receiving end does not exist.\"}","method":"getElements","request":"{\"method\":\"getElements\",\"params\":[\".tabBodyLV17 tr:nth-of-type(n+2) td:nth-of-type(n+2)\",0]}","stack":"Error\n    at a.error (chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:468:35)\n    at chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:27879:99","timestamp":1595527448,"level_name":"ERROR","message":"Failed to send message to chrome tab"}
log @ background_script.js:465
background_script.js:465 {"error":"{\"message\":\"Could not establish connection. Receiving end does not exist.\"}","method":"getRootElement","request":"{\"method\":\"getRootElement\",\"params\":[]}","stack":"Error\n    at a.error (chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:468:35)\n    at chrome-extension://jnhgnonknehpejjnehehllkliplmbmhn/background_script.js:27879:99","timestamp":1595527450,"level_name":"ERROR","message":"Failed to send message to chrome tab"}

.

leemeng · August 1, 2020, 10:10am

You're probably running into WS's limit of 10,000 lines of data per scrape (or 20,000 lines for the cloud scraper). There's also the RAM limits for your browser. You can try splitting your scrapes so that it doesn't hit the limits.

martins · August 4, 2020, 1:52pm

The 10k and 20k limit is only for start url import. A sitemap could discover millions of urls and Web Scraper would just do it's job.

The "Could not establish connection." occurs after a page load when Web Scraper tries to extract data from the specific URL and the page hasn't loaded. This can be related to loosing network connection or being blocked by the site in their firewall. If the error happens in browser extension, the page will be skipped. You will loose data from this single page and not the entire job. Of course if the site blocked you then you would see a lot of these errors and would miss a lot of data. If the error happens in Web Scraper Cloud, the page will be retried later till the data is extracted.

Doing large volume scraping in the browser is always a risk. Chrome tends to close tabs when it reaches memory limits. Try splitting the sitemap in multiple parts.

Creedy · August 17, 2020, 11:02am

Thank you very much for your reply ! @leemeng @martins

In the case of the sitemap I am attempting to do, would you have any advise on how I could split it up?