BUG: Crash with large data set

UniqueUsername · August 29, 2019, 12:25pm

Scraper crash with large dataset

When Scraper starts up, it start generate statistics, and when the underlying data is large enough it runs out of memory and crashes.

The solution would be to not try to fit the entire database into memory.

This problem prevents Scraper from starting up, and you need to "hack" the scraper code with the dev tools to get it running.

Update 1: Screen shoot & Stack trace from crash

This shows how the function reportStats (call stack level 5) is the top level function for this crash.

Screen shoot

Stack trace

gi (background_script.js:22284)
A.get.onsuccess (background_script.js:22568)
success (async)
IndexedDB (async)
(anonymous) (background_script.js:22563)
x (background_script.js:22570)
O (background_script.js:22577)
(anonymous) (background_script.js:22630)
IndexedDB (async)
(anonymous) (background_script.js:22627)
Ci (background_script.js:22628)
e._allDocs (background_script.js:22873)
(anonymous) (background_script.js:21416)
(anonymous) (background_script.js:20457)
(anonymous) (background_script.js:25761)
(anonymous) (background_script.js:20431)
(anonymous) (background_script.js:20425)
(anonymous) (background_script.js:25761)
(anonymous) (background_script.js:20459)
Qe.execute (background_script.js:21498)
Qe.ready (background_script.js:21502)
(anonymous) (background_script.js:21097)
(anonymous) (background_script.js:22714)
(anonymous) (background_script.js:22718)
(anonymous) (background_script.js:22975)
u (background_script.js:18396)
characterData (async)
i (background_script.js:18376)
e.exports (background_script.js:18402)
(anonymous) (background_script.js:22974)
(anonymous) (background_script.js:23089)
(anonymous) (background_script.js:22711)
wi (background_script.js:22635)
(anonymous) (background_script.js:22722)
Ti (background_script.js:22723)
Ke (background_script.js:21087)
getSitemapDataDb (background_script.js:20247)
(anonymous) (background_script.js:20308)
(anonymous) (background_script.js:20224)
n (background_script.js:20204)
getSitemapData (background_script.js:20307)
(anonymous) (background_script.js:20039)
o (background_script.js:19706)
Promise.then (async)
c (background_script.js:19721)
o (background_script.js:19706)
Promise.then (async)
c (background_script.js:19721)
o (background_script.js:19706)
Promise.then (async)
c (background_script.js:19721)
(anonymous) (background_script.js:19723)
n (background_script.js:19703)
getDatabaseStats (background_script.js:19878)
(anonymous) (background_script.js:20046)
(anonymous) (background_script.js:19723)
n (background_script.js:19703)
getStats (background_script.js:20044)
(anonymous) (background_script.js:20110)
(anonymous) (background_script.js:19723)
n (background_script.js:19703)
reportStats (background_script.js:20109)
(anonymous) (background_script.js:20095)
(anonymous) (background_script.js:19723)
n (background_script.js:19703)
(anonymous) (background_script.js:20092)

snied · October 17, 2019, 1:31pm

Hi UniqueUsername,

I had a similar issue with Google Chrome and thus changed to Firefox, which worked much more stable with the extension. However, after scraping, Firefox failed to create the CSV file for export, probably messed up with the local storage files and now does not load any sitemap anymore.

martins · October 18, 2019, 12:04pm

The given function where the problem occurs is within PouchDB library. PouchDB is used to store scraped data in browsers local storage or in CouchDB database. Scraped data isn't stored in memory. Only Queue is stored in memory but you would have to have millions of URLs in the queue to have it use more than 1GB of memory.

This could be caused by a bug in the PouchDB library - too much memory used for some kind of lookup or a simple memory leak. The simplest and maybe best solution would be to remove the PouchDB wrapper and write to the local storage directly. This would give more control of memory usage and would simplify the application. This would also mean that data export to CouchDB would be removed.

We have been thinking about how to simplify the storage engine for the extension for some time. Previously we already fixed bugs related to data export because of too big data files being exported. At the moment we don't have an ETA for a storage engine rewrite.

Right now the best thing you can do is to split up a scraping job into multiple sitemaps. Historically we have received different types of bug reports related to chrome crashes. Mostly they are related to available memory on the host computer or a bug in the Chrome browser causes a random crash. There isn't a good fix for these kind of problems. To reduce these problems we added a start URL limit to reduce scraping job sizes. Adding a queue size limit might be too limiting because different setups can run different size jobs.

If you really want to run big jobs, use Web Scraper Cloud. It has built in fail-over and retry mechanisms and the scrapers are running in an controlled environment.

Iagovar · December 13, 2019, 9:48am

I used CouchDB as DB but keeps happening. It seems that Chrome uses more and more ram. Like if the extension keeps loading pouchdb with data, despite using CouchDB.

I saddens me a bit that removing PouchDB would remove also export to CouchDB. There has to be some DB option because handling large datasets within the browser doesn't seem like a sane option.

CouchDB seems nice, or at least it's nice and easy to work with. I wonder if there isn't any workaround for this, like posting a document to CouchDB and deleting it from memory, some kind of loop to keep memory usage at bay.