Very slow when pulling a million data

Test · October 27, 2022, 9:19am

I'm pulling the detail information of more than a million data. I'm printing the information under an advertisement (it could be a house posting) into the database. About 35 details are available, but max. I can shoot 1000 advertisements with their detailed information. And judging by the abundance of advertisements, this is a very serious period. I added accelerator commands like 'lxml', cchardet, but still performance is not good. Unfortunately, I don't have time to write with a new library like Scrapy. The pc is 8 cores and an additional 8 gb ram has been installed, the internet speed is also good. I would be very happy if you share a suggestion to improve performance.

ViestursWS · October 27, 2022, 12:54pm

@Test Hello, the main reasons why this issue might occur are that in most cases the scraping job(launched locally) can be easily interrupted by the blockage from the security mechanisms(CAPTCHA prompts, IP bans & etc.) of the targeted website, incorrect sitemap navigation, background processes you may be running within your computer and it gradually increases the browser memory usage which can lead to an eventual browser crash/premature ending of the extraction process in case your system does not have sufficient computing resources.

Scraping jobs in Web Scraper Cloud are launched within a virtual environment(based on a server) and are not dependent on your own machine in any aspect, therefore that might be the right solution for you: Web Scraper - Pricing

Test · October 28, 2022, 7:44am

Hello, thank you very much for your answer. Your determinations are very nice. Yes, there is a captcha on the site, but I don't get any warnings that you are stuck on it. I think if he tripped over him, he'd stop me from pulling the data. If I think wrong, you can warn me.Does the machine speed you up greatly? If it will speed up, I will provide this process.

ViestursWS · October 31, 2022, 2:32pm

@Test Hi, I would recommend trying out the Cloud solution of Web Scraper which would also give you the possibility of monitoring the 'Failed' or 'Empty' pages that might occur during the extraction process.

Web Scraper Cloud offers the following features: proxy, scheduler, API, parser, data export, data quality control & notifications.

You can learn more about that right here: Web Scraper Cloud | Web Scraper Documentation

Test · November 2, 2022, 8:14am

Thank you so much for help.