Can we scrape images inline or from cache?

joec · August 22, 2023, 3:24pm

Two questions:

Is it possible to scrape the images directly while scraping the text from the page? Looking at the response headers for one of the images on the page, I see "Cache-Control: public,max-age=31536000,immutable". It seems silly to have to run a separate script outside of the browser to scrape all the images, which would generate new requests to the server being scraped.
Can we make the filename of the image downloaded be unique to some field in the data row it's in. For the example of eBay, it makes sense to name the image 1234567890.webp if the item id in ebay is 1234567890. Ideally, I'll write a custom script for my specific use case that takes the downloaded images and processes them based on text data from that row, so having a unique reference field will be critical.

I searched and didn't immediately find any posts covering this. If I need to move this to Feature Requests, that's OK, too, but I'd love to get this going soon, so if there's some creative work-around, I should be able to write the code or do what I need to do to make it work.

Url: http://ebay.com

Sitemap:
{id:"sitemap code"}

joec · August 22, 2023, 4:26pm

Thanks for the quick reply. Are you saying there is a way to get what I'm after with this extraAttribute selector? I don't see this in my selector types, but I'll look again at the documentation to see if there's something I missed there. I was unable to use the sitemap you provided as it gives this error:
* FAILED_TO_CONNECT_TO_CHROME_TAB {"message":"Could not establish connection. Receiving end does not exist."} at getRootElement with []

joec · August 22, 2023, 5:03pm

That gets the image urls, but does not seem to grab the actual image (download it), which is what I'm looking to do. Is this example supposed to get the actual image?

joec · August 22, 2023, 5:06pm

To be clear, I have no problems getting the URLs for multiple images off a page. I need the actual image files.

joec · August 22, 2023, 5:28pm

So it's not possible to have them download while the scraping is in progress? I was mainly attempting to avoid re-downloading images that are already cached in the browser.

I'll take a look at the provided image download scripts and figure out what I need to do there.

joec · August 22, 2023, 9:51pm

I think it should be fine. Locally, I'll write a python script to pull the images from cache to save time and bandwidth and requests. When I move to the cloud I won't care as much, I think.

leemeng · August 22, 2023, 10:57pm

If it is excessive traffic you're worried about, you can easily disable a particular site from showing images via Chrome's settings. It is buried somewhere in there and I just look for "images" in the settings search.

I do this for large scrape jobs. It usually has no effect on the scraping because all the selectors are still present, even the image URLs. This has the added benefit of making pages load much faster. On the downside, some sites will detect this as bot behavior but that is rare. Also, some sites' features won't work properly if images are not loaded, e.g. page navigation.

joec · August 23, 2023, 5:11pm

I opted for a python script that goes through the csv and looks for each image by url in Firefox's cache. It's super fast and I don't have to mess with the browser or additional extensions.