Unable to scrap image

bernardtapie · April 17, 2024, 7:03pm

Hi guys,

Usually I do not have any issues but on this page for example I am not able to scrap the first image in order to obtain the link of the source : Chemise sur mesure homme en twill facile à repasser

If I select the image it identifies img.w-full.ath_lazy and returns data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

However when I click to zoom it identifies .main img and returns data/lechemiseur/chemise-homme/fiche-produit-pleine-page/Chemise-homme-en-Twill-facile-a-repasser-vichy-bleu-LE-CHEMISEUR-CB18-carre.jpg which is what I need.

How I am suppose to do it please ? IS there a way to obtain it without zooming ? Or how I am suppose to tell to the scrapper to zoom before looking for .main img ?

Thank you very much for your help !

JanAp · April 18, 2024, 12:41pm

Hi,

In order to scrape the required image URL, I would copy the image link, then go to the Elements tab in Dev Tools and search for the image link to see which HTML element holds it.

In this case it is .zoom-container source

The srcset attribute can be scraped using a selector with the Element attribute type, see reference below:

{"_id":"lechemiseur","startUrl":["https://lechemiseur.fr/chemise-twill?tissu=CB18"],"selectors":[{"extractAttribute":"srcset","id":"image","multiple":false,"parentSelectors":["_root"],"selector":".zoom-container source","type":"SelectorElementAttribute"}]}

Some data post-processing will be required to filter out the initial image URL though.

bernardtapie · April 18, 2024, 4:45pm

Hi JanAp,

Thank you very very much for your help it works like a charm!

I really need to spend some time in order to better understand HTML and how the tool works.

Wish you the best

bernardtapie · April 19, 2024, 9:28am

Hi JanAp,

How were you able to know that it was ".zoom-container source" and not just ".zoom-container" ?

Moreover, if I wan to scrap the other pictures how can I do please ? I checked multiple but it's not working.

Thank you !

JanAp · April 19, 2024, 10:52am

Hi,

You have to specify 'source' because it is the element that holds the 'srcset' attribute.

It appears that the site renders the images only after they are scrolled over, so you can add a basic scroll selector at the beginning of the sitemap.

Also, only the first element includes the domain (https://lechemiseur.imgix.net/) in the image URL. For the consecutive image URLs it will have to added in post processing.

Reference sitemap:

{"_id":"lechemiseur","startUrl":["https://lechemiseur.fr/chemise-twill?tissu=CB18"],"selectors":[{"delay":0,"elementLimit":500,"id":"scroll","multiple":true,"parentSelectors":["_root"],"selector":"body","type":"SelectorElementScroll"},{"extractAttribute":"srcset","id":"image","multiple":true,"parentSelectors":["_root"],"selector":".zoom-container source","type":"SelectorElementAttribute"}]}