In aliexpress.com, only the thumbnail images can be scraped, but that's not enough

Dear all,
so i've written the Sitemap out to scrape all the thumbnail images of aliexpress.com's product, but it seems like can't scrape the bigger size images of them(when zoom in/ mouseover the thumbnails).

it looks like the root cause is that: my code can't click over the "Click selector" which i coded, if it can click over, then, maybe it can scrape all the bigger images, or something better.

any one can take it further?

Thanks.

Sitemap:
{"_id":"aliexpress_images","startUrl":["https://www.aliexpress.com/item/New-Original-Converse-all-star-canvas-shoes-men-s-and-women-s-sneakers-low-classic-Skateboarding/32896359346.html"],"selectors":[{"id":"all","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.detail-gallery-main","multiple":true,"delay":0,"clickElementSelector":"span.img-thumb-item","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"image","type":"SelectorImage","parentSelectors":["all"],"selector":"li span.img-thumb-item img","multiple":true,"delay":0}]}

This one is actually quite straighforward. The Aliexpress thumbnails have the same Url as the fullsize, except they have _50x50.jpg appended at the end. So you just need to grab the thumbs Url and remove the _50x50.jpg part.

You could use Excel to do this post-scrape, or you can use a combo of selector:HTML and regex, something like this.

{"_id":"aliexpress_images","startUrl":["https://www.aliexpress.com/item/32970692142.html"],"selectors":[{"id":"Item name","type":"SelectorText","parentSelectors":["_root"],"selector":"div[itemprop='name']","multiple":false,"regex":"","delay":0},{"id":"thumbs container","type":"SelectorElement","parentSelectors":["_root"],"selector":"ul.images-view-list","multiple":false,"delay":0},{"id":"Fullsize Urls","type":"SelectorHTML","parentSelectors":["thumbs container"],"selector":"li","multiple":true,"regex":"(?<=img src=\").+(?=_50x50.jpg)","delay":0}]}

Note: The regex is
(?<=img src=").+(?=_50x50.jpg)

1 Like