Scrape the image of each color option on JSON

Hello everyone,

What is the method to scrape the images for each option, each color option gives each different photos of the tshirt. This is JSON.

https://www.toptex.com/variantgrid/color/loadpackshot/id/3294/

Should I use popup link?

thank you very much!

1 Like

@mamescrap Hello, it is possible to be done using an 'Element click' selector(with a click selector targeting - span.color) and a Grouped selector(set as a child to it) - .slick-active img, .visible img with an 'Attribute name' - src.

Example:

{"_id":"toptex-com","startUrl":["https://www.toptex.com/men-s-organic-pique-short-sleeved-polo-shirt.html"],"selectors":[{"clickElementSelector":"span.color","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickOnce","delay":2000,"discardInitialElements":"do-not-discard","id":"color-click","multiple":true,"parentSelectors":["_root"],"selector":"html","type":"SelectorElementClick"},{"delay":0,"extractAttribute":"src","id":"images","parentSelectors":["color-click"],"selector":".slick-active img, .visible img","type":"SelectorGroup"}]}

2 Likes

Hi Viesturs,

I'm just new in this community. I was trying your solution to scrap images from the below page by making the required changes in the script but was not successful. Can I ask you if you can help me in this regard? What could be the code if I have to scrap images of all variants of this product.

Thanks in advance.

Sincerely,
Faheem

Hello viesturs

A big thank you to you, it's great, I understand a little better how the application works.

Hi @viesturs, @mamescrap any idea how should I can create this code, please?

@Faheem Hi, these product variants are actually unique products with unique links and the 'Element click' selector wont work here because each time you click on the different option the page is reloaded. Element click selector does not work if the page is reloaded.

The products can be found in the product listing page, therefore there is no need to create this selector anyway.

All of the images can be extracted using the 'Grouped' selector - div.bottom-thumbnail-item img with an 'Attribute name' - src.

Example:

{"_id":"aosom-ca","startUrl":["https://www.aosom.ca/item/aosom-kids-electric-pedal-motorcycle-ride-on-toy-6v-battery-powered-for-3-8-years-old-370-104bu~370-104BU.html?recv=eyJwYWdldHlwZSI6Iml0bSIsInBhZ2VpZCI6IjM3MC0xMDRSRCJ9"],"selectors":[{"delay":0,"extractAttribute":"src","id":"images","parentSelectors":["_root"],"selector":"div.bottom-thumbnail-item img","type":"SelectorGroup"}]}

If necessary, afterwards you can do some additional data post-processing(for example getting rid of unnecessary symbols or transform the images in different size) with the parser feature using Web Scraper Cloud.


vs

More info:

https://webscraper.io/documentation/web-scraper-cloud/parser
https://webscraper.io/documentation/web-scraper-cloud/parser/replace-text

1 Like

Hello viesturs, Faheem

Thank you so much for your help.. :+1::hugs:

On the other hand I have a small problem to recover hidden data, the name of a child category:

The name is "Veste / bodies"

<li class="level1 nav-1-5 parent"><a href="https://www.toptex.fr/vetements/vestes-bodies.html?___SID=U" class="level1 has-children ">Vestes / bodies</a>

I can only take the url of the link below, why? (level2)

here is my sitemap

{"_id":"tooooptex","startUrl":["https://www.toptex.fr/veste-softshell-3-couches-a-capuche-avec-manches-amovibles-unisexe.html"],"selectors":[{"delay":0,"id":"hidden-childcategory","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"li.level1, a.level1 has-children > span","type":"SelectorHTML"}]}`Preformatted text`

How to isolate the right part of the code to recycle the right value?

thank you !

Thanks @viesturs sure I'll try this code.

@mamescrap Hi, did you try - li.level1.nav-1-5.parent a.level1.has-children ?

Hello viestrus!

in fact it always gives me the same category on all products.

I think that some of the product sheets (new products) in the general page here, do not have an breacrump link:

https://www.toptex.fr/vetements.html

by creating a new sitemap that uses the megamenu, the breacrump remains empty anyway, while the previews are ok

{"_id":"toptexrootmenu","startUrl":["https://www.toptex.fr/vetements.html"],"selectors":[{"delay":0,"id":"menu","multiple":true,"parentSelectors":["_root"],"selector":".nav-1 a.level0.has-children, .nav-1-1 a.level1, .nav-1-1 li.level3:nth-of-type(n+2) a","type":"SelectorLink"},{"delay":0,"id":"bloc","multiple":true,"parentSelectors":["menu"],"selector":"a.product-image","type":"SelectorLink"},{"delay":0,"id":"breacrump","multiple":false,"parentSelectors":["bloc"],"regex":"","selector":"div.breadcrumbs","type":"SelectorText"},{"delay":0,"id":"name","multiple":false,"parentSelectors":["bloc"],"regex":"","selector":"h1","type":"SelectorText"}]}

Hello, no one has any idea? is it a bug? thank you

@mamescrap Hi, it appears, that this data is located in the following scripts:

a#scrollToTop + script

script[type="application/ld+json"]:contains("category")

Hello, thanks for the answer. how do I decrypt a it's like a div? I have to use "Element attribute"? and I put a#scrollToTop in the selector? I put what in "attribute name"? than you

{"_id":"toptexrootmenu","startUrl":["https://www.toptex.fr/vetements.html"],"selectors":[{"delay":0,"id":"menu","multiple":true,"parentSelectors":["_root"],"selector":".nav-1 a.level0.has-children, .nav-1-1 a.level1, .nav-1-1 li.level3:nth-of-type(n+2) a","type":"SelectorLink"},{"delay":0,"id":"bloc","multiple":true,"parentSelectors":["menu"],"selector":"a.product-image","type":"SelectorLink"},{"delay":0,"extractAttribute":"script[type=\"application/ld+json\"]:contains(\"category\")","id":"breacrump","multiple":false,"parentSelectors":["bloc"],"selector":"a#scrollToTop","type":"SelectorElementAttribute"},{"delay":0,"id":"name","multiple":false,"parentSelectors":["bloc"],"regex":"","selector":"h1","type":"SelectorText"},{"delay":0,"extractAttribute":"application/ld+json","id":"category","multiple":false,"parentSelectors":["bloc"],"selector":"script","type":"SelectorElementAttribute"}]}

@mamescrap Hi, you have to specify the script type using a 'Text' selector. For example: script[type="application/ld+json"]:contains("category").

Afterward, you can clean up the data using the parser feature within Web Scraper Cloud.

More information can be found here: Parser | Web Scraper Documentation

See the screenshot examples.


Hello, it requires the use of the cloud version? thank you

@mamescrap If you want to perform data post-processing - yes, you will have to use the Cloud version.

Hello, I don't know, I just want to get the breadcrumb trail of each product, if you have to choose the cloud version to achieve this, I want to be sure... thank you