Scrape the image of each color option on JSON

mamescrap · January 10, 2022, 4:28pm

Hello everyone,

What is the method to scrape the images for each option, each color option gives each different photos of the tshirt. This is JSON.

https://www.toptex.com/variantgrid/color/loadpackshot/id/3294/

Should I use popup link?

thank you very much!

ViestursWS · January 11, 2022, 3:02pm

@mamescrap Hello, it is possible to be done using an 'Element click' selector(with a click selector targeting - span.color) and a Grouped selector(set as a child to it) - .slick-active img, .visible img with an 'Attribute name' - src.

Example:

{"_id":"toptex-com","startUrl":["https://www.toptex.com/men-s-organic-pique-short-sleeved-polo-shirt.html"],"selectors":[{"clickElementSelector":"span.color","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickOnce","delay":2000,"discardInitialElements":"do-not-discard","id":"color-click","multiple":true,"parentSelectors":["_root"],"selector":"html","type":"SelectorElementClick"},{"delay":0,"extractAttribute":"src","id":"images","parentSelectors":["color-click"],"selector":".slick-active img, .visible img","type":"SelectorGroup"}]}

Faheem · January 13, 2022, 4:19am

Hi Viesturs,

I'm just new in this community. I was trying your solution to scrap images from the below page by making the required changes in the script but was not successful. Can I ask you if you can help me in this regard? What could be the code if I have to scrap images of all variants of this product.

Thanks in advance.

Sincerely,
Faheem

mamescrap · January 13, 2022, 11:12am

Hello viesturs

A big thank you to you, it's great, I understand a little better how the application works.

Faheem · January 13, 2022, 5:00pm

Hi @ViestursWS, @mamescrap any idea how should I can create this code, please?

ViestursWS · January 13, 2022, 7:18pm

@Faheem Hi, these product variants are actually unique products with unique links and the 'Element click' selector wont work here because each time you click on the different option the page is reloaded. Element click selector does not work if the page is reloaded.

The products can be found in the product listing page, therefore there is no need to create this selector anyway.

All of the images can be extracted using the 'Grouped' selector - div.bottom-thumbnail-item img with an 'Attribute name' - src.

Example:

{"_id":"aosom-ca","startUrl":["https://www.aosom.ca/item/aosom-kids-electric-pedal-motorcycle-ride-on-toy-6v-battery-powered-for-3-8-years-old-370-104bu~370-104BU.html?recv=eyJwYWdldHlwZSI6Iml0bSIsInBhZ2VpZCI6IjM3MC0xMDRSRCJ9"],"selectors":[{"delay":0,"extractAttribute":"src","id":"images","parentSelectors":["_root"],"selector":"div.bottom-thumbnail-item img","type":"SelectorGroup"}]}

If necessary, afterwards you can do some additional data post-processing(for example getting rid of unnecessary symbols or transform the images in different size) with the parser feature using Web Scraper Cloud.

vs

More info:

https://webscraper.io/documentation/web-scraper-cloud/parser
https://webscraper.io/documentation/web-scraper-cloud/parser/replace-text

mamescrap · January 13, 2022, 9:32pm

Hello viesturs, Faheem

Thank you so much for your help..

On the other hand I have a small problem to recover hidden data, the name of a child category:

The name is "Veste / bodies"

<li class="level1 nav-1-5 parent"><a href="https://www.toptex.fr/vetements/vestes-bodies.html?___SID=U" class="level1 has-children ">Vestes / bodies</a>

I can only take the url of the link below, why? (level2)

here is my sitemap

{"_id":"tooooptex","startUrl":["https://www.toptex.fr/veste-softshell-3-couches-a-capuche-avec-manches-amovibles-unisexe.html"],"selectors":[{"delay":0,"id":"hidden-childcategory","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"li.level1, a.level1 has-children > span","type":"SelectorHTML"}]}`Preformatted text`

How to isolate the right part of the code to recycle the right value?

thank you !

Faheem · January 14, 2022, 1:44pm

Thanks @ViestursWS sure I'll try this code.

ViestursWS · January 14, 2022, 2:42pm

@mamescrap Hi, did you try - li.level1.nav-1-5.parent a.level1.has-children ?

mamescrap · January 14, 2022, 3:48pm

Hello viestrus!

in fact it always gives me the same category on all products.

I think that some of the product sheets (new products) in the general page here, do not have an breacrump link:

https://www.toptex.fr/vetements.html

by creating a new sitemap that uses the megamenu, the breacrump remains empty anyway, while the previews are ok

{"_id":"toptexrootmenu","startUrl":["https://www.toptex.fr/vetements.html"],"selectors":[{"delay":0,"id":"menu","multiple":true,"parentSelectors":["_root"],"selector":".nav-1 a.level0.has-children, .nav-1-1 a.level1, .nav-1-1 li.level3:nth-of-type(n+2) a","type":"SelectorLink"},{"delay":0,"id":"bloc","multiple":true,"parentSelectors":["menu"],"selector":"a.product-image","type":"SelectorLink"},{"delay":0,"id":"breacrump","multiple":false,"parentSelectors":["bloc"],"regex":"","selector":"div.breadcrumbs","type":"SelectorText"},{"delay":0,"id":"name","multiple":false,"parentSelectors":["bloc"],"regex":"","selector":"h1","type":"SelectorText"}]}

mamescrap · January 24, 2022, 2:31pm

Hello, no one has any idea? is it a bug? thank you

ViestursWS · January 24, 2022, 3:45pm

@mamescrap Hi, it appears, that this data is located in the following scripts:

a#scrollToTop + script

script[type="application/ld+json"]:contains("category")

mamescrap · February 1, 2022, 8:25am

Hello, thanks for the answer. how do I decrypt a it's like a div? I have to use "Element attribute"? and I put a#scrollToTop in the selector? I put what in "attribute name"? than you

{"_id":"toptexrootmenu","startUrl":["https://www.toptex.fr/vetements.html"],"selectors":[{"delay":0,"id":"menu","multiple":true,"parentSelectors":["_root"],"selector":".nav-1 a.level0.has-children, .nav-1-1 a.level1, .nav-1-1 li.level3:nth-of-type(n+2) a","type":"SelectorLink"},{"delay":0,"id":"bloc","multiple":true,"parentSelectors":["menu"],"selector":"a.product-image","type":"SelectorLink"},{"delay":0,"extractAttribute":"script[type=\"application/ld+json\"]:contains(\"category\")","id":"breacrump","multiple":false,"parentSelectors":["bloc"],"selector":"a#scrollToTop","type":"SelectorElementAttribute"},{"delay":0,"id":"name","multiple":false,"parentSelectors":["bloc"],"regex":"","selector":"h1","type":"SelectorText"},{"delay":0,"extractAttribute":"application/ld+json","id":"category","multiple":false,"parentSelectors":["bloc"],"selector":"script","type":"SelectorElementAttribute"}]}

ViestursWS · February 4, 2022, 3:16pm

@mamescrap Hi, you have to specify the script type using a 'Text' selector. For example: script[type="application/ld+json"]:contains("category").

Afterward, you can clean up the data using the parser feature within Web Scraper Cloud.

More information can be found here: Parser | Web Scraper Documentation

See the screenshot examples.

mamescrap · February 4, 2022, 11:10pm

Hello, it requires the use of the cloud version? thank you

ViestursWS · February 5, 2022, 12:40pm

@mamescrap If you want to perform data post-processing - yes, you will have to use the Cloud version.

mamescrap · February 8, 2022, 7:40am

Hello, I don't know, I just want to get the breadcrumb trail of each product, if you have to choose the cloud version to achieve this, I want to be sure... thank you