Help: Scraping multiple images with and without image carousel

Im scraping a website and so far everything is working fine but regarding image scraping I'm facing a problem because each product page has a main image and an image carousel with the thumbnails but some product pages only have the main image and that's when the problem started because those wouldn't get scrapped, also the number of pictures varies from 2 to sometimes seven for each product page, plus only 4 are displayed and the rest are hidden until you press the next button. I've read the documentation, seen the tutorials and read all related questions similar to mine to the point I'm more confused than when I started and can't really figure out how to do it.

The site I'm trying to scrape is of "adult" products so I'm not sure if I can post my sitemap or the link to the site, so if at least someone has experience with this type of situation I would appreciate the help, thanks

Url: http://example.com

Sitemap:
{id:"sitemap code"}

Hi,

Please post the sitemap you have prepared so far.

{"_id":"Catalogo_Aizen","startUrl":["https://juguetesparaadultos.com.mx/tienda/page/[1-66]/"],"selectors":[{"id":"Product-Link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":".wd-entities-title a","type":"SelectorLink"},{"id":"Product Wraper","multiple":false,"parentSelectors":["Product-Link"],"selector":"div.summary-inner","type":"SelectorElement"},{"id":"Nombre","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"SKU","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"span.sku","type":"SelectorText"},{"id":"Categoría","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":".posted_in a","type":"SelectorText"},{"id":"Precio","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"bdi","type":"SelectorText"},{"id":"Descuento","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"ins bdi","type":"SelectorText"},{"id":"Description Wraper","multiple":false,"parentSelectors":["Product-Link"],"selector":"div.woocommerce-tabs","type":"SelectorElement"},{"id":"Descripción","multiple":false,"parentSelectors":["Description Wraper"],"regex":"","selector":"div.wc-tab-inner","type":"SelectorText"},{"extractAttribute":"src","id":"Tags","parentSelectors":["Description Wraper"],"selector":"img","type":"SelectorGroup"}]}Preformatted text

Sorry for the site content...

The best I've been able so far is to do a pagination on the next button and use selector image, it does work and gives me all of the images in highest resolution but in separate rows so the csv file ends up with multiple rows of the same product information and different image, I don't know if there is a way to use group selector in this case or if It's possible to post process it to merge al links in a single cel and somehow get rid of duplicate information

{"_id":"Prueba","startUrl":["https://juguetesparaadultos.com.mx/tienda/page/[1-66]/"],"selectors":[{"id":"Product-Link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":".wd-entities-title a","type":"SelectorLink"},{"id":"Product Wraper","multiple":false,"parentSelectors":["Product-Link"],"selector":"div.summary-inner","type":"SelectorElement"},{"id":"Nombre","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"SKU","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"span.sku","type":"SelectorText"},{"id":"Categoría","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":".posted_in a","type":"SelectorText"},{"id":"Precio","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"bdi","type":"SelectorText"},{"id":"Descuento","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"ins bdi","type":"SelectorText"},{"id":"Description Wraper","multiple":false,"parentSelectors":["Product-Link"],"selector":"div.woocommerce-tabs","type":"SelectorElement"},{"id":"Descripción","multiple":false,"parentSelectors":["Description Wraper"],"regex":"","selector":"div.wc-tab-inner","type":"SelectorText"},{"extractAttribute":"src","id":"Tags","parentSelectors":["Description Wraper"],"selector":"img","type":"SelectorGroup"},{"id":"Pagination","paginationType":"clickMore","parentSelectors":["Product-Link","Pagination"],"selector":".wd-hover-1.wd-custom-style .wd-next div","type":"SelectorPagination"},{"id":"Img","multiple":false,"parentSelectors":["Pagination"],"selector":".wd-active .woocommerce-product-gallery__image img","type":"SelectorImage"}]}

type or paste code here

For anyone having the same problem of having each picture in a separate row which results in the rest of the information being duplicated, triplicated, etc, i found a very easy work around with post processing. Open your CSV in google sheets, download an extension power tools (which is free) and go to extensions>power tools>tools>merge & combine, a menu will open on the right side of the screen, press combine duplicate rows, select your range data on the pop up window (usually the whole sheet) next, select the columns that have duplicate records, next, choose the columns you want to merge (e.g. the column with all your images) it will ask you for an action, press merge values and then choose how you wanted separated (space, line, comma) and then finish. There's a video tutorial from the app company if you still have doubts.

Ok, it looks like you have found a solution :+1:

Anyway, below you can see a reference on how to extract the images in one cell with the Grouped selector:

{"_id":"Prueba","startUrl":["https://juguetesparaadultos.com.mx/tienda/page/[1-66]/"],"selectors":[{"id":"Product-Link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":".wd-entities-title a","type":"SelectorLink"},{"id":"Product Wraper","multiple":false,"parentSelectors":["Product-Link"],"selector":"div.summary-inner","type":"SelectorElement"},{"id":"Nombre","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"SKU","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"span.sku","type":"SelectorText"},{"id":"Categoría","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":".posted_in a","type":"SelectorText"},{"id":"Precio","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"bdi","type":"SelectorText"},{"id":"Descuento","multiple":false,"parentSelectors":["Product Wraper"],"regex":"","selector":"ins bdi","type":"SelectorText"},{"id":"Description Wraper","multiple":false,"parentSelectors":["Product-Link"],"selector":"div.woocommerce-tabs","type":"SelectorElement"},{"id":"Descripción","multiple":false,"parentSelectors":["Description Wraper"],"regex":"","selector":"div.wc-tab-inner","type":"SelectorText"},{"extractAttribute":"src","id":"Tags","parentSelectors":["Description Wraper"],"selector":"img","type":"SelectorGroup"},{"id":"Pagination","paginationType":"clickMore","parentSelectors":["Product-Link","Pagination"],"selector":".wd-hover-1.wd-custom-style .wd-next div","type":"SelectorPagination"},{"id":"Img","multiple":false,"parentSelectors":["Pagination"],"selector":".wd-active .woocommerce-product-gallery__image img","type":"SelectorImage"},{"extractAttribute":"href","id":"images","parentSelectors":["Product-Link"],"selector":".wd-carousel-wrap .wd-carousel-item figure a","type":"SelectorGroup"}]}
1 Like

Thank you very much for taking the time on doing it Jan, I'll give it a try too.

I tried it already, your's is the best choice of course because it makes the scrape faster for not having to navigate through the carousel, thank you very much, that's what I was trying to do but for some reason I could never select the carousel correctly.

1 Like