Element Click for "Load More" - Loads all Products But Will Not Select Products

Hello all,

I am trying to scrape from the below site and category:

Url: https://samkoandmikotoywarehouse.com/product-category/toys/

Sitemap:
{"_id":"samko_all","startUrl":["https://samkoandmikotoywarehouse.com/product-category/toys/"],"selectors":[{"id":"load_more","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"li.col-md-5ths","multiple":true,"delay":"2000","clickElementSelector":"nav.woocommerce-pagination a.button","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"product","type":"SelectorLink","parentSelectors":["load_more"],"selector":"a","multiple":false,"delay":"2000"},{"id":"upc","type":"SelectorText","parentSelectors":["product"],"selector":"div.product_meta","multiple":false,"regex":"","delay":0}]}

This page displays products, with more products displaying when "Load More" is clicked. I am able to use an Element Click selector to successful click "Load More" until all products are shown, but my child link selector will not click on the products. I need to go into the product pages to get the UPC, which I'm also having trouble selecting on it's own, but instead am grabbing additional data and cleaning up in Excel.

I've seen similar example sites like this:

Url: https://www.cablesandsensors.com

That have seemingly similar sitemaps and needs (based on the selector graph) that work fine:

Sitemap:
{"_id":"cables_and_sensors","startUrl":"https://www.cablesandsensors.com","selectors":[{"id":"load","type":"SelectorElementClick","parentSelectors":["category"],"selector":"div.collection-item","multiple":true,"delay":"","clickElementSelector":"button.btn.btn-huge.btn-light","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"ItemTitle","type":"SelectorText","parentSelectors":["item"],"selector":"div.section.section-header","multiple":false,"regex":"","delay":""},{"id":"Price","type":"SelectorText","parentSelectors":["item"],"selector":"div.col-xs-8 span.money","multiple":false,"regex":"","delay":""},{"id":"CandS_PN","type":"SelectorText","parentSelectors":["item"],"selector":"span.variant-sku","multiple":false,"regex":"","delay":""},{"id":"OEM Part Number","type":"SelectorTable","parentSelectors":["item"],"selector":"div.main-content div.col-xs-12 div table.table","multiple":true,"columns":[{"header":"Manufacturer","name":"Manufacturer","extract":true},{"header":"OEM Part #","name":"OEM Part #","extract":true}],"delay":"","tableDataRowSelector":"tr:nth-of-type(n+2)","tableHeaderRowSelector":"tr:nth-of-type(1)"},{"id":"Compatibility","type":"SelectorTable","parentSelectors":["item"],"selector":"table#compatibility.table","multiple":true,"columns":[{"header":"Manufacturer","name":"Manufacturer","extract":true},{"header":"Model","name":"Model","extract":true}],"delay":"","tableDataRowSelector":"tr:nth-of-type(n+2)","tableHeaderRowSelector":"tr.table-heading:contains('Manufacturer')"},{"id":"category","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.visible-lg-block","multiple":true,"delay":""},{"id":"item","type":"SelectorLink","parentSelectors":["load"],"selector":"a","multiple":false,"delay":""}]}

I've been stuck on this for awhile - any insight would be greatly appreciated!

Give this a try.

{"_id":"forum-cablesandsensors-fix1","startUrl":"https://www.cablesandsensors.com","selectors":[{"id":"load","type":"SelectorElementClick","parentSelectors":["category"],"selector":"div.collection-item","multiple":true,"delay":"","clickElementSelector":".btn-huge","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"category","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.visible-lg-block","multiple":true,"delay":""},{"id":"item","type":"SelectorLink","parentSelectors":["load"],"selector":"a","multiple":false,"delay":""},{"id":"Part","type":"SelectorText","parentSelectors":["item"],"selector":"div.section.section-header","multiple":false,"regex":"","delay":0},{"id":"part number","type":"SelectorText","parentSelectors":["item"],"selector":"span.variant-sku","multiple":false,"regex":"","delay":0},{"id":"Price","type":"SelectorText","parentSelectors":["item"],"selector":"div.col-xs-8 span.money","multiple":false,"regex":"","delay":0},{"id":"Variant title","type":"SelectorText","parentSelectors":["item"],"selector":"div.variant-title","multiple":false,"regex":"","delay":0}]}

I changed the click selector element and also set it to unique css selector (not entirely sure that's necessary but it seems to work

1 Like

Hey - thanks for answering so quickly!

Apologies, I realize my post may have been misleading - it's the top sitemap I need help with; I provided the lower sitemap as an example as a similarly-structured site with a selector graph with a similar design.

Ok - I have a partial solution for you.

In order to grab the UPC you need REGEX to section it out (to the best of my knowledge)

UPC = div.product_meta REGEX = \d{11}
SKU = span.sku (no regex needed)
Categories = div.product_meta REGEX = [^\t].+?$

I had to remove all the delays as it was taking forever but seems like it is finally working

I got 900+ products: https://docs.google.com/spreadsheets/d/1u9eQacTLZYe_otGJlX8i_goQNmJg7Ft2G5PJenohv5Q/edit?usp=sharing

{"_id":"forum-fix-samkordik","startUrl":["https://samkoandmikotoywarehouse.com/product-category/toys/"],"selectors":[{"id":"load_more","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"li.col-md-5ths","multiple":true,"delay":"","clickElementSelector":"nav.woocommerce-pagination a.button","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"product","type":"SelectorLink","parentSelectors":["load_more"],"selector":"div.prodholderbox div a","multiple":false,"delay":""},{"id":"upc","type":"SelectorText","parentSelectors":["product"],"selector":"div.product_meta:contains(\"UPC\")","multiple":false,"regex":"\\d{11}","delay":0},{"id":"Sku","type":"SelectorText","parentSelectors":["product"],"selector":"span.sku","multiple":false,"regex":"","delay":0},{"id":"Catagories","type":"SelectorText","parentSelectors":["product"],"selector":"div.product_meta","multiple":false,"regex":"[^\\t].+?$","delay":0},{"id":"Price","type":"SelectorText","parentSelectors":["product"],"selector":"div.price span.amount","multiple":false,"regex":"","delay":0}]}
1 Like

Hey again,

At first run it appears to work! It does miss a couple UPCs here and there, but overall working very well. Thanks so much!

Was it just the delays you needed to remove to have the product link selector work after clicking the "Load More" until all products were loaded? I'll compare sitemaps to view all of your edits more in-depth so I can take away some learning from this.

Again - thanks for taking the time to help me out! Much appreciated.

So a few changes. :slight_smile:

  • I changed the link selector you used click into products
  • The regex I used to extract UPC looks for a 7 digit #. If the UPC changes, then it will miss it.
  • I noticed they stop using UPC codes at some point and that's why we didn't get a bunch.
1 Like

Hey,

Regarding the UPC codes - you're right, I should have spot-checked them first, I see that there are simply omitted on some product pages.

I took a look at comparing our selector graphs. I'm going to look more into the regex to fully understand that. Regarding the link selector changes you made to click into products, I'm just wondering why we need to select the product holder box after using Selector Element Click, when the regular link selector on the product's title link works when not using the element click as a parent? I'm wondering if there is a principle at play here I can internalize for future scraping.

Cheers!

I have no ideas. I’m very good at trouble shooting but I’m not knowledgeable on why it worked. @iconoclast can help you there. From my basic undeetanding the elemenr selector (any of them) defines the metafrrame/containsr and depending on that, the link selector that you choose is different.
Trial and error also helps

Regarding regex - that’s super difficult and google/stackoverflow is your friend; but very useful for truncating the data you want.