Building link list with variables

infrasignal · November 1, 2022, 4:08pm

Hello there and thanks for creating an amazingly useful tool!
I've been able to scrape the main catalog listing page, but can't drill down to product detail pages where I need to grab SKU and inventory.
I've looked at other examples here, but got stuck implementing for my scenario.
It seems my list of links is javascript based as there's a javascript:void(0) reference for every anchor tag.

Start page: Home (required in order to load js product catalog)
Then Shop

After manually navigating to a product detail page, the resulting URL is formatted like:
https://my.tupperware.ca/chafran/ProductDetails?ID=10153818000_FRPCA1031

I determined the ProductDetails value is tucked away in the img alt text, but needs to be concatenated somehow to mirror the above URL pattern.
Not sure how to use this Regex in the selector definition:
\d+_FRPCA1031

Any insights would be greatly appreciated.
Again, many thanks for maintaining this utility!

Sitemap:
{"_id":"pwsca","startUrl":["https://my.tupperware.ca/chafran/"],"selectors":[{"id":"click SHOP","multiple":false,"parentSelectors":["_root"],"selector":"a[data-megamenu='PWSOurProducts']","type":"SelectorLink"},{"delay":2000,"elementLimit":500,"id":"scroll down to get entire catalog","multiple":true,"parentSelectors":["click SHOP"],"selector":"li.show-titles","type":"SelectorElementScroll"},{"id":"title","multiple":true,"parentSelectors":["scroll down to get entire catalog"],"regex":"","selector":"a > div.product-title","type":"SelectorText"},{"id":"price","multiple":true,"parentSelectors":["scroll down to get entire catalog"],"regex":"","selector":"a > div.product-price","type":"SelectorText"},{"extractAttribute":"alt","id":"photolink","multiple":true,"parentSelectors":["scroll down to get entire catalog"],"selector":"img","type":"SelectorElementAttribute"}]}

ViestursWS · November 2, 2022, 1:44pm

@infrasignal Hi. It appears that the most viable way to achieve this will require extracting the image 'alt' attribute. Post-process the extracted data(identifier) that follows after the first dash and prepend - https://my.tupperware.ca/chafran/ProductDetails?ID= for each of them using the parser feature via Web Scraper Cloud. Later you can download these URLs and use them as unique start URLs for a new sitemap using the 'Bulk Start URL Import' feature. Web Scraper Cloud handles up to 20'000 start URLs.

Learn more: Parser | Web Scraper Documentation

infrasignal · November 3, 2022, 5:42pm

Thanks for your timely reply. In the end, I found the original JSON feed from which I can pull the catalog data without scraping.