Extracting multiple regex expression from HTML?

Jaffo_LG · February 16, 2021, 6:47pm

I want to extract the content of a javascript, particularly a set of images listed as an array within the script. I got the right Regex, but the preview only shows me one result. I want to get all the matches of the regex in the script. Is this possible? I kinda saw it explained in "text extractor" but not so clear for me, and the regex documentation mentions that I click in multiple and should do the deal, but it does not.

have anyone of you folks done so? what am I missing?

leemeng · February 18, 2021, 12:08am

Hard to diagnose without a URL, sitemap or HTML code.

Jaffo_LG · February 18, 2021, 5:53am

Right, let me show you what I got, basically, I want to create some sort of catalog based on my purchases from Amazon, this is the script

{"_id":"amazonimport","startUrl":["https://www.amazon.com.mx/gp/your-account/order-history/ref=ppx_yo_dt_b_pagination_1_2?ie=UTF8&orderFilter=months-3&search=&startIndex=[0-45]0"],"selectors":[{"id":"detalles","type":"SelectorLink","parentSelectors":["_root"],"selector":".a-row .a-vertical a:nth-of-type(1)","multiple":true,"delay":0},{"id":"tittle","type":"SelectorLink","parentSelectors":["detalles"],"selector":".a-fixed-left-grid-col .a-row > a","multiple":true,"delay":0},{"id":"cost","type":"SelectorText","parentSelectors":["detalles"],"selector":"span.a-size-small.a-color-price","multiple":true,"regex":"","delay":0},{"id":"count","type":"SelectorText","parentSelectors":["detalles"],"selector":"span.item-view-qty","multiple":true,"regex":"","delay":0},{"id":"order","type":"SelectorText","parentSelectors":["detalles"],"selector":"bdi","multiple":false,"regex":"","delay":0},{"id":"currentprice","type":"SelectorText","parentSelectors":["tittle"],"selector":"span#price_inside_buybox","multiple":false,"regex":"","delay":0},{"id":"image_link","type":"SelectorImage","parentSelectors":["tittle"],"selector":"img.a-dynamic-image","multiple":true,"delay":0},{"id":"ASIN","type":"SelectorText","parentSelectors":["tittle"],"selector":"tr:contains('ASIN') td","multiple":false,"regex":"","delay":0},{"id":"descripcion","type":"SelectorText","parentSelectors":["tittle"],"selector":"div#productDescription","multiple":false,"regex":"","delay":0},{"id":"features","type":"SelectorText","parentSelectors":["tittle"],"selector":"div#feature-bullets","multiple":false,"regex":"","delay":0},{"id":"images_all","type":"SelectorHTML","parentSelectors":["tittle"],"selector":"div#leftCol","multiple":true,"regex":"https://images-na.ssl-images-amazon.com/images/I/.{10,20}.AC.jpg","delay":0}]}

So I noticed that the images are stored in a JS, so I extract the div that declares it, then through regex I want to extract all the entries that matches the regex, the thing is it only gets me the first result. Is this possible?

leemeng · February 19, 2021, 4:31am

I don't think you can get it to work that way, because the Multiple option is meant for WS selectors, not for regex. One possible workaround is to create a fixed number scrapers, up to the maximum possible number of images which you require. This might be 15? 20?

In the example below, I am scraping 4 image URLs from a public Amazon page. Here I am just using the same selector for all the image URL scrapers, along with different regex to get Image 1, Image 2, etc. See if you can adapt it for your purposes. The regex used is quite complex and it would take me a long time to explain it all. Basically I am using "positive lookbehind" to identify an image in the sequence, based on the text which is behind it (preceding it):

{"_id":"amazon-images-test","startUrl":["https://www.amazon.com/AmazonBasics-Home-Safe-1-20-Cubic/dp/B078K4W8N9/"],"selectors":[{"id":"Product","type":"SelectorText","parentSelectors":["_root"],"selector":"span.product-title-word-break","multiple":false,"regex":"" },{"id":"Image 1","type":"SelectorHTML","parentSelectors":["_root"],"selector":".imageBlockRearch div.a-fixed-left-grid-inner","multiple":false,"regex":"(?<=data-a-dynamic-image[^h]+)https://images[^&]+" },{"id":"Image 2","type":"SelectorHTML","parentSelectors":["_root"],"selector":".imageBlockRearch div.a-fixed-left-grid-inner","multiple":false,"regex":"(?<=data-a-dynamic-image[^h]+https://images.+?)https://images[^&]+" },{"id":"Image 3","type":"SelectorHTML","parentSelectors":["_root"],"selector":".imageBlockRearch div.a-fixed-left-grid-inner","multiple":false,"regex":"(?<=data-a-dynamic-image[^h]+(https://images.+?){2})https://images[^&]+" },{"id":"Image 4","type":"SelectorHTML","parentSelectors":["_root"],"selector":".imageBlockRearch div.a-fixed-left-grid-inner","multiple":false,"regex":"(?<=data-a-dynamic-image[^h]+(https://images.+?){3})https://images[^&]+" }]}

Jaffo_LG · February 19, 2021, 5:07am

Thanks! that would make it!