Regex works on shorter expressions but shows null value when I add a wordes before the capture group

LiterallyPolice · January 10, 2023, 5:44am

I want to scrape the whole element containing all the data I need because the order of the data is not uniform, so individual selectors is not an option for me. So, I used regex to get a match, and have it displayed on a data preview so that all products use that regex expression.

the sitemap is:

{"_id":"i9-10900X","startUrl":["https://ark.intel.com/content/www/us/en/ark/products/198019/intel-core-i910900x-xseries-processor-19-25m-cache-3-70-ghz.html"],"selectors":[{"id":"base-freq","multiple":false,"parentSelectors":["_root"],"regex":"Processor\\sBase\\sFrequency\\s(\\d+.\\d+\\s?[MGgm]Hz)","selector":"div.active","type":"SelectorText"}]}

and the regex is:

Processor Base Frequency\s(\d+.\d+\s?[MGgm]Hz)

The expected output once scraped is 3.70 MHz and it works well in regex101(link below)

The actual output is null

If I remove the keywords Processor Base Frequency on the webscraper, I get a valid result of 4.50 GHz but what I need is the Processor Base Frequency.

leemeng · August 9, 2023, 10:56am

This seems like an unreliable way to scrape. You could use the :contains selector so the row position will not matter. E.g.

{"_id":"i9-10900X","startUrl":["https://ark.intel.com/content/www/us/en/ark/products/198019/intel-core-i910900x-xseries-processor-19-25m-cache-3-70-ghz.html"],"selectors":[{"id":"Total Cores","parentSelectors":["_root"],"type":"SelectorText","selector":"li:contains('Total Cores') span[data-key='CoreCount']","multiple":false,"regex":""},{"id":"Processor Base Frequency","parentSelectors":["_root"],"type":"SelectorText","selector":"li:contains('Processor Base Frequency') span[data-key='ClockSpeed']","multiple":false,"regex":""}]}