Regex parsing matching backwards

watilo · August 4, 2024, 11:01am

I'm trying to strip UTM parameters from links, so I'm using regex to try to remove the ? from the URL and anything after.

My regex: \?.*

But instead of filtering out the string, the parser is only keeping things that match. How do I reverse it so it's filtering it the other way?

(In this example, there is only one string with UTM params, so it's only matching one record.)

don2010 · August 4, 2024, 2:42pm

try this way: .+?(?=/?)

watilo · August 5, 2024, 11:40am

Hmm, .+?(?=/?) just gives me the output of "h" on every row?

JanAp · August 5, 2024, 11:47am

Hi, you can try:

^[^?]*

watilo · August 5, 2024, 12:24pm

That one did it, thank you!

I don't have a huge handle on regex (and mostly getting by with a little help from ChatGPT), but I understand what this is doing compared to mine. I guess my question going forward is... should I have expected my original syntax to work, or does Web Scraper just parse things in a different way?

JanAp · August 5, 2024, 12:45pm

Hey, no shame in using chatGPT, that's how I did it

My prompt: regex to match string until the first occurrence of ?

Your original regex matched the string after the '?' and Parser returns the matched part .

leemeng · August 8, 2024, 3:53pm

This is the expected behavior for regex, and it is a "match" or "find" operation. So you should think in terms of what you want to match, and not what you need to remove. That is why JanAp's regex works. Personally I would go with something like the one below which would match only URLs:

^http[^?]+