Valid Regex doesn't work

I'm scraping the "time" in a website.
Scraped data (time) is 09:00 - 10:00
So start time is 9 and the end time 10.
To get only the start time I use this regex succesfully:"[0-9]+:[0-9]+"
To get only the end time I put a space in front of the regex: " [0-9]+:[0-9]+" OR "- [0-9]+:[0-9]+"
This regex is valid, I tested it in https://regex101.com/
Does anyone know a way to make this work.

Url: https://www.muziekgebouw.nl/agenda/8308/Mendelssohn_op_zijn_best/Nederlands_Kamerorkest/

Sitemap:
{"_id":"muziekgebouw","startUrl":["https://www.muziekgebouw.nl/agenda/?p=[1-20]"],"selectors":[{"id":"Evenement","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.desc","multiple":true,"delay":0},{"id":"Titel","type":"SelectorText","parentSelectors":["Evenement"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"Subtitel","type":"SelectorText","parentSelectors":["Evenement"],"selector":".descWrapper div.subtitle","multiple":false,"regex":"","delay":0},{"id":"Genre/label 1","type":"SelectorText","parentSelectors":["Evenement"],"selector":".genres ul li:nth-of-type(1) a","multiple":false,"regex":"","delay":0},{"id":"Genre/label 2","type":"SelectorText","parentSelectors":["Evenement"],"selector":"li:nth-of-type(2) a.genreBtn","multiple":false,"regex":"","delay":0},{"id":"Omschrijving 1","type":"SelectorText","parentSelectors":["Evenement"],"selector":".desc1 p","multiple":false,"regex":"","delay":0},{"id":"Omschrijving 2","type":"SelectorText","parentSelectors":["Evenement"],"selector":"Placeholder","multiple":false,"regex":"","delay":0},{"id":"Omschrijving 3","type":"SelectorText","parentSelectors":["Evenement"],"selector":"Placeholder","multiple":false,"regex":"","delay":0},{"id":"Omschrijving 4","type":"SelectorText","parentSelectors":["Evenement"],"selector":"Placeholder","multiple":false,"regex":"","delay":0},{"id":"Locatie 1","type":"SelectorText","parentSelectors":["Evenement"],"selector":"Placeholder","multiple":false,"regex":"","delay":0},{"id":"Locatie 2","type":"SelectorText","parentSelectors":["Evenement"],"selector":"Placeholder","multiple":false,"regex":"","delay":0},{"id":"Locatie 3","type":"SelectorText","parentSelectors":["Evenement"],"selector":"Placeholder","multiple":false,"regex":"","delay":0},{"id":"Datum","type":"SelectorText","parentSelectors":["Evenement"],"selector":".dateTime div.date","multiple":false,"regex":"","delay":0},{"id":"Tijd van","type":"SelectorText","parentSelectors":["Evenement"],"selector":".dateTime div.time","multiple":false,"regex":"[0-9]+:[0-9]+","delay":0},{"id":"Tijd tot","type":"SelectorText","parentSelectors":["Evenement"],"selector":".dateTime div.time","multiple":false,"regex":" [0-9]+:[0-9]+","delay":0}]}

So what I discovered is that the regex doesn't accept a space. So when I use "[0-9]+:[0-9]+ " (space on the end). Then I get null in return. Without the space I get the 09:00. Does anyone have a solution for this?

That is untrue. Regex definitely do handle spaces. If you look at the page source you may notice the times are separated by tabs, not spaces. It is hard to see in the browser and I had to paste the element in Notepad++ to see the tabs. So your regex would not match. You can try:

\t\d+:\d+

\t is the tab character
\d means any digit, same as [0-9]

2 Likes

I stand corrected. Thanks a lot for your feedback, you helped me out a lot!!

I do have another regex question. I'm trying to get the image url from a website.

html= mobile-image=data-mobile-image="/media/3748/proefles-met-joël-delamar-theater.jpg?anchor=cente:

My regex is: media[a-zA-Z/0-9-._]+
Desired result: /media/3748/proefles-met-joël-delamar-theater.jpg
Actual Result: media/3748/proefles-met-jo
Reason: it doesn't take the ë.
I got it working in rexeg101.com with the folowing: media.[0-9]+.[\p{L}-.]+
\p{L}=all Unicode letters. But this doesn't seem to work in webscraper.io

So my question is actually how do I produce a regex that matches the Unicode letter category.

Hope you can help me out.

Thanks in advance

{"_id":"delamar1","startUrl":["https://delamar.nl/voorstellingen/joel-broekaert/"],"selectors":[{"id":"foto","type":"SelectorHTML","parentSelectors":["root"],"selector":"div.details-header","multiple":false,"regex":"media[a-zA-Z/0-9-.]+","delay":0}]}

Try:
media.+?\.jpg

1 Like