Regex noob and am baffled

chris · January 26, 2018, 5:22pm

I am scraping just fine (great tool) but spend a lot of time later sanitizing and throwing away some of the data using find and replace in spreadsheet or CSVed..

One field I collect information from has many new lines / eol / cr etc.

I only need the first 15 characters from that field anyway.

I have spent all afternoon trying different RegEx expressions to try and throw away / ignore anything from char 16 onwards or to ignore non alphanumeric data such as carriage returns etc or to select only letters and numbers and I have failed miserably.

I've looked at dozens of examples for Javascript Regex, Pearl, PHP etc but they are often trying to do way more than what I need, I can't get them to work, it feels like I'm just missing a quote or a bracket somewhere.

Be grateful for a few pointers.

KristapsWS · January 29, 2018, 2:22pm

Have you tried this: .{15} ?

chris · January 29, 2018, 3:57pm

Thanks for replying and LOL, no, I ended up doing this:
[a-zA-Z0-9]..........................

Approx 15 dots

I've been making noob mistakes regarding wildcards (I'm used to * )

The more examples I see the better I get it, I can understand your example above and I'm going to change my Regex to that

The field I'm extracting data from typically has hundreds of rows with values like this:
Views: 17 (12 Unique)

The information that I really want are the two figures.

After downloading my CSV I do a few searches and replaces

I change the ( to a comma,

Unique) and Views: get deleted altogether (searched and replaced with nothing).

For the example above I end up with 17,12 - I can split that column easily in Google Sheets.

I'm sure there's probably a regular expression that would take care of most or all of that at the data collection stage.

Any pointers on that?

KristapsWS · January 30, 2018, 10:41am

Select each row if it is possible and try this regex: \s\d+\s\(\d+ .

chris · January 30, 2018, 2:46pm

Thanks again. That works perfectly. Could you explain it or is that too much to write?