Scraped emails looks funny

tope · March 21, 2021, 9:09pm

I am trying to scrape email from a site, and have run in to some problems.

I can copy paste the emails and they look good. But when scraped, the site ads lot of letters to the email adress. Can I scrape the email as it looks on the site? Or do I need to use regex or something do decipher the content?

Big thanks for help!

ViestursWS · March 22, 2021, 6:55am

@tope Hi. Would you be able to show a screenshot or name this particular website URL?

tope · March 22, 2021, 7:12am

Hello viesturs, thanks for looking in to this.

https://www.eniro.se/engelska+skolan+i+upplands+väsby+upplands+väsby/31401925/firma?page=1&query=förskolor'

You need to click the "Skicka mejl" link. It will show a popup that contains the emailadress on top of the form shown in popup.

ViestursWS · March 22, 2021, 1:18pm

@tope They look funny because they don't want their information scraped.
That's why they have created 3 different classes with different algorithms to stop us from scraping.
For example in this division class is with z and each letter for the e-mail is every 2nd letter... When you reload the page, the algorithm changes and you have to predict it.

leemeng · March 22, 2021, 3:22pm

There is definitely anti-scraper code on this site, but if you look at the source code (Ctrl-u), the unscrambled emails are buried within a <script> block. Try searching for "email":
(with the quotes). In fact, all the other organizations and emails are already on the page too. It looks like JSon so would be very hard to extract with WS alone. Maybe with Python and the JSon module.

tope · March 22, 2021, 9:36pm

I used Outwit to catch them instead, more used to that scraper. But could you not make one text selector per class? Then you get one column for each class and then use regex to clean each column?

leemeng · March 23, 2021, 3:15am

This will pick out the Json block:

Type: HTML
Selector: head
Regex: (?<=PRELOADED_STATE__ \=).+}}}}

If you paste ithe block into a parser like https://jsonlint.com/ you'll see that it is a perfectly valid Json block with a lot of info. Parsing Json is a very common task in many programming languages like Python, C#, Java, Javascript, etc. Many, many examples available.