How to Extract Email Addresses from Images When They're Loaded by AJAX and JavaScript?

Alejopena0 · April 24, 2024, 3:53am

Hello everyone,

I've encountered a challenging scenario while attempting to scrape email addresses for a large dataset of over 15,000 companies. The emails are not listed in plaintext in the HTML source code. Instead, they are displayed as an image, and when clicked, a JavaScript function makes an AJAX request that, I believe, fetches the email address.

Here’s what happens step-by-step:

A click on the image seems to trigger a JavaScript function.
No mailto: or direct email link is present in the source code.
The actual email link is possibly being retrieved by an AJAX call to a PHP script.
The email address is likely returned from the server in response to the AJAX request, but it's not visible in the HTML or the page's source code.

I have tried using WebScraper.io to simulate the click and capture the resulting AJAX request, but the data preview showed nothing, suggesting that the email is fetched and rendered dynamically.

Url: Ninoshka, S.a. | DirectorioDeCarga.com

Sitemap:
{"_id":"directorio-de-carga","startUrl":[""],"selectors":[{"id":"pagination","parentselectors":["_root","pagination"],"paginationtype":"auto","type":"selectorpagination","selector":".pagination_fg en DirectorioDeCarga.com a"},{"id":"element-card","parentSelectors":["Pagination"],"type":"SelectorElement","selector":"div.content","multiple":true},{"id":"Nombre-de-la-empresa","parentSelectors":["element-card"],"type":"SelectorText","selector":"strong","multiple":false,"regex":""},{"id":"Dirección-de-la-empresa","parentSelectors":["element-card"],"type":"SelectorText","selector":"p","multiple":false,"regex":""},{"id":"País","parentSelectors":["element-card"],"type":"SelectorText","selector":"h6","multiple":false,"regex":""},{"id":"Link-del-directorio","parentSelectors":["element-card"],"type":"SelectorLink","selector":".title a","multiple":false,"linkType":"linkFromHref"},{"id":"Website","parentSelectors":["Link-del-directorio"],"type":"SelectorText","selector":"strong span","multiple":false,"regex":""},{"id":"Telefono","parentSelectors":["Link-del-directorio"],"type":"SelectorText","selector":"li:nth-of-type(6) strong","multiple":false,"regex":"Teléfono\(s\):\s*\+\d{3}\s\d{4}-\d{4}"},{"id":"emailClick","parentSelectors":["Link-del-directorio"],"type":"SelectorElementClick","clickActionType":"real","clickElementSelector":"li:nth-of-type(3) label a","clickElementUniquenessType":"uniqueText","clickType":"clickOnce","delay":2000,"discardInitialElements":"do-not-discard","multiple":true,"selector":"li:nth-of-type(3) label a"},{"id":"emailExtract","parentSelectors":["emailClick"],"type":"SelectorElementAttribute","selector":"a","multiple":false,"extractAttribute":"href"}]}

JanAp · April 24, 2024, 9:12am

Hi,

Please try the below setup and let me know if it scrapes the e-mail.

{"_id":"directoriodecarga","startUrl":["https://directoriodecarga.com/empresas/ninoshka-ciudad-de-guatemala"],"selectors":[{"id":"link","linkType":"linkFromRedirect","multiple":false,"parentSelectors":["_root"],"selector":"li:nth-of-type(3) label a","type":"SelectorLink"},{"extractAttribute":"data-hovercard-id","id":"e-mail","multiple":false,"parentSelectors":["link"],"selector":"[data-hovercard-id]","type":"SelectorElementAttribute"}]}

Note, that a browser e-mail client (like Gmail) has to be set as the default handler of e-mail links.

Alejopena0 · April 24, 2024, 3:50pm

Thank you for your help, but look it gives me the link not the email

Dont know what else to do?

I think im gonna create an python code this is what gpt is telling me " With Python, you would use the requests library to send a GET request to each of these PHP URLs and then parse the response to extract the email address. If the email address is returned directly in the response, it could be straightforward. However, if the email is loaded through JavaScript on the page after the PHP script is called, you may need to simulate a browser session using a tool like Selenium."

JanAp · April 25, 2024, 5:57am

Can you describe what happens when you click on the icon for the e-mail? How does your browser handle the request?

JanAp · April 25, 2024, 9:29am

It works just fine if gmail is set as the mail protocol handler:

leemeng · April 25, 2024, 10:55am

If you surf to the mail link directly, the email is within the HTML and can be extracted with Element attribute or HTML:

However your browser is probably configured to launch Gmail or email app; this can be disabled. Google "disable mailto links Chrome"

Alejopena0 · April 25, 2024, 11:31pm

Wao, it function! Thank you so much! Im sorry to bother you im quite new with all these thing of webscraping but when i add your sitemap to mine, it opens gmail but it doesnt extract the email, could you please help me seeing my whole sitemap

{"_id":"directorio-de-cargaentregforum02","startUrl":["https://directoriodecarga.com/directorio-de-empresas/?pag=1&sector=&pais=&clave="],"selectors":[{"id":"pagination","paginationType":"auto","parentSelectors":["_root","pagination"],"selector":".pagination_fg a","type":"SelectorPagination"},{"id":"element-card","multiple":true,"parentSelectors":["pagination"],"selector":"div.content","type":"SelectorElement"},{"id":"Name-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"strong","type":"SelectorText"},{"id":"Adress-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"p","type":"SelectorText"},{"id":"Country","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"h6","type":"SelectorText"},{"id":"Directory-link","linkType":"linkFromHref","multiple":false,"parentSelectors":["element-card"],"selector":".title a","type":"SelectorLink"},{"id":"Website","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"strong span","type":"SelectorText"},{"id":"Phone","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"li:nth-of-type(6) strong","type":"SelectorText"},{"id":"Link","linkType":"linkFromRedirect","multiple":false,"parentSelectors":["Directory-link"],"selector":"li:nth-of-type(3) label a","type":"SelectorLink"},{"extractAttribute":"[data-hovercard-id]","id":"e-mail","multiple":false,"parentSelectors":["Link"],"selector":"[data-hovercard-id]","type":"SelectorElementAttribute"}]}

I dont know why when it finishes it doesnt give me back the email, i literally just copy your sitemap on mine but it doesnt function

JanAp · April 26, 2024, 12:36pm

Hi,

Please try this setup:

{"_id":"directorio-de-cargaentregforum02","startUrl":["https://directoriodecarga.com/directorio-de-empresas/?pag=1&sector=&pais=&clave="],"selectors":[{"id":"pagination","paginationType":"auto","parentSelectors":["_root","pagination"],"selector":"a.active","type":"SelectorPagination"},{"id":"element-card","multiple":true,"parentSelectors":["pagination"],"selector":"div.content","type":"SelectorElement"},{"id":"Name-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"strong","type":"SelectorText"},{"id":"Adress-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"p","type":"SelectorText"},{"id":"Country","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"h6","type":"SelectorText"},{"id":"Directory-link","linkType":"linkFromHref","multiple":false,"parentSelectors":["element-card"],"selector":".title a","type":"SelectorLink"},{"id":"Website","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"strong span","type":"SelectorText"},{"id":"Phone","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"li:nth-of-type(6) strong","type":"SelectorText"},{"id":"Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["Directory-link"],"selector":"li:nth-of-type(3) label a","type":"SelectorLink"},{"extractAttribute":"data-hovercard-id","id":"e-mail","multiple":false,"parentSelectors":["Link"],"selector":"[data-hovercard-id]","type":"SelectorElementAttribute"}]}

I have set the pagination to the first page only just for testing purpose of the e-mail scraping.

don2010 · April 30, 2024, 12:31pm

here is a sitemap with emails + some unnecessary signs in last column.... There is no problem to extract a clean emails using Excel

{"_id":"Directorio-de-cargaentregforum02","startUrl":["https://directoriodecarga.com/directorio-de-empresas/?pag=1&sector=&pais=&clave="],"selectors":[{"id":"pagination","paginationType":"auto","parentSelectors":["_root","pagination"],"selector":"a.active","type":"SelectorPagination"},{"id":"element-card","multiple":true,"parentSelectors":["pagination"],"selector":"div.content","type":"SelectorElement"},{"id":"Name-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"strong","type":"SelectorText"},{"id":"Adress-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"p","type":"SelectorText"},{"id":"Country","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"h6","type":"SelectorText"},{"id":"Directory-link","linkType":"linkFromHref","multiple":false,"parentSelectors":["element-card"],"selector":".title a","type":"SelectorLink"},{"id":"Website","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"strong span","type":"SelectorText"},{"id":"Phone","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"li:nth-of-type(6) strong","type":"SelectorText"},{"id":"Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["Directory-link"],"selector":"li:nth-of-type(3) label a","type":"SelectorLink"},{"extractAttribute":"content","id":"e-mail","multiple":false,"parentSelectors":["Link"],"selector":"meta[http-equiv=\"refresh\"]","type":"SelectorElementAttribute"}]}

Alejopena0 · May 7, 2024, 11:35pm

Hi jan, ive been trying this set up and it functions perfectly when the pagination is only for one page, but when i add the pagination for all the pages, it gives me only the emails of a few, from around 9000 companies it only gives it to me for around 2000, i dont know why.

Alejopena0 · May 8, 2024, 1:18am

When its scraping, for most of the companies it only take the text info and doesnt open gmail. For taking the email

JanAp · May 8, 2024, 12:29pm

Hi,

I have run a test with page range for pages 1-10

{"_id":"directorio-de-cargaentregforum02","startUrl":["https://directoriodecarga.com/directorio-de-empresas/?pag=[1-10]&sector=&pais=&clave="],"selectors":[{"id":"element-card","multiple":true,"parentSelectors":["_root"],"selector":"div.content","type":"SelectorElement"},{"id":"Name-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"strong","type":"SelectorText"},{"id":"Adress-of-the-company","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"p","type":"SelectorText"},{"id":"Country","multiple":false,"parentSelectors":["element-card"],"regex":"","selector":"h6","type":"SelectorText"},{"id":"Directory-link","linkType":"linkFromHref","multiple":false,"parentSelectors":["element-card"],"selector":".title a","type":"SelectorLink"},{"id":"Website","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"strong span","type":"SelectorText"},{"id":"Phone","multiple":false,"parentSelectors":["Directory-link"],"regex":"","selector":"li:nth-of-type(6) strong","type":"SelectorText"},{"id":"Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["Directory-link"],"selector":"li:nth-of-type(3) label a","type":"SelectorLink"},{"extractAttribute":"data-hovercard-id","id":"e-mail","multiple":false,"parentSelectors":["Link"],"selector":"[data-hovercard-id]","type":"SelectorElementAttribute"}]}

It appears that e-mails are scraped for all listings that have them. Some listings just don't have an e-mail, i.e. Consultores Logistico Aduanero S.a. | DirectorioDeCarga.com