Extract data via a poorly marked website

MaoMao · September 11, 2023, 10:22am

I want to extract the data from this website : https://natexpo.com/

including Company name, website etc etc but even when trying with web scraper & Data minor I can't do it because the groups of text are poorly marked I have the impression. I don't have a lot of coding knowledge but I want to learn because this problem can happen other times, thank you very much for your help!!!

PS : I am not looking for the CSV result of the scrap but for the solution & to learn how to do it in this case, Thank you

leemeng · September 11, 2023, 11:19pm

This site is slightly tricky but can still be scraped. The main data is actually within an iframe which can be accessed directly at:
https://liste-exposants.hubj2c.com/natexpo23&lang=fr

Each company tile contains a company ID number which can be used to create the URL for the company's details page. The URLs all have the same format, e.g.:

https://liste-exposants.hubj2c.com/natexpo23/main?id=397512
https://liste-exposants.hubj2c.com/natexpo23/main?id=405426
https://liste-exposants.hubj2c.com/natexpo23/main?id=383540

So the first thing you can do is scrape all the company ID numbers with this sitemap:

{"_id":"liste-exposants-get-ids","startUrl":["https://liste-exposants.hubj2c.com/natexpo23&lang=fr"],"selectors":[{"extractAttribute":"id","id":"Company IDs","multiple":true,"parentSelectors":["_root"],"selector":"table#exposants > tbody > tr","type":"SelectorElementAttribute"}]}

Use a longer Page load delay, about 5000.
From these results, you can create a list of company URLs from the ID numbers. There are a few ways to do so and I have covered them in Add suffix/prefix to URLs, or build URLs from scratch

Then you can create a new sitemap which uses all those company URLs as StartURLs. It should be fairly straightforward and would not need a paginator.

MaoMao · September 12, 2023, 8:30am

Thank you so much !!

To be honest, I didn't manage to do everything and I'm stuck on the part where I managed to create all the URLs with the ID of each company but then I can't create the new sitemap even if I replace "StartURL" with the links.

I have another question, how did you get the first 3 links from the Iframe? It's really impressive! I'd really like to learn how to do it myself

The ideal, my friend, would be a video tutorial on how to do it from A to Z (taking the example of 2 companies for example) Give me your price and how many time it will take you and we'll see if you can do it for me please, I really want to learn.

Thank you,

leemeng · September 12, 2023, 2:36pm

Sorry, I don't have the time to create video or provide lessons. This would be challenging site to scrape for beginners. I initially thought it could not be done with WS, but then I remembered some old tricks from scraping a similar site a while ago.

Here's the sitemap that will get 800+ companies. It is too large to paste here; it can be downloaded at:

Plus, here is a test sitemap which is actually the same as the one above, except that it has only 3 companies so you can do a quick test to see if it works. I suggest running this test sitemap first with Page load delay of about 5000.

{"_id":"liste-exposants-companies-test","startUrl":["https://liste-exposants.hubj2c.com/natexpo23/main?id=383540","https://liste-exposants.hubj2c.com/natexpo23/main?id=405426","https://liste-exposants.hubj2c.com/natexpo23/main?id=397512"],"selectors":[{"id":"Company","parentSelectors":["_root"],"type":"SelectorText","selector":"div.modal-content div > h5.modal-title","multiple":false,"regex":".+(?=\\(Stand)"},{"id":"Website","parentSelectors":["_root"],"type":"SelectorElementAttribute","selector":"div.modal-content div.fe-lien a","multiple":false,"extractAttribute":"href"},{"id":"Country","parentSelectors":["_root"],"type":"SelectorElementAttribute","selector":"div.modal-content img.fe-drapeau","multiple":false,"extractAttribute":"title"},{"id":"LinkedIn","parentSelectors":["_root"],"type":"SelectorElementAttribute","selector":"div.modal-content a.fe-social[href*='linkedin']","multiple":false,"extractAttribute":"href"},{"id":"Facebook","parentSelectors":["_root"],"type":"SelectorElementAttribute","selector":"div.modal-content a.fe-social[href*='facebook']","multiple":false,"extractAttribute":"href"},{"id":"Stand","parentSelectors":["_root"],"type":"SelectorText","selector":"div.modal-content div > h5.modal-title","multiple":false,"regex":"(?<=\\(Stand)[^\\)]+"}]}

MaoMao · September 12, 2023, 3:07pm

Thank you so much, you really really helped me ! Thank for your time and patience I really appreciate it

MaoMao · December 13, 2023, 4:14pm

@leemeng Hi my friend, new challenge for you, it looks very easy for this one but i still can't figure it out.. I need for every brand on this site (2300) Their country, URL website, name of the company, email. Everything is into the website but i can't extract with webscraper don't know why...Have you got a sitemap for this mate ?

Thanks a lot