Scraping difficult website

jbbiolay · July 19, 2022, 1:23pm

Hello everyone

I try desperately to scrape this website:
Url: Health Valley

I'm interested in getting the complete list of companies, with these fields: sector of activity, canton, year of creation, number of employees, website, name of the director and description.

And I have two problems:

All the fields are not systematically available for all the companies
You have to click on the company name each time to display the fields, which means that I can only scrape the first company in the list...

Can someone help me?

Thank you

ViestursWS · July 19, 2022, 1:43pm

@jbbiolay Hi, it appears that you should be able to achieve this by using an 'Element click' selector, however, that will take a while. Please, note that the results will start to appear only once the Element click selector has stopped the execution process.

Example:

{"_id":"republic-of-innovation-org","startUrl":["https://www.republic-of-innovation.org/HealthValley/"],"selectors":[{"clickElementSelector":"a.more","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickMore","delay":3000,"discardInitialElements":"discard-when-click-element-exists","id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":"div.company","type":"SelectorElementClick"},{"delay":0,"id":"sector","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Secteur\") span","type":"SelectorText"},{"delay":0,"id":"year-founded","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Année de fondation/implantation\") span","type":"SelectorText"},{"delay":0,"id":"canton","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Canton\") span","type":"SelectorText"},{"delay":0,"id":"address","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Adresse\") span","type":"SelectorText"},{"delay":0,"id":"director","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Directeur\") span","type":"SelectorText"},{"delay":0,"id":"website","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Site web\") span","type":"SelectorText"},{"delay":0,"id":"description","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"div.description_html","type":"SelectorText"}]}

jbbiolay · July 20, 2022, 1:22pm

Hello,

Thank you for your answer. After trying to scrape with your method, even at the end, I have no result...

I have the following message: No data scraped yet. refresh

ViestursWS · July 20, 2022, 1:49pm

@jbbiolay Try, limiting the click amount. For example to 10 companies - div.list_block li:nth-of-type(-n+10) a.more

Example:

{"_id":"republic-of-innovation-org","startUrl":["https://www.republic-of-innovation.org/HealthValley/"],"selectors":[{"clickElementSelector":"div.list_block li:nth-of-type(-n+10) a.more","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickMore","delay":3000,"discardInitialElements":"discard-when-click-element-exists","id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":"div.company","type":"SelectorElementClick"},{"delay":0,"id":"sector","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Secteur\") span","type":"SelectorText"},{"delay":0,"id":"year-founded","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Année de fondation/implantation\") span","type":"SelectorText"},{"delay":0,"id":"canton","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Canton\") span","type":"SelectorText"},{"delay":0,"id":"address","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Adresse\") span","type":"SelectorText"},{"delay":0,"id":"director","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Directeur\") span","type":"SelectorText"},{"delay":0,"id":"website","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"p:contains(\"Site web\") span","type":"SelectorText"},{"delay":0,"id":"description","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":"div.description_html","type":"SelectorText"}]}

jbbiolay · July 20, 2022, 10:15pm

Good evening,
Thank you! Indeed, it works for the first 10 lines, but for some reason, as soon as I exceed 100, it blocks...