Scraping from subsubpage

DavidBremer · January 31, 2024, 10:25am

Hi JanAp,

Thank you that you will help me!
I would like to scrape all the 358 pages with dentists info from:

example of the structure is:
(first mainpage) 7146 tandartsen in Nederland

(first subpage) Tandarts Elst, M.

in this first subpage there are 2 addresses (subsubpages)
In both subsubpages there are address details:

(Not all the subpages have 2 subsubpages, sometimes only 1, but also sometimes more then 2.)

===============
I am very unexperienced with scripts and found this script below, using AI:

"
import requests
from bs4 import BeautifulSoup

def find_modal_content_values(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
modal_content = soup.find('div', class_='modal-content')

    if modal_content:
        naam_element = modal_content.find('h2', class_='mb-2')
        adres_element = modal_content.find('address', class_='flex-fill non-italic m-0')
        telefoon_element = modal_content.find('a', class_='underline')

        if naam_element:
            naam = naam_element.text.strip()
        else:
            naam = "Niet beschikbaar"

        if adres_element:
            adres = adres_element.text.strip()
        else:
            adres = "Niet beschikbaar"

        if telefoon_element:
            telefoon = telefoon_element.text.strip()
        else:
            telefoon = "Niet beschikbaar"

        print("Naam:", naam)
        print("Adres:", adres)
        print("Telefoon:", telefoon)

        website_element = modal_content.find('div', class_='flex-fill d-flex flex-column')
        if website_element:
            website_link = website_element.find('a')
            if website_link:
                website = website_link['href']
            else:
                website = "Niet beschikbaar"
            print("Website:", website)
        else:
            print("Website: Niet beschikbaar")
    else:
        print("Geen <div class='modal-content'> gevonden op de pagina.")
else:
    print(f"Fout bij het ophalen van de pagina: {url}")

url = "7146 tandartsen in Nederland"
find_modal_content_values(url)

DavidBremer · January 31, 2024, 10:32am

Hi Emily,

thank you so much for helping me.
Unfortunately I am unexperienced to adjust my code with your information.

I saw also a reaction from JanAp, to who I sent my not-working script.

I would be very glad when I will have the working script.

Thank you again so much!
David

JanAp · January 31, 2024, 11:40am

Hi, you can import this sitemap into the webscraper extension:

{"_id":"zorgkaartnederland","startUrl":["https://www.zorgkaartnederland.nl/tandarts/pagina[1-358]"],"selectors":[{"id":"listing-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"a.filter-result__name","type":"SelectorLink"},{"id":"address-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["listing-link"],"selector":"p:contains(\"is werkzaam bij:\") + [class=\"filter-results\"] .filter-result-content__body a","type":"SelectorLink"},{"id":"address","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":"[class*=\"modal-address-toggle\"]","type":"SelectorText"}]}

Let me know if it works!

DavidBremer · January 31, 2024, 12:04pm

Hi JanAp,

Thank you!

But also I do not know how or where to import :frowning
Would you also help me with this ?

Thank you !!

JanAp · January 31, 2024, 12:13pm

You have to install the Web Scraper Chrome extension, then you can check the How To videos to learn the basics.

DavidBremer · January 31, 2024, 12:15pm

Thank you, I will work on it tonight when I am at home.

I am very pleased with your help!!

DavidBremer · January 31, 2024, 7:26pm

Hi JanAp,
I managed to open https://webscraper.io/ and I have below a toolbar, starting with:
Elements - Console - Sources - Network - Memory - etc... - Web Scraper (which I selected)

Where do I need to copy and paste the sitemap which you have sent?
Because I do not understand where is the webscraper extension :frowning

Still my unexperiency

DavidBremer · January 31, 2024, 7:48pm

Maybe I manage, I now understood that your long sentence is info of the steps in Webscraper.

DavidBremer · January 31, 2024, 8:38pm

I managed to get from the 1st page:

listing-link - listing-link-href
names - url's

but which code do I need to import as a Sitemap to get also the addresses from the sub-sub url?

Do I do something wrong in the webscraper?

JanAp · February 1, 2024, 7:44am

Hi, all you need to do is click 'Create new sitemap' -> 'Import sitemap' and copy the code I posted earlier. When that is done, click on Sitemap -> Scrape.

DavidBremer · February 1, 2024, 12:43pm

YES!! It works perfect !!

I forgot to ask if the phone number and the website-url of the dentist can also be scraped together. Would you mind to add these fields to your previous link. I am VERY thankful

DavidBremer · February 1, 2024, 12:48pm

And also the postcode !!

DavidBremer · February 1, 2024, 1:13pm

It seems that when you click in the 2nd page (subpage) on the address, you get a popup with the whole address including the postalcode, phone number and their url. I hope you can and will add it to the previous script. Thank you so much in advance !!

DavidBremer · February 1, 2024, 4:47pm

Dear JanAp,

Can I do anything for you ?
I am so pleased that you are willing to help. It is so important for me.

Please let me know.
Thank you !!
David

JanAp · February 2, 2024, 10:54am

Hi, sure, I can help you with that. To open the pop-up, I added the 'contact-information-click' selector to the sitemap:

{"_id":"zorgkaartnederland","startUrl":["https://www.zorgkaartnederland.nl/tandarts/pagina[1-358]"],"selectors":[{"id":"listing-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"a.filter-result__name","type":"SelectorLink"},{"id":"address-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["listing-link"],"selector":"p:contains(\"is werkzaam bij:\") + [class=\"filter-results\"] .filter-result-content__body a","type":"SelectorLink"},{"clickActionType":"real","clickElementSelector":"[class*=\"modal-address-toggle\"]","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickOnce","delay":1000,"discardInitialElements":"do-not-discard","id":"contact-information-click","multiple":false,"parentSelectors":["address-link"],"selector":"_parent_","type":"SelectorElementClick"},{"id":"address","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":"address","type":"SelectorText"},{"id":"phone","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":".align-items-center a.underline","type":"SelectorText"},{"id":"website","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":".flex-fill a.underline","type":"SelectorText"}]}

DavidBremer · February 2, 2024, 12:50pm

Hi JanAp,

Thank you so much!!

In the meantime I was busy to get the same, because I do not always want to be depended.
I managed myself in another way, without opening the popup, but your result is much better.

I am really very thankful for your help.
Would like to get something for your help? Please inform me.

Warm regards,
David

JanAp · February 2, 2024, 1:04pm

Hi, David! I am happy to help. If you have a minute, you are welcome to leave a review for the extension in the Chrome store!

DavidBremer · February 2, 2024, 1:05pm

Sure, but please help me where and how.
I am also unexperienced in this

And I will do

JanAp · February 2, 2024, 1:14pm

There should be a 'Write a review' button where the reviews are listed.

DavidBremer · February 3, 2024, 7:41am

Hi JanAp, yesterday I clicked on that screen but couldnot find that button :(.

Can you help me more wirh it. I love to write positive about your help