Scraping from subsubpage

DavidBremer · January 30, 2024, 8:09am

Hi all,

I am a beginner in webscraping.
What I need is some addresses from an open website.

This website has pages with 20 contacts.

When inspecting an url-page you can click on 1 of the links (1 of the 20 contacts) which brings you to the detailspage (subpage) of the specific link (contact).

This detailspage has also 1 or more links (1 or more addresses of that contact) which brings you to to a sub-sub page.

In this sub-sub page you can find the wanted address details.

Main pages (20 contacts)
Detailspage contactA-address1
Subsubpage with addressinfoA1
Detailspage contactA-address2
Subsubpage with addressinfoA2

Main pages (20 contacts)
Detailspage contactB-address1
Subsubpage with addressinfoB1
Detailspage contactB-address2
Subsubpage with addressinfoA2
etc.

I use Colaboratory.
I installed the correct version of the webdriver with the correct path
Also I use beatifulsoup.

My result:
I get records with the correct name, but the addresses show ‘none’ for every address-detail.

Who can and will help me?
Thank you in advance
David

emily.kennewell · January 30, 2024, 3:05pm

To help you, I'll provide you with a basic example using Python, Selenium, and BeautifulSoup. Make sure you have the necessary libraries installed:
pip install selenium
pip install beautifulsoup4

Here's a basic script outline:
from selenium import webdriver
from bs4 import BeautifulSoup
import time

Function to extract address details from sub-sub page

def extract_address_details(sub_sub_page):
# Your code to extract address details here
# Example:
# address = sub_sub_page.find('span', {'class': 'address'}).text
# return address

Function to scrape data from detail page and navigate to sub-sub page

def scrape_detail_page(detail_page_url):
driver.get(detail_page_url)

# Wait for the page to load, you may need to adjust the time.sleep value
time.sleep(2)

# Your code to extract contact name
# Example: contact_name = driver.find_element_by_xpath('//span[@class="contact-name"]').text

# Click on the link to go to the sub-sub page
sub_sub_page_link = driver.find_element_by_xpath('//a[@class="sub-sub-page-link"]')
sub_sub_page_link.click()

# Wait for the sub-sub page to load
time.sleep(2)

# Get the page source and create a BeautifulSoup object
sub_sub_page_source = driver.page_source
sub_sub_page_soup = BeautifulSoup(sub_sub_page_source, 'html.parser')

# Call function to extract address details
address_details = extract_address_details(sub_sub_page_soup)

return contact_name, address_details

Main scraping function

def scrape_main_pages(main_page_url):
driver.get(main_page_url)

# Wait for the page to load
time.sleep(2)

# Your code to loop through the contacts on the main page
# Example:
# contact_links = driver.find_elements_by_xpath('//a[@class="contact-link"]')
# for contact_link in contact_links:
#     contact_name, address_details = scrape_detail_page(contact_link.get_attribute('href'))
#     print(f"Contact: {contact_name}, Address: {address_details}")

Set up the webdriver (change 'chrome' to 'firefox' if using Firefox)

driver = webdriver.Chrome('path/to/chromedriver')

URL of the main page

main_page_url = 'your_main_page_url_here'

Call the main scraping function

scrape_main_pages(main_page_url)

Close the webdriver when done

driver.quit()

JanAp · January 31, 2024, 10:04am

Hi, could you please post the website URL so I can inspect the code?

DavidBremer · January 31, 2024, 10:25am

Hi JanAp,

Thank you that you will help me!
I would like to scrape all the 358 pages with dentists info from:

example of the structure is:
(first mainpage) 7146 tandartsen in Nederland

(first subpage) Tandarts Elst, M.

in this first subpage there are 2 addresses (subsubpages)
In both subsubpages there are address details:

(Not all the subpages have 2 subsubpages, sometimes only 1, but also sometimes more then 2.)

===============
I am very unexperienced with scripts and found this script below, using AI:

"
import requests
from bs4 import BeautifulSoup

def find_modal_content_values(url):
response = requests.get(url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
modal_content = soup.find('div', class_='modal-content')

    if modal_content:
        naam_element = modal_content.find('h2', class_='mb-2')
        adres_element = modal_content.find('address', class_='flex-fill non-italic m-0')
        telefoon_element = modal_content.find('a', class_='underline')

        if naam_element:
            naam = naam_element.text.strip()
        else:
            naam = "Niet beschikbaar"

        if adres_element:
            adres = adres_element.text.strip()
        else:
            adres = "Niet beschikbaar"

        if telefoon_element:
            telefoon = telefoon_element.text.strip()
        else:
            telefoon = "Niet beschikbaar"

        print("Naam:", naam)
        print("Adres:", adres)
        print("Telefoon:", telefoon)

        website_element = modal_content.find('div', class_='flex-fill d-flex flex-column')
        if website_element:
            website_link = website_element.find('a')
            if website_link:
                website = website_link['href']
            else:
                website = "Niet beschikbaar"
            print("Website:", website)
        else:
            print("Website: Niet beschikbaar")
    else:
        print("Geen <div class='modal-content'> gevonden op de pagina.")
else:
    print(f"Fout bij het ophalen van de pagina: {url}")

url = "7146 tandartsen in Nederland"
find_modal_content_values(url)

DavidBremer · January 31, 2024, 10:32am

Hi Emily,

thank you so much for helping me.
Unfortunately I am unexperienced to adjust my code with your information.

I saw also a reaction from JanAp, to who I sent my not-working script.

I would be very glad when I will have the working script.

Thank you again so much!
David

JanAp · January 31, 2024, 11:40am

Hi, you can import this sitemap into the webscraper extension:

{"_id":"zorgkaartnederland","startUrl":["https://www.zorgkaartnederland.nl/tandarts/pagina[1-358]"],"selectors":[{"id":"listing-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"a.filter-result__name","type":"SelectorLink"},{"id":"address-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["listing-link"],"selector":"p:contains(\"is werkzaam bij:\") + [class=\"filter-results\"] .filter-result-content__body a","type":"SelectorLink"},{"id":"address","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":"[class*=\"modal-address-toggle\"]","type":"SelectorText"}]}

Let me know if it works!

DavidBremer · January 31, 2024, 12:04pm

Hi JanAp,

Thank you!

But also I do not know how or where to import :frowning
Would you also help me with this ?

Thank you !!

JanAp · January 31, 2024, 12:13pm

You have to install the Web Scraper Chrome extension, then you can check the How To videos to learn the basics.

DavidBremer · January 31, 2024, 12:15pm

Thank you, I will work on it tonight when I am at home.

I am very pleased with your help!!

DavidBremer · January 31, 2024, 7:26pm

Hi JanAp,
I managed to open https://webscraper.io/ and I have below a toolbar, starting with:
Elements - Console - Sources - Network - Memory - etc... - Web Scraper (which I selected)

Where do I need to copy and paste the sitemap which you have sent?
Because I do not understand where is the webscraper extension :frowning

Still my unexperiency

DavidBremer · January 31, 2024, 7:48pm

Maybe I manage, I now understood that your long sentence is info of the steps in Webscraper.

DavidBremer · January 31, 2024, 8:38pm

I managed to get from the 1st page:

listing-link - listing-link-href
names - url's

but which code do I need to import as a Sitemap to get also the addresses from the sub-sub url?

Do I do something wrong in the webscraper?

JanAp · February 1, 2024, 7:44am

Hi, all you need to do is click 'Create new sitemap' -> 'Import sitemap' and copy the code I posted earlier. When that is done, click on Sitemap -> Scrape.

DavidBremer · February 1, 2024, 12:43pm

YES!! It works perfect !!

I forgot to ask if the phone number and the website-url of the dentist can also be scraped together. Would you mind to add these fields to your previous link. I am VERY thankful

DavidBremer · February 1, 2024, 12:48pm

And also the postcode !!

DavidBremer · February 1, 2024, 1:13pm

It seems that when you click in the 2nd page (subpage) on the address, you get a popup with the whole address including the postalcode, phone number and their url. I hope you can and will add it to the previous script. Thank you so much in advance !!

DavidBremer · February 1, 2024, 4:47pm

Dear JanAp,

Can I do anything for you ?
I am so pleased that you are willing to help. It is so important for me.

Please let me know.
Thank you !!
David

JanAp · February 2, 2024, 10:54am

Hi, sure, I can help you with that. To open the pop-up, I added the 'contact-information-click' selector to the sitemap:

{"_id":"zorgkaartnederland","startUrl":["https://www.zorgkaartnederland.nl/tandarts/pagina[1-358]"],"selectors":[{"id":"listing-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"a.filter-result__name","type":"SelectorLink"},{"id":"address-link","linkType":"linkFromHref","multiple":true,"parentSelectors":["listing-link"],"selector":"p:contains(\"is werkzaam bij:\") + [class=\"filter-results\"] .filter-result-content__body a","type":"SelectorLink"},{"clickActionType":"real","clickElementSelector":"[class*=\"modal-address-toggle\"]","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickOnce","delay":1000,"discardInitialElements":"do-not-discard","id":"contact-information-click","multiple":false,"parentSelectors":["address-link"],"selector":"_parent_","type":"SelectorElementClick"},{"id":"address","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":"address","type":"SelectorText"},{"id":"phone","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":".align-items-center a.underline","type":"SelectorText"},{"id":"website","multiple":false,"parentSelectors":["address-link"],"regex":"","selector":".flex-fill a.underline","type":"SelectorText"}]}

DavidBremer · February 2, 2024, 12:50pm

Hi JanAp,

Thank you so much!!

In the meantime I was busy to get the same, because I do not always want to be depended.
I managed myself in another way, without opening the popup, but your result is much better.

I am really very thankful for your help.
Would like to get something for your help? Please inform me.

Warm regards,
David

JanAp · February 2, 2024, 1:04pm

Hi, David! I am happy to help. If you have a minute, you are welcome to leave a review for the extension in the Chrome store!