Web Scraping URL Real-Time

mrm · March 23, 2023, 7:22pm

I wrote a python script to detect Website's internal URL when they come alive.. for example News Website they post regularly and i'm interested to detect new urls within seconds.. i can use any website its / feed rss but it's an 8 minutes latency..

For you to understand me more clearly let's take this forum url as an example.. The script below is python using requests and beautiful soup.. when i run its Output is "https://forum.webscraper.io/"

import requests
from bs4 import BeautifulSoup

url = https://forum.webscraper.io/
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_url = soup.find('meta', {'property': 'og:url'})['content']

print(page_url)

OUTPUT ""https://forum.webscraper.io/""

The script below uses hashlib to detect changes on website "urls" and whenever a new url is detected which didn't exist it notifies me but the OUTPUT is not an URL but only "The website has been updated! A new internal link may have been added.
No new internal links found." How to fix this, because i want to get OUTPUT the URL it detected.

Here is the script

import requests
import hashlib
import time
from bs4 import BeautifulSoup

Define the base URL and the hash of the current page content

base_url = "https://forum.webscraper.io/"
current_hash = None

Define a list to store the internal links found on the page

internal_links = []

while True:
# Make a request to the website and get the content
response = requests.get(base_url)
content = response.content

# Compute the hash of the content
new_hash = hashlib.md5(content).hexdigest()

# If the hash of the content has changed, the website has been updated
if new_hash != current_hash:
    print("The website has been updated! A new internal link may have been added.")

    # Update the current hash to the new hash
    current_hash = new_hash

    # Parse the HTML code of the website using BeautifulSoup
    soup = BeautifulSoup(content, "html.parser")

    # Find the canonical link element in the page head
    canonical_link = soup.find("link", {"rel": "canonical"})

    # If a canonical link element is found, extract the href attribute
    if canonical_link:
        internal_links.append(canonical_link["href"])

        # Scrape the webpage URL using BeautifulSoup
        page_url_response = requests.get(canonical_link["href"])
        page_url_content = page_url_response.content
        page_url_soup = BeautifulSoup(page_url_content, 'html.parser')
        page_url = page_url_soup.find('meta', {'property': 'og:url'})['content']
        print(f"New URL found: {page_url}")

    # If there are new internal links, print them
    if internal_links:
        print("New internal links found:")
        for link in internal_links:
            print(link)
    else:
        print("No new internal links found.")

# Wait for 15 seconds before checking the website again
time.sleep(15)

OUTPUT: "The website has been updated! A new internal link may have been added"

I really appreciate an opinion or help

leemeng · January 6, 2024, 2:17pm

This is not a python forum, it is for the web scraper browser extension. But anyway, you probably don't need to hash the web pages. Most sites these days will report the Last-Modified in the headers, e.g.:

# Get the last modified time of the web page from the server
response = requests.head(url)
last_modified_time = response.headers.get('Last-Modified')