I wrote a python script to detect Website's internal URL when they come alive.. for example News Website they post regularly and i'm interested to detect new urls within seconds.. i can use any website its / feed rss but it's an 8 minutes latency..
For you to understand me more clearly let's take this forum url as an example.. The script below is python using requests and beautiful soup.. when i run its Output is "https://forum.webscraper.io/"
import requests
from bs4 import BeautifulSoupurl = https://forum.webscraper.io/
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_url = soup.find('meta', {'property': 'og:url'})['content']print(page_url)
OUTPUT ""https://forum.webscraper.io/""
The script below uses hashlib to detect changes on website "urls" and whenever a new url is detected which didn't exist it notifies me but the OUTPUT is not an URL but only "The website has been updated! A new internal link may have been added.
No new internal links found." How to fix this, because i want to get OUTPUT the URL it detected.
Here is the script
import requests
import hashlib
import time
from bs4 import BeautifulSoupDefine the base URL and the hash of the current page content
base_url = "https://forum.webscraper.io/"
current_hash = NoneDefine a list to store the internal links found on the page
internal_links = []
while True:
# Make a request to the website and get the content
response = requests.get(base_url)
content = response.content# Compute the hash of the content new_hash = hashlib.md5(content).hexdigest() # If the hash of the content has changed, the website has been updated if new_hash != current_hash: print("The website has been updated! A new internal link may have been added.") # Update the current hash to the new hash current_hash = new_hash # Parse the HTML code of the website using BeautifulSoup soup = BeautifulSoup(content, "html.parser") # Find the canonical link element in the page head canonical_link = soup.find("link", {"rel": "canonical"}) # If a canonical link element is found, extract the href attribute if canonical_link: internal_links.append(canonical_link["href"]) # Scrape the webpage URL using BeautifulSoup page_url_response = requests.get(canonical_link["href"]) page_url_content = page_url_response.content page_url_soup = BeautifulSoup(page_url_content, 'html.parser') page_url = page_url_soup.find('meta', {'property': 'og:url'})['content'] print(f"New URL found: {page_url}") # If there are new internal links, print them if internal_links: print("New internal links found:") for link in internal_links: print(link) else: print("No new internal links found.") # Wait for 15 seconds before checking the website again time.sleep(15)
OUTPUT: "The website has been updated! A new internal link may have been added"
I really appreciate an opinion or help