Using AI to scrape

jay2jay99 · March 1, 2023, 3:39pm

I don't know anything about webscaping but enlisted the assistance of an AI to write some scripts for me, we tried Python and Java scripts and both times it had issues finding elements on the page. I know they're populated by java (which was why I thought I'd try getting it to write a javascript). It's tried a few ways to detect the element, CSS, xpath but they all end the same way. I just want it to pull a list of all the products on this page (https://www.jellycat.com/2023) and put it in an excel file. This is what it's made so far:

const { Builder, By, Key, until } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
const excel = require('exceljs');

async function scrapeJellycat() {
const driver = await new Builder().forBrowser('chrome').build();

try {
await driver.get('https://www.jellycat.com/2023/');
await driver.wait(until.elementLocated(By.xpath('//*[@id="content"]/div[3]')), 10000);

const products = await driver.findElements(By.xpath('//*[@id="content"]/div[3]/div'));

const workbook = new excel.Workbook();
const worksheet = workbook.addWorksheet('Retired Jellycats');

worksheet.columns = [
  { header: 'Product Name', key: 'name', width: 30 },
  { header: 'Product Code', key: 'code', width: 15 },
  { header: 'Description', key: 'description', width: 50 },
];

for (let i = 0; i < products.length; i++) {
  const name = await products[i].findElement(By.xpath('./div[2]/a')).getText();
  const code = await products[i].findElement(By.xpath('./div[2]/div[1]')).getText();
  const description = await products[i].findElement(By.xpath('./div[2]/div[2]')).getText();

  worksheet.addRow({ name, code, description });
}

await workbook.xlsx.writeFile('C:/*****************/Jellycat/Retired-Jellycats.xlsx');
console.log('Scraping complete!');

} catch (error) {
console.log(error);
} finally {
await driver.quit();
}
}

scrapeJellycat();

ViestursWS · March 2, 2023, 2:59pm

@jay2jay99 Hello.

Web Scraper can be used to extract data from a large variety of websites; however, a specific sitemap has to be created for each different site. These sitemaps describe to the scraper how to choose from which page elements to extract data, which links to follow to different pages, etc.

Please, start by installing the free Web Scraper browser extension for Chrome, and having a look at the video tutorial and documentation sections of the Web Scraper site to learn the basics of how to create sitemaps.

Learning resources are available on the Web Scrapers website here:

Tutorial videos: Web Scraper Tutorials
Documentation: Installation | Web Scraper Documentation
How-to videos: Web Scraper << How to >> video tutorials

If you have already created a sitemap using the browser extension, it can be imported to your Cloud account by selecting the "Export Sitemap" option from the extension's sitemap menu and copying the sitemap code into Cloud's Import Sitemap page.

Alternatively, you can search for a pre-made sitemap in the "Community sitemaps" section:
Community sitemaps: Web Scraper

Here's a sitemap example that should help you get started:

{"_id":"jellycat-com","startUrl":["https://www.jellycat.com/eu/2023/"],"selectors":[{"delay":2000,"elementLimit":500,"id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":"div#productDataOnPagex div.listing","type":"SelectorElementScroll"},{"id":"name","multiple":false,"parentSelectors":["wrapper"],"regex":"","selector":".listing-details a","type":"SelectorText"},{"id":"img","multiple":false,"parentSelectors":["wrapper"],"selector":"img","type":"SelectorImage"}]}

jay2jay99 · March 3, 2023, 3:19pm

Thank you, I gave that a go and managed to create a site map of the page and it did pull off the data, great, is there a way to automate running that scrape that saves the info on my computer? I understand there is a scheduler in the cloud section but without using that, can you have the chrome extension scrape the site once a day?

ViestursWS · March 4, 2023, 11:23am

@jay2jay99 Hi, no the extension does not have such a feature, it's exclusively available only on Cloud. It also has the automatic data export option that can be configured to export data to Google Sheets, Dropbox or Amazon S3 Bucket.