Scraping button links from User Panel

Rasul · November 9, 2021, 8:26am

Describe the problem.

I would like to get the links of "DOWNLOAD" button of each video. How can I do that?
I need scraper to go through each course page and retrieve link of "Download" button from each course

Url: Course Outline | Zero To Mastery Academy

Sitemap:
I am not sure about the sitemap

ViestursWS · November 9, 2021, 9:02am

@Rasul Hello. That would be possible by using an 'Element click' selector(selector - div.course-mainbar, click selector - a.item) set as a parent to a link selector - a.

Practical example:

{"_id":"academy-zerotomastery-io","startUrl":["https://academy.zerotomastery.io/courses/deno-the-complete-guide-zero-to-mastery/lectures/17534833"],"selectors":[{"id":"clicks","parentSelectors":["_root"],"type":"SelectorElementClick","clickElementSelector":"a.item","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickOnce","delay":2000,"discardInitialElements":"do-not-discard","multiple":true,"selector":"div.course-mainbar"},{"id":"links","parentSelectors":["clicks"],"type":"SelectorLink","selector":"a","multiple":false,"delay":0}]}

Rasul · November 9, 2021, 9:31am

Thank you a lot, it worked. One last thing, web-scraper-start-url is same for all Button Links. Is there a way to make start-urls correct?

ViestursWS · November 9, 2021, 10:13am

@Rasul If you are looking to gather the 'Lecture' links as well, you can use the 'Link' selector instead and the 'Element Attribute' for the 'Download' link extraction.

Example:

{"_id":"academy-zerotomastery-io-edit","startUrl":["https://academy.zerotomastery.io/courses/deno-the-complete-guide-zero-to-mastery/lectures/17534833"],"selectors":[{"id":"course-links","parentSelectors":["_root"],"type":"SelectorLink","selector":"a.item","multiple":true,"delay":0},{"id":"links","parentSelectors":["course-links"],"type":"SelectorElementAttribute","selector":"a.download","multiple":false,"delay":0,"extractAttribute":"href"}]}

Rasul · November 10, 2021, 7:58am

@ViestursWS thank you very much it works perfectly, I have one more problem: The order of data is quite messy now. I think there are 2 ways to solve:

If possible, get data of each section separately ( but then I had to merge each csv sections into 1 file per course)
Can i get the data in the order that is shown on webpage?
I think i know how to do 1, just by selecting links of each sections separately, but it seems a bit tedious work, so I need easier and less time-consuming way. What can I do to achieve my goal?

ViestursWS · November 11, 2021, 10:32am

@Rasul Hi. The URLs are traversed in pseudo-random order, to ensure the most recent data is being scraped when crawling larger sites, however, you can sort the scraped data by the 'web-scraper-order' column.

Rasul · November 30, 2021, 7:54am

HI @ViestursWS, thank you for helping me.

I would greatly appreciate if you could help me last time. In addition to your Sitemap provided, I also need DETAILS of each lecture(video)
Mentioned(in the question) site have a strict html structure which makes scraping easier.

Your provided Sitemap captures 1)Title and 2)Link, but not 3) Details
Details always exist as "<p<(paragraph)" inside a <div< which is always called "lecture-text-container".
Usually there is no paragraph and no <div< called "lecture-text-container", then Sitemap can ignore.
Like this:

But sometimes when some pages include <div< and paragraphs, I want Sitemap to scrape all of those paragraphs.
Let me explain in pseudocode for the sake of clarity:

if(<div< "lecture-text-container" exist):
  save(<p<"ALL paragraphs")
else:
  do nothing(or continue)

So, the question is how I can change your provided Sitemap to capture ALL Details inside those <p< paragraphs?

P.S. Could not type <p> or <div> correctly in the question as it becomes invisible as text

Thank you in advance @ViestursWS

ViestursWS · November 30, 2021, 6:44pm

@Rasul Hi. Are these details only available after a log-in is made?

In any case, you should be able to capture the necessary data by using the 'Grouped' selector - div.lecture-text-container p

Practical example:

{"_id":"academy-zerotomastery-io-edit","startUrl":["https://academy.zerotomastery.io/courses/deno-the-complete-guide-zero-to-mastery/lectures/17534833"],"selectors":[{"delay":0,"id":"course-links","multiple":true,"parentSelectors":["_root"],"selector":"a.item","type":"SelectorLink"},{"delay":0,"extractAttribute":"href","id":"links","multiple":false,"parentSelectors":["course-links"],"selector":"a.download","type":"SelectorElementAttribute"},{"delay":0,"extractAttribute":"","id":"details","parentSelectors":["course-links"],"selector":"div.lecture-text-container p","type":"SelectorGroup"}]}

Rasul · December 1, 2021, 5:37am

@ViestursWS your sitemap works great, god bless you brother