Scraping Popup Links

DrPangea · January 22, 2019, 6:37pm

Hello all,

I am running into a problem on the website capterra.com. I am attempting to scrape the name and website url from each of the LMS providers in this page: https://www.capterra.com/learning-management-system-software/

My current map is setup like this: element selector > Company title (textselector) & Popuplink selector

The issue I am running into is the scraper will open up every link in a new tab as if it was extracting the url from each, but at the end of the process, no data has been scraped. I have tried extending the delay time to 5000 ms but that does not seem to help. Any idea on how to proceed here?

sitemap: {"_id":"capterralms","startUrl":["https://www.capterra.com/learning-management-system-software/"],"selectors":[{"id":"element","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.card","multiple":true,"delay":0},{"id":"Company name","type":"SelectorText","parentSelectors":["element"],"selector":"a.external-product-link","multiple":false,"regex":"","delay":0},{"id":"url","type":"SelectorPopupLink","parentSelectors":["element"],"selector":"a.button","multiple":false,"delay":"5000"}]}

bretfeig · January 22, 2019, 8:02pm

Change pop-up select to "Element Attribute" and then under attribute name put "href"

that should get you the link associated with each of your elements.

DrPangea · January 22, 2019, 8:58pm

Hey @bretfeig,

Thank you very much for your help, changing the Popup Link Selector to Element Attribute Selector did manage to get me some links and fill out the csv export, but unfortunately, the links appear to be the redirects from the host site to each company's individual site rather than each company's URL.

Any idea on how I would extract the company URL from the "Visit Website" button listed in each company card? (companies without a "visit website" button redirect onsite, so the difference in links may be causing an issue?)

image for reference:

DrPangea · January 23, 2019, 2:24pm

Did a little more digging, ended up checking the robots.txt file for the website and it appears they specifically block scraping of external link clicks, so I seem to be out of luck on this one. Thanks for your help @bretfeig

bretfeig · January 23, 2019, 3:38pm

Robots.txt is not about allowing scraping. It's about allowing indexing by bots (search engines)

You can still legally scrape any site regardless of their robots.txt. The whole concept is actually just a n advisory

DrPangea · January 23, 2019, 3:56pm

Correct me if I am wrong, but doesn't this piece in the robots.txt block access to the /external_click that I need to get these URLs?:

User-agent: *
Disallow: /external_click
Disallow: /external_slp_click
Disallow: /external_click_sa
Disallow: /external_click_ga
Disallow: /sem-combo
Disallow: /search

I am relatively new to the scraping scene, so if I have the wrong idea as to how the robots.txt file functions, I apologize.

bretfeig · January 25, 2019, 12:32am

I don't believe so. I believe that is asking the robot/crawler not to following external links