Directory scrape strugglebus (courtesy of Angular?)

asteve · May 28, 2021, 6:52am

So, I'm new to Web Scraper, but have hacked around with CSS selectors in other tools that rely on them — and for the life of me can't figure out where I'm going wrong.

I'm trying to scrape a paginated directory, capturing businesses' name/description/location/website from each profile in the listing. Each item in the directory has a discrete URL, and pretty sure that I've got the initial SelectorLink tree right, and also the pagination (&page=[1-252]).

The data preview looks the way it should, and when I execute the scrape, the preview window navigates through each list item and page as it should.

But always "No data scraped yet". Tested the same tree structure on another site, and it worked fine. Thinking it might have something to do with how the elements are generated in Angular? Am I out of luck?

Url: 1% for the Planet

Sitemap:
{"_id":"one-p-for-planet-directory-biz","startUrl":["https://directories.onepercentfortheplanet.org/?memberType=business&page=252"],"selectors":[{"id":"profile-link","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.mat-list-item","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["profile-wrapper"],"selector":"div:nth-of-type(1) > section > div > h1","multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","parentSelectors":["profile-wrapper"],"selector":"div:nth-of-type(2) > div> dl > dt:contains('Location') + dd","multiple":false,"regex":"","delay":0},{"id":"website","type":"SelectorText","parentSelectors":["profile-wrapper"],"selector":"div:nth-of-type(2) > div > dl > dt:contains('Website') + dd","multiple":false,"regex":"","delay":0},{"id":"profile-wrapper","type":"SelectorElement","parentSelectors":["profile-link"],"selector":"app-member","multiple":true,"delay":0},{"id":"socials","type":"SelectorGroup","parentSelectors":["profile-wrapper"],"selector":"a.social-link","delay":0,"extractAttribute":"href"}]}

ViestursWS · May 28, 2021, 1:33pm

@asteve Hello.

If the sitemap has a page range defined within its start URL as well as if there is a 'Link Selector ' within the '_root ' of the sitemap. Example start URL containing a page range - Web Scraper Test Sites[1-20].
In this case, the scraper will iterate through all of the pages specified in the range first, to collect all of the available links, before moving to the next structure level of the sitemap. Once the scraper has reached the bottom level of the sitemap, records will start to be returned then.

Asad · May 28, 2021, 1:44pm

I've tried to scrape the one page through link but its still not working.
Best test is to try on one link.

asteve · May 28, 2021, 5:05pm

@ViestursWS - that's helpful to know, thanks! That said, I actually did perform one run where I waited for it to reach the bottom of the site map, and it still returned no results. Since then I've just been building/testing with just one page of the directory as the start URL until I can get that to work.

@Asad - that's a good call. And yeah, I'm similarly not having any luck with even a simplified scrape of just a one link/page in the directory. For example:

URL: https://directories.onepercentfortheplanet.org/profile/aubrey-hord-photography

Sitemap:
{"_id":"one-p-directory-single-item-test","startUrl":["https://directories.onepercentfortheplanet.org/profile/aubrey-hord-photography"],"selectors":[{"id":"profile-wrapper","type":"SelectorElement","parentSelectors":["_root"],"selector":"app-member","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["profile-wrapper"],"selector":"div:nth-of-type(1) > section > div > h1","multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","parentSelectors":["profile-wrapper"],"selector":"div:nth-of-type(2) > div> dl > dt:contains('Location') + dd","multiple":false,"regex":"","delay":0},{"id":"website","type":"SelectorText","parentSelectors":["profile-wrapper"],"selector":"div:nth-of-type(2) > div > dl > dt:contains('Website') + dd","multiple":false,"regex":"","delay":0}]}

Asad · May 28, 2021, 7:23pm

Hi @asteve
I have a better idea for you.
Just scrape the links href and put them in "startUrl":["https://google.com","https://yahoo.com","https://facebook.com"]
for example:
You have 2 links in excel like

https://google.com
https://yahoo.com
https://facebook.com

Edit it in notepad or where ever you like and make it like
"https://google.com","https://yahoo.com","https://facebook.com"
then just put that in [ ]

asteve · May 28, 2021, 9:24pm

Oh, interesting, didn't realize you could pass an array to startUrl! But I'm not sure how that would help in this situation.

It doesn't seem like Web Scraper is having trouble navigating to each profile link in the directory listing view. Even when I set the startUrl to just a single profile profile, I can't seem to get the scraper to return any text fields from it at all. For example:

startUrl:

https://directories.onepercentfortheplanet.org/profile/cs-consulting

Site map:
{"_id":"one-p-directory-single-item-test","startUrl":["https://directories.onepercentfortheplanet.org/profile/cs-consulting"],"selectors":[{"id":"name","type":"SelectorText","parentSelectors":["_root"],"selector":"app-layout > div > > > div > section:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"website","type":"SelectorText","parentSelectors":["_root"],"selector":"dt:contains('Website') + dd","multiple":false,"regex":"","delay":0}]}

I run that scrape, bumping page load delay up to 10000 (just in case) ... no data gets returned.

asteve · May 28, 2021, 10:38pm

@Asad like, i'm wondering if it has something to do with the how the HTML gets generated with all of those weird <app-... > tags and/or all of those seemingly auto generated element attributes, e.g. _ngcontent-uba-c202 (i have no experience with Angular)

Then again, if I create a sitemap that simple scrapes Links from the parent directory list page, and I add no other child selectors below that, Web Scraper does successfully return results.

And that parent listing page is also built on Angular, so I'm not sure why the scraper would only fail when attempting to traverse similar markup on an individual profile page.

leemeng · May 29, 2021, 3:49am

Interesting problem. I don't have a solution but you can save yourself a lot of time in the future by testing a site for "scrapability" first with one or both of these methods:

Open an example page you want to scrape, then press Ctrl-U (view source). This is what WS will "see" too. If the data you want is not there, then WS probably can't scrape it.
For further confirmation, you can create an "HTML Dump" sitemap to check what WS is really scraping. The idea here is to use the URL you want, and dump the entire <body> or <html> tag to check if the onscreen data is present. You'd need to run an actual scrape, not just preview. If data is not there then WS probably can't scrape it.

Id: Let's dump the body
Type: HTML
Selector: body

KristapsWS · May 31, 2021, 6:40am

The first point is not true. Web Scraper will extract data that is generated by javascript as well.

ViestursWS · May 31, 2021, 7:46am

@asteve It seems that the website loads with the 404 error and Web Scraper extension perceives this as an empty page. After performing tests on the Web Scraper Cloud i managed to get the results but ,unfortuntely, on the extension it won't be possible.

asteve · June 1, 2021, 4:23pm

@ViestursWS @leemeng got it, thank you so much for the help ya'll!