Can't scrape data past first page with automatic load more page

Joel · June 29, 2018, 12:56am

Hello,

So I have learnt with the tutorial how to scrape data with load more button pages, but this website I am trying to scrape, automatically loads more pages when scrolling down. There's no "load more" button.

My sitemap only scrapes the first page. How can I make it scrape all pages?

Best regards,

Url: https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE

Sitemap:
{"_id":"partners","startUrl":["https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE"],"selectors":[{"id":"partner","type":"SelectorLink","selector":"a.card__box","parentSelectors":["_root"],"multiple":true,"delay":"10000"},{"id":"item","type":"SelectorElement","selector":"div.TwVPnd div.v3oWjd","parentSelectors":["partner"],"multiple":false,"delay":0},{"id":"website","type":"SelectorLink","selector":"div.ZGlQsc:nth-of-type(1) a.dFnaUd","parentSelectors":["item"],"multiple":false,"delay":0},{"id":"email","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(2) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"phone number","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(3) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(4) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0}]}

bretfeig · June 29, 2018, 3:24pm

Joel:

{"_id":"partners","startUrl":["https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE"],"selectors":[{"id":"partner","type":"SelectorLink","selector":"a.card__box","parentSelectors":["_root"],"multiple":true,"delay":"10000"},{"id":"item","type":"SelectorElement","selector":"div.TwVPnd div.v3oWjd","parentSelectors":["partner"],"multiple":false,"delay":0},{"id":"website","type":"SelectorLink","selector":"div.ZGlQsc:nth-of-type(1) a.dFnaUd","parentSelectors":["item"],"multiple":false,"delay":0},{"id":"email","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(2) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"phone number","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(3) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(4) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0}]}

You have to use the Element Scroll Selector. I set the delay at 2000, it might need longer. I didn't have time to troubleshoot a full scrape.

After that I used a regular link selector into your item selectors. I hope it works. If not, there is always @iconoclast:slight_smile:

{"_id":"partners","startUrl":["https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE"],"selectors":[{"id":"partner","type":"SelectorElementScroll","selector":"a.card__box","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"item","type":"SelectorLink","selector":"parent","parentSelectors":["partner"],"multiple":false,"delay":0},{"id":"website","type":"SelectorLink","selector":"div.ZGlQsc:nth-of-type(1) a.dFnaUd","parentSelectors":["item"],"multiple":false,"delay":0},{"id":"email","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(2) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"phone number","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(3) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(4) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0}]}

iconoclast · June 29, 2018, 9:24pm

It's best to limit your search to a particular region to minify the real results, as scraping 2K+ records will take a long time.

This website has a bug, once you set the region and something else, it will change the URL, but once you hit F5 to refresh the page, the filters get dropped while URL still has them.

The easiest way of scraping it, is just to set filters and scroll down manually to see the results, then, having only Link selector for the pages and text selectors inside it, press Preview Data.

The hard way to workaround the bug, is to have a Click selector that picks the filters for each scrape and then it scrapes the results.

Joel · July 6, 2018, 9:16am

Sorry for the late reply. I increased the delay from 2000 to 5000 and it managed to scrape just the links of the 500+ results but it couldn't scrape the text selectors.

Joel · July 6, 2018, 9:21am

Sorry for the late reply, I only have 500+ records. I manually scrolled down and remained with only the link selector and text selectors inside but still couldn't get anything after pressing Preview Data. I also tried adding an element selector outside the text selector in vain.

iconoclast · July 8, 2018, 12:49am

Hi Joel,

I've totally forgot about limiting the record count for Scroll selector. It can be done using :nth-of-type(-n+#) CSS selector (negative child range), where # is a number of records you want to scrape.

Here's an example of scraping eactly 150 records:
{"_id":"withgoogle","startUrl":["https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE"],"selectors":[{"id":"scroll","type":"SelectorElementScroll","selector":"div.card:nth-of-type(-n+150) h3.h-u-font-weight-medium","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"txt","type":"SelectorText","selector":"_parent_","parentSelectors":["scroll"],"multiple":false,"regex":"","delay":0}]}

Just replace -150 with a number of records you want to scrape, and it should work.

bretfeig · July 8, 2018, 10:56am

@Joel - Try this sitemap. It will scrape name/link/phone/e-mail etc. Let me know if it doesn't work

Also - check out the chrome extension "instant data scraper" it's not nearly as powerful as webscraper.io but for quick/dirty scrapes where you don't need to "click through into link" it works instantaneously

{"_id":"delete-me","startUrl":["https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE"],"selectors":[{"id":"partner","type":"SelectorElementScroll","selector":"a.card__box","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"item","type":"SelectorLink","selector":"parent","parentSelectors":["partner"],"multiple":false,"delay":0},{"id":"website","type":"SelectorLink","selector":"div.ZGlQsc:nth-of-type(1) a.dFnaUd","parentSelectors":["item"],"multiple":false,"delay":0},{"id":"email","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(2) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"phone number","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(3) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0},{"id":"address","type":"SelectorText","selector":"div.ZGlQsc:nth-of-type(4) a.dFnaUd","parentSelectors":["item"],"multiple":false,"regex":"","delay":0}]}

Joel · July 9, 2018, 7:47am

Hi Anton,

I still need it to click-link all the 150 records so that I may scrape: website, phone number, email.

This sitemap is just giving me the names.

Best regards,

Joel · July 9, 2018, 7:55am

Hi Bret,

Unfortunately, this sitemap still doesn't scrape the data. Let me check out "instant data scraper".

Best regards,

iconoclast · July 9, 2018, 9:53am

@Joel,

I've sent you an example sitemap that will scroll down until 150 records is shown, then scrape them.

I thought you would figure out how to change text selector into Link so once 150 records are shown they're all going to be scraped per-Job.

I've changed it for you:
{"_id":"withgoogle2","startUrl":["https://cloud.withgoogle.com/partners/?products=GSUITE&regions=NA&sort-type=RELEVANCE"],"selectors":[{"id":"scroll","type":"SelectorElementScroll","selector":"div.card:nth-of-type(-n+150)","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"lnk","type":"SelectorLink","selector":"a.card__box","parentSelectors":["scroll"],"multiple":true,"delay":"2000"},{"id":"txt2","type":"SelectorText","selector":"div.TwVPnd div.v3oWjd","parentSelectors":["lnk"],"multiple":false,"regex":"","delay":0}]}

Please add resting text selectors you need (phone/website/etc) inside Link selector.

Joel · July 9, 2018, 12:42pm

This sitemap is working great. Much appreciated. I have seen where I was doing it wrong.

Best regards,