Page with load more pagination abruptly closes during scraping

desertsun · August 25, 2020, 8:26pm

Hello,

I am trying to scrape the headlines and time stamps for all articles listed by a news website's search results.

This is the link I would like to scrape: https://www.cbc.ca/search?q=racism&section=news&sortOrder=date&media=all

The page has a load more that I have been able to get to work using a click element selector but whenever I run the scraper it abruptly ends after loading nearly 10% of the searches without scraping any data.

{"_id":"cbc5","startUrl":["https://www.cbc.ca/search?q=racism&section=news&sortOrder=date&media=all"],"selectors":[{"id":"main1","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.contentListCards","multiple":true,"delay":"3000","clickElementSelector":"div > button[class^='sclt-loadmore']","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueHTML"},{"id":"sub","type":"SelectorElement","parentSelectors":["main1"],"selector":"div.card-content","multiple":true,"delay":0},{"id":"info","type":"SelectorGroup","parentSelectors":["sub"],"selector":"h3, time","delay":0,"extractAttribute":""}]}

Could somebody please help me get this thing working? I have been stuck on this for quite some time now.

Thank you!

leemeng · October 3, 2020, 2:13pm

This one was interesting; I wanted to figure out a way to limit the Load More. Try the sitemap below which will stop at 200 results. I recommend Page load delay of at least 5000.

This sitemap will click on all the Load More first so it might look like nothing much is happening for a while. The results are actually loading below the screen, and will be indicated by "Showing results 1 – XXX of" which will change every few seconds.

If you want more/fewer pages, you'll need to do some math to figure out which Load More to stop at, and then change the Load More selector,
div > button[class^='sclt-loadmore']:not([class*='loadmore20'])

Each Load More loads an additional 10 results. In this example, it will stop at loadmore20, so 20 x 10 = 200 results.

Sitemap:
{"_id":"forum-cbc-load-more","startUrl":["https://www.cbc.ca/search?q=quebec%20tourism&section=all&sortOrder=relevance&media=all"],"selectors":[{"id":"Separate Load More","type":"SelectorElementClick","parentSelectors":["_root"],"selector":" div.contentListCards","multiple":false,"delay":"3700","clickElementSelector":"div > button[class^='sclt-loadmore']:not([class*='loadmore20'])","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueHTML"},{"id":"Row wrappers","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.contentListCards a.card","multiple":true,"delay":0},{"id":"Title","type":"SelectorText","parentSelectors":["Row wrappers"],"selector":"h3","multiple":false,"regex":"","delay":0},{"id":"Time","type":"SelectorText","parentSelectors":["Row wrappers"],"selector":"time","multiple":false,"regex":"","delay":0},{"id":"Link","type":"SelectorLink","parentSelectors":["Row wrappers"],"selector":"_parent_","multiple":false,"delay":0}]}

valery · October 5, 2020, 11:29am

Hello!
A similar problem, why does parsing stop on page 3 field?
Sitemap:
{"_id":"donna","startUrl":["http://www.donnaflora.ru/?act=catalog_result&IDFamily=19"],"selectors":[{"id":"next","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"td[height] td[valign='top']","multiple":false,"delay":2000,"clickElementSelector":"a:nth-of-type(12)","clickType":"clickMore","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueCSSSelector"},{"id":"name","type":"SelectorText","parentSelectors":["next"],"selector":"tr[valign]:nth-of-type(n+2) a b","multiple":false,"regex":"","delay":0}]}

I checked, the error is showing up on several sites. I wanted to buy the program, but with this error it is useless.

valery · October 6, 2020, 9:55am

Get error in firefox console:

{"url":"http://www.donnaflora.ru/?act=catalog_result&IDFamily=19","parentSelector":"_root","sitemapName":"donna","driver":"chrometab","error":"PAGE_REDIRECTED_DURING_DATA_EXTRACTION_AFTER_RETRY","stack":"handle/<@moz-extension://ec68cf88-1ec9-40bb-b254-2cf76275a14e/background_script.js:20889:35\no@moz-extension://ec68cf88-1ec9-40bb-b254-2cf76275a14e/background_script.js:20847:25\n","domainName":"www.donnaflora.ru","timestamp":1601977737,"level_name":"NOTICE","message":"Job execution failed"} background_script.js:66:95

{"timestamp":1601977737,"level_name":"PROFILE","message":"11733 ms job execution"}

How to fix it?
Firefox 81.0.1 (64-bit)