Limited pages listing for unlimited pagination

It is a good feature to be added - to limit a number of pages for scraping while using "Pagination" type.

For example, let say you try to scrape Google search results... What if there are more tnan >100K results...?
It is better to set manually: 50 pages... or 200 pages... etc....
I hope I've been understood....

It is possible to limit page navigation with either the :has or :not selectors. In the example below, I am scraping only the first 3 pages of a google search for "web-scraping" (total of 30 results):

{"_id":"google_search_example","startUrl":["https://www.google.com/search?q=web-scraping"],"selectors":[{"id":"Result rows","multiple":true,"parentSelectors":["_root","Click Next until page 3"],"selector":"div#search > div[data-hveid] > div[data-async-context] > div","type":"SelectorElement"},{"id":"Title n Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["Result rows"],"selector":"a","type":"SelectorLink"},{"id":"Desc","multiple":false,"parentSelectors":["Result rows"],"regex":"","selector":"span:nth-of-type(2)","type":"SelectorText"},{"id":"Click Next until page 3","linkType":"linkFromHref","multiple":false,"parentSelectors":["_root","Click Next until page 3"],"selector":"div[role=\"navigation\"]:contains(\"Page navigation\"):has(a[aria-label=\"Page 3\"]) a#pnnext","type":"SelectorLink"}]}

Here, the :has acts like an If condition and WS will click Next as long as the element a aria-label="Page 3" is present in the navigation bar. Once the browser hits page 3, the element disappears, and WS stops scraping.

1 Like

Thanks, will check it later...

Thanks... That worked for English Google version, I changed it to fit my home language...
One more question... what about this URL

How to list first 3 pages for example...? Please write me an example of pagination string

I have posted a solution on the original page:

Here I am using the :not selector:

table-api:not(:contains('Showing 81 - 100 of')) ul.pagination > li:nth-child(13) > a > i

which means, "keep clicking Next as long as the text 'Showing 81 - 100 of' does NOT appear on the page".

1 Like

That worked great. I was only curious how to do that.
The problem was to creat a string of pagination until exact event.
I just want to ask - what for there was a pause for 1 second ?

I find that the paginator sometimes works too quickly so I put in delays to ensure that pages fully load. You can remove them if not needed.

Hi! Could you please help me to find out how to let the scraper just scrape first 100 pages of this site.
Here is my current sitemap:

{"_id":"covid_search_page","startUrl":["https://discover.abc.net.au/index.html?siteTitle=news#/?query=covid&refinementList%5Bsite.title%5D%5B0%5D=ABC%20News"],"selectors":[{"id":"news_links","parentSelectors":["_root","pagination"],"type":"SelectorLink","selector":"._content_52s80_22 a","multiple":true,"linkType":"linkFromHref"},{"id":"title","parentSelectors":["news_links"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"date","parentSelectors":["news_links"],"type":"SelectorText","selector":".Headline_meta__XZ4ek div[data-component='Dateline']","multiple":false,"regex":""},{"id":"news_text","parentSelectors":["news_links"],"type":"SelectorGroup","selector":".LayoutContainer_container__O1X6V > div > div > p.paragraph_paragraph__3Hrfa, div:nth-of-type(n+2) p.paragraph_paragraph__3Hrfa","extractAttribute":""},{"id":"pagination","parentSelectors":["_root","pagination"],"paginationType":"auto","selector":"._arrow_1hrmo_49 svg[data-component='ChevronRight']","type":"SelectorPagination"}]}

I would really appreciate!

Hi,
You don't need pagination... I used parameter [1-4] to scrape pages starting from first till 4th page... So if you need to scrape till page 100 - just change that parameter in "Edit metadata"...

here is the sample:

{"_id":"covid_search_page","startUrl":["https://discover.abc.net.au/index.html?siteTitle=news#/?query=covid&refinementList%5Bsite.title%5D%5B0%5D=ABC%20News&page=[1-4]"],"selectors":[{"id":"news_links","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":"div[data-component*=\"SearchHits\"] div[data-component*=\"CardLayout\"] a[data-component*=\"Link\"]:not(:has(img))","type":"SelectorLink"},{"id":"title","multiple":false,"parentSelectors":["news_links"],"regex":"","selector":"h1","type":"SelectorText"},{"id":"date","multiple":false,"parentSelectors":["news_links"],"regex":"","selector":"div[data-component='Dateline']","type":"SelectorText"},{"extractAttribute":"","id":"news_text","parentSelectors":["news_links"],"selector":".LayoutContainer_container__O1X6V > div > div > p.paragraph_paragraph__3Hrfa, div:nth-of-type(n+2) p.paragraph_paragraph__3Hrfa","type":"SelectorGroup"}]}

Hi,
Thanks for your reply!
I think setting page parameter doesn't really work for this website. Since no matter which page you have set as the start url, the scraper will scrape page 1-I think this is the problem of the website itself. Below are some test screenshots.



The results(scraped data) are all from page one...

Your sitemap's basic structure looks OK but most of your selectors will eventually fail because they have random characters, e.g. paragraph__3Hrfa

Try the improved version below which has improved selectors, and also has a paginator that stops at page 3. I suggest you run this once first with a low number of pages to check if it gets the results you need. I tested with Page Load delay: 4000.

This pagination selector is quite complex, but you only need to change the number within :contains('xx') to the page number you want to stop at.


Sitemap:

{"_id":"covid_search_page_v2","startUrl":["https://discover.abc.net.au/index.html?siteTitle=news#/?query=covid&refinementList%5Bsite.title%5D%5B0%5D=ABC%20News"],"selectors":[{"id":"Delay after paginate","parentSelectors":["pagination"],"type":"SelectorElementClick","clickActionType":"real","clickElementSelector":"head title","clickElementUniquenessType":"uniqueText","clickType":"clickOnce","delay":1500,"discardInitialElements":"do-not-discard","multiple":false,"selector":"head title"},{"id":"news_links","parentSelectors":["_root","pagination"],"type":"SelectorLink","selector":"div > div[class^='_content'] a[data-component]","multiple":true,"linkType":"linkFromHref"},{"id":"title","parentSelectors":["news_links"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"date","parentSelectors":["news_links"],"type":"SelectorText","selector":"header div[data-component='Dateline']","multiple":false,"regex":""},{"id":"Result page number","parentSelectors":["_root","pagination"],"type":"SelectorText","selector":"nav div[role='presentation'] li[aria-current]","multiple":false,"regex":""},{"id":"pagination","parentSelectors":["_root","pagination"],"paginationType":"auto","selector":"nav div[role='presentation']:not(:has(li[aria-current]:contains('3'))) button[data-component='Pagination__Next']","type":"SelectorPagination"},{"id":"news_text_container","parentSelectors":["news_links"],"type":"SelectorElement","selector":"div[class*='LayoutContainer_container'] > div","multiple":false},{"id":"news_text","parentSelectors":["news_text_container"],"type":"SelectorGroup","selector":"div > p[class*='paragraph_paragraph'], div > h2","extractAttribute":""}]}
1 Like

Hi! This works great! I bet it must be a very brilliant solution! but since I'm not so familiar with the css selector, I can't really comprehend the improved selectors you have set. :slightly_frowning_face:I could just use the automaticlly detected selectors that are generated by the scraper.= =
Thank you so much! !
In addition, do you mind me citing the sitemap you created in my project report? :pray:

Glad to help; please go ahead with the citing, and all the best with the project. If you plan to do more web scraping in the future, it would be useful to learn more about CSS selectors. A good reference is W3School's CSS Selector Reference.

1 Like

Thanks a lot!
I will definitely check this reference in the future, and learn more about CSS selectors. Thanks for your suggestion!