How to scrape thousands of pages from a site without crashes?

makspans · October 17, 2023, 7:53pm

hello, im wanting to scrape this entire profile. it includes the title, size and the magnet link. it has over 6000 pages though, which i think crashes the process and prevents me from scraping it all and thus nothing is saved. is there a technique or method that can scrape 500 or so pages at a time, save the data, scrape 500 more, add the additional data etc? or is there a way to remove performance limits? i have pretty decent amount of resources so that would be unlikely to cause any issues.

i saw this post, but unfortunately it didnt have any responses to it.

Url: https://torrentgalaxy.to/profile/TGxTV/torrents/6562

i'm doing the pagination backwards for simplicity as doing it recent to old pages causes the "next" page button to change positions, so It was just simpler this way.
Sitemap:
{"_id":"torrentgalaxytgxtv","startUrl":["https://torrentgalaxy.to/profile/TGxTV/torrents/6562"],"selectors":[{"id":"pagination","parentSelectors":["_root","pagination"],"paginationType":"clickMore","selector":".tab-pane > nav li:nth-of-type(1) a:contains(\"Previous\")","type":"SelectorPagination"},{"id":"Title","parentSelectors":["Row"],"type":"SelectorText","selector":"a b","multiple":false,"regex":""},{"id":"Magnet","parentSelectors":["Row"],"type":"SelectorElementAttribute","selector":".tgxtablecell a[role]","multiple":false,"extractAttribute":"href"},{"id":"Size","parentSelectors":["Row"],"type":"SelectorText","selector":"span.badge.txlight","multiple":false,"regex":""},{"id":"Row","parentSelectors":["pagination"],"type":"SelectorElement","selector":"div.tgxtablerow:nth-of-type(n+2)","multiple":true}]}

makspans · October 18, 2023, 11:22pm

seems to work so far, thanks! i didn't know that specifying ranges like that was a thing.
also, do i need to export the data after every scrape or can i do multiple scrapes and export all at once?

quick update. in order to include the first page, i set the range to 0-5 just as testing. but this results in continually visiting the same pages. first it paginates through 5,4,3,2,1,0. then it goes to 4,3,2,1,0. then 3,2,1,0 etc. is that normal? it's making the process way too long and is making excessive requests. shouldnt it just go through the pages and then get the data and finish?

{"_id":"torrentgalaxytgxtv","startUrl":["https://torrentgalaxy.to/profile/TGxTV/torrents/[0-5]"],"selectors":[{"id":"pagination","paginationType":"clickMore","parentSelectors":["_root","pagination"],"selector":".tab-pane > nav li a:contains(\"Previous\")","type":"SelectorPagination"},{"id":"Title","multiple":false,"parentSelectors":["Row"],"regex":"","selector":"a b","type":"SelectorText"},{"extractAttribute":"href","id":"Magnet","multiple":false,"parentSelectors":["Row"],"selector":".tgxtablecell a[role]","type":"SelectorElementAttribute"},{"id":"Size","multiple":false,"parentSelectors":["Row"],"regex":"","selector":"span.badge.txlight","type":"SelectorText"},{"id":"Row","multiple":true,"parentSelectors":["pagination"],"selector":"div.tgxtablerow:nth-of-type(n+2)","type":"SelectorElement"}]}

I have to do the scraping backwards otherwise it starts at page 5 and just keeps going. so it turns out the range thing doesnt properly work.

leemeng · October 19, 2023, 8:02am

You do not need a paginator (remove it from your sitemap). You can modify it to scrape from this list of URLs. Expires in 2 months:

makspans · October 21, 2023, 12:50am

no i didnt, there was never any implication that i needed to remove that
anyway, yes removing the pagination does seem to get things working. thanks for the help.