Forum Scrape Issue

Hi guys,

So i set away my first scraping job of a forum using the interactive selector and it seems to be grabbing everything i want.

Basically:
Topic Pages 1-n > then within each topic > Replies pages 1-n

it captures around 179,000 posts of the 220,000 topics/replies but i don't know where its going wrong with the remaining missed posts. when i preview the data it seems to highlight all the right stuff. I paste my sitemap and details below. Im saving to a couchDB as it does not handle it using CSV and i can't seem to figure out how to export it so i cant see where its gone wrong really so that will be another post most likely!

is there anything below that seems to pop out to you that is incorrect?

{"_id":"living3","startUrl":["http://sjogrensworld.org/forums/index.php?PHPSESSID=89f6f7a2480d9312151be0bdc2e3cb3c&board=1.0"],"selectors":[{"id":"paginationSubRootPages","type":"SelectorLink","parentSelectors":["_root","paginationSubRootPages"],"selector":".pagelinks a:nth-of-type(n+2)","multiple":true,"delay":0},{"id":"paginationThread","type":"SelectorLink","parentSelectors":["_root","paginationSubRootPages"],"selector":".windowbg2 span a","multiple":true,"delay":0},{"id":"paginationInThread","type":"SelectorLink","parentSelectors":["paginationThread","paginationInThread"],"selector":"a.navPages","multiple":true,"delay":0},{"id":"Replies","type":"SelectorText","parentSelectors":["paginationThread"],"selector":"div.inner","multiple":true,"regex":"","delay":0}]}

Any guidance appreciated.

Cheers,

Kris

Try this one:

{"_id":"living3","startUrl":["http://sjogrensworld.org/forums/index.php?PHPSESSID=89f6f7a2480d9312151be0bdc2e3cb3c&board=1.0"],"selectors":[{"id":"paginationSubRootPages","type":"SelectorLink","parentSelectors":["_root","paginationSubRootPages"],"selector":".pagelinks a:nth-of-type(n+2)","multiple":true,"delay":0},{"id":"paginationThread","type":"SelectorLink","parentSelectors":["_root","paginationSubRootPages"],"selector":".windowbg2 span a","multiple":true,"delay":0},{"id":"paginationInThread","type":"SelectorLink","parentSelectors":["paginationThread","paginationInThread"],"selector":"a.navPages","multiple":true,"delay":0},{"id":"Replies","type":"SelectorText","parentSelectors":["paginationThread","paginationInThread"],"selector":"div.inner","multiple":true,"regex":"","delay":0}]}

Might be that the pagination within the thread is paginating through, but the 'Replies' selector is not looped into the pagination, so if there are more than 1 page, the pagination is executed for those pages, but there is not 'Replies' selector looped in, so it just iterates through the pages without the selector to collect the replies.

Let me know if it worked.

Hi Webber,

This worked great overall, caught 98% of the posts!

Cheers!