How to scrape a wikia discussion forum

Describe the problem.

Url: https://harrypotter.wikia.com/d/f

Sitemap: wikiadiscussion
{id:"sitemap code"}

Trying to use webscraper to get data for a school project from wikia discussion forums. I manage to add the topic of each post to the sitemap but when I try to add the content of the post it just crashes. Was wondering if anyone had a fix for this. I need to collect the
topic,
content and
timestamp
for each post in the forum for my project.

here is the sitemap for you:

{"_id":"wikia-discussion-help","startUrl":["http://harrypotter.wikia.com/d/f?page=[1-1250]"],"selectors":[{"id":"items","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.post-content__link","multiple":true,"delay":0},{"id":"title","type":"SelectorText","parentSelectors":["items"],"selector":"h1.post-content__title","multiple":false,"regex":"","delay":0},{"id":"by-","type":"SelectorText","parentSelectors":["items"],"selector":"a.user-avatar__username","multiple":false,"regex":"","delay":0},{"id":"details","type":"SelectorText","parentSelectors":["items"],"selector":"div.post-content","multiple":false,"regex":"","delay":0}]}

there are more than 1250 pages so what you should do easy , scrape them bunch by bunch, GO to edit meta data, change the page number range , for example 1-200 and once finished with that 200-400 like that you won't be blocked by the website.

make minor tweaks in what text you need. and in "Details" Column once finished scrapping split them using excel

hope I was helpful

Hi there,

Thank you so much for the help. I'm still new to webscraper so I'm a little confused as to how I can use the above sitemap. Do I import the sitemap and copy the code into the JSON field? Also by changing the meta tag do you mean by changing the page range inside the url? Aside from that do I need to do anything else to get it to work?

Really grateful for the assistance!

ok, no problem.. I'm still a beginner as well. :slight_smile:

Discard the first sitemap below is the final version for you, It will work perfectly, These are the things it will scrape,
*topic *content *timestmp/date

After it collects those 3 items from each topic, it will will then collect replies for the respective topic.

sitemap final version:

{"_id":"wikia-discussion-final-sitemap","startUrl":["http://harrypotter.wikia.com/d/f?page=[1-1250]"],"selectors":[{"id":"items","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.post-content__link","multiple":true,"delay":0},{"id":"topic","type":"SelectorText","parentSelectors":["items"],"selector":"h1.post-content__title","multiple":false,"regex":"","delay":0},{"id":"content","type":"SelectorText","parentSelectors":["items"],"selector":"div.post-content__body","multiple":false,"regex":"","delay":0},{"id":"timestamp","type":"SelectorText","parentSelectors":["items"],"selector":"div.post-card__body span.timestamp","multiple":false,"regex":"","delay":0},{"id":"post-author","type":"SelectorText","parentSelectors":["items"],"selector":"div.post-card__body a.user-avatar__username","multiple":false,"regex":"","delay":0},{"id":"replies-for-main-post-element","type":"SelectorElement","parentSelectors":["items"],"selector":"div.discussion-reply","multiple":true,"delay":0},{"id":"reply-by","type":"SelectorText","parentSelectors":["replies-for-main-post-element"],"selector":"a.user-avatar__username","multiple":false,"regex":"","delay":0},{"id":"reply-date","type":"SelectorText","parentSelectors":["replies-for-main-post-element"],"selector":"span.timestamp","multiple":false,"regex":"","delay":0},{"id":"reply-content","type":"SelectorText","parentSelectors":["replies-for-main-post-element"],"selector":"div.post-content","multiple":false,"regex":"","delay":0}]}

copy the sitemap and then go to webscrapper --> create new sitemap --> import sitemap --> paste it in "Sitemap JSON" ----> save

it will scrape date till 1250 pages. below is the sample records it scraped ,

sample file:
https://docs.zoho.in/file/w6fy1f62f1475917845b29af9148f9eee9ff1

If you don't want the replies for each topic to be included just delete the element "replies-for-main-post-element"