Describe the problem.
Hi all, I keep running into deduplication issues in my current scraping activities. I'm trying to get Student Organizations hosted in CompusLabs (which 80% of colleges do) but because several organizations are listed in multiple categories, when bulk scrapping it will scrape Org X from Cat X but it wont "rescrap" Org X from Cat Y, seems from deduplication features. Any chance anyone can confirm if this can be actually done or only by doing 1 sitemap per category is the only solution?
What I was trying to do is put 4 or more Category URLs in Metadata so it goes individually to each category, but if an organization is repeated (same url) between categories it discards it and only gets it once, I need it repeated per category in my data.
Url: - My BC
Url per category examples:
- My BC
- My BC
Sitemap:
{"_id":"sitemaporgsgeneral","startUrl":["- My BC button span","type":"SelectorPagination"},{"id":"links","linkType":"linkFromHref","multiple":true,"parentSelectors":["click"],"selector":"li a","type":"SelectorLink","version":2},{"id":"name","multiple":false,"multipleType":"singleColumn","parentSelectors":["links"],"regex":"","selector":"h1","type":"SelectorText","version":2},{"id":"desc","multiple":false,"multipleType":"singleColumn","parentSelectors":["links"],"regex":"","selector":"div.bodyText-large","type":"SelectorText","version":2}]}
Sitemap above works perfectly but I need it to do this accross many categories at the same time in order to scrape the school faster