Limit Pagination Using Link Selector

yanda · November 9, 2019, 9:05pm

I am trying to scrape a site for crossword information. This site has decades of information but I only want it for a couple of years. I want to limit pagination - I don't think I can use the URL range method because the URL uses a mm/dd/yyyy format. I have set up pagination by clicking on the "previous puzzle" link and making that selector a child of itself.

URL: https://www.xwordinfo.com/Crossword?date=11/7/2019

Sitemap:

{"_id":"xwordinfo","startUrl":["https://www.xwordinfo.com/Crossword?date=11/9/2019"],"selectors":[{"id":"across","type":"SelectorText","parentSelectors":["_root"],"selector":"#ACluesPan .numclue div:nth-of-type(n)","multiple":true,"regex":"","delay":0},{"id":"down","type":"SelectorText","parentSelectors":["_root"],"selector":"#DCluesPan .numclue div:nth-of-type(n)","multiple":true,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["_root"],"selector":"h1#PuzTitle","multiple":false,"regex":"","delay":0},{"id":"previous-puzzle","type":"SelectorLink","parentSelectors":["_root","previous-puzzle"],"selector":"a.cfNoLeft","multiple":false,"delay":0}]}

leemeng · November 10, 2019, 12:00pm

You can try the :not selector method. e.g. if you want it to stop at 11/2/2019

a.cfNoLeft:not([href='/Crossword?date=11/1/2019'])

This means keep clicking as long the href is NOT '/Crossword?date=11/1/2019' (specify the date you want, minus one day)