Trying data-scraping for the first time so I don't have to copy-paste a million rows to get all of the agenda items from the past 5 years into one spreadsheet. I'm willing to combine outputs since the agenda links are weird to get to on the top site: https://www.worcesterma.gov/city-clerk/public-meetings/agendas-minutes Scroll down to archive, under City Council's expand and then pick a year, and then open the agenda link with the date selector. Due to this complexity and there being more years than I want to scrape, I figured I'd just skip this portion and just manually combine the results, but if this isn't super difficult that would be better.
My city council agendas are formatted really poorly, I'm attempting to gather the Item number, Item text, Attachment link, and then in the next row down is the Resulting vote. I can't get both rows to separate from the rest of the agenda items in that section because they're all in one table.
Right now this is my sitemap but it's also only grabbing the one row that I initially gave it in the selection. Additionally, there are some that are at this TR level and others that are nested further into the table. ex. agenda item 7a is only one deep under item 7, but in item 8 the information I want to grab is under 8.1.a which is a sub-portion of 8.1 which is the same level as 7a.
If I need to pay someone to write this script I've got an extremely tiny budget for the whole project but I'm happy to toss a little your way if needed.
URL: City of Worcester Council Journal for 10/03/2023
{"_id":"Agenda1","startUrl":["https://www.worcesterma.gov/agendas-minutes/city-council/2023/20231003.htm"],"selectors":[{"id":"agendaitem","parentSelectors":["_root"],"type":"SelectorElementScroll","selector":"table:nth-of-type(15) tr:nth-of-type(3)","multiple":true,"delay":2000,"elementLimit":500},{"id":"itemnumber","parentSelectors":["agendaitem"],"type":"SelectorText","selector":"td[width='35']","multiple":false,"regex":""},{"id":"summary","parentSelectors":["agendaitem"],"type":"SelectorText","selector":"p","multiple":false,"regex":""},{"id":"attached","parentSelectors":["agendaitem"],"type":"SelectorLink","selector":"a","multiple":false,"linkType":"linkFromHref"}]}