Multi-row into one row - badly segmented html

Trying data-scraping for the first time so I don't have to copy-paste a million rows to get all of the agenda items from the past 5 years into one spreadsheet. I'm willing to combine outputs since the agenda links are weird to get to on the top site: https://www.worcesterma.gov/city-clerk/public-meetings/agendas-minutes Scroll down to archive, under City Council's expand and then pick a year, and then open the agenda link with the date selector. Due to this complexity and there being more years than I want to scrape, I figured I'd just skip this portion and just manually combine the results, but if this isn't super difficult that would be better.

My city council agendas are formatted really poorly, I'm attempting to gather the Item number, Item text, Attachment link, and then in the next row down is the Resulting vote. I can't get both rows to separate from the rest of the agenda items in that section because they're all in one table.
Right now this is my sitemap but it's also only grabbing the one row that I initially gave it in the selection. Additionally, there are some that are at this TR level and others that are nested further into the table. ex. agenda item 7a is only one deep under item 7, but in item 8 the information I want to grab is under 8.1.a which is a sub-portion of 8.1 which is the same level as 7a.

If I need to pay someone to write this script I've got an extremely tiny budget for the whole project but I'm happy to toss a little your way if needed.

URL: City of Worcester Council Journal for 10/03/2023

{"_id":"Agenda1","startUrl":["https://www.worcesterma.gov/agendas-minutes/city-council/2023/20231003.htm"],"selectors":[{"id":"agendaitem","parentSelectors":["_root"],"type":"SelectorElementScroll","selector":"table:nth-of-type(15) tr:nth-of-type(3)","multiple":true,"delay":2000,"elementLimit":500},{"id":"itemnumber","parentSelectors":["agendaitem"],"type":"SelectorText","selector":"td[width='35']","multiple":false,"regex":""},{"id":"summary","parentSelectors":["agendaitem"],"type":"SelectorText","selector":"p","multiple":false,"regex":""},{"id":"attached","parentSelectors":["agendaitem"],"type":"SelectorLink","selector":"a","multiple":false,"linkType":"linkFromHref"}]}

1 Like

Additional info from questions I have gotten:

Another agenda for example: City of Worcester Council Journal for 10/03/2023 that would be 7a, 7b, 7c, etc, but also 8.1a, 8.2a, 8.2b etc. Some weeks don't have it double-nested.

I want a spreadsheet with columns for:
A: date
B: item (e.g 7a or 8.2b)
C: text of item
D: results from item (the next tr down from the rest)
E: section it's under (e.g. "7. Petitions")

I plan to manually process more of the information but the other bits I plan to do will give me these columns:
F: yes vote count
G: no vote count
H: Abstain vote count
I: not present count
J: action taken (held, tabled, sent to committee, sent to administration, passed, etc)
K: who submitted the item
L: what committee/department it went to if any
M: what committee/department it came from if any
N: category the item falls under (housing, taxes, noise ordinance, ball park, etc.)
O: other times this item has appeared

I have the tool as an extension on chrome on my windows 11 computer.

from a visual standpoint, Some parts of the agenda are only nested one <"td>" in

Other parts have things nested twice: