Table with Two Header Rows

I want to update some information on Wikipedia and need to extract data from several video game pages from the same site to have everything on a single file instead of navigating through all the sources. Unfortunately, the data pages with the information I need have tables with two header rows and I haven't been able to figure out how to merge both header rows into a single one.

Using "Table" (the second row gets ignored) or "Element" (the 'Box" cell doesn't let me select anything else) as the selector doesn't help and after a lot of experimenting I'm ready to give up.

Url: https://gamefaqs.gamespot.com/nes/525245-super-mario-bros-3/data

Sitemap:

I'm attaching several screenshots of what I've tried. The first two are with "Table" as a selector (notice how the "Region" data gets merged into the "Box" column). The 3rd and 4th are with "Element" as the selector (notice how I'm forced to include "Box" with my selection). The 5th is after I created a child for the "Element" selector in 4 (can't select anything else besides "Box"). The last screenshot is basically the only data I care about, but if I scrape it like that, the data will not match up.

Any help to solve this riddle, wrapped in a mystery, inside an enigma will be greatly appreciated.

@Frustrated Hello.

You did not attach your sitemap so I still didn't get a clear idea of what did you try to get but I assume it's each second table row which contains region, publisher, id etc.

Example:

{"_id":"gamefaqs","startUrl":["https://gamefaqs.gamespot.com/nes/525245-super-mario-bros-3/data"],"selectors":[{"id":"Table-Element","type":"SelectorElement","parentSelectors":["_root"],"selector":"tr:has(.cregion)","multiple":true,"delay":0},{"id":"region","type":"SelectorText","parentSelectors":["Table-Element"],"selector":"td.cregion","multiple":false,"regex":"","delay":0},{"id":"publisher","type":"SelectorText","parentSelectors":["Table-Element"],"selector":"td.datacompany","multiple":false,"regex":"","delay":0},{"id":"product-ID","type":"SelectorText","parentSelectors":["Table-Element"],"selector":"td:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"distribution/barcode","type":"SelectorText","parentSelectors":["Table-Element"],"selector":"td:nth-of-type(4)","multiple":false,"regex":"","delay":0},{"id":"release date","type":"SelectorText","parentSelectors":["Table-Element"],"selector":"td.cdate","multiple":false,"regex":"","delay":0},{"id":"rating","type":"SelectorText","parentSelectors":["Table-Element"],"selector":"td.datarating","multiple":false,"regex":"","delay":0}]}

It's not possible to merge different rows into unified columns unless you make 2 different sitemaps and merge them locally.

Sorry for not attaching my sitemap. I got so frustrated (hence my username) after hours of trying different approaches that didn't give me anything near to what I was trying to do. That's why I added so many screenshots with my ramblings.

And on that note, thank you for your suggestion and your later reply. Your solution is much more simple and elegant that the mumbo jumbo I ended up with when I gave up yesterday.

I want a CSV file with the Title (in the SMB3 page it's the same for all releases, but it tends to change depending on the region so it's necessary for the editing I need to do) and all the information from the second header row. Now that you tell me that I need to make two different sitemaps and merge them locally, I can stop banging my head against the wall and start scraping.

Thank you very much! :+1: :+1:

1 Like