Scraping data tables without proper headers and unmatched number of columns for body

Objective:
I want to do the following..

  1. Click each competition (e.g. 2020, Tokyo, JPN) to load the data tables.
  2. Then, click on each specific event (e.g. K1 men 200m) which will bring me to the specific data table.
  3. Scrape the event name, ranking, athlete(s), nationality and timings data from each table.

Describe the problem:
The data tables don't have a proper header row. The 'header row' comprises 2 columns while the 'body rows' comprise 4 columns. Can't seem to access the nationality and timings data either. Using the sitemap I have, I've only managed to get the rankings (some with null values?) and athletes. Would appreciate sitemap suggestions or even alternative approaches to tackle this.

Url:
http://www.canoeresults.eu/view-results/sprint

Sitemap:
{"_id":"canoesprintEU","startUrl":["http://www.canoeresults.eu/view-results/sprint"],"selectors":[{"delay":0,"id":"competition_link","multiple":true,"parentSelectors":["_root"],"selector":"div.row:nth-of-type(n+2) a","type":"SelectorLink"},{"delay":0,"id":"event_link","multiple":true,"parentSelectors":["competition_link"],"selector":"#results div a","type":"SelectorLink"},{"columns":[{"extract":true,"header":"K2 men 1.000 m","name":"rank"},{"extract":true,"header":"sprint","name":"sprint"}],"delay":0,"id":"event_table","multiple":true,"parentSelectors":["event_link"],"selector":"table","tableDataRowSelector":"tbody tr","tableHeaderRowSelector":"thead tr","type":"SelectorTable"}]}

@Scraper-Man Hello, to extract the desired data from multiple tables, I would suggest using an 'Element' selector - tbody tr with the 'Multiple' option checked and set as a 'parent' for the 'rank', 'name', and etc. with the 'Multiple' option not checked.

Example:

{"_id":"canoesprintEU","startUrl":["http://www.canoeresults.eu/view-results/sprint?eventid[]=4530#discipline67794"],"selectors":[{"delay":0,"id":"event_table","multiple":true,"parentSelectors":["_root"],"selector":"tbody tr","type":"SelectorElement"},{"delay":0,"id":"Rank","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl1","type":"SelectorText"},{"delay":0,"id":"Name","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl2","type":"SelectorText"},{"delay":0,"id":"Nationality","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl3","type":"SelectorText"},{"delay":0,"id":"Result","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl4","type":"SelectorText"}]}

Helpful resources:

https://webscraper.io/how-to-video/multiple-items
https://webscraper.io/documentation/selectors/element-selector

1 Like

Thank you! That helped me get a huge step closer to what I needed.. except for one issue.

Clicking on the competition link loads the data tables. Then clicking on the event link directs to the specific table (e.g. K2 men 500m). I wish to select the first table that is in frame after clicking the event link. How do I do so? I used table:first-of-type but it selects the first table on the entire page instead of the one that is brought into frame after clicking event link.

Sitemap:

{"_id":"canoesprinteu","startUrl":["http://www.canoeresults.eu/view-results/sprint"],"selectors":[{"delay":0,"id":"event_table","multiple":true,"parentSelectors":["select_firstTable"],"selector":"tr","type":"SelectorElement"},{"delay":0,"id":"Rank","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl1","type":"SelectorText"},{"delay":0,"id":"Name","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl2","type":"SelectorText"},{"delay":0,"id":"Nationality","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl3","type":"SelectorText"},{"delay":0,"id":"Result","multiple":false,"parentSelectors":["event_table"],"regex":"","selector":"td.cl4","type":"SelectorText"},{"delay":0,"id":"competition_link","multiple":true,"parentSelectors":["_root"],"selector":"div.row:nth-of-type(n+2) a","type":"SelectorLink"},{"delay":0,"id":"event_link","multiple":true,"parentSelectors":["competition_link"],"selector":"#results div a","type":"SelectorLink"},{"delay":0,"id":"select_firstTable","multiple":false,"parentSelectors":["event_link"],"selector":"table:first-of-type > tbody","type":"SelectorElement"}]}