Pages with missing sublinks

Dingo · February 11, 2024, 7:41pm

Describe the problem.
I have set up a sitemap that scrapes information from a school sports program website. Currently the sitemap works as follows:
select school link > select respective sport page link > scrape sport information; select coach contact information link > scrape coaches email.
The problem lies on sport pages where a coach contact information link is not present, usually due to the sports program being discontinued. when this happens web scraper will get stuck on a coach less page and proceed to cycle through the rest of the sports listed without scraping anything.

I then end up with data that looks like this:

Url:https://www.psal.org/

This is a sample site map for just the school abraham lincoln as this is one of the school pages with the above scraping problem
Sitemap:
{"_id":"Coach_test","startUrl":["https://www.psal.org/"],"selectors":[{"id":"Abraham lincoln","linkType":"linkFromHref","multiple":false,"parentSelectors":["_root"],"selector":"li.ss:nth-of-type(3) a","type":"SelectorLink"},{"id":"lincoln_sports","linkType":"linkFromHref","multiple":true,"parentSelectors":["Abraham lincoln"],"selector":"tr:nth-of-type(n+2) td:nth-of-type(1) a","type":"SelectorLink"},{"id":"School","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"span#schoolName","type":"SelectorText"},{"id":"Sport","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"span#fullSpName","type":"SelectorText"},{"id":"Division","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"span#location","type":"SelectorText"},{"id":"Coach_FullName","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"#headCoach a","type":"SelectorText"},{"id":"Coach_Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["lincoln_sports"],"selector":"#headCoach a","type":"SelectorLink"},{"id":"Email","multiple":false,"parentSelectors":["Coach_Link"],"regex":"","selector":"a#coachEmail","type":"SelectorText"}]}

JanAp · February 12, 2024, 2:39pm

Hi, this is interesting, I am not sure what is going on there. I will try to have a deeper look into this issue, but for now you can use a workaround to exclude the listings without a coach

{"_id":"Coach_test","startUrl":["https://www.psal.org/"],"selectors":[{"id":"Abraham lincoln","linkType":"linkFromHref","multiple":false,"parentSelectors":["_root"],"selector":"li.ss:nth-of-type(3) a","type":"SelectorLink"},{"id":"lincoln_sports","linkType":"linkFromHref","multiple":false,"parentSelectors":["listing-wrapper"],"selector":"td:nth-of-type(1) a","type":"SelectorLink"},{"id":"School","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"span#schoolName","type":"SelectorText"},{"id":"Sport","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"span#fullSpName","type":"SelectorText"},{"id":"Division","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"span#location","type":"SelectorText"},{"id":"Coach_FullName","multiple":false,"parentSelectors":["lincoln_sports"],"regex":"","selector":"#headCoach a","type":"SelectorText"},{"id":"Coach_Link","linkType":"linkFromHref","multiple":false,"parentSelectors":["lincoln_sports"],"selector":"#headCoach a","type":"SelectorLink"},{"id":"Email","multiple":false,"parentSelectors":["Coach_Link"],"regex":"","selector":"a#coachEmail","type":"SelectorText"},{"id":"listing-wrapper","multiple":true,"parentSelectors":["Abraham lincoln"],"selector":"#tblSchoolSport tbody tr:has(a):has(td:nth-of-type(2):has(a))","type":"SelectorElement"}]}

Dingo · February 12, 2024, 5:33pm

Thank you for the help! So this solution worked for Abraham Lincoln school site case, however I run into problems in some other scenarios. I also wanted to say that even when I just try to scrape the coaches name without going deeper and going to their coaching profile link, the same thing happens as my example where it visits the other sports pages but only acknowledges the page with the missing coach data.
The check you put in skips sports pages that have a head coach listed but don't have a linked listing. In these cases there will be a link to the coach's profile in the sports page, even though the link is not present in the school page. I haven't seen any sports without a head coach listed on the school page that then have a head coach listed in the sports page. Here is an example of a school page that has all three possibilities in the sports column (coach with link, coach with no link, and no coach listed): School Profile
Second there are schools that are listed where all sports listed are without head coaches. So without any links to pass the check, the scraper gets stuck on these sites. This probably would also happen if there was a coach listed that didn't have a link on the school page, but I haven't scrolled through enough pages to observe this. Here is an example of a school with empty coaches column: School Profile

JanAp · February 14, 2024, 2:44pm

Unfortunately, it looks like this is a very specific edge case. The issue has to do with how the URLs are handled by the web browser. If you try to open two team profiles in a row, you will see that the page will not be updated even if the URL has changed in the address bar. To get to the second team profile, you have to click on the address bar and hit enter one more time.

The only reason why the scraper works with teams that have a coach, is because there is a constant switch between 'team-profile.aspx' and 'coach-profile.aspx' in the URL.

If the coach is missing, the switch happens between two 'team-profile.aspx' URLs, which the browser cannot handle.