Background
I am trying to scrape the people's page of a particular website. For each person's page, I have multiple tabs as shown in the image below. Each tab follows a link like: https://www.example.com/people/person?tab=experience
When you click the tab the page reloads and content corresponding to the tab is displayed.
I have multiple SelectorLink
in my sitemap to extract the content from the tabs. The SelectorLinks are: awards-community
, news
, thought-leadership
The Problem
When I scrape the website, even though it detects the link of the tab(returned in the data), it do not go through all the tabs. It just goes to one seemingly random tab.
I also observed the scraping process, and it was not going to the other tabs. This rules out the possibility of the text selector(in the tab) not being correct.
Observations
- It always opens the last link(order as present in sitemap) from the sitemap.
Sitemap
{
"_id": "people-pagination",
"startUrl": [
"https://www.example.com/people/"
],
"selectors": [
{
"id": "people",
"linkType": "linkFromHref",
"multiple": true,
"parentSelectors": [
"_root"
],
"selector": ".bbt-letter-grid a",
"type": "SelectorLink"
},
{
"id": "person",
"linkType": "linkFromHref",
"multiple": true,
"parentSelectors": [
"page"
],
"selector": ".people-results .person-results-details a:nth-child(1):not(:contains(\"Email\"))",
"type": "SelectorLink"
},
{
"id": "person-name",
"multiple": false,
"parentSelectors": [
"person"
],
"regex": "",
"selector": "h1",
"type": "SelectorText"
},
{
"id": "person-level",
"multiple": false,
"parentSelectors": [
"person"
],
"regex": "",
"selector": "span.bio-card-info-level",
"type": "SelectorText"
},
{
"id": "person-phone",
"multiple": false,
"parentSelectors": [
"person"
],
"regex": "",
"selector": "span[itemprop='telephone']",
"type": "SelectorText"
},
{
"extractAttribute": "",
"id": "person-overview",
"parentSelectors": [
"person"
],
"selector": ".grid-content-main p",
"type": "SelectorGroup"
},
{
"extractAttribute": "",
"id": "person-practices",
"parentSelectors": [
"person"
],
"selector": "div h3.h4-primary:contains(\"Practices\")~ul a",
"type": "SelectorGroup"
},
{
"extractAttribute": "",
"id": "person-industry",
"parentSelectors": [
"person"
],
"selector": "div.content-block:contains(\"Industries\")>~*",
"type": "SelectorGroup"
},
{
"extractAttribute": "",
"id": "person-education",
"parentSelectors": [
"person"
],
"selector": ".related-accordion-btn:contains(\"Education\"):parent~div p",
"type": "SelectorGroup"
},
{
"extractAttribute": "",
"id": "person-affiliation",
"parentSelectors": [
"person"
],
"selector": "div.related-accordion a:contains(\"Admission & Affiliations\"):parent~div p",
"type": "SelectorGroup"
},
{
"extractAttribute": "",
"id": "person-featured",
"parentSelectors": [
"person"
],
"selector": "h3.h4-primary:contains(\"Featured\")~ul a",
"type": "SelectorGroup"
},
{
"id": "person-image",
"multiple": false,
"parentSelectors": [
"person"
],
"selector": ".bio-card-info-image img",
"type": "SelectorImage"
},
{
"id": "page",
"paginationType": "clickOnce",
"parentSelectors": [
"people",
"page"
],
"selector": ".pagination-controls span a",
"type": "SelectorPagination"
},
{
"id": "experience",
"linkType": "linkFromHref",
"multiple": false,
"parentSelectors": [
"person"
],
"selector": "a.tabs-link:contains(\"Experience\")",
"type": "SelectorLink"
},
{
"id": "experience-content",
"multiple": false,
"parentSelectors": [
"experience"
],
"regex": "",
"selector": "div.rich-text",
"type": "SelectorText"
},
{
"id": "thought-leadership",
"linkType": "linkFromHref",
"multiple": false,
"parentSelectors": [
"person"
],
"selector": "a.tabs-link:contains(\"Thought Leadership\")",
"type": "SelectorLink"
},
{
"id": "news",
"linkType": "linkFromHref",
"multiple": false,
"parentSelectors": [
"person"
],
"selector": "a.tabs-link:contains(\"News\")",
"type": "SelectorLink"
},
{
"extractAttribute": "",
"id": "person-news",
"parentSelectors": [
"news"
],
"selector": ".article-list article",
"type": "SelectorGroup"
},
{
"id": "awards-community",
"linkType": "linkFromHref",
"multiple": false,
"parentSelectors": [
"person"
],
"selector": "a.tabs-link:contains(\"Awards and Community\")",
"type": "SelectorLink"
},
{
"extractAttribute": "",
"id": "person-awards-community",
"parentSelectors": [
"awards-community"
],
"selector": ".grid-content-main p",
"type": "SelectorGroup"
},
{
"extractAttribute": "",
"id": "person-thought-leadership",
"parentSelectors": [
"thought-leadership"
],
"selector": ".grid-content-main .article-list article",
"type": "SelectorGroup"
}
]
}
P.S. I also tried making it a ElementSelector, it did go through the other tabs but did not return any data probably due to the page being reloaded.