Select multiple elements (and then links) on each page not working

AD00 · May 7, 2025, 4:01pm

Hello,

I am trying to do some web scrapping on PubMed. I am trying to go through each page using pagination and then on each page to select the links of each article so that I can go onto each article and extract things like title, author, abstract etc (I want to end up with an Excel having the title, abstracts, author etc of all the articles shown on PubMed when I search for 'epidemiology and "squamous cell carcinoma" and cutaneous' for example) . The pagination seems to work but the scrapping does not extract any of the links on a given page.
Here is the sitemap for reference:
{"_id":"pubmed_epidemiology","startUrl":["[Invalid form] - Search Results - PubMed h1","multiple":false,"regex":""},{"id":"abstract_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".abstract-content p","multiple":false,"regex":""},{"id":"firstautor_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".inline-authors span.authors-list-item:nth-of-type(1)","multiple":false,"regex":""},{"id":"PMID_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers strong","multiple":false,"regex":""},{"id":"journal_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"button#full-view-journal-trigger","multiple":false,"regex":""},{"id":"year_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.cit","multiple":false,"regex":""},{"id":"doi_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.citation-doi","multiple":false,"regex":""},{"id":"openaccess_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers .pmc a","multiple":false,"regex":""}]}.
I think there is a problem in how I connect article_el to the pagination, but I am not 100% sure. Any suggestions would be appreciated.

Thanks!

JanAp · May 11, 2025, 3:43pm

Hi,

Please post the sitemap as Preformatted text; otherwise, the JSON is broken.

code

AD00 · May 12, 2025, 11:02am

{"_id":"pubmed_epidemiology","startUrl":["https://pubmed.ncbi.nlm.nih.gov/?term=epidemiology%20and%20%22squamous%20cell%20carcinoma%22%20and%20cutaneous&filter=simsearch1.fha&filter=years.2025-2025&page=[1-4]"],"selectors":[{"id":"article_el","parentSelectors":["_root"],"type":"SelectorElement","selector":"div.search-results","multiple":false,"scroll":false,"elementLimit":0},{"id":"article_link","parentSelectors":["article_el"],"type":"SelectorLink","selector":"a.docsum-title","multiple":true,"linkType":"linkFromHref"},{"id":"title_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view h1","multiple":false,"regex":""},{"id":"abstract_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"div.abstract","multiple":false,"regex":""},{"id":"firstautor_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".inline-authors span.authors-list-item:nth-of-type(1)","multiple":false,"regex":""},{"id":"PMID_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers strong","multiple":false,"regex":""},{"id":"journal_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"button#full-view-journal-trigger","multiple":false,"regex":""},{"id":"year_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.cit","multiple":false,"regex":""},{"id":"doi_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.citation-doi","multiple":false,"regex":""},{"id":"openaccess_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers .pmc a","multiple":false,"regex":""}]}

Hi, re-uploaded the sitemap above, thankss

JanAp · May 13, 2025, 1:52pm

Hi,

What is the issue again? The sitemap seems to work just fine.

AD00 · May 15, 2025, 11:25am

Hi,
Sorry I have pasted the wrong JSON above. I was trying to avoid using the [1-100] for pagination, and use the Pagination selector instead. Here is the correct JSON using the pagination selector:

{"_id":"pubmed_epi_pagination","startUrl":["https://pubmed.ncbi.nlm.nih.gov/?term=epidemiology+and+%22squamous+cell+carcinoma%22+and+cutaneous&filter=simsearch1.fha&filter=years.2025-2025"],"selectors":[{"id":"Pagination","paginationType":"auto","parentSelectors":["_root","Pagination"],"selector":".button-wrapper.next-page-btn","type":"SelectorPagination"},{"elementLimit":0,"id":"article_el","multiple":true,"parentSelectors":["_root","Pagination"],"scroll":false,"selector":"div.docsum-wrap","type":"SelectorElement"},{"id":"article_link","linkType":"linkFromHref","multiple":false,"parentSelectors":["article_el"],"selector":"a","type":"SelectorLink"},{"id":"Tile_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view h1","type":"SelectorText"},{"id":"Journal_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":"button#full-view-journal-trigger","type":"SelectorText"},{"id":"DOI_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view span.citation-doi","type":"SelectorText"},{"id":"Firstauthor_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".inline-authors span.authors-list-item:nth-of-type(1)","type":"SelectorText"},{"id":"abstract_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".abstract-content p","type":"SelectorText"}]}

In tis format, there is no data scraped at the end of the process

JanAp · May 15, 2025, 11:38am

I have added :not([disabled]) to the pagination to stop it from clicking the button on the last page.

{"_id":"pubmed_epi_pagination","startUrl":["https://pubmed.ncbi.nlm.nih.gov/?term=epidemiology+and+%22squamous+cell+carcinoma%22+and+cutaneous&filter=simsearch1.fha&filter=years.2025-2025"],"selectors":[{"id":"Pagination","paginationType":"auto","parentSelectors":["_root","Pagination"],"selector":".button-wrapper.next-page-btn:not([disabled])","type":"SelectorPagination"},{"elementLimit":0,"id":"article_el","multiple":true,"parentSelectors":["_root","Pagination"],"scroll":false,"selector":"div.docsum-wrap","type":"SelectorElement"},{"id":"article_link","linkType":"linkFromHref","multiple":false,"parentSelectors":["article_el"],"selector":"a","type":"SelectorLink"},{"id":"Tile_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view h1","type":"SelectorText"},{"id":"Journal_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":"button#full-view-journal-trigger","type":"SelectorText"},{"id":"DOI_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view span.citation-doi","type":"SelectorText"},{"id":"Firstauthor_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".inline-authors span.authors-list-item:nth-of-type(1)","type":"SelectorText"},{"id":"abstract_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".abstract-content p","type":"SelectorText"}]}

AD00 · May 15, 2025, 12:53pm

Thank you so much, it works now!

JanAp · May 15, 2025, 1:00pm

Glad I could help! If you have a moment, I'd appreciate you leaving a review on the Web Scraper extension page! Thanks