Select multiple elements (and then links) on each page not working

Hello,

I am trying to do some web scrapping on PubMed. I am trying to go through each page using pagination and then on each page to select the links of each article so that I can go onto each article and extract things like title, author, abstract etc (I want to end up with an Excel having the title, abstracts, author etc of all the articles shown on PubMed when I search for 'epidemiology and "squamous cell carcinoma" and cutaneous' for example) . The pagination seems to work but the scrapping does not extract any of the links on a given page.
Here is the sitemap for reference:
{"_id":"pubmed_epidemiology","startUrl":["[Invalid form] - Search Results - PubMed h1","multiple":false,"regex":""},{"id":"abstract_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".abstract-content p","multiple":false,"regex":""},{"id":"firstautor_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".inline-authors span.authors-list-item:nth-of-type(1)","multiple":false,"regex":""},{"id":"PMID_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers strong","multiple":false,"regex":""},{"id":"journal_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"button#full-view-journal-trigger","multiple":false,"regex":""},{"id":"year_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.cit","multiple":false,"regex":""},{"id":"doi_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.citation-doi","multiple":false,"regex":""},{"id":"openaccess_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers .pmc a","multiple":false,"regex":""}]}.
I think there is a problem in how I connect article_el to the pagination, but I am not 100% sure. Any suggestions would be appreciated.

Thanks!

Hi,

Please post the sitemap as Preformatted text; otherwise, the JSON is broken.

code

{"_id":"pubmed_epidemiology","startUrl":["https://pubmed.ncbi.nlm.nih.gov/?term=epidemiology%20and%20%22squamous%20cell%20carcinoma%22%20and%20cutaneous&filter=simsearch1.fha&filter=years.2025-2025&page=[1-4]"],"selectors":[{"id":"article_el","parentSelectors":["_root"],"type":"SelectorElement","selector":"div.search-results","multiple":false,"scroll":false,"elementLimit":0},{"id":"article_link","parentSelectors":["article_el"],"type":"SelectorLink","selector":"a.docsum-title","multiple":true,"linkType":"linkFromHref"},{"id":"title_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view h1","multiple":false,"regex":""},{"id":"abstract_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"div.abstract","multiple":false,"regex":""},{"id":"firstautor_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".inline-authors span.authors-list-item:nth-of-type(1)","multiple":false,"regex":""},{"id":"PMID_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers strong","multiple":false,"regex":""},{"id":"journal_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"button#full-view-journal-trigger","multiple":false,"regex":""},{"id":"year_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.cit","multiple":false,"regex":""},{"id":"doi_text","parentSelectors":["article_link"],"type":"SelectorText","selector":".full-view span.citation-doi","multiple":false,"regex":""},{"id":"openaccess_text","parentSelectors":["article_link"],"type":"SelectorText","selector":"#full-view-identifiers .pmc a","multiple":false,"regex":""}]}

Hi, re-uploaded the sitemap above, thankss

Hi,

What is the issue again? The sitemap seems to work just fine.

Hi,
Sorry I have pasted the wrong JSON above. I was trying to avoid using the [1-100] for pagination, and use the Pagination selector instead. Here is the correct JSON using the pagination selector:

{"_id":"pubmed_epi_pagination","startUrl":["https://pubmed.ncbi.nlm.nih.gov/?term=epidemiology+and+%22squamous+cell+carcinoma%22+and+cutaneous&filter=simsearch1.fha&filter=years.2025-2025"],"selectors":[{"id":"Pagination","paginationType":"auto","parentSelectors":["_root","Pagination"],"selector":".button-wrapper.next-page-btn","type":"SelectorPagination"},{"elementLimit":0,"id":"article_el","multiple":true,"parentSelectors":["_root","Pagination"],"scroll":false,"selector":"div.docsum-wrap","type":"SelectorElement"},{"id":"article_link","linkType":"linkFromHref","multiple":false,"parentSelectors":["article_el"],"selector":"a","type":"SelectorLink"},{"id":"Tile_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view h1","type":"SelectorText"},{"id":"Journal_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":"button#full-view-journal-trigger","type":"SelectorText"},{"id":"DOI_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view span.citation-doi","type":"SelectorText"},{"id":"Firstauthor_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".inline-authors span.authors-list-item:nth-of-type(1)","type":"SelectorText"},{"id":"abstract_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".abstract-content p","type":"SelectorText"}]}

In tis format, there is no data scraped at the end of the process

I have added :not([disabled]) to the pagination to stop it from clicking the button on the last page.

{"_id":"pubmed_epi_pagination","startUrl":["https://pubmed.ncbi.nlm.nih.gov/?term=epidemiology+and+%22squamous+cell+carcinoma%22+and+cutaneous&filter=simsearch1.fha&filter=years.2025-2025"],"selectors":[{"id":"Pagination","paginationType":"auto","parentSelectors":["_root","Pagination"],"selector":".button-wrapper.next-page-btn:not([disabled])","type":"SelectorPagination"},{"elementLimit":0,"id":"article_el","multiple":true,"parentSelectors":["_root","Pagination"],"scroll":false,"selector":"div.docsum-wrap","type":"SelectorElement"},{"id":"article_link","linkType":"linkFromHref","multiple":false,"parentSelectors":["article_el"],"selector":"a","type":"SelectorLink"},{"id":"Tile_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view h1","type":"SelectorText"},{"id":"Journal_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":"button#full-view-journal-trigger","type":"SelectorText"},{"id":"DOI_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".full-view span.citation-doi","type":"SelectorText"},{"id":"Firstauthor_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".inline-authors span.authors-list-item:nth-of-type(1)","type":"SelectorText"},{"id":"abstract_text","multiple":false,"parentSelectors":["article_link"],"regex":"","selector":".abstract-content p","type":"SelectorText"}]}

Thank you so much, it works now!

1 Like

Glad I could help! If you have a moment, I'd appreciate you leaving a review on the Web Scraper extension page! Thanks