Only scrape sites containing specific term

Arbitrary example: Let's say I want to scrape a list of job openings. I want the scraper to visit each job page and grab the job title, description, address etc. However, I only want to save the jobs where the word "Java" was in the job description. Is there a way to do this?

1 Like

Hi @matt_gr
You could make such a rule f.e. - body:has(div[class*="description"]:contains("Java"))

Thanks for replying @ViestursWS

How would I go about implementing this? I tried by making a selector with this as the first part when navigated to a details page, and then inside that selectors for title, content and so on.

Here's one site I'm trying to scrape: https://itelligencegroup.com/dk/careers/jobs/all-jobs/

So far I've successfully made an element clicker that opens the accordion tabs, goes onto each page and grabs the job titles and description. But I want to only save the records that contain e.g. "SAP" in the description.

Here's the sitemap:
{"_id":"itelligence","startUrl":["https://itelligencegroup.com/dk/careers/jobs/all-jobs/"],"selectors":[{"id":"accordion-click","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"body","multiple":true,"delay":"0","clickElementSelector":"a.training-title-click","clickType":"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},{"id":"job-link","type":"SelectorPopupLink","parentSelectors":["accordion-click"],"selector":"a.side-navigation__toggle-link, .in a","multiple":true,"delay":0},{"id":"job-title","type":"SelectorText","parentSelectors":["job-link"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"job-description","type":"SelectorText","parentSelectors":["job-link"],"selector":"div.content","multiple":false,"regex":"","delay":0}]}

@matt_gr
The idea is to define the place where this keyword exists so the selector would only grab those results. Well, I was doubting "SAP" uniqueness so I choose the e-mail for one person - mette.larsson@itelligence.dk, and as far as I inspected it actually returned me 3 results which contained the e-mail in the description. So I edited the sitemap just a little bit by adding Element selector which defines the keyword existence in the description field of each page.
Take a look:
{"_id":"itelligence","startUrl":["https://itelligencegroup.com/dk/careers/jobs/all-jobs/"],"selectors":[{"id":"accordion-click","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"body","multiple":true,"delay":"0","clickElementSelector":"a.training-title-click","clickType":"clickOnce","discardInitialElements":"do-not-discard","clickElementUniquenessType":"uniqueText"},{"id":"job-link","type":"SelectorPopupLink","parentSelectors":["accordion-click"],"selector":"a.side-navigation__toggle-link, .in a","multiple":true,"delay":0},{"id":"job-title","type":"SelectorText","parentSelectors":["card"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"job-description","type":"SelectorText","parentSelectors":["card"],"selector":"div.content","multiple":false,"regex":"","delay":0},{"id":"card","type":"SelectorElement","parentSelectors":["job-link"],"selector":"body:has(div[class*=\"joqReqDescription\"]):contains(\"mette.larsson@itelligence.dk\")","multiple":true,"delay":0}]}

Ah thank you! I see the key was to make a selector with that rule as the first one on the details page, and then inside that selector make more selectors to grab the info I wanted. Cheers.

1 Like