Using the same selector name within different elements?

Hi there,

I'm enjoying learning how to use the tool for a variety of purposes. Just signed up for a subscription. Very useful!

I am stuck on one use. I've been trying to tweak things and figure it out myself for a good few hours now so I've finally given up and turning to the experts for help!

A mock up of the scenario is on the link below. There are multiple links to different job listings that are hosted on different domains. I'd like to be able to extract the job name, salary, location, description from each into a spreadsheet. As the code is slightly different on each website I'm hitting a roadblock - ie. the title of the job is using an h1 or one site, but an h3 on another.

Below is my sitemap just focussing on extracting the job title for now. As you can see the job titles are split across two different columns - 'name1' and 'name2'. How can I set up the sitemap so they all appear under a single column. It doesn't appear you can use the same selector more than once across any elements.

Url: http://rssbuilder.nfshost.com/bba/joblistingexample.html

Sitemap:
{"_id":"joblinks","startUrl":["http://rssbuilder.nfshost.com/bba/joblistingexample.html"],"selectors":[{"id":"grablink","type":"SelectorLink","parentSelectors":["_root"],"selector":"a","multiple":true,"delay":0},{"id":"indeed","type":"SelectorElement","parentSelectors":["grablink"],"selector":"div.jobsearch-ViewJobLayout-mainContent","multiple":false,"delay":0},{"id":"reed","type":"SelectorElement","parentSelectors":["grablink"],"selector":"article","multiple":false,"delay":0},{"id":"name1","type":"SelectorText","parentSelectors":["indeed"],"selector":"h3","multiple":false,"regex":"","delay":0},{"id":"name2","type":"SelectorText","parentSelectors":["reed"],"selector":"h1","multiple":false,"regex":"","delay":0}]}

I thought one solution may be to exports the above, and edit so both selectors were called 'name' then import. This results in one column, but only grabs half the names. This sitemap is below.

{"_id":"joblinks","startUrl":["http://rssbuilder.nfshost.com/bba/joblistingexample.html"],"selectors":[{"id":"grablink","type":"SelectorLink","parentSelectors":["_root"],"selector":"a","multiple":true,"delay":0},{"id":"indeed","type":"SelectorElement","parentSelectors":["grablink"],"selector":"div.jobsearch-ViewJobLayout-mainContent","multiple":false,"delay":0},{"id":"reed","type":"SelectorElement","parentSelectors":["grablink"],"selector":"article","multiple":false,"delay":0},{"id":"name1","type":"SelectorText","parentSelectors":["indeed"],"selector":"h3","multiple":false,"regex":"","delay":0},{"id":"name2","type":"SelectorText","parentSelectors":["reed"],"selector":"h1","multiple":false,"regex":"","delay":0}]}

Any help appreciated - I'm hoping it is possible and I'm just unable to figure it out!

You could just have separate sitemaps for Indeed links and Reed links, then merge the results later. As they are different sitemaps, you can use the same selector name and structure, and that would produce CSVs with similar columns. The example below will only click and scrape the Indeed links. With slight changes, you can create a similar one for Reed links:

{"_id":"forum-nfshost-test-indeed","startUrl":["http://rssbuilder.nfshost.com/bba/joblistingexample.html"],"selectors":[{"id":"Indeed links","type":"SelectorLink","parentSelectors":["_root"],"selector":"a[href*='indeed.co']","multiple":true,"delay":0},{"id":"Title","type":"SelectorText","parentSelectors":["Indeed links"],"selector":"div[class*='title'] > h3","multiple":false,"regex":"","delay":0},{"id":"Company","type":"SelectorText","parentSelectors":["Indeed links"],"selector":"div[class*='jobsearch-CompanyInfo'] > div","multiple":false,"regex":"","delay":0}]}

Thanks @leemeng!

I had considered this as a last resort. Was hoping there was a solution that allows this to be achieved with one sitemap. Can others in the community confirm that what I'm trying to achieve is definitely not possible?

Thanks!

You can set multiple CSS selectors for a selector by separating them by a comma. You don't need element selectors if you want the same selectors to execute in each site. If you want specific selectors to execute in each site, set element selectors to multiple so element selector that doesn't extract anything doesn't return an empty row. Here is an example sitemap with comma separated selector:

{"_id":"joblinks","startUrl":["http://rssbuilder.nfshost.com/bba/joblistingexample.html"],"selectors":[{"id":"grablink","type":"SelectorLink","parentSelectors":["_root"],"selector":"a","multiple":true,"delay":0},{"id":"name","type":"SelectorText","parentSelectors":["grablink"],"selector":"h3, h1","multiple":false,"regex":"","delay":0}]}

2 Likes

Thank you! This is what I was looking to achieve.