Extract link and text from pagination

i want extract text from all the link on all the pages in single search please help in this regard i want all pages in single search.please guide.

here is the site map:
{"_id":"link1","startUrl":["https://www.myscience.org/jobs/search?ctrl=1&p=&d=Administration-Government&r=&t=&q="],"selectors":[{"id":"link","type":"SelectorLink","selector":"a.url","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"txt","type":"SelectorText","selector":"table","parentSelectors":["link"],"multiple":true,"regex":"","delay":0}]}

hi

this is a test with page 1 (you will have to take into account the pagination).

Be careful : all the pages are not structured with the same way, so you will have to include more fields of type text to scrape all the data.

{"_id":"test2","startUrl":["https://www.myscience.org/jobs/search?ctrl=1&p=&d=Administration-Government&r=&t=&q="],"selectors":[{"id":"link","type":"SelectorLink","selector":"a.url","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"title","type":"SelectorText","selector":"h1.hA","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"employer","type":"SelectorText","selector":"tr:nth-of-type(3) a","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"published","type":"SelectorText","selector":"tr:nth-of-type(6) time","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"closing date","type":"SelectorText","selector":"tr:nth-of-type(8) time","parentSelectors":["link"],"multiple":false,"regex":"","delay":0}]}

thank you but this is not the solution i want the same following pattern for all the pagination and i dont know how to add all the pages.please help

{"_id":"link1","startUrl":["https://www.myscience.org/jobs/search?ctrl=1&p=&d=Administration-Government&r=&t=&q="],"selectors":[{"id":"link","type":"SelectorLink","selector":"a.url","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"txt","type":"SelectorText","selector":"table","parentSelectors":["link"],"multiple":true,"regex":"","delay1":0}]}

to add pagination, follow the different tutorial presented here: http://webscraper.io/tutorials

could you please do it for me.

Hello!

You can create pagination based on an Element Click selector, but it requires you to at least know how to use CSS selectors in general. In this particular example, you need to pick 'Next' button rather than a next page, in my opinion, but once you select it in the beginning, and then switch to another page, you will notice that selector picks another page instead of 'Next'. The right selector for next button is div.container a.nav:last-child.

Please keep in mind that a search results you've brought contains more than 1000+ records. My personal recommendation for you is to select and scrape proper region you really interested.

Here's a sitemap example i've made for you:

{"_id":"link1test","startUrl":["https://www.myscience.org/jobs/search?ctrl=1&p=&d=Administration-Government&r=New+South+Wales&t=&q="],"selectors":[{"id":"click_next","type":"SelectorElementClick","selector":"div.textbody > div.block","parentSelectors":["_root"],"multiple":true,"delay":"1500","clickElementSelector":"div.container a.nav:last-child","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"Title","type":"SelectorLink","selector":"a.url","parentSelectors":["click_next"],"multiple":true,"delay":"1500"},{"id":"text_inside","type":"SelectorText","selector":"table","parentSelectors":["Title"],"multiple":false,"regex":"","delay":0}]}

It will go through all the pages in pagination, then it will go trough the Jobs pages.

Please choose the right region beforehand and put the URL into sitemap (Sitemap -> Edit Metadata)

Hmm, some reason that didn't work on my side. It paginated through all 5 pages but only scrape one page of links before ending. I found a slightly different way to make it work w/o use element-click selector

Any thoughts on why this method worked for me but the way you suggested didn'tt?

{"_id":"link1test","startUrl":["https://www.myscience.org/jobs/search?ctrl=1&p=&d=Administration-Government&r=New+South+Wales&t=&q="],"selectors":[{"id":"Pag","type":"SelectorLink","selector":"div.container a.nav:last-child","parentSelectors":["_root","Pag"],"multiple":false,"delay":0},{"id":"Link Selector","type":"SelectorLink","selector":"a.url","parentSelectors":["_root","Pag"],"multiple":true,"delay":0},{"id":"Body Text","type":"SelectorText","selector":"tbody","parentSelectors":["Link Selector"],"multiple":false,"regex":"","delay":0}]}

1 Like

I've probably did a mistake somewhere, I can't put as much effort into it as want due to my dayjob.

I'll try to look into it either today's evening or during weekend.

If Link selector works as expected, it's great! The point was to find the right selector for Next button, as there's always an option to either use Link or Element Click selector for pagination based on buttons present on website. Sometimes buttons are not links but JS-coded buttons, changing URL location. You know this :slight_smile:

P.S. you can also use pagination brackets [ ] as URL has page number coded into it

Yep. Sometimes one way works, sometimes another. You seem to have much more understanding as to why

Just a matter of practice :slight_smile:

Hi iconoclast,

Could you explain me how you find the second part of the selector, "a.nav:last-child" ???

I tried to inspect the Html coding of the page but can't find any word in correspondance with "last-child".
So I am interested to understand where you found this string in the coding.

Hi Chefas!

:last-child is a CSS selector

Please refer to this page for better understanding of CSS-selectors: CSS Selector Reference

There's also a CSS Selector Tester, try it out!

Yes but I was curious how you knew to use that selector here?

Thanks
I will read this tomorrow

@bretfeig,

If you select Next button using nth-of-type selector, you will be limited to a certain number of pages currently visible on a website, once you click on, for example, second page, the selection will be on a different page number instead of a needed Next. Knowing that the button is a last item in pagination, it's not hard to at least try it out (meaning last-child).

I was modifying websites for my own needs with Stylebot extension, as well as making websites in the past. That adds a lot to the knowledge and again, practice :slight_smile: