Limit number of page

hi, i'd like to scrape the data of each restaurant of an italian region. the problem is that for a single region there are up to 90 pages and i want to limit the scraping to the first 10 pages and i don't know how i can achieve this result.

i tried with the negative nt-child ranges but didn't work

Furthermore i can't retrieve the coordinates of the restaurant from its detail page.

Could you help me?

Url: https://www.tripadvisor.it/Restaurants-g2440596-Province_of_Palermo_Sicily.html#EATERY_OVERVIEW_BOX

this is my sitemap
{"_id":"prova_limite_pagine","startUrl":["https://www.tripadvisor.it/Restaurants-g2440596-Province_of_Palermo_Sicily.html#EATERY_OVERVIEW_BOX"],"selectors":[{"id":"pagina","type":"SelectorElement","parentSelectors":["_root","avanti"],"selector":"div.ui_columns.is-partitioned > div.ui_column.is-9","multiple":false,"delay":"1.5"},{"id":"apriLink","type":"SelectorLink","parentSelectors":["pagina"],"selector":"a.property_title","multiple":true,"delay":0},{"id":"nome","type":"SelectorText","parentSelectors":["apriLink"],"selector":"h1.heading_title","multiple":false,"regex":"","delay":0},{"id":"telefono","type":"SelectorText","parentSelectors":["apriLink"],"selector":"div.blEntry.phone span:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"avanti","type":"SelectorElementClick","parentSelectors":["pagina","avanti"],"selector":"div.pageNumbers","multiple":true,"delay":"1","clickElementSelector":"a.pageNum:nth-child(-n+3)","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"indirizzo","type":"SelectorText","parentSelectors":["apriLink"],"selector":"div.blEntry span.street-address","multiple":false,"regex":"","delay":0},{"id":"regione","type":"SelectorText","parentSelectors":["apriLink"],"selector":"div.blEntry span.locality","multiple":false,"regex":"","delay":0},{"id":"fascia_prezzo","type":"SelectorText","parentSelectors":["apriLink"],"selector":"span.header_tags","multiple":false,"regex":"","delay":0},{"id":"tipo_cucina1","type":"SelectorText","parentSelectors":["apriLink"],"selector":"span.header_links a:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"tipo_cucina2","type":"SelectorText","parentSelectors":["apriLink"],"selector":"span.header_links a:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"lat","type":"SelectorElementAttribute","parentSelectors":["apriLink"],"selector":"div.mapContainer","multiple":false,"extractAttribute":"data-lat","delay":"2"},{"id":"lon","type":"SelectorElementAttribute","parentSelectors":["apriLink"],"selector":"div.mapContainer","multiple":false,"extractAttribute":"data-lng","delay":"2"},{"id":"valutazione","type":"SelectorElementAttribute","parentSelectors":["apriLink"],"selector":"div.rs span.ui_bubble_rating","multiple":false,"extractAttribute":"content","delay":0},{"id":"recensioni","type":"SelectorHTML","parentSelectors":["apriLink"],"selector":"a.more span","multiple":false,"regex":"","delay":0}]}

here's a simple trick, go down the where you can change pages, and right click to get the link for each page, for example the link of the 5th page is the following

https://www.tripadvisor.it/RestaurantSearch-g2440596-oa120-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Provin.html#EATERY_LIST_CONTENTS

so just copy the link of the first 10 pages & add them to the scraper. should be easy to do!

hi hossain007, thank you for the suggestion, it works, but i have to repeat the same for other 5 regions and would be more simple if i can find the method to limit the number of pages and insert the start url just one time.
Anyway the other problem is that i can't retrieve the coordinates, if i try the "data preview" i can see the coordinates but when i scrape the detail page of the restaurant the result is "null", i don't understand why. could you help me?

do you mean the address by coordinates ? if yes then check your selector, it worked fine for me, here's a test I did.

https://www.dropbox.com/s/znt22j6nme6b5t8/italy.xlsx?dl=0

Sitemap:

{"_id":"italy","startUrl":["https://www.tripadvisor.it/RestaurantSearch-g2440596-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Province_of_Palermo_Sicily.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa30-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Provinc.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa60-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Provinc.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa90-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Provinc.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa120-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Provin.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa150-a_date.2018__2D__08__2D__26-a_people.2-a_time.20%3A00%3A00-a_zur.2018__5F__08__5F__26-Provin.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa180-Province_of_Palermo_Sicily.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa210-Province_of_Palermo_Sicily.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa240-Province_of_Palermo_Sicily.html#EATERY_LIST_CONTENTS","https://www.tripadvisor.it/RestaurantSearch-g2440596-oa270-Province_of_Palermo_Sicily.html#EATERY_LIST_CONTENTS"],"selectors":[{"id":"parent","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.property_title","multiple":true,"delay":0},{"id":"address","type":"SelectorText","parentSelectors":["parent"],"selector":"div.blEntry.address","multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","parentSelectors":["parent"],"selector":"div.blEntry.phone span:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"type of food","type":"SelectorText","parentSelectors":["parent"],"selector":"span.header_links","multiple":false,"regex":"","delay":0},{"id":"ranking","type":"SelectorText","parentSelectors":["parent"],"selector":"span.header_popularity","multiple":false,"regex":"","delay":0},{"id":"number of reviews","type":"SelectorText","parentSelectors":["parent"],"selector":"a.more","multiple":false,"regex":"","delay":0}]}

I guess there's another way, I have never used CouchDB but apparenly if you use it to store data it keeps it arranged, so just use it and stop once you reach 30*10=300 sets of data.

no, i mean the latitude and longitude of each restaurant, these data are in the detail page of the single restaurant and are in the html of the page.
In the sitemap i posted is the selector: "selector":"div.mapContainer","multiple":false,"extractAttribute":"data-lat"

I tried but no luck, hope @iconoclast or @bretfeig can help you out

thanks anyway for your kindness. you have been very useful

1 Like

Hi!

What you're trying to achieve can be done using :not CSS selector.

I'd say it's luck that Next button contains page number that it leads to:

So now we can build an Element Click selector, that will click Next button until it's data-page-number value is 11 (we want 10 pages, stopping on 11th).
Correct click selector would be: a.nav.next:not([data-page-number=11])

Here's an example sitemap:

{"_id":"tripadvisor_ita","startUrl":["https://www.tripadvisor.it/Restaurants-g2440596-Province_of_Palermo_Sicily.html#EATERY_OVERVIEW_BOX"],"selectors":[{"id":"clicker","type":"SelectorElementClick","selector":"div.hotels_lf_redesign","parentSelectors":["_root"],"multiple":true,"delay":"1000","clickElementSelector":"a.nav.next:not([data-page-number=11])","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"page_number","type":"SelectorText","selector":"span.pageNum.current","parentSelectors":["clicker"],"multiple":false,"regex":"","delay":0}]}

hi @iconoclast, thank you very much for your suggestion, i'll try your solution.