Only scrape off the first four pages

spalk · May 1, 2018, 10:16am

First of all thank you for making this software available for free.
My problem is that I set up everything (https://www.paginasamarillas.es/a/administrador-de-fincas/barcelona/[2-57]) and it only scrapes from page 2 to page 5, and it must do so until page 57, I have already tried many settings but I am not successful. Thank you in advance.

Url: https://www.paginasamarillas.es/a/administrador-de-fincas/barcelona/2

Sitemap:
{"_id":"administradorespa","startUrl":["https://www.paginasamarillas.es/a/administrador-de-fincas/barcelona/[2-57]"],"selectors":[{"id":"empresalink","type":"SelectorLink","selector":"div.cabecera:nth-of-type(2) a","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"nombre","type":"SelectorText","selector":"h1","parentSelectors":["empresalink"],"multiple":false,"regex":"","delay":0},{"id":"direccion","type":"SelectorText","selector":"div.bip-links2:nth-of-type(1)","parentSelectors":["empresalink"],"multiple":false,"regex":"","delay":0}]}

chefas · May 1, 2018, 5:46pm

Hi
test this extract pages 10 to 19 = 150 records:

{"_id":"test","startUrl":["https://www.paginasamarillas.es/a/administrador-de-fincas/barcelona/[10-19]"],"selectors":[{"id":"empresalink","type":"SelectorLink","selector":"div.col-xs-11 a","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"nombre","type":"SelectorText","selector":"h1","parentSelectors":["empresalink"],"multiple":false,"regex":"","delay":0},{"id":"direccion","type":"SelectorText","selector":"div.bip-links2:nth-of-type(1)","parentSelectors":["empresalink"],"multiple":false,"regex":"","delay":0}]}

spalk · May 2, 2018, 10:27am

Thank you very much for your help, it works without problems, the only thing I have seen that does not scrape all the results if I include all 57 pages of the website. I guess it is because of problems with my Internet connection or web mechanisms to avoid scraping. It's also a slow process.

For all this I have created a new Sitemap that takes the data from the same listing page (without entering each item separately). This way you should collect the data more quickly. The problem is that once finished, the values come out as "null", I have checked them with "Element preview" and apparently they are well marked, but I can't get it to work.

Here is the code:

{"_id":"administradores2","startUrl":["https://www.paginasamarillas.es/a/administrador-de-fincas/barcelona/[2-57]"],"selectors":[{"id":"administrador","type":"SelectorElement","selector":"div.listado-item:nth-of-type(n+3) div.box","parentSelectors":["_root"],"multiple":false,"delay":0},{"id":"nombre","type":"SelectorText","selector":"div.cabecera:nth-of-type(2) h2 span","parentSelectors":["administrador"],"multiple":false,"regex":"","delay":0},{"id":"direccion","type":"SelectorText","selector":"div.row:nth-of-type(3) div.col-xs-12 p.location","parentSelectors":["administrador"],"multiple":false,"regex":"","delay":0}]}

Thank you again

chefas · May 9, 2018, 9:02am

Hello,

this is the scrapping only for page 1:

{"_id":"test","startUrl":["https://www.paginasamarillas.es/a/administrador-de-fincas/barcelona/1"],"selectors":[{"id":"administrador","type":"SelectorElement","selector":"div.listado-item:nth-of-type(n+3) div.box","parentSelectors":["_root"],"multiple":true,"delay":"2000"},{"id":"nombre","type":"SelectorText","selector":"div.cabecera:nth-of-type(2) h2 span","parentSelectors":["administrador"],"multiple":false,"regex":"","delay":0},{"id":"direccion","type":"SelectorText","selector":"div.col-xs-8 a.como-ir span","parentSelectors":["administrador"],"multiple":false,"regex":"","delay":0}]}

spalk · May 9, 2018, 12:07pm

Thank you very much for your help!!