Cannot loop paginations with detail pages

fishfree · May 12, 2021, 10:29am

I create a sitemap with https://***/page=[1-927] as start URL. On the list pages, there are many news summaries, I create the selector "link" targeting to the multiple news detail page links on the start page, then open one of the news detail page in the browser, create selectors of title, post-date and body of the news, and set the parent selector to "link".
I expect to scrape every news in all list pages. But my sitemap can only scrape all the news on the last list page(927), not all the list pages(1-927).
I cannot figure out why. Please help me

Url: https://activo.eluniversal.com.mx/historico/search/index.php?q=China&anio=&seccion=&opinion=&tipo_contenido=&autor=&tipoedicion=&dia=&mes=&rango_Fechas=&k_rango_fechas=&fecha_ini=&fecha_fin=&editor=&start=20&page=[1-927]

Sitemap:
{"_id":"eluniversalpagination","startUrl":["https://activo.eluniversal.com.mx/historico/search/index.php?q=China&anio=&seccion=&opinion=&tipo_contenido=&autor=&tipoedicion=&dia=&mes=&rango_Fechas=&k_rango_fechas=&fecha_ini=&fecha_fin=&editor=&start=20&page=[1-927]"],"selectors":[{"id":"link","type":"SelectorLink","parentSelectors":["_root"],"selector":"div:nth-of-type(n+5) .HeadNota a","multiple":true,"delay":0},{"id":"title","type":"SelectorText","parentSelectors":["link"],"selector":"h2.h1","multiple":false,"regex":"","delay":0},{"id":"column","type":"SelectorText","parentSelectors":["link"],"selector":"a.ce12-DatosArticulo_TiempoRelojes","multiple":false,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["link"],"selector":"span.ce12-DatosArticulo_ElementoFecha","multiple":false,"regex":"","delay":0},{"id":"body","type":"SelectorText","parentSelectors":["link"],"selector":"div.field","multiple":false,"regex":"","delay":0}]}

ViestursWS · May 12, 2021, 6:15pm

Hello @fishfree

Not so sure what you tried to do because your sitemap did not work.
Next time use preformatted txt, please.

But my solution is this for your described issue is this:

{"_id":"activo","startUrl":["https://activo.eluniversal.com.mx/historico/search/index.php?q=China&anio=&seccion=&opinion=&tipo_contenido=&autor=&tipoedicion=&dia=&mes=&editor=&start=100&page=[1-927]"],"selectors":[{"id":"wrapper","type":"SelectorElement","parentSelectors":["_root"],"selector":"div#IzqDisplayColumn > div:has(div.HeadNota)","multiple":true,"delay":0},{"id":"news-link","type":"SelectorElementAttribute","parentSelectors":["wrapper"],"selector":".HeadNota a","multiple":false,"extractAttribute":"href","delay":0},{"id":"title","type":"SelectorText","parentSelectors":["wrapper"],"selector":".HeadNota a","multiple":false,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["wrapper"],"selector":"span.Fecha","multiple":false,"regex":"","delay":0}]}

fishfree · May 12, 2021, 9:11pm

@ViestursWS Thank you very much! I edited my question. Your sitemap can only scrape news title, date and link. I need to also scrape news body, which is only available on the news detail page by opening the links in the news summaries list pages.

ViestursWS · May 13, 2021, 8:09am

@fishfree Try this:

{"_id":"activo","startUrl":["https://activo.eluniversal.com.mx/historico/search/index.php?q=China&anio=&seccion=&opinion=&tipo_contenido=&autor=&tipoedicion=&dia=&mes=&editor=&start=100&page=[1-927]"],"selectors":[{"id":"wrapper","type":"SelectorElement","parentSelectors":["_root"],"selector":"div#IzqDisplayColumn > div:has(div.HeadNota)","multiple":true,"delay":0},{"id":"news-link","type":"SelectorElementAttribute","parentSelectors":["wrapper"],"selector":".HeadNota a","multiple":false,"extractAttribute":"href","delay":0},{"id":"title","type":"SelectorText","parentSelectors":["wrapper"],"selector":".HeadNota a","multiple":false,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["wrapper"],"selector":"span.Fecha","multiple":false,"regex":"","delay":0},{"id":"link","type":"SelectorLink","parentSelectors":["wrapper"],"selector":".HeadNota a","multiple":false,"delay":0},{"id":"card","type":"SelectorElement","parentSelectors":["link"],"selector":"body:has(div.texto-blanco:has(h1))","multiple":true,"delay":0},{"id":"news-content","type":"SelectorGroup","parentSelectors":["card"],"selector":".field p","delay":0,"extractAttribute":""}]}

fishfree · May 13, 2021, 9:49am

Thank you, viesturs. Unfortunately it still does not work. I need to scrape all the news body of all news of all the pages:

ViestursWS · May 13, 2021, 12:25pm

@fishfree It is working but the results will start to appear only after the scraper has gone through all of the pages which might take several minutes.

fishfree · May 14, 2021, 4:59am

@ViestursWS Thank you! This time I only scrape 2 pages by setting the start URL as Buscador
But the result is still not as expected

ViestursWS · May 14, 2021, 12:47pm

@fishfree Are you also proceeding into the links or you just need the link itself?

fishfree · May 15, 2021, 1:19am

@ViestursWS Proceeding into the links. Using your sitemap, I can only scrape 13 news, but there are several news-content JSON object in the news-content column as below ( I think it's because you select only the

for the news body, should seelct the parent selector of

) :

Using my sitemap, I can only scrape 20 news on the last page https://activo.eluniversal.com.mx/historico/search/index.php?q=China&anio=&seccion=&opinion=&tipo_contenido=&autor=&tipoedicion=&dia=&mes=&editor=&start=20&page=927