Scrapping and sometimes having an issue with Pagination

FuegoX · August 10, 2023, 9:39am

Hello I've the following sitemap for scrapping some links from Coches.net which its a website which has some protection against scrapping tools:

{"_id":"Cochesnet+30","startUrl":["https://www.coches.net/concesionario/jmcarsbenissa/"],"selectors":[{"id":"PaginationElement","parentSelectors":["_root"],"type":"SelectorElementClick","clickElementSelector":"li:nth-of-type(n+1) a.sui-AtomButton","clickElementUniquenessType":"uniqueText","clickType":"clickMore","delay":2000,"discardInitialElements":"discard-when-click-element-exists","multiple":true,"selector":"div.mt-AdvertisingRoadblock"},{"id":"Scrolling","parentSelectors":["PaginationElement"],"type":"SelectorElementScroll","selector":"div.mt-LayoutApp","multiple":true,"delay":100,"elementLimit":500},{"id":"Price","parentSelectors":["PaginationElement"],"type":"SelectorText","selector":"div.mt-CardAdPrice","multiple":true,"regex":""}]}

The main problem with this sitemap and not happens with all is that when the first link has some "pagination like 1-2-3-4, the script start doing from PG1 up to page 2 then goes to "null" and for a strange reason i don't know when its in "null" also extracts duplicated info, meaning that with this sitemap I should have only 35 data instead of 40 which is giving me.

Anyone knows? I've tried to change few things and now I'm getting 40 instead of 60 when there is only 35 data to extract.

Help would be appreciate!

Thanks a lot!

FuegoX · August 10, 2023, 1:41pm

Hello,
As I told previously, this site has some protections against scrapping methods but I could avoid some of them with this tool which i had found 2 months ago and I were learing how to use seeing this forum and your guides.

To avoid bot detection:

Try to clean cache / cookies and use a fresh Google chrome.
For scrapping time:
Request interval (ms) = 2500
Page load delay (ms) = 4000/5000

With this data I could use without troubles. The main issue is for some reason pagination is not working properly to me and its duplicating data sometimes and I dunno what could be.

I've modified to this new sitemap

{"_id":"Cochesnet+30","startUrl":["https://www.coches.net/concesionario/jmcarsbenissa/"],"selectors":[{"id":"PaginationElement","parentSelectors":["_root"],"type":"SelectorElementClick","clickElementSelector":"li:nth-of-type(n+2) a.sui-AtomButton","clickElementUniquenessType":"uniqueText","clickType":"clickMore","delay":2500,"discardInitialElements":"discard","multiple":true,"selector":"div.mt-AdvertisingRoadblock"},{"id":"Scrolling","parentSelectors":["PaginationElement"],"type":"SelectorElementScroll","selector":"div.mt-LayoutApp","multiple":true,"delay":250,"elementLimit":500},{"id":"Price","parentSelectors":["PaginationElement"],"type":"SelectorText","selector":"div.mt-CardAdPrice","multiple":true,"regex":""}]}

Increasing delay from elementscroll from 100 to 250.

P.S: Element scroll is necessary since this site, has a protection where only gives you the first 4 "items" visible / 30 if you don't scroll the site.

ViestursWS · August 10, 2023, 2:09pm

@3HAT0K You need a proxy based on a Spanish IP address to access it.

ViestursWS · August 10, 2023, 2:15pm

@FuegoX Hi, try to launch the scraping jobs via Web Scraper Cloud which has an in-built proxy feature. The trial is free for 7-days: Login | Web Scraper
Search results for proxy | Web Scraper Knowledge Base

FuegoX · August 10, 2023, 2:29pm

Hello Guys,
Maybe I didnt explained right myself. I don't need a proxy since I'm based on Spain, so im not having any issues with that (dont worry)

My issues are relating how my sitemap is designed probably because pagination element is not working properly because sometimes its duplicating data for me when its going to pg1-pg2 then goes to "null" and then goes a new link and sometimes the data is being duplicated.

I've attached a screenshot to you that website is working for me

@3HAT0K don't worry thanks for your help

Hope someone could help me a bit

FuegoX · August 16, 2023, 1:35pm

Hello Mr,

Thanks for ur fast explanation and new code. I've tried it and its working.
Really I didnt need the other things u've added additonally like kms, and other stuff, but this helped me to use similar structure for other page "similar to ebay with items and cars announced and now, I can extract all the data in a table with correct elements so i can filter them to have only the info for cars thanks too much! . I've learnt a lot from the changes u've made.

Finally, I will take a look into your code to develop another one with auto-pagination since I've around 2000 links which needs to be scrapped each monthly and don't have time to go link per link to put pg=1-3-4 when a lot of them have random and not are exactly 10-20-30.

Btw, thanks so much bro and have a good day!