Scrape Paging Without Entering Elements, Stopped by CAPTCHA

Dukk · July 28, 2023, 2:33pm

Hi Everybody,

I am trying to scrape vehicle data from the following site.
The scraper needs to a) enter each vehicle (element) to collect the necessary data and b) continue on the following 50 pages.
Currently the scrape pages through b) but does not enter any individual elements before the website throws a CAPTCHA at the scrape and it stops.

I would appreciate any help on how to correctly set up the scraper

Url:
https://suchen.mobile.de/fahrzeuge/search.html?cn=DE&damageUnrepaired=NO_DAMAGE_UNREPAIRED&grossPrice=true&isSearchRequest=true&makeModelVariant1.makeId=3500&makeModelVariant1.modelGroupId=21&minFirstRegistrationDate=2020-01-01&minPrice=10000&pageNumber=[1-50]&ref=srpNextPage&scopeId=C&sortOption.sortBy=searchNetGrossPrice&sortOption.sortOrder=ASCENDING&refId=03b164cf-7cac-7e69-76b2-f1d4d2cef898

Sitemap:

{"_id":"Mobile3series","startUrl":["https://suchen.mobile.de/fahrzeuge/search.html?cn=DE&damageUnrepaired=NO_DAMAGE_UNREPAIRED&grossPrice=true&isSearchRequest=true&makeModelVariant1.makeId=3500&makeModelVariant1.modelGroupId=21&minFirstRegistrationDate=2020-01-01&minPrice=10000&pageNumber=[1-50]&ref=srpNextPage&scopeId=C&sortOption.sortBy=searchNetGrossPrice&sortOption.sortOrder=ASCENDING&refId=03b164cf-7cac-7e69-76b2-f1d4d2cef898"],"selectors":[{"id":"next-page","paginationType":"auto","parentSelectors":["_root","next-page"],"selector":".btn--primary i.gicon-next-white-s","type":"SelectorPagination"},{"id":"cars","multiple":true,"parentSelectors":["_root","next-page"],"selector":"div.cBox-body--resultitem","type":"SelectorElement"},{"id":"name","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div.listing-title","type":"SelectorText"},{"id":"milage","multiple":false,"parentSelectors":["cars"],"regex":"","selector":".key-feature--mileage div.key-feature__value","type":"SelectorText"},{"id":"first-reg","multiple":false,"parentSelectors":["cars"],"regex":"","selector":".key-feature--firstRegistration div.key-feature__value","type":"SelectorText"},{"id":"power","multiple":false,"parentSelectors":["cars"],"regex":"","selector":".key-feature--power div.key-feature__value","type":"SelectorText"},{"id":"prev-owners","multiple":false,"parentSelectors":["cars"],"regex":"","selector":".key-feature--numberOfPreviousOwners div.key-feature__value","type":"SelectorText"},{"id":"fuel","multiple":false,"parentSelectors":["cars"],"regex":"","selector":".key-feature--fuel div.key-feature__value","type":"SelectorText"},{"id":"type","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#category-v","type":"SelectorText"},{"id":"vehicle-no","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#sku-v","type":"SelectorText"},{"id":"cubic-cap","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#cubicCapacity-v","type":"SelectorText"},{"id":"HU-date","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#hu-v","type":"SelectorText"},{"id":"color_universal","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#color-v","type":"SelectorText"},{"id":"color_individual","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#manufacturerColorName-v","type":"SelectorText"},{"id":"features","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div#features","type":"SelectorText"},{"id":"description","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div.cBox-body--vehicledescription","type":"SelectorText"},{"id":"dealer_rating","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"div.pageTitle__2nJo2","type":"SelectorText"},{"id":"dealer_location","multiple":false,"parentSelectors":["cars"],"regex":"","selector":"p.seller-address","type":"SelectorText"}]}

Dukk · August 3, 2023, 10:42am

Friendly bump, anyone able to assist please?

leemeng · August 8, 2023, 12:16am

WS by itself does not solve Captchas. They are designed to stop scrapers. There are Captcha-solving extensions available (paid) but I have not tried them. You can sometimes avoid triggering them by scraping slowly, say 10 pages per minute (6 sec delay) because some sites will detect excessive page loading as scraping activity.

Dukk · August 8, 2023, 12:22pm

Hi Lee,

Thank you, I have not seen the captcha again since slowing down the refresh to 6 seconds!

However, the scrape still cycles through each page first and does not enter any elements (under which in this case the detailed information about the cars exist), before eventually crashing.
Is it normal behavior to cycle through pages first?

I have tried to follow which pages it skips to during the scrape and it seems like it it starting from 50 going back to 45-40 and then cycling between 45-40 before eventually crashing.

Any idea how to correct this?