Did I edited the link correctly?

0101 · April 10, 2018, 6:35pm

Hi,

i wanna practice webscraping and try to get some information from www.immobilienscout24.de. For excample all flats in Schwerin (the website has 314 flats in this city). The problem: The scraping ends with the standard message that signals: the scraping-process was successful. But: I tried it several times in different cities but the tool never got 100% of the flats; it lacks approx.25%.
It would be fine if you can help me. Do you think, the root-link is the reason for the problem?

Original-Link: https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Mecklenburg-Vorpommern/Schwerin?enteredFrom=one_step_search

Edited link: https://www.immobilienscout24.de/Suche/S-T/P-[1-16]/Wohnung-Miete/Mecklenburg-Vorpommern/Schwerin

I didnt added a pagination; i just edited the link in the part: P-[1-16]

Thx for help!

chefas · April 11, 2018, 9:48am

Hello,

try to increase de Delay, perhaps you will get more records.
I made this test:

{"_id":"test_immo_schwerin","startUrl":["https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Mecklenburg-Vorpommern/Schwerin"],"selectors":[{"id":"link","type":"SelectorLink","selector":"a.result-list-entry__brand-title-container","parentSelectors":["pagination"],"multiple":true,"delay":"2000"},{"id":"title","type":"SelectorText","selector":"h1.font-semibold","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"price","type":"SelectorText","selector":"div.is24qa-kaltmiete","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"div.react div.grid.grid-align-center a","parentSelectors":["_root","pagination"],"multiple":true,"delay":0}]}

0101 · April 11, 2018, 5:37pm

Thx for help. I´ll try it.
btw: can you explain how to read that text? Sorry, but I cant translate it

{"_id":"test_immo_schwerin","startUrl":["https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Mecklenburg-Vorpommern/Schwerin"],"selectors":[{"id":"link","type":"SelectorLink","selector":"a.result-list-entry__brand-title-container","parentSelectors":["pagination"],"multiple":true,"delay":"2000"},{"id":"title","type":"SelectorText","selector":"h1.font-semibold","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"price","type":"SelectorText","selector":"div.is24qa-kaltmiete","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"div.react div.grid.grid-align-center a","parentSelectors":["_root","pagination"],"multiple":true,"delay":0}]}

chefas · April 11, 2018, 7:17pm

Hi
you just have to type F12 and with Web Screaper:
Create new Sitemap / Import SiteMap
Paste the code in the field : Sitemap JSON
Give a name of your choice
click Import SiteMap

0101 · April 13, 2018, 9:40am

@chefas: Thx for helping. Worked betted than my version; but:

The first step of your Version is "pagination". I observed the process and the first step of the scraping-process is opening of all the pages (page 1, 2, 3 ...) with the overview about the flats. Here it doesnt scrape informaion, cause it doesnt open a page of a flat. So its al liitle time-wasting-step. After this it open the flat-pages and scraps the information I want. Its no problem if I wanna scrap only a few pages; but if I wanna scrap all flats of a whole state it would cost a lot of time
It didnt scrape the first site; so all flats on the page 1 are missing. I guess the reason is that the start of the process is pagination; so it starts scraping with page 2.

Can you help me again?

chefas · April 16, 2018, 5:25pm

Hello

point #1 : Web scaper operates like this, not possible to avoid theses steps
point #2 : I dont have any idea why the first page is not screaped

sorry not to help you more

KristapsWS · April 17, 2018, 12:13pm

First page is not scraped because "link" selector isn't a child selector for "_root". "link" selector has to have the same parent selectors as "pagination".

chefas · April 18, 2018, 10:25am

Hi,
it's a bit surprising.

In the tutorial "Scrape an e-commerce with pagination" at the end of the video, the link is child of pagination, and pagination is child of subcategory.
It is said that this presentation avoids to get data in double for the first page.

In your sitemaps, link and pagination are at the same level, both child of root.

So it will be fine to explain us why the relations parent/child need to be built differently.

Thanks a lot for your explantions.

KristapsWS · April 18, 2018, 12:39pm

In this case you are using link selector next to pagination so there won't be duplicates because web scraper scrapes each link only once. In the video it is showed on pagination and element selector so scraper would return duplicates unless there is link selector as a child selector for element selector.