How to select background images (without any img div)?

Hello there,

I want to scrape galleries from an ad archive, but I'm having trouble finding the right selector. "Image" doesn't work at all (well, only on a google ad), because all images in the gallery are backgrounds for on-click items, like this:

div onclick="location.href='/stajnie/galeria/3497/#1'" class="swiper-lazy swiper-lazy-loaded" style="background-image: url("/img/s/duze/5407.jpg");"

With "/img/s/duze/5407.jpg" being one of the images in a gallery.

I am aware that the Image selector scrapes exactly that kind of an url, but how do I extract it now? Thanks in advance for any help, much appreciated.

Url: Ogłoszenia re-volta.pl or for particular page Ogłoszenia re-volta.pl

Sitemap:
{"_id":"stajnie","startUrl":["[https://ogloszenia.re-volta.pl/stajnie/"],"selectors":{"id":"pagination","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":"a:nth-of-type(7)","multiple":true,"delay":0},{"id":"product-link","type":"SelectorLink","parentSelectors":["_root","pagination"],"selector":"div.informacje:nth-of-type(2) h3 a","multiple":true,"delay":0},{"id":"Stajnia","type":"SelectorText","parentSelectors":["product-link"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"Infrastruktura jeździecka","type":"SelectorText","parentSelectors":["product-link"],"selector":"div.kolumna:nth-of-type(1) li.ok","multiple":true,"regex":"","delay":0},{"id":"Spędzanie czasu","type":"SelectorText","parentSelectors":["product-link"],"selector":"div.kolumna:nth-of-type(2) li.ok:nth-of-type(n+2)","multiple":true,"regex":"","delay":0},{"id":"Wyposażenie stajni","type":"SelectorText","parentSelectors":["product-link"],"selector":"div.kolumna:nth-of-type(3) li.ok","multiple":true,"regex":"","delay":0},{"id":"Adres","type":"SelectorText","parentSelectors":["product-link"],"selector":".pole .stajnia_detale div:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"obrazek","type":"SelectorImage","parentSelectors":["product-link"],"selector":".swiper-lazy swiper-lazy-loaded","multiple":false,"delay":0},{"id":"Kontakt","type":"SelectorText","parentSelectors":["product-link"],"selector":".pole .stajnia_detale div:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"Opis","type":"SelectorText","parentSelectors":["product-link"],"selector":"div[itemprop='description']","multiple":false,"regex":"","delay":0}]}

Type: HTML and a regex works great for this. For the first image you can use:

Type: HTML
Selector: div.swiper-wrapper > div:nth-of-type(1)
Regex: (?<=quot;)[^&]+

For the subsequent images, you only need to increment the number after nth-of-type(x) - 2, 3, 4, and so on. You'll need to create the max number of scrapers for the images that the site shows per page, be it 5? 10? 15? Or up to the max you need. The regex can remain the same for all.

You'd still need to so a bit of post-processing cos this method only yields partial URLs like /img/s/duze/5407.jpg so you'll have to prefix the URL to make it https://ogloszenia.re-volta.pl/img/s/duze/5407.jpg

This is a straightforward search n replace which you can do with even Notepad. If you're using Cloud Scraper, it has this post search n replace feature.

Search: /img/
Replace: https://ogloszenia.re-volta.pl/img/

1 Like

Thanks for the answer! I really appreciate it.

It almost works, the first image does load, but for

div.swiper-wrapper > div:nth-of-type(2)

or 3 / 4 and so on I get "null". I tried also n+1, n+2, but same issue. Do you have any idea why could that be?

Thank you very much for your time.

Turns out the subsequent images need a different regex:

(?<=background=")[^"]+

See this test sitemap:

{"_id":"forum-ogloszenia","startUrl":["https://ogloszenia.re-volta.pl/stajnia-w-zarebach/stajnia/3497/"],"selectors":[{"id":"Title","type":"SelectorText","parentSelectors":["_root"],"selector":"div > h1","multiple":false,"regex":""},{"id":"Img1","type":"SelectorHTML","parentSelectors":["_root"],"selector":"div.swiper-wrapper > div:nth-of-type(1)","multiple":false,"regex":"(?<=quot;)[^&]+"},{"id":"Img2","type":"SelectorHTML","parentSelectors":["_root"],"selector":"div.swiper-wrapper > div:nth-of-type(2)","multiple":false,"regex":"(?<=background=\")[^\"]+"},{"id":"Img3","type":"SelectorHTML","parentSelectors":["_root"],"selector":"div.swiper-wrapper > div:nth-of-type(3)","multiple":false,"regex":"(?<=background=\")[^\"]+"}]}

Thank you so much for your answer! I really appreciate you taking the time and effort to help out, it means the World to me - my family asked me to help with this project :slight_smile:

So this new regex does help with image nr2, but still no luck with image nr3 and 4, even with your test sitemap. Any last idea how do we nail this down :)?

best regards