6+ hours trying to extract a single image

binvius · June 14, 2023, 6:51pm

After over 6 hours trying to solve this and billions of form posts, I am finally turning to the lovely people here for what will likely take someone more knowledgeable a few seconds to solve. Thank you so so much if you can spare a quick moment to help out as it would really mean the world to me.

I successfully built a separate initial sitemap to extract all full-screen image URLs from a similar pages 'floating' carousel which works fine but breaks when there is no 'next' icon for the Element Select to click, like this page provided below. I figured I could just use this separate sector below to pull the first image and then deal with things in post.

I've tried every combination of options for each of the selector types but sadly to no avail.

Massive thanks once again if anyone can help out - I shall look forward to being able to help others out once I have learnt more.

Warmest regards,

Binvius

Url: https://www.atlasobscura.com/places/tallest-tree-in-europe-portugal

Sitemap:
{"_id":"image-single","startUrl":["https://www.atlasobscura.com/places/tallest-tree-in-europe-portugal"],"selectors":[{"id":"tit","multiple":false,"parentSelectors":["_root"],"regex":"","selector":"h1.DDPage__header-title","type":"SelectorText"},{"extractAttribute":"src","id":"image","multiple":false,"parentSelectors":["_root"],"selector":"img#lightbox-image","type":"SelectorElementAttribute"},{"id":"img","multiple":false,"parentSelectors":["image"],"selector":"img","type":"SelectorImage"}]}

binvius · June 17, 2023, 9:35am

Hi @3HAT0K

Thank you so so much!

I tested it and it appeared to work.

Just to be sure, I ran it on the full dataset and it has been running none stop since your post.

Sadly it just crashed due to lack of hard-drive space. I free'd up a load of space but figured it would probably be wise to optimise the sitemap anyway as it seems pointless having a separate 'duplicate' row just for each image URL.

I have been trying for hours to try and work out how to get 'Grouped selector' to work in the full 'messy' sitemap included below but always learn best by seeing a working example.

I basically need it to scrape pages that only have one image like: https://www.atlasobscura.com/places/tallest-tree-in-europe-portugal

But then also work for pages with multiple images like: https://www.atlasobscura.com/places/tree-of-40-fruit

Basically just a single row for each page with headers along the lines of: title | location | foobar | foobar | img-1 | img-2 | img-3 | img-4 | img-5 | img-6 | img-etc...

I'm convinced in my heart that it is possible but would really appreciate a working example to learn from if you (or anyone else) can spare a few seconds.

I can't begin to express how appreciative I would be as it would have a massive impact!

Cheers if you can help a soul in need!

This is the full 'messy' sitemap I'm trying to get to work:

{"_id":"atlas-obscura","startUrl":["https://www.atlasobscura.com/places?page=1403&sort=likes_count"],"selectors":[{"id":"pagination","parentSelectors":["_root","pagination"],"paginationType":"linkFromHref","selector":".next a","type":"SelectorPagination"},{"id":"cards","parentSelectors":["pagination"],"type":"SelectorLink","selector":"a.content-card","multiple":true},{"id":"closed","parentSelectors":["cards"],"type":"SelectorText","selector":"b","multiple":false,"regex":""},{"id":"ao-edited","parentSelectors":["cards"],"type":"SelectorText","selector":"h2.DDPage__header-hat--aoedited","multiple":false,"regex":""},{"id":"title","parentSelectors":["cards"],"type":"SelectorText","selector":"h1.DDPage__header-title","multiple":false,"regex":""},{"id":"location","parentSelectors":["cards"],"type":"SelectorText","selector":".DDPage__header-place-location a","multiple":false,"regex":""},{"id":"address","parentSelectors":["cards"],"type":"SelectorText","selector":".DDPageSiderail__address > div","multiple":false,"regex":""},{"id":"coords","parentSelectors":["cards"],"type":"SelectorText","selector":"div.DDPageSiderail__coordinates","multiple":false,"regex":""},{"id":"been-here-count","parentSelectors":["cards"],"type":"SelectorText","selector":".hidden-print .js-been-to-top-wrap div.title-md","multiple":false,"regex":""},{"id":"want-to-visit-count","parentSelectors":["cards"],"type":"SelectorText","selector":".hidden-print .js-like-top-wrap div.title-md","multiple":false,"regex":""},{"id":"summary","parentSelectors":["cards"],"type":"SelectorText","selector":"h3.DDPage__header-dek","multiple":false,"regex":""},{"id":"details","parentSelectors":["cards"],"type":"SelectorText","selector":".DDP__body-copy > div","multiple":false,"regex":""},{"id":"before-you-go","parentSelectors":["cards"],"type":"SelectorText","selector":"div.DDP__direction-copy","multiple":false,"regex":""},{"id":"nearby-title-1","parentSelectors":["cards"],"type":"SelectorText","selector":"a:nth-of-type(1) div.DDPageSiderailRecirc__item-title","multiple":false,"regex":""},{"id":"nearby-distance-1","parentSelectors":["cards"],"type":"SelectorText","selector":"a:nth-of-type(1) div.DDPageSiderailRecirc__item-distance","multiple":false,"regex":""},{"id":"nearby-title-2","parentSelectors":["cards"],"type":"SelectorText","selector":"a:nth-of-type(2) div.DDPageSiderailRecirc__item-title","multiple":false,"regex":""},{"id":"nearby-distance-2","parentSelectors":["cards"],"type":"SelectorText","selector":"a:nth-of-type(2) div.DDPageSiderailRecirc__item-distance","multiple":false,"regex":""},{"id":"nearby-title-3","parentSelectors":["cards"],"type":"SelectorText","selector":"a:nth-of-type(3) div.DDPageSiderailRecirc__item-title","multiple":false,"regex":""},{"id":"nearby-distance-3","parentSelectors":["cards"],"type":"SelectorText","selector":"a:nth-of-type(3) div.DDPageSiderailRecirc__item-distance","multiple":false,"regex":""},{"id":"tag-1","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(1) a.itemTags__link","multiple":false,"regex":""},{"id":"tag-2","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(2) a","multiple":false,"regex":""},{"id":"tag-3","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(3) a","multiple":false,"regex":""},{"id":"tag-4","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(4) a","multiple":false,"regex":""},{"id":"tag-5","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(5) a","multiple":false,"regex":""},{"id":"tag-6","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(6) a","multiple":false,"regex":""},{"id":"tag-7","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(7) a","multiple":false,"regex":""},{"id":"tag-8","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(8) a","multiple":false,"regex":""},{"id":"tag-9","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(9) a","multiple":false,"regex":""},{"id":"tag-10","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(10) a","multiple":false,"regex":""},{"id":"tag-11","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(11) a","multiple":false,"regex":""},{"id":"tag-12","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(12) a","multiple":false,"regex":""},{"id":"tag-13","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(13) a","multiple":false,"regex":""},{"id":"tag-14","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(14) a","multiple":false,"regex":""},{"id":"tag-15","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(15) a","multiple":false,"regex":""},{"id":"tag-16","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(16) a","multiple":false,"regex":""},{"id":"tag-17","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(17) a","multiple":false,"regex":""},{"id":"tag-18","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(18) a","multiple":false,"regex":""},{"id":"tag-19","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(19) a","multiple":false,"regex":""},{"id":"tag-20","parentSelectors":["cards"],"type":"SelectorText","selector":"span:nth-of-type(20) a","multiple":false,"regex":""},{"id":"image","parentSelectors":["cards"],"type":"SelectorElementAttribute","selector":"figure.js-item-image a","multiple":false,"extractAttribute":"data-lightbox-src"},{"id":"images","parentSelectors":["cards"],"type":"SelectorElementClick","clickElementSelector":".hidden-sm i.icon-expand_right","clickElementUniquenessType":"uniqueCSSSelector","clickType":"clickMore","delay":2000,"discardInitialElements":"do-not-discard","multiple":true,"selector":"div#lightbox-content"},{"id":"img","parentSelectors":["images"],"type":"SelectorImage","selector":"img","multiple":true},{"id":"grouped","parentSelectors":["cards"],"type":"SelectorGroup","selector":"img","extractAttribute":"src"}]}

binvius · June 19, 2023, 9:08am

Hey buddy,

When I saw your response, my heart started pounding, I felt nauseous and my lip trembled a little - that's how incredibly important this is to me so words will never suffice in thanking you enough.

I have been non-stop with this since your post dropped and am SO CLOSE!

(It took me a while to work out how to post-process to remove all the duplicate image URLs as they are repeated many times on their respective row. Not sure why but guessing it's just a byproduct but at least I can clean things up after in the spreadsheet.)

Whilst it would be ideal if they were just in the correct order, the dilemma I'm still facing is trying to get at the very least the first image on the webpage to be the first of my image columns. (Again not sure why they seem to scrape in a random order but vaguely remember reading something about it recently.) I thought about a separate Selector for the first image but it would break again as before with that original issue of some webpages having no multi-photo lightbox as some only have a single image. Perhaps there is a way to 'tag' the first image from the webpage with a unique string that I can pick up in post-processing but then that would require Webscraper to know what the first image was, and in which case, it doesn't make sense to me why it couldn't just scrape the image URLs in order. I'm really at a lost end trying to work out a solution - my head hurts!

You seem super wise with this so if you had any ideas or solutions as to getting the images in order or the first image first, then that would really help eliminate me putting in dozens of hours trying to work it out with my current limited knowledge.

As mentioned, I love learning with the aim of being able to teach others, which I do with several other things that I have already mastered, so thanks again for being so awesome with me.

Cheers again!

binvius · June 20, 2023, 4:33am

No worries - you deserve it.

I have put in many more hours trying to work it out but not having any luck so I'm very much on that same path of "many hours of self-learning" myself and you've been a great teacher in helping with that.

Thank you so much for having a think about it as it is going to have such a positive impact!

Cheers again buddy.

binvius · June 25, 2023, 3:49am

Hey dude!

So sorry for the delay - I received zero notifications of your reply so not sure what happened there.

You are awesome - I think it's all working!

I've been trying to learn more from your working examples and it might slowly be starting to sink in.

One thing that I've also learnt is that sometimes it is necessary to keep things simple (by creating a separate selector for each item as opposed to trying to be too clever and recreate the same requirement in a single selector.) Overcomplication is actually a life-long key issue with everything in my life so I'm not surprised it was also the same in this case - ha!

I increased the photo count from 8 to 24 and also made some additions using your same 'single selector per item' approach. It was then that I realised how laborious it was so felt guilty that you were so kind to put in the laborious effort doing 8 of them (as just a couple would have been fine for me to do the rest.) So, thank you for being so decent and going above and beyond.

I have only tested on a small sample set so have it running now on the larger set. I will let you know once it completes in a few days how that went but have huge faith.

Massive thanks again buddy and hope you're having a lovely weekend thus far.

Will catch up as soon as it completes.

Cheers!