A truly difficult case

Hi. I am deeply in need of help. I have been trying in the last couple weeks to srape a particular archive. My objective is all 12762 images form all pages.

I have tried all possible ways, but nothing works. I know I should use SelectorElementClick, but: 1) It does not go to the next page; 2) It does not save anything (in the example below I tried to add two other information besides de image).

Could you please help?

Url: http://docvirt.com/docreader.net/docreader.aspx?bib=Acervo_AAS&pasta=AAS%20mre%20d%201974.03.26

Sitemap:
{"_id":"fgvinformacoesaas","startUrl":["http://docvirt.com/docreader.net/docreader.aspx?bib=Acervo_AAS&pasta=AAS%20mre%20d%201974.03.26"],"selectors":[{"id":"nextpage","type":"SelectorElementClick","parentSelectors":["_root","nextpage"],"selector":"td.rspPaneHorizontal img, #TextoDigitadoTxt, #PastaTxt","multiple":true,"delay":"4000","clickElementSelector":"div#paginaposdiv input","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueHTMLText"}]}

In the following example below (from the same case), I can scrape the image link. But from only the first page.

{"_id":"testebivi","startUrl":["http://docvirt.com/docreader.net/DocReader.aspx?bib=Acervo_AAS&PagFis=52570&Pesq="],"selectors":[{"id":"NextPage","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"textarea","multiple":true,"delay":"2000","clickElementSelector":"div#paginaposdiv input","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"Title Folder","type":"SelectorText","parentSelectors":["NextPage"],"selector":"#PastaTxt","multiple":false,"regex":"","delay":0},{"id":"Description","type":"SelectorText","parentSelectors":["NextPage"],"selector":"#TextoDigitadoTxt","multiple":false,"regex":"","delay":0},{"id":"Image atribute","type":"SelectorElementAttribute","parentSelectors":["NextPage"],"selector":"<img id="DocumentoImg" ondragstart="return false;" onmousedown="return false;" src="cache/4868305824091/I0052572-20Alt=001643Lar=001103LargOri=001696AltOri=002527.JPG">","multiple":false,"extractAttribute":"src","delay":0},{"id":"Element Image","type":"SelectorElement","parentSelectors":["NextPage"],"selector":"<img id="DocumentoImg" ondragstart="return false;" onmousedown="return false;" src="cache/4868305824091/I0052572-20Alt=001643Lar=001103LargOri=001696AltOri=002527.JPG">","multiple":false,"delay":0}]}

Hi!

In order for Element Click to click next page continuously, you have to set it as 'Click More'.

Your sitemap:
{"_id":"fgvinformacoesaas","startUrl":["http://docvirt.com/docreader.net/docreader.aspx?bib=Acervo_AAS&pasta=AAS%20mre%20d%201974.03.26"],"selectors":[{"id":"nextpage","type":"SelectorElementClick","selector":"form","parentSelectors":["_root"],"multiple":true,"delay":"3000","clickElementSelector":"div#pagposteriorgrandediv input","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"Doc number","type":"SelectorText","selector":"div#pastadiv span","parentSelectors":["nextpage"],"multiple":false,"regex":"","delay":0},{"id":"Description","type":"SelectorText","selector":"textarea","parentSelectors":["nextpage"],"multiple":false,"regex":"","delay":0},{"id":"image","type":"SelectorElementAttribute","selector":"td.rspPaneHorizontal img","parentSelectors":["nextpage"],"multiple":false,"extractAttribute":"src","delay":0}]}

I'd delete the description though, as it's the same all across the document.

If you want to stop scraping and have results saved, you have to call Developer panel within scrape window, and use Inspect Tool (Ctrl + Shift + C) to manually select ' > ' (to the right of the image) button and delete it.

Dear Iconoclast. You are a savior!! It was much easier than I thought! Thank you very much!

I just discovered that the window suddenly closes. I will deal with this latter.

My biggest problem is that I am not able to download the images both on my Mac and my PC. On my PC I dragged and dropped after installed pyton. It creates a folder and states "image download completed". But when I open the folder there is nothing there.

The same thing with my Mac OS.

I opened image-downloader.py and I found on line 53 a reference to -src. On my csv I do not have any src reference. Everything is like the string below.

cache/828407580531/I0001054-20Alt=001500Lar=001006LargOri=003412AltOri=005087.JPG

Does your column name that contains images ends with "-src"? If it doesn't then rename your column so it has "-src" at the end.

Make sure that image URLs are full, not just the path to image. If your image URL lacks domain name you will have to add that as well.

The column end with "-src".

All imaged are saved in the computer, so there is no URL.

I will add the full path for the .jpg file tonight and see if it works.

It is not stored in Chrome cache. It is stored in sites cache, the full URL should look like this:
http://docvirt.com/docreader.net/cache/4087609790722/I0052568-20Alt=002359Lar=001583LargOri=001696AltOri=002527.JPG

So you will have to add http://docvirt.com/docreader.net/ before each images path.

It worked like a charm! Thank you both.

Now I will try to discover why it stopped after scrapping only 1/3.

Have you tried increasing page load delay?

I will try tomorrow to use 5000 instead of 3000.

Do you know how could I start from a particular page instead beginning all over again and again? I did not find any clues on how to type a specific page in the documentation.

You have to set Request Interval up to 10000 ms (10 seconds) in order to have time to set different page number on a website until it starts to scrape.

Perfect!
IT makes sense.