A truly difficult case

central11 · August 13, 2018, 2:44pm

Hi. I am deeply in need of help. I have been trying in the last couple weeks to srape a particular archive. My objective is all 12762 images form all pages.

I have tried all possible ways, but nothing works. I know I should use SelectorElementClick, but: 1) It does not go to the next page; 2) It does not save anything (in the example below I tried to add two other information besides de image).

Could you please help?

Url: http://docvirt.com/docreader.net/docreader.aspx?bib=Acervo_AAS&pasta=AAS%20mre%20d%201974.03.26

Sitemap:
{"_id":"fgvinformacoesaas","startUrl":["http://docvirt.com/docreader.net/docreader.aspx?bib=Acervo_AAS&pasta=AAS%20mre%20d%201974.03.26"],"selectors":[{"id":"nextpage","type":"SelectorElementClick","parentSelectors":["_root","nextpage"],"selector":"td.rspPaneHorizontal img, #TextoDigitadoTxt, #PastaTxt","multiple":true,"delay":"4000","clickElementSelector":"div#paginaposdiv input","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueHTMLText"}]}

central11 · August 13, 2018, 3:36pm

In the following example below (from the same case), I can scrape the image link. But from only the first page.

{"_id":"testebivi","startUrl":["http://docvirt.com/docreader.net/DocReader.aspx?bib=Acervo_AAS&PagFis=52570&Pesq="],"selectors":[{"id":"NextPage","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"textarea","multiple":true,"delay":"2000","clickElementSelector":"div#paginaposdiv input","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueCSSSelector"},{"id":"Title Folder","type":"SelectorText","parentSelectors":["NextPage"],"selector":"#PastaTxt","multiple":false,"regex":"","delay":0},{"id":"Description","type":"SelectorText","parentSelectors":["NextPage"],"selector":"#TextoDigitadoTxt","multiple":false,"regex":"","delay":0},{"id":"Image atribute","type":"SelectorElementAttribute","parentSelectors":["NextPage"],"selector":"<img id="DocumentoImg" ondragstart="return false;" onmousedown="return false;" src="cache/4868305824091/I0052572-20Alt=001643Lar=001103LargOri=001696AltOri=002527.JPG">","multiple":false,"extractAttribute":"src","delay":0},{"id":"Element Image","type":"SelectorElement","parentSelectors":["NextPage"],"selector":"<img id="DocumentoImg" ondragstart="return false;" onmousedown="return false;" src="cache/4868305824091/I0052572-20Alt=001643Lar=001103LargOri=001696AltOri=002527.JPG">","multiple":false,"delay":0}]}

iconoclast · August 13, 2018, 9:01pm

Hi!

In order for Element Click to click next page continuously, you have to set it as 'Click More'.

Your sitemap:
{"_id":"fgvinformacoesaas","startUrl":["http://docvirt.com/docreader.net/docreader.aspx?bib=Acervo_AAS&pasta=AAS%20mre%20d%201974.03.26"],"selectors":[{"id":"nextpage","type":"SelectorElementClick","selector":"form","parentSelectors":["_root"],"multiple":true,"delay":"3000","clickElementSelector":"div#pagposteriorgrandediv input","clickType":"clickMore","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"Doc number","type":"SelectorText","selector":"div#pastadiv span","parentSelectors":["nextpage"],"multiple":false,"regex":"","delay":0},{"id":"Description","type":"SelectorText","selector":"textarea","parentSelectors":["nextpage"],"multiple":false,"regex":"","delay":0},{"id":"image","type":"SelectorElementAttribute","selector":"td.rspPaneHorizontal img","parentSelectors":["nextpage"],"multiple":false,"extractAttribute":"src","delay":0}]}

I'd delete the description though, as it's the same all across the document.

If you want to stop scraping and have results saved, you have to call Developer panel within scrape window, and use Inspect Tool (Ctrl + Shift + C) to manually select ' > ' (to the right of the image) button and delete it.

central11 · August 14, 2018, 1:04pm

Dear Iconoclast. You are a savior!! It was much easier than I thought! Thank you very much!

central11 · August 15, 2018, 1:55am

I just discovered that the window suddenly closes. I will deal with this latter.

My biggest problem is that I am not able to download the images both on my Mac and my PC. On my PC I dragged and dropped after installed pyton. It creates a folder and states "image download completed". But when I open the folder there is nothing there.

The same thing with my Mac OS.

I opened image-downloader.py and I found on line 53 a reference to -src. On my csv I do not have any src reference. Everything is like the string below.

cache/828407580531/I0001054-20Alt=001500Lar=001006LargOri=003412AltOri=005087.JPG

KristapsWS · August 15, 2018, 8:06am

Does your column name that contains images ends with "-src"? If it doesn't then rename your column so it has "-src" at the end.

Make sure that image URLs are full, not just the path to image. If your image URL lacks domain name you will have to add that as well.

central11 · August 15, 2018, 11:44am

The column end with "-src".

All imaged are saved in the computer, so there is no URL.

I will add the full path for the .jpg file tonight and see if it works.

KristapsWS · August 15, 2018, 1:03pm

It is not stored in Chrome cache. It is stored in sites cache, the full URL should look like this:
http://docvirt.com/docreader.net/cache/4087609790722/I0052568-20Alt=002359Lar=001583LargOri=001696AltOri=002527.JPG

So you will have to add http://docvirt.com/docreader.net/ before each images path.

central11 · August 16, 2018, 11:14pm

It worked like a charm! Thank you both.

Now I will try to discover why it stopped after scrapping only 1/3.

iconoclast · August 16, 2018, 11:25pm

Have you tried increasing page load delay?

central11 · August 17, 2018, 1:06am

I will try tomorrow to use 5000 instead of 3000.

Do you know how could I start from a particular page instead beginning all over again and again? I did not find any clues on how to type a specific page in the documentation.

iconoclast · August 17, 2018, 9:50am

You have to set Request Interval up to 10000 ms (10 seconds) in order to have time to set different page number on a website until it starts to scrape.

central11 · August 17, 2018, 11:03am

Perfect!
IT makes sense.