Scrape a list of entities

Manecas1979 · December 7, 2018, 10:31am

I would like to scrape a database of companies, getting three fields from each one of them.

I need to, first of all, click in a search button, then I have to enter in each of the company's detail page and here I would like to get those three text fields.

The urls are not dynamic....

Thanks for your advise

Url: http://www.impic.pt/impic/pt-pt/consultar/empresas-titulares-de-licenca-de-mediacao-imobiliaria

Sitemap:
{"_id":"impic","startUrl":["http://www.impic.pt/impic/pt-pt/consultar/empresas-titulares-de-licenca-de-mediacao-imobiliaria"],"selectors":[{"id":"next","type":"SelectorLink","parentSelectors":["start"],"selector":"div.col-sm-4.text-right a.btn","multiple":true,"delay":0},{"id":"element","type":"SelectorElementClick","parentSelectors":["next","start"],"selector":"div.block.impic-form","multiple":false,"delay":0,"clickElementSelector":"td.text-center:nth-of-type(3) a.btn-info","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"council","type":"SelectorText","parentSelectors":["element"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(3) span","multiple":false,"regex":"","delay":0},{"id":"district","type":"SelectorText","parentSelectors":["element"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(4) span","multiple":false,"regex":"","delay":0},{"id":"creation","type":"SelectorText","parentSelectors":["element"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(7) span","multiple":false,"regex":"","delay":0},{"id":"start","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"a.btn.btn-search","multiple":false,"delay":0,"clickElementSelector":"a.btn.btn-search","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"}]}

iconoclast · December 9, 2018, 11:22am

Hi!

You can set Page Load Delay to 10000-15000 ms (10-15 seconds), so before WebScraper starts the job, you can enter all necessary data for it to scrape through.

Manecas1979 · December 10, 2018, 2:30pm

Hi Iconoclast

Sorry, but I didn't understand the way you are telling me to do this

Thanks

iconoclast · December 11, 2018, 10:42am

You can set Request Interval and Page Load Delay prior to actual scraping, before you hit Start scraping button.

If you set Page Load Delay, say, to 10000 ms (10 seconds), WebScraper will open a window, but it won't start scraping until 10 seconds you've just set have passed.

Manecas1979 · December 11, 2018, 11:53am

Hi Iconoclast

But I do not want to filter anything... i want to get their entire database with no filters...

The jason that I posted won't work, even I filter my results first

Thanks

Manecas1979 · December 11, 2018, 12:22pm

Now I am trying to use this Jason:

{"_id":"impic","startUrl":["http://www.impic.pt/impic/pt-pt/consultar/empresas-titulares-de-licenca-de-mediacao-imobiliaria"],"selectors":[{"id":"page","type":"SelectorElementClick","parentSelectors":["_root"],"selector":"div.block.impic-form","multiple":true,"delay":"2500","clickElementSelector":"div.col-sm-4.text-right a.btn","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},{"id":"data","type":"SelectorLink","parentSelectors":["page"],"selector":"tr.page-1 td.text-center:nth-of-type(1) a.btn-info","multiple":false,"delay":0},{"id":"ami","type":"SelectorText","parentSelectors":["data"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(5) span","multiple":false,"regex":"","delay":0},{"id":"council","type":"SelectorText","parentSelectors":["data"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(3) span","multiple":false,"regex":"","delay":0},{"id":"district","type":"SelectorText","parentSelectors":["data"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(4) span","multiple":false,"regex":"","delay":0},{"id":"date","type":"SelectorText","parentSelectors":["data"],"selector":"div.information:nth-of-type(3) div.information-field:nth-of-type(7) span","multiple":false,"regex":"","delay":0}]}

Because, the info I want to get is the "Nº Licença", the "Concelho", "Distrito" and "Licença emitida em" from their entire database (311 pages and around 6200 companies)

Thanks for your help