The program does not scrape all data

tajana · May 12, 2018, 5:32pm

Hi!

The program scrapes only the first 10 links. And I need all 205. I apply Link-Multily-Select. http://www.simon.com/mall/the-mills-at-jersey-gardens/stores
Can you advise me, please, how to solve the issue?

Best regards,
Maria

chefas · May 12, 2018, 7:04pm

Maria,
could you post here on the forum your sitemap.
Thank's

tajana · May 13, 2018, 8:14am

Hi!
Do you need this information?

{"_id":"simon","startUrl":["http://www.simon.com/mall/the-mills-at-jersey-gardens/stores"],"selectors":[{"id":"list","type":"SelectorLink","selector":"div.col-lg-3 > div.card-secondary a.no-underline.cardImgLink, div.LazyLoad:nth-of-type(n+204) a.no-underline.cardImgLink","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"name","type":"SelectorText","selector":"h1.header-md","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"website","type":"SelectorLink","selector":"div.store-social-desktop a.nav-link","parentSelectors":["list"],"multiple":false,"delay":0}]}

chefas · May 13, 2018, 10:03am

Hello,

it seems that this site can't be scrapped with this extension, even if you select the appropriate selector (normaly you have to begin with an "element scroll down" selector)

tajana · May 14, 2018, 9:16am

Hi!

Thank you very much for your help!
I chose "element scroll down" and correctly identified all the links.

Best regards,
Maria

chefas · May 14, 2018, 12:10pm

Hello,

could you post here your sitemap.

Thanks

tajana · May 15, 2018, 8:10am

Hi!

Have you requested this information?

{"_id":"simon2","startUrl":["http://www.simon.com/mall/the-mills-at-jersey-gardens/stores"],"selectors":[{"id":"list","type":"SelectorLink","selector":"div.col-lg-3 > div.card-secondary a.no-underline.cardImgLink, div.LazyLoad:nth-of-type(n+11) a.no-underline.cardImgLink","parentSelectors":["_root","element"],"multiple":true,"delay":0},{"id":"NAME","type":"SelectorText","selector":"h1.header-md","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"element","type":"SelectorElementScroll","selector":"div.col-lg-3 > div.card-secondary div.card-secondary-text a.no-underline:nth-of-type(1), div.col-lg-3:nth-of-type(2) h2.card-secondary-title, div.col-lg-3:nth-of-type(n+2) > div.card-secondary div.header-xs, div.LazyLoad:nth-of-type(n+11) div.card-secondary-text a.no-underline:nth-of-type(1), div.LazyLoad:nth-of-type(12) h2.card-secondary-title, div.LazyLoad:nth-of-type(12) div.header-xs","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"WEBSITE","type":"SelectorLink","selector":"div.store-social-desktop a.nav-link","parentSelectors":["list"],"multiple":false,"delay":0},{"id":"STORE HOURS","type":"SelectorText","selector":"div.store-hours div.store-hours","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"BEST ENTRANCE","type":"SelectorText","selector":"div.store-entrance p:nth-of-type(1)","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"LOCATION IN MALL","type":"SelectorText","selector":"p.no-margin","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"MORE INFO","type":"SelectorText","selector":"li.nav-item.hidden-sm-down:nth-of-type(2) a.nav-link","parentSelectors":["list"],"multiple":false,"regex":"","delay":0}]}

KristapsWS · May 15, 2018, 1:12pm

Set a delay to your element scroll down selector to at least 3000ms and change selector to div.directory-store. Make list selector as a child selector only to element scroll down selector.

chefas · May 15, 2018, 1:18pm

Hello Tajana,

yes I wanted that.

your scrape is interesting because it shows how it can be difficult to play with this extension to get all the results. I have made your sitemap easier to understand for those who would be interested to learn more about this extension.

This website is very interesting as an example. It works with an "Element scroll down" when using the elevator to unroll the elements down. The big difficulty is to be able to select the content "Element scroll down": do not take the logos, nor all the blocks, or only the title, or only the phone .... But you must select the "title + hours openings" without taking the phone below .... And in addition you have to select at least 11 lines to be sure of being able to have a complete scraping of 202 records.

Here is my sitemap:

{"_id":"test2","startUrl":["http://www.simon.com/mall/the-mills-at-jersey-gardens/stores"],"selectors":[{"id":"element","type":"SelectorElementScroll","selector":"div.col-lg-3 > div.card-secondary div.card-secondary-text a.no-underline:nth-of-type(1), div.LazyLoad:nth-of-type(n+11) div.card-secondary-text a.no-underline:nth-of-type(1)","parentSelectors":["_root"],"multiple":true,"delay":0},{"id":"link","type":"SelectorLink","selector":"parent","parentSelectors":["element"],"multiple":true,"delay":0},{"id":"name","type":"SelectorText","selector":"h1.header-md","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"hours","type":"SelectorText","selector":"div.store-hours div.store-hours","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"location","type":"SelectorText","selector":"p.no-margin","parentSelectors":["link"],"multiple":false,"regex":"","delay":0},{"id":"more-info","type":"SelectorText","selector":"li.nav-item.hidden-sm-down:nth-of-type(2) a.nav-link","parentSelectors":["link"],"multiple":false,"regex":"","delay":0}]}

tajana · May 15, 2018, 3:10pm

Hi chefas!

Yes, it was difficult for me to highlight the correct links. Your variant is easier than my. Thank you so much!
Can you advise me how to extract an e-mail without the Mailto? example- mailto:info@advancedcryonyc.com

my sitemap:
{"_id":"nybeautysalons","startUrl":["https://www.yellowpages.com/search?search_terms=Beauty%20Salons&geo_location_terms=NoHo%2C%20New%20York%2C%20NY&refinements=headingtext%3AMedical%20Spas"],"selectors":[{"id":"list","type":"SelectorLink","selector":"div.result a.business-name","parentSelectors":["_root","pagination"],"multiple":true,"delay":0},{"id":"email","type":"SelectorLink","selector":"a.email-business","parentSelectors":["list"],"multiple":false,"delay":0},{"id":"address","type":"SelectorText","selector":"p.address span","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"phone","type":"SelectorText","selector":"p.phone","parentSelectors":["list"],"multiple":false,"regex":"","delay":0},{"id":"pagination","type":"SelectorLink","selector":"div.pagination a","parentSelectors":["_root"],"multiple":true,"delay":0}]}

tajana · May 15, 2018, 3:11pm

Thank you for your advice!

chefas · May 15, 2018, 4:04pm

Hi

perhaps it's possible to extract the email without "mailto:" directly inside web scraper extension but I don't know how to do it.

nevertheless, you can do it more easily with excel ( at the level of colum email-link-href)