How do I include line breaks or bullets from a list?

mki · January 16, 2019, 8:30pm

I am trying to scrape a bulleted list, however the results don't capture any of the line breaks or bullets. Unfortunately, the list itself doesn't have any periods or semicolons, so I can't use an Excel formula to insert line breaks in place of periods/semicolons/bullets/etc. Thus, I am hoping that there might be a way to capture the line breaks (or bullets) when scraping the data.

https://www.virtuoso.com/hotels/6163838/the-plaza#.XD-Sb1xKhPY

The list of amenities appears in two places, but both result in the same issue.

Sitemap:

{"_id":"plaza-sample","startUrl":["https://www.virtuoso.com/hotels/6163838/the-plaza#.XD-Sb1xKhPY"],"selectors":[{"id":"top-list-without-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities ul","multiple":false,"regex":"","delay":0},{"id":"top-list-with-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities div.tab-content div","multiple":false,"regex":"","delay":0},{"id":"bottom-list-without-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-tabs div.tab-content.virtuoso-amenities ul","multiple":false,"regex":"","delay":0},{"id":"bottom-list-with-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-tabs div.tab-content.virtuoso-amenities div","multiple":false,"regex":"","delay":0}]}

toutopiawego · January 22, 2019, 8:47pm

I am having a similar problem, so I am also interested in this answer - if anyone has it. Right now when there is a line break or a bullet list, and I extract the text, it does not include any spaces between the lines. So in my CSV file I get a lot of "knowldgeThis" - words mushed together because the scraper is not accounting for line breaks. Thanks in advance for any help!

Bort · January 23, 2019, 1:52am

If there are never more than a handful of bullet points, you can hard-code each one: Amenity1, Amenity2, Amenity3, ...

Not ideal.

Sitemap:

{"_id":"testing","startUrl":["https://www.virtuoso.com/hotels/6163838/the-plaza#.XD-Sb1xKhPY"],"selectors":[{"id":"amenity1","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities li:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"amenity2","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities li:nth-of-type(2)","multiple":false,"regex":"","delay":0},{"id":"amenity3","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities li:nth-of-type(3)","multiple":false,"regex":"","delay":0},{"id":"amenity4","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities li:nth-of-type(4)","multiple":false,"regex":"","delay":0},{"id":"amenity5","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities li:nth-of-type(5)","multiple":false,"regex":"","delay":0},{"id":"amenity6","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities li:nth-of-type(6)","multiple":false,"regex":"","delay":0}]}

Bort · January 23, 2019, 2:03am

If you don't mind each bullet point going on a separate line in your output, you can scrape the bullet list as a text/multiple.

(But for some reason this has double output? I'm a newb.)

Sitemap:

{"_id":"testing","startUrl":["https://www.virtuoso.com/hotels/6163838/the-plaza#.XD-Sb1xKhPY"],"selectors":[{"id":"amenities","type":"SelectorText","parentSelectors":["_root"],"selector":"div[id="amenities-content"] > p + ul > li","multiple":true,"regex":"","delay":0}]}

mki · January 23, 2019, 3:36am

Hi toutopiawego - Nice to know that I'm not alone in struggling to figure this one out.

Hi Bort - Thanks for the great feedback!! I had tried to grab the items on the list one-by-one, however the number of items on the lists can vary slightly. I have found an example that has quite a few bulleted items, and built the scraper around that. It may prove to get the job done.

I am really intrigued by your second post, however I can't replicate it. Are you taking the original sitemap and adding your sitemap to it? When I try creating a sitemap using just the text you provided, I get an error message.

Thanks again!

bretfeig · January 25, 2019, 12:46am

mki:

{"_id":"plaza-sample","startUrl":["https://www.virtuoso.com/hotels/6163838/the-plaza#.XD-Sb1xKhPY"],"selectors":[{"id":"top-list-without-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities ul","multiple":false,"regex":"","delay":0},{"id":"top-list-with-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities div.tab-content div","multiple":false,"regex":"","delay":0},{"id":"bottom-list-without-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-tabs div.tab-content.virtuoso-amenities ul","multiple":false,"regex":"","delay":0},{"id":"bottom-list-with-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-tabs div.tab-content.virtuoso-amenities div","multiple":false,"regex":"","delay":0}]}

Let me look into this. I did discover something cool.. Check out
https://wapi.engage.co/api/v2/getUsers?apiKey=5e5287ce9c4979cd6acf742850fd21af&categorySlug=Virtuoso-USCAN&syndicationCode=virtuoso&sig=6d3ecae505aad4e5abc7385da7c29e3d1e9c4197

Copy the output and paste it into https://json-csv.com/

Looks like a list of their employees

bretfeig · January 25, 2019, 12:53am

Change text selector to HTML selector. This will but

instead of the bullets. You can swap it out in excel later.

You can also used Group Selector and use an asci code for the name like ■

{"_id":"plaza-sample","startUrl":["https://www.virtuoso.com/hotels/6163838/the-plaza#.XD-Sb1xKhPY"],"selectors":[{"id":"■  ","type":"SelectorGroup","parentSelectors":["_root"],"selector":"div.product-header__amenities li","delay":0,"extractAttribute":""},{"id":"top-list-with-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-header__amenities div.tab-content div","multiple":false,"regex":"","delay":0},{"id":"bottom-list-without-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-tabs div.tab-content.virtuoso-amenities ul","multiple":false,"regex":"","delay":0},{"id":"bottom-list-with-year","type":"SelectorText","parentSelectors":["_root"],"selector":"div.product-tabs div.tab-content.virtuoso-amenities div","multiple":false,"regex":"","delay":0}]}

Shadowwzz · August 12, 2019, 12:56pm

Anyone found a way to scrape data in bullet point or line breaks?

Can’t seem to make it work. Tried all methods discussed above but still cannot make it work.