How to extract specific text after text

zis · January 7, 2020, 10:33am

Hi,
I need to extract text as shown in attached file, text after: Produktnavn, Type, Printertype etc.

I can setup it and scrape with webscraper selectors, but the problem is that some pages contain different numbers of lines.
As shown in the screenshot on page 2 missing line with start text "Sider".
So when I scrape with my setup, it's scrapped with incorrect text in column for some pages.

I need regex to scrape specific text next for example, Produktnavn, Type, Printertype etc., so it's correct scraped for this specific next text.

Example URL's for attached screenshot for:
Page 1: https://www.pricerunner.dk/pl/187-4534532/Blaek-og-toner/Brother-LC223BK-(Black)-Sammenlign-Priser
Page 2: https://www.pricerunner.dk/pl/187-4533750/Blaek-og-toner/HP-CH561EE-(Black)-Sammenlign-Priser

How to make one regex for one line, which I can use at template for other lines?

Sitemap:
{id:"sitemap code"}

leemeng · January 8, 2020, 1:35am

There is no real need for regex in this case, as you can achieve results just by using better selectors. You are probably using the selectors from WS's element picker which tend to look like div:nth-of-type(5) div._3_beFIN3FX
These are position-dependent so your scraper will produce wrong results if the rows are not always in the same position - row 5 in this case.

For "Kompatible printere", you can try this selector:
div > div > div:contains('Kompatible printere') > div[style^='width:120px']

and for "Sider", try this selector:
div > div > div:contains('Sider') > div[style^='width:120px']

If Sider is not found, WS will just return "null". These selectors pick out the divs based on text content and style, so they are not affected by position changes.

zis · January 8, 2020, 3:53pm

Hi,
It works fine

Only problem is with scraping text from link, like next for "Type" which is "Blækpatroner" in this case.
How to scrape this text from link?

URL:

Mazea · October 30, 2020, 11:11pm

Hi Lee,

may be you could really help me with the following issue that I can’t manage 1 week daily:

I tried to scrap the phone number from the webpage but it doesn’t scrap because I guess the phone number is as a link, but is not actually, is just basically the recognition via facetime that you can directly dial it through my computer.
So but even when I create a selector and select the phone number with the type of the Link, it doesn’t scrap it as text.

So how can I scrap in this case the phone number?
Thanks

meathead · March 23, 2021, 9:35am

Hi guys

Hopefully, you can help. I have the same issue except with the ufc website. If there is a row missing like style for example it messes with the rest of the data and even adds data to the wrong cell. Please help I've been trying to figure this out forever.

So The deiveson is correct. because it has all the columns. But Jessica's is all messed up because style is missing. Im trying to get the info from, Hometown , style , age, height, weight, reach, leg reach, debut.

this is my sitemap

{"_id":"scraper-example","startUrl":["[https://www.ufc.com/athlete/deiveson-figueiredo","https://www.ufc.com/athlete/jessica-aguilar"],"selectors":{"id":"Name","type":"SelectorText","parentSelectors":["_root"],"selector":"h1","multiple":false,"regex":"","delay":0},{"id":"Fight Name","type":"SelectorText","parentSelectors":["_root"],"selector":"div.field-name-nickname","multiple":false,"regex":"","delay":0},{"id":"Hometown","type":"SelectorText","parentSelectors":["_root"],"selector":".c-bio__row--1col div.c-bio__text","multiple":false,"regex":"","delay":0},{"id":"Style","type":"SelectorText","parentSelectors":["_root"],"selector":".c-bio__row--2col div.c-bio__text","multiple":false,"regex":"","delay":0},{"id":"Height","type":"SelectorText","parentSelectors":["_root"],"selector":"div.c-bio__row--3col:nth-of-type(3) div:nth-of-type(2) div.c-bio__text","multiple":false,"regex":"","delay":0},{"id":"Weight","type":"SelectorText","parentSelectors":["_root"],"selector":"div.c-bio__row--3col:nth-of-type(3) div:nth-of-type(3) div.c-bio__text","multiple":false,"regex":"","delay":0},{"id":"Reach","type":"SelectorText","parentSelectors":["_root"],"selector":"div.c-bio__row--3col:nth-of-type(4) div:nth-of-type(2) div.c-bio__text","multiple":false,"regex":"","delay":0},{"id":"Leg Reach","type":"SelectorText","parentSelectors":["_root"],"selector":"div.c-bio__row--3col:nth-of-type(4) div:nth-of-type(3) div.c-bio__text","multiple":false,"regex":"","delay":0},{"id":"Debute","type":"SelectorText","parentSelectors":["_root"],"selector":"div.c-bio__row--3col:nth-of-type(4) div:nth-of-type(1) div.c-bio__text","multiple":false,"regex":"","delay":0}]}

these are the websites

any help will be awesome
thanks in advance

leemeng · March 24, 2021, 12:17am

Your selectors are currently too reliant on nth-of-type; these only work if the elements are in the exact same position on every page. You will need better selectors. For Hometown, Fighting Style and Height, try the examples below:

div.c-bio__label:contains('Hometown') + div.c-bio__text

div.c-bio__label:contains('Fighting style') + div.c-bio__text

div.c-bio__label:contains('Height') + div.c-bio__text

Ref: CSS Selectors Reference

meathead · March 24, 2021, 1:20am

man thankyou so much. i never would of figured that out you are a champion.

thanks again.
meathead