Grabbing just the <p class> text from container

mysterdee888 · August 19, 2024, 10:22am

Hi all. im trying to scrape content from this website

currently im having to grab the RAW text from DIV -

But i would prefere if i could grab the

text just from the

tags

as it is im grabbing raw text and its just messy, grabbing alot of other junk

So if anybody could help, how would i grab the just

or

text

thank you

heres my current scrape JSON

{"_id":"audioloveandgenreVstorrent","startUrl":["https://vstorrent.org/page/[0-1495]"],"selectors":[{"id":"linker","parentSelectors":["wrapper"],"type":"SelectorLink","selector":".entry-title a","multiple":false,"linkType":"linkFromHref"},{"id":"title","parentSelectors":["linker"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"image","parentSelectors":["linker"],"type":"SelectorImage","selector":"img","multiple":false},{"id":"info","parentSelectors":["linker"],"type":"SelectorText","selector":"div.entry-content","multiple":false,"regex":""},{"id":"fileinfo","parentSelectors":["linker"],"type":"SelectorText","selector":"X","multiple":false,"regex":""},{"id":"catagory","parentSelectors":["linker"],"type":"SelectorText","selector":"div.categories","multiple":false,"regex":""},{"id":"dateadded","parentSelectors":["linker"],"type":"SelectorText","selector":"span.date","multiple":false,"regex":""},{"id":"video","parentSelectors":["linker"],"type":"SelectorHTML","selector":"div.articletxt","multiple":false,"regex":"YouTube[^" ]+"},{"id":"soundcloud","parentSelectors":["linker"],"type":"SelectorHTML","selector":".soundcloudclass","multiple":false,"regex":"https://w.soundcloud.com/player/[^" ]+"},{"id":"mp3","parentSelectors":["linker"],"type":"SelectorHTML","selector":".dleaudioplayer","multiple":false,"regex":"https://cdn-prd.sounds.com[^" ]+"},{"id":"download","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":".code-block a[data-wpel-link]","multiple":false,"extractAttribute":"href"},{"id":"wrapper","parentSelectors":["_root"],"type":"SelectorElement","selector":"article","multiple":true},{"id":"tags","parentSelectors":["linker"],"type":"SelectorText","selector":"div.tags","multiple":false,"regex":""},{"id":"image2","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":".entry-content p:nth-of-type(1) a","multiple":false,"extractAttribute":"href"},{"id":"image3","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":"p:nth-of-type(4) a","multiple":false,"extractAttribute":"href"},{"id":"image4","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":"p:nth-of-type(9) a","multiple":false,"extractAttribute":"href"},{"id":"image5","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":"p:nth-of-type(12) a","multiple":false,"extractAttribute":"href"},{"id":"image6","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":"p:nth-of-type(15) a","multiple":false,"extractAttribute":"href"},{"id":"image7","parentSelectors":["linker"],"type":"SelectorElementAttribute","selector":"p:nth-of-type(18) a","multiple":false,"extractAttribute":"href"}]}

don2010 · August 19, 2024, 3:12pm

Your JSON is invalid.
Can you send a screenshot pointing exactly what you need to be scraped...?

mysterdee888 · August 19, 2024, 3:28pm

Sure heres a Screenshot

As you can see there are multiple P tags

My scraping only manages to grab them all as a RAW text grab off the main container

what i need though is to be able to grab each Paragraph and have them in 1 place in my capture
so ican use them with php , rather than using multiple grabs at each paragraph.

if i try to grab the p tags, all i get is the 1st P tag data, grabbing them all and inputting the output into 1 catch seems impossible, and more complicating is that i want to display the captured data with a php variable - $info - the cvs i also import into a databse and the captured text goes into 1 column called info

im not sure how i can capture all paragraphs from the website amd store them into my sql database and echo them out thru php...

im not sure how i should be capturing the data in https://webscraper.io/ to do this..

so far i opted for just grabbing the whole RAW text data from the main container on the website

it is ok, but i have to do alot of cleaning of the text to present it on my webspage.

i am thinking maybe capture them as a group and explode them using php.

But when i capture the P tag as a group - i end up with these to deal with "},{" in my explode function, which is not appropriate. i would rather only have to deal with maybe a , between each paragraph, then i can explode them.

don2010 · August 19, 2024, 3:53pm

you can try this sitemap, but it's not an ideal..... the structure of the page is a bit abnormal.. there not only

but also

.... you can grab a parent

which includes all description inside (div[id*="post"] div.entry-content)

{"_id":"vstorrent","startUrl":["https://vstorrent.org/page/[1-2]"],"selectors":[{"id":"link","linkType":"linkFromHref","multiple":true,"parentSelectors":["_root"],"selector":".entry-title a","type":"SelectorLink"},{"extractAttribute":"","id":"description","parentSelectors":["link"],"selector":"div[id*=\"post\"] div.entry-content p","type":"SelectorGroup"}]}

mysterdee888 · August 19, 2024, 4:05pm

il have a looksee. thank you