How to scrape source code with webscraper?

yopyop · September 8, 2019, 3:56pm

Hey foxes!
I need to scrape the source code . i tried with this -> view-source:https://www.mysite.com.. but can not scrape the page.. webscraper doesnt let me save the URL.

Any help?

THX a looot

webber · September 9, 2019, 9:09am

The scraper has access to the current "view-source" regardless, so do not need to add this part in the metadata. The only thing is that you will not be able to use the point-and-click interface for it and will have to build the scraper by creating the selectors manually and typing them out.

yopyop · September 9, 2019, 4:17pm

Ah ok thanks! i tried few times but i dont see how to scrape it with webscraper and extract the "logging_page_id":"profilePage_6887304"

in this page : view-source:https://www.instagram.com/tf1/

. could you help me a bit
Thanks a loooot

leemeng · September 10, 2019, 6:25am

This is an interesting one, I know what you're trying to extract - the Instagram User ID. You can get that info by using this link:
https://www.instagram.com/web/search/topsearch/?query=[exact username]

So for your example it would be:
https://www.instagram.com/web/search/topsearch/?query=tf1

The results page is a plain text file, so there is only one selector, pre. Grab the text from that, and you can then pick out the user ID with a regex. This is assuming the first line contains the correct user ID. The regex I'm using is \b\d{6,10}\b , which picks out the first 6-10 digit number it finds. That's why you'll need the exact username. Sample sitemap:

{"_id":"instagram_get_user_id","startUrl":["https://www.instagram.com/web/search/topsearch/?query=tf1"],"selectors":[{"id":"user_id","type":"SelectorText","parentSelectors":["_root"],"selector":"pre","multiple":false,"regex":"\\b\\d{6,10}\\b","delay":0}]}

yopyop · September 10, 2019, 7:56am

you're the best!! Thanks a lot!