Is there anyway to start scraping from a list of urls instead of a json sitemap?
Alternatively is there a free tool that converts a list of plain urls into a json sitemap working with web scraper?
Thanks for any help.
world33
Is there anyway to start scraping from a list of urls instead of a json sitemap?
Alternatively is there a free tool that converts a list of plain urls into a json sitemap working with web scraper?
Thanks for any help.
world33
Hi there!
It can be done using macro in some advanced text editor (that supports macros, of course), like Notepad++ or UltraEdit. Please let me know what text editor you're using, i'll try to make you one
Hi iconoclast,
Thank you for your reply and kind offer.
I use Notepad++
I appreciate it,
world33
Hello @world33, you have to locate your shortcuts.xml file first, it's located in your %appdata%\Notepad++ (you can copy/paste this it will open right folder), then open your shortcuts.xml and add this inside <Macros>
section:
<Macro name="prepURLs" Ctrl="no" Alt="no" Shift="no" Key="120">
<Action type="0" message="2024" wParam="-1" lParam="0" sParam="" />
<Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
<Action type="3" message="1601" wParam="0" lParam="0" sParam="http" />
<Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
<Action type="3" message="1602" wParam="0" lParam="0" sParam='"http' />
<Action type="3" message="1702" wParam="0" lParam="1536" sParam="" />
<Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
<Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
<Action type="3" message="1601" wParam="0" lParam="0" sParam="\r" />
<Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
<Action type="3" message="1602" wParam="0" lParam="0" sParam='",' />
<Action type="3" message="1702" wParam="0" lParam="1536" sParam="" />
<Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
<Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
<Action type="3" message="1601" wParam="0" lParam="0" sParam="\n" />
<Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
<Action type="3" message="1602" wParam="0" lParam="0" sParam="" />
<Action type="3" message="1702" wParam="0" lParam="1536" sParam="" />
<Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
<Action type="0" message="2024" wParam="999999" lParam="0" sParam="" />
<Action type="0" message="2326" wParam="0" lParam="0" sParam="" />
</Macro>
P.S. it's assigned to F9 key
P.P.S. it will work if you paste links with 'http' in the beginning of each URL, links have to be pasted in column (I update links on daily basis just copying from Excel).
UPD: revised it as i use UltraEdit, it was a little outdated
Thanks iconoclast.
So this macro will help me start from a list of urls as follows:
https://www.linkedin.com/search/results/schools/?keywords=1spbgmu.ru
https://www.linkedin.com/search/results/schools/?keywords=21.edu.ar
https://www.linkedin.com/search/results/schools/?keywords=22may-col.com
https://www.linkedin.com/search/results/schools/?keywords=29mayis.edu.tr
https://www.linkedin.com/search/results/schools/?keywords=3erdene.edu.mn
https://www.linkedin.com/search/results/schools/?keywords=3il-ingenieurs.fr
https://www.linkedin.com/search/results/schools/?keywords=aabfs.org
https://www.linkedin.com/search/results/schools/?keywords=aabu.edu.jo
https://www.linkedin.com/search/results/schools/?keywords=aalto.fi
https://www.linkedin.com/search/results/schools/?keywords=aamu.edu
https://www.linkedin.com/search/results/schools/?keywords=aarch.dk
https://www.linkedin.com/search/results/schools/?keywords=aasa.ac.jp
https://www.linkedin.com/search/results/schools/?keywords=aast.edu
https://www.linkedin.com/search/results/schools/?keywords=aau.ac.ae
https://www.linkedin.com/search/results/schools/?keywords=aau.ac.in
https://www.linkedin.com/search/results/schools/?keywords=aau.at
https://www.linkedin.com/search/results/schools/?keywords=aau.dk
https://www.linkedin.com/search/results/schools/?keywords=aau.edu.et
https://www.linkedin.com/search/results/schools/?keywords=aau.edu.jo
https://www.linkedin.com/search/results/schools/?keywords=aau.edu.sd
https://www.linkedin.com/search/results/schools/?keywords=aau.in
https://www.linkedin.com/search/results/schools/?keywords=aaua.edu.ng
https://www.linkedin.com/search/results/schools/?keywords=aauc.edu.jo
and generate a proper json sitemap which will work with web scraper.
is that correct?
thanks again
This macro will prepare urls to be placed within sitemap, i.e. macro will encapsulate each link into quotation marks, and separate them by a comma. Then you will be able to place them within your sitemap metadata. One of the easiest ways to do so is create a sitemap, then click Export sitemap, find 'startUrl:[" .. "] and place your URLs inside it, then copy it all and import.
Macro that will make a whole sitemap requires you to make a dummy sitemap beforehand, i can help you with that too.
Ok cool, thank you for that.
I guess the CONCATENATE text function in excel might achieve the same pattern (quotation marks and then comma) as the macro.
All I need is then to get a sitemap template and substitute the urls at the right place as you suggested.
Do you have a json sitemap template to share?
Thanks again!
I wouldn't recommend using Excel for that rather than Notepad++, as Excel generates new line symbols, and you will have to remove them afterwards anyway.
{"_id":"dummy","startUrl":[your urls here],"selectors":[]}
Hi Iconoclast,
Please could you example this process more in depth as I'm struggling to use it as I dont have any experience with Macros or Notepad++.
Thank you in advance.
Can you give me a step by step guide on how you did it?
I can’t seem to get it working as I don’t have any experience with coding.
Kind regards,
Hi Mrbeans,
I use a much faster method now. I copy the list of urls, paste them into this online tool:
https://www.textfixer.com/html/convert-url-to-html-link.php (choose the option Wrap all the links in a p with a br after each link)
to convert them into an html page. Then I upload the page online and use the url address of the uploaded page as a Start Url in webscraper.
this sounds exactly what i need.
As a start i am trying to do that just with 2 links:
link 1 https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fwww.velcom.by%2Fru%2Fshop%2Fphones%2Fsmartphones%2FSamsung%2FSamsung-SM-G973F-DS%2Fgalaxy-s10-prism-white%2Fp%2F17.1010965&tab=desktop
link 2 https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Flife.com.by%2Fstore%2Fsmartphones%2Fsamsunggalaxya10-blue&tab=desktop
i put them in the tool you mentioned, i generate the html page but then how do i upload the page online?
After that seems to be quite easy as i should just add that to the start url page of the sitemap.
You can use any free web hosting service such the ones listed at
Hi there,
I'm having trouble re-importing my sitemap. I've added my list of URLs to the startUrl field, but when I re-import the json sitemap it deletes the startURL field and all of the related URLs. When I go to editmetadata, there are no URLs listed. Is there a typo in my code?
Thanks for your help!
{"_id”:”rescrape”,”startUrl":["https://www.linkedin.com/recruiter/profile/185737247,CIDy,CAP?searchController=smartSearch&searchId=2864477516&pos=330&total=7582&searchCacheKey=31d7b3f2-a0fc-4424-8279-7795e0b6dcf7%2CkltY&searchRequestId=a89da9c5-c399-4284-8766-2e4661631193%2C0E93&searchSessionId=2864477516&origin=SRFS&memberAuth=185737247%2CCIDy%2CCAP","https://www.linkedin.com/recruiter/profile/89328379,dFsQ,CAP?searchController=smartSearch&searchId=2863884746&pos=171&total=178&searchCacheKey=19b2a5b2-a8db-44e4-975a-d91948e174a0%2C1ePw&searchRequestId=cb1ae28d-6606-4801-ad2a-0e2956aec5bf%2C0atW&searchSessionId=2863884746&origin=SRFS&memberAuth=89328379%2CdFsQ%2CCAP"],"selectors":[{"id":"name","type":"SelectorText","parentSelectors":["_root"],"selector":"h1.searchable","multiple":false,"regex":"","delay":0},{"id":"currenttitle","type":"SelectorText","parentSelectors":["_root"],"selector":"li.title","multiple":false,"regex":"","delay":0},{"id":"currentlocation","type":"SelectorText","parentSelectors":["_root"],"selector":".location a","multiple":false,"regex":"","delay":0},{"id":"currentfield","type":"SelectorText","parentSelectors":["_root"],"selector":".industry a","multiple":false,"regex":"","delay":0}]}
Ok, this really works. So for dummies like me here is what you do.
Thanks guys for scraping this together, helps me a lot!
Ya, the textfixer website works great. I also upload Url lists to https://pastelink.net/ which will not only convert plain text Urls to clickable Urls, it also hosts the list like Pastebin. This site is not related to Pastebin, though.
Once you have uploaded Urls to pastelink, they'll be converted to a webpage of links, and you'll get a unique Url which you can then use to generate a sitemap.
It is a good macro, sometimes it leaves some quotes at the beginning or at the end but it is not a major problem.
Could you put a macro to return the list to original state? do the reverse.
Thanks
<Macro name="jsonURLs2list" Ctrl="yes" Alt="no" Shift="yes" Key="120">
<Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
<Action type="3" message="1601" wParam="0" lParam="0" sParam='","' />
<Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
<Action type="3" message="1602" wParam="0" lParam="0" sParam="\r\n" />
<Action type="3" message="1702" wParam="0" lParam="1792" sParam="" />
<Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
</Macro>
For return to list ctrl+shift+F9