Scraping from list of urls

Is there anyway to start scraping from a list of urls instead of a json sitemap?
Alternatively is there a free tool that converts a list of plain urls into a json sitemap working with web scraper?

Thanks for any help.

world33

Hi there!

It can be done using macro in some advanced text editor (that supports macros, of course), like Notepad++ or UltraEdit. Please let me know what text editor you're using, i'll try to make you one :slight_smile:

Hi iconoclast,

Thank you for your reply and kind offer.
I use Notepad++

I appreciate it,

world33

Hello @world33, you have to locate your shortcuts.xml file first, it's located in your %appdata%\Notepad++ (you can copy/paste this it will open right folder), then open your shortcuts.xml and add this inside <Macros> section:

<Macro name="prepURLs" Ctrl="no" Alt="no" Shift="no" Key="120">
            <Action type="0" message="2024" wParam="-1" lParam="0" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="http" />
            <Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam='&quot;http' />
            <Action type="3" message="1702" wParam="0" lParam="1536" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="\r" />
            <Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam='&quot;,' />
            <Action type="3" message="1702" wParam="0" lParam="1536" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam="\n" />
            <Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1702" wParam="0" lParam="1536" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
            <Action type="0" message="2024" wParam="999999" lParam="0" sParam="" />
            <Action type="0" message="2326" wParam="0" lParam="0" sParam="" />
        </Macro>

P.S. it's assigned to F9 key

P.P.S. it will work if you paste links with 'http' in the beginning of each URL, links have to be pasted in column (I update links on daily basis just copying from Excel).

UPD: revised it as i use UltraEdit, it was a little outdated

2 Likes

Thanks iconoclast.
So this macro will help me start from a list of urls as follows:

https://www.linkedin.com/search/results/schools/?keywords=1spbgmu.ru
https://www.linkedin.com/search/results/schools/?keywords=21.edu.ar
https://www.linkedin.com/search/results/schools/?keywords=22may-col.com
https://www.linkedin.com/search/results/schools/?keywords=29mayis.edu.tr
https://www.linkedin.com/search/results/schools/?keywords=3erdene.edu.mn
https://www.linkedin.com/search/results/schools/?keywords=3il-ingenieurs.fr
https://www.linkedin.com/search/results/schools/?keywords=aabfs.org
https://www.linkedin.com/search/results/schools/?keywords=aabu.edu.jo
https://www.linkedin.com/search/results/schools/?keywords=aalto.fi
https://www.linkedin.com/search/results/schools/?keywords=aamu.edu
https://www.linkedin.com/search/results/schools/?keywords=aarch.dk
https://www.linkedin.com/search/results/schools/?keywords=aasa.ac.jp
https://www.linkedin.com/search/results/schools/?keywords=aast.edu
https://www.linkedin.com/search/results/schools/?keywords=aau.ac.ae
https://www.linkedin.com/search/results/schools/?keywords=aau.ac.in
https://www.linkedin.com/search/results/schools/?keywords=aau.at
https://www.linkedin.com/search/results/schools/?keywords=aau.dk
https://www.linkedin.com/search/results/schools/?keywords=aau.edu.et
https://www.linkedin.com/search/results/schools/?keywords=aau.edu.jo
https://www.linkedin.com/search/results/schools/?keywords=aau.edu.sd
https://www.linkedin.com/search/results/schools/?keywords=aau.in
https://www.linkedin.com/search/results/schools/?keywords=aaua.edu.ng
https://www.linkedin.com/search/results/schools/?keywords=aauc.edu.jo

and generate a proper json sitemap which will work with web scraper.
is that correct?
thanks again

This macro will prepare urls to be placed within sitemap, i.e. macro will encapsulate each link into quotation marks, and separate them by a comma. Then you will be able to place them within your sitemap metadata. One of the easiest ways to do so is create a sitemap, then click Export sitemap, find 'startUrl:[" .. "] and place your URLs inside it, then copy it all and import.

Macro that will make a whole sitemap requires you to make a dummy sitemap beforehand, i can help you with that too.

1 Like

Ok cool, thank you for that.

I guess the CONCATENATE text function in excel might achieve the same pattern (quotation marks and then comma) as the macro.
All I need is then to get a sitemap template and substitute the urls at the right place as you suggested.
Do you have a json sitemap template to share?

Thanks again!

I wouldn't recommend using Excel for that rather than Notepad++, as Excel generates new line symbols, and you will have to remove them afterwards anyway.

{"_id":"dummy","startUrl":[your urls here],"selectors":[]}
1 Like

Hi Iconoclast,

Please could you example this process more in depth as I'm struggling to use it as I dont have any experience with Macros or Notepad++.

Thank you in advance.

Can you give me a step by step guide on how you did it?

I can’t seem to get it working as I don’t have any experience with coding.
Kind regards,

Hi Mrbeans,

I use a much faster method now. I copy the list of urls, paste them into this online tool:
https://www.textfixer.com/html/convert-url-to-html-link.php (choose the option Wrap all the links in a p with a br after each link)
to convert them into an html page. Then I upload the page online and use the url address of the uploaded page as a Start Url in webscraper.

this sounds exactly what i need.
As a start i am trying to do that just with 2 links:
link 1 https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Fwww.velcom.by%2Fru%2Fshop%2Fphones%2Fsmartphones%2FSamsung%2FSamsung-SM-G973F-DS%2Fgalaxy-s10-prism-white%2Fp%2F17.1010965&tab=desktop
link 2 https://developers.google.com/speed/pagespeed/insights/?url=https%3A%2F%2Flife.com.by%2Fstore%2Fsmartphones%2Fsamsunggalaxya10-blue&tab=desktop

i put them in the tool you mentioned, i generate the html page but then how do i upload the page online?
After that seems to be quite easy as i should just add that to the start url page of the sitemap.

You can use any free web hosting service such the ones listed at

Hi there,

I'm having trouble re-importing my sitemap. I've added my list of URLs to the startUrl field, but when I re-import the json sitemap it deletes the startURL field and all of the related URLs. When I go to editmetadata, there are no URLs listed. Is there a typo in my code?

Thanks for your help!

{"_id”:”rescrape”,”startUrl":["https://www.linkedin.com/recruiter/profile/185737247,CIDy,CAP?searchController=smartSearch&searchId=2864477516&pos=330&total=7582&searchCacheKey=31d7b3f2-a0fc-4424-8279-7795e0b6dcf7%2CkltY&searchRequestId=a89da9c5-c399-4284-8766-2e4661631193%2C0E93&searchSessionId=2864477516&origin=SRFS&memberAuth=185737247%2CCIDy%2CCAP","https://www.linkedin.com/recruiter/profile/89328379,dFsQ,CAP?searchController=smartSearch&searchId=2863884746&pos=171&total=178&searchCacheKey=19b2a5b2-a8db-44e4-975a-d91948e174a0%2C1ePw&searchRequestId=cb1ae28d-6606-4801-ad2a-0e2956aec5bf%2C0atW&searchSessionId=2863884746&origin=SRFS&memberAuth=89328379%2CdFsQ%2CCAP"],"selectors":[{"id":"name","type":"SelectorText","parentSelectors":["_root"],"selector":"h1.searchable","multiple":false,"regex":"","delay":0},{"id":"currenttitle","type":"SelectorText","parentSelectors":["_root"],"selector":"li.title","multiple":false,"regex":"","delay":0},{"id":"currentlocation","type":"SelectorText","parentSelectors":["_root"],"selector":".location a","multiple":false,"regex":"","delay":0},{"id":"currentfield","type":"SelectorText","parentSelectors":["_root"],"selector":".industry a","multiple":false,"regex":"","delay":0}]}

Ok, this really works. So for dummies like me here is what you do.

  1. Get a list of links in any way you need - scrape them somewhere, get from pricelist, in my case for example I used http://site.com/?search=xxxxx and then used excel to concatenate two strings together where xxxx1 - xxxx10000 was variable
  2. use the convert url to html provided here https://www.textfixer.com/html/convert-url-to-html-link.php
  3. copy generated urls, open any kind of notepad and save as links.html
  4. upload to your or any other web server (links were provided above)
  5. create webscraper sitemap where this uploaded link is sitemap, then go deeper and do a regular scrape

Thanks guys for scraping this together, helps me a lot!

2 Likes

Ya, the textfixer website works great. I also upload Url lists to https://pastelink.net/ which will not only convert plain text Urls to clickable Urls, it also hosts the list like Pastebin. This site is not related to Pastebin, though.

Once you have uploaded Urls to pastelink, they'll be converted to a webpage of links, and you'll get a unique Url which you can then use to generate a sitemap.

4 Likes

paste url with https://www.google.com/url?q=urlpasted...
:see_no_evil:

It is a good macro, sometimes it leaves some quotes at the beginning or at the end but it is not a major problem.

Could you put a macro to return the list to original state? do the reverse.

Thanks

        <Macro name="jsonURLs2list" Ctrl="yes" Alt="no" Shift="yes" Key="120">
            <Action type="3" message="1700" wParam="0" lParam="0" sParam="" />
            <Action type="3" message="1601" wParam="0" lParam="0" sParam='&quot;,&quot;' />
            <Action type="3" message="1625" wParam="0" lParam="1" sParam="" />
            <Action type="3" message="1602" wParam="0" lParam="0" sParam="\r\n" />
            <Action type="3" message="1702" wParam="0" lParam="1792" sParam="" />
            <Action type="3" message="1701" wParam="0" lParam="1609" sParam="" />
        </Macro>

For return to list ctrl+shift+F9