Scrape URL with pages in alphabet

The URL http://example.com/page/[1-3] can scrape:

My question is: how to scrape pages with alphabet. For example:

If I use http://example.com/page/[a-c], the program report error.

Thank you!

Hi!

In order to keep pages arranged the way you want, you need to use CouchDB.
Information about CouchDB server can be found here:
http://webscraper.io/documentation#storage-backends

Please keep in mind that multiple URL sitemap works bottom-up.

P.S. if you ment how to add multiple URLs, open your Metadata, then add URLs by pressing [ + ] button to the right side.

2 Likes

Thank you for reply. Can you give me a simple example or URLs to learn tips on using Web Scraper+CouchDB ?

You can download CouchDB instance directly from here: https://dl.bintray.com/apache/couchdb/win/2.1.2/couchdb-2.1.2.msi

Then you have to install it, i would recommend to have it installed (if possible) on your second drive, in the root of the drive (like D:\CouchDB).

Next, you right click on WebScraper icon in your Browser, click Options, then select CouchDB from the list.
Then put these two lines accordingly:
(sitemap) http://127.0.0.1:5984/scraper-sitemaps
(data) http://127.0.0.1:5984/

And there you go.

You can access your CouchDB server instance using this url: http://127.0.0.1:5984/_utils/

1 Like

I try to create the following sitemap and view it in http://127.0.0.1:5984/_utils/#database/scraper-sitemaps/example

{
"_id": "example",
"_rev": "2-32bd47d5bbddc2eb23bc9e3ec7014772",
"startUrl": [
"http://example.com/page/a"
],
"selectors": [
{
"id": "example",
"type": "SelectorText",
"selector": "h1",
"parentSelectors": [
"_root"
],
"multiple": false,
"regex": "",
"delay": 0
}
]
}

Can you teach me how to scrape a set of URLs with the last letter from a to z?

http://example.com/page/a
http://example.com/page/b
http://example.com/page/c
...
http://example.com/page/z

Thank you very much!

Hi!

Please add URLs within WebScraper itself, using Menu -> Your Sitemap Name -> (dropdown) -> Metadata.

Please add urls bottom up (starting from Z to A). You can add new url by pressing [ + ] button to the right side of URL list.

I that case I will manually add 26 URLs. However my original request is to scrape pages with 3-letter string with numbers(0-9) and alphabet(a-z). Can you find a method to add URLs automatically?

I do add URLs automatically using Macro in UltraEdit(paid, more functionality) / Notepad++(free, 'nuff functionality)

The array [#-#] method works only for numbers though.

Hi, there is another way - just use URL encoding (percent encoding) to turn your letters into numbers.

For instance, http://example.com/page/%61 is the same as http://example.com/page/a

%61 = a
%62 = b
%63 = c and so on (refer to chart)

Then you can use the WS number ranges again. However, these are hexa numbers so it'll only work for letters a - i (%61 - %69). The letter j is %6A.

Fun test - where do you think https://forum.webscraper.%69%6F points to?

1 Like