How can I enter the search result for this particular website?

toiletchan · March 22, 2024, 9:23am

Describe the problem.
This is the base that I want to import multiple URL with search enquiries.
Usually there are 2-3 results after conducting the search. I want to enter site of the first result and then further extract information from it (e.g. book title, ratings, price etc.).

Steps I want to do:

Go to search page
Create Element for first result
Create Link wihin the Element
Go to a specific book page to further extract information

Question 1: I managed to do it for Goodreads and other websites. But not at this website. I can't create "element" which is not book name specific. Please help.

Question 2: How can I scrape the link for the website I am scraping? I manage to scrape the "title", but not the "link". thanks!

Url: 博客來-目前您搜尋的關鍵字為: 賺錢公司也會倒閉!讀財報最常犯的40個誤解

Sitemap:
{"_id":"Books","startUrl":["Error a","multiple":false,"linkType":"linkFromHref"},{"id":"Book-name","parentSelectors":["Book-link"],"type":"SelectorText","selector":"h1","multiple":false,"regex":""},{"id":"Foreign-name","parentSelectors":["Book-link"],"type":"SelectorText","selector":".mod h2 a","multiple":false,"regex":""},{"id":"Ratings","parentSelectors":["Book-link"],"type":"SelectorText","selector":"div.average","multiple":false,"regex":""},{"id":"Ratings-count","parentSelectors":["Book-link"],"type":"SelectorText","selector":"div.sum:nth-of-type(3)","multiple":false,"regex":""}]}

JanAp · March 22, 2024, 12:45pm

Hi,

Here is a reference sitemap on how to iterate through the listings:

{"_id":"books","startUrl":["https://search.books.com.tw/search/query/key/%E8%B3%BA%E9%8C%A2%E5%85%AC%E5%8F%B8%E4%B9%9F%E6%9C%83%E5%80%92%E9%96%89%EF%BC%81%E8%AE%80%E8%B2%A1%E5%A0%B1%E6%9C%80%E5%B8%B8%E7%8A%AF%E7%9A%8440%E5%80%8B%E8%AA%A4%E8%A7%A3"],"selectors":[{"id":"wrapper","multiple":true,"parentSelectors":["_root"],"selector":".table-td:has(h4)","type":"SelectorElement"},{"id":"link","linkType":"linkFromHref","multiple":false,"parentSelectors":["wrapper"],"selector":"h4 a","type":"SelectorLink"},{"id":"name","multiple":false,"parentSelectors":["link"],"regex":"","selector":"h1","type":"SelectorText"},{"extractAttribute":"href","id":"listing-link","multiple":false,"parentSelectors":["link"],"selector":"[rel=\"canonical\"]","type":"SelectorElementAttribute"}]}

I hope this helps!

toiletchan · March 23, 2024, 2:16am

Thanks a lot! It works!

One follow up question: For the listing link, how can I choose "[rel="canonical"]" usign mouse cursor? Or it's something I have to remember and it works on every site situation?

JanAp · March 24, 2024, 10:15am

Hi,

No, this selector can be identified by inspecting the HTML. The same selector should work for most websites if it is present in the HTML.