How to scrape an unstructured menu: different categories have different depths?

Let's suppose a menu takes the following structure

  1. Category A
    a. subcategory x
    b. subcategory y
  2. Category B
    a.subcategory z
  3. Category C

See that Category C has no subcategory, so if you set up a Category Link --> Subcategory Link scraping structure, the webscraper will not find a subcategory to follow for Category C. What would happen for all items under C, which have no subcategory? Would they then not be scraped? How can I create a flexible solution that correctly navigates this unstructured type of menu in which different categories have different menu depths? Again, emphasis on flexible, because the menu's contents may change over time, but I want a solution which won't break in case the menu is updated.

Also very interested in solving this task!

1 Like

I'm also (very) interested in getting products from categories with different depths.

This is quite a common issue for ecommerce sites. The trick is to create a wrapper (container) for all the category results you want, then make that wrapper a child of all the categories you will navigate to, i.e. child of Category L1, Category L2, etc. You would then build your scrapers under this category results wrapper. As the same wrapper will handle all the categories, this will provide the needed flexibility to handle categories with different depths.

This concept can be a bit hard to understand so you can try importing the example sitemap
below, which scrapes the webscraper test site: You can open Selector graph to see how it is structured:

{"_id":"category-scrape-demo","startUrl":[""],"selectors":[{"id":"Click L1 categories","type":"SelectorLink","parentSelectors":["_root"],"selector":"a.category-link","multiple":true },{"id":"Category results wrapper","type":"SelectorElement","parentSelectors":["Click L1 categories","Click L2 categories"],"selector":"div.row > div.col-md-9","multiple":false },{"id":"Click L2 categories","type":"SelectorLink","parentSelectors":["Click L1 categories"],"selector":"a.subcategory-link","multiple":true },{"id":"Category title","type":"SelectorText","parentSelectors":["Category results wrapper"],"selector":"h1","multiple":false,"regex":"" },{"id":"Item wrappers","type":"SelectorElement","parentSelectors":["Category results wrapper"],"selector":"div.thumbnail","multiple":true },{"id":"Product name","type":"SelectorText","parentSelectors":["Item wrappers"],"selector":"h4 > a","multiple":false,"regex":"" },{"id":"Desc","type":"SelectorText","parentSelectors":["Item wrappers"],"selector":"p.description","multiple":false,"regex":"" },{"id":"Price","type":"SelectorText","parentSelectors":["Item wrappers"],"selector":"h4.pull-right.price","multiple":false,"regex":"" },{"id":"Link","type":"SelectorLink","parentSelectors":["Item wrappers"],"selector":"h4 > a","multiple":false }]}