I am scraping a site where part of the data I want is behind several interactions, which have unique selectors on each page and I can't get them to get properly selected. They also follow a format where regexing them wouldn't work (it's selecting filters from a sidebar and there are other sidebar elements with the same attribute/id/etc structure that get rotated in random variables).
However, selecting those options I want creates a consistent URL structure.
What I would like to be able to do is reference that URL structure using a variable of the parent item, e.g.:
_root -> category_links -> transformed_links -> final_data
So on the root page it (currently correctly) scrapes a list of categories.
From there, the next step is for each category link I want to add a URL query parameter like ?a=b&c=d
From there it will get me to the final category page I want to scrape data from (those selectors work)
I do not need to go any further than that (e.g. pagination), basically I just want to manually add a query parameter to the URL which has already been scraped to go further into the process.
This part just explains how I currently do it which is a total mess and hopefully you can save me from this madness:
I run the webscraper scrape to scrape the list of categories from the starting page and export the data
I take the column of results and put it into a text editor
I find/replace using regex to append the query parameters to the end of each URL
I generate a sitemap from the list of URLs using a free online tool
I do another regex find/replace to clean up some issues that tool has with generating sitemaps
I upload the sitemap to my site (which is not the site I am scraping) so that I can put a link into webscraper
I run the webscraper scrape to scrape the data for each sitemap entry
I create a "helper" column with numbers incrementing from 1 to number all the rows
I sort all the rows from lowest to highest with that helper number because webscraper outputs from the sitemap in inverted order and I need to maintain the original list ordering.
I merge the data back into the original scrape list of categories
I repeat with each sub-category set..
I really hope this can all be avoided by just having a %variable% I can insert into a step..