Seemingly Simple Scrape No Longer Possible

Zach_Hartman · April 29, 2021, 12:34pm

Hi everyone, I'm running into an unusual issue with my scrape of what "should" be simple pulling of text. I want an organized listing of abstract titles from the indicated website. In the past, I've been able to do this with relative ease, since this website doesn't use unusual pagination or lazy loading, as far as I can tell. However, this year, there is no output, and I get no notification that the scrape has finished, no matter what page load and request intervals I specify (I've tried between 2000 and 100,000 ms) in the test that I've built out below.

A few details about the issue:

If I preview data on the test page, things look fine. It IS just text, after all, and I've rendered the site. However, I get no output when scraping just this page. When the scraper is running, I do see the rendered text on the page, and my delays are long enough to sit there, watch, and verify that everything is loading correctly in the popup window. Other people experiencing the "data preview working, final output is not" issues seem to be related to lazy loading, and adjusting delays works well for them. Unfortunately, I could not find an issue that mirrors mine closely enough to assist me this time.
After the specified delay, the popup window closes with no indication that the scrape finished. I don't see any errors, of course, but it would appear that something is going wrong.
Webscraper IS working for me to do similar tasks on other websites, so it doesn't appear to be an issue inherent to Webscraper itself

Url: https://meetinglibrary.asco.org/session/13619

Sitemap:
{"_id":"test","startUrl":["https://meetinglibrary.asco.org/session/13619"],"selectors":[{"id":"Abstract Title","type":"SelectorText","parentSelectors":["_root"],"selector":"div.session-presentation:nth-of-type(n+3) .record__title span","multiple":true,"regex":"","delay":0},{"id":"Author","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(n+4) .record__meta div.record__ellipsis","multiple":true,"regex":"","delay":0},{"id":"Abstract Number","type":"SelectorText","parentSelectors":["_root"],"selector":"div:nth-of-type(n+5) .record__meta > span","multiple":true,"regex":"","delay":0}]}

Thank you, everyone! I know this type of issue has been flagged previously, but they always seem to be with respect to other more complicated scrapes. I haven't run into a site where I can't pull ANY text at all. I can't even scrape elements that are more fixed on the website, like the banner.

Any assistance would be greatly appreciated! Thank you very much!

Zach_Hartman · May 3, 2021, 12:26pm

Hi Leemeng, thank you for taking a look at this! If you have any insights at all (including perhaps something dumb I've missed), I would be very grateful. I have beaten my head against the wall on this, but to no avail so far.

leemeng · May 4, 2021, 7:53am

According to network log, the URL you are using produces a 404 error. So I think WS is halting when it sees that. There also seems to be some redirecting on the server side so WS may have problems with that as well.

Zach_Hartman · May 4, 2021, 12:06pm

Interesting. Thank you for the insight. That would explain why nothing is working at all. It is strange that the window pops up and displays while attempting the run.

That's a helpful lead! I'll see what I can do to get around this issue. Thank you again!

Zach_Hartman · May 4, 2021, 3:21pm

Hi everyone, just returning to square the circle. Lee's insight proved fruitful, since knowing that I was getting a redirect issue at least gave me a clue as to things to try. In this case, switching up the browser allowed me to scrape one page at a time, with data output and everything, without incurring a 404 error.

Unfortunately, the site I'm trying to scrape was still playing weird with clicking links, so since I had a finite number of pages to pull text from (n = 225), I just used a different webscraper (Data Miner in Chrome) to get all those links, used Excel to format the URLs as "https://website.com", , and then imported the site map with all 225 pages. Took about an hour of work, and now things are scraping as needed.

I don't think very many people are running into the same challenge that I am, but I wanted to make sure to report how I was able to work around the solution. Thanks again, Lee!