Unknown Content Type Loaded - Help!

educationresearch · April 16, 2019, 11:39am

Am I doing something wrong, or is the site blocking me in some way? I am trying to scrape text from the records of UK legislation on education. I have checked every step with data preview, and everything comes up fine, but then when I start scraping it opens the root page and then immediately finishes scraping, with no data scraped.

When I was making the sitemap I had to enter the CSS codes manually as for some reason the popup for the selector (with "done selecting" on) would not appear. Relevant to the problem? Thanks in advance for any help.

Url: http://www.legislation.gov.uk/uksi/education

Sitemap:
{"_id":"education1","startUrl":["http://www.legislation.gov.uk/uksi/education"],"selectors":[{"id":"Link","type":"SelectorLink","parentSelectors":["_root"],"selector":"#content > table > tbody > tr:nth-child(n+1) > td:nth-child(1) > a","multiple":true,"delay":0},{"id":"link2","type":"SelectorLink","parentSelectors":["Link"],"selector":"#viewLegSnippet > div > ol > li:nth-child(n+1) > li > p > span > a, #viewLegSnippet > div > ol > li:nth-child(n+1) > p > span.LegDS.LegContentsTitle > a","multiple":true,"delay":0},{"id":"AllText","type":"SelectorText","parentSelectors":["link2"],"selector":"#viewLegSnippet","multiple":false,"regex":"","delay":0}]}

EDIT

OK I went for a very simple test and it failed on that too, just tried to get two bits of text from the page

{"_id":"education2","startUrl":["http://www.legislation.gov.uk/"],"selectors":[{"id":"Text","type":"SelectorText","parentSelectors":["_root"],"selector":"p","multiple":true,"regex":"","delay":"2000"}]}

Looking at the log (see below) I find the error "unknown content type loaded". Does anyone know what is going on? It is driving me mad!

{"url":"http://www.legislation.gov.uk/","timestamp":1555504760,"level_name":"INFO","message":"Job execution started"}
background_script.js:465 {"contentType":"application/xhtml+xml;charset=utf-8","timestamp":1555504760,"level_name":"NOTICE","message":"unknown content type loaded"}
background_script.js:465 {"url":"http://www.legislation.gov.uk/","parentSelector":"_root","sitemapName":"education2","driver":"chrometab","error":"PAGE_UNKNOWN_CONTENT_TYPE_ERROR","timestamp":1555504760,"level_name":"NOTICE","message":"Job execution failed"}
background_script.js:465 {"timestamp":1555504760,"level_name":"PROFILE","message":"157 ms job execution"}
background_script.js:465 {"url":"http://www.legislation.gov.uk/","timestamp":1555504760,"level_name":"INFO","message":"Syncing storage because a job failed"}
background_script.js:465 {"timestamp":1555504762,"level_name":"INFO","message":"Scraper execution is finished"}
background_script.js:465

martins · April 17, 2019, 2:45pm

Hi!

We will release an update that fixes the problem. As you found out the issue is related to the content type.

educationresearch · April 17, 2019, 3:36pm

Ah great, thanks martins. Do you know if there is any workaround I can do in the meantime? Trying to get this data out for a job....

martins · April 17, 2019, 4:12pm

This issue could be affecting others and the fix is really small. So we made a fix already and published it in chrome store. It should be available via updates in about an hour. You can force the update by enabling developer mode in chrome extension manager and pressing the "update" button there.

Your sitemap works in the updated version.

Thank you for posting the detailed description of the problem. Without it we wouldn't be able to make the fix.

educationresearch · April 18, 2019, 9:01am

Ah brilliant, thank you martins! Works like a dream.

One other thing. It doesn't bother me as I tend to copy and paste the CSS Selector from the "Elements" bar in Chrome, but the automated selector still doesn't work on this particular site. You can select things, but the popup window that allows you to click "Done Selecting" doesn't appear. Is this part of the same problem?

martins · April 18, 2019, 10:03am

This site has a pretty rare content type. The content type is disallowing creation of the toolbar. There is a workaround for that but to implement it we have to rewrite the toolbars code completely.

I added this to our to do list. It will be fixed when we do an update for the toolbar.

Thank you for reporting this!