Web Scraper Chrome Extension release notes

1.95.8

  • [Feature] We launched a new feature idea gathering and voting portal for https://feedback.webscraper.io/
  • [Improvement] Reduced extensions footprint on Chrome by loading page content script only when extension is used.
  • [Fix] Element selection when page is zoomed
  • [Fix] Multiple fixes for sitemap XML selector where the selector incorrectly parsed sitemap.xml file

1.87.6

  • [Feature] added Website state setup configuration to prepare target site before scraping. For example Sign-in/Login, change currency, change location. For example if an element that indicated that you are logged into the website perform these actions:
    • Navigate to url (Login page)
    • type text into an input field
    • click a button
  • [Breaking] regex extraction has been deprecated. If already configured, it will continue to work, but there won't be an option to add a new regex filter.
  • [Fix] Improved performance in start url editing view when a lot of urls are added
  • [Fix] Excessive memory usage in Pagination selector when large amounts of contents is accessed.

1.79.3

  • [Feature] improved link from any script data extraction to handle some edge cases
  • [Feature] added sitemap sync button within the sitemap editing view
  • [Fix] Fixed issues in page load detection where the scraper could think that page has loaded while it is still loading ajax data.
  • [Fix] Improved scraped data export. Should fix issues regarding slow data export.
  • [Refactor] We completed input validation refactoring. Congrats to our team! :pizza:

1.75.7

  • [Feature] a mis-configuration modal will shop up in scenario when multiple selectors with multiple option enabled have been created. The modal will offer to group the selectors under one element selector.
  • [Feature] new link type and pagination type - Link from any script. This type will extract links in scenarios when a link can only be determined after clicking on the target element. The extractor will perform the click and monitor network traffic to see where the page is navigating. Handles window.open() and window.location=
  • [Refactor] Improved scrolling. Now scrolling will not skip frames in case when the target page is doing heavy rendering.

1.72.5

  • [Feature] confirmation popup will show up when deleting a selector.
  • [Feature] link selector will now allow to select elements that are not html links but contain links.
  • [Feature] when selecting html code of the hovered element will be shown.
  • [Fix] in some sites data extraction was slow due to Chrome background page throttling.
  • [Fix] scraper could get stuck on a bad page load.
  • [Refactor] We are refactoring UI code. Currently these changes should be invisible.

1.29.66

  • [Feature] Link selector can now extract links from other attributes, attributes with scripts and text
  • [Feature] When deleting a sitemap a confirmation popup will be shown
  • [Breaking] Popup link selector has been removed due to chrome MV3 javascript execution limitation. use link selector with custom link type
  • [Breaking] Removed popup link type from element click selector due of chrome MV3 javascript execution limitation
  • [Change] Improved mouse click simulation.
  • [Refactor] We are refactoring UI code. Currently these changes should be invisible.

1.29.58

  • We bumped extension version to our internal release number. This release also introduces some breaking changes. Major version bump would be needed anyways.
  • [Refactor] A lot of code was refactored to work with the new Chrome manifest V3 API
  • [Change] We completely removed integration with external CouchDB database.
  • [Change] Internal PouchDB database will be replaced with a simpler local storage database engine. All sitemaps will be migrated to new engine on first extension start.
  • [Change] This release includes an updated validation engine. Some validation rules will be stricter to prevent unexpected issues.
  • [Change] When during data extraction process a data element wasn't found a "null" value was stored. Now an empty value "" will be stored.
  • [Feature] Selectors now can be sorted.
  • [Feature] During data extraction process some CSS selector will be optimized for better performance.

0.6.5

  • [Fix] Varios edge case issues with page load detection
  • [Feature] Sitemap sync for Firefox
  • [Refactor] Completely reimplemented validation.
  • [Feature] Exported data will be in sorted the order it was scraped
  • [Feature] Added limit option for scroll down selector
  • [Fix] Element click selector data extraction within shadow root
  • [Refactor] Chrome is migrationg to a new manifest version which change interal APIs. We put a lot of work into this.
  • [Fix] Some special chars where incorrectly exported in XLSX exporter
  • [Fix] Data extractors will strip invalid utf-8 characters from text
  • [Change] Removed delay options from most of the selectors

0.6.4

  • [Feature] XLSX export
  • [Change] Data extraction will be limited to 120 min from a single URL.
  • [Fix] Ignore elements created by google translate

0.6.1

  • [Feature] image selector will extract image URL from srcset attribute
  • [Feature] sitemap synchronization with Web Scraper Cloud
  • [Feature] pagination selector
  • [Fix] element selection when page has zoomed elements
  • [Feature] element attribute selector now has attribute suggestion drop-down
  • [Feature] element preview shows found element count
  • [Feature] sitemap search in sitemap list
  • [Fix] lots of improvements in page load detection for edge cases
  • [Fix] overall lots of small fixes and updates

0.5.1

  • [Feature] new data selection UI engine. It is faster and more resilient to websites having CSS rules that break the UI.
  • [Change] Added a minimum Chrome version 60 requirement. This means that new releases won't be available on Windows XP.
  • [Feature] A new page load detection engine has been added. The new page load detection system will handle a lot of edge cases:
    • immediate redirect after page load
    • service workers
    • hash tag changes
    • quicker load when there is a slowly loading asset
    • won't fail on an error page if there is an immediate redirect to a successful page
    • data extraction will be retried if a redirect occurs during data extraction
    • improved content type checking
    • window.history.push changes
  • [Fix] errors when invalid URL were extracted from a page
  • [Update] libraries that Web Scraper depends on have been updated
  • [Change] CouchDB has been deprecated. Users that were using it will be able to continue to use it but new users won't be able to change data storage. We plan to replace the current data storage engine (PouchDB) with simpler one to reduce problems with sitemap and data storage.
  • [Feature] privacy policy page and an option to opt-out of extension metric gathering via options page.
  • [Feature] new users will see a welcome page with a quick startup guide
  • [Change] reduced extensions permission requirements.
  • [Fix] overall a lot of fixes for different types of edge cases.
  • Overall code quality improvements

0.4.1.2

  • [Refactored] Scroll down selector scroll system.
  • [Fix] Sometimes Web Scraper wasn't receiving a page load completed network event from chrome. Added a workaround so that page load doesn't fail
  • [Fix] Page load event was failing when page contents were loaded by a background worker

0.4.1.1

  • [Feature] A custom selector to extract data within shadow root elements has been added. Right now element selection within shadow root elements won't work but it is possible to extract data from them with a custom CSS selector. Use like this: .shadow-root-parent-element:shadow-root .selector-within-shadow-root
  • [Feature] Similarly selector to extract data within iframes was added. Use like this: iframe:iframe .selector-within-iframe

0.4.1

  • [Fix] web scraper didn't recognize an XHTML header and was ignoring pages with this header

0.4.0

  • New web scraper version released

0.3.8.9

  • [Feature] Added sitemap.xml Link selector . The selector take all urls from a sitemap.xml files. In cases when you want to scrape entire website this selector simplifies navigation within the site. Instead of using link selectors for categories, subcategories, pagination you can use one selector.
  • [Change] Removed delay option from selectors that don't need it. If you had delay set up in your sitemap, it will continue to work.
  • [Feature] CSS selector generator now generates selectors that use schema.org attributes, other element attributes, dd:contains() + dt, h1:contains() + *
  • [Change] CSS selector generator has been refactored to generate better selectors
  • [Change] Popup window has new style
  • [Change] Scraper will now try to extract data from a page that hasn't completely loaded. Previously page load timed out and no data was extracted
  • [Feature] Element click selector now has an option that returns initial element only when there is no button found in the page. Useful for product variations within a page (size, color switch)
  • [Change] Element click selector now triggers change event on <select> tags when it should click on an <option> element
  • [Fix] Fixed an issue where the scraper would continue to run after user has created extension window
  • [Fix] Validation sometimes didn't trigger when extension type was changed
  • [Fix] Empty urls won't be added to queue

0.3.8.5

  • [Change] Updated scroll down behavior one more time. Now it will make scroll really smoothly. It will scroll to the bottom element that it is looking for. If nothing is found it will scroll to the bottom of page. And if nothing is found then it will consider that scrollling is done. This will increase the success rate of scroll down selector.

0.3.8.4

  • [Feature/WIP] Work has started on a pre-scrape website set up for authentification, language selection, currency change. UI not available ATM.
  • [Fix] Element selectors weren't waiting for initial delay after click in click was implemented.
  • [Fix] Some sites rebuilt entire dom tree. This was breaking element click selector. This is partly now fixed
  • [Change] Element scroll down selector now scrolls to most bottom element instead of page bottom. This solves a problem when a site doesn't trigger data load because the footer is being displayed.

0.3.8.2

  • [Fix] Now it is possible to select elements within element selector that has CSS selector _parent_
  • [Feature] Allow IP addresses, host names as start urls
  • [Change] Refactored CSV export. Line break now is \r\n instead of \n. Now the CSV format should be as described in RFC RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files . Tested edge cases when text contains quotes and backslashed quotes. Opening CSV with MS Office 365 and libre office worked as expected. Note some software libraries use backslash \ as escape character. This use is incorrect by RFC standard.

0.3.8.1

  • [Change] Data extraction module has been rewritten completely. The only difference should be that an element selector with multiple not checked will return record with null values when child selectors don't extract anything
  • [Feature] Element click selector can have child element click selectors. Useful when clicking trough multiple variation selections in a product page
  • [Feature] With some minor changes we managed to port Web Scraper on Firefox
  • [Fix] Refactored page status code detection code to fix a race condition on Firefox

0.3.8

  • [Feature] Element click selector now also triggers touch events in case buttons are triggered by touch instead of click.
  • [Fix] Disallow "&" char in selector ids
  • [Fix] Sitemaps in sitemap list are now loaded one by one. This should resolve a problem when too big sitemaps are stored in chrome
  • [Feature] Web Scraper got a new logo
  • [Feature] In a recent release Chrome added lookbehind to regex engine. Now you can write regex like this (?<=sku: ).+. This will extract 12345 from sku: 12345
  • [Feature] Now you can detach devtools and run web scraper in a separate window.
  • [Fix] When a parent element selector selected HTML element data preview didn't work
  • [Fix] When a class name contained % char CSS selector couldn't be generated
  • [Fix] Escape special characters in a CSS selector ($, (, etc..)
  • [Fix] White space in links extracted by link selector is now removed or escaped
  • [Fix] Table selectors header and data row selector generator has been rewritten. Fixed an issue when table contained tables
  • [Fix] Added error handlers for errors that were happening in chrome API
  • [Fix] Web scraper won't stop if an url with invalid domain name is being scraped. It will continue
  • [Change] If a page returns loads with a 4xx or 5xx status code data won't be extracted from this page.
  • [Feature] CSS selector can now generate CSS selectors that start with >
  • [Change] delay option is marked as deprecated in some selectors
  • [Fix] Image selector might have failed when a sitemap had image download enabled. Image download was disabled in a previous release.
  • [Fix] If the loaded url isn't an HTML document data won't be extracted from it. For example it might be an Image url.
  • [Change/Fix] We are limiting start url count to 10000 in a sitemap. The problem was that chrome has some internal storage limitations and a large sitemap could make all of sitemaps inaccessible.
  • [Change] Scraper window now opens the first url that needs to be scraped instead of the "waiting" page
  • [Feature] When element click selector is used to click through a <select> tag, the selected <option> tag will have selected="selected" attribute.
  • [Feature] We added a survey system to better understand what should we focus on.

0.3.7

  • Fixed an issue where large csv files couldn't be downloaded.
  • Fixed scrolling with mouse middle button in some UI elements.
  • Fixed data preview which sometimes showed more data because of multiple linked link selectors
  • Updated test sites in https://webscraper.io. Now there are product pages, more pagination pages and a test site for popup link selector.

0.3.6

  • Refactored ajax wait functionality. It will wait only for xhr and script requests that are made to domain or subdomain of the currently open age.
  • Element click selector now will be able to click on <option> tags. Instead of clicking it will trigger a value change event.

0.3.5

0.3.2 - 0.3.4 versions didn't work on windows xp, windows vista. The problem was that google isn't releasing newer chrome versions for these operating systems and the extension was using JavaScript features which weren't available in older chrome versions. This release should fix problems in these operating systems.

  • Fixed an issue in URLSearchParams library which was incorrect in chrome 49
  • Disabled wait ajax functionality, on chrome versions that doesn't support privilege request from devtools (chrome 49 couldn't)
  • Refactored page load delay waiter with an 60s+30s delays. The scraper will also try multiple times to connect to scraper window during page load process if the content script isn't reachable. Previously during this check an error could happen on a slower computer.
  • Fixed element preview for element click selector
  • Added ajax wait to element click selector delay. Element click selector now should wait on requests that the page is making after clicking an element. This won't work in windows xp/vista though.

0.3.4

  • Increased page load timeouts which have caused problems when a page has a lot of content or when scraper is running on a slower computer
  • Reordered scraped data preview columns to match exported csv columns.

0.3.3

  • Fixed an issue where configuring request interval would make the scraper to load only one page
  • Added tab refresh timeout. In case of a timeout scraper window will be recreated
  • Added page load checker. In case a page is stuck in loading process scraper window will be recreated

0.3.2

  • Added tab refresh
  • Added page load detection using network listeners. This detection feature will wait for dynamic data to load before starting data extraction
  • Page load detection feature should also increase page load speed.
  • Added primitive adblocker. Right now it blocks few analytics trackers in scraper window.
  • Removed Image download. (Use image download script instead)
2 Likes