I Use This!
Very High Activity

News

Analyzed about 14 hours ago. based on code collected about 21 hours ago.
Posted over 5 years ago by The Rust Core Team
The Rust team is happy to announce a new version of Rust, 1.31.0, and "Rust 2018" as well. Rust is a programming language that empowers everyone to build reliable and efficient software. If you have a previous version of Rust installed via rustup ... [More] , getting Rust 1.31.0 is as easy as: $ rustup update stable If you don't have it already, you can get rustup from the appropriate page on our website, and check out the detailed release notes for 1.31.0 on GitHub. What's in 1.31.0 stable Rust 1.31 may be the most exciting release since Rust 1.0! Included in this release is the first iteration of "Rust 2018," but there's more than just that! This is going to be a long post, so here's a table of contents: Rust 2018 Non-lexical lifetimes Module system changes More lifetime elision rules const fn New tools Tool Lints Documentation Domain working groups New website Library stabilizations Cargo features Contributors Rust 2018 We wrote about Rust 2018 first in March, and then in July. For some more background about the why of Rust 2018, please go read those posts; there's a lot to cover in the release announcement, and so we're going to focus on the what here. There's also a post on Mozilla Hacks as well! Briefly, Rust 2018 is an opportunity to bring all of the work we've been doing over the past three years together, and create a cohesive package. This is more than just language features, it also includes Tooling (IDE support, rustfmt, Clippy) Documentation Domain working groups work A new web site We'll be covering all of this and more in this post. Let's create a new project with Cargo: $ cargo new foo Here's the contents of Cargo.toml: [package] name = "foo" version = "0.1.0" authors = ["Your Name "] edition = "2018" [dependencies] A new key has been added under [package]: edition. Note that it has been set to 2018. You can also set it to 2015, which is the default if the key does not exist. By using Rust 2018, some new features are unlocked that are not allowed in Rust 2015. It is important to note that each package can be in either 2015 or 2018 mode, and they work seamlessly together. Your 2018 project can use 2015 dependencies, and a 2015 project can use 2018 dependencies. This ensures that we don't split the ecosystem, and all of these new things are opt-in, preserving compatibility for existing code. Furthermore, when you do choose to migrate Rust 2015 code to Rust 2018, the changes can be made automatically, via cargo fix. What kind of new features, you may ask? Well, first, features get added to Rust 2015 unless they require some sort of incompatibility with 2015's features. As such, most of the language is available everywhere. You can check out the edition guide to check each feature's minimum rustc version as well as edition requirements. However, there are a few big-ticket features we'd like to mention here: non-lexical lifetimes, and some module system improvements. Non-lexical lifetimes If you've been following Rust's development over the past few years, you may have heard the term "NLL" or "non-lexical lifetimes" thrown around. This is jargon, but it has a straightforward translation into simpler terms: the borrow checker has gotten smarter, and now accepts some valid code that it previously rejected. Consider this example: fn main() { let mut x = 5; let y = &x; let z = &mut x; } In older Rust, this is a compile-time error: error[E0502]: cannot borrow `x` as mutable because it is also borrowed as immutable --> src/main.rs:5:18 | 4 | let y = &x; | - immutable borrow occurs here 5 | let z = &mut x; | ^ mutable borrow occurs here 6 | } | - immutable borrow ends here This is because lifetimes follow "lexical scope"; that is, the borrow from y is considered to be held until y goes out of scope at the end of main, even though we never use y again. This code is fine, but the borrow checker could not handle it. Today, this code will compile just fine. What if we did use y, like this for example: fn main() { let mut x = 5; let y = &x; let z = &mut x; println!("y: {}", y); } Older Rust will give you this error: error[E0502]: cannot borrow `x` as mutable because it is also borrowed as immutable --> src/main.rs:5:18 | 4 | let y = &x; | - immutable borrow occurs here 5 | let z = &mut x; | ^ mutable borrow occurs here ... 8 | } | - immutable borrow ends here With Rust 2018, this error changes for the better: error[E0502]: cannot borrow `x` as mutable because it is also borrowed as immutable --> src/main.rs:5:13 | 4 | let y = &x; | -- immutable borrow occurs here 5 | let z = &mut x; | ^^^^^^ mutable borrow occurs here 6 | 7 | println!("y: {}", y); | - borrow later used here Instead of pointing to where y goes out of scope, it shows you where the conflicting borrow occurs. This makes these sorts of errors far easier to debug. In Rust 1.31, this feature is exclusive to Rust 2018. We plan to backport it to Rust 2015 at a later date. Module system changes The module system can be a struggle for people first learning Rust. Everyone has their own things that take time to master, of course, but there's a root cause for why it's so confusing to many: while there are simple and consistent rules defining the module system, their consequences can feel inconsistent, counterintuitive and mysterious. As such, the 2018 edition of Rust introduces a few changes to how paths work, but they end up simplifying the module system, to make it more clear as to what is going on. Here's a brief summary: extern crate is no longer needed in almost all circumstances. You can import macros with use, rather than a #[macro_use] attribute. Absolute paths begin with a crate name, where the keyword crate refers to the current crate. A foo.rs and foo/ subdirectory may coexist; mod.rs is no longer needed when placing submodules in a subdirectory. These may seem like arbitrary new rules when put this way, but the mental model is now significantly simplified overall. There's a lot of details here, so please read the edition guide for full details. More lifetime elision rules Let's talk about a feature that's available in both editions: we've added some additional elision rules for impl blocks and function definitions. Code like this: impl<'a> Reader for BufReader<'a> { // methods go here } can now be written like this: impl Reader for BufReader<'_> { // methods go here } The '_ lifetime still shows that BufReader takes a parameter, but we don't need to create a name for it anymore. Lifetimes are still required to be defined in structs. However, we no longer require as much boilerplate as before: // Rust 2015 struct Ref<'a, T: 'a> { field: &'a T } // Rust 2018 struct Ref<'a, T> { field: &'a T } The : 'a is inferred. You can still be explicit if you prefer. We're considering some more options for elision here in the future, but have no concrete plans yet. const fn There's several ways to define a function in Rust: a regular function with fn, an unsafe function with unsafe fn, an external function with extern fn. This release adds a new way to qualify a function: const fn. It looks like this: const fn foo(x: i32) -> i32 { x + 1 } A const fn can be called like a regular function, but it can also be used in any constant context. When it is, it is evaluated at compile time, rather than at run time. As an example: const SIX: i32 = foo(5); This will execute foo at compile time, and set SIX to 6. const fns cannot do everything that normal fns can do; they must have deterministic output. This is important for soundness reasons. Currently, const fns can do a minimal subset of operations. Here's some examples of what you can do: Arithmetic and comparison operators on integers All boolean operators except for && and || Constructing arrays, structs, enums, and tuples Calls to other const fns Index expressions on arrays and slices Field accesses on structs and tuples Reading from constants (but not statics, not even taking a reference to a static) & and * of references Casts, except for raw pointer to integer casts We'll be growing the abilities of const fn, but we've decided that this is enough useful stuff to start shipping the feature itself. For full details, please see the reference. New tools The 2018 edition signals a new level of maturity for Rust's tools ecosystem. Cargo, Rustdoc, and Rustup have been crucial tools since 1.0; with the 2018 edition, there is a new generation of tools ready for all users: Clippy, Rustfmt, and IDE support. Rust's linter, clippy, is now available on stable Rust. You can install it via rustup component add clippy and run it with cargo clippy. Clippy is now considered 1.0, which carries the same lint stability guarantees as rustc. New lints may be added, and lints may be modified to add more functionality, however lints may never be removed (only deprecated). This means that code that compiles under clippy will continue to compile under clippy (provided there are no lints set to error via deny), but may throw new warnings. Rustfmt is a tool for formatting Rust code. Automatically formatting your code lets you save time and arguments by using the official Rust style. You can install with rustup component add rustfmt and use it with cargo fmt. This release includes Rustfmt 1.0. From now on we guarantee backwards compatibility for Rustfmt: if you can format your code today, then the formatting will not change in the future (only with the default options). Backwards compatibility means that running Rustfmt on your CI is practical (use cargo fmt -- --check). Try that and 'format on save' in your editor to revolutionize your workflow. IDE support is one of the most requested tooling features for Rust. There are now multiple, high quality options: Visual Studio Code IntelliJ Atom Sublime Text 3 Eclipse Work on IDE support is not finished, in particular code completion is not up to scratch in the RLS-based editors. However, if you mainly want support for types, documentation, and 'go to def', etc. then you should be happy. If you have problems installing any of the tools with Rustup, try running rustup self update, and then try again. Tool lints In Rust 1.30, we stabilized "tool attributes", like #[rustfmt::skip]. In Rust 1.31, we're stabilizing something similar: "tool lints," like #[allow(clippy::bool_comparison)] These give a namespace to lints, so that it's more clear which tool they're coming from. If you previously used Clippy's lints, you can migrate like this: // old #![cfg_attr(feature = "cargo-clippy", allow(bool_comparison))] // new #![allow(clippy::bool_comparison)] You don't need cfg_attr anymore! You'll also get warnings that can help you update to the new style. Documentation Rustdoc has seen a number of improvements this year, and we also shipped a complete re-write of the "The Rust Programming Language." Additionally, you can buy a dead-tree copy from No Starch Press! We had previously called this the "second edition" of the book, but since it's the first edition in print, that was confusing. We also want to periodically update the print edition as well. In the end, after many discussions with No Starch, we're going to be updating the book on the website with each release, and No Starch will periodically pull in our changes and print them. The book has been selling quite well so far, raising money for Black Girls Code. You can find the new TRPL here. Domain working groups We announced the formation of four working groups this year: Network services Command-line applications WebAssembly Embedded devices Each of these groups has been working very hard on a number of things to make Rust awesome in each of these domains. Some highlights: Network services has been shaking out the Futures interface, and async/await on top of it. This hasn't shipped yet, but we're close! The CLI working group has been working on libraries and documentation for making awesome command-line applications The WebAssembly group has been shipping a ton of world-class tooling for using Rust with wasm. Embedded devices has gotten ARM development working on stable Rust! You can find out more about this work on the new website! New Website Last week we announced a new iteration of the web site. It's now been promoted to rust-lang.org itself! There's still a ton of work to do, but we're proud of the year of work that it took by many people to get it shipped. Library stabilizations A bunch of From implementations have been added: u8 now implements From, and likewise for the other numeric types and their NonZero equivalents Option<&T> implements From<&Option>, and likewise for &mut Additionally, these functions have been stabilized: slice::align_to and its mutable counterpart slice::chunks_exact, as well as its mutable and r counterparts (like slice::rchunks_exact_mut) in all combinations See the detailed release notes for more. Cargo features Cargo will now download packages in parallel using HTTP/2. Additionally, now that extern crate is not usually required, it would be jarring to do extern crate foo as bar; to rename a crate. As such, you can do so in your Cargo.toml, like this: [dependencies] baz = { version = "0.1", package = "foo" } or, the equivalent [dependencies.baz] version = "0.1" package = "foo" Now, the foo package will be able to be used via baz in your code. See the detailed release notes for more. Contributors to 1.31.0 At the end of release posts, we normally thank the people who contributed to this release. But for this release, more so than others, this list does not truly capture the amount of work and the number of people who have contributed. Each release is only six weeks, but this release is the culmination of three years of effort, in countless repositories, by numerous people. It's been a pleasure to work with you all, and we look forward to continuing to grow in the next three years. [Less]
Posted over 5 years ago
Introduction We now have nightly releases of Servo for the Magic Leap One augmented reality headset. You can head over to https://download.servo.org/, install the application, and browse the web in a virtual browser. This is a developer preview ... [More] release, designed for as a testbed for future products, and as a venue for experimenting with UI design. What should the web look like in augmented reality? We hope to use Servo to find out! We are providing these nightly snapshots to encourage other developers to experiment with AR web experiences. There are still many missing features, such as immersive or 3D content, many types of user input, media, or a stable embedding API. We hope you forgive the rough edges. This blog post will describe the experience of porting Servo to a new architecture, and is intended for system developers. Magic Leap under the hood The Magic Leap software development kit (SDK) is based on commonly-used open-source technologies. In particular, it uses the clang compiler and the gcc toolchain for support tools such as ld, objcopy, ranlib and friends. The architecture is 64-bit ARM, using the same application binary interface as Android. Together these give the target as being aarch64-linux-android, the same as for many 64-bit Android devices. Unlike Android, Magic Leap applications are native programs, and do not require a Java Native Interface (JNI) to the OS. Magic Leap provides a lot of support for developing AR applications, in the form of the Lumin Runtime APIs, which include 3D scene descriptions, UI elements, input events including device placement and orientation in 3D space, and rendering to displays which provide users with 3D virtual visual and audio environments that interact with the world around them. The Magic Leap and Lumin Runtime SDKs are available from https://creator.magicleap.com/ for Mac and Windows platforms. Building the Servo library The Magic Leap library is built using ./mach build --magicleap, which under the hood calls cargo build --target=aarch64-linux-android. For most of the Servo library and its dependencies, this just works, but there are a couple of corner cases: C/C++ libraries and crates with special treatment for Android. Some of Servo’s dependencies are crates which link against C/C++ libraries, notably openssl-sys and mozjs-sys. Each of these libraries uses slightly different build environments (such as Make, CMake or Autoconf, often with custom build scripts). The challenge for software like Servo that uses many such libraries is to find a configuration which will work for all the dependencies. This comes down to finding the right settings for environment variables such as $CFLAGS, and is complicated by cross-compiling the libraries which often means ensuring that the Magic Leap libraries are included, not the host libraries. The other main source of issues with the build is that since Magic Leap uses the same ABI as Android, its target is aarch64-linux-android, which is the same as for 64-bit ARM Android devices. As a result, many crates which need special treatment for Android (for example for JNI or to use libandroid) will treat the Magic Leap build as an Android build rather than a Linux build. Some care is needed to undo all of this special treatment. For example, the build scripts of Servo, SpiderMonkey and OpenSSL all contain code to guess the directory layout of the Android SDK, which needs to be undone when building for Magic Leap. One thing that just worked turned out to be debugging Rust code on the Magic Leap device. Magic Leap supports the Visual Studio Code IDE, and remote debugging of code running natively. It was great to see the debugging working out of the box for Rust code as well as it did for C++. Building the Magic Leap application The first release of Servo for Magic Leap comes with a rudimentary application for browsing 2D web content. This is missing many features, such as immersive 3D content, audio or video media, or user input by anything other than the controller. Magic Leap applications come in two flavors: universe applications, which are immersive experiences that have complete control over the device, and landscape applications, which co-exist and present the user with a blended experience where each application presents part of a virtual scene. Currently, Servo is a landscape application, though we expect to add a universe application for immersive web content. Landscape applications can be designed using the Lumin Runtime Editor, which gives a visual presentation of the various UI components in the scene graph. The most important object in Servo’s scene graph is the content node, since it is a Quad that can contain a 2D resource. One of the kinds of resource that a Quad can contain is an EGL context, that Servo uses to render web content. The runtime editor generates C++ code that can be included in an application to render and access the scene graph; Servo uses this to access the content node, and the EGL context it contains. The other hooks that the Magic Leap Servo application uses are for events such as moving the laser pointer, which are mapped to mouse events, a heartbeat for animations or other effects which must be performed on the main thread, and a logger which bridges Rust’s logging API to Lumin’s. The Magic Leap application is built each night by Servo’s CI system, using the Mac builders since there is no Linux SDK for Magic Leap. This builds the Servo library, and packages it is a Magic Leap application, which is hosted on S3 and linked to from the Servo download page. Summary The pull request that added Magic Leap support to Servo is https://github.com/servo/servo/pull/21985 which adds about 1600 lines to Servo, mostly in the build scripts and the Magic Leap application. Work on the Magic Leap port of Servo started in early September 2018, and the pull request was merged at the end of October, so took about two person-months. Much of the port was straightforward, due to the maturity of the Rust cross-compilation and build tools, and the use of common open-source technologies in the Magic Leap platform. Lumin OS contains many innovative features in its treatment of blending physical and virtual 3D environments, but it is built on a solid open-source foundation, which makes porting a complex application like Servo relatively straightforward. Servo is now making its first steps onto the Magic Leap One, and is available for download and experimentation. Come try it out, and help us design the immersive web! [Less]
Posted over 5 years ago by TWiR Contributors
Hello and welcome to another issue of This Week in Rust! Rust is a systems language pursuing the trifecta: safety, concurrency, and speed. This is a weekly summary of its progress and community. Want something mentioned? Tweet us at @ThisWeekInRust ... [More] or send us a pull request. Want to get involved? We love contributions. This Week in Rust is openly developed on GitHub. If you find any errors in this week's issue, please submit a PR. Updates from Rust Community News & Blog Posts A new look for rust-lang.org. Rust Quiz: 26 medium to hard Rust questions with complete explanations. Announcing RustaCUDA. Official Rust runtime for AWS Lambda. Creating my first AWS Lambda using Rust. How I wrote a modern C++ library in Rust. Using Passenger with Rust. wasm-bindgen — how does it work? Rust web survey 2018. This Week in Rust and WebAssembly 9. Crate of the Week This week's crate is cargo-call-stack, a cargo subcommand for whole-program call stack analysis. Thanks to Jorge Aparicio for the suggestion! Submit your suggestions and votes for next week! Call for Participation Always wanted to contribute to open-source projects but didn't know where to start? Every week we highlight some tasks from the Rust community for you to pick and get started! Some of these tasks may also have mentors available, visit the task page for more information. Rust Latam CFP is now open, deadline is December 31st. The imag project calls for contributors (2) If you are a Rust project owner and are looking for contributors, please submit tasks here. Updates from Rust Core 254 pull requests were merged in the last week decouple proc_macro from the rest of the compiler implement chalk unification routines upgrade LLVM to trunk, still version 8 another LLVM Update and Re-enable lldb use sort_by_cached_key when the key function is not trivial/free deduplicate literal → constant lowering use MaybeUninit instead of mem::uninitialized for Windows Mutex libcore: add VaList and variadic arg handling intrinsics arena: speed up TypedArena::clear and improve common patterns stabilize macro_at_most_once_rep stabilize dbg!(..) stabilize self_in_typedefs stabilize self_struct_ctor remove unsafe unsafe inner function add TryFrom<&[T]> for [T; $N] where T: Copy move VecDeque::resize_with out of the impl block use allow-dirty option in cargo package to skip vcs checks make ParseIntError and IntErrorKind fully public use MaybeUninit in libcore fix futures creating aliasing mutable and shared ref add libstd Cargo feature panic_immediate_abort cargo: ConflictStoreTrie: faster filtered search crates.io: email verification warning Approved RFCs Changes to Rust follow the Rust RFC (request for comments) process. These are the RFCs that were approved for implementation this week: RFC 2591: Stabilise exhaustive integer pattern matching. RFC 2500: Needle API (née Pattern API). Final Comment Period Every week the team announces the 'final comment period' for RFCs and key PRs which are reaching a decision. Express your opinions now. RFCs No RFCs are currently in final comment period. Tracking Issues & PRs [disposition: merge] Stabilize memory-releated std::arch::wasm32 intrinsics. [disposition: merge] Tracking issue for RFC 2300, "Self in type definitions". [disposition: merge] Tracking issue for str::split_ascii_whitespace. [disposition: merge] Tracking issue for Vec::resize_with and resize_default. [disposition: close] Tracking issue for feature extern_in_paths. New RFCs impl trait expressions. Upcoming Events Online Dec 12. Rust Community Team Meeting in Discord. Dec 17. Rust Community Content Subteam Meeting on Discord. Dec 19. Rust Events Team Meeting on Telegram. Asia Pacific Dec 6. Pune, IN - Rust workshop at Pune, India. Dec 12. Hangzhou, CN - Rust Hangzhou. Dec 16. Sydney, AU - Rust Sydney Meetup 15. Europe Dec 10. Vienna, AT - Metalab - Rust Workshop. Dec 11. Zurich, CH - Rust Zurich - Rust Embedded Edition 2018. Dec 12. Berlin, DE - Berlin Rust Hack and Learn. Dec 12. Milano, IT - Milano - Hello Open Closed Principle. Dec 15 & 16. Moscow, RU - RustRush 2018. Dec 20. Cambridge, GB - The Last Cambridge Rust? Dec 20. Turin, IT - Gruppo di studio Rust. North America Dec 6. Phoenix, US - Phoenix 2018 Edition Release Party. Dec 9. Mountain View, US - Rust Dev in Mountain View!. Dec 10. Seattle, US - Seattle Rust Meetup. Dec 12. Vancouver, CA - Vancouver Rust meetup. Dec 12. Boulder, US - Rust Boulder/Denver Monthly Meeting. Dec 13. Arlington, US - Rust DC — Mid-month Rustful. Dec 13. Columbus, US - Columbus Rust Society - Monthly Meeting. Dec 13. Utah, US - Utah Rust monthly meetup. Dec 13. San Diego, US - San Diego Rust December Meetup - Rust 2018 Overview + Memory Allocator. Dec 16. Mountain View, US - Rust Dev in Mountain View!. Dec 20. Chicago, US - Rust for the Holidays. If you are running a Rust event please add it to the calendar to get it mentioned here. Please remember to add a link to the event too. Email the Rust Community Team for access. Rust Jobs Software Infrastructure Engineer - Engines at Blue Origin, Kent, US. Rust Engineer at Commure, Inc. (San Francisco, Boston, Montreal). Intermediate Software Developer at Finhaven, Vancouver, CA. Tech Lead at Hashintel, London, GB. Embedded operating system developer, Karlsruhe, DE Student research assistant (embedded), Karlsruhe, DE Tweet us at @ThisWeekInRust to get your job offers listed here! Quote of the Week The bug I did not have – /u/pacman82's reddit post title Thanks to Felix for the suggestion! Please submit your quotes for next week! This Week in Rust is edited by: nasa42, llogiq, and Flavsditz. Discuss on r/rust. [Less]
Posted over 5 years ago by [email protected] (ClassicHasClass)
I used to think that WebKit would eat the world, but later on I realized it was Blink. In retrospect this should have been obvious when the mobile version of Microsoft Edge was announced to use Chromium (and not Microsoft's own rendering engine ... [More] EdgeHTML), but now rumour has it that Edge on its own home turf -- Windows 10 -- will be Chromium too. Microsoft engineers have already been spotted committing to the Chromium codebase, apparently for the ARM version. No word on whether this next browser, codenamed Anaheim, will still be called Edge. In the sense that Anaheim won't (at least in name) be Google, just Chromium, there's reason to believe that it won't have the repeated privacy erosions that have characterized Google's recent moves with Chrome itself. But given how much DNA WebKit and Blink share, that means there are effectively two current major rendering engines left: Chromium and Gecko (Firefox). The little ones like NetSurf, bless its heart, don't have enough marketshare (or currently features) to rate, Trident in Internet Explorer 11 is intentionally obsolete, and the rest are too deficient to be anywhere near usable (Dillo, etc.). So this means Chromium arrogates more browsershare to itself and Firefox will continue to be the second class citizen until it, too, has too small a marketshare to be relevant. Then Google has eaten the Web. And we are worse off for it. Bet Mozilla's reconsidering that stupid embedding decision now. [Less]
Posted over 5 years ago by Nick Cameron
In a few days the 2018 edition is going to roll out, and that will include some new framing around Rust's tooling. We've got a core set of developer tools which are stable and ready for widespread use. We're going to have a blog post all about that ... [More] , but for now I wanted to address the status of the RLS, since when I last blogged about a 1.0 pre-release there was a significant sentiment that it was not ready (and given the expectations that a lot of people have, we agree). The RLS has been in 0.x-stage development. We think it has reached a certain level of stability and usefulness. While it is not at the level of quality you might expect from a mature IDE, it is likely to be useful for a majority of users. The RLS is tightly coupled with the compiler, and as far as backwards compatibility is concerned, that is the important thing. So from the next release, the RLS will share a version number with the Rust distribution. We are not claiming this as a '1.0' release, work is certainly not finished, but we think it is worth taking the opportunity of the 2018 edition to highlight the RLS as a usable and useful tool. In the rest of this blog post I'll go over how the RLS works in order to give you an idea of what works well and what does not, and where we are going (or might go) in the future. Background The RLS is a language server for Rust - it is meant to handle the 'language knowledge' part of an IDE (c.f., editing, user interaction, etc.). The concept is that rather than having to develop Rust support from scratch in each editor or IDE, you can do it once in the language server and each editor can be a client. This is a recent approach to IDE development, in contrast to the approach of IntelliJ, Eclipse, and others, where the IDE is designed to make language support pluggable, but language support is closely tied to a specific IDE framework. The RLS integrates with the Rust compiler, Cargo, and Racer to provide data. Cargo is used as a source of data for orchestrating builds. The compiler provides data for connecting references to definitions, and about types and docs (which is used for 'go to def', 'find all references', 'show type', etc.). Racer is used for code completion (and also to supply some docs). Racer can be thought of as a mini compiler which does as little as possible to provide code completion information as fast as possible. The traditional approach to IDEs, and how Rust support in IntelliJ works, is to build a completely new compiler frontend, optimised for speed and incremental compilation. This compiler provides enough information to provide the IDE functionality, but usually doesn't do any code generation. This approach is much easier in fairly simple languages like Java, compared to Rust (macros, modules, and the trait system all make this a lot more complex). There are trade-offs to the two approaches: using a separate compiler is fast and functionality can be limited to ensure it is fast enough. However, there is a risk that the two compilers do not agree on how to compiler a program, in particular, covering the whole of a language like Rust is difficult and so completeness can be an issue. Maintaining a separate compiler also takes a lot of work. In the future, we hope to further optimise the Rust compiler for IDE cases so that it is fast enough that the user never has to wait, and to use the compiler for code completion. We also want to work with Cargo a bit differently so that there is less duplication of logic between Cargo and the RLS. Current status For each feature of the RLS, I measure its success along two axes: is it fast enough and is it complete (that is, does it work for all code). There are also non-functional issues of resource usage (how much battery and CPU the RLS is using), how often the RLS crashes, etc. Go to definition This is usually fast enough: if the RLS is ready, then it is pretty much instant. For large crates, it can take too long for the RLS to be ready, and thus we are not fast enough. However, usually using slightly stale data for 'go to def' is not a problem, so we're ok. It is fairly complete. There are some issues around macros - if a definition is created by a macro, then we often have trouble. 'Go to def' is not implemented for lifetimes, and there are some places we don't have coverage (inside where clauses was recently fixed). Show type Showing types and documentation on hover has almost the same characteristics as 'go to definition'. Rename Renaming is similar to 'find all references' (and 'go to def'), but since we are modifying the user's code, there are some more things that can go wrong, and we want to be extra conservative. It is therefore a bit less complete than 'go to def', but similarly fast. Code completion Code completion is generally pretty fast, but often incomplete. This is because method dispatch in Rust is really complicated! Eventually, we hope that using the compiler for code completion rather than Racer will solve this problem. Resource usage The RLS is typically pretty heavy on the CPU. That is because we prioritise having results quickly over minimising CPU usage. In the future, making the compiler more incremental should give big improvements here. Crashes The RLS usually only crashes when it disagrees with Cargo about how to build a project, or when it exercises a code path in the compiler which would not be used by a normal compile, and that code path has a bug. While crashes are more common than I'd like, they're a lot rarer than they used to be, and should not affect most users. Project structure There is a remarkable variety in the way a Rust project can be structured. Multiple crates can be arranged in many ways (using workspaces, or not), build scripts and procedural macros cause compile-time code execution, and there are Cargo features, different platforms, tests, examples, etc. This all interacts with code which is edited but not yet saved. Every different configuration can cause bugs. I think we are mostly doing well here, as far as I know there are no project structures to avoid (but this has been a big source of trouble in the past). Overall The RLS is clearly not done. It's not in the same league as IDE support for more mature languages. However, I think that it is at a stage where it is worth trying for many users. Stability is good enough - it's unlikely you'll have a bad experience. It does somewhat depend on how you use an IDE: if you rely heavily on code completion (in particular, if you use code completion as a learning tool), then the RLS is probably not ready. However, we think we should encourage new users to Rust to try it out. So, while I agree that the RLS is not 'done', neither is it badly unstable, likely to be disappear, or lacking in basic functionality. For better or worse, 1.0 releases seem to have special significance in the Rust community. I hope the version numbering decision sends the right message: we're ready for all Rust users to use the RLS, but we haven't reached 'mission accomplished' (well, maybe in a 'George W Bush' way). More on that version number The RLS will follow the Rust compiler's version number, i.e., the next release will be 1.31.0. From a strict semver point of view this makes sense since the RLS is only compatible with its corresponding Rust version, so incrementing the minor version with each Rust release is the right thing to do. By starting at 1.31, we're deliberately avoiding the 1.0 label. In terms of readiness, it's important to note that the RLS is not a user-facing piece of software. I believe the 1.x version number is appropriate in that context - if you want to build an IDE, then the RLS is stable enough to use as a library. However, it is lacking some user-facing completeness and so an IDE built using the RLS should probably not use the 1.0 number (our VSCode extension will keep using 0.x). The future There's been some discussion about how best to improve the IDE experience in Rust. I believe the language server approach is the correct one, but there are several options to make progress: continue making incremental improvements to the compiler and RLS, moving towards compiler-driven code completion; use an alternate compiler frontend (such as Rust analyzer); improve Racer and continue to rely on it for code completion; some hybrid approach using more than one of these ideas. When assessing these options, we need to take into account the likely outcome, the risk of something bad happening, the amount of work needed, and the long-term maintenance burden. The main downside of the current path is the risk that the compiler will never get fast enough to support usable code completion. Implementation is also a lot of work, however, it would mostly help with compile time issues in general. With the other approaches there is a risk that we won't get the completeness needed for useful code completion. The implementation work is again significant, and depending on how things pan out, there is a risk of much costlier long-term maintenance. I've been pondering the idea of a hybrid approach: using the compiler to provide information about definitions (and naming scopes), and either Racer or Rust Analyzer to do the 'last mile' work of turning that into code completion suggestions (and possibly resolving references too). That might mean getting the best of both worlds - the compiler can deal with a lot of complexity where speed is not as necessary, and the other tools get a helping hand with the stuff that has to be done quickly. Orthogonally, there is also work planned to better integrate with Cargo and to support more features, as well as some 'technical debt' issues, such as better testing. [Less]
Posted over 5 years ago by Andre Vrignaud
Today, we’re making available an early developer preview of a browser for the Magic Leap One device. This browser is built on top of our Servo engine technology and shows off high quality 2D graphics and font rendering through our WebRender web ... [More] rendering library, and more new features will soon follow. While we only support basic 2D pages today and have not yet built the full Firefox Reality browser experience and published this into the Magic Leap store, we look forward to working alongside our partners and community to do that early in 2019! Please try out the builds, provide feedback, and get involved if you’re interested in the future of mixed reality on the web in a cutting-edge standalone headset. And for those looking at Magic Leap for the first time, we also have an article on how the work was done. [Less]
Posted over 5 years ago
encoding_rs is a high-decode-performance, low-legacy-encode-footprint and high-correctness implementation of the WHATWG Encoding Standard written in Rust. In Firefox 56, encoding_rs replaced uconv as the character encoding library used in Firefox. ... [More] This wasn’t an addition of a component but an actual replacement: uconv was removed when encoding_rs landed. This writeup covers the motivation and design of encoding_rs, as well as some benchmark results. Additionally, encoding_rs contains a submodule called encoding_rs::mem that’s meant for efficient encoding-related operations on UTF-16, UTF-8, and Latin1 in-memory strings—i.e., the kind of strings that are used in Gecko C++ code. This module is discussed separately after describing encoding_rs proper. The C++ integration of encoding_rs is not covered here and is covered in another write-up instead. TL;DR Rust’s borrow checker is used with on-stack structs that get optimized away to enforce an “at most once” property that matches reads and writes to buffer space availability checks in legacy CJK converters. Legacy CJK converters are the most risky area in terms of memory-safety bugs in a C or C++ implementation. Decode is very fast relative to other libraries with the exception of some single-byte encodings on ARMv7. Particular effort has gone into validating UTF-8 and converting UTF-8 to UTF-16 efficiently. ASCII runs are handled using SIMD when it makes sense. There is tension between making ASCII even faster vs. making transitions between ASCII and non-ASCII more expensive. This tension is the clearest when encoding from UTF-16, but it’s there when decoding, too. By default, there is no encode-specific data other than 32 bits per single-byte encoding. This makes legacy CJK encode extremely slow by default relative to other libraries but still fast enough in for the browser use cases. That is, the amount of text one could reasonably submit at a time in a form submission encodes so fast even on a Raspberry Pi 3 (standing in for a low-end phone) that the user will not notice. Even with only 32 bits of encode-oriented data, multiple single-byte encoders are competitive with ICU though only the windows-1252 applied to ASCII or almost ASCII input is competitive with Windows system encoders. Faster CJK legacy encode is available as a compile-time option. But ideally, you should only be using UTF-8 for output anyway. (If you just want to see the benchmarks and don’t have time for the discussion of the API and implementation internals, you can skip to the benchmarking section.) Scope Excluding the encoding_rs::mem submodule, which is discussed after encoding_rs proper, encoding_rs implements the character encoding conversions defined in the Encoding Standard as well as the mapping from labels (i.e. strings in protocol text that identify encodings) to encodings. Specifically, encoding_rs does the following: Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid aligned native-endian in-RAM UTF-16 (units of u16). Encodes a stream of potentially-invalid aligned native-endian in-RAM UTF-16 (units of u16) into a sequence of bytes in an Encoding Standard-defined character encoding as if the lone surrogates had been replaced with the REPLACEMENT CHARACTER before performing the encode. (Gecko’s UTF-16 is potentially invalid.) Decodes a stream of bytes in an Encoding Standard-defined character encoding into valid UTF-8. Encodes a stream of valid UTF-8 into a sequence of bytes in an Encoding Standard-defined character encoding. (Rust’s UTF-8 is guaranteed-valid.) Does the above in streaming (input and output split across multiple buffers) and non-streaming (whole input in a single buffer and whole output in a single buffer) variants. Avoids copying (borrows) when possible in the non-streaming cases when decoding to or encoding from UTF-8. Resolves textual labels that identify character encodings in protocol text into type-safe objects representing the those encodings conceptually. Maps the type-safe encoding objects onto strings suitable for returning from document.characterSet. Validates UTF-8 (in common instruction set scenarios a bit faster for Web workloads than the Rust standard library; hopefully will get upstreamed some day) and ASCII. Notably, the JavaScript APIs defined in the Encoding Standard are not implemented by encoding_rs directly. Instead, they are implemented in Gecko as a thin C++ layer that calls into encoding_rs. Why is a Character Encoding Conversion Library Even Needed Anymore? The Web is UTF-8 these days and Rust uses UTF-8 as the in-RAM Unicode representation, so why is a character encoding conversion library even needed anymore? The answer is, of course, “for legacy reasons”. While the HTML spec requires the use of UTF-8 and the Web is over 90% UTF-8 (according to W3Techs, whose methodology is questionable considering that they report e.g. ISO-8859-1 separately from windows-1252 and GB2312 separately from GBK even though the Web Platform makes no such distinctions, but Google hasn’t published their numbers since 2012), users still need to access the part of the Web that has not migrated to UTF-8 yet. That part does not consist only of ancient static pages, either. For example, in Japan there are still news sites that publish new content every day in Shift_JIS. Over here in Finland, I do my banking using a Web UI that is still encoded in ISO-8859-15. Another side of the legacy is inside the browser engine. Gecko, JavaScript and the DOM API originate from the 1990s when the way to represent Unicode in RAM was in 16-bit units as can also been seen in other software from that era, such as Windows NT, Java, Qt and ICU. (Unicode was formally extended beyond 16 bits in Unicode 2.0 in 1996 but non-Private Use Characters were not assigned outside the Basic Multilingual Plane until Unicode 3.1 in 2001.) Why a Rewrite? Regardless of the implementation language, the character encoding library in Gecko was in need of a rewrite for three reasons: The addition of Rust code in Firefox brought about the need to be able to convert to and from UTF-8 directly and in terms of binary size, it didn’t make sense to have distinct libraries for converting to and from UTF-16 and for converting to and from UTF-8. Instead, a unified library using the same lookup tables for both was needed. The old code wasn’t designed to yield both UTF-16-targeting and UTF-8-targeting machine code from the same source. The addition of an efficient capability to decode to UTF-8 or to encode from UTF-8 would have involved a level of change comparable to a rewrite. The old library was crufty enough that it was easier to make correctness improvements by the means of a rewrite than by the means of incremental fixes. In Firefox 43, I had already rewritten the Big5 decoder and encoder in C++, because a rewrite was easier than modifying the old code. In that particular case, the old code used the Private Use Area (PUA) of the Basic Multilingual Plane (BMP) for Hong Kong Supplementary Character Set (HKSCS) characters. However, after the old code was written, HKSCS characters had been assigned proper code points in Unicode, but many of the assignments are on the Supplementary Ideographic Plane (Plane 2). When a fundamental assumption, such as all the characters in an encoding mapping to the BMP, no longer holds, a rewrite is easier than an incremental change. As another example (that showed up after the initial rewrite proposal but before the implementation got properly going), the ISO-2022-JP decoder had an XSS vulnerability that was difficult to fix without restructuring the existing code. I actually tried to write a patch for the old code and gave up. In general, the code structure of the old multi-byte decoders differed from the spec text so much that it would have been harder to try to figure out if the code does what the spec requires than to write new code according to the spec. The old code was written at a time when the exact set of behaviors that Web-exposed character encodings exhibit wasn’t fully understood. For this reason, the old code had generality that is no longer useful now that we know the full set of Web-exposed legacy encodings and can be confident that there will be no additional legacy encodings introduced with additional behaviors anymore. As the most notable example, the old code assumed that the lower half of single-byte encodings might not be ASCII. By the time of planning encoding_rs, single-byte encodings whose lower half wasn’t ASCII had already been removed as part of previous Encoding Standard-compliance efforts. Some of the multi-byte encoding handling code also had configurability for the single-byte mode that allowed for non-ASCII single-byte mode. However, some multi-byte encodings had already been migrated off the generic two-byte encoding handling code years ago. There had been generic two-byte encoding handling code, but it no longer made sense when only EUC-KR remained as an encoding exhibiting the generic characteristics. Big5 was able to decode to Plane 2, GBK had grown four-byte sequences as part of the evolution to GB18030, EUC-JP had grown support for three-byte sequences in order to support JIS X 0212 and Shift_JIS never had the EUC structure to begin with and had single-byte half-width katakana. Even EUC-KR itself had deviated from the original EUC structure by being extended to support all precomposed Hangul syllables (not just the ones in common use) in windows-949. When a rewrite made sense in any case, it made sense to do the rewrite in Rust, because a rewrite of a clearly identifiable subsystem is exactly the kind of thing that is suitable for rewriting in Rust and the problem domain could use memory-safety. The old library was created in early 1999, but it still had a buffer overrun discovered in it in 2016 (in code added in the 2001 and 2002). This shows that the notion that code written in a memory-unsafe language becomes safe by being “battle-hardened” if it has been broadly deployed for an extended period of time is a myth. Memory-safety needs a systematic approach. Calendar time and broad deployment are not sufficient to turn unsafe code into safe code. (The above-mentioned bug discovered in 2016 wasn’t the last uconv security bug to be fixed. In 2018, a memory-safety-relevant integer overflow bug was discovered in uconv after uconv had already been replaced with encoding_rs in non-ESR Firefox but uconv was still within security support in ESR. However, that bug was in the new Big5 code that I wrote for Firefox 43, so it can’t be held against the ancient uconv code. I had fixed the corresponding encoding_rs bug before encoding_rs landed in Firefox 56. The uconv bug was fixed in Firefox ESR 52.7.) Why not ICU or rust-encoding? As noted above, a key requirement was the ability to decode to and from both UTF-16 and UTF-8, but ICU supports only decoding to and from UTF-16 and rust-encoding supports only decoding to and from UTF-8. Perhaps one might argue that pivoting via another UTF would be fast enough, but experience indicated that pivoting via another UTF posed at least a mental barrier: Even after the benefits of UTF-8 as an in-memory Unicode representation were known, Gecko subsystems had been written to use UTF-16 because that was what uconv decoded to. A further problem with ICU is that it does not treat the Encoding Standard as its conformance target. Chrome patches ICU substantially for conformance. I didn’t want to maintain a similar patch set in the Gecko context and instead wanted a library that treats the Encoding Standard as its conformance target. The invasiveness of the changes to rust-encoding that would have been needed to meet the API design, performance and UTF-16 targeting goals would have been large enough that it made sense to pursue them in a new project instead of trying to impose the requirements onto an existing project. API Design Problems In addition to internal problems, uconv also had a couple of API design problems. First, the decoder API lacked the ability to signal the end of the stream. This meant that there was no way for the decoder to generate a REPLACEMENT CHARACTER when the input stream ended with an incomplete byte sequence. It was possible for the caller to determine from the status code if the last buffer passed to the decoder ended with an incomplete byte sequence, but then it was up to the caller to generate the REPLACEMENT CHARACTER in that situation even though the decoder was generally expected to provide this service. As a result, only one caller in the code base, the TextDecoder implementation, did the right thing. Furthermore, even though the encoder side had an explicit way to signal the end of the stream, it was a separate method leading to more complexity for callers than just being able to say that a buffer is the last buffer. Additionally, the API contract was unclear on whether it was supposed to fill buffers exactly potentially splitting a surrogate pair across buffer boundaries or whether it was supposed to guarantee output validity on a per-method call basis. In a situation where the input and output buffers were exhausted simultaneously, it was unspecified whether the converter should signal that the input was exhausted or that the output was exhausted. In cases where it wasn’t the responsibility of the converter to handle the replacement of malformed byte sequences when decoding or unmappable characters when encoding, the API left needlessly much responsibility to the caller to advance over the faulty input and to figure out what the faulty input was in the case where that mattered, i.e. when encoding and producing numeric character references for unmappable characters. Character encoding conversion APIs tend to exhibit common problems, so the above uconv issues didn’t make uconv particularly flawed compared to other character encoding conversion APIs out there. In fact, to uconv’s credit at least in the form that it had evolved into by the time I got involved, given enough output space uconv always consumed all the input provided to it. This is very important from the perspective of API usability. It’s all too common for character encoding conversion APIs to backtrack if the input buffer ends with an incomplete byte sequence and to report the incomplete byte sequence at the end of the input buffer as not consumed. This leaves it to the caller to take those unconsumed bytes and to copy them to the start of the next buffer so that they can be completed by the bytes that follow. Even worse, sometimes this behavior isn’t documented and is up to the caller of the API to discover by experimentation. This behavior also imposes a, typically undocumented, minimum input buffer size, because the input buffer has to be large enough for at least one complete byte sequence to fit. If the input trickles in byte by byte, it’s up to the caller to arrange them into chunks large enough to contain a complete byte sequence. Sometimes, the API design problem described in the previous paragraph is conditional on requesting error reporting. When I was writing the Validator.nu HTML Parser, I discovered that the java.nio.charset character encoding conversion API was well-behaved when it was asked to handle errors on its own, but when the caller asked for the errors to be reported, the behavior undocumentedly changed to not consuming all the input offered even if there was enough output space. This was because the error reporting mechanism sought to designate the exact bytes in error by giving the caller the number of erroneous bytes corresponding to a single error. In order to make a single number make sense, the bytes always had to be counted backwards from the current position, which meant that the current position had to be placed such that it was at the end of the erroneous sequence and additionally the API sought to make it so that the entire erroneous sequence was in the buffer provided and not partially in a past already discarded buffer. Additionally, as a more trivial to describe matter, but as a security-wise potentially very serious matter, some character encoding conversion APIs offer to provide a mode that ignores errors. Especially when decoding and especially in the context of input such as HTML that has executable (JavaScript) and non-executable parts, silently dropping erroneous byte sequences instead of replacing them with the REPLACEMENT CHARACTER is a security problem. Therefore, it’s a bad idea for character encoding conversion API to offer a mode where errors are neither signaled to the caller nor replaced with the REPLACEMENT CHARACTER. Finally, some APIs fail to provide a high-performance streaming mode where the caller is responsible for output buffer allocation. (This means two potential failures: First, failure to provide a streaming mode and, second, providing a streaming mode but converter seeks to control the output buffer allocation.) In summary, in my experience, common character encoding conversion API design problems are the following: Failure to provide a streaming mode E.g. the kernel32 conversion APIs In streaming mode, failure to let the caller signal the end of the stream E.g. uconv decode API and Qt (.NET can signal this but the documentation says the converter ignores the invalid bytes at the end in that case! I hope the docs are wrong.) In streaming mode, having a separate API entry point for signaling the end of the stream (as opposed to being able to flag a buffer as the last buffer) resulting in two API entry points that can generate output E.g. uconv encode API In streaming mode, given sufficient output space, failure to consume all provided input E.g. java.nio.charset in error reporting mode, rust-encoding and iconv In streaming mode, seeking to identify which bytes were in error but doing so with too simplistic mechanism leading to also having to have the problem from the previous item E.g. java.nio.charset In streaming mode, causing memory allocation when a conversion call is on the stack (as opposed to letting the caller be fully in charge of allocating buffers) E.g. Qt, WebKit and rust-encoding In streaming mode, failure to guarantee that the exhaustion of the input buffer is the condition that is reported if both the input and output buffers are exhausted at the same time E.g. uconv In streaming mode, seeking to fill the output buffer fully (even if doing so e.g. splits a surrogate pair) instead of guaranteeing that the output is valid on a per-buffer basis E.g. ICU by documentation; many others silent on this matter in documentation, so who knows Providing a mode that silently ignores erroneous input sequences E.g. rust-encoding, java.nio.charset All but the last item are specific to a streaming mode. Streaming is hard. Other Design Considerations There are other API design considerations that would be unfair to label as “problems”, but that are still very relevant to designing a new API. These relate mainly to error handling and byte order mark (BOM) handling. Replacement of Errors It is typical for character encoding conversion APIs to treat error handling as a mode that is set on a converter object as opposed to treating error handling as a different API entry point. API-wise it makes sense to have different entry points in order to have different return values for the two cases. Specifically, when the converter handles errors, the status of the conversion call cannot be that conversion stopped on an error for the caller to handle. Additionally, when the converter handles errors, it may make sense to provide a flag that indicates whether there where errors even though they were automatically handled. Implementation-wise, experience suggests that baking error handling into each converter complicates code considerably and adds opportunities for bugs. Making the converter implementation always signal errors and having an optional wrapper that deals with those errors so that the application developer doesn’t need to leads to a much cleaner design. This design is a natural match for exposing different entry points: one entry point goes directly to the underlying converter and the other goes through the wrapper. BOM Handling BOM sniffing is subtle enough that it is a bad idea to leave it to the application. It’s more robust to bake it into the conversion library. In particular, getting BOM sniffing right when bytes arrive one at a time is not trivial for applications to handle. Like replacement of errors, different BOM handling modes can be implemented as wrappers around the underlying converters. Extensibility Especially in languages that provide a notion of inheritance, interfaces or traits it is alluring for the API designer to seek to define an abstract conversion API that others can write more converters for. However, in the case of the Web, the set of encodings is closed and includes only those that are defined in the Encoding Standard. As far as the use cases in the Web context go, extensibility is not needed. On the contrary, especially in a code base that is also used in a non-Web context like Gecko is used in Thunderbird in the email context, it is a feature that we can be confident on the Web side that if we have a type that represents an encoding defined in the Encoding Standard it can’t exhibit behaviors from outside the Encoding Standard. By design, encoding_rs is not extensible, so an encoding_rs Encoding does not represent any imaginable character encoding but instead represents a character encoding from the Encoding Standard. For example, we know from the type that we don’t accidentally have a UTF-7 decoder in Gecko code that has Web expectations even though Thunderbird contains a UTF-7 decoder in its codebase. (If you are interested in decoding email in Rust, there is a crate that wraps encoding_rs, adds UTF-7 decoding and maintains a type distinction between Web encodings and email encodings.) Additionally, in the context of Rust and its Foreign Function Interface (FFI), it helps that references are references to plain structs and not trait objects. Whereas C++ puts a vtable pointer on the objects allowing pointers to polymorphic types to have the same size as C pointers, Rust’s type erasure puts the vtable pointer in the reference. A Rust reference to a struct has the same machine representation as a plain (non-null) C pointer. A Rust reference to a trait-typed thing is actually two pointers: one to the instance and another to the vtable appropriate for the concrete type of the instance. Since interoperability with C++ is a core design goal for encoding_rs, using the kind of types whose references are the same as C pointers avoids the problem of losing the vtable pointer when crossing the FFI boundary. Iterators vs. Slices Conceptually a character encoding is a mapping from a stream of bytes onto a stream of Unicode scalar values and, in most cases, vice versa. Therefore, it would seem that the right abstraction for a converter is an iterator adaptor that consumes an iterator over bytes and yields Unicode scalar values (or vice versa). There are two problems with modeling character encoding converters as iterator adaptors. First, it leaves optimization to the compiler, when manual optimizations across runs of code units are desirable. Specifically, it is a core goal for encoding_rs to make ASCII handling fast using SIMD, and the compiler does not have enough information about the data to know to produce ASCII-sequence-biased autovectorization. Second, Rust iterators are ill-suited for efficient and (from the C perspective) idiomatic exposure over the FFI. The API style of unconv, java.nio.charset, iconv, etc., of providing input and output buffers of several code units at a time to the converter is friendly both to SIMD and to FFI (Rust slices trivially decompose to pointer and length in C). While this isn’t 100% rustic like iterators, slices still aren’t unrustic. The API Design This finally brings us to the actual API. There are three public structs: Encoding, Decoder and Encoder. From the point of view of the application developer, these act like traits (or interfaces or superclasses to use concepts from other languages) even though they are structs. Instead of using language implementation-provided vtables for dynamic dispatch, they internally have an enum that wraps private structs that are conceptually like subclasses. The use of private enum for dispatch avoids vtable pointers in FFI, makes the hierarchy intentionally non-extensible (see above) and allows BOM sniffing to change what encoding a Decoder is a decoder for. There is one statically allocated instance of Encoding for each encoding defined in the Encoding Standard. These instances have publicly visible names that allow application code to statically refer to a specific encoding (commonly, you want to do this with UTF-8, windows-1252, and the replacement encoding). To find an Encoding instance dynamically at runtime based on a label obtained from protocol text, there is a static method fn Encoding::for_label(label: &[u8]) -> &'static Encoding. The Encoding struct provides convenience methods for non-streaming conversions. These are “convenience” methods in the sense that they are implemented on top of Decoder and Encoder. An application that only uses non-streaming conversions only needs to deal with Encoding and doesn’t need to use Decoder and Encoder at all. Streaming API Decoder and Encoder provide streaming conversions and are allocated at runtime, because they encapsulate state related to the streaming conversion. On the Encoder side, only ISO-2022-JP is actually stateful, so most of the discussion here will focus on Decoder. Internally, the encoding-specific structs wrapped by Decoder are macroized to generate decode to UTF-8 and decode to UTF-16 from the same source code (likewise for Encoder). Even though Rust applications are expected to use the UTF-8 case, I’m going to give examples using the UTF-16 case, because it doesn’t involve the distinction between &str and &[u8] which would distract from the more important issues. The fundamental function that Decoder provides is: fn decode_to_utf16_without_replacement(    &mut self,     src: &[u8],     dst: &mut [u16],     last: bool) -> (DecoderResult, usize, usize) This function wraps BOM sniffing around an underlying encoding-specific implementation that takes the same arguments and has the same return value. The Decoder-provided wrapper first exposes the input to a BOM sniffing state machine and once the state machine gets out of the way delegates to the underlying implementation. Decoder instances can’t be constructed by the application directly. Instead, they need to be obtained from factory functions on Encoding. The factory functions come in three flavors for three different BOM sniffing modes: full BOM sniffing (the default), which may cause the Decoder to morph into a decoder for a different encoding than initially (using enum for dispatch shows its usefulness here!), BOM removal (no morphing but the BOM for the encoding itself is skipped) and without BOM handling. The struct is the same in all cases, but the different factory methods initialize the state of the BOM sniffing state machine differently. The method takes an input buffer (src) and an output buffer (dst) both of which are caller-allocated. The method then decodes bytes from src into Unicode scalar values that are stored (as UTF-16) into dst until one of the following three things happens: A malformed byte sequence is encountered. All the input bytes have been processed. The output buffer has been filled so near capacity that the decoder cannot be sure that processing an additional byte of input wouldn’t cause so much output that the output buffer would overflow. The return value is a tuple of a status indicating which one of the three reasons to return happened, how many input bytes were read and how many output code units were written. The status is a DecoderResult enumeration (possibilities Malformed, InputEmpty and OutputFull corresponding to the three cases listed above). The output written into dst is guaranteed to be valid UTF-16, and the output after each call is guaranteed to consist of complete characters. (I.e. the code unit sequence for the last character is guaranteed not to be split across output buffers.) This implies that the output buffer must be long enough for an astral character to fit (two UTF-16 code units) and the output buffer might not be fully filled. While it may seem wasteful not to fill the last slot of the output buffer in the common case, this design significantly simplifies the implementation while also simplifying callers by guaranteeing to the caller that it won’t have to deal with split surrogate pairs. The boolean argument last indicates that the end of the stream is reached when all the bytes in src have been consumed. A Decoder object can be used to incrementally decode a byte stream. During the processing of a single stream, the caller must call the method zero or more times with last set to false and then call decode_* at least once with last set to true. If the decode_* with last set to true returns InputEmpty, the processing of the stream has ended. Otherwise, the caller must call decode_* again with last set to true (or treat a Malformed result as a fatal error). Once the stream has ended, the Decoder object must not be used anymore. That is, you need to create another one to process another stream. Unlike with some other libraries that encourage callers to recycle converters that are expensive to create, encoding_rs guarantees that converters are extremely cheap to create. (More on this later.) When the decoder returns OutputFull or the decoder returns Malformed and the caller does not wish to treat it as a fatal error, the input buffer src may not have been completely consumed. In that case, the caller must pass the unconsumed contents of src to the method again upon the next call. Typically the application doesn’t wish to do its own error handling and just wants errors to be replaced with the REPLACEMENT CHARACTER. For this use case, there is another method that wraps the previous method and provides the replacement. The wrapper looks like this: fn decode_to_utf16(    &mut self,     src: &[u8],     dst: &mut [u16],     last: bool) -> (CoderResult, usize, usize, bool) Notably, the status enum is different, because the case of malformed sequences doesn’t need to be communicated to the application. Also, the return tuple includes a boolean flag to indicate whether there where errors. Additionally, there is a method for querying the worst case output size even the current state of the decoder and the length of an input buffer. If the length of the output buffer is at least the worst case, the decoder guarantees that it won’t return OutputFull. Identifying Malformed Sequences Initially, the plan was simply not to support applications that need to identify which input bytes were in error, because I thought that it wasn’t possible to do so without complicating the API for everyone else. However, very early into the implementation phase, I realized that it is possible to identify which bytes are in error without burdening applications that don’t care if the applications that want to know are responsible for remembering the last N bytes decoded where N is relatively small. It turns out that N is 6. For a malformed sequence that corresponds to a single decode error (i.e. a single REPLACEMENT CHARACTER) a DecoderResult::Malformed(u8, u8) is returned. The first wrapped integer indicates the length of the malformed byte sequence. The second wrapped integer indicates the number of bytes that were consumed after the malformed sequence. If the second integer is zero, the last byte that was consumed is the last byte of the malformed sequence. The malformed bytes may have been part of an earlier input buffer, which is why it is the responsibility of the application that wants to identify the bytes that were in error. The first wrapped integer can have values 1, 2, 3 or 4. The second wrapped integer can have values 0, 1, 2 or 3. The worst-case sum of the two is 6, which happens with ISO-2022-JP. Identifying Unmappable Characters When encoding to an encoding other than UTF-8 (the Encoding Standard does not support encoding into UTF-16LE or UTF-16BE, and there is one Unicode scalar value that cannot be encoded into gb18030), it is possible that the encoding cannot represent a character that is being encoded. In this case, instead of returning backward-looking indices EncoderResult::Unmappable(char) wraps the Unicode scalar value that needs to be replaced with a numeric character reference when performing replacement. In the case of ISO-2022-JP, this Unicode scalar value can be the REPLACEMENT CHARACTER instead of a value actually occurring in the input if the input contains U+000E, U+000F, or U+001B. This asymmetry between how errors are signaled in the decoder and encoder scenarios makes the signaling appropriate for each scenario instead of optimizing for consistency where consistency isn’t needed. Non-Streaming API As noted earlier, Encoding provides non-streaming convenience methods built on top of the streaming functionality. Instead of being simply wrappers for the streaming conversion, the non-streaming methods first try to check if the input is borrowable as output without conversion. For example, if the input is all ASCII and the encoding is ASCII-compatible, a Cow borrowing the input is returned. Likewise, the input is borrowed when the encoding is UTF-8 and the input is valid or when the encoding is ISO-2022-JP and the input contains no escape sequences. Here’s an example of a non-streaming conversion method: fn decode_with_bom_removal<'a>(    &'static self,     bytes: &'a [u8]) -> (Cow<'a, str>, bool) (Cow is a Rust standard library type that wraps either an owned type or a corresponding borrowed type, so a heap allocation and copy can be avoided if the caller only needs a borrow. E.g., Cow<'a, str> wraps either a heap-allocated string or a pointer and a length designating a string view into memory owned by someone else. Lifetime 'a indicates that the lifetime of borrowed output depends on the lifetime of the input.) Internals Internally, there are five guiding design principles. For the legacy CJK encodings the conversions to and from UTF-8 and UTF-16 should come from the same source code instead of being implemented twice. (For the UTFs and for single-byte encodings, there are enough optimization opportunities from having two implementations that it doesn’t make sense to keep those unified for the sake of unification.) Since Web content is either markup, which is runs of ASCII mixed with runs of potentially non-ASCII, and CSS and JS, which are almost entirely ASCII, handling of the ASCII range should be very fast and use SIMD where possible. Small binary size matters more than the speed of encode into legacy encodings. For performance, everything should be inlined into the conversion loop. (This rules out abstractions that would involve virtual calls from within the conversion loop.) The instantiation of converters should be very efficient—just a matter of initializing a few machine words. The instantiation should not read from the file system (other than the system lazily paging in the binary for encoding_rs itself), run decompression algorithms, allocate memory on the heap or compute derived lookup tables from other lookup tables. Abstracting over UTF-8 and UTF-16 Even though in principle compile-time abstraction over UTF-8 and UTF-16 is a matter of monomorphizing over u8 and u16, handling the two cases using generics would be more complicated than handling them using macros. That’s why it’s handled using macros. The conversion algorithms are written as blocks of code that are inputs to macros that expand to provide the skeleton conversion loop and fill in the encoding-specific blocks of code. In the skeleton in the decode case, one instantiation uses a Utf8Destination struct and another uses a Utf16Destination struct both of which provide the same API for writing into them. In the encode case, the source struct varies similarly. Using Rust Lifetimes to Match Buffer Accesses to Space Checks The old code in uconv was relatively ad hoc in how it accessed the input and output buffers. It maybe did stuff, advanced some pointers, checked if the pointers reached the end of the buffer and maybe even backed off a bit in some places. It didn’t have an overarching pattern to how space availability was checked and matched to memory accesses so that no accesses could happen without a space check having happened first. For encoding_rs, I wanted to make sure that buffer access only goes forwards without backtracking more than the one byte that might get unread in error cases and that no read happens without checking that there is still data to be read and no write happens without checking that there is space in the output buffer. Rust’s lifetimes can be used to enforce an “at most once” property. Immediately upon entering a conversion function, the input and output slices are wrapped in source and destination structs that maintain the current read or write position. I’ll use the write case as the example, but the read case works analogously. A decoder that only ever produces characters in the basic multilingual plane uses a BMP space checking method on the destination that takes the destination as a mutable reference (&mut self). If the destination is a UTF-8 destination, the method checks that there is space for at least three additional bytes. If the destination is a UTF-16 destination, the method checks that there is space for at least one additional code unit. If there is enough space, the caller receives a BMP handle whose lifetime is tied to the the lifetime of the destination due to the handle containing the mutable reference to the destination. A mutable reference in Rust means exclusive access. Since a mutable reference to the destination is hidden inside the handle, no other method can be called on the destination until the handle goes out of scope. The handle provides a method for writing one BMP scalar value. That method takes the handle’s self by value consuming the handle and preventing reuse. The general concept is that at the top of the loop, the conversion loop checks availability of data at the source and obtains a read handle or returns from the conversion function with InputEmpty and then checks availability of space at the destination and obtains a write handle or returns from the conversion function with OutputFull. If neither check caused a return out of the conversion function, the conversion loop now hasn’t read or written either buffer but can be fully confident that it can successfully read from the input at most once and write a predetermined amount of units to the output at most once during the loop body. The handles go out of scope at the end of the loop body, and once the loop starts again, it’s time to check for input availability and out space availability again. As an added twist, the read operation yields not only a byte of input but also an unread handle for unreading it, because in various error cases the spec calls for prepending input that was already read back to the input stream. In practice, all the cases in the spec can be handled by being able to unread at most one unit of input even though the spec text occasionally prepends more than one unit. Optimizing ASCII and Multibyte Sequences In practice, the ISO-2022-JP converters, which don’t need to be fast for Web use cases, use the above concept in its general form. For the ASCII-compatible encodings that are actually performance-relevant for Web use cases, there are a couple of elaborations. First, the UTF-8 destination and the UTF-16 destination know how to copy ASCII from a byte source in an efficient way that handles more than one ASCII character per register (either a SIMD register or even an ALU register). So the main conversion loop starts with a call to a method that first tries to copy ASCII from the source to the destination and then returns a non-ASCII byte and write handle if there’s space left the destination. Once a non-ASCII byte is found, another loop is entered into that actually works with the handles. Second, the loop that works with the handles doesn’t have a single scope per loop body for multi-byte encodings. Once we’re done copying ASCII, the non-ASCII byte that we found is always a lead byte of a multi-byte sequence unless there is an error—and we are optimizing for the case where there is neither an error nor a buffer boundary. Therefore, it makes sense to start another scope that does the handle obtaining space check choreography again in the hope that the next byte will be a valid trail byte given the lead byte that we just saw. Then there is a third innermost loop for reading the next byte after that so that if non-ASCII we can continue the middle loop as if this non-ASCII byte had come from the end of the initial ASCII fast path and if the byte is ASCII punctuation, we can spin in the innermost loop without trying to handle a longer ASCII run using SIMD, which would likely fail within CJK plain text. However, if we see non-punctuation ASCII, we can continue the outermost loop and go back to the ASCII fast path. Not matching on a state variable indicating whether we’re expecting a lead or trail byte on a per-byte basis and instead using the program counter for the state distinguishing between lead and trail byte expectations is good for performance. However, it poses a new problem: What if the input buffer ends in the middle of a multi-byte sequence? Since we are using the program counter for state, the code for handling the trail byte in a two-byte encoding is only reachable by first executing the code for handling the lead byte, and since Rust doesn’t have goto or a way to store continuations, after a buffer boundary we can’t just restore the local variables and jump directly to the trail byte handling. To deal with this, the macro structure that allows the reuse of code for decoding both to UTF-8 and to UTF-16 also duplicates the block for handling the trail byte such that the same block occurs between the function method entry and the conversion loop. If the previous buffer ended in the middle of a byte sequence, the next call to the conversion function handles the trail of that sequence before entering the actual conversion loop. Optimizing UTF-8 The UTF-8 decoder does not use the same structure as the other multi-byte decoders. Dealing with invalid byte sequences in the middle of the buffer or valid byte sequences that cross a buffer boundary is implemented naïvely from the spec in a way that is instantiated via macro from the same code both when converting to UTF-8 and when converting to UTF-16. However, once that outer tier of conversion gets to a state where it expects the next UTF-8 byte sequence, it calls into fast-track code that only deals with valid UTF-8 and returns back to the outer tier that’s capable of dealing with invalid UTF-8 or partial sequences when it discovers an incomplete sequence at the end of the buffer or an invalid sequence in the middle. This inner fast track is implemented separately for decoding UTF-8 to UTF-8 and for decoding UTF-8 to UTF-16. The UTF-8 to UTF-16 case is close enough to one might expect from the above description of legacy multibyte encodings. At the top of the loop, there is the call to the ASCII fast path that zero-extends ASCII to UTF-16 Basic Latin multiple code units at a time and then byte sequences that start with a non-ASCII lead byte are handles as three cases: two-byte sequence, three-byte sequence or four-byte sequence. Lookup tables are used to check the validity of the combination of lead byte and second byte as explained below. The sequence is considered consumed only if it’s found to be valid. The corresponding UTF-8 code units are then written to the destination as normal u16 writes. The UTF-8 to UTF-8 case is different. The input is read twice, but the writing is maximally efficient. First, a UTF-8 validation function is run on the input. This function only reads and doesn’t write and uses an ASCII validation fast path that checks more than one code unit at a time using SIMD or multiple code units per ALU word. The UTF-8 validation function is the UTF-8 to UTF-16 conversion function with all the writes removed. After the validation, the valid UTF-8 run is copied to the destination using std::ptr::copy_nonoverlapping(), which is the Rust interface to LLVM memcpy(). This way, the writing, which is generally less efficient than reading, can be done maximally efficiently instead of being done on a byte-by-byte basis for non-ASCII as would result from a read-once implementation. (Note that in the non-streaming case when the input is valid, both the second read and the writing are avoided. More on that later.) It is not totally clear if this kind of double-reading is smart, since it is a pessimization for the 100% ASCII case. Intuitively, it should help the non-ASCII case, since even the non-ASCII parts can be written using SIMD. However, 100% ASCII UTF-8 to UTF-8 streaming case, which copies instead of borrowing, runs on Haswell at about two thirds of memcpy() speed while the 100% ASCII windows-1252 to UTF-8 case (which writes the SIMD vectors right away without re-reading) runs at about memcpy() speed. The hard parts of looping over potentially-invalid UTF-8 are: Minimizing the performance impact of deciding if the lead byte is valid Minimizing the performance impact of deciding if the second byte is valid considering that its valid range depends on the lead byte Avoiding misprediction of the length of the byte sequence representing the next scalar value. encoding_rs combines the solution for the first two problems. Once it’s known that the lead byte is not ASCII, the lead byte is used as an index to a lookup table that yields a byte whose lower two bits are always zero and that has exactly one of the other six bits set to represent the following cases: Byte is not a legal lead byte. Lead byte is associated with a normal-range second byte. Lead byte for a three-byte sequence requires special lower bound for second byte. Lead byte for a three-byte sequence requires special upper bound for second byte. Lead byte for a four-byte sequence requires special lower bound for second byte. Lead byte for a four-byte sequence requires special upper bound for second byte. The second byte is used as an index to a lookup table yielding a byte whose low two bits are always zere, whose bit in the position corresponding to the lead being illegal is always one and whose other five bits are zero if the second byte is legal given the type of lead the bit position represents and one otherwise. When the bytes from the two lookup tables are ANDed together, the result is zero if the combination of lead byte and second byte is legal and non-zero otherwise. When a trail byte is always known to have the normal range, as the third byte in a three-byte sequence is, we can check that the most significant bit is 1 and the second-most significant bit is zero. Note how the ANDing described in the above paragraph always leaves the two least-significant bits of the AND result as zeros. We shift the third byte of a three-byte sequence right by six and OR it with the AND result from the previous paragraph. Now the validity of the three-byte sequence can be decided in a single branch: If the result is 0x2, the sequence is valid. Otherwise, it’s invalid. In the case of four-byte sequences, the number computed per above is extended to 16 bits and the two most-significant bits of the fourth byte are masked and shifted to bit positions 8 and 9. Now the validitiy of the four-byte sequence can be decidded in a single branch: If the result is 0x202, the sequence is valid. Otherwise, it’s invalid. The fast path checks that there is at least 4 bytes of input on each iteration, so the bytes of any valid byte sequence for a single scalar value can be read without further bound checks. The code does use branches to decide whether to try to match the bytes as a two-byte, three-byte or four-byte sequence. I tried to handle the distinction between two-byte sequences and three-byte sequences branchlessly when converting UTF-8 to UTF-16. In this case, the mask applied to the lead byte is taken from a lookup table and mask is taken from a lookup table to zero out the bits of the third byte and the third shift amount (from 6 to 0) in the two-byte case. The result was slower than just having a branch to distinguish between two-byte sequences and three-byte sequences. Now that there is branching to categorize the sequence length, it becomes of interest to avoid that branching. It’s also of interest to avoid going back to the SIMD ASCII fast path when the next lead is not ASCII. After a non-ASCII byte sequence, instead of looping back to the ASCII fast path, the next byte is read and checked. After a two-byte sequence, the next lead is checked for ASCIIness. If it’s not ASCII, the code loops back to the point where the SIMD ASCII path has just exited. I.e. there’s a non-ASCII byte as when exiting the ASCII SIMD fast path, but its non-ASCIIness was decided without SIMD. If the byte is an ASCII byte, it is processed and then the code loops back to the ASCII SIMD fast path. Obviously, this is far from ideal. Avoiding immediate return to ASCII fast path after a two-byte character works within a non-Latin-script word but it doesn’t really help to let one ASCII character signal a return to SIMD when the one ASCII character is a single space between two non-Latin words. Unfortunately, trying to be smarter about avoiding too early looping back to the SIMD fast path would mean more branching, which itself has a cost. In the two-byte case, if the next lead is non-ASCII, looping back to immediately after the exit from the ASCII fast path means that the next branch is anyway the branch to check if the lead is for a two-byte sequence, so this works out OK for words in non-Latin scripts in the two-byte-per-character part of the Basic Multilingual Plane. In the three-byte case, however, looping back to the point where the ASCII SIMD fast path ends would first run the check for a two-byte lead even though after a three-byte sequence the next lead is more likely to be for another three-byte sequnces. Therefore, after a three-byte sequence, the first check performed on the next lead is to see if it, too, is for a three-byte sequence in which case the code loops back to the start of the three-byte sequence processing code. Optimizing UTF-16LE and UTF-16BE UTF-16LE and UTF-16BE are rare enough on the Web that a browser can well get away with a totally naïve and slow from-the-spec implementation. Indeed, that’s what landed in Firefox 56. However, when talking about encoding_rs, it was annoying to always have the figurative asterisk next to UTF-16LE and UTF-16BE to disclose slowness when the rest was fast. To get rid of the figurative asterisk, UTF-16LE and UTF-16BE decode is now optimized, too. If you read The Unicode Standard, you might be left with the impression that the difference between UTF-16 as an in-memory Unicode representation and UTF-16 as an interchange format is byte order. This is not the full story. There are three additional concerns. First, there is a concern of memory alignment. In the case of UTF-16 as an in-memory Unicode representation, a buffer of UTF-16 code units is aligned to start at a memory address that is a multiple of the size of the code unit. That is, such a buffer always starts at an even address. When UTF-16 as an interchange format is read using a byte-oriented I/O interface, it may happen that a buffer starts at an odd address. Even on CPU architectures that don’t distinguish between aligned and unaligned 16-bit reads and writes on the ISA layer, merely reinterpreting a pointer to bytes starting at an odd address as a pointer pointing to 16-bit units and then accessing it as if was a normal buffer of 16-bit units is Undefined Behavior in C, C++, and Rust (as can in practice be revealed by autovectorization performed on the assumption of correct alignment). Second, there is the concern of buffers being an odd number of bytes in length, so the special logic is needed to handle the split UTF-16 code unit at the buffer boundary. Third, there is the concern of unpaired surrogates, so even when decoding to UTF-16, the input can’t be just be copied into right alignment, potentially with byte order swapping, without inspecting the data. The structure of the UTF-16LE and UTF-16BE decoders is modeled on the structure of the UTF-8 decoders: There’s a naïve from-the-spec outer tier that deals with invalid and partial sequences and an inner fast path that only deals with valid sequences. At the core of the fast path is a struct called UnalignedU16Slice that wraps *const u8, i.e. a pointer that can point to either an ever or an odd address, and a length in 16-bit units. It provides a way to make the unaligned slice one code unit shorter (to exclude a trailing high surrogate when needed), a way to take a tail subslice and ways to read a u16 or, if SIMD is enabled, u16x8 in a way that assumes the slice might not be aligned. It also provides a way to copy, potentially with endianness swapping, Basic Multilingual Plane code units to a plain aligned &mut [u16] until the end of the buffer or surrogate code unit is reached. If SIMD is enabled, both the endianness swapping and the surrogate check are SIMD-accelerated. When decoding to UTF-16, there’s a loop that first tries to use the above-mentioned Basic Multilingual Plane fast path and once a surrogate is found, handles the surrogates on a per-code-unit and returns back to the top of the loop if there was a valid pair. When decoding to UTF-8, code copied and pasted from the UTF-16 to UTF-8 encoder is used. The difference is that instead of using &[u16] as the source, the source is an UnalignedU16Slice and, additionally, reads are followed with potential endian swapping. Additionally, unpaired surrogates are reported as errors in decode while UTF-16 to UTF-8 encode silently replaces unpaired surrogates with the REPLACEMENT CHARACTER. If SIMD is enabled, SIMD is used for the ASCII fast path. Both when decoding to UTF-8 and when decoding to UTF-16, endianness swapping is represented by a trait parameter, so the conversions are monomorphized into two copies: One that swaps endianness and one that doesn’t. This results in four conversion functions: Opposite-endian UTF-16 to UTF-8, same-endian UTF-16 to UTF-8, opposite-endian UTF-16 to UTF-16, same-endian UTF-16 to UTF-16. All these assume the worst for alignment. That is, code isn’t monomorphized for the aligned and unaligned cases. Unaligned access is fast on aarch64 and on the several most recent x86_64 microarchitectures, so optimizing performance of UTF-16LE and UTF-16BE in the aligned case for Core2 Duo-era x86_64 or for ARMv7 at the expense of binary size and source code complexity would be a bit too much considering that UTF-16LE and UTF-16BE performance doesn’t even really matter for Web use cases. Optimizing x-user-defined Unlike the other decoders, the x-user-defined decoder doesn’t have an optimized ASCII fast path. This is because the main remaining use case for x-user-defined it is loading binary data via XMLHttpRequest in code written before proper binary data support via ArrayBuffers was introduced to JavaScript. (Note that when HTML is declared as x-user-defined via the meta tag, the windows-1252 decoder is used in place of the x-user-defined decoder.) When decoding to UTF-8, the byte length of the output varies depending on content, so the operation is not suitable for SIMD. The loop simply works in a per-byte basis. However, when decoding to UTF-16 with SIMD enabled, each u8x16 vector is zero-extended into two u16x8 vectors. A mask computed by a lane-wise greater-than comparison to see which lanes were not in the ASCII range. The mask is used to retain the corresponding lanes from a vector of all lanes set to 0xF700 and the result is added to the original u16x8 vector. Portable SIMD (Nightly) Rust provides access to portable SIMD which closely maps to LLVM’s notion of portable SIMD. There are portable types, such as the u8x16 and u16x8 types used by encoding_rs. These map to SSE registers on x86 & x86_64 and NEON registers on ARMv7 & aarch64, for example. The portable types provide lane-wise basic arithmetic, bitwise operations, and comparisons in a portable manner and with generally predictable performance characteristics. Additionally, there are portable shuffles where the shuffle pattern is constant at compile time. The performance characteristics of shuffles rely heavily on the quality of implementation of specific LLVM back ends, so with shuffles it’s a good idea to inspect the generated assembly. The portable types can be zero-cost transmuted into vendor-specific types in order to perform operations using vendor-specific intrinsics. This means that SIMD code can generally be written in a portable way and specific operations can be made even faster using vendor specific operations. For example, checking if a u8x16 contains only ASCII can be done very efficiently on SSE2 and aarch64, so the SIMD “is this u8x16 ASCII?” operation in encoding_rs has vendor-specific specializations for SSE2 and aarch64. This is an amazing improvement over C. With C, an entire langer function / algorithm that uses SIMD ends up being written separately for each instruction set using vendor intrinsics for everything—even the basic operations that are supported by practically all vendors. It often happens that such vendor-specific code is written only for x86/x86_64 with ARMv7 or aarch64 left as a todo with POWER, etc., completely ignored. Despite Rust making SIMD portable, performance tuning for specific architectures using conditional compilation to turn alternative implementations on or off is still needed. For example, because NEON on ARMv7 lacks an efficient “is this u8x16 ASCII?” check, using NEON for processing the ASCII runs in UTF-8 validation turned out not to be an improvement over ALU-only code on ARMv7, even though using SIMD in UTF-8 validation makes sense on x86 and x86_64. On the other hand, the difference between using aligned or unaligned SIMD loads and stores is negligible on aarch64 (tested on ThunderX), so on that architecture encoding_rs uses unaligned loads and stores unconditionally. However, especially on Core2 Duo-era x86_64, the difference between using aligned access compared to using unaligned loads and stores with addresses that are actually aligned is very significant, so in the SSE2 case encoding_rs checks for alignment first and has four-way specializations for the four combinations of the source and destination being aligned or unaligned. As of June 2018, 20% of the Firefox x86/x86_64 release population was still on the kind of x86/x86_64 CPU where there’s a substantial performance disparity between aligned and unaligned SIMD loads and stores with actually aligned addresses. Punctuation Loops Using SIMD for ASCII poses the problem that many non-Latin scripts use ASCII spaces and punctuation. If we return directly to the SIMD path upon seeing a single ASCII byte after a sequence of non-ASCII, we may end up processing a SIMD vector only to find that it’s not fully ASCII, because it just starts with an ASCII space or an ASCII punctuation character followed by an ASCII space and then non-ASCII follows again. For non-Latin scripts that use ASCII spaces and punctuation, after non-ASCII it is useful to have a loop that keeps processing ASCII bytes using the ALU as long as the byte values are below the less-than sign. This way, ASCII spaces, punctuation and digits do not result unhelpful use of SIMD, but HTML markup results in a jump back to the SIMD path. In the case of the legacy CJK encodings, it’s easy to decide whether to have such a punctuation loop are not: Korean benefits from one, so EUC-KR gets such a loop. Chinese and Japanese don’t benefit from such a loop, so the rest of the legacy CJK encodings don’t get one. The decision is trickier for single-byte encodings and UTF-8. In the interest of code size, all the single byte encodings (other than x-user-defined) are handled with the same code. For the Latin encodings, it would be beneficial not to have a punctuation loop. For Cyrillic, Greek, Arabic and Hebrew, it is beneficial to have the punctuation loop. Decoding the Latin single-byte encodings is faster anyway, so the punctuation loop is therefore all single-byte encodings for the benefit of the ones that are non-Latin but use ASCII spaces and punctuation. UTF-8 calls for a one-size-fits-all solution. By the same logic, one should expect to put a punctuation loop in the UTF-8 to UTF-16 decoder. Yet, there is no punctuation loop in the UTF-8 to UTF-16 decoder. I don’t recall the details, but a punctuation loop didn’t behave well. I didn’t investigate why exactly a punctuation loop didn’t behave well in this case, but the conversion loop is pretty delicate even without a punctuation loop, so maybe there was some bad interaction in the optimizer. Rust has been through LLVM major version updates since I experimented with this code, so it might be worthwhile to experiment again. Fast Instantiation of Converters Character encoding conversion libraries typically reserve the right to perform expensive operations when a decoder or an encoder is instantiated. Expensive operations could include loading lookup tables from the file system, decompressing lookup tables or deriving encode-oriented lookup tables from decode-oriented lookup tables. This is problematic. When the instantiation of a converter is potentially expensive, libraries end up recommending that callers hold onto converters and reset them between uses. Since encoding_rs builds BOM handling into the decoders, does so by varying the initial state of a state machine and BOM sniffing can change what encoding the decoder is for, being able to reset a decoder would require storing a second copy of the initial state in the decoder. More importantly, though, the usage patterns for character encoding converters tend to be such (at least in a Web browser) that there isn’t a natural way for callers to hold onto converters and creating some kind of cache for recycled converters create threading-problems and shouldn’t be the callers’ responsibility anyway. Even a thread-safe once-per-process heap-allocation on first use would be a problem. Firefox is both a multi-threaded and a multi-process application. E.g. generating a heap-allocated encode-optimized lookup table in a thread-safe way on first use would end up costing the footprint of the table in each process even if sharing between threads appeared simple enough. To avoid these problems, encoding_rs guarantees that instantiating a converter is a very cheap operation: just a matter of loading some constants into a few machine words. No up-front computation on the data tables is performed during the converter instantiation. The data tables are Plain Old Data arranged in the layout that the conversion algorithms access. Of course, if the relevant part of the program binary hasn’t been paged in yet, accessing the data tables can result in the operating system paging them in. Single-Byte Lookup Table Layout The Encoding Standard gives the mapping tables for the legacy encodings as arrays indexed by what the spec calls the “pointer”. For single-byte encodings, the pointer is simply the unsigned byte value minus 0x80. That is, the lower half passes through as ASCII and the higher half is used for simple table lookup when decoding. Conceptually, the encoder side is a linear search through the mapping table. A linear search may seem inefficient and, of course, it is. Still, the encode operation with the legacy encodings is actually rather rare in the Web Platform. It is exposed in only two places: in the error handling for the query string URLs occurring as attribute values in HTML and in HTML form submission. The former is error handling for the case where the query string hasn’t been properly percent escaped and, therefore, relatively rarely has to handle non-ASCII code points. The latter happens mainly in response to a user action that is followed by a network delay. An encoding library can get away with slowness in this case, since the slowness can get blamed on the network anyway. Furthermore, encoder speed that is shockingly slow percentage-wise compared to how fast it could be can still be fast in terms of human-perceivable timescales for the kind of input sizes that typically occur in the text fields of an HTML form. The design of encoding_rs took place in the context of the CLDR parts of ICU having been accepted as part of desktop Firefox but having been blocked from inclusion in Firefox for Android for an extended period of time out of concern of the impact on apk size. I wanted to make sure that encoding_rs could replace uconv without getting blocked on size concerns on Android. Therefore, since there wasn’t a pressing need for the encoders for legacy encodings to be fast and there was a binary size concern (and the performance concern of instantiating an encoder excluding the option of spending time computing and encode-specific lookup table from decode-specific tables at the time of encoder instantiation), I made it a design principle that encoding_rs would have no encoder-specific data tables and instead the encoders would search the decode-oriented data tables even if it meant linear search. As shipped in Firefox 56, the single-byte encoders in encoding_rs performed forward linear search across each quadrant of the lookup table for the single-byte encoding such that the fourth quadrant was search first and the first quadrant was searched last. This search order make the most sense for the single byte encodings considered collectively, since most encodings have lower-case letters in the fourth quadrant and the first quadrant is either effectively unused or contains rare punctuation. In encoding_rs 0.8.11 (Firefox 65), though, as a companion change to compile-time options to speed up legacy CJK encode (discussed below), I relaxed the principle of not having any encode-specific data a little based on the observation that adding just 32 bits (not bytes!) of encoder-specific data per single-byte encoding could significantly accelerate the encoders for Latin1-like and non-Latin single-byte encodings while not making the performance of non-Latin1-like Latin encodings notably worse. Adding 8 bits for an offset in the lookup table to the start of a run on the consecutive code points, 8 bits for the length of the run and 16 bits for an offset to the start of the run in the Unicode code points, the common case (the code points to encode falling within the range) could be handled without a linear search. Unlike in the case of CJK legacy encode compile time options, the addition of 32 bits per single-byte encoding was small enough in added footprint that I thought it did not make sense to make it a compile-time option. Instead, the 32 bits per single-byte encoding are there unconditionally. Multi-Byte Lookup Table Layout For multi-byte legacy encodings, the pointer is computed from two or more bytes. In that case, the computation forms a linear offset to the array when not all values of the (typically) two bytes are valid or the valid values for the two bytes aren’t contiguous. This is in contrast to some previous formulations where two bytes are interpreted as a 16-bit big endian integer and then that integer is considered to map to Unicode. Since not all values of the two bytes are in use, simply interpreting the two bytes as a 16-bit big endian integer would result in a needlessly sparse lookup table. (A sparse lookup table can have the benefit of being able to combine bits from the lead and trail byte without an actual multiplication instruction, which may have been important in the past. E.g. Big5 with a dense lookup table involves multiplying by 157, which compiles to an actual multiplication instruction.) Still with the linearization math provided by the spec, the lookup tables provided by the spec are not fully dense. Since legacy encodings are not exercised by the modern most performance-sensitive sites and binary size on Android was a concern, I sought to make the lookup tables more compact potentially trading off a bit of performance. Visualizing the lookup table for EUC-KR (warning: the link points to a page that may be too large for phones with little RAM) reveals that the lookup table has two unused vertical bands as well as an unused lower left quadrant. The Japanese lookup tables (JIS X 0208 with vendor extensions and JIS X 0212) also have unused ranges. The gbk lookup table has no unused parts but in place of unused parts has areas filled with consecutive Private Use Area code points. More generally, the lookup tables have ranges of pointers that map to consecutive Unicode code points. As the most obvious examples, the Hiragana and Katakana characters occur in the lookup tables in the same order as they appear in Unicode, therefore, forming ranges of consecutive code points. The handling of such ranges can be performed by excluding them from the lookup table and instead writing a range check (and offset addition) in the decoder program code. (Aside: The visualizations were essential in order to gain understanding of the structure of the legacy CJK encodings. I developed the visualizations when working on encoding_rs and contributed them to the spec.) Furthermore, the way EUC-KR and gbk have been extended from their original designs has a relationship with Unicode. The original smaller lookup table appears in the visualizations of the extended lookup tables on the lower right. In the case of EUC-KR, the original KS X 1001 lookup table contains the Hangul syllables in common use. In the case of gbk, the original GB2312 lookup table contains the most common (simplified) Hanzi ideographs. The extended lookup table for EUC-KR, at the top and on the left, contains in the Unicode order all the Hangul syllabes from the Hangul Syllables Unicode block that weren’t already included in the original KS X 1001 part on the lower right. Likewise, the extended lookup table for gbk, at the top and on the left, contains in the Unicode order all the ideographs from the CJK Unified Ideographs Unicode block that weren’t already included in the original GB2312 part on the lower right. That is, after omitting the empty vertical bands in EUC-KR, in both EUC-KR and gbk the top part and the bottom left part form runs of consecutive code points such that the last code point in each run is less than the first code point in the next run. These are stored as tables (one for top and another for botton left) that contain the (linearized) pointer for the start of each such run and tables of equal length that contain the first code point of each run. When decoding, binary search with the linearized pointer can be performed to locate the start of the run that the pointer belongs to. The code point at the start of the run can be then obtained by reading the corresponding item from the table of the first code points of the runs. The correct code point within the range can be obtained by adding to the first code point the offset obtained by subtracting the pointer to the start of run from the pointer being searched. On the encoder side, linear search with the code point can be performed in the table starting the first code point of each range instead after it has been established that the Hangul Syllable code point (in the EUC-KR case) or the CJK Unified Ideograph code point (in the gbk case) wasn’t found in the lower right part of the lookup table. (This process could be even optimized further by arranging the tables in the Eytzinger order instead.) Adding more program code in order to make the lookup tables smaller worked in most cases. Replacing ranges like Hiragana and Katakana with explicitly-programmed range checks and by compressing the top and bottom left parts of EUC-KR and gbk as described above resulted in an overall binary size reduction except for big5. In the case of big5, the added program code seemed to exceed the savings from a slightly smaller lookup table. That’s why the above techniques were not applied in released code to Big5 after all. However, Big5 did provide the opportunity to separate the Unicode plane from the lower 16 bits instead of having to store 32-bit scalar values. The other lookup tables (excluding the non-gbk part of gb18030, which is totally different) only contain code points from the Basic Multilingual Plane, so the code points can be stored in 16 bits. The lookup table for Big5, however, contains code points from above the Basic Multilingual Plane. Still, the code points from above the Basic Multilingual Plane are not arbitrary code points. Instead, they are all from the Supplementary Ideographic Plane. Therefore, the main lookup table can contain the low 16 bits and then there is a bitmap that indicates whether the code point is on the Basic Multilingual Plane or on the Supplementary Ideographic Plane. It is worth noting that while the attempts to make the tables smaller strictly add branching when decoding to UTF-16, in some cases when decoding to UTF-8 they merely move a branch to a different place. For example, when the code has a branch to handle e.g. Hiragana by offset mapping, it knows that a Hiragana character will be three bytes in UTF-8, so the branch to decide the UTF-8 sequence length based on scalar value is avoided. (There are separate methods for writing output that is known to be three bytes in UTF-8, output that is known to be two bytes in UTF-8, and output that might be either two or three bytes in UTF-8. In the UTF-16 case, all these methods do the same thing an output a single UTF-16 code unit.) The effort to reduce the binary size was successful in the sense that the binary size of Firefox was reduced when encoding_rs replaced uconv, even though encoding_rs added new functionality to support decoding directly to UTF-8 and encoding directly from UTF-8. Optional Encode-Oriented Tables for Multi-Byte Encodings In the case of Hangul syllables when encoding to EUC-KR even the original unextended KS X 1001 part of the mapping table is in the Unicode order due to KS X 1001 and Unicode agreeing on how the syllables should be sorted. This enables the use of binary search when encoding Hangul into EUC-KR without encode-specific lookup tables. However, with the exception of gbk extension part that was not in original GB2312, the way the CJK Unified Ideographs have been laid out in the legacy standards has no obvious correspondence to Unicode order. As far as I’m aware, the options are doing a linear search over the decode-oriented data tables or introducing additional encode-oriented data tables. The relative performance difference between these two approaches is, obviously, dramatic. Even though testing indicated that linear search over the decode-oriented data tables yielded acceptable human-perceived performance for the browser-relevant use cases even on phone-like hardware, I wanted to have a backup plan in case my determination of the human-perceived performance was wrong and users ended up complaining. Still, I tried to come up with a backup plan that would reach uconv performance (which already wasn’t as fast as as an implementation willing to spend memory on encode-specific tables could be) without having to add lookup tables as large as the obviously fast solution of having a table large enough to index by the offset to the CJK Unified Ideographs block would require. Ideographs appear to be practically unused in modern Korean online writing, so accelerating Hanja to EUC-KR encode wasn’t important. On the other hand, GB2312, original Big5 (without the HKSCS parts) and JIS X 0208 all have the ideographs organized into two ranges: Level 1 and Level 2, where Level 1 contains the more frequently used ideographs. As the backup plan, I developed compile-time-optional encode acceleration of the Level 1 areas of these three mapping tables. Since this was a mere backup plan, instead of researching better data structures for the problem, I went with the most obvious one: For each of the three legacy standards, an array of the level Level 1 Hanzi/Kanji sorted in the Unicode order and another array of the same length sorted in the corresponding order containing arrays of two bytes already encoded in the target encoding. In the case of JIS X 0208, there are three target encodings, so I used the most common one, Shift_JIS, for the bytes and added functions to transform the bytes to EUC-JP and ISO-2022-JP. This solution was enough to make encode to the legacy CJK encodings many times faster than uconv. The backup plan, however, didn’t end up needing to ship in Firefox. Linear search seems to be fast enough, considering that users didn’t complain. Indeed, a linear search-based Big5 encoder had already been shipped in Firefox 43 without complaints from users. (However, this, in itself, wasn’t a sufficient data point on its own, since, anecdotally, it seems that the migration from Big5 to UTF-8 on the Web is further along than the migration from Shift_JIS and gbk.) Even though impressive relative to uconv performance, accelerating Level 1 Hanzi/Kanji encode using binary search remained very slow relative to other encoding conversion libraries. In order to remove the perception that encoding_rs is very slow for some use cases, I implemented a compile-time option to use encode-only lookup tables that are large enough to index into directly by the offset into the Hangul Syllables or CJK Unified Ideographs Unicode blocks. With these options enabled, encoding_rs legacy CJK encoder performance is within an order of magnitude from ICU and kernel32.dll though still generally not exceeding their performance for plain text (that doesn’t have a lot of ASCII markup). Presumably, to match or exceed their performance, encoding_rs would need to use even larger lookup tables directly indexable by Basic Multilingual Plane code point and to have even fewer branches. It is worth noting, though, that while even larger lookup tables might win micro-benchmarks, they might have adverse effects on other code in real application workloads by causing more data to be evicted from caches during the encoding process. In general, a library that seeks high encoder performance should probably take the advice given in the Unicode Standard and use an array of 256 pointers indexed by the high half of the Basic Multilingual Plane code point where each pointer either points to an array of 256 pre-encoded byte pairs indexed by the lower half of the Basic Multilingual plane code point or is a null pointer if all possible low bit combinations are unmapped. Still, considering that a Web browser gets away with relatively slow legacy encoders, chances are that many other applications do, too. In general, applications should use UTF-8 for interchange and, therefore, not use the legacy encoders except where truly needed for backward compatibility. Chances are that most applications won’t need to use the compile-time options to enhance encoder performance and if they do, it’s probably more about getting the performance on a level where untrusted input can’t exercise excessively slow code paths rather than about maximal imaginable performance being essential. At this point, it doesn’t make sense to introduce compile options that would deviate more from the Firefox-relevant code structure for the sake of winning legacy encoder benchmarks. Safety One generally expects Rust code to be safe. Rust code that doesn’t use unsafe is obviously safe. Rust code that uses unsafe is safe only if unsafe has been used correctly. Semi-alarmingly, encoding_rs uses unsafe quite a bit. Still, unsafe isn’t used isn’t used in random ways. Instead, it’s for certain things and only in certain source files. In particular, it is not used inside the source files that implement the logic for legacy CJK encodings, which in the C++ implementation would be the riskiest area in terms of memory-safety bugs. This is not to say that all the unsafe is appropriate. Some of it would be avoidable right now, but the better way either didn’t exist or didn’t exist outside nightly Rust when I wrote the code, and some will likely become avoidable in the future. Here’s an overview of the kinds of unsafe in encoding_rs: Unchecked conversion of u32 to char A couple of internal APIs use char to signify Unicode scalar value. However, the scalar value gets computed in a way that first yields the value as u32. Since the value is in the right range by construction, it is reinterpreted as char without the cost of the range check. Some of this use of unsafe could be avoided by using u32 instead of char internally in some places. It’s unclear if the current usage is worthwhile. Writing to &mut str as &mut [u8] Since dealing with character encodings is in the core competence of encoding_rs, it would be silly to run the standard library’s UTF-8 validation on encoding_rs’s UTF-8 output. Instead, encoding_rs uses unsafe to assert the validity of its UTF-8 output to the type system. It doesn’t make sense to try to get rid of this use of unsafe. It’s fundamental to the crate. Calling Intrinsics Rust makes intrinsics categorically unsafe even in cases where there isn’t actually anything that logically requires a given intrinsic to be unsafe. This results in the use of unsafe to call vendor-specific SIMD operations and to annotate if conditions for branch prediction using likely/unlikely. This kind of unsafe makes the code look harder to read and scarier than it actually is, but it is easy to convince oneself that this kind of unsafe is not risky in terms of the overall safety of the crate. SIMD Bitcasts When working with SIMD, it is necessary to convert between different lane configurations in a way that is just a type-system-level operation and on the machine level is nothing: the register is the same and the operations determine how the contents of the register are interpreted. As a consequence, reinterpreting a SIMD type of a given width in bits (always 128 bits in encoding_rs) as another SIMD type of the same width in bits should be OK if both types have integer lanes (i.e. all bit patterns are valid). I expect that in the future, Rust will gain safe wrappers for performing these reinterpretations. Such wrappers already exist behind a feature flag in the packed_simd crate. Re-Interpreting Slices as Sequences of Different Types The ASCII acceleration code reads and writes slices of u8 and u16 as usize (if SIMD isn’t enabled) or u8x16 and u16x8 (if SIMD is enabled). This is done by casting pointers and by dereferencing pointers. This, obviously, is not ideal in terms of confidence in the correctness of the code. Indeed, this kind of code in the mem module had a bug that made to a crates.io release of encoding_rs, though I believe no one actually deployed that code to end users before the problem is remedied. While, based in fuzzing, I believe this code to be to correct, potentially in the future it could be more obviously correct by using align_to(_mut) on primitive slices (stabilized in Rust 1.30.0) and from_slice_aligned/from_slice_aligned and, possibly, their _unchecked variants on SIMD types in the packed_simd crate. However, some of these, notably align_to, are themselves unsafe, even though align_to wouldn’t need to be unsafe when both slice item types allow all bit patterns as their value space as primitive integers and integer-lane SIMD vectors do. Unaligned Memory Access Especially with SIMD but also with UTF-16LE and UTF-16BE, unaligned memory access is done with unsafe and std::ptr::copy_nonoverlapping which LLVM optimizes the same way as C’s memcpy idioms. memcpy In some cases, data is copied from a slice to another using std::ptr::copy_non_overlapping even when copy_from_slice on primitive slice would do and the bound check wouldn’t be too much of a performance problem. Removing remaining cases like this would not remove the unsafe that they are in, because they are right next to setting the logical length of Vec in a way that exposes uninitialized memory. Since the length is set right there anyway, it doesn’t make much sense to worry about passing the wrong length to std::ptr::copy_non_overlapping. Avoiding Bound Checks Perhaps the most frustrating use of unsafe is to omit bound checks on slice access that the compiler logically should be able to omit from safe code but doesn’t. I hope that in the future, LLVM gets taught more about optimizing away unnecessary bound checks in the kind of IR that rustc emits. At present, it might be possible to write the code differently without unsafe such that the resulting IR would match the kind of patterns that LLVM knows how to optimize. It is not a nice programming experience, though, to try different logically equivalent ways of expressing the code and seeing what kind of assembly comes out of the compiler. Additionally, there are cases where an array of 128 items is accessed with a byte minus 128 after the bytes is known to have its highest bit set. This can’t be expected to be known to the optimizer in cases where the fact that the highest bit is set has been established using vendor-specific SIMD. Testing In the opening paragraph, I claimed high correctness. encoding_rs has been tested in various ways. There are small manually-written tests in the Rust source files for edge cases that seemed interesting. Additionally, every index item for every lookup table-based encoding is tested by generating the expectations from the data provided along via different code than the main implementation. In the context of Firefox, encoding_rs is tested using the test cases in Web Platform Tests (WPT). All encoding tests in WPT pass, except tests that test for the new TextDecoderStream and TextEncoderStream JavaScript APIs. Additionally, encoding_rs is fuzzed using cargo-fuzz, which wraps LLVM’s coverage-guided libFuzzer for use on Rust code. Benchmarking Let’s finally take a look at how encoding_rs performs compared to other libraries. Workloads When decoding from UTF-8, the test case is the Wikipedia article for Mars, the planet, for the language in question in HTML. Reasons for choosing Wikipedia were: Wikipedia is an actual top site that’s relevant to users. Wikipedia has content in all the languages that were relevant for testing. Wikipedia content is human-authored (though I gather that the Simplified Chinese text not directly human-authored but is programmatically derived from human-authored Traditional Chinese text). Wikipedia content is suitably licensed. The topic Mars, the planet, was chosen, because it is the most-featured topic across the different-language Wikipedias and, indeed, had non-trivial articles in all the languages needed. Trying to choose a typical-length article for each language separately wasn’t feasible in the Wikidata data set. The languages were chosen to represent the languages that have Web-relevant legacy encodings. In the case of windows-1252, multiple languages with different non-ASCII frequencies were used. The main shortcoming of this kind of selection is that UTF-8 is not tested with a (South Asian) language that would use three bytes per character in UTF-8 with ASCII spaces and would have more characters per typical word than Korean has. When decoding from a non-UTF-8 encoding, the test case is synthetized from the UTF-8 test case by converting the Wikipedia article to the encoding in question and replacing unmappable characters with numeric character references (and in the case of Big5 removing a couple of characters that glibc couldn’t deal with). When testing x-user-defined decode, the test case is a JPEG image, because loading binary data over XHR is the main performance-sensitive use case for x-user-defined. The JavaScript case represents 100% ASCII and is a minified version of jQuery. (Wikipedia English isn’t 100% ASCII.) The numbers for uconv are missing, because the benchmark was added to the set after the builds made for uconv testing had rotted and were no longer linkable due to changes in the system C++ standard library. Vietnamese windows-1258 workloads are excluded, because windows-1258 uses combining characters in an unusual way, so a naïve synthetization of windows-1258 test data from precomposed UTF-8 content would not have represented a real workload. The encoder work loads use plain text extracts from the decoder test cases in order to simulate form submission (textarea) workloads. That is, the encoder benchmarks do not test ASCII runs of HTML markup, because that scenario isn’t relevant to Web-exposed browser features. The other Web-relevant case for the encoders is the parsing of URL query strings. In the absence of errors, the query strings are ASCII. Reference Libraries Obviously, uconv is benchmarked to understand performance relative to what Gecko had before. rust-encoding is benchmarked to understand performance relative to what was already available in the Rust ecosystem. ICU and WebKit are benchmarked to understand performance relative to other browsers. WebKit uses its own character encoding converters for UTF-8, UTF-16LE, UTF-16BE, x-user-defined, replacement, and windows-1252 and uses ICU for the others. Chrome inherits this approach from WebKit but has changed the error handling for UTF-8 to be spec-compliant and carries substantial patches to ICU for Encoding Standard compliance. WebKit internals were easier to make available to the benchmark harness, so only WebKit is benchmarked. WebKit’s windows-1252 is not benchmarked, because trying to use it segfaulted and it wasn’t worthwhile to debug the failure. WebKit on macOS is built with clang, of course, but hopefully building with GCC gives a general idea. ICU is benchmarked as shipped in Ubuntu, but hopefully that’s close enough performance-wise to the copies of ICU used by Safari and Chrome. kernel32.dll and glibc represent system APIs. I believe Core Foundation on Mac uses ICU internally, so in that sense ICU also represents a system API. I have no idea if the converters in kernel32.dll are performance-wise representative of what Edge and IE use. (kernel32.dll provides only a non-streaming API publicly while Edge and IE clearly need streaming converters.) Bob Steagall’s UTF-8 to UTF-16 decoder is benchmarked, because an entire talk claiming excellent results was recently dedicated to it at CppCon and it indeed turned out to be exceptional in its speed for non-ASCII input. Apples to Oranges Comparisons Some the comparisons could be considered to compare things that aren’t commensurable. In particular: Except for kernel32, the measurements exclude the initialization and destruction of the converter. This is to the advantage of uconv, ICU and glibc, which perform more work during converter initialization than encoding_rs does. kernel32 does not expose converter initialization is a distinct operation and it’s not clear if there is an initialization cost the first time a given converter is used or every time. When converting to and from UTF-8, in the comparison with rust-encoding, rust-encoding targets String and Vec while encoding_rs uses Cows. In this case, instead of trying to make the comparison fair by making encoding_rs make a useless copy, the comparison demonstrates the benefits of conditionally copy-free Rust API design. The WebKit API shows traces of Qt’s converter design. This includes always allocating a buffer on the heap for output. As a result, the WebKit numbers include the allocation and deallocation of the output buffer but those numbers are compared with encoding_rs numbers that don’t include buffer allocation and deallocation. Since a reference libraries do not fully conform to the Encoding Standard, the work being performed isn’t exactly the same. Instead, the closest approximation of a given legacy encoding is used. Even the error handling can differ: WebKit’s UTF-16BE and UTF-16LE converters don’t check for unpaired surrogates and kernel32 shows unpolished behavior on errors. Arguably, UTF-8 isn’t the native application-side Unicode representation of glibc. However, since e.g. glib (the infrastructure library used by GTK+) uses UTF-8 as its native application-side Unicode representation and wraps glibc for the conversions from external encodings, testing glibc’s performance to and from UTF-8 is relevant to how glibc is used even if arguably unfair. When encoding from UTF-8, encoding_rs and rust-encoding assume the input is valid, but glibc does not. Reading the Tables The columns are grouped into decode results and into encode results. Those groups, in turn, are grouped into using UTF-16 as the internal Unicode representation and into using UTF-8 as the internal Unicode representation. Both cases are supported by encoding_rs but the libraries being compared with support one or the other. Then there is a column for each library whose performance is being compared with. uconv is Gecko’s old encoding converter library with the numbers run in November 2016 on Ubuntu 16.04 with Ubuntu-provided GCC and before Spectre/Meltdown kernel mitigations. It would be fair to recompile with current clang, but I deemed it too much effort to get 2016 Gecko building on a 2018 system. ICU is ICU 60 as shipped on Ubuntu 18.04. kernel32 is kernel32.dll included in Windows 10 1803. WebKit is WebKitGTK+ 2.22.2 built with the default options (-O2) with GCC 7.3.0 on Ubuntu 18.04. kewb is Bob Steagall’s SSE2-accelerated converter presented at CppCon2018 built with clang at -O3. stdlib is Rust’s standard library. rust-encoding is rust-encoding 0.2.33. glibc is glibc’s iconv as shipped on Ubuntu 18.04. Each row names a language and an external encoding to convert from or to. The numbers are encoding_rs speed factors relative to the library named in the column. 2.0 means that encoding_rs is twice as fast as the reference library named in the column header. 0.5 means that the reference library named in the column header is twice as fast as encoding_rs. 0.00 means that encoding_rs is relatively very slow (still user-perceptibly fast enough for the form submission use case in a browser) and the non-zero decimals didn’t show up in the second decimal position. Benchmark Results encoding_rs and rust-encoding are built with Rust’s default optimization level opt_level=3 even though encoding_rs in Firefox is built at opt_level=2 for the time being. encoding_rs in Firefox is expected to switch to opt_level=3 soon. For these benchmarks, at least on x86_64 Haswell, there is no practical difference between opt_level=2 and opt_level=3 being applied to encoding_rs. However, previously there have been issues with opt_level=2 that I would rather not have investigated, so I am really looking forward to using opt_level=3 in the Gecko context. Also kewb is built at -O3. The Rust version was 1.32.0-nightly (9fefb6766 2018-11-13). In all cases, the default rustc optimization target for a given instruction set architecture was used. That is, e.g. the Haswell numbers mean running the code compiled for the generic x86_64 target on a Haswell chip and do not mean asking the compiler to optimize for Haswell specifically. x86_64 Intel Core i7-4770 @ 3.40 GHz (Haswell, desktop) encoding_rs uses SSE2 explicitly. Since SSE2 is part of the x86_64 baseline instruction set, other software is eligible for SSE2 autovectorization or to enable explicit SSE2 parts if they have them. At least uconv had an explicit SSE2 code path for ASCII in the UTF-8 to UTF-16 decoder. Decode Encode UTF-16 UTF-8 UTF-16 UTF-8 uconv ICU kernel32 WebKit kewb stdlib rust-encoding glibc uconv ICU kernel32 WebKit rust-encoding glibc Arabic, UTF-8 2.47 2.68 1.26 1.77 0.98 1.37 4.68 5.73 0.85 0.85 0.75 1.15 4024.12 110.89 Czech, UTF-8 2.55 2.84 1.57 1.78 0.67 2.01 9.96 10.60 1.04 1.23 0.93 1.42 9055.00 104.12 German, UTF-8 3.36 5.95 2.77 2.90 1.03 2.14 22.60 19.19 3.43 4.14 1.71 5.10 3469.75 73.62 Greek, UTF-8 2.52 2.96 1.37 1.88 1.01 1.38 5.72 6.80 0.86 0.90 0.77 1.15 5492.50 105.05 English, UTF-8 2.79 8.57 3.65 3.66 1.14 1.82 61.74 31.99 7.46 11.07 3.76 14.20 632.38 69.89 JavaScript, UTF-8 11.42 4.77 0.81 1.05 1.58 30.02 45.80 13.84 5.20 17.84 682.12 63.83 French, UTF-8 2.82 4.20 2.06 2.16 0.77 1.80 14.54 13.80 1.25 1.54 0.87 1.84 14217.50 80.27 Hebrew, UTF-8 2.45 2.50 1.26 1.71 0.93 1.47 4.67 5.78 0.81 0.87 0.72 1.04 9654.38 113.19 Portuguese, UTF-8 2.94 4.91 2.33 2.44 0.86 1.85 17.65 15.90 1.89 2.30 1.06 2.77 5188.50 79.98 Russian, UTF-8 2.46 2.73 1.29 1.81 0.96 1.41 5.07 6.11 0.81 0.90 0.75 1.02 21188.00 109.55 Thai, UTF-8 3.11 3.99 1.67 2.06 1.18 1.59 10.15 10.38 1.09 1.47 1.06 1.41 16414.75 68.88 Turkish, UTF-8 2.47 2.53 1.47 1.70 0.67 2.04 8.93 9.74 1.01 1.19 0.89 1.35 10995.38 104.52 Vietnamese, UTF-8 2.37 2.31 1.31 1.63 0.78 1.90 6.62 7.58 0.90 1.01 0.84 1.08 27145.50 145.72 Simplified Chinese, UTF-8 3.02 3.40 1.67 1.96 1.06 1.90 8.93 9.49 1.15 1.58 1.03 1.55 3575.00 75.42 Traditional Chinese, UTF-8 3.05 3.42 1.68 1.96 1.07 1.90 8.98 9.54 1.15 1.58 1.03 1.55 3600.25 74.89 Japanese, UTF-8 3.26 3.47 1.66 1.99 1.15 1.94 8.40 9.20 1.14 1.60 1.07 1.56 2880.12 71.67 Korean, UTF-8 2.98 2.85 1.54 1.89 1.01 1.90 6.48 7.56 1.10 1.39 0.89 1.33 3929.12 108.69 Arabic, windows-1256 1.62 1.12 0.82 5.15 4.03 3.27 0.37 0.05 0.72 0.86 Czech, windows-1250 2.49 1.71 1.25 7.87 7.00 2.71 0.65 0.12 1.01 1.12 German, windows-1252 7.25 4.99 3.66 25.07 22.76 32.31 6.82 1.64 12.89 12.02 Greek, windows-1253 2.12 1.46 1.07 6.36 5.01 7.03 1.43 0.20 2.06 2.00 English, windows-1252 9.96 6.85 5.02 47.65 43.28 96.70 20.12 5.10 58.56 55.80 French, windows-1252 4.29 2.95 2.16 13.91 12.51 10.67 2.33 0.53 4.24 4.04 Hebrew, windows-1255 1.96 1.07 0.78 5.19 4.88 7.05 1.34 0.18 1.98 1.78 Portuguese, windows-1252 5.46 3.75 2.75 18.32 16.51 17.36 3.78 0.87 6.53 6.14 Russian, windows-1251 1.63 1.12 0.82 5.21 4.00 4.97 1.36 0.19 2.04 1.91 Thai, windows-874 3.36 2.31 1.69 5.83 4.70 3.99 0.59 0.10 1.18 1.03 Turkish, windows-1254 2.28 1.57 1.15 7.02 6.21 4.61 0.84 0.16 1.32 1.48 Simplified Chinese, gb18030 3.68 3.64 5.04 6.40 4.73 0.23 0.01 0.02 0.01 0.01 Traditional Chinese, Big5 3.24 3.08 1.87 6.13 4.36 1.29 0.01 0.00 0.01 0.02 Japanese, EUC-JP 2.85 2.79 1.69 5.17 3.78 1.26 0.02 0.01 0.03 0.17 Japanese, ISO-2022-JP 0.94 1.80 1.07 2.91 2.10 0.61 0.06 0.06 0.03 0.15 Japanese, Shift_JIS 1.72 2.35 1.42 4.66 3.41 0.62 0.01 0.01 0.03 0.03 Korean, EUC-KR 39.64 3.47 2.24 5.81 4.08 84.85 0.31 0.20 0.56 0.53 x-user-defined 12.87 25.29 3.03 Arabic, UTF-16LE 13.48 6.33 4.17 4.74 3.47 Czech, UTF-16LE 13.54 6.33 4.17 7.20 5.33 German, UTF-16LE 13.48 6.34 4.18 14.87 10.86 Greek, UTF-16LE 13.49 6.33 4.18 5.70 4.16 English, UTF-16LE 13.43 6.33 4.17 32.86 24.17 French, UTF-16LE 13.51 6.33 4.17 11.58 8.43 Hebrew, UTF-16LE 13.50 6.33 4.18 4.55 3.38 Portuguese, UTF-16LE 13.50 6.33 4.17 13.66 9.94 Russian, UTF-16LE 13.52 6.33 4.17 5.00 3.63 Thai, UTF-16LE 13.33 6.33 4.17 8.40 6.03 Turkish, UTF-16LE 13.42 6.33 4.17 6.47 4.83 Vietnamese, UTF-16LE 13.51 6.33 4.17 5.48 4.13 Simplified Chinese, UTF-16LE 13.52 6.33 8.38 7.60 5.59 Traditional Chinese, UTF-16LE 13.48 6.33 8.38 7.58 5.58 Japanese, UTF-16LE 13.54 6.33 4.18 6.69 4.90 Korean, UTF-16LE 13.84 6.49 4.29 5.49 4.14 Arabic, UTF-16BE 11.30 5.29 3.49 4.17 3.11 Czech, UTF-16BE 11.15 5.29 3.49 6.59 5.04 German, UTF-16BE 11.32 5.29 3.49 12.85 9.79 Greek, UTF-16BE 11.30 5.28 3.49 5.00 3.73 English, UTF-16BE 11.26 5.29 3.48 26.03 20.02 French, UTF-16BE 11.28 5.29 3.48 10.09 7.70 Hebrew, UTF-16BE 11.26 5.28 3.49 4.04 3.03 Portuguese, UTF-16BE 11.29 5.29 3.49 11.77 8.98 Russian, UTF-16BE 11.27 5.29 3.48 4.43 3.27 Thai, UTF-16BE 11.22 5.29 3.48 7.63 5.63 Turkish, UTF-16BE 11.31 5.29 3.49 5.97 4.60 Vietnamese, UTF-16BE 11.29 5.29 3.48 5.03 3.87 Simplified Chinese, UTF-16BE 11.27 5.29 7.00 6.85 5.17 Traditional Chinese, UTF-16BE 11.31 5.29 7.00 6.84 5.16 Japanese, UTF-16BE 11.31 5.29 3.49 6.08 4.55 Korean, UTF-16BE 11.44 5.36 3.54 4.90 3.77 The above table shows the results with the SIMD enabled for encoding_rs but without encode-specific data tables beyond 32 bits of encode-specefic data for each single-byte encoding). With indexable lookup tables for the CJK Unified Ideographs and Hangul Syllables Unicode blocks, but otherwise retaining the same encoder structure, encoding_rs performs CJK legacy encode like this: Encode UTF-16 UTF-8 uconv ICU kernel32 rust-encoding glibc Simplified Chinese, gb18030 24.53 0.74 2.42 0.97 0.80 Traditional Chinese, Big5 160.73 0.68 0.31 1.30 2.07 Japanese, EUC-JP 47.25 0.68 0.21 0.93 5.29 Japanese, ISO-2022-JP 20.81 1.99 2.10 0.94 4.56 Japanese, Shift_JIS 29.04 0.59 0.26 1.41 1.01 Korean, EUC-KR 372.45 1.36 0.90 2.15 2.05 ARMv7+NEON Exynos 5 Windows 10 is not available, kewb is not optimized for ARM, and browsers are excluded due to compilation problems. encoding_rs and rust-encoding are compiled with NEON enabled. Only encoding_rs uses NEON explicitly. Notably, NEON is less suited for feeding back into control flow than SSE2, so NEON is not used for validating ASCII, so the comparison with the Rust standard library ends up being an ALU vs. ALU comparison. Decode Encode UTF-16 UTF-8 UTF-16 UTF-8 ICU stdlib rust-encoding glibc ICU rust-encoding glibc Arabic, UTF-8 2.15 1.21 2.71 5.28 0.93 5974.90 164.96 Czech, UTF-8 1.96 1.26 4.19 7.27 1.13 10653.25 75.24 German, UTF-8 2.89 1.20 7.13 11.32 2.54 5299.90 57.87 Greek, UTF-8 2.29 1.17 3.10 5.96 0.95 7891.35 159.49 English, UTF-8 4.25 1.07 13.82 15.11 4.66 2038.65 57.17 JavaScript, UTF-8 5.15 1.01 6.97 18.02 5.63 2120.60 57.11 French, UTF-8 2.73 1.22 7.95 9.88 1.61 16413.40 61.95 Hebrew, UTF-8 2.08 1.26 2.77 5.36 0.96 13160.95 93.50 Portuguese, UTF-8 2.80 1.22 8.66 10.39 1.87 6767.35 60.16 Russian, UTF-8 2.22 1.20 3.45 5.36 0.97 28588.75 98.30 Thai, UTF-8 3.32 1.41 6.11 9.33 1.84 28600.00 143.92 Turkish, UTF-8 1.84 1.25 3.78 6.74 1.13 12253.10 73.99 Vietnamese, UTF-8 1.76 1.32 4.06 6.11 1.06 29650.00 111.16 Simplified Chinese, UTF-8 2.46 1.43 4.09 7.94 1.82 5748.35 238.95 Traditional Chinese, UTF-8 2.46 1.43 4.16 8.01 1.82 5872.95 171.07 Japanese, UTF-8 2.48 1.45 3.79 8.92 1.88 5498.10 168.30 Korean, UTF-8 2.02 1.40 3.21 6.49 1.25 5938.90 198.42 Arabic, windows-1256 0.58 3.01 3.66 0.36 0.96 1.08 Czech, windows-1250 0.96 4.11 6.73 0.54 1.02 1.20 German, windows-1252 1.72 5.64 14.03 2.89 6.06 7.14 Greek, windows-1253 0.73 3.03 4.65 1.09 2.49 2.26 English, windows-1252 2.66 6.79 27.79 5.08 20.01 24.02 French, windows-1252 1.39 4.69 9.69 1.80 3.24 3.86 Hebrew, windows-1255 0.58 2.68 4.42 1.14 2.58 2.22 Portuguese, windows-1252 1.64 5.61 12.78 2.16 4.09 4.77 Russian, windows-1251 0.60 3.15 3.79 1.16 2.65 2.35 Thai, windows-874 0.98 3.64 5.88 0.60 1.87 2.46 Turkish, windows-1254 0.87 3.85 6.07 0.65 1.22 1.48 Simplified Chinese, gb18030 1.74 4.08 4.02 0.01 0.01 0.02 Traditional Chinese, Big5 1.73 4.57 4.40 0.01 0.02 0.04 Japanese, EUC-JP 1.61 3.91 4.26 0.03 0.04 0.22 Japanese, ISO-2022-JP 1.96 2.12 1.98 0.09 0.04 0.20 Japanese, Shift_JIS 1.41 3.46 3.77 0.02 0.04 0.06 Korean, EUC-KR 1.73 5.75 4.59 0.29 0.57 0.51 x-user-defined 2.44 Arabic, UTF-16LE 4.64 2.64 3.64 Czech, UTF-16LE 4.65 3.51 5.71 German, UTF-16LE 4.64 4.61 9.66 Greek, UTF-16LE 4.73 2.87 4.19 English, UTF-16LE 4.51 5.52 13.75 French, UTF-16LE 3.03 4.07 7.10 Hebrew, UTF-16LE 4.75 2.60 3.57 Portuguese, UTF-16LE 4.61 5.19 9.23 Russian, UTF-16LE 4.59 3.12 3.82 Thai, UTF-16LE 3.78 3.66 6.59 Turkish, UTF-16LE 4.61 3.33 5.22 Vietnamese, UTF-16LE 4.59 3.54 4.80 Simplified Chinese, UTF-16LE 4.61 3.32 5.88 Traditional Chinese, UTF-16LE 4.61 3.32 5.87 Japanese, UTF-16LE 4.74 3.02 5.35 Korean, UTF-16LE 4.73 4.24 4.59 Arabic, UTF-16BE 2.85 2.30 3.11 Czech, UTF-16BE 2.84 3.02 4.68 German, UTF-16BE 2.84 3.94 7.49 Greek, UTF-16BE 2.93 2.50 3.55 English, UTF-16BE 2.79 4.70 10.07 French, UTF-16BE 2.05 3.38 5.37 Hebrew, UTF-16BE 2.93 2.27 3.06 Portuguese, UTF-16BE 2.87 4.44 7.19 Russian, UTF-16BE 2.83 2.75 3.24 Thai, UTF-16BE 2.49 3.09 5.33 Turkish, UTF-16BE 2.83 2.85 4.30 Vietnamese, UTF-16BE 2.85 3.03 3.98 Simplified Chinese, UTF-16BE 2.83 2.85 4.84 Traditional Chinese, UTF-16BE 2.82 2.87 4.84 Japanese, UTF-16BE 2.94 2.62 4.52 Korean, UTF-16BE 2.93 3.51 3.88 aarch64 ThunderX I lack access to Windows 10 on aarch64, kewb is not optimized for aarch64, either, and browsers were excluded for compilation problems. As with x86_64, SIMD is part of the baseline compiler target instruction set on aarch64. While I was not paying attention, ALU code for ASCII validation has gained speed relative to SIMD-based ASCII validation. I suspect this might be due to LLVM updates since LLVM 4. For this reason, I have moved aarch64 to use ALU code for ASCII validation pending more investigation of how to fix the SIMD code. These numbers are from ThunderX, which is a server chip. Furthermore, this is the first-generation ThunderX, which is an in-order design. Benchmarking on phones does not make sense, because their clock speeds vary all over all the time due to thermal throttling, so benchmark results are not repeatable. Moreover, the thermal throttling may be rather fine-grained, so it is not feasible to identify throttling by looking at a clear 50% drop as is feasible e.g. with Raspberry Pi 3. The problem with ThunderX and Raspberry Pi 3 is that they use cores with in-order designs while high-end phones use more advanced out-of-order designs. It is quite frustrating that there is not good information about what non-phone computers with aarch64 chips might be able to hold a stable clock speed when running a compute benchmark for the purpose of testing small changes in implementation details. Stable clock speed is not a characteristic of ARM hardware and kernel combination that gets advertised or talked about on forums. (In the ARMv7+NEON case, I just happened to discover that a piece of hardware, Samsung Chromebook 2 with Crouton, suited my needs.) Decode Encode UTF-16 UTF-8 UTF-16 UTF-8 ICU stdlib rust-encoding glibc ICU rust-encoding glibc Arabic, UTF-8 1.81 1.14 3.53 5.74 0.85 4358.21 43.56 Czech, UTF-8 1.63 1.19 5.72 7.97 1.00 7739.88 24.20 German, UTF-8 1.89 1.16 8.67 10.77 1.96 5448.46 20.42 Greek, UTF-8 1.91 1.16 4.16 6.55 0.88 6339.08 36.21 English, UTF-8 2.10 1.03 10.98 12.73 2.57 2585.88 19.28 JavaScript, UTF-8 2.49 1.01 9.24 17.81 4.18 3874.54 39.83 French, UTF-8 1.79 1.17 7.59 10.03 1.48 14883.29 21.76 Hebrew, UTF-8 1.77 1.16 3.56 5.82 0.86 10102.21 30.73 Portuguese, UTF-8 1.88 1.16 8.43 10.46 1.64 6257.67 20.99 Russian, UTF-8 1.90 1.17 3.73 6.08 0.91 22567.83 29.29 Thai, UTF-8 2.28 1.05 4.43 6.74 1.29 29472.83 24.38 Turkish, UTF-8 1.59 1.21 5.35 7.63 1.02 9224.92 23.80 Vietnamese, UTF-8 1.55 1.12 4.27 6.33 0.84 20106.71 26.71 Simplified Chinese, UTF-8 1.98 1.19 4.67 6.83 1.27 5704.42 35.09 Traditional Chinese, UTF-8 1.97 1.18 4.62 6.77 1.28 5706.46 35.22 Japanese, UTF-8 2.05 1.18 4.04 6.10 1.31 5963.42 76.90 Korean, UTF-8 1.81 1.18 3.89 5.89 0.93 4173.88 37.93 Arabic, windows-1256 1.35 4.26 3.29 0.44 0.86 1.00 Czech, windows-1250 1.72 6.75 5.68 0.62 1.07 1.12 German, windows-1252 2.12 9.87 8.33 3.17 6.46 6.71 Greek, windows-1253 1.50 4.93 3.87 1.26 1.74 1.56 English, windows-1252 2.32 12.30 10.39 4.28 11.15 11.59 French, windows-1252 1.98 8.72 7.36 2.38 4.23 4.36 Hebrew, windows-1255 1.34 4.32 4.09 1.26 1.73 1.36 Portuguese, windows-1252 2.08 9.60 8.04 2.68 4.98 5.15 Russian, windows-1251 1.37 4.37 3.43 1.27 1.76 1.52 Thai, windows-874 1.69 5.04 4.11 0.82 1.33 1.14 Turkish, windows-1254 1.65 6.34 5.30 0.80 1.37 1.45 Simplified Chinese, gb18030 1.93 6.94 3.22 0.01 0.01 0.01 Traditional Chinese, Big5 1.92 5.65 3.65 0.01 0.01 0.02 Japanese, EUC-JP 1.86 6.16 3.07 0.02 0.04 0.19 Japanese, ISO-2022-JP 1.88 2.95 1.60 0.05 0.04 0.21 Japanese, Shift_JIS 1.69 5.21 3.08 0.02 0.04 0.03 Korean, EUC-KR 1.98 6.07 3.36 0.30 0.59 0.46 x-user-defined Arabic, UTF-16LE 3.27 4.40 3.50 Czech, UTF-16LE 3.27 6.03 4.74 German, UTF-16LE 3.26 7.75 6.02 Greek, UTF-16LE 3.26 5.07 3.98 English, UTF-16LE 3.24 9.22 7.18 French, UTF-16LE 3.22 7.25 5.74 Hebrew, UTF-16LE 3.26 4.41 3.50 Portuguese, UTF-16LE 3.29 7.90 6.16 Russian, UTF-16LE 3.25 4.72 3.73 Thai, UTF-16LE 3.31 5.77 4.73 Turkish, UTF-16LE 3.27 5.75 4.55 Vietnamese, UTF-16LE 3.30 5.11 4.21 Simplified Chinese, UTF-16LE 3.26 5.79 4.59 Traditional Chinese, UTF-16LE 3.26 5.78 4.58 Japanese, UTF-16LE 3.26 5.34 4.24 Korean, UTF-16LE 3.28 4.90 3.89 Arabic, UTF-16BE 2.56 3.82 2.83 Czech, UTF-16BE 2.56 4.94 3.60 German, UTF-16BE 2.57 6.38 4.64 Greek, UTF-16BE 2.57 4.40 3.21 English, UTF-16BE 2.56 7.47 5.48 French, UTF-16BE 2.51 5.82 4.36 Hebrew, UTF-16BE 2.57 3.82 2.82 Portuguese, UTF-16BE 2.54 6.42 4.64 Russian, UTF-16BE 2.58 4.09 3.01 Thai, UTF-16BE 2.63 4.97 3.81 Turkish, UTF-16BE 2.56 4.71 3.44 Vietnamese, UTF-16BE 2.59 4.18 3.14 Simplified Chinese, UTF-16BE 2.56 4.92 3.64 Traditional Chinese, UTF-16BE 2.56 4.93 3.63 Japanese, UTF-16BE 2.57 4.61 3.43 Korean, UTF-16BE 2.57 4.17 3.06 Notable Observations Rather expectedly, for ASCII on x86_64, SIMD is a lot faster than not using SIMD and encode to legacy encodings without encode-oriented data tables is relatively slow (but, again, still user-perceptibly fast enough even on low-end hardware for the form submission use case for legacy encoders in a Web browser). Also, the naïve the code structure that remains in the ISO-2022-JP decoder is slower than the kind of code structure that uses the program counter as part of the two-byte state tracking leading to more predictable branches. glibc Unlike the other libraries that convert to UTF-16 or UTF-8, glibc supports conversions from any encoding into any other by pivoting via UTF-32 on a per scalar value basis. This generality has a cost. I think the main take-away for application developers is that a standard library implementation covers a lot of functionality and not all those areas are optimized, so you should not assume that a library is fast at everything just because it is a core system library that has been around for a long time. As noted earlier, in the “Apples to Oranges Comparisons”, when encoding from UTF-8, glibc treats the input as potentially invalid, but encoding_rs assumes validity, so when encoding from UTF-8 to UTF-8, the encoding_rs numbers are basically for memcpy but glibc inspects everything. kernel32 In contrast, the Windows system converters have been seriously optimized for the encodings that are the default “ANSI code page” for some Windows locale. Notably, this benchmark tested gb18030 (not default system code page for any locale) and not GBK (the default for Simplified Chinese), and gb18030 looks relatively slower than the code pages that are the default in some locale configuration of Windows. EUC-JP, however, looks well optimized in kernel32 despite it not being the default for any locale. On the decode side, kernel32 is faster than encoding_rs for single-byte encodings for non-Latin scripts that use ASCII punctuation and spaces. However, for Thai and Latin scripts, encoding_rs is faster than kernel32 for single-byte encodings. This shows the cost of ASCII-acceleration when bouncing back to ASCII only for one or two bytes at a time and shows the downside of trying to limit the code footprint of encoding_rs by using the same code for all single-byte encodings with only the lookup table as a configurable variable. On the encode side, kernel32 isextremely fast relative to other implementations for the encodings that are the default “ANSI code page” for some Windows locale (and for EUC-JP). Windows is not Open Source, so I haven’t seen the code, but from the performance characteristics it looks like kernel32 has a lookup table that can be directly indexed by a 16-bit Basic Multilingual Plane code point and that yields a pair of bytes that can be copied directly to output. In microbenchmarks that don’t involve SIMD-acceleratable ASCII runs, it’s basically impossible to do better. It is hard to know what the cache effects of a maximally large lookup table are outside microbenchmarks, but the lookup table footprint just for CJK Unified Ideographs or just for Hangul Syllables is a large number of cache lines anyway. Considering the use cases for the kernel32 converters, optimizing for extreme speed rather than small footprint makes sense. When pre-Unicode legacy apps are run on Windows, all calls to systems APIs that involve strings convert between the application-side “ANSI code page” and the system-side UTF-16. Typically, all apps run with the same legacy “ANSI code page”, so only the lookup table for one encoding needs to be actively accessed. If the mission of the legacy encoders in encoding_rs was to provide maximally fast conversion to legacy encodings as opposed to providing correct conversion to legacy encodings with minimal footprint and just enough speed for the user not to complain about form submission, it would totally make sense to use tables directly indexably by 16-bit Basic Multilingual Plane code point. uconv Overall, performance-wise the rewrite was an improvement. (More about UTF-16 to UTF-8 encode below.) As far as I can tell, the EUC-KR results for uconv are not a benchmark environment glitch but the EUC-KR implementation in uconv was just remarkably inefficient. The Big5 results say nothing about the original design of uconv. The uconv Big5 implementation being compared with in the one I wrote for Firefox 43, and that implementation already did away with encode-oriented data tables. In encoding_rs, the ISO-2022-JP decoder uses a state variable while uconv was a bit faster thanks to using the program counter for state. rust-encoding As noted earlier in the “Apples to Oranges Comparisons” section, the numbers to and from UTF-8 show how much better borrowing is compared to copying when borrowing is possible. That is, encoding_rs borrows and rust-encoding copies. ICU ICU is an immensely useful and important library, but I am somewhat worried about the mentality that everyone should just standardize on ICU, and that no one can afford to rewrite ICU. In particular, I’m worried about the “just use ICU” approach entrenching UTF-16 as an in-memory representation of Unicode even more at a time when it’s increasingly clear that UTF-8 should be used not only as the interchange representation but also as the in-memory representation of Unicode. I hope the x86_64 and aarch64 results here encourage others to try to do better than ICU, (piece-wise, as the Rust ecosystem is doing) instead of just settling on ICU. On ARMv7, encoding_rs performs worse than ICU for decoding non-windows-1252 single-byte encodings into UTF-16. This shows how encoding_rs’s design relies heavily on SIMD. ARMv7 has weaker SIMD functionality than x86, x86_64 or aarch64, so the split between ASCII and non-ASCII is a pessimization on ARMv7. In the x86_64 case the benefits of SSE2 for markup offset the downsides of the ASCII/non-ASCII handling split for natural language in the Wikipedia case. Fortunately, mobile browsing skews even more towards UTF-8 than the Web in general, migration from the affected encodings to UTF-8 is, anecdotally, even further along than migration to UTF-8 in general, and aarch64 is around the corner, so I think it isn’t worthwhile to invest effort or binary footprint into having a different design for ARMv7. Encode from UTF-16 to UTF-8 While encoding_rs is a lot faster than the other libraries when encoding ASCII or almost-ASCII from UTF-16 to UTF-8, encoding_rs does worse than uconv, kernel32 and ICU in cases where there is only short runs of ASCII, typically one ASCII space, mixed with non-ASCII. This is consistent for the Arabic, Greek, Hebrew and Russian but relative to kernel32this shows up also for Korean and for the Latin script—not just for Vietnamese (with which the effect also shows up relative to uconv), Turkish and Czech that whose non-ASCII frequency is obviously high but even for French. This shows that the cost of swiching between the ASCII fast path and the non-ASCII mode is higher for UTF-16 input than for single-byte input, which makes sense, since checking whether a SIMD vector of 16-bit units is in the Basic Latin range requires more SSE2 operations that checking a vector of 8-bit units. Considering that the benefit of the ASCII fast path is so large in the ASCII case, I ended up keeping the ASCII fast path, despite it being a pessimization, though, fortunately, not a huge one, for many languages. Single-Byte Encode Arabic, Hebrew, Greek and Russian are all written in non-Latin scripts that use ASCII spaces and punctuation. Why does Arabic encode perform so much worse? The reason is that the trick of identifying a contiguous dominant range of code points that maps by offset is not as good a fit for windows-1256 as it is for windows-1251, windows-1252, windows-1253, and windows-1255. While there is a range of Arabic characters that is contiguous in both Unicode and in windows-1256, some characters are not in that range. In contrast, all Hebrew consonants (the test data is not vocalized) map by offset between Unicode and windows-1255. The Cyrillic letters needed for Russian are likewise mappable by offset between Unicode and windows-1251 as are Greek lower-case letters (and some upper case ones) in windows-1253. Of course, the bulk of windows-1252 maps by offset. The approach of offsetting one range does not work at all for windows-1250. Considering how for Web browser use cases even the relatively extremely slow speed of legacy CJK encode is fast enough, non-ASCII single-byte encode is fast enough for Web browser use cases even when the approach of offsetting a range does not work. The offset approach is just a very small-footprint tweak that is a nice bonus when it does work. The Rust Standard Library UTF-8 validation in the Rust standard library is very fast. It took quite a bit of effort to do better. (I hope that the code from encoding_rs gets upstreamed to the standard library eventually.) I managed to make encoding_rs faster than the standard library for input that’s not 100% ASCII first, but even when encoding_rs was faster than the standard library for English Wikipedia, the standard library was still faster for 100% ASCII. To make encoding_rs faster even in that case, it was necessary to introduce a two-tier approach even to the ASCII fast path. Assuming that the input is long enough to use SIMD at all, first the ASCII fast path processes 16 bytes as an unalinged SSE2 read. If that finds non-ASCII, the cost of having bounced to the SIMD path is still modest. If the first 16 bytes are ASCII, the fast path enters an ever faster path that uses aligned reads and unrolls the loop by two. The data cache footprint of the UTF-8 validation function in the Rust standard library is 256 bytes or four cache lines. The data cache footprint of encoding_rs’s UTF-8 validation function is 384 bytes or six cache lines, so 50% more. Using a lookup table to speed up a function that in principle should be doing just simple bit manipulation is a bit questionable, because benchmarks show behavior where the cost of bringing the lookup table to the cache is amortized across the benchmark iterations and the application-context cost of having to evict something else is not visible. For long inputs containing non-ASCII, using a lookup table is clearly justified. The effects on handling short strings as part of a larger system are unclear. As we’ve learned from Spectre, we shouldn’t assume that the 100% ASCII case avoids bringing the lookup table into the data cache. WebKit What bothers me the most about the benchmark results is that WebKit’s UTF-8 to UTF-16 decoder is faster than encoding_rs’s for the 100% ASCII case. That encoding_rs is faster for English Wikipedia content shows how specialized the WebKit win is. Closing the gap did not succeed using the same approach that worked in the case of closing the UTF-8 validation performance gap with the Rust standard library (which involved only reads, while decoding to UTF-16 involves writes, too). I don’t want to sacrifice encoding_rs’s performance in the case where the input isn’t 100% ASCII. The obvious solution would be to introduce very ASCII-biased prefix handling and moving to the current more balanced (between ASCII and non-ASCII) encoding_rs code when the first non-ASCII byte is seen. However, I don’t want to introduce a performance cliff like that. Consider a single copyright sign in a license header at the top of an otherwise ASCII file. For a long file, a good implementation should be able to climb back to the fast path after the copyright sign. As a consolation, the 100% ASCII case matters the most for CSS and JavaScript. In Gecko, the CSS case already uses UTF-8 validation instead of UTF-8 to UTF-16 conversion and JavaScript is on track to moving from UTF-8 to UTF-16 conversion to UTF-8 validation. Interestingly, WebKit’s ASCII fast path is written as ALU code. I didn’t bother trying to locate the right disassembly, but if the performance is any indication, GCC must be unrolling and autovectrorizing WebKit’s ASCII fast path. kewb Bob Steagall’s UTF-8 to UTF-16 decoder that combines SSE2 with a Deterministic Finite Automaton (DFA) is remarkably fast. While encoding_rs is a bit faster for Latin script with very infrequent non-ASCII (the threshold is between German and Portuguese) and for writing that doesn’t use use ASCII spaces (Thai, Chinese, and Japanese), the DFA is faster for everything that involves more frequent transitions between ASCII and non-ASCII. I haven’t studied properly how the implementation manages the transitions between SSE2 and the DFA, but the result is awesome. Compared to encoding_rs’s lookup table of 384 bytes or six cache lines, the DFA has a larger data cache footprint: the presentation slides say 896 bytes or 14 cache lines. As noted earlier, in the benchmarks the cost of bringing the tables into the cache are amortized across benchmark iterations and the cost of having to evict something else in a real-world application is not visible in a benchmark. Considering that encoding_rs::mem (discussed below) reuses encoding_rs’s UTF-8 to UTF-16 decoder for potentially short strings, I’m reluctant to adopt the DFA design that could have adverse cache effects in an application context. One More Thing: encoding_rs::mem The above discussion has been about encoding_rs in its role for converting between external encodings and the application-internal Unicode representation(s). That kind of usage calls for a well-designed streaming API when incremental processing of HTML (and XML) is one of the use cases. However, if an application that has, for legacy reasons, multiple application-internal representations, converting between those generally calls less for streaming generality and more for API simplicity. A Rust application written from scratch could do well with just one application-internal Unicode representation: UTF-8. However, Gecko, JavaScript, and the DOM API were created at the time when it was believed that Unicode was a 16-bit code space and that the application-internal Unicode representation should consist of 16-bit units. In the same era, Java, Windows NT, and Qt, among others, committed to 16-bit units in their internal Unicode representations. With the benefit of hindsight, we can now say that it was a mistake to commit to 16-bit units in the application-internal Unicode representation. At the upper end of the code space, it became clear that 16 bits weren’t enough and Unicode was extended to 21 bits, so UTF-16 with surrogates was introduced making a memory representation consisting of 16-bits units variable-width representation anyway (even without considering grapheme clusters). At the lower end of the code space, it became clear that the ASCII range remains quite a bit more overrepresented than one might have expected by looking at the natural language is used around the world: Various textual computer syntaxes tend to use ASCII. In the context of Gecko, the syntax of HTML, XML, CSS and JavaScript is ASCII. To cope with these realities, Gecko now uses UTF-8 internally for some things and in some cases tries to store semantically UTF-16 data without the higher half of each code unit—i.e. storing data as Latin1 if possible. In Gecko, this approach is used for JavaScript strings and DOM text nodes. (This approach isn’t unique to Gecko. It is also used in V8, optionally in HotSpot and, with Latin1, UCS-2 and UTF-32 levels, in Python 3. Swift is moving away from a similar dual model to UTF-8.) When adding to the mix that Rust code is confident about UTF-8 validity but C++ isn’t, Gecko ends up with four kinds of internal text representations: UTF-16 whose validity cannot be trusted Latin1 that cannot be invalid UTF-8 whose validity cannot be fully trusted UTF-8 whose validity can be fully trusted encoding_rs::mem provides efficient conversions between these four cases as well as functionality for checking if UTF-16 or UTF-8 only contains code points in the Latin1 range. Furthermore, the module also is able to check if text for sure does not contain any right-to-left text. While this check seems to be out of place in this module, it makes sense to combine this check with a Latin1ness check when creating DOM text nodes. Also, it makes sense to optimize the check using portable SIMD. (In Gecko, documents start their life as left-to-right-only. As long as they stay that way, the Unicode Bidirectional Algorithm can be optimized out in layout. However, whenever text is added to the document, it needs to be inspected to see if it might contain right-to-left characters. Once at least one such character is encountered, the document transitions into the bidi mode and the Unicode Bidirectional Algorithm is used in layout from then on.) Notably, the use case of converting in-memory text is different from converting incrementally-parsed HTML or XML. Instead of providing streaming conversion, encoding_rs::mem provides conversions in a non-streaming manner, which enables a simpler API. In most cases, the caller is supposed to allocate the target buffer according to the maximum possible length requirement. As an exception, conversions to UTF-8 can be performed in multiple steps in order to avoid excessive allocation, considering that the maximum possible length requirement when converting from UTF-16 to UTF-8 is three times the minimum possible case. The general assumption is that when converting from UTF-16 to UTF-8, first this buffer is sized according to the minimum possible case and rounded up to the allocator bucket and if the result doesn’t fit, then the maximum possible case is tried. When converting XPCOM strings, though, there’s an additional heuristic that looks at the first two cache lines of the UTF-16 buffer in order to make a guess whether the initial allocation should be larger than the minimum possible size. Since Gecko uses an allocator with power-of-two buckets, is not worthwhile to compute the buffer size requirements precisely. Being a bit wrong still often ends up in the same allocator bucket. Indeed, the new code that makes guesses and occasionally reallocates is generally faster than the old code that tried to compute the buffer size requirements precisely and ended up doing UTF math twice in the process. The code for encoding_rs::mem looks rather unreviewable. It is that way due performance reasons. The messy look arisis from SIMD with raw pointers, manual alignment handling and manual loop unrolling. To convince myself and others that the code does what it is supposed to do, I created another implementation of the same API in the simplest way possible using the Rust standard library facilities. Then I benchmarked the two to verify that my complicated code indeed was faster. Then I used cargo-fuzz to pass the same fuzzing input to both implementations and seeing that their output agrees (and that there a no panics or Address Sanitizer-reported problems). This description of encoding_rs::mem looks thematically quite different from the earlier discussion of encoding_rs proper. Indeed, as far as API usage goes, encoding_rs::mem should be a separate crate. The only reason why it is a submodule is that the two share implementation details that don’t make sense to expose as a third crate with the public API. Users of encoding_rs that don’t need encoding_rs::mem should simply ignore the submodule and let link-time optimization discard it. The combination of encoding_rs’s faster converter internals with the new allocation strategy that is a better fit for Gecko’s memory allocator was a clear performance win. My hope is that going forward conversion between UTF-8 and UTF-16 will be perceived as having acceptable enough a cost that Gecko developers will feel more comfortable with components that use UTF-8 internally even if it means that a conversion has to happen on a component boundary. On the other hand, I’m hoping to use this code to speed up a case where there already is a boundary even though the boundary is easy to forget: The WebIDL boundary between JavaScript and C++. Currently, when SpiderMonkey has a Latin1 string, it is expanded to UTF-16 at the DOM boundary, so e.g. using TextEncoder to encode an ASCII JavaScript string to UTF-8 involves expanding the string to UTF-16 and then encoding from UTF-16 to UTF-8 when just copying the bytes over should be logically possible. [Less]
Posted over 5 years ago
encoding_rs::mem is a Rust module for performing conversions between different in-RAM text representations that are relevant to Gecko. Specifically, it converts between potentially invalid UTF-16, Latin1 (in the sense that unsigned byte value equals ... [More] the Unicode scalar value), potentially invalid UTF-8, and guaranteed-valid UTF-8, and provides some operations on buffers in these encodings, such as checking if a UTF-16 or UTF-8 buffer only has code points in the ASCII range or only has code points in the Latin1 range. (You can read more about encoding_rs::mem in a write-up about encoding_rs as a whole.) The whole point of this module is to make things very fast using Rust’s (not-yet-stable) portable SIMD features. The code was written before slices in the standard library had the align_to method or the chunks_exact method. Moreover, to get speed competitive with the instruction set-specific and manually loop-unrolled C++ code that the Rust code replaced, some loop unrolling is necessary, but Rust does not yet support directives for the compiler that would allow the programmer to request specific loop unrolling from the compiler. As a result, the code is a relatively unreviewable combination of manual alignment calculations, manual loop unrolling and manual raw pointer handling. This indeed achieves high speed, but by looking at the code, it isn’t at all clear whether the code is actually safe or otherwise correct. To validate the correctness of the rather unreviewable code, I used model-based testing with cargo-fuzz. cargo-fuzz provides Rust integration for LLVM’s libFuzzer coverage-guided fuzzer. That is, the fuzzer varies the inputs it tries based on observing how the inputs affect the branches taken inside the code being fuzzed. The fuzzer runs with one of LLVM’s sanitizers enabled. By default, the Address Sanitizer (ASAN) is used. (Even though the sanitizers should never find bugs in safe Rust code, the sanitizers are relevant to bugs in Rust code that uses unsafe.) I wrote a second implementation (the “model”) of the same API in the most obvious way possible using Rust standard-library facilities and without unsafe, except where required to be able to write into an &mut str. I also used the second implementation to validate the speed of the complex implementation. Obviously, there’d be no point in having a complex implementation if it wasn’t faster than the simple and obvious one. (The complex implementation is, indeed, faster.) For example, the function for checking if a buffer of potentially invalid UTF-16 only contains characters in the Latin1 range is 8 lines (including the function name and the closing brace) in the safe version. In the fast version, it’s 3 lines that just call to another function expanded from a macro, where the expansion is either generated using either a 76-line SIMD-using macro or a 71-line ALU-using macro depending on whether the code was compiled with SIMD enabled. Of these macros, the SIMD calls another (tiny) function that has a specialized implementation for aarch64 and a portable implementation. To use cargo-fuzz, you create a “fuzzer script”, which is a Rust function that gets a slice of bytes from the fuzzer and exercises the code being fuzzed. In the case of fuzzing encoding_rs::mem, the first byte is used to decide which function to exercise and the rest of the slice is used as the input to the function. When the function being called takes a slice of u16, a suitably aligned u16 subslice of the input is taken. For each function, the fuzzer script calls both the complex implementation and the corresponding simple implementation with the same input and checks that the outputs match. The fuzzer finds a bug if the outputs don’t match, if there is a panic, or if the LLVM Address Sanitizer notices bad memory access, which could arise from the use of unsafe. Once the fuzzer fails to find problems after having run for a few days, we can have high confidence that the complex implementation is correct in the sense that its observable behavior, ignoring speed, matches the observable behavior of the simple implementation. Therefore, a code review for the correctness of the simple implementation can, with high confidence, be considered to apply to the complex implementation as well. [Less]
Posted over 5 years ago
Since version 56, Firefox has had a new character encoding conversion library called encoding_rs. It is written in Rust and replaced the old C++ character encoding conversion library called uconv that dated from early 1999. Initially, all the callers ... [More] of the character encoding conversion library were C++ code, so the new library, despite being written in Rust, needed to feel usable when used from C++ code. In fact, the library appears to C++ callers as a modern C++ library. Here are the patterns that I used to accomplish that. (There is another write-up about encoding_rs itself. I presented most of the content in this write-up in my talk at RustFest Paris: video, slides.) Modern C++ in What Way? By “modern” C++ I mean that the interface that C++ callers see conforms to the C++ Core Guidelines and uses certain new features: Heap allocations are managed by returning pointers to heap-allocated objects within std::unique_ptr / mozilla::UniquePtr. Caller-allocated buffers are represented using gsl::span / mozilla::Span instead of plain pointer and length. Multiple return values are represented using std::tuple / mozilla::Tuple instead of out params. Non-null plain pointers are annotated using gsl::not_null / mozilla::NotNull. gsl:: above refers to the Guidelines Support Library, which provides things that the Core Guidelines expect to have available but that are not (yet) in the C++ standard library. C++ Library in Rust? By writing a C++ library “in Rust” I mean that the bulk of the library is actually a library written in Rust, but the interface provided to C++ callers makes it look and feel like a real C++ library as far as the C++ callers can tell. Both C++ and Rust Have C Interop C++ has a very complex ABI, and the Rust ABI is not frozen. However, both C++ and Rust support functions that use the C ABI. Therefore, interoperability between C++ and Rust involves writing things in such a way that C++ sees Rust code as C code and Rust sees C++ code as C code. Simplifying Factors This write-up should not be considered a comprehensive guide to exposing Rust code to C++. The interface to encoding_rs is simple enough that it lacks some complexities that one could expect from the general case of interoperability between the two languages. However, the factors that simplify the C++ exposure of encoding_rs can be taken as a guide to simplifications that one should seek to achieve in the interest of easy cross-language interoperability when designing libraries. Specifically: encoding_rs never calls out to C++: The cross-language calls are unidirectional. encoding_rs does not hold references to C++ objects after a call returns: There is no need for Rust code to manage C++ memory. encoding_rs does not present an inheritance hierarchy either in Rust or in C++: There are no vtables on either side. The datatypes that encoding_rs operates on are very simple: Contiguous buffers of primitives (buffers of u8/uint8_t and u16/char16_t). Only the panic=abort configuration (i.e. a Rust panic terminates the program instead of unwinding the stack) is supported and the code presented here is only correct if that option is used. The code presented here does not try to prevent Rust panics from unwinding across the FFI, and letting a panic unwind across the FFI is Undefined Behavior. A Very Quick Look at the API To get an idea about the Rust API under discussion, let’s take a high-level look. The library has three public structs: Encoding, Decoder and Encoder. From the point of view of the library user, these structs are used like traits, superclasses or interfaces in the sense that they provide a uniform interface to various concrete encodings, but technically they are indeed structs. Instances of Encoding are statically allocated. Decoder and Encoder encapsulate the state of a streaming conversion and are allocated at run-time. A reference to an Encoding, that is &'static Encoding, can be obtained either from label (textual identification extracted from protocol text) or by a named static. The Encoding can then be used as a factor for a Decoder, which is stack-allocated. let encoding: &'static Encoding = Encoding::for_label( // by label byte_slice_from_protocol ).unwrap_or( WINDOWS_1252 // by named static ); let decoder: Decoder = encoding.new_decoder(); In the streaming case, a method for decoding from a caller-allocated slice into another caller-allocate slice is available on the Decoder. The decoder performs no heap allocations. pub enum DecoderResult { InputEmpty, OutputFull, Malformed(u8, u8), } impl Decoder { pub fn decode_to_utf16_without_replacement( &mut self, src: &[u8], dst: &mut [u16], last: bool ) -> (DecoderResult, usize, usize) } In the non-streaming case, the caller does not need to deal with Decoder and Encoder at all. Instead, methods for handling an entire logical input stream in one buffer are provided on Encoding. impl Encoding { pub fn decode_without_bom_handling_and_without_replacement<'a>( &'static self, bytes: &'a [u8], ) -> Option'a, str>> } The Process 0. Designing for FFI-friendliness Some of the simplifying factors arise from the problem domain itself. Others are a matter of choice. A character encoding library could reasonably present traits (similar to abstract superclasses with no fields in C++) for each of the concepts of an encoding, a decoder and an encoder. Instead, encoding_rs has structs for these that internally match on an enum for dispatch instead of relying on a vtable. pub struct Decoder { // no vtable variant: VariantDecoder, // ... } enum VariantDecoder { // no extensibility SingleByte(SingleByteDecoder), Utf8(Utf8Decoder), Gb18030(Gb18030Decoder), // ... } The primary motivation for this wasn’t as much eliminating vtables per se but to make the hierarchy intentionally unextensible. This reflects a philosophy that adding character encodings is not something that programmers should do. Instead, programs should use UTF-8 for interchange, and programs should support legacy encodings only to the extent necessary for compatibility with existing content. The non-extensibility of the hierarchy provides stronger type-safety. If you have an Encoding from encoding_rs, you can trust that it doesn’t exhibit characteristics that aren’t exhibited by the encodings defined in the Encoding Standard. That is, you can trust that it won’t behave like UTF-7 or EBCDIC. Additionally, by dispatching on an enum, a decoder for one encoding can internally morph into a decoder for another encoding in response to BOM sniffing. One might argue that the Rustic way to provide encoding converters would be making them into iterator adaptors that consume an iterator of bytes and yield Unicode scalar values or vice versa. In addition to iterators being more complex to expose across the FFI, iterators make it harder to perform tricks to accelerate ASCII processing. Taking a slice to read from and a slice to write to not only makes it easier to represent things in a C API (in C terms, a Rust slice decomposes to an aligned non-null pointer and a length) but also enables ASCII acceleration by processing more than one code unit at a time making use of the observation that multiple code units fit in a single register (either an ALU register or a SIMD register). If the Rust-native API deals only with primitives, slices and (non-trait object) structs, it is easier to map to a C API than a Rust API that deals with fancier Rust features. (In Rust, you have a trait object when type erasure happens. That is, you have a trait-typed reference that does not say the concrete struct type of the referent that implements the trait.) 1. Creating the C API When the types involved are simple enough, the main mismatches between C and Rust are the lack of methods and multiple return values in C and the inability to transfer non-C-like structs by value. Methods are wrapped by functions whose first argument is a pointer to the struct whose method is being wrapped. Slice arguments become two arguments: the pointer to the start of the slice and the length of the slice. One primitive value is returned as a function return value and the rest become out params. When the output params clearly relate to inputs of the same type, it makes sense to use in/out params. When a Rust method returns the struct by value, the wrapper function boxes it and returns a pointer such that the Rust side forgets about the struct. Additionally, a function for freeing a given struct type by pointer is added. Such a method simply turns pointer back into a box and drops the box. The struct is opaque from the C point of view. As a special case, the method for getting the name of an encoding, which in Rust would return &'static str is wrapped by a function that takes a pointer to writable buffer whose length must be at least the length of the longest name. enums signaling the exhaustion of the input buffer, the output buffer becoming full or errors with detail about the error became uint32_t with constants for “input empty” and “output full” and rules for how to interpret the other error details. This isn’t ideal but works pretty well in this case. Overflow-checking length computations are presented as saturating instead. That is, the caller has to treat SIZE_MAX as a value signaling overflow. 2. Re-Creating the Rust API in C++ over the C API Even an idiomatic C API doesn’t make for a modern C++ API. Fortunately, Rustic concepts like multiple return values and slices can be represented in C++, and by reinterpreting pointers returned by the C API as pointers to C++ objects, it’s possible to present the ergonomics of C++ methods. Most of the examples are from a version of the API that uses C++17 standard library types. In Gecko, we generally avoid the C++ standard library and use a version of the C++ API to encoding_rs that uses Gecko-specific types. I assume that the standard-library-type examples make more sense to a broader audience. Method Ergonomics For each opaque struct pointer type in C, a class is defined in C++ and the C header is tweaked such that the pointer types become pointers to instances of the C++ classes from the point of view of the C++ compiler. This amounts to a reinterpret_cast of the pointers without actually writing out the reinterpret_cast. Since the pointers don’t truly point to instances of the classes that they appear to point to but point to instances of Rust structs instead, it’s a good idea to take some precautions. No fields are declared for the classes. The default no-argument and copy constructors are deleted as is the default operator=. Additionally, there must be no virtual methods. (This last point is an important limitation that will come back to later.) class Encoding final { // ... private: Encoding() = delete; Encoding(const Encoding&) = delete; Encoding& operator=(const Encoding&) = delete; ~Encoding() = delete; }; In the case of Encoding whose all instances are static, the destructor is deleted as well. In the case of the dynamically-allocated Decoder and Encoder both an empty destructor and a static void operator delete is added. (An example follows a bit later.) This enables the destruction of the fake C++ class to be routed to the right type-specific freeing function in the C API. With that foundation in place to materialize pointers that look like pointers to C++ class instances, it’s possible to make method calls on this pointers work. (An example follows after introducing the next concept, too.) Returning Dynamically-Allocated Objects As noted earlier, the cases where the Rust API would return an Encoder or a Decoder by value so that the caller can place them on the stack is replaced by the FFI wrapper boxing the objects so that the C API exposes only heap-allocated objects by pointer. Also, the reinterpretation of these pointers as deleteable C++ object pointers was already covered. That still leaves making sure that delete is actually used at an appropriate time. In modern C++, when an object can have only one legitimate owner of the time, this is accomplished by wrapping the object pointer in std::unique_ptr or mozilla::UniquePtr. The old uconv converters supported reference counting, but all the actual uses in the Gecko code base involved only one owner for each converter. Since the usage patterns of encoders and decoders are such that there is only one legitimate owner of the time, using std::unique_ptr and mozilla::UniquePtr is what the two C++ wrappers for encoding_rs do. Let’s take a look at a factory method on Encoding that returns a Decoder. In Rust, we have a method that takes a reference to self and returns Decoder by value. impl Encoding { pub fn new_decoder(&'static self) -> Decoder { // ... } } On the FFI layer, we have an explicit pointer-typed first argument that corresponds to Rust &self and C++ this (specifically, the const version of this). We allocate memory on the heap (Box::new()) and place the Decoder into the allocated memory. We then forget about the allocation (Box::into_raw) so that we can return the pointer to C without deallocating at the end of the scope. In order to be able to free the memory, we introduce a new function that puts the Box back together and assigns it into a variable that immediately goes out of scope causing the heap allocation to be freed. #[no_mangle] pub unsafe extern "C" fn encoding_new_decoder( encoding: *const Encoding) -> *mut Decoder { Box::into_raw(Box::new((*encoding).new_decoder())) } #[no_mangle] pub unsafe extern "C" fn decoder_free(decoder: *mut Decoder) { let _ = Box::from_raw(decoder); } In the C header, they look like this: ENCODING_RS_DECODER* encoding_new_decoder(ENCODING_RS_ENCODING const* encoding); void decoder_free(ENCODING_RS_DECODER* decoder); ENCODING_RS_DECODER is a macro that is used for substituting the right C++ type when the C header is used in the C++ context instead of being used as a plain C API. On the C++ side, then, we use std::unique_ptr, which is the C++ analog of Rust’s Box. They are indeed very similar: let ptr: Box std::unique_ptr ptr Box::new(Foo::new(a, b, c)) make_unique(a, b, c) Box::into_raw(ptr) ptr.release() let ptr = Box::from_raw(raw_ptr); std::unique_ptr ptr(raw_ptr); We wrap the pointer obtained from the C API in a std::unique_ptr: class Encoding final { public: inline std::unique_ptr new_decoder() const { return std::unique_ptr( encoding_new_decoder(this)); } }; When the std::unique_ptr goes out of scope, the deletion is routed back to Rust via FFI thanks to declarations like this: class Decoder final { public: ~Decoder() {} static inline void operator delete(void* decoder) { decoder_free(reinterpret_cast(decoder)); } private: Decoder() = delete; Decoder(const Decoder&) = delete; Decoder& operator=(const Decoder&) = delete; }; How Can it Work? In Rust, non-trait methods are just syntactic sugar: impl Foo { pub fn get_val(&self) -> usize { self.val } } fn test(bar: Foo) { assert_eq!(bar.get_val(), Foo::get_val(&bar)); } A method call on non-trait-typed reference is just a plain function call with the reference to self as the first argument. On the C++ side, non-virtual method calls work the same way: A non-virtual C++ method call is really just a function call whose first argument is the this pointer. On the FFI/C layer, we can pass the same pointer as an explicit pointer-typed first argument. When calling ptr->Foo() where ptr is of type T*, the type of this is T* if the method is declared as void Foo() (which maps to &mut self in Rust) and const T* if the method is declared as void Foo() const (which maps to &self in Rust), so const-correctness is handled, too. fn foo(&self, bar: usize) -> usize size_t foo(size_t bar) const fn foo(&mut self, bar: usize) -> usize size_t foo(size_t bar) The qualifications about “non-trait-typed” and “non-virtual” are important. For the above to work, we can’t have vtables on either side. This means no Rust trait objects and no C++ inheritance. In Rust, trait objects, i.e. trait-typed references to any struct that implements the trait, are implemented as two pointers: one to the struct instance and another to the vtable appropriate for the concrete type of the data. We need to be able to pass reference to self across the FFI as a single pointer, so there’s no place for the vtable pointer when crossing the FFI. In order to keep pointers to C++ objects as C-compatible plain pointers, C++ puts the vtable pointer on the objects themselves. Since the pointers don’t really point to C++ objects carrying vtable pointers but point to Rust objects, we must make sure not to make the C++ implementation expect to find a vtable pointer on the pointee. As a consequence, the C++ reflector classes for the Rust structs cannot inherit from a common baseclass of a C++ framework. In the Gecko case, the reflector classes cannot inherit from nsISupports. E.g. in the context of Qt, the reflector classes wouldn’t be able to inherit from QObject. Non-Nullable Pointers There are methods in the Rust API that return &'static Encoding. Rust references can never be null, and it would be nice to relay this piece of information in the C++ API. It turns out that there is a C++ idiom for this: gsl::not_null and mozilla::NotNull. Since gsl::not_null and mozilla::NotNull are just type system-level annotations that don’t change the machine representation of the underlying pointer and since from the guarantees Rust we know which pointers that we get from the FFI really never are null, it is tempting to apply the same reinterpretation trick of lying to the C++ compiler about types that we use to reinterpret pointers returned by the FFI as pointers to fieldless C++ objects with no virtual methods and to claim in a header file that the pointers that we know not to be null in the FFI return values are of the type mozilla::NotNull. Unfortunately, this doesn’t actually work because types involving templates are not allowed in the declarations of extern "C" functions in C++, so the C++ code ends up executing a branch for the null check when wrapping pointers received from the C API with gsl::not_null or mozilla::NotNull. However, there are also declarations of static pointers to the constant encoding objects (where the pointees are defined in Rust) and it happens that C++ does allow declaring those as gsl::not_null, so that is what is done. (Thanks to Masatoshi Kimura for pointing out that this is possible.) The statically-allocated instances of Encoding are declared in Rust like this: pub static UTF_8_INIT: Encoding = Encoding { name: "UTF-8", variant: VariantEncoding::Utf8, }; pub static UTF_8: &'static Encoding = &UTF_8_INIT; In Rust, the general rule is that you use static for an unchanging memory location and const for an unchanging value. Therefore, UTF_8_INIT should be static and UTF_8 should be const: the value of the reference to the static instance is unchanging, but statically allocating a memory location for the reference is not logically necessary. Unfortunately, Rust has a rule that says that the right-hand side of const may not contain anything static and this is applied so heavily as to prohibit even references to static, in order to ensure that the right-hand side of a const declaration can be statically checked to be suitable for use within any imaginable const declaration—even one that tried to dereference the reference at compile time. For FFI, though, we need to allocate an unchanging memory location to a pointer to UTF_8_INIT, because such memory locations work in C linkage and allow us provide a pointer-typed named thing to C. The representation of UTF_8 above is already what we need, but for Rust ergonomics, we want UTF_8 to participate in Rust’s crate namespacing. This means that from the C perspective the name gets mangled. We waste some space by statically allocating pointers again without name mangling for C usage: pub struct ConstEncoding(*const Encoding); unsafe impl Sync for ConstEncoding {} #[no_mangle] pub static UTF_8_ENCODING: ConstEncoding = ConstEncoding(&UTF_8_INIT); A pointer type is used to make in clear that C is supposed to see a pointer (even if a Rust reference type would have the same representation). However, the Rust compiler refuses to compile a program with globally-visible pointer. Since globals are reachable from different threads, multiple threads accessing the pointee might be problem. In this case, the pointee cannot be mutated, so global visibility is fine. To tell the compiler that this is fine, we need to implement the Sync marker trait for the pointer. However, traits cannot be implemented on pointer types. As a workaround, we create a newtype for *const Encoding. A newtype has the same representation as the type it wraps, but we can implement traits on the newtype. Implementing Sync is unsafe, because we are asserting to the compiler that something is OK when the compiler does not figure it out on its own. In C++, we can then say (what via macros expands to): extern "C" { extern gsl::not_null<const encoding_rs::Encoding*> const UTF_8_ENCODING; } The pointers to the encoders and decoders are also known not to be null, since allocation failure would terminate the program, but std::unique_ptr / mozilla::UniquePtr and gsl::not_null / mozilla::NotNull cannot be combined. Optional Values In Rust, it’s idiomatic to use Option to represent return values might either have a value or might not have a value. C++ these days provides the same thing as std::optional. In Gecko, we instead have mozilla::Maybe. Rust’s Option and C++’s std::optional indeed are basically the same thing: return None; return std::nullopt; return Some(foo); return foo; is_some() operator bool() has_value() unwrap() value() unwrap_or(bar) value_or(bar) Unfortunately, though, C++ reverses the safety ergonomics. The most ergonomic way to extract the wrapped value from a std::optional is via operator*(), which is unchecked and, therefore, unsafe. 😭 Multiple Return Values While C++ lacks language-level support for multiple return values, multiple return values are possible thanks to library-level support. In the case of the standard library, the relevant library pieces are std::tuple, std::make_tuple and std::tie. In the case of Gecko, the relevant library pieces are mozilla::Tuple, mozilla::MakeTuple and mozilla::Tie. fn foo() -> (T, U, V) std::tuple foo() return (a, b, c); return {a, b, c}; let (a, b, c) = foo(); const auto [a, b, c] = foo(); let mut (a, b, c) = foo(); auto [a, b, c] = foo(); Slices A Rust slice wraps a non-owning pointer and a length that identify a contiguous part of an array. In comparison to C: src: &[u8] const uint8_t* src, size_t src_len dst: &mut [u8] uint8_t* dst, size_t dst_len There isn’t a corresponding thing in the C++ standard library yet (except std::string_view for read-only string slices), but it’s already part of the C++ Core Guidelines and is called a span there. src: &[u8] gsl::span<const uint8_t> src dst: &mut [u8] gsl::span<uint8_t> dst &mut vec[..] gsl::make_span(vec) std::slice::from_raw_parts(ptr, len) gsl::make_span(ptr, len) for item in slice {} for (auto&& item : span) {} slice[i] span[i] slice.len() span.size() slice.as_ptr() span.data() GSL relies on C++14, but at the time encoding_rs landed, Gecko was stuck on C++11 thanks to Android. Since, GSL could not be used as-is in Gecko, I backported gsl::span to C++11 as mozilla::Span. The porting process was mainly a matter of ripping out constexpr keywords and using mozilla:: types and type traits in addition to or instead of standard-library ones. After Gecko moved to C++14, some of the constexpr keywords have been restored. Once we had our own mozilla::Span anyway, it was possible to add Rust-like subspan ergonomics that are missing from gsl::span. For the case where you want a subspan from index i up to but not including index j. gsl::span has: &slice[i..] span.subspan(i) &slice[..i] span.subspan(0, i) &slice[i..j] span.subspan(i, j - i) 😭 mozilla::Span instead has: &slice[i..] span.From(i) &slice[..i] span.To(i) &slice[i..j] span.FromTo(i, j) gsl::span and Rust slices have one crucial difference in how they decompose into a pointer and a length. For zero-length gsl::span it is possible for the pointer to be nullptr. In the case of Rust slices, the pointer must always be non-null and aligned even for zero-length slices. This may look counter-intuitive at first: When the length is zero, the pointer never gets dereferenced, so why doesn’t matter whether it is null are not? It turns out that it matters for optimizing out the enum discriminant in Option-like enums. None is represented by all-zero bits, so if wrapped in Some(), a slice with null as the pointer and zero as the length would accidentally have the same representation as None. By requiring the pointer to be a potentially bogus non-null pointer, a zero-length slice inside an Option can be represented distinctly from None without a discriminant. By requiring the pointer to be aligned, further uses of the low bits of the pointer are possible when the alignment of the slice element type is greater than one. After realizing that it’s not okay to pass the pointer obtained from C++ gsl::span::data() to Rust std::slice::from_raw_parts() as-is, it was necessary to decide where to put the replacement of nullptr with reinterpret_cast(alignof(T)). There are two candidate locations when working with actual gsl::span: In the Rust code that provides the FFI or in the C++ code that calls the FFI. When working with mozilla::Span, the code of the span implementation itself could be changed, so there are two additional candidate locations for the check: the constructor of mozilla::Span and the getter for the pointer. Of these for candidate locations, the constructor of mozilla::Span seemed like the one where the compiler has the best opportunity to optimize out the check in some cases. That’s why I chose to put the check there. This means that in the gsl::span scenario the check had to go in the code that calls the FFI. All pointers obtained from gsl::span have to be laundered through: template <class T> static inline T* null_to_bogus(T* ptr) { return ptr ? ptr : reinterpret_cast(alignof(T)); } Additionally, this means that since the check is not in the code that provides the FFI, the C API became slightly unidiomatic in the sense that requires C callers to avoid passing in NULL even when the length is zero. However, the C API already has many caveats about things that are Undefined Behavior, and adding yet another thing that is documented to be Undefined Behavior does seem like an idiomatic thing to do with C. Putting it Together Let’s look at an example of how the above features combine. First, in Rust we have a method that takes a slice and returns an optional tuple: impl Encoding { pub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)> { if buffer.starts_with(b"\xEF\xBB\xBF") { Some((UTF_8, 3)) } else if buffer.starts_with(b"\xFF\xFE") { Some((UTF_16LE, 2)) } else if buffer.starts_with(b"\xFE\xFF") { Some((UTF_16BE, 2)) } else { None } } } Since this is a static method, there is no reference to self and no corresponding pointer in the FFI function. The slice decomposes into a pointer and a length. The length becomes an in/out param that communicates the length of the slice in and the length of the BOM sublice out. The encoding becomes the return value and the encoding pointer being null communicates the Rust None case for the tuple. #[no_mangle] pub unsafe extern "C" fn encoding_for_bom(buffer: *const u8, buffer_len: *mut usize) -> *const Encoding { let buffer_slice = ::std::slice::from_raw_parts(buffer, *buffer_len); let (encoding, bom_length) = match Encoding::for_bom(buffer_slice) { Some((encoding, bom_length)) => (encoding as *const Encoding, bom_length), None => (::std::ptr::null(), 0), }; *buffer_len = bom_length; encoding } In the C header, the signature looks like this: ENCODING_RS_ENCODING const* encoding_for_bom(uint8_t const* buffer, size_t* buffer_len); The C++ layer then rebuilds the analog of the Rust API on top of the C API: class Encoding final { public: static inline std::optional< std::tupleconst Encoding*>, size_t>> for_bom(gsl::span<const uint8_t> buffer) { size_t len = buffer.size(); const Encoding* encoding = encoding_for_bom(null_to_bogus(buffer.data()), &len); if (encoding) { return std::make_tuple( gsl::not_null<const Encoding*>(encoding), len); } return std::nullopt; } }; Here we have to exlicitly use std::make_tuple, because the implicit constructor doesn’t work when the std::tuple is nested inside std::optional. Algebraic Types Early on, we saw that the Rust-side streaming API can return this enum: pub enum DecoderResult { InputEmpty, OutputFull, Malformed(u8, u8), } C++ now has an analog for Rust enum, sort of: std::variant. In practice, though, std::variant is so clunky that it does not make sense to use it when a Rust enum is supposed to act in a lightweight way from the point view of ergonomics. First, the variants in std::variant aren’t named. They are identified positionally or by type. Named variants were proposed as proposed as lvariant but did not get accepted. Second, even though duplicate types are permitted, working with them is not practical. Third, there is no language-level analog for Rust’s match. A match-like mechanism was proposed as inspect() but was not accepted. On the FFI/C layer, the information from the above enum is packed into a u32. Instead of trying to expand it to something fancier on the C++ side, the C++ API uses the same uint32_t as the C API. If the caller actually cares about extracting the two small integers in the malformed case, it’s up to the caller to do the bitwise ops to extract them from the uint32_t. The FFI code looks like this: pub const INPUT_EMPTY: u32 = 0; pub const OUTPUT_FULL: u32 = 0xFFFFFFFF; fn decoder_result_to_u32(result: DecoderResult) -> u32 { match result { DecoderResult::InputEmpty => INPUT_EMPTY, DecoderResult::OutputFull => OUTPUT_FULL, DecoderResult::Malformed(bad, good) => (good as u32) << 8) | (bad as u32), } } Using zero as the magic value for INPUT_EMPTY is a premature micro-optimization. On some architectures comparison with zero is cheaper than comparison with other constants, and the values representing the malformed case when decoding and the unmappable case when encoding are known not to overlap zero. Signaling Integer Overflow Decoder and Encoder have methods for querying worst-case output buffer size requirement. The caller provides the number of input code units and the method returns the smallest output buffer length, in code units, that guarantees that the corresponding conversion method will not return OutputFull. E.g. when encoding from UTF-16 to UTF-8, calculating the worst case involves multiplication by three. Such a calculation can, at least in principle, result in integer overflow. In Rust, integer overflow is considered safe, because even if you allocate too short a buffer as a result of its length computation overflowing, actually accessing the buffer is bound checked, so the overall result is safe. However, buffer access is not generally bound checked in C or C++, so an integer overflow in Rust can result in memory unsafety in C or C++ if the result of the calculation that overflowed is used for deciding the size of buffers allocated and accessed by C or C++ code. In the case of encoding_rs, even when C or C++ allocates the buffer, the writing is supposed to be performed by Rust code, so it might be OK. However, to be sure, the worst-case calculations provided by encoding_rs used overflow-checking arithmetic. In Rust, the methods whose arithmetic is overflow-checked return Option. To keep the types of the C API simple, the C API returns size_t with SIZE_MAX signaling overflow. That is, the C API effectively appears as using saturating arithmetic. In the C++ API version that uses standard-library types, the return type is std::optional. In Gecko, we have a wrapper for integer types that provides overflow-checking arithmetic and a validity flag. In the Gecko version of the C++ API, the return type is mozilla::CheckedInt so that dealing with overflow signaling is uniform with the rest of Gecko code. (Aside: I find it shocking and dangerous that the C++ standard library still does not provide a wrapper similar to mozilla::CheckedInt in order to do overflow-checking integer math in a standard-supported Undefined Behavior-avoiding way.) Recreating the Non-Streaming API Let’s look again at the example of a non-streaming API method on Encoding: impl Encoding { pub fn decode_without_bom_handling_and_without_replacement<'a>( &'static self, bytes: &'a [u8], ) -> Option'a, str>> } This type inside the Option in the return type is Cow<'a, str>, which is a type that holds either an owned String or a borrowed string slice (&'a str) whose data is owned by someone else. The lifetime 'a of the borrowed string slice is the lifetime of the input slice (bytes: &'a [u8]), because in the borrow case the output is actually borrowed from the input. Mapping this kind of return type to C poses problems. First of all, C does not provide a great way to say that we either have the owned case or we have the borrowed case. Second, C does not have a standard type for heap-allocated strings that know their length and capacity and that can reallocate their buffer when modified. Maybe this could be seen as an opportunity to create a new C type whose buffer is managed by Rust String, but then such a type would not fit together with C++ strings. Third, a borrowed string slice in C would be a raw pointer and a length and some documentation that says that the pointer is valid only as long as the input pointer is valid. There would be no language-level safeguards against use-after-free. The solution is not to provide the non-streaming API on the C layer at all. On the Rust side, the non-streaming API is a convenience API built on top of the streaming API and some validation functions (ASCII validation, UTF-8 validation, ISO-2022-JP ASCII state validation). Instead of trying to provide FFI bindings for the non-streaming API in an inconvenient manner, a similar non-streaming API can be recreated in C++ on top of the streaming API and the validation functions that were suitable for FFI. While the C++ type system could represent the same kind of structure as Rust’s Cow<'a, str> e.g. as std::variant, such a C++ Cow would be unsafe, because the lifetime 'a would not be enforced by C++. While a std::string_view (or gsl::span) is (mostly) OK as an argument in C++, as a return type it’s use-after-free waiting to happen. As with C, at best there would be some documentation saying that the output std::string_view is valid for as long as the input gsl::span is valid. To avoid use-after-free risk, in the C++ API version that uses C++17 standard-library types, I simply ended up making the C++ decode_without_bom_handling_and_without_replacement() always copy and return a std::optional. In the case of Gecko though, it’s possible to do better while keeping things safe. Gecko uses XPCOM strings, which provide a variety of storage options, notably: dependent strings that (unsafely) borrow storage owned by someone else, auto strings that store short strings in an inline buffer and shared strings that point to heap-allocated reference-counted buffer. In the case where the buffer to decode is in an XPCOM string that points to a reference-counted heap-allocated buffer and we are decoding to UTF-8 (as opposed to UTF-16), in the cases where we’d borrow in Rust (expect for BOM removal cases), we can instead make the output string point the same reference-counted heap-allocated buffer that the input points to (and increment the reference count). This is indeed what the non-streaming API for mozilla::Encoding does. Compared to Rust, there is a limitation beyond the input string having to use reference-counted storage for the copy avoidance to work: The input must not have the UTF-8 BOM in the cases where the BOM is removed. While Rust can borrow a subslice of the input excluding the BOM, with XPCOM strings just incrementing a reference count only works if the byte content of the input and output is the entirely the same. When the first three bytes need to be omitted, it’s not the entirely the same. While the C++ API version that uses C++17 standard library types builds the non-streaming API on top of the streaming API in C++, for added safety, the non-streaming part of mozilla::Encoding is not actually built on the streaming C++ API in C++ but built on top of the streaming Rust API in Rust. In Gecko, we have Rust bindings for XPCOM strings, so it’s possible to manipulate XPCOM strings from Rust. Epilog: Do We Really Need to Hold Decoder and Encoder by Pointer? Apart from having to copy in the non-streaming API due to C++ not having a safe mechanism for borrows, it’s a bit disappointing that instantiating Decoder and Encoder from C++ involves a heap allocation while Rust callers get to allocate these types on the stack. Can we get rid of the heap allocation for C++ users of the API? The answer is that we could, but to do it properly we’d end up with the complexity of making the C++ build system generate constants by querying them from rustc. We can’t return a non-C-like struct over the FFI by value, but given a suitably-aligned pointer to enough memory, we can write a non-C-like struct to memory provided by the other side of the FFI. In fact, the API supports this as an optimization of instantiating a new Decoder into a heap allocation made by Rust previously: #[no_mangle] pub unsafe extern "C" fn encoding_new_decoder_into( encoding: *const Encoding, decoder: *mut Decoder) { *decoder = (*encoding).new_decoder(); } Even though documentation says that encoding_new_decoder_into() should only be used with pointers to Decoder previously obtained from the API, in the case of Decoder, assigning with = would be OK even if the memory pointed to by the pointer was uninitialized, because Decoder does not implement Drop. That is, in C++ terms, Decoder in Rust does not have a destructor, so assignment with = does not do any clean-up with the assumption that the pointer points to a previous valid Decoder. When writing a Rust struct that implements Drop into uninitialized memory, std::ptr::write() should be used instead of =. std::ptr::write() “overwrites a memory location with the given value without reading or dropping the old value”. Perhaps it would set a good example to use std::ptr::write() even in the above case, even though it’s not strictly necessary. When working with a pointer previously obtained from Rust Box, the pointer is aligned correctly and points to a sufficiently large piece of memory. If C++ is to allocate stack memory for Rust code to write into, we need to make the C++ code use the right size and alignment. The issue of communicating these two numbers from Rust to C++ is already where things start getting brittle. The C++ code needs to discover the right size and alignment for the struct. These cannot be discovered by calling FFI functions, because C++ needs to know them at compile time. Size and alignment aren’t just constants that could be written manually in a header file once and forgotten. First of all, they change when the Rust structs change, so just writing them down has the risk of the written-down values getting out of sync with the real requirements as the Rust code changes. Second, the values differ on 32-bit architectures vs. 64-bit architectures. Third, and this is the worst, the alignment can differ from one 32-bit architecture to another. Specifically, the alignment of f64 is 8 on most targets, like ARM, MIPS and PowerPC, but the alignment of f64 is 4 on x86. If Rust gets an m68k port, even more variety of alignments across 32-bit platforms is to be expected. It seems that the only way to get this right is to get the size and alignment information from rustc as part of the build process before the C++ code is built so that the numbers can be written in a generated C++ header file that the C++ code can then refer to. The simple way to do this would be to have the build system compile and run a tiny Rust program that prints out a C++ header with numbers obtained using std::mem::size_of and std::mem::align_of. This solution assumes that the build system runs on the architecture that the compilation is targeting, so this solution would break cross-compilation. That’s not good. We need to extract target-specific size and alignment from a given struct from rustc but without having to run a binary built for the target. It turns out that rustc has a command-line option, -Zprint-type-sizes, that prints out the size and alignment of types. Unfortunately, the feature is nightly-only… Anyway, the most correct way to go about this would be to have a build script controlling C++ compilation first invoke rustc with that option, parse out the sizes and aligments of interest, and generate a C++ header file with the numbers as constants. Or, since overaligning is permitted, we could trust that the struct will not have a SIMD member (alignment 16 for 128-bit vectors) and always align to 8. We could also check the size on 64-bit platforms, always use that and hope for the best (especially hope that whenever the struct grows in Rust, someone remembers to update the C++-visible size). But hoping for the best in memory matters kind of defeats the point of using Rust. Anyway, assuming that we have constants DECODER_SIZE and DECODER_ALIGNMENT available to C++ somehow, we can do this: class alignas(DECODER_ALIGNMENT) Decoder final { friend class Encoding; public: ~Decoder() {} Decoder(Decoder&&) = default; private: unsigned char storage[DECODER_SIZE]; Decoder() = default; Decoder(const Decoder&) = delete; Decoder& operator=(const Decoder&) = delete; // ... }; Notably: Instead of the constructor Decoder() being marked delete, it is marked default but still private. Encoding is declared as a friend to grant it access to the above-mentioned constructor. A public default move constructor is added. A single private field of type unsigned char[DECODER_SIZE] is added. Decoder itself is declared with alignas(DECODER_ALIGNMENT). operator delete is no longer overloaded. Then new_decoder() on Encoding can be written like this (and be renamed make_decoder to avoid unidiomatic use of the word “new” in C++): class Encoding final { public: inline Decoder make_decoder() const { Decoder decoder; encoding_new_decoder_into(this, &decoder); return decoder; } // ... }; And it can be used like this: Decoder decoder = input_encoding->make_decoder(); Note that outside the implementation of Encoder trying to just declare Decoder decoder; without initializing it right away initializing is a compile-time error, because the constructor Decoder() is private. Let’s unpack what’s happening: The array of unsigned char provides storage for the Rust Decoder. The C++ Decoder has no base class, virtual methods, etc., so there are no implementation-supplied hidden members and the address of a Decoder is the same as the address of its storage member, so we can simply pass the address of Decoder itself to Rust. The alignment of unsigned char is 1, i.e. unrestricted, so alignas on the Decoder gets to determine the alignment. The default trivial move constructor memmoves the bytes of the Decoder, and the Rust Decoder is OK to move. The private default no-argument constructor makes it a compile error to try to declare a not-immediately-initialized instance of the C++ Decoder outside the implementation of Encoder. Encoder, however, can instantiate an uninitialized Decoder and pass a pointer to it to Rust, so that Rust code can write the Rust Decoder instance into the C++-provided memory via the pointer. [Less]
Posted over 5 years ago by Daniel.Pocock
My home automation plans have been progressing and I'd like to share some observations I've made about planning a project like this, especially for those with larger houses. With so many products and technologies, it can be hard to know where to ... [More] start. Some things have become straightforward, for example, Domoticz can soon be installed from a package on some distributions. Yet this simply leaves people contemplating what to do next. The quickstart For a small home, like an apartment, you can simply buy something like the Zigate, a single motion and temperature sensor, a couple of smart bulbs and expand from there. For a large home, you can also get your feet wet with exactly the same approach in a single room. Once you are familiar with the products, use a more structured approach to plan a complete solution for every other space. The Debian wiki has started gathering some notes on things that work easily on GNU/Linux systems like Debian as well as Fedora and others. Prioritize What is your first goal? For example, are you excited about having smart lights or are you more concerned with improving your heating system efficiency with zoned logic? Trying to do everything at once may be overwhelming. Make each of these things into a separate sub-project or milestone. Technology choices There are many technology choices: Zigbee, Z-Wave or another protocol? I'm starting out with a preference for Zigbee but may try some Z-Wave devices along the way. E27 or B22 (Bayonet) light bulbs? People in the UK and former colonies may have B22 light sockets and lamps. For new deployments, you may want to standardize on E27. Amongst other things, E27 is used by all the Ikea lamp stands and if you want to be able to move your expensive new smart bulbs between different holders in your house at will, you may want to standardize on E27 for all of them and avoid buying any Bayonet / B22 products in future. Wired or wireless? Whenever you take up floorboards, it is a good idea to add some new wiring. For example, CAT6 can carry both power and data for a diverse range of devices. Battery or mains power? In an apartment with two rooms and less than five devices, batteries may be fine but in a house, you may end up with more than a hundred sensors, radiator valves, buttons, and switches and you may find yourself changing a battery in one of them every week. If you have lodgers or tenants and you are not there to change the batteries then this may cause further complications. Some of the sensors have a socket for an optional power supply, battery eliminators may also be an option. Making an inventory Creating a spreadsheet table is extremely useful. This helps estimate the correct quantity of sensors, bulbs, radiator valves and switches and it also helps to budget. Simply print it out, leave it under the Christmas tree and hope Santa will do the rest for you. Looking at my own house, these are the things I counted in a first pass: Don't forget to include all those unusual spaces like walk-in pantries, a large cupboard under the stairs, cellar, en-suite or enclosed porch. Each deserves a row in the table. Sensors help make good decisions Whatever the aim of the project, sensors are likely to help obtain useful data about the space and this can help to choose and use other products more effectively. Therefore, it is often a good idea to choose and deploy sensors through the home before choosing other products like radiator valves and smart bulbs. The smartest place to put those smart sensors When placing motion sensors, it is important to avoid putting them too close to doorways where they might detect motion in adjacent rooms or hallways. It is also a good idea to avoid putting the sensor too close to any light bulb: if the bulb attracts an insect, it will trigger the motion sensor repeatedly. Temperature sensors shouldn't be too close to heaters or potential draughts around doorways and windows. There are a range of all-in-one sensors available, some have up to six features in one device smaller than an apple. In some rooms this is a convenient solution but in other rooms, it may be desirable to have separate motion and temperature sensors in different locations. Consider the dining and sitting rooms in my own house, illustrated in the floorplan below. The sitting room is also a potential 6th bedroom or guest room with sofa bed, the downstairs shower room conveniently located across the hall. The dining room is joined to the sitting room by a sliding double door. When the sliding door is open, a 360 degree motion sensor in the ceiling of the sitting room may detect motion in the dining room and vice-versa. It appears that 180 degree motion sensors located at the points "1" and "2" in the floorplan may be a better solution. These rooms have wall mounted radiators and fireplaces. To avoid any of these potential heat sources the temperature sensors should probably be in the middle of the room. This photo shows the proposed location for the 180 degree motion sensor "2" on the wall above the double door: Summary To summarize, buy a Zigate and a small number of products to start experimenting with. Make an inventory of all the products potentially needed for your home. Try to mark sensor locations on a floorplan, thinking about the type of sensor (or multiple sensors) you need for each space. [Less]