In 25 Words or Less

Not Enough Words!

Okay, I lied. It'll be more than 25 words. That's because Unicode is complex and 25 words just can't do it justice. To paraphrase an Author, “ Unicode is big. Really Big. You just won't believe how vastly hugely mind-bogglingly big it is.

But if “25 Words Or Less” is taken metaphorically to mean, “Keep It Brief”, well that just might be possible. Understanding what Unicode really is and what it implies for IT covers several important points, but I'll try to “Keep It Brief”.

That said…

Unicode seeks to map all glyphs of all languages to unique integer Codepoints along with various 32-bit, 16-bit and 8bit physical encodings.


[TIP] The Unicode Technical Report #17: Character Encoding Model, is an excellent explanation and provides much more detail.

Unicode in Many More Words

Actually, the 22 words do a fair job of laying out the landscape. The goal of Unicode is to define a unique Codepoint for every gylph of every language in the world. (Note that every language is defined sensibly, not literally.) The Codepoints fall into a positive integer domain [0,1,2,...etc.].

Nearly all of the Codepoints defined so far lie in a range (0–65,536 aka 64K) expressible with 16-bit integers. This first “page” of points is called the Basic Multilingual Plane (BMP). All higher Codepoints fall into supplementary planes (from 1 to 16—over a million Codepoints).

[TIP] The important thing to understand is that Unicode is basically a 32-bit standard as far as numbering the world's language glyphs.

The actual range of Codepoints–based on the 17 code planes–is only a bit over a million total. This range requires 21 bits, so it doesn't fit in a 16-package. For most systems, the next available size is 32-bits (native machine sizes typically double). Thus, because Unicode fits “naturally” into a 32-bit package, it's rightfully called a 32-bit standard.

Unicode Layers

Assigning integer values to language gylphs actually represents two of Unicode's four layers: The Abstract Character Repertoire and the Coded Character Set (CCS). The first defines the set of language glyphs to encode and the second defines integer values for those glyphs.

An important note is that the two top layers are machine-independent. The second layer does not limit the integer values to any physical width.

The third layer, Character Encoding Form (CEF), maps the CCS to physical machine widths. This layer defines how Unicode is represented in machine form. When discussing how Unicode is stored in a Database or in application code, the discussion usually involves an Encoding Form.

Finally, a Character Encoding Scheme (CES) maps the CCS into 8-bit bytes. Encoding Schemes typically matter when considering file storage and network transport, both of which are typically byte-oriented.

[TIP] To illustrate the difference between Character Encoding Forms and Schemes, compare UTF-16 to UTF-16LE. UTF-16 is a CEF (Form) that maps Unicode Code points into 16-bit values. UTF-16LE is the same mapping, but specifies “little Endian” mapping of 16-bit values into 8-bit bytes.

UTF-32, UTF-16 & UTF-8

UTF-32 is an Encoding Form that maps the CCS integer values directly to 32-bit equivalents. Because the 32-bit integer is a very common machine size, many systems represent Unicode as UTF-32—at least internally! Transferring Unicode into, and out of, the system often requires narrower encodings.

UTF-16 is an Encoding Form that maps most of the “lower” CCS integer value space (the values that can be represented with 16 bits–half the 32-bit virtual width) directly into the 16-bit equivalents. Special sequences of 16-bit values (surrogate Code Points, range: D800..DFFF) stand for 32-bit values outside the 16-bit space.

UTF-16LE and UTF-16BE are Encoding Schemes that specify the Endianness of byte representations of UTF-16.

UTF-8 is an Encoding Form that maps the CCS integer values to 8-bit bytes. UTF-8 maps 7-bit ASCII directly to byte equivalents. Values higher than 7-bit ASCII become multi-byte sequences (all multi-byte bytes have their 8th bit set). See the Encoding Chart for specifics.