encoding – Prolific holes

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before the Unicode standard was developed, there were many different systems, called character encodings, for assigning these numbers. These earlier character encodings were limited and did not cover characters for all the world’s languages. Even for a single language like English, no single encoding covered all the letters, punctuation, and technical symbols in common use. Pictographic languages, such as Japanese, were a challenge to support with these earlier encoding standards.

Early character encodings also conflicted with one another. That is, two encodings could use the same number for two different characters, or use different numbers for the same character. Any given computer might have to support many different encodings. However, when data is passed between computers and different encodings it increased the risk of data corruption or errors.

Additionally, researchers in the 1980s faced the dilemma that on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users).

The compromise solution that was eventually found and developed into Unicode was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to a particular sequence of bits. Instead, each character would first be mapped to a universal intermediate representation in the form of abstract numbers called code points.

Here is the complete list of all Unicode code points. The basic Latin characters (ASCII) can be found here.

Each code point would then be represented in a variety of ways and with various default numbers of bits per character (code units) depending on context. To encode code points higher than the length of the code unit, such as above 256 for 8-bit units, the solution was to implement variable-width encodings where an escape sequence would signal that subsequent bits should be parsed as a higher code point.

The Unicode standard was created on an encoding foundation large enough to support the writing systems used by all the world’s languages. Over the years the Unicode standard encoding has been steadily expanded and now includes languages like Cherokee, Mongolian, and ancient Egyptian hieroglyphics. Beyond simply providing a standardized system of character codes, the Unicode Consortium has expanded the scope of its efforts to include standard “locale” data, such as how a date is formatted in Arabic or Swahili, and code libraries that assist programmers to develop.

Unicode is maintained by the Unicode Consortium, a non-profit organization based in Mountain View, California.

Same Unicode *code point* but different *encodings*. Code units in the UTF8 encoding have 2⁸= 16² bits and in the UTF16 encoding have 2¹⁶ = 16⁴ bits. In UTF8 three code units of 2⁸ bits each are needed to represent the € sign.

07/12/2019 prolific holesLeave a comment

Tag: encoding

Unicode 𒆟