Many old programs assumed that every character in a string occupied 8 bits, eg a Java byte
. The Java language assumed that every character in a string occupies 16 bitesbytes, eg a Java char
. Unfortunately, the Java byte
was not sufficient to hold all possible characters, and neither is a Java char
. Many strings are stored on disk and in memory using an encoding such as UTF-8
that allows characters to have varying sizes.
...
The trailing byte ranges overlap the range of both the single byte and lead byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it if were not separated across the buffer boundary; this difference arises due to the ambiguity of its composing bytes [Phillips 2005].
...
Supplementary Characters
Wiki Markup |
---|
According to the Java API \[[API 2006|AA. Bibliography#API 06]\], class {{Character}} documentation (Unicode Character Representations) |
...
The size of the data
byte buffer depends on the maximum number of bytes required to write an encoded character. For example, UTF-8 encoded data requires four bytes to represend represent any character above U+FFFF
. Because Java uses the UTF-16 character encoding to represent char
data, such sequences are split into two separate char
values of two bytes each. Consequently, the buffer size should be four times the size of a typical byte sequence.
...