...
In the Java SE API documentation and in this coding standard, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char
values that are code units of the UTF-16 encoding.
Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012].
However, Java programs must often work with string data represented in various character sets. Java 7 introduced the StandardCharsets
Class that specifies character sets that are guaranteed to be available on every implementation of the Java platform including ISO Latin Alphabet No. 1, Seven-bit ASCII, UTF 8, and UTF 16.
The Java language assumes that every character in a string occupies 16 bits (a Java char
). Unfortunately, neither the Java byte
nor Java char
data types can represent all possible Unicode characters. Many strings are stored or communicated using encodings such as UTF-8 that support characters with varying sizes.
While Java strings are stored as an array of characters and can be represented as an array of bytes, a single character in the string might be represented by two or more consecutive elements of type byte
or of type char
. Splitting a char
or byte
array risks splitting a multibyte character.
Ignoring the possibility of supplementary characters and multibyte characters may allow the formation of incorrect strings.
Multibyte Characters
Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).
Byte Type | Range |
---|---|
single-byte |
|
lead-byte |
|
trailing-byte |
|
The trailing byte ranges overlap the range of both the single-byte and lead-byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it were not separated across the buffer boundary; this difference arises because of the ambiguity of its composing bytes [Phillips 2005].
Supplementary Characters
Noncompliant Code Example (Read)
...