...
The Java platform uses the UTF-16 representation in
char
arrays and in theString
andStringBuffer
classes. In this representation, supplementary characters are represented as a pair ofchar
values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF)..
A char
value, therefore, represents BMP code points, including the surrogate code points, or code units of the UTF-16 encoding. An int
value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int
are used to represent Unicode code points, and the upper (most significant) 11 bits must be zero. Similar to UTF-8 (see STR00-J. Don't form strings containing partial characters from variable-width encodings), UTF-16 is a variable-width encoding. Because the UTF-16 representation is also used in char
arrays and in the String
and StringBuffer
classes, care must be taken when manipulating string data in Java. In particular, do not write code that assumes that a value of the primitive type char
(or a Character
object) fully represents a Unicode code point. Conformance with this requirement typically requires using methods that accept a Unicode code point as an int
value and avoiding methods that accept a Unicode code unit as a char
value because these latter methods cannot support supplementary characters.
...