Page History

...

In the Java SE API documentation and in this coding standard, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012].

However, Java programs must often work with string data represented in various character sets. Java 7 introduced the StandardCharsets Class that specifies character sets that are guaranteed to be available on every implementation of the Java platform including ISO Latin Alphabet No. 1, Seven-bit ASCII, UTF 8, and UTF 16.

The Java language assumes that every character in a string occupies 16 bits (a Java char). Unfortunately, neither the Java byte nor Java char data types can represent all possible Unicode characters. Many strings are stored or communicated using encodings such as UTF-8 that support characters with varying sizes.

While Java strings are stored as an array of characters and can be represented as an array of bytes, a single character in the string might be represented by two or more consecutive elements of type byte or of type char. Splitting a char or byte array risks splitting a multibyte character.

Ignoring the possibility of supplementary characters and multibyte characters may allow the formation of incorrect strings.

Multibyte Characters

Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).

Byte Type	Range
single-byte	`0x00` through `0x7F` and `0xA0` through `0xDF`
lead-byte	`0x81` through `0x9F` and `0xE0` through `0xFC`
trailing-byte	`0x40-0x7E` and `0x80-0xFC`

The trailing byte ranges overlap the range of both the single-byte and lead-byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it were not separated across the buffer boundary; this difference arises because of the ambiguity of its composing bytes [Phillips 2005].

Supplementary Characters

Noncompliant Code Example (Read)

...

Space shortcuts

Page tree

Versions Compared

Old Version 106

New Version 107

Key

Multibyte Characters

Supplementary Characters

Noncompliant Code Example (Read)