...
A char
value, therefore, represents BMP code points, including the surrogate code points, or code units of the UTF-16 encoding. An int
value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int
are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Similar to UTF-8 (see STR00-J. Don't form strings containing partial characters from variable-width encodings), UTF-16 is a variable-width encoding. Because the UTF-16 representation is also used in char
arrays and in the String
and StringBuffer
classes, care must also be taken when manipulating string data in Java. In particular, do not write code that assumes that a value of the primitive type char
(or a Character
object) fully represents a Unicode code point. Conformance with this requirement typically requires using methods that accept a Unicode code point as an int
value and avoiding methods that accept a Unicode code unit as a char
value as these latter methods cannot support supplementary characters.
Noncompliant Code Example
This noncompliant code example attempts to trim leading letters from string
.
...
They treat
char
values from the surrogate ranges as undefined characters. For example,Character.isLetter('\uD840')
returnsfalse
, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
Compliant Solution
This noncompliant code example corrects the problem with supplementary characters by using the integer form of Character.isLetter()
method that accepts a Unicode code point as an int
argument. Java library methods that accept an int
value support all Unicode characters, including supplementary characters.
Code Block | ||
---|---|---|
| ||
public static String trim(String string) { int ch; int i; for (i = 0; i < string.length(); i += Character.charCount(ch)) { ch = string.codePointAt(i); if (!Character.isLetter(ch)) { break; } } return string.substring(i); } |
Risk Assessment
Forming strings consisting of partial characters can result in unexpected behavior.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
STR01-J | low | unlikely | medium | P2 | L3 |
Bibliography
[API 2014] | Classes |
Character Boundaries |