Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A char value, therefore, represents BMP code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Similar to UTF-8 (see STR00-J. Don't form strings containing partial characters from variable-width encodings), UTF-16 is a variable-width encoding. Because the UTF-16 representation is also used in char arrays and in the String and StringBuffer classes, care must also be taken when manipulating string data in Java. In particular, do not write code that assumes that a value of the primitive type char (or a Character object) fully represents a Unicode code point. Conformance with this requirement typically requires using methods that accept a Unicode code point as an int value and avoiding methods that accept a Unicode code unit as a char value as these latter methods cannot support supplementary characters.

 

Noncompliant Code Example

This noncompliant code example attempts to trim leading letters from string

...

They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

Compliant Solution

This noncompliant code example corrects the problem with supplementary characters by using the integer form of Character.isLetter() method that accepts a Unicode code point as an int argument. Java library methods that accept an int value support all Unicode characters, including supplementary characters.  

Code Block
bgColor#ccccff
public static String trim(String string) {
  int ch;
  int i;
  for (i = 0; i < string.length(); i += Character.charCount(ch)) {
    ch = string.codePointAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }
  } 
  return string.substring(i);
}

Risk Assessment

Forming strings consisting of partial characters can result in unexpected behavior.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

STR01-J

low

unlikely

medium

P2

L3

Bibliography

[API 2014]

Classes Character and BreakIterator

[Tutorials 2008]

Character Boundaries

 

      Image RemovedRule 00: Input Validation and Data Sanitization (IDS)