Page History

...

Ignoring the possibility of supplementary characters , multibyte characters, or combining characters (characters that modify other characters) may allow an attacker to bypass input validation checks.

Combining Characters

A combining character sequence is a base character followed by any number of combining characters. The combining character sequence forms a grapheme, which is a minimally distinctive unit of writing in the context of a particular writing system. For example, the grapheme ü can be composed by combining the base character \u0075 (u) with the combining diacritical mark \u00a8 (¨). It may also be represented by the single Unicode character \u00fc. and multibyte characters may allow the formation of incorrect strings.

Multibyte Characters

Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).

...

This noncompliant code example attempts to trim leading letters from string. However, this method may fail because methods that only accept a char value cannot support supplementary characters. According to the Java API [API 2014] class Character documentation:

They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

Because the method only examines one character at a time, it will also separate combining character sequences.

Code Block

bgColor	#FFcccc

public static String trim(String string) {
  char ch;
  int i;
  for (i = 0; i < string.length(); i += 1) {
    ch = string.charAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }
  }
  return string.substring(i);
}

...

Unfortunately, the trim() method may fail because it is using the character form of the Character.isLetter() method. Methods that only accept a char value cannot support supplementary characters. According to the Java API [API 2014] class Character documentation:

They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.

Compliant Solution (Substring)

This noncompliant code example corrects the problem with supplementary characters by using the integer form of Character.isLetter() method that accepts a Unicode code point as an int argument. Java library methods that accept an int value support all Unicode characters, including supplementary characters. However, this method still fails to handle combining characters because it only examines one character at a time.

Code Block

bgColor	#FFcccc	#ccccff

// Fails for combining characters
public static String trim(String string) {
  int ch;
  int i;
  for (i = 0; i < string.length(); i += Character.charCount(ch)) {
    ch = string.codePointAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }
  } 
  return string.substring(i);
}

Compliant Solution (Substring)

This compliant solution works both for supplementary and for combining characters [Tutorials 2008]. According to the Java API [API 2006] class java.text.BreakIterator documentation:

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur.

The boundaries returned may be those of supplementary characters, combining character sequences, or ligature clusters. For example, an accented character might be stored as a base character and a diacritical mark.

Code Block

bgColor	#ccccff

public static String trim(String string) {
  BreakIterator iter = BreakIterator.getCharacterInstance();
  iter.setText(string);
  int i;
  for (i = iter.first(); i != BreakIterator.DONE; i = iter.next()) {
    int ch = string.codePointAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }    
  }
  
  if (i == BreakIterator.DONE) { // Reached first or last text boundary
    return ""; // The input was either blank or had only (leading) letters
  } else {
    return string.substring(i);
  }
}

...

Risk Assessment

Forming strings consisting of partial characters can result in unexpected behavior.

...

Space shortcuts

Page tree

Versions Compared

Old Version 103

New Version 104

Key

Combining Characters

Multibyte Characters

Compliant Solution (Substring)

Compliant Solution (Substring)

Risk Assessment