Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added section on combining chars

...

Ignoring the possibility of supplementary characters, multibyte characters, or combining characters (characters that modify other characters) may allow an attacker to bypass input validation checks. Consequently, characters must not be split between two data structures.

Combining Characters

The Unicode specification recognizes that a character as perceived by a user may consist of multiple characters as stored in a string. The Java Tutorial provides an example:

A user character may be composed of more than one Unicode character. For example, the user character ü can be composed by combining the Unicode characters \u0075 (u) and \u00a8 (¨). ... The character ü may also be represented by the single Unicode character \u00fc.

Do not split a string between two combining characters.

Multibyte Characters

Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).

...

Code Block
bgColor#FFcccc
// Fails for supplementary or combining characters
public static String trim_bad1(String string) {
  char ch;
  int i;
  for (i = 0; i < string.length(); i += 1) {
    ch = string.charAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }
  }
  return string.substring(i);
}

...

Code Block
bgColor#FFcccc
// Fails for combining characters
public static String trim_bad2(String string) {
  int ch;
  int i;
  for (i = 0; i < string.length(); i += Character.charCount(ch)) {
    ch = string.codePointAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }
  } 
  return string.substring(i);
}

...

Code Block
bgColor#ccccff
public static String trim_good(String string) {
  BreakIterator iter = BreakIterator.getCharacterInstance();
  iter.setText(string);
  int i;
  for (i = iter.first(); i != BreakIterator.DONE; i = iter.next()) {
    int ch = string.codePointAt(i);
    if (!Character.isLetter(ch)) {
      break;
    }    
  }
  
  if (i == BreakIterator.DONE) { // Reached first or last text boundary
    return ""; // The input was either blank or had only (leading) letters
  } else {
    return string.substring(i);
  }
}

...