...
Ignoring the possibility of supplementary characters, multibyte characters, or combining characters (characters that modify other characters) may allow an attacker to bypass input validation checks. Consequently, characters must not be split between two data structures.
Combining Characters
The Unicode specification recognizes that a character as perceived by a user may consist of multiple characters as stored in a string. The Java Tutorial provides an example:
A user character may be composed of more than one Unicode character. For example, the user character
ü
can be composed by combining the Unicode characters\u0075
(u
) and\u00a8
(¨
). ... The characterü
may also be represented by the single Unicode character\u00fc
.
Do not split a string between two combining characters.
Multibyte Characters
Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).
...
Code Block | ||
---|---|---|
| ||
// Fails for supplementary or combining characters
public static String trim_bad1(String string) {
char ch;
int i;
for (i = 0; i < string.length(); i += 1) {
ch = string.charAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
return string.substring(i);
}
|
...
Code Block | ||
---|---|---|
| ||
// Fails for combining characters
public static String trim_bad2(String string) {
int ch;
int i;
for (i = 0; i < string.length(); i += Character.charCount(ch)) {
ch = string.codePointAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
return string.substring(i);
}
|
...
Code Block | ||
---|---|---|
| ||
public static String trim_good(String string) {
BreakIterator iter = BreakIterator.getCharacterInstance();
iter.setText(string);
int i;
for (i = iter.first(); i != BreakIterator.DONE; i = iter.next()) {
int ch = string.codePointAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
if (i == BreakIterator.DONE) { // Reached first or last text boundary
return ""; // The input was either blank or had only (leading) letters
} else {
return string.substring(i);
}
}
|
...