...
Ignoring the possibility of supplementary characters , multibyte characters, or combining characters (characters that modify other characters) may allow an attacker to bypass input validation checks.
Combining Characters
A combining character sequence is a base character followed by any number of combining characters. The combining character sequence forms a grapheme, which is a minimally distinctive unit of writing in the context of a particular writing system. For example, the grapheme ü
can be composed by combining the base character \u0075
(u
) with the combining diacritical mark \u00a8
(¨
). It may also be represented by the single Unicode character \u00fc
. and multibyte characters may allow the formation of incorrect strings.
Multibyte Characters
Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).
...
This noncompliant code example attempts to trim leading letters from string
. However, this method may fail because methods that only accept a char
value cannot support supplementary characters. According to the Java API [API 2014] class Character
documentation:
They treat
char
values from the surrogate ranges as undefined characters. For example,Character.isLetter('\uD840')
returnsfalse
, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
Because the method only examines one character at a time, it will also separate combining character sequences.
Code Block | ||
---|---|---|
| ||
public static String trim(String string) { char ch; int i; for (i = 0; i < string.length(); i += 1) { ch = string.charAt(i); if (!Character.isLetter(ch)) { break; } } return string.substring(i); } |
...
Unfortunately, the trim()
method may fail because it is using the character form of the Character.isLetter()
method. Methods that only accept a char
value cannot support supplementary characters. According to the Java API [API 2014] class Character
documentation:
They treat
char
values from the surrogate ranges as undefined characters. For example,Character.isLetter('\uD840')
returnsfalse
, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
Compliant Solution (Substring)
This noncompliant code example corrects the problem with supplementary characters by using the integer form of Character.isLetter()
method that accepts a Unicode code point as an int
argument. Java library methods that accept an int
value support all Unicode characters, including supplementary characters. However, this method still fails to handle combining characters because it only examines one character at a time.
Code Block | |||
---|---|---|---|
| |||
// Fails for combining characters
public static String trim(String string) {
int ch;
int i;
for (i = 0; i < string.length(); i += Character.charCount(ch)) {
ch = string.codePointAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
return string.substring(i);
}
|
Compliant Solution (Substring)
This compliant solution works both for supplementary and for combining characters [Tutorials 2008]. According to the Java API [API 2006] class java.text.BreakIterator
documentation:
The
BreakIterator
class implements methods for finding the location of boundaries in text. Instances ofBreakIterator
maintain a current position and scan over text returning the index of characters where boundaries occur.
The boundaries returned may be those of supplementary characters, combining character sequences, or ligature clusters. For example, an accented character might be stored as a base character and a diacritical mark.
Code Block | ||
---|---|---|
| ||
public static String trim(String string) {
BreakIterator iter = BreakIterator.getCharacterInstance();
iter.setText(string);
int i;
for (i = iter.first(); i != BreakIterator.DONE; i = iter.next()) {
int ch = string.codePointAt(i);
if (!Character.isLetter(ch)) {
break;
}
}
if (i == BreakIterator.DONE) { // Reached first or last text boundary
return ""; // The input was either blank or had only (leading) letters
} else {
return string.substring(i);
}
}
|
...
Risk Assessment
Forming strings consisting of partial characters can result in unexpected behavior.
...