Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: minor edits

In some versions of The Unicode Standard prior to Unicode version 5.2, conformance clause C7 allows the deletion of noncharacter code points. For example, conformance clause C7 from Unicode 5.1 states [Unicode 2007]:

...

Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.

Any Consequently, any string modifications, including the removal or replacement of noncharacter code points, must be performed before any validation of the string is performed.

...

Code Block
bgColor#FFcccc
// "\uFEFF" is a non-character code point
String s = "<scr" + "\uFEFF" + "ipt>"; 
s = Normalizer.normalize(s, Form.NFKC);
// Input validation
Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found black listed tag");
} else {
  // ... 
}

// Deletes all non-validASCII characters 
s = s.replaceAll("[^\\p{ASCII}]", "");
// s now contains "<script>"

...

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also does performs this replacement before doing any other sanitization, in particular, checking for <script>. This ensures that malicious input cannot bypass filters.

...