Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: the regexp fixed, which was broken since rev5.

...

Input validation is being performed before the deletion of non-ASCII characters. Consequently, an attacker can disguise a <script> tag and bypass the validation checks.

Code Block
bgColor#FFcccc

// "\uFEFF" is a non-character code point
String s = "<scr" + "\uFEFF" + "ipt>"; 
s = Normalizer.normalize(s, Form.NFKC);
// Input validation
Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found black listed tag");
} else {
  // ... 
}

// Deletes all non-valid characters 
s = s.replaceAll("[^\\p{ASCII}]", "");
// s now contains "<script>"

...

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for <script>. This ensures that malicious input cannot bypass filters.

Code Block
bgColor#ccccff

String s = "<scr" + "\uFEFF" + "ipt>";

s = Normalizer.normalize(s, Form.NFKC);
// Replaces all non-valid characters with unicode U+FFFD
s = s.replaceAll("[^\\p{ASCII}]", "\uFFFD"); 

Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found blacklisted tag");
} else {
  // ... 
}

...

[API 2006]

 

[Davis 2008b]

3.5, Deletion of Noncharacters

[Weber 2009]

Handling the Unexpected: Character-deletion

[Unicode 2007]

 

[Unicode 2011]

 

 

IDS10-J. Do not split characters between two data structures            IDS12-J. Perform lossless conversion of String data between differing character encodings