It is important to not modify a string after validation has occurred because doing so may allow an attacker to bypass validation. For example, a program may filter out the <script>
tags from HTML input to avoid cross-site scripting (XSS) and other vulnerabilities. If exclamation marks '!' are deleted from the input following validation, an attacker may pass the string "
<scr!ipt>"
so that the validation check fails to detect the <script>
tag but the subsequent removal of the exclamation mark creates a <script>
tag in the input.
There are many different categories of characters that a programmer might decided to exclude. For example, The Unicode Standard [Unicode 2012] defines the following categories of characters all of which can be matched using an appropriate regular expression:
Abbr | Long | Description |
---|---|---|
Cc | Control | a C0 or C1 control code |
Cf | Format | a format control character |
Cs | Surrogate | a surrogate code point |
Co | Private_Use | a private-use character |
Cn | Unassigned | a reserved unassigned code point or a noncharacter |
Other programs may remove or replace any character belonging to a uniquely defined set of characters. Any string modifications must be performed before the string is validated.
Noncompliant Code Example (Noncharacter Code Points)
In some versions of In some versions of The Unicode Standard prior to version 5.2, conformance clause C7 allows the deletion of noncharacter code points. For example, conformance clause C7 from Unicode 5.1 states [Unicode 2007]:
...
Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.
...
.
...
The filterString()
method in this noncompliant code example normalizes the input string, validates that the input does not contain a <script>
tag, and then removes any non-ASCII characters from noncharacter code points from the input string. Because input validation is performed before the removal of non-ASCII characters any noncharacter code points, an attacker can insert include noncharacter code points into in the <script>
tag ,to bypass the validation checks.
Code Block | ||
---|---|---|
| ||
import java.text.Normalizer; import java.text.Normalizer.Form; import java.util.regex.Matcher; import java.util.regex.Pattern; public class TagFilter { public static String filterString(String str) { String s = Normalizer.normalize(str, Form.NFKC); // Validate input Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { throw new IllegalArgumentException("Invalid input"); } // Deletes allnoncharacter non-ASCIIcode characterspoints s = s.replaceAll("[^\\p{ASCIICn}]", ""); return s; } public static void main(String[] args) { // "\uFEFFuFDEF" is a non-characternoncharacter code point String maliciousInput = "<scr" + "\uFEFFuFDEF" + "ipt>"; String sb = filterStringBad(maliciousInput); // sb = "<script>" } } |
Compliant
...
Solution (Noncharacter Code Points)
This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD
, which is reserved to denote this condition. It also performs this replacement before doing any other sanitization, in particular, checking for <script>
. This ensures that malicious input cannot bypass filters.
Code Block | ||
---|---|---|
| ||
import java.text.Normalizer; import java.text.Normalizer.Form; import java.util.regex.Matcher; import java.util.regex.Pattern; public class TagFilter { public static String filterString(String str) { // "\uFEFF" is a non-character code point String s = Normalizer.normalize(str, Form.NFKC); // Replaces all noncharacter non-validcode characterspoints with Unicode U+FFFD s = s.replaceAll("[^\\p{ASCIICn}]", "\uFFFD"); // Validate input Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { throw new IllegalArgumentException("Invalid input"); } return s; } public static void main(String[] args) { // "\uFEFFuFDEF" is a non-character code point String maliciousInput = "<scr" + "\uFEFFuFDEF" + "ipt>"; String s = filterString(maliciousInput); // s = <scr?ipt> } |
...