It is important that a string not be modified after validation has occurred because doing so may allow an attacker to bypass validation. For example, a program may filter out the <script>
tags from HTML input to avoid cross-site scripting (XSS) and other vulnerabilities. If exclamation marks (!) are deleted from the input following validation, an attacker may pass the string "
<scr!ipt>"
so that the validation check fails to detect the <script>
tag, but the subsequent removal of the exclamation mark creates a <script>
tag in the input.
A programmer might decide to exclude many different categories of characters. For example, The Unicode Standard [Unicode 2012] defines the following categories of characters, all of which can be matched using an appropriate regular expression:
Abbr | Long | Description |
---|---|---|
Cc | Control | A C0 or C1 control code |
Cf | Format | A format control character |
Cs | Surrogate | A surrogate code point |
Co | Private_Use | A private-use character |
Cn | Unassigned | A reserved unassigned code point or a noncharacter |
Other programs may remove or replace any character belonging to a uniquely defined set of characters. Any string modifications must be performed before the string is validated.
Noncompliant Code Example (Noncharacter Code Points)
In some versions of The Unicode Standard prior to Unicode version 5.2, conformance clause C7 allows the deletion of noncharacter code points. For example, conformance clause C7 from Unicode 5.1 states [Unicode 2007] states:
C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.
According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], Section 3.5, "Deletion of Noncharacters"Code Points" [Davis 2008b]:
Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.
Any string modifications, including the removal or replacement of noncharacter code points, must be performed before any validation of the string is performed.
Noncompliant Code Example
This noncompliant code example accepts only valid ASCII characters and deletes any non-ASCII characters. It also checks for the existence of a <script>
tag.
Input validation is being performed before the deletion of non-ASCII characters. Consequently, an attacker can disguise a <script>
tag and bypass The filterString()
method in this noncompliant code example normalizes the input string, validates that the input does not contain a <script>
tag, and then removes any noncharacter code points from the input string. Because input validation is performed before the removal of any noncharacter code points, an attacker can include noncharacter code points in the <script>
tag to bypass the validation checks.
Code Block | ||
---|---|---|
| ||
// "\uFEFF" is a non-character code point String s = "<scr" + "\uFEFF" + "ipt>"; import java.text.Normalizer; import java.text.Normalizer.Form; import java.util.regex.Matcher; import java.util.regex.Pattern; public class TagFilter { public static String filterString(String str) { String s = Normalizer.normalize(sstr, Form.NFKC); // InputValidate validationinput Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.printlnthrow new IllegalArgumentException("Found black listed tagInvalid input"); } else {} // ... } // Deletes all non-valid characters noncharacter code points s = s.replaceAll("^[\\p{ASCIICn}]", ""); return s; } public static void main(String[] args) { // s now contains "\uFDEF" is a noncharacter code point String maliciousInput = "<scr" + "\uFDEF" + "ipt>"; String sb = filterString(maliciousInput); // sb = "<script>" } } |
Compliant
...
Solution (Noncharacter Code Points)
This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD
, which is reserved to denote this condition. It also does performs this replacement before doing any other sanitization, in particular, checking for <script>
. This ensures , to ensure that malicious input cannot bypass filters.
Code Block | ||
---|---|---|
| ||
import java.text.Normalizer; import java.text.Normalizer.Form; import java.util.regex.Matcher; String s = "<scr" + "\uFEFF" + "ipt>"; import java.util.regex.Pattern; public class TagFilter { public static String filterString(String str) { String s = Normalizer.normalize(sstr, Form.NFKC); // Replaces all non-valid charactersnoncharacter code points with unicodeUnicode U+FFFD s = s.replaceAll("^[\\p{ASCIICn}]", "\uFFFD"); // Validate input Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println throw new IllegalArgumentException("FoundInvalid blacklisted taginput"); } return s; } else public static void main(String[] args) { // ... } "\uFDEF" is a non-character code point String maliciousInput = "<scr" + "\uFDEF" + "ipt>"; String s = filterString(maliciousInput); // s = <scr?ipt> } |
According to According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], "U+FFFD
is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it doesn't have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available."
Risk Assessment
Validating input before eliminating noncharacter code points removing or modifying characters in the input string can allow malicious input to bypass validation checks.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
IDS11-J |
High |
Probable |
Medium | P12 | L1 |
Automated Detection
Tool | Version | Checker | Description | ||||||
---|---|---|---|---|---|---|---|---|---|
The Checker Framework |
| Tainting Checker | Trust and security errors (see Chapter 8) | ||||||
Parasoft Jtest |
| CERT.IDS11.VPPD | Validate all dangerous data |
Related Guidelines
Bibliography
[API 2006] |
Section 3.5, "Deletion of Noncharacters" | |
[ |
Seacord 2015] |
"Handling the Unexpected: Character-deletion" (slides 72–74) |
...
IDS10-J. Do not split characters between two data structures IDS12-J. Perform lossless conversion of String data between differing character encodings