Many applications that accept untrusted input strings employ input filtering and validation mechanisms based on the strings' character data. For example, an application's strategy for avoiding cross-site scripting (XSS) vulnerabilities may include forbidding <script> tags in inputs. Such blacklisting mechanisms are a useful part of a security strategy, even though they are insufficient for complete input validation and sanitization.
Character information in Java is based on the Unicode Standard. The following table shows the version of Unicode supported by the latest three releases of Java SE.
Java Version | Unicode Version |
---|---|
Java SE 6 | Unicode Standard, version 4.0 [Unicode 2003] |
Java SE 7 | Unicode Standard, version 6.0.0 [Unicode 2011] |
Java SE 8 | Unicode Standard, version 6.2.0 [Unicode 2012] |
Applications that accept untrusted input should normalize the input before validating it. Normalization is important because in Unicode, the same string can have many different representations. According to the Unicode Standard [Davis 2008], annex #15, Unicode Normalization Forms:
When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.
Noncompliant Code Example
The Normalizer.normalize()
method transforms Unicode text into the standard normalization forms described in Unicode Standard Annex #15 Unicode Normalization Forms. Frequently, the most suitable normalization form for performing input validation on arbitrarily encoded strings is KC (NFKC) .
This noncompliant code example attempts to validate the String
before performing normalization.
// String s may be user controllable // \uFE64 is normalized to < and \uFE65 is normalized to > using the NFKC normalization form String s = "\uFE64" + "script" + "\uFE65"; // Validate Pattern pattern = Pattern.compile("[<>]"); // Check for angle brackets Matcher matcher = pattern.matcher(s); if (matcher.find()) { // Found black listed tag throw new IllegalStateException(); } else { // ... } // Normalize s = Normalizer.normalize(s, Form.NFKC);
<script>
tag because it is not normalized at the time. Therefore the system accepts the invalid input.Compliant Solution
This compliant solution normalizes the string before validating it. Alternative representations of the string are normalized to the canonical angle brackets. Consequently, input validation correctly detects the malicious input and throws an IllegalStateException
.
String s = "\uFE64" + "script" + "\uFE65"; // Normalize s = Normalizer.normalize(s, Form.NFKC); // Validate Pattern pattern = Pattern.compile("[<>]"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { // Found blacklisted tag throw new IllegalStateException(); } else { // ... }
Risk Assessment
Validating input before normalization affords attackers the opportunity to bypass filters and other security mechanisms. It can result in the execution of arbitrary code.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
IDS01-J | High | Probable | Medium | P12 | L1 |
Automated Detection
Tool | Version | Checker | Description |
---|---|---|---|
The Checker Framework | 2.1.3 | Tainting Checker | Trust and security errors (see Chapter 8) |
Fortify | 1.0 | Process_Control | Implemented |
Related Guidelines
Cross-site Scripting [XYT] | |
CWE-289, Authentication bypass by alternate name |
Android Implementation Details
Android apps can receive string data from the outside and normalize it.
Bibliography
8 Comments
Dhruv Mohindra
NFKC appears to be the best form for input validation. Any arguments against this?
David Svoboda
We still need to know what NFKC actually is...is there a good definition of it that we can cite in this guideline?
A Bishop
This rule is very hard to read. It starts talking about KC and KD without first defining them. The first mention of KC and KD is that they should not be applied blindly. And by the end of the reading of this rule I still don't know what they are other than some types of normalization?
A Bishop
Also, do we only care about this type of normalisation? What about escaped characters (html/xml/sql etc.).
Josh Cain
I would also be interested in the reasons why NFKC is preferred.
David Svoboda
I've rewritten the intro and text around the code samples, which should make them clearer now.
This rule was built to focus on Unicode normalization, but we certainly care aobut other types, such as >. Also path name normaliztaion / canonicalization. See IDS02-J. Canonicalize path names before validating them for more info.
陈欢欢
I tested this with a simple tomcat servlet and HTML form. It works fine without normalizer. How the attack occurs ?
David Svoboda
Try just running the code examples. Tomcat may have added extra security to thwart the attack.