Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

It is important that a string not be modified after validation has occurred because doing so may allow an attacker to bypass validation. For example, a program may filter out the <script> tags from HTML input to avoid cross-site scripting (XSS) and other vulnerabilities. If exclamation marks (!) are deleted from the input following validation, an attacker may pass the string "<scr!ipt>" so that the validation check fails to detect the <script> tag, but the subsequent removal of the exclamation mark creates a <script> tag in the input. 

A programmer might decide to exclude many different categories of characters. For example, The Unicode Standard [Unicode 2012defines the following categories of characters, all of which can be matched using an appropriate regular expression:

AbbrLongDescription
CcControlA C0 or C1 control code
CfFormatA format control character
CsSurrogateA surrogate code point
CoPrivate_UseA private-use character
CnUnassignedA reserved unassigned code point or a noncharacter

Other programs may remove or replace any character belonging to a uniquely defined set of characters. Any string modifications must be performed before the string is validated.

Noncompliant Code Example (Noncharacter Code Points)

In some versions of The Unicode Standard prior to version 5.2, conformance clause C7 allows the deletion of noncharacter code points. For example, conformance clause C7 from Unicode 5.1 [Unicode 2007] states:

C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.

According to Unicode Technical Report #36, Unicode Security Considerations, Section 3.5, "Deletion of Code Points" [Davis 2008b]:

Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.

The filterString() method in this noncompliant code example normalizes the input string, validates that the input does not contain <script> tag, and then removes any noncharacter code points from the input string. Because input validation is performed before the removal of any noncharacter code points, an attacker can include noncharacter code points in the <script> tag to bypass the validation checks.

Code Block
bgColor#FFcccc
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class TagFilter {
  public static String filterString(String str) {
    String s = Normalizer.normalize(str
Wiki Markup
            In some versions prior to Unicode 5.2, conformance clause C7 allowed the deletion of noncharacter code points.  For example, conformance clause C7 from Unicode 5.1 states: \[[Unicode 2007|AA. Bibliography#API 2007]\]

{quote}
C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.
{quote}

According to the Unicode Technical Report #36, Unicode Security Considerations \[[Davis 2008b|AA. Bibliography#Davis 08b]\], Section 3.5, "Deletion of Noncharacters" 

{quote}
Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.
{quote}

Because character-level modifications of a string can nullify substring-level checks, it is important to perform the character-level modifications before substring-level checks.


h2. Noncompliant Code Example

This noncompliant code example accepts only valid ASCII characters and deletes any non conforming characters. It also checks for the existence of a {{<script>}} tag.  

Input validation is being performed before the character checks. As such, this code also violates [IDS02-J. Normalize strings before validating them]. Consequently, an attacker can disguise a {{<script>}} tag and fool the filter.

{code:bgColor=#FFcccc}
String s = "<scr" + "\uFEFF" + "ipt>"; // "\uFEFF" is a non-character code point
s = Normalizer.normalize(s, Form.NFKC);

    // InputValidate validationinput
    Pattern pattern = Pattern.compile("<script>");
    Matcher matcher = pattern.matcher(s);
    if (matcher.find()) {
       System.out.printlnthrow new IllegalArgumentException("Found black listed tagInvalid input");
    }

 else {
  // ... 
}

 Deletes noncharacter code points
    s = s.replaceAll("^[\\p{ASCIICn}]", "");
    return s;
  }

  public static void main(String[] args) {
    // Deletes all non-valid characters
// s now contains "<script>"		
{code}


h2. Compliant Solution

This compliant solution replaces the unknown or unrepresentable character with unicode sequence {{\uFFFD}} which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for {{<script>}}. This ensures that malicious input cannot bypass filters.  

{mc}
Strange things are happening with the regex below. Our bot inserts a link to the same rec within the code regex.
{mc}

{code:bgColor=#ccccff}
String s = "<scr" + "\uFEFF" + "ipt>";

 "\uFDEF" is a noncharacter code point
    String maliciousInput = "<scr" + "\uFDEF" + "ipt>";
    String sb = filterString(maliciousInput);
    // sb = "<script>"
  }
}

Compliant Solution (Noncharacter Code Points)

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also performs this replacement before doing any other sanitization, in particular, checking for <script>, to ensure that malicious input cannot bypass filters.

Code Block
bgColor#ccccff
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TagFilter {
 
  public static String filterString(String str) {
    String s = Normalizer.normalize(sstr, Form.NFKC);

    // Replaces all noncharacter code points with Unicode U+FFFD
    s = s.replaceAll("^[\\p{ASCIICn}]", "\uFFFD");

    // ReplacesValidate allinput
 non-valid characters with unicode U+FFFD

Pattern pattern = Pattern.compile("<script>");
    Matcher matcher = pattern.matcher(s);
    if (matcher.find()) {
  System.out.println("Found black listed tag");
} else    throw new IllegalArgumentException("Invalid input");
    }
    return s;
  }
  public static void main(String[] args) {
    // ... 
}
{code}

"{{U+FFFD}} is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it doesn't have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available" \[[Davis 2008b|AA. Bibliography#Davis 08b]\].


h2. Risk Assessment

Deleting non-character code points can allow malicious input to bypass validation checks.

|| Rule || Severity || Likelihood || Remediation Cost || Priority || Level ||
| IDS03-J | high | probable | medium | {color:red}{*}P12{*}{color} | {color:red}{*}L1{*}{color} |


h2. Related Guidelines

Search for vulnerabilities resulting from the violation of this rule on the [CERT website|https://www.kb.cert.org/vulnotes/bymetric?searchview&amp;query=FIELD+KEYWORDS+contains+MSC42-J].
| \[[MITRE 2009|AA. Bibliography#MITRE 09]\] | [CWE ID 182|http://cwe.mitre.org/data/definitions/182.html] "Collapse of Data Into Unsafe Value" |

h2. Bibliography

|\[[API 2006|AA. Bibliography#API 06]\] | |
|\[[Davis 2008b|AA. Bibliography#Davis 08b]\] | 3.5 Deletion of Noncharacters |
|\[[Weber 2009|AA. Bibliography#Weber 09]\]| Handling the Unexpected: Character-deletion |
|\[[Unicode 2007|AA. Bibliography#API 2007]\]| The Unicode Consortium. The Unicode Standard, Version 5.1.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0), as amended by Unicode 5.1.0 (http://www.unicode.org/versions/Unicode5.1.0/). |
| \[[Unicode 2011|AA. Bibliography#Unicode 2011]\] | The Unicode Consortium. The Unicode Standard, Version 6.0.0, (Mountain View, CA: The Unicode Consortium, 2011. ISBN 978-1-936213-01-6) http://www.unicode.org/versions/Unicode6.0.0/ |

----
[!The CERT Oracle Secure Coding Standard for Java^button_arrow_left.png!|IDS02-J. Normalize strings before validating them]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[!The CERT Oracle Secure Coding Standard for Java^button_arrow_up.png!|00. Input Validation and Data Sanitization (IDS)]&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[!The CERT Oracle Secure Coding Standard for Java^button_arrow_right.png!|IDS05-J. Do not log unsanitized user input]

 "\uFDEF" is a non-character code point
    String maliciousInput = "<scr" + "\uFDEF" + "ipt>";
    String s = filterString(maliciousInput);
    // s = <scr?ipt>
  }

According to Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], "U+FFFD is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it doesn't have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available."

Risk Assessment

Validating input before removing or modifying characters in the input string can allow malicious input to bypass validation checks.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

IDS11-J

High

Probable

Medium

P12

L1

Automated Detection

ToolVersionCheckerDescription
The Checker Framework

Include Page
The Checker Framework_V
The Checker Framework_V

Tainting CheckerTrust and security errors (see Chapter 8)
Parasoft Jtest

Include Page
Parasoft_V
Parasoft_V

CERT.IDS11.VPPDValidate all dangerous data

Related Guidelines

MITRE CWE

CWE-182, Collapse of Data into Unsafe Value

Bibliography

[API 2006]


[Davis 2008b]

Section 3.5, "Deletion of Noncharacters"

[Seacord 2015]

[Unicode 2007]


[Unicode 2011]


[Weber 2009]

"Handling the Unexpected: Character-deletion" (slides 72–74)


...

Image Added Image Added Image Added