You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 96 Next »

In some versions of The Unicode Standard prior to version 5.2, conformance clause C7 allows the deletion of noncharacter code points. For example, conformance clause C7 from Unicode 5.1 states [Unicode 2007]:

C7. When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.

According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], Section 3.5, "Deletion of Code Points":

Whenever a character is invisibly deleted (instead of replaced), such as in this older version of C7, it may cause a security problem. The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach.

Consequently, any string modifications, including the removal or replacement of noncharacter code points, must be performed before any validation of the string is performed.

Noncompliant Code Example

The filterString() method in this noncompliant code example normalizes the input string, validates that the input does not contain <script> tag, and then removes any non-ASCII characters from the input string.  Because input validation is performed before the removal of non-ASCII characters, an attacker can insert noncharacter code points into the <script> tag, bypass the validation checks.

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TagFilter {
  public static String filterString(String str) {
    String s = Normalizer.normalize(str, Form.NFKC);
    // Validate input
    Pattern pattern = Pattern.compile("<script>");
    Matcher matcher = pattern.matcher(s);
    if (matcher.find()) {
      throw new IllegalArgumentException("Invalid input");
    }
    // Deletes all non-ASCII characters
    s = s.replaceAll("[^\\p{ASCII}]", "");
    return s;
  }
 
  public static void main(String[] args) {
    // "\uFEFF" is a non-character code point
    String maliciousInput = "<scr" + "\uFEFF" + "ipt>";
    String sb = filterStringBad(maliciousInput);
    // sb = "<script>"
  }
}

Compliant Solution

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also performs this replacement before doing any other sanitization, in particular, checking for <script>. This ensures that malicious input cannot bypass filters.

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class TagFilter {
 
  public static String filterString(String str) {
    // "\uFEFF" is a non-character code point
    String s = Normalizer.normalize(str, Form.NFKC);
    // Replaces all non-valid characters with Unicode U+FFFD
    s = s.replaceAll("[^\\p{ASCII}]", "\uFFFD");
    // Validate input
    Pattern pattern = Pattern.compile("<script>");
    Matcher matcher = pattern.matcher(s);
    if (matcher.find()) {
      throw new IllegalArgumentException("Invalid input");
    }
    return s;
  }
  public static void main(String[] args) {
    // "\uFEFF" is a non-character code point
    String maliciousInput = "<scr" + "\uFEFF" + "ipt>";
    String s = filterString(maliciousInput);
    // s = <scr?ipt>
  }

According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], "U+FFFD is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it doesn't have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing. Where the output character set is not Unicode, though, this character may not be available."

Risk Assessment

Validating input before eliminating noncharacter code points can allow malicious input to bypass validation checks.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

IDS11-J

high

probable

medium

P12

L1

Related Guidelines

MITRE CWE

CWE-182. Collapse of data into unsafe value

Bibliography

[API 2006]

 

[Davis 2008b]

3.5, Deletion of Noncharacters

[Weber 2009]

Handling the Unexpected: Character-deletion

[Unicode 2007]

 

[Unicode 2011]

 

 


            IDS12-J. Perform lossless conversion of String data between differing character encodings

  • No labels