IDS01-J. Sanitize untrusted input before processing or storing it

Many programs accept data from untrusted sources, and then pass the data (perhaps with minor modifications) to some subsystem. Often the subsystem accepts the data as a character string that follows some syntax, and both the program and subsystem must respect the syntax when managing the data.

Input sanitization refers to the elimination of unwanted characters from the input by means of removal, replacement, encoding or escaping the characters. Input must be sanitized, both because an application may be unprepared to handle the malformed input, and also because unsanitized input may conceal an attack vector.

Blacklisting

Blacklisting is the process of examining input data, looking for components that are known to be invalid. One advantage of this approach is that detection of known invalid input is often straightforward. A disadvantage is that the set of all possible invalid inputs may be unknown, or too large to enumerate fully. In such cases, consider the alternative of whitelisting known, valid inputs.

Depending on the language and subsystem in question, certain characters and character sequences are frequently considered to be invalid input when encountered in strings. A common set of such characters includes:

Character	Name
LF \r	Line Feed
CR \n	Carriage Return
CRLF \r\n	Line Feed + Carriage Return
" and '	Quotes
, and ;	Comma, semicolon, white space
/ and \	Forward and back slash
< and >	Angle brackets
&	Ampersand
%00	NULL
( and )	Parentheses
%	Percent

A blacklist of invalid inputs would forbid the appearance of any of these characters in their raw form. Note that determination of what constitutes invalid input can be difficult. For example, input validation of textual data using a black-listing approach requires enumerating not only the invalid characters shown above, but also the alternate Unicode representations of these characters in differing locales.

Whitelisting

The white-listing approach to input validation consists of building a list of valid input components (such as characters) and ensuring that untrusted input conforms to that list. Whitelisting is easier than blacklisting when it is easier to enumerate a set of valid input conditions than to detect and reject all instances of invalid input. But this advantage over blacklisting fails to apply when the set of valid inputs is difficult or impossible to enumerate.

Noncompliant Code Example (Blacklisting)

This noncompliant example must build a URI from untrusted input. It sanitizes the input by checking for angle brackets. However, the URI may consist of UTF-8 encoded character sequences. If the filter fails to forbid the % characters that comprise part of the UTF-8 encoding, it cannot achieve its purpose. For example, an attacker can bypass the filter by specifying the hexadecimal encoded form of the sequence <script> as %3C%73%63%72%69%70%74%3E.

String tainted = "%3C%73%63%72%69%70%74%3E"; // Hex encoded equivalent form of <script>

Pattern pattern = Pattern.compile("[<>]");
if (pattern.matcher(tainted).find()) {
  throw new ValidationException("Invalid Input");
}
URI uri = new URI("http://vulnerable.com/" + tainted);

Noncompliant Code Example

This noncompliant code example attempts to check for the hex-encoded form in addition to the canonical representation of the angle brackets. Note, however, that the program remains vulnerable when an alternative encoding, such as a modified Base64 URL encoding, is used farther along the chain.

String tainted = Base64.encode("%3C%73%63%72%69%70%74%3E".getBytes()); // <script>

Pattern pattern = Pattern.compile("(%3C|<)(.*)(%3E|>)");
if (pattern.matcher(tainted).find()) {
  throw new ValidationException("Invalid Input");
}
URI uri = new URI("http://vulnerable.com/" + tainted);

This approach also fails to prevent other forms of injection attacks that do not rely on angle brackets. Further, the infeasibility of exhaustive enumeration of all forms of blacklisted characters renders the use of methods such as String.replaceAll() ineffective for sanitizing untrusted user input.

Compliant Solution (Whitelisting)

This compliant solution validates the input based on a white-list. It permits the URL to contain only alphanumeric characters and the encoded forms of the space (" ") and period (".") characters; all other characters are treated as invalid and are rejected.

String tainted = "%3C%73%63%72%69%70%74%3E"; // Hex encoded equivalent form of <script>

Pattern pattern = Pattern.compile("[\\W&&[IDS01-J. Sanitize untrusted input before processing or storing it^\\s\\.]]");
if (pattern.matcher(tainted).find()) {
  throw new ValidationException( "Invalid Input");
}
URI uri = new URI("http://vulnerable.com/" + tainted);

Risk Assessment

Failure to sanitize user input before processing or storing it can lead to injection of arbitrary executable content.

Guideline	Severity	Likelihood	Remediation Cost	Priority	Level
IDS01-J	high	probable	medium	P12	L1

Related Vulnerabilities

CVE-2008-2370 describes a vulnerability in Apache Tomcat 4.1.0 through 4.1.37, 5.5.0 through 5.5.26, and 6.0.0 through 6.0.16. When a RequestDispatcher is used, Tomcat performs path normalization before removing the query string from the URI, which allows remote attackers to conduct directory traversal attacks and read arbitrary files via a .. (dot dot) in a request parameter.

Search for other vulnerabilities resulting from the violation of this guideline on the CERT website.