Input Validation and Data Sanitization
By definition, security vulnerabilities are flaws in software that can be exploited by an attacker. To succeed, the attacker must provide malicious inputs to a program of influence the program's environments to trigger the vulnerability. Consequently, preventing the introduction of malicious inputs into a program can eliminate the majority of vulnerabilities; the purpose of input validation and data sanitization.
Wiki Markup |
---|
Regardless of programming language, input validation requires several steps. The following steps are paraphrased from _Secure Coding in C and C++_ \[[Seacord 2005|AA. Bibliography#Seacord 05]\]: |
- All input sources must be identified. Input sources include the console, command line arguments, files, network data, environment variables, and system properties.
- Specify and validate data. Data from untrusted sources must be fully specified and the data validated against these specifications. The system implementation must be designed to handle any range or combination of valid data. Valid data, in this sense, is data that is anticipated by the design and implementation of the system and therefore will not result in the system entering an indeterminate state. For example, if a system accepts two integers as input and multiplies those two values, the system must either (a) validate the input to ensure that an overflow or other exceptional condition cannot occur as a result of the operation or (b) be prepared to handle the result of the operation. The specifications must address limits, minimum and maximum values, minimum and maximum lengths, valid content, initialization and reinitialization requirements, and encryption requirements for storage and transmission.
- Ensure that all input meets specification. Input should be validated as soon as possible. Incorrect input is not always maliciousâ”often it is accidental. Reporting the error as soon as possible often helps correct the problem. When an exception occurs deep in the code it is not always apparent that the cause was an invalid input and which input was out of bounds. A data dictionary or similar mechanism can be used for specification of all program inputs. Input is usually stored in variables, and some input is eventually stored as persistent data. To validate input, specifications for what is valid input must be developed. A good practice is to define data and variable specifications, not just for all variables that hold user input, but also for all variables that hold data from a persistent store. The need to validate user input is obvious; the need to validate data being read from a persistent store is a defense against the possibility that the persistent store has been tampered with.
The Myth of Trust
...
Software programs often contain multiple components that act as subsystems , where wherein each component operates in one or more trusted domains. For example, one component may have access to the file system but lack access to the network, while another component has access to the network but lacks access to the file system. _Distrustful decomposition_ and _privilege separation_ \ [[Dougherty 2009|AA. Bibliography#Dougherty 2009] \] are examples of secure design patterns that recommend reducing reduce the amount of code that runs with special privileges by designing the system using mutually untrusting components.
When components with differing degrees of trust share data, the data are said to flow across a trust boundary. Because Java allows components under different trusted domains to communicate with each other, data can be transmitted across a trust boundary. Furthermore, a Java program can contain both internally developed and third-party code. Data that are transmitted to or accepted from third-party code also flow across a trust boundary.
While Although software components can obey policies that allow them to transmit data across trust boundaries, they cannot specify the level of trust given to any component. The deployer of the application must define the trust boundaries with the help of a system-wide systemwide security policy. A security auditor can use that definition to determine whether the software adequately supports the security objectives of the application.
ThirdA Java program can contain both internally developed and third-party code. Java was designed to allow the execution of untrusted code; consequently, third-party code should can operate in its own trusted domain; any code potentially exported to a third-party — such as libraries — should be deployable in well-defined trusted domains. The public API of the potentiallysuch third-exported party code can be considered to be a trust boundary. Data flowing across that crosses a trust boundary should be validated when the publisher lacks unless the code that produces this data provides guarantees of validationvalidity. A subscriber or client may omit validation when the data flowing into its trust boundary is appropriate for use as is. In all other cases, inbound data must be validated.
...
Data received by a component from a source outside the component's trust boundary may can be malicious . Consequently, the program and can result in an injection attack, as shown in the scenario in Figure 1.
Figure 1. Injection attack
Programs must take steps to ensure that the data are both genuine and appropriate.
data received across a trust boundary is appropriate and not malicious. These steps can include the following:
Validation: Validation is the process of ensuring that input data fall falls within the expected domain of valid program input. For example, not only must method arguments This requires that inputs conform to the type and numeric range requirements of a method or subsystem, but also they must contain data that conform to the required as well as to input invariants for that methodthe class or subsystem.
Sanitization: In many cases, the data may be is passed directly to a component in a different trusted domain. Data sanitization is the process of ensuring that data conforms to the requirements of the subsystem to which they are it is passed. Sanitization also involves ensuring that data also conforms to security-related requirements regarding leaking or exposure of sensitive data when output across a trust boundary. Sanitization may include the elimination of unwanted characters from the input by means of removalremoving, replacementreplacing, encoding, or escaping the characters. Sanitization may occur following input (input sanitizesanitization) or before the data is passed to across a trust boundary (output sanitization). Data sanitization and input validation may coexist and complement each other. Refer to the related guideline IDS01-J. Sanitize data passed across a trust boundary for more details on data sanitizationMany command interpreters and parsers provide their own sanitization and validation methods. When available, their use is preferred over custom sanitization techniques because custom-developed sanitization can often neglect special cases or hidden complexities in the parser. Another problem with custom sanitization code is that it may not be adequately maintained when new capabilities are added to the command interpreter or parser software.
Canonicalization and Normalization: Canonicalization is the process of lossless reduction of the input to its equivalent simplest known form. Normalization is the process of lossy conversion of input data to the simplest known (and anticipated) form. Canonicalization and normalization must occur before validation to prevent attackers from exploiting the validation routine to strip away invalid characters and, as a result, constructing an invaild invalid (and potentially malicious) character sequence. Refer to the guideline IDS02See FIO16-J. Normalize strings Canonicalize path names before validating them for more details. In addition, ensure that normalization is information. Normalization should be performed only on fully assembled user input. Never normalize partial input or combine normalized input with non-normalized input.
For example, POSIX file systems provide a syntax for expressing file names on the system using paths. A path is a string which indicates how to find any file by starting at a particular directory (usually the current working directory), and traversing down directories until the file is found. Canonical paths lack both symbolic links and special entries such as '.' or '..', which are handled specially on POSIX systems. Each file accessible from a directory has exactly one canonical path, along with many non-canonical paths.
nonnormalized input.
Complex subsystems In particular, complex subsystems are often components that accept string data that specifies specify commands or instructions to are a the componentspecial concern. String data passed to these components may contain special characters that can trigger commands or actions, resulting in a software vulnerability.
Examples These are examples of components which that can interpret commands or instructions:
- Operating system command interpreter (see guideline IDS07-J. Do not pass Sanitize untrusted , unsanitized data passed to the Runtime.exec() method)
- A data repository with an a SQL-compliant interface (see IDS00-J. Prevent SQL Injection)
- XML parser
- XPath evaluators
- (see IDS16-J. Prevent XML Injection and IDS17-J. XML External Entity Attacks)
- Regular expression engines (see IDS08-J. Sanitize untrusted data included in a regular expression)
- Formatted output methods (see IDS06-J. Exclude unsanitized user input from format strings)
- XPath evaluatorsA SAX (Simple API for XML) or a DOM (Document Object Model) parser
- Lightweight Directory Access Protocol (LDAP) directory service
- Script engines
Many rules address proper filtering of untrusted input, especially when such input is passed to a component that can interpret commands or instructions.
...
Bibliography
[Seacord 2015] | Injection attacks LiveLesson |