...
As a result, it is necessary to sanitize all string data passed to parsers or command interpreters so that the resulting string is innocuous in the context in which it will be parsed or interpreted.
Sanitization Techniques
Blacklisting
Blacklisting is the process of examining input data, looking for components that are known to be invalid. One advantage of this approach is that detection of known invalid input is often straightforward. A disadvantage is that the set of all possible invalid inputs may be unknown, or too large to enumerate fully.
Depending on the language and subsystem in question, certain characters and character sequences are frequently considered to be invalid input when encountered in strings. A common set of such characters includes:
Character | Name |
---|---|
LF \r | Line Feed |
CR \n | Carriage Return |
CRLF \r\n | Line Feed + Carriage Return |
" and ' | Quotes |
, and ; | Comma, semicolon, white space |
/ and \ | Forward and back slash |
< and > | Angle brackets |
& | Ampersand |
%00 | NULL |
( and ) | Parentheses |
% | Percent |
A blacklist of invalid inputs would forbid the appearance of any of these characters in their raw form. Note that determination of what constitutes invalid input can be difficult. For example, input validation of textual data using a black-listing approach requires enumerating not only the invalid characters shown above, but also the alternate Unicode representations of these characters in differing locales.
Whitelisting
The whitelisting approach to input validation consists of building a list of valid input elements (such as characters) and ensuring that all untrusted input elements appear on that list. Whitelisting is easier than blacklisting when it is easier to enumerate valid input elements than to detect and reject all instances of invalid input elements. But this advantage over blacklisting fails to apply when the set of valid input elements is difficult or impossible to enumerate and creating a subset of valid input elements is not a viable solution.
Component-based Sanitization
Many parsers and command interpreters provide their own sanitization and validation APIs. When available, their use is preferred over homegrown sanitization techniques, as homegrown sanitization can often neglect special cases or hidden complexities in the parser.
...