Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: shuffled intro

Many applications that accept untrusted input strings employ input filtering and validation mechanisms based on the strings' character data. For example, an application's strategy for avoiding cross-site scripting (XSS) vulnerabilities may include forbidding <script> tags in inputs. Such blacklisting mechanisms are a useful part of a security strategy, even though they are insufficient for complete input validation and sanitization. When implemented, this form of validation must be performed only after normalizing the input.

Applications that accept untrusted input, especially Unicode-based input, should normalize the input before validating it.  Character Character information in Java SE 6 is based on the Unicode Standard, version 4.0 [Unicode 2003]. Character , while character information in Java SE 7 is based on the Unicode Standard, version 6.0.0 [Unicode 2011]. Normalization is important because in Unicode, the same string can have many different representations.  According to the Unicode Standard [Davis 2008], annex #15, Unicode Normalization Forms:

When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation.

AlsoUnicode provides several normalization forms. According to [Davis 2008]:

Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate. They can be applied more freely to domains with restricted character sets.

Frequently, the most suitable normalization form for performing input validation on arbitrarily encoded strings is KC (NFKC) because normalizing to KC transforms the input into an equivalent canonical form that can be safely compared with the required input form.

Noncompliant Code Example

...