Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: changed description of combining characters

...

Ignoring the possibility of supplementary characters, multibyte characters, or combining characters (characters that modify other characters) may allow an attacker to bypass input validation checks. Consequently, characters must not be split between two data structures.

Combining Characters

The Unicode specification recognizes that a character as perceived by a user may consist of multiple characters as stored in a string. The Java Tutorial provides an example:

...

A combining character sequence is a base character followed by any number of combining characters. The combining character sequence forms a grapheme, which is a minimally distinctive unit of writing in the context of a particular writing system. For example, the

...

grapheme ü can be composed by combining the

...

base character \u0075 (u)

...

  with the combining diacritical mark \u00a8 (¨).

...

  It may also be represented by the single Unicode character \u00fc. 

Multibyte Characters

Multibyte encodings are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below) supports multibyte encoding where the maximum character length is two bytes (one leading and one trailing byte).

...