...
In
...
some
...
versions
...
prior
...
to
...
Unicode
...
5.2,
...
conformance
...
clause
...
C7
...
allows
...
the
...
deletion
...
of
...
noncharacter
...
code
...
points.
...
For
...
example,
...
conformance
...
clause
...
C7
...
from
...
Unicode
...
5.1
...
states
...
...
...
]:
...
C7.
...
When
...
a
...
process
...
purports
...
not
...
to
...
modify
...
the
...
interpretation
...
of
...
a
...
valid
...
coded
...
character
...
sequence,
...
it
...
shall
...
make
...
no
...
change
...
to
...
that
...
coded
...
character
...
sequence
...
other
...
than
...
the
...
possible
...
replacement
...
of
...
character
...
sequences
...
by
...
their
...
canonical-equivalent
...
sequences
...
or
...
the
...
deletion
...
of
...
noncharacter
...
code
...
points.
...
According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], Section 3.5,
...
"Deletion
...
of
...
Noncharacters":
...
Whenever a character is invisibly deleted (instead of replaced),
...
such
...
as
...
in
...
this
...
older
...
version
...
of
...
C7,
...
it
...
may
...
cause
...
a
...
security
...
problem.
...
The
...
issue
...
is
...
the
...
following:
...
A
...
gateway
...
might
...
be
...
checking
...
for
...
a
...
sensitive
...
sequence
...
of
...
characters,
...
say
...
"delete".
...
If
...
what
...
is
...
passed
...
in
...
is
...
"deXlete",
...
where
...
X
...
is
...
a
...
noncharacter,
...
the
...
gateway
...
lets
...
it
...
through:
...
the
...
sequence
...
"deXlete"
...
may
...
be
...
in
...
and
...
of
...
itself
...
harmless.
...
However,
...
suppose
...
that
...
later
...
on,
...
past
...
the
...
gateway,
...
an
...
internal
...
process
...
invisibly
...
deletes
...
the
...
X.
...
In
...
that
...
case,
...
the
...
sensitive
...
sequence
...
of
...
characters
...
is
...
formed,
...
and
...
can
...
lead
...
to
...
a
...
security
...
breach.
...
Any string modifications,
...
including
...
the
...
removal
...
or
...
replacement
...
of
...
noncharacter
...
code
...
points,
...
must
...
be
...
performed
...
before
...
any
...
validation
...
of
...
the
...
string
...
is
...
performed.
...
Noncompliant
...
Code
...
Example
...
This
...
noncompliant
...
code
...
example
...
accepts
...
only
...
valid
...
ASCII
...
characters
...
and
...
deletes
...
any
...
non-ASCII
...
characters.
...
It
...
also
...
checks
...
for
...
the
...
existence
...
of
...
a
...
<script>
...
tag.
...
Input
...
validation
...
is
...
being
...
performed
...
before
...
the
...
deletion
...
of
...
non-ASCII
...
characters.
...
Consequently,
...
an
...
attacker
...
can
...
disguise
...
a
...
<script>
...
tag
...
and
...
bypass
...
the
...
validation
...
checks.
Code Block | ||||
---|---|---|---|---|
| =
| |||
} // "\uFEFF" is a non-character code point String s = "<scr" + "\uFEFF" + "ipt>"; s = Normalizer.normalize(s, Form.NFKC); // Input validation Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println("Found black listed tag"); } else { // ... } // Deletes all non-valid characters s = s.replaceAll("^\\p{ASCII}]", ""); // s now contains "<script>" {code} h2. Compliant Solution This compliant solution replaces the unknown or unrepresentable character with Unicode sequence {{\uFFFD}}, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for {{<script>}}. This ensures that malicious input cannot bypass filters. {mc} Strange things are happening with the regex below. Our bot inserts a link to the same rec within the code regex. {mc} {code |
Compliant Solution
This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD
, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for <script>
. This ensures that malicious input cannot bypass filters.
Code Block | ||
---|---|---|
| ||
:bgColor=#ccccff} String s = "<scr" + "\uFEFF" + "ipt>"; s = Normalizer.normalize(s, Form.NFKC); // Replaces all non-valid characters with unicode U+FFFD s = s.replaceAll("^\\p{ASCII}]", "\uFFFD"); Pattern pattern = Pattern.compile("<script>"); Matcher matcher = pattern.matcher(s); if (matcher.find()) { System.out.println("Found blacklisted tag"); } else { // ... } {code} |
According
...
to
...
the
...
Unicode
...
Technical
...
Report
...
#36,
...
Unicode
...
Security
...
Considerations
...
...
...
]
...
,
...
"
...
U+FFFD
...
is
...
usually
...
unproblematic,
...
because
...
it
...
is
...
designed
...
expressly
...
for
...
this
...
kind
...
of
...
purpose.
...
That
...
is,
...
because
...
it
...
doesn't
...
have
...
syntactic
...
meaning
...
in
...
programming
...
languages
...
or
...
structured
...
data,
...
it
...
will
...
typically
...
just
...
cause
...
a
...
failure
...
in
...
parsing.
...
Where
...
the
...
output
...
character
...
set
...
is
...
not
...
Unicode,
...
though,
...
this
...
character
...
may
...
not
...
be
...
available."
...
Risk
...
Assessment
...
Validating
...
input
...
before
...
eliminating
...
noncharacter
...
code
...
points
...
can
...
allow
...
malicious
...
input
...
to
...
bypass
...
validation
...
checks.
...
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
IDS11-J | high | probable | medium | P12 | L1 |
Related Guidelines
Bibliography
[API 2006] |
|
3.5, Deletion of Noncharacters | |
Handling the Unexpected: Character-deletion | |
| |
|
...
IDS10-J. Do not split characters between two data structures IDS12-J. Perform lossless conversion of String data between differing character encodings