Page History

...

In

...

some

...

versions

...

prior

...

to

...

Unicode

...

5.2,

...

conformance

...

clause

...

C7

...

allows

...

the

...

deletion

...

of

...

noncharacter

...

code

...

points.

...

For

...

example,

...

conformance

...

clause

...

C7

...

from

...

Unicode

...

5.1

...

states

...

[Unicode

...

2007

...

]:

...

C7.

...

When

...

a

...

process

...

purports

...

not

...

to

...

modify

...

the

...

interpretation

...

of

...

a

...

valid

...

coded

...

character

...

sequence,

...

it

...

shall

...

make

...

no

...

change

...

to

...

that

...

coded

...

character

...

sequence

...

other

...

than

...

the

...

possible

...

replacement

...

of

...

character

...

sequences

...

by

...

their

...

canonical-equivalent

...

sequences

...

or

...

the

...

deletion

...

of

...

noncharacter

...

code

...

points.

...

According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], Section 3.5,

...

"Deletion

...

of

...

Noncharacters":

...

Whenever a character is invisibly deleted (instead of replaced),

...

such

...

as

...

in

...

this

...

older

...

version

...

of

...

C7,

...

it

...

may

...

cause

...

a

...

security

...

problem.

...

The

...

issue

...

is

...

the

...

following:

...

A

...

gateway

...

might

...

be

...

checking

...

for

...

a

...

sensitive

...

sequence

...

of

...

characters,

...

say

...

"delete".

...

If

...

what

...

is

...

passed

...

in

...

is

...

"deXlete",

...

where

...

X

...

is

...

a

...

noncharacter,

...

the

...

gateway

...

lets

...

it

...

through:

...

the

...

sequence

...

"deXlete"

...

may

...

be

...

in

...

and

...

of

...

itself

...

harmless.

...

However,

...

suppose

...

that

...

later

...

on,

...

past

...

the

...

gateway,

...

an

...

internal

...

process

...

invisibly

...

deletes

...

the

...

X.

...

In

...

that

...

case,

...

the

...

sensitive

...

sequence

...

of

...

characters

...

is

...

formed,

...

and

...

can

...

lead

...

to

...

a

...

security

...

breach.

...

Any string modifications,

...

including

...

the

...

removal

...

or

...

replacement

...

of

...

noncharacter

...

code

...

points,

...

must

...

be

...

performed

...

before

...

any

...

validation

...

of

...

the

...

string

...

is

...

performed.

...

Noncompliant

...

Code

...

Example

...

This

...

noncompliant

...

code

...

example

...

accepts

...

only

...

valid

...

ASCII

...

characters

...

and

...

deletes

...

any

...

non-ASCII

...

characters.

...

It

...

also

...

checks

...

for

...

the

...

existence

...

of

...

a

...

<script>

...

tag.

...

Input

...

validation

...

is

...

being

...

performed

...

before

...

the

...

deletion

...

of

...

non-ASCII

...

characters.

...

Consequently,

...

an

...

attacker

...

can

...

disguise

...

a

...

<script>

...

tag

...

and

...

bypass

...

the

...

validation

...

checks.

{:=

Code Block

bgColor

	#FFcccc

}
// "\uFEFF" is a non-character code point
String s = "<scr" + "\uFEFF" + "ipt>"; 
s = Normalizer.normalize(s, Form.NFKC);
// Input validation
Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found black listed tag");
} else {
  // ... 
}

// Deletes all non-valid characters 
s = s.replaceAll("^\\p{ASCII}]", "");
// s now contains "<script>"
{code}


h2. Compliant Solution

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence {{\uFFFD}}, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for {{<script>}}. This ensures that malicious input cannot bypass filters.  

{mc}
Strange things are happening with the regex below. Our bot inserts a link to the same rec within the code regex.
{mc}

{code

Compliant Solution

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for <script>. This ensures that malicious input cannot bypass filters.

Code Block

bgColor	#ccccff

:bgColor=#ccccff}
String s = "<scr" + "\uFEFF" + "ipt>";

s = Normalizer.normalize(s, Form.NFKC);
// Replaces all non-valid characters with unicode U+FFFD
s = s.replaceAll("^\\p{ASCII}]", "\uFFFD"); 

Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found blacklisted tag");
} else {
  // ... 
}
{code}

According

...

to

...

the

...

Unicode

...

Technical

...

Report

...

#36,

...

Unicode

...

Security

...

Considerations

...

[Davis

...

2008b

...

]

...

,

...

"

...

U+FFFD

...

is

...

usually

...

unproblematic,

...

because

...

it

...

is

...

designed

...

expressly

...

for

...

this

...

kind

...

of

...

purpose.

...

That

...

is,

...

because

...

it

...

doesn't

...

have

...

syntactic

...

meaning

...

in

...

programming

...

languages

...

or

...

structured

...

data,

...

it

...

will

...

typically

...

just

...

cause

...

a

...

failure

...

in

...

parsing.

...

Where

...

the

...

output

...

character

...

set

...

is

...

not

...

Unicode,

...

though,

...

this

...

character

...

may

...

not

...

be

...

available."

...

Risk

...

Assessment

...

Validating
...
input
...
before
...
eliminating
...
noncharacter
...
code
...
points
...
can
...
allow
...
malicious
...
input
...
to
...
bypass
...
validation
...
checks.
...
Rule
Severity
Likelihood
Remediation Cost
Priority
Level
IDS11-J
high
probable
medium
P12
L1

Related Guidelines

MITRE CWE

CWE-182. Collapse of data into unsafe value

Bibliography

[API 2006]
[Davis 2008b]	3.5, Deletion of Noncharacters
[Weber 2009]	Handling the Unexpected: Character-deletion
[Unicode 2007]
[Unicode 2011]

...

IDS10-J. Do not split characters between two data structures Image Added IDS12-J. Perform lossless conversion of String data between differing character encodings

Space shortcuts

Page tree

Versions Compared

Old Version 69

New Version 70

Key

Noncompliant

Code

Example

Compliant Solution

Risk

Assessment

Validating
...
input
...
before
...
eliminating
...
noncharacter
...
code
...
points
...
can
...
allow
...
malicious
...
input
...
to
...
bypass
...
validation
...
checks.
...
Rule
Severity
Likelihood
Remediation Cost
Priority
Level
IDS11-J
high
probable
medium
P12
L1

Related Guidelines

Bibliography

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 69

New Version 70

Key

Noncompliant

Code

Example

Compliant Solution

Risk

Assessment

Validating ... input ... before ... eliminating ... noncharacter ... code ... points ... can ... allow ... malicious ... input ... to ... bypass ... validation ... checks.... Rule Severity Likelihood Remediation Cost Priority Level IDS11-J high probable medium P12 L1

Related Guidelines

Bibliography

Validating
...
input
...
before
...
eliminating
...
noncharacter
...
code
...
points
...
can
...
allow
...
malicious
...
input
...
to
...
bypass
...
validation
...
checks.
...
Rule
Severity
Likelihood
Remediation Cost
Priority
Level
IDS11-J
high
probable
medium
P12
L1