Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

In

...

some

...

versions

...

prior

...

to

...

Unicode

...

5.2,

...

conformance

...

clause

...

C7

...

allows

...

the

...

deletion

...

of

...

noncharacter

...

code

...

points.

...

For

...

example,

...

conformance

...

clause

...

C7

...

from

...

Unicode

...

5.1

...

states

...

[Unicode

...

2007

...

]:

...

C7.

...

When

...

a

...

process

...

purports

...

not

...

to

...

modify

...

the

...

interpretation

...

of

...

a

...

valid

...

coded

...

character

...

sequence,

...

it

...

shall

...

make

...

no

...

change

...

to

...

that

...

coded

...

character

...

sequence

...

other

...

than

...

the

...

possible

...

replacement

...

of

...

character

...

sequences

...

by

...

their

...

canonical-equivalent

...

sequences

...

or

...

the

...

deletion

...

of

...

noncharacter

...

code

...

points.

...

According to the Unicode Technical Report #36, Unicode Security Considerations [Davis 2008b], Section 3.5,

...

"Deletion

...

of

...

Noncharacters":

...

Whenever a character is invisibly deleted (instead of replaced),

...

such

...

as

...

in

...

this

...

older

...

version

...

of

...

C7,

...

it

...

may

...

cause

...

a

...

security

...

problem.

...

The

...

issue

...

is

...

the

...

following:

...

A

...

gateway

...

might

...

be

...

checking

...

for

...

a

...

sensitive

...

sequence

...

of

...

characters,

...

say

...

"delete".

...

If

...

what

...

is

...

passed

...

in

...

is

...

"deXlete",

...

where

...

X

...

is

...

a

...

noncharacter,

...

the

...

gateway

...

lets

...

it

...

through:

...

the

...

sequence

...

"deXlete"

...

may

...

be

...

in

...

and

...

of

...

itself

...

harmless.

...

However,

...

suppose

...

that

...

later

...

on,

...

past

...

the

...

gateway,

...

an

...

internal

...

process

...

invisibly

...

deletes

...

the

...

X.

...

In

...

that

...

case,

...

the

...

sensitive

...

sequence

...

of

...

characters

...

is

...

formed,

...

and

...

can

...

lead

...

to

...

a

...

security

...

breach.

...

Any string modifications,

...

including

...

the

...

removal

...

or

...

replacement

...

of

...

noncharacter

...

code

...

points,

...

must

...

be

...

performed

...

before

...

any

...

validation

...

of

...

the

...

string

...

is

...

performed.

...

Noncompliant

...

Code

...

Example

...

This

...

noncompliant

...

code

...

example

...

accepts

...

only

...

valid

...

ASCII

...

characters

...

and

...

deletes

...

any

...

non-ASCII

...

characters.

...

It

...

also

...

checks

...

for

...

the

...

existence

...

of

...

a

...

<script>

...

tag.

...

Input

...

validation

...

is

...

being

...

performed

...

before

...

the

...

deletion

...

of

...

non-ASCII

...

characters.

...

Consequently,

...

an

...

attacker

...

can

...

disguise

...

a

...

<script>

...

tag

...

and

...

bypass

...

the

...

validation

...

checks.

{:=
Code Block
bgColor
#FFcccc
}
// "\uFEFF" is a non-character code point
String s = "<scr" + "\uFEFF" + "ipt>"; 
s = Normalizer.normalize(s, Form.NFKC);
// Input validation
Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found black listed tag");
} else {
  // ... 
}

// Deletes all non-valid characters 
s = s.replaceAll("^\\p{ASCII}]", "");
// s now contains "<script>"
{code}


h2. Compliant Solution

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence {{\uFFFD}}, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for {{<script>}}. This ensures that malicious input cannot bypass filters.  

{mc}
Strange things are happening with the regex below. Our bot inserts a link to the same rec within the code regex.
{mc}

{code

Compliant Solution

This compliant solution replaces the unknown or unrepresentable character with Unicode sequence \uFFFD, which is reserved to denote this condition. It also does this replacement before doing any other sanitization, in particular, checking for <script>. This ensures that malicious input cannot bypass filters.

Code Block
bgColor#ccccff
:bgColor=#ccccff}
String s = "<scr" + "\uFEFF" + "ipt>";

s = Normalizer.normalize(s, Form.NFKC);
// Replaces all non-valid characters with unicode U+FFFD
s = s.replaceAll("^\\p{ASCII}]", "\uFFFD"); 

Pattern pattern = Pattern.compile("<script>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
  System.out.println("Found blacklisted tag");
} else {
  // ... 
}
{code}

According

...

to

...

the

...

Unicode

...

Technical

...

Report

...

#36,

...

Unicode

...

Security

...

Considerations

...

[Davis

...

2008b

...

]

...

,

...

"

...

U+FFFD

...

is

...

usually

...

unproblematic,

...

because

...

it

...

is

...

designed

...

expressly

...

for

...

this

...

kind

...

of

...

purpose.

...

That

...

is,

...

because

...

it

...

doesn't

...

have

...

syntactic

...

meaning

...

in

...

programming

...

languages

...

or

...

structured

...

data,

...

it

...

will

...

typically

...

just

...

cause

...

a

...

failure

...

in

...

parsing.

...

Where

...

the

...

output

...

character

...

set

...

is

...

not

...

Unicode,

...

though,

...

this

...

character

...

may

...

not

...

be

...

available."

...

Risk

...

Assessment

...

Validating

...

input

...

before

...

eliminating

...

noncharacter

...

code

...

points

...

can

...

allow

...

malicious

...

input

...

to

...

bypass

...

validation

...

checks.

...

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

IDS11-J

high

probable

medium

P12

L1

Related Guidelines

MITRE CWE

CWE-182. Collapse of data into unsafe value

Bibliography

[API 2006]

 

[Davis 2008b]

3.5, Deletion of Noncharacters

[Weber 2009]

Handling the Unexpected: Character-deletion

[Unicode 2007]

 

[Unicode 2011]

 

...

IDS10-J. Do not split characters between two data structures      Image Added      IDS12-J. Perform lossless conversion of String data between differing character encodings