(THIS CODING RULE OR GUIDELINE IS UNDER CONSTRUCTION)
According to [JNI Tips], section "UTF-8 and UTF-16 Strings", Java uses UTF-16 strings that are not null-terminated. UTF-16 strings may contain \u0000 in the middle of the string, so it is necessary to know the length of the string when working on Java strings in native code.
JNI does provide methods that work with Modified UTF-8 (see [API 2013], Interface DataInput, section "Modified UTF-8"). The advantage of working with Modified UTF-8 is that it encodes \u0000 as 0xc0 0x80 instead of 0x00. This allows the use of C-style null-terminated strings that can be handled by C standard library string functions. However, arbitrary UTF-8 data cannot be expected to work correctly in JNI. Data passed to the NewStringUTF()
function must be in Modified UTF-8 format. Character data read from a file or stream cannot be passed to the NewStringUTF()
function without being filtered to convert the high-ASCII characters to Modified UTF-8. In other words, character data must be normalized to Modified UTF-8 before being passed to the NewStringUTF()
function. (For more information about string normalization see IDS01-J. Normalize strings before validating them. Note, however, that that rule is mainly about UTF-16 normalization whereas what is of concern here is Modified UTF-8 normalization.)
Noncompliant Code Example
This noncompliant code example shows an example where the wrong type of character encoding is used with erroneous results.
Compliant Solution
In this compliant solution ...
Risk Assessment
If character data is not normalized before being passed to the NewStringUTF()
function then erroneous results may be obtained.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
JNI04-J | Low | Probable | Medium | P4 | L3 |
Automated Detection
It may be possible to automatically detect whether character data from untrusted sources has been normalized before being passed to the NewStringUTF()
function.
Bibliography
5 Comments
David Svoboda
Could JNI code use
wchar_t
or is that also forbidden?I disagree. While programmer intent is clearly not feasible, it is somewhat possible to determine if data is coming from an (untrusted) file and going directly to NewStringUTF(). IF Java has a ASCII-to-Mod-UTF8 JNI function, you can also leverage that to know when strings are being properly sanitized. Which is the point...this is a sanitization problem of sorts. Well...actually a normalization problem, but it stil lcan be handled by the same techniques that we discuss in the IDS section.
Fred Long
I'm not sure about using
wchar_t
, I'll check.Thanks for the comment about sanitising/normalising. I'll think about rewriting this rule to take that into account.
Fred Long
Concerning
wchar_t
, that is an implementation dependent type whereas JNI'sjchar
is always unsigned 16 bits, so some type conversion may be necessary.I have added some text about normalization and changed the Risk Assessment and Automatic Detection sections.
I'm reluctant to add a link to IDS01-J. Normalize strings before validating them because that is about UTF-16 normalization whereas this rule is about Modified UTF-8 normalization which is very different.
David Svoboda
The rule is looking better. But I do think you should mention IDS01-J. IDS01-J is about normalization in general. Although there is UTF16-specific facts too. It's worth mentioning IDS01-J in this rule, although you should add the UTF16/UTF8 caveat you mentioned.
Fred Long
Done.