Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012]. A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The Java platform uses the UTF-16 encoding in char
arrays and in the String
, StringBuilder
, and StringBuffer
classes. However, Java programs must often process character data in various character encodings. The java.io.InputStreamReader
, java.io.OutputStreamWriter
, java.lang.String
classes, and classes in the java.nio.charset
package can convert between UTF-16 and a number of other character encodings. The supported encodings vary among different implementations of the Java Platform. The class description for java.nio.charset.Charset
lists the encodings that every implementation of the Java Platform is required to support. These include US-ASCII, ISO-8859-1, and UTF-8.
UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 221 Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.
Bits of | First | Last | Bytes in | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
7 | U+0000 | U+007F | 1 | 0xxxxxxx | |||
11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | ||
16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
21 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].
Similar to UTF-8, UTF-16 is a variable-width encoding. Unicode code points between U+10000 and U+10FFFF are called supplementary code points, and Unicode-encoded characters having a supplementary code point are called supplementary characters. UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit Unicode code units called surrogates to encode supplementary characters. The first Unicode code point is taken from the high-surrogates range (U+D800-U+DBFF), and the second is taken from the low-surrogates range (U+DC00-U+DFFF). Because UTF-16 code point ranges for high and low surrogates, as well as for single units, are all completely disjoint, there are no false matches, the location of the character boundary can be directly determined from each code unit value, and a dropped surrogate will corrupt only a single character.
Programmers must not form strings containing partial characters, for example, when converting variable-width encoded character data to strings.
Noncompliant Code Example (Read)
This noncompliant code example tries to read up to 1024 bytes from a socket and build a String
from this data. It does so by reading the bytes in a while
loop, as recommended by FIO10-J. Ensure the array is filled when using read() to fill an array. If it ever detects that the socket has more than 1024 bytes available, it throws an exception, which prevents untrusted input from potentially exhausting the program's memory.
public final int MAX_SIZE = 1024; public String readBytes(Socket socket) throws IOException { InputStream in = socket.getInputStream(); byte[] data = new byte[MAX_SIZE+1]; int offset = 0; int bytesRead = 0; String str = new String(); while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) { str += new String(data, offset, bytesRead, "UTF-8"); offset += bytesRead; if (offset >= data.length) { throw new IOException("Too much input"); } } in.close(); return str; }
This code fails to account for the interaction between variable-width character encodings and the boundaries between the loop iterations. If the last byte read from the data stream in one read()
operation is the leading byte of a character, the trailing bytes are not encountered until the next iteration of the while
loop. However, variable-width encoding is resolved during construction of the new String
within the loop. Consequently, the variable-width encoding can be interpreted incorrectly. A similar problem can occur when constructing strings from UTF-16 data if the surrogate pair for a supplementary character are separated.
Compliant Solution (Read)
This compliant solution defers creation of the string until all the data is available:
public final int MAX_SIZE = 1024; public String readBytes(Socket socket) throws IOException { InputStream in = socket.getInputStream(); byte[] data = new byte[MAX_SIZE+1]; int offset = 0; int bytesRead = 0; while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) { offset += bytesRead; if (offset >= data.length) { throw new IOException("Too much input"); } } String str = new String(data, 0, offset, "UTF-8"); in.close(); return str; }
This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full.
Compliant Solution (Reader
)
This compliant solution uses a Reader
rather than an InputStream
. The Reader
class converts bytes into characters on the fly, so it avoids the hazard of splitting variable-width characters. This routine aborts if the socket provides more than 1024 characters rather than 1024 bytes.
public final int MAX_SIZE = 1024; public String readBytes(Socket socket) throws IOException { InputStream in = socket.getInputStream(); Reader r = new InputStreamReader(in, "UTF-8"); char[] data = new char[MAX_SIZE+1]; int offset = 0; int charsRead = 0; String str = new String(); while ((charsRead = r.read(data, offset, data.length - offset)) != -1) { str += new String(data, offset, charsRead); offset += charsRead; if (offset >= data.length) { throw new IOException("Too much input"); } } in.close(); return str; }
Risk Assessment
Forming strings from character data containing partial characters can result in data corruption.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
STR00-J | Low | Unlikely | Medium | P2 | L3 |
Automated Detection
Tool | Version | Checker | Description |
---|---|---|---|
Parasoft Jtest | 2024.1 | CERT.STR00.COS | Do not use String concatenation in an Internationalized environment |
Bibliography
[API 2014] | Classes |
[Phillips 2005] | |
[Seacord 2015] | |
[Unicode 2012] |
11 Comments
Robert Seacord
I've changed the description in the front from Shift JIS to UTF-8 for a variety of reasons, but mainly because it is used in the examples. I removed the following statement that applies to Shift JIS but not UTF-8 and therefore did not apply to the examples:
The trailing byte ranges overlap the range of both the single-byte and lead-byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it were not separated across the buffer boundary; this difference arises because of the ambiguity of its composing bytes [Phillips 2005].
Here is a description of some of the differences between these encodings:
Robert Seacord
There are a number of ways to refer to encodings like UTF-8 and Shift JIS including: multibyte, variable-width, variable-length, and byte encodings.
I've gone here with "variable-width". I don't like "multibyte" because it applies to an encoding like UTF-32 where each character uses four bytes.
I went with width over length just because of wide characters in C and other stuff I'm used to, I suppose. The term "variable-width" is also used on this page: http://www.unicode.org/faq/utf_bom.html
David Svoboda
A better table for illustrating how UTF-8 is encoded in 4 bytes
https://en.wikipedia.org/wiki/UTF-8#Description
This rule smells like a FIO rule right now, prob b/c of the exmaples.
I think this rule and STR03-J could be merged to make a stronger rule "don't build strings from byte arrays that are not designed to hold a complete String"
Fred Long
I think that STR is definitely the correct home for this rule.
Concerning your last point, I think that you mean STR01-J. There are certainly similarities, and the example here is almost the same as the first example in STR01-J. However, combining the two might make an overly complicated rule. Perhaps it would be better to insert a reference to STR01-J in this rule. (There is already a reference to this rule in STR01-J.)
Robert Seacord
Yes, I think I would like to keep them separate. The biggest difference to me is that for STR01-J you also need to be concerned about operations using the various string types because these are all UTF-16 encoded.
Jussi Auvinen
There seems to be 2 additional errors in the non-compliant code fragment.
str += new String(data, offset, bytesRead, "UTF-8");
David Svoboda
Thanks, I've fixed all the code samples to read text properly. The NCCE correctly reads text if it does not break up a variable-width character.
Sarah Olsen
The priority and level scores for STR00-J on the parent page (P12, L1) disagree with those on this page (P2, L3). Which is correct?
David Svoboda
The scores on this page (P12, L1) are correct, I've fixed the parent page.
Sarah Olsen
That was fast! Thanks!
Ehely Gergely
The last line of that UTF-8 table doesn't seem right.