...
While Java strings are stored as arrays of type char
, and can be represented as an array of bytes, a single character in the string might be represented by two or more consecutive bytes or charselements of type byte
or of type char
. Splitting a char
or byte
array runs the risk of splitting two chars or bytes that make up risks splitting a multibyte character. Security vulnerabilities may arise when an application expects input in a form that an attacker is capable of bypassing. This can happen when an application disregards supplementary characters, multibyte characters, or when it fails to use combining characters appropriately. Combining characters are (characters that modify other characters. Refer to the Combining Diacritical Marks chart for more details on combining characters.) appropriately.
Consequently, programs must take multibyte characters into account when manipulating character string arrays of bytes or chars.
Multibyte Characters
Multibyte encodings such as UTF-8 are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below), supports multibyte encoding wherein the maximum character length is two bytes (one leading and one trailing byte).
Byte Type | Range |
---|---|
single-byte | |
lead-byte |
|
trailing-byte | |
The trailing byte ranges overlap the range of both the single byte and lead byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than if it were not separated across the buffer boundary; this difference arises due to because of the ambiguity of its composing bytes [Phillips 2005].
...
This noncompliant code example attempts to read reads bytes from a FileInputStream
, in chunks of 1024, and to return them as into a 1024 byte buffer before concatenating to a String
.
Code Block | ||
---|---|---|
| ||
public String readBytes(Socket socket) throws IOException { InputStream in = socket.getInputStream(); String str = ""; byte[] data = new byte[1024]; while (in.read(data) > -1) { str += new String(data, "UTF-8"); } in.close(); return str; } |
This code fails to consider the interaction between characters represented with a multi-byte encoding and the boundaries between the loop iterations. If the 1024th byte read from the data stream in one read()
operation is the leading byte of a multibyte character, the trailing bytes will are not be encountered until the next iteration of the while
loop. However, multi-byte encoding is resolved during construction of the new String
within the loop. Consequently, the multibyte encoding will be is interpreted incorrectly.
Compliant Solution (byte
)
This compliant solution reads all the desired bytes into its buffer, and does not create a string until all the data is available. It does assume that the string to be read is 4096 bytes or less.
Code Block | ||
---|---|---|
| ||
public String readBytes(Socket socket) throws IOException { InputStream in = socket.getInputStream(); int offset = 0; int bytesRead = 0; byte[] data = new byte[4096]; while (true) { bytesRead += in.read(data, offset, data.length - offset); if (bytesRead == -1 || offset >= data.length) break; offset += bytesRead; } in.close(); String str = new String(data, "UTF-8"); return str; } |
This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data have has been read in full. It also facilitates portability across systems that use different default character encodings by specifying an explicit character encoding for the String
constructor.
The size of the data
byte buffer depends on the maximum number of bytes required to write an encoded character and the number of characters. For example, UTF-8 encoded data requires four bytes to represent any character above U+FFFF
. Because Java uses the UTF-16 character encoding to represent char
data, such sequences are split into two separate char
values of two bytes each. Consequently, the buffer size should be four times the size of a typical byte sequenceof the maximum number of characters.
Compliant Solution (byte
, readFully()
)
...
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="de5b2f1712ef3886-4269b4fc-48b24737-85799a22-c11e8848489dd43186e2e587"><ac:plain-text-body><![CDATA[ | [[API 2006 | AA. Bibliography#API 06]] | Classes | ]]></ac:plain-text-body></ac:structured-macro> |
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="1c365a88492168e3-4797ef49-4ab440fc-b26b8769-2708353632077833b2a75c74"><ac:plain-text-body><![CDATA[ | [[Hornig 2007 | AA. Bibliography#Hornig 07]] | Problem areas: Characters | ]]></ac:plain-text-body></ac:structured-macro> |
...