Legacy software frequently assumes that every character in a string occupied occupies 8 bits (a Java byte
). The Java language assumed assumes that every character in a string occupies 16 bytes (a Java char
). Unfortunately, neither the Java byte
nor Java char
data types can represent all possible Unicode characters. Many strings are stored or communicated using an encoding such as UTF-8
that allows characters to have varying sizes.
While Java strings are stored as arrays an array of type char
characters, and can be represented as an array of bytes, a single character in the string might be represented by two or more consecutive elements of type byte
or of type char
. Splitting a char
or byte
array risks splitting a multibyte character.
...
Code Block | ||
---|---|---|
| ||
public String readBytes(Socket socket) throws IOException { InputStream in = socket.getInputStream(); int offset = 0; int bytesRead = 0; byte[] data = new byte[4096]; while (true) { bytesRead += in.read(data, offset, data.length - offset); if (bytesRead == -1 || ) { break; } offset += bytesRead; if (offset >= data.length) { break; offset += bytesRead;} } in.close(); String str = new String(data, "UTF-8"); return str; } |
This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full. It does assume that the 4096th byte in the stream is not in the middle of a multibyte character.
The size of the data
byte buffer depends on the maximum number of bytes required to write an encoded character and the number of characters. For example, UTF-8 encoded data requires four bytes to represent any character above U+FFFF
. Because Java uses the UTF-16 character encoding to represent char
data, such sequences are split into two separate char
values of two bytes each. Consequently, the buffer size should be four times the size of the maximum number of characters.
...
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b07ee927b33d75e6-18ebb8ae-49f14015-b639961b-60d7f0d46165b64e4fd39aac"><ac:plain-text-body><![CDATA[ | [[API 2006 | AA. Bibliography#API 06]] | Classes | ]]></ac:plain-text-body></ac:structured-macro> |
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f31775000f2af81e-c7338d98-4a184ea3-8c6fa8c1-b203f4f2e33aa7cdabf50b50"><ac:plain-text-body><![CDATA[ | [[Hornig 2007 | AA. Bibliography#Hornig 07]] | Problem areas: Characters | ]]></ac:plain-text-body></ac:structured-macro> |
...