Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: wordsmithing & code tweaks

Legacy software frequently assumes that every character in a string occupied occupies 8 bits (a Java byte). The Java language assumed assumes that every character in a string occupies 16 bytes (a Java char). Unfortunately, neither the Java byte nor Java char data types can represent all possible Unicode characters. Many strings are stored or communicated using an encoding such as UTF-8 that allows characters to have varying sizes.

While Java strings are stored as arrays an array of type charcharacters, and can be represented as an array of bytes, a single character in the string might be represented by two or more consecutive elements of type byte or of type char. Splitting a char or byte array risks splitting a multibyte character.

...

Code Block
bgColor#ccccff
public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  int offset = 0;
  int bytesRead = 0;
  byte[] data = new byte[4096];
  while (true) { 
    bytesRead += in.read(data, offset, data.length - offset);
    if (bytesRead == -1 || ) {
      break;
    }
    offset += bytesRead;
    if (offset >= data.length) {
      break;
    offset += bytesRead;}
  }
  in.close();
  String str = new String(data, "UTF-8");
  return str;
}

This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full. It does assume that the 4096th byte in the stream is not in the middle of a multibyte character.

The size of the data byte buffer depends on the maximum number of bytes required to write an encoded character and the number of characters. For example, UTF-8 encoded data requires four bytes to represent any character above U+FFFF. Because Java uses the UTF-16 character encoding to represent char data, such sequences are split into two separate char values of two bytes each. Consequently, the buffer size should be four times the size of the maximum number of characters.

...

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b07ee927b33d75e6-18ebb8ae-49f14015-b639961b-60d7f0d46165b64e4fd39aac"><ac:plain-text-body><![CDATA[

[[API 2006

AA. Bibliography#API 06]]

Classes Character and BreakIterator

]]></ac:plain-text-body></ac:structured-macro>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="f31775000f2af81e-c7338d98-4a184ea3-8c6fa8c1-b203f4f2e33aa7cdabf50b50"><ac:plain-text-body><![CDATA[

[[Hornig 2007

AA. Bibliography#Hornig 07]]

Problem areas: Characters

]]></ac:plain-text-body></ac:structured-macro>

...