STR00-J. Don't form strings containing partial characters from variable-width encodings

Character information in Java SE 8 is based on the Unicode Standard, version 6.2.0 [Unicode 2012]. A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The Java platform uses the UTF-16 encoding in char arrays and in the String, StringBuilder, and StringBuffer classes. However, Java programs must often process character data in various character encodings. The java.io.InputStreamReader, java.io.OutputStreamWriter, java.lang.String classes, and classes in the java.nio.charset package can convert between UTF-16 and a number of other character encodings. The supported encodings vary among different implementations of the Java Platform. The class description for java.nio.charset.Charset lists the encodings that every implementation of the Java Platform is required to support. These include US-ASCII, ISO-8859-1, and UTF-8.

UTF-8 is an example of a variable-width encoding for Unicode. UTF-8 uses 1 to 4 bytes per character, depending on the Unicode symbol. All possible 2²¹ Unicode code points can be encoded using UTF-8. The following table lists the well-formed UTF-8 byte sequences.

Bits of Code Point	First Code Point	Last Code Point	Bytes in Sequence	Byte 1	Byte 2	Byte 3	Byte 4
7	U+0000	U+007F	1	`0xxxxxxx`
11	U+0080	U+07FF	2	`110xxxxx`	`10xxxxxx`
16	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+0800	U+FFFF	3	`1110xxxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, the beginning of the next valid character can be located and processing resumed. Many variable length encodings are harder to resynchronize. In some older variable-length encodings (such as Shift JIS), the end byte of a character and the first byte of the next character could look like another valid character [Phillips 2005].

Similar to UTF-8, UTF-16 is a variable-width encoding. Unicode code points between U+10000 and U+10FFFF are called supplementary code points, and Unicode-encoded characters having a supplementary code point are called supplementary characters. UTF-16 uses a single 16-bit code unit to encode the most common 63K characters, and a pair of 16-bit Unicode code units called surrogates to encode supplementary characters. The first Unicode code point is taken from the high-surrogates range (U+D800-U+DBFF), and the second is taken from the low-surrogates range (U+DC00-U+DFFF). Because UTF-16 code point ranges for high and low surrogates, as well as for single units, are all completely disjoint, there are no false matches, the location of the character boundary can be directly determined from each code unit value, and a dropped surrogate will corrupt only a single character.

Programmers must not form strings containing partial characters, for example, when converting variable-width encoded character data to strings.

Noncompliant Code Example (Read)

This noncompliant code example tries to read up to 1024 bytes from a socket and build a String from this data. It does so by reading the bytes in a while loop, as recommended by FIO10-J. Ensure the array is filled when using read() to fill an array. If it ever detects that the socket has more than 1024 bytes available, it throws an exception, which prevents untrusted input from potentially exhausting the program's memory.

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  byte[] data = new byte[MAX_SIZE+1];
  int offset = 0;
  int bytesRead = 0;
  String str = new String();
  while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) {
    str += new String(data, offset, bytesRead, "UTF-8");
    offset += bytesRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  in.close();
  return str;
}

This code fails to account for the interaction between variable-width character encodings and the boundaries between the loop iterations. If the last byte read from the data stream in one read() operation is the leading byte of a character, the trailing bytes are not encountered until the next iteration of the while loop. However, variable-width encoding is resolved during construction of the new String within the loop. Consequently, the variable-width encoding can be interpreted incorrectly. A similar problem can occur when constructing strings from UTF-16 data if the surrogate pair for a supplementary character are separated.

Compliant Solution (Read)

This compliant solution defers creation of the string until all the data is available:

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  byte[] data = new byte[MAX_SIZE+1];
  int offset = 0;
  int bytesRead = 0;
  while ((bytesRead = in.read(data, offset, data.length - offset)) != -1) {
    offset += bytesRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  String str = new String(data, 0, offset, "UTF-8");
  in.close();
  return str;
}

This code avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data has been read in full.

Compliant Solution (`Reader`)

This compliant solution uses a Reader rather than an InputStream. The Reader class converts bytes into characters on the fly, so it avoids the hazard of splitting variable-width characters. This routine aborts if the socket provides more than 1024 characters rather than 1024 bytes.

public final int MAX_SIZE = 1024;

public String readBytes(Socket socket) throws IOException {
  InputStream in = socket.getInputStream();
  Reader r = new InputStreamReader(in, "UTF-8");
  char[] data = new char[MAX_SIZE+1];
  int offset = 0;
  int charsRead = 0;
  String str = new String();
  while ((charsRead = r.read(data, offset, data.length - offset)) != -1) {
    str += new String(data, offset, charsRead);
    offset += charsRead;
    if (offset >= data.length) {
      throw new IOException("Too much input");
    }
  }
  in.close();
  return str;
}

Risk Assessment

Forming strings from character data containing partial characters can result in data corruption.

Rule	Severity	Likelihood	Remediation Cost	Priority	Level
STR00-J	Low	Unlikely	Medium	P2	L3

Automated Detection

Tool	Version	Checker	Description
Parasoft Jtest	2024.2	CERT.STR00.COS	Do not use String concatenation in an Internationalized environment

Bibliography

[API 2014]	Classes `Character` and `BreakIterator`
[Java Tutorials]	Character Boundaries
[Phillips 2005]
[Seacord 2015]	STR00-J. Don't form string containing partial characters from variable-width encodings LiveLesson
[Unicode 2012]

Space shortcuts

Page tree

Noncompliant Code Example (Read)

Compliant Solution (Read)

Compliant Solution (`Reader`)

Risk Assessment

Automated Detection

Bibliography

11 Comments

Robert Seacord

Robert Seacord

David Svoboda

Fred Long

Robert Seacord

Jussi Auvinen

David Svoboda

Sarah Olsen

David Svoboda

Sarah Olsen

Ehely Gergely

Space shortcuts

Page tree

Noncompliant Code Example (Read)

Compliant Solution (Read)

Compliant Solution (Reader)

Risk Assessment

Automated Detection

Bibliography

11 Comments

Compliant Solution (`Reader`)