Performing conversions of String
objects between different character encodings or to byte arrays may result in loss of dataString objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8. Errors may occur when converting between differently coded character data. There are two general types of encoding errors. If the byte sequence is not valid for the specified charset then the input is considered malformed. If the byte sequence cannot be mapped to an equivalent character sequence then an unmappable character has been encountered.
According to the Java API API [API 20062014] for the String
constructors:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.
Similarly, the description of the ], String.getBytes(Charset)
method documentationstates:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.
When a String
must be converted to bytes, for example, for writing to a file, and the string might contain sequences of unmappable characters, proper character encoding must be performed.
...
The CharsetEncoder
class is used to transform character data into a sequence of bytes in a specific charset. The input character sequence is provided in a character buffer or a series of such buffers. The output byte sequence is written to a byte buffer or a series of such buffers. The CharsetDecoder
class reverses this process by transforming a sequence of bytes in a specific charset into character data. The input byte sequence is provided in a byte buffer or a series of such buffers, while the output character sequence is written to a character buffer or a series of such buffers.
Special care should be taken when decoding untrusted byte data to ensure that malformed input or unmappable character errors do not result in defects and vulnerabilities. Encoding errors can also occur, for example, encoding a cryptographic key containing malformed input for transmission will result in an error. Encoding and decoding errors typically result in data corruption.
Noncompliant Code Example
This noncompliant code example is similar to the one used in STR03-J. Do not represent numeric data as strings in that it attempts to convert the a byte array representing a BigInteger
into containing the two's-complement representation of this BigInteger
value to a String
. Because some of the bytes do not denote valid characters, the resulting String
representation loses information. Converting the String
back to a BigInteger
produces a different valuethe byte array contains malformed-input sequences, the behavior of the String
constructor is unspecified.
Code Block |
---|
...
lang | java |
---|
| ||
import java.math.BigInteger; import java.nio.CharBuffer; public class CharsetConversion { public static void main(String[] args) { BigInteger x = new BigInteger("530500452766"); |
...
|
...
|
...
|
...
|
...
byte[] byteArray = x.toByteArray(); String s = new String(byteArray); |
...
|
...
|
...
|
...
System.out.println(s); |
...
|
...
}
} |
Compliant Solution
The java.nio.charset.CharsetEncoder
and java.nio.charset.CharacterDecoder
provide greater control over the process. In this compliant solution, the CharsetDecode.decode()
method is used to convert the byte array containing the two's-complement representation of this BigInteger
value to a CharBuffer
. Because the bytes do not represent a valid UTF-16, the input is considered malformed, and a MalformedInputException
is thrown.
Code Block | ||
---|---|---|
| ||
import java.math.BigInteger;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.MalformedInputException;
import java.nio.charset.StandardCharsets;
import java.nio.charset.UnmappableCharacterException;
public class CharsetConversion {
public static void main(String[] args) {
CharBuffer charBuffer;
CharsetDecoder decoder = StandardCharsets.UTF_16.newDecoder();
BigInteger x = new BigInteger("530500452766");
byte[] byteArray = x.toByteArray();
ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray);
try {
charBuffer = decoder.decode(byteBuffer);
s = charBuffer.toString();
System.out.println(s);
} catch (IllegalStateException e) {
e.printStackTrace();
} catch (MalformedInputException e) {
e.printStackTrace();
} catch (UnmappableCharacterException e) {
e.printStackTrace();
} catch (CharacterCodingException e) {
e.printStackTrace();
}
}
} |
Risk Assessment
Malformed input or unmappable character errors can result in a loss of data integrity
When this program was run on a Linux platform where the default character encoding is US-ASCII
, the string s
got the value {?J??
, because some of the characters were unprintable. When converted back to a BigInteger
, x
got the value 149830058370101340468658109
.
Compliant Solution
This compliant solution first produces a String
representation of the BigInteger
object and then converts the String
object to a byte array. This process is reversed on input. Because the textual representation in the String
object was generated by the BigInteger
class, it contains valid characters.
Do not try to convert the String
object to a byte array to obtain the original BigInteger
. Character encoded data may yield a byte array that, when converted to a BigInteger
, results in a completely different value.
|
Noncompliant Code Example
This noncompliant code example corrupts the data when string
contains characters that are not representable in the specified charset
.
|
Compliant Solution
The java.nio.charset.CharsetEncoder
class can transform a sequence of 16-bit Unicode characters into a sequence of bytes in a specific charset
, while the java.nio.charset.CharacterDecoder
class can reverse the procedure [API 2006].
This compliant solution uses the CharsetEncoder
and CharsetDecoder
classes to handle encoding conversions.
|
Noncompliant Code Example
This noncompliant code example attempts to append a string to a text file in the specified encoding. This is erroneous because the String
may contain unrepresentable characters.
|
Compliant Solution
This compliant solution uses the CharsetEncoder
class to perform the required function.
|
Use the FileInputStream
and InputStreamReader
objects to read back the data from the file. InputStreamReader
accepts a optional CharsetDecoder
argument, which must be the same as that previously used for writing to the file.
Exceptions
STR03-EX0: Binary data that is expected to be a valid string may be read and converted to a string. How to perform this operation securely is explained in rule STR04-J. Use compatible character encodings on both sides of file or network IO. Also see rule STR01-J. Don't form strings containing partial characters.
Risk Assessment
Attempting to read a byte array containing binary data as if it were character data can produce erroneous results.
Rule | Severity | Likelihood | Remediation Cost | Priority | Level |
---|---|---|---|---|---|
STR03STR05-J | low | unlikely | medium | P2 | L3 |
Related Guidelines
Bibliography