What is UTF that Serialization uses to write strings?

Tim Rohaly

The Java programming language was designed from the ground up to be compatible with multiple character sets from multiple languages. This entails using a non-ASCII character and string representation within the language itself.

Internally, Java stores and manipulates all characters and strings as Unicode. Externally, in serialization and in bytecode, Java uses the UTF-8 encoding of the Unicode character set. For each character, UTF-8 encoding is a 1-, 2-, or 3-byte representation of the corresponding 2-byte Unicode character (ASCII characters are encoded as 1 byte, non-ASCII characters are encoded as 2 or 3 bytes).

UTF-8 has the property that the ASCII characters 0x20-0x7E encode to UTF-8 1-byte characters of the same value. Therefore, UTF-8 is 100% compatible with existing systems which use ASCII encoding. This is one of the main reasons UTF-8 was chosen for Java's external character representation - it allows complete compatibility with ASCII as well as allowing for a wide range of international characters.

For more information, refer to the FAQ What is Unicode?