Compact Strings In Java 9

In any of the Java applications Strings are used extensively. I can’t remember a single application where I have not used Strings. So any optimization on String would affect almost each and every Java application. So it would be important to know what Java 9 is bringing in with String optimizations. Java 9 is coming with a feature JEP 254 (Compact Strings). In this article we would be discussing this Compact Strings and how it can impact an applications memory footprint and performance.

Compact Strings

Each String in Java is internally represented by two objects. First object is the String object itself. Second one is the char array that handles the data contained by the String. The char type occupies 16 bits or two bytes. If the data is a String in English language for instance, often the leading 8 bits will be all zeroes as the character can be represented only by using one byte.

Studies on various applications say that Strings occupy a major portion of heap space of JVM in any application. Since strings are immutable and reside in the string literal pool, you can imagine how much memory could be used up by them till the garbage collection occurs. It thus makes sense to make the strings shorter or compact by discarding some data that do not have added value (the leading byte or 8 bits of all zeroes in case of an English character).

A JDK Enhancement Proposal (JEP 254) was created to address the issue explained above. Note that this is just a change at the internal implementation level and no changes are proposed for existing public interfaces. A study on thread dumps of various Java Applications revealed that most of the Strings in the applications were LATIN-1 characters, that can be represented by just using 8 bits. There were other special characters that needed all 16 bits but their frequency of occurrence was far less compared to LATIN-1 characters.

To understand the proposed changes in a better fashion, let us consider a String in Java containing the letters Hello. The following illustration shows how the data is saved internally.

 

Compact Strings in Java 9
Compact Strings

Under each byte, we have written the hexadecimal representation according to UTF-16. This is how a String object is internally represented using char array till Java 8. Notice that the bytes in light grey are not really needed to represent the characters. The actual data that matters in each 16 bits representing the English alphabets are the trailing 8 bits. Thus, by omitting these bytes, it is possible to save extra space.

String Class Enhancements For Compact Strings

In the enhanced String class of Java 9, the string is compressed during construction where, there is an attempt to optimistically compress the string into 1 byte per character (simple ASCII, also known as ISO-8859-1 representation for LATIN-1 character). If any character in given string is not representable only using 8 bits, copy all characters using two bytes (UTF-16 representation).

Certain changes are made to the internal implementation of String class in order to distinguish between UTF-16 and LATIN-1 Strings. A final field named coder has been introduced which demanded for incorporation of one crucial change to the API. How shall the length of the string be calculated for each encoding? This is a very important because the most widely used method in String class is charAt(index i) which goes to i-th position and returns the character there. Unless the length is determined properly, methods like this can be error prone.

Length of the String is calculated internally as follows:

If the String contains LATIN-1 only, coder is going to be zero, so length of String will be length of char array. If the String contains UTF-16 characters, coder will be set. The above method will perform a right shift which means the actual string length will be half of the size of the byte array that holds the UTF-16 encoded data.

Similar to the length() method of String, internal implementations of certain methods in StringBuffer and StringBuilder are also changed in order to incorporate the string compaction fully in Java 9. However, the APIs to which the Java programmer code remains the same.

Kill-Switch For Compact String Feature

Compact String feature is enabled by default in Java 9. If we are sure that at runtime, your application will generate Strings that are mostly representable only using UTF-16, we may want to disable this compact string feature so that the overlay incurred during optimistic conversion to 1 byte (LATIN-1) representation and failure to do so can be avoided during String construction. To disable the feature, we can use the following switch:

Impact Of Compact String During Runtime

The developers of this feature from Oracle found out during performance testing that Compact Strings showed a significant reduction in memory footprint and a performance gain when Strings of LATIN-1 only characters were processed. There was a notable improvement in the performance of  Garbage Collector as well.

A feature named Compressed String was introduced in Java 6 which had the same motive but was not effective. Compressed Strings were not enabled by default in JDK 6 and had to be explicitly set using:

Compressed String maintained a completely distinct String implementation that was under alt-rt.jar and was focused on converting ASCII encodable string to byte array. A major problem faced during that time was that the String constructor used to take char array. Also, many operations depended on char array representation and not byte array because of which a lot of unpacking was needed which resulted in performance problems. This feature was eventually removed in JDK 7 and JDK 8.

Unlike compressed Strings, Compact Strings don’t require unpacking or repacking and hence gives better performance at runtime.

Summary

Compact Strings is going to be a very helpful feature for applications extensively using Strings. This may lead to a much less memory requirement. We are looking forward to this feature.

One thought to “Compact Strings In Java 9”

  1. Isn’t this just a symptom of Java not using UTF8? Why make a workaround when you should be addressing the issue directly?

Leave a Reply