The Secret Life of a Java String

The design decisions behind Java's most-used type

 · 23 min

String is the most-used type in any Java codebase, without much competition. And yet, most developers don’t know much about what actually happens under the hood.

At surface level, we all know what a String is and how to deal with them. They are immutable, so we should use StringBuilder for heavy concatenation. Never compare them with ==, use equals. Most of the time, that’s quite enough to write correct code.

But Java’s String features a genuinely interesting design. One that has changed significantly over the years, is still evolving, and that rewards our efforts of understanding it.

When performance matters or bugs get weird, the difference between guessing and knowing often comes down to what you understand about the thing you use most.


CharSequence: What String Presents to the World

Regardless of the programming language used, the most common model many developers have of a String is an array of characters. The Java API seems to confirm it, with methods like charAt(int), length(), substring().

But that model is an incomplete view of what String really is.

String doesn’t just happen to support those operations, but the type explicitly declares itself to be a CharSequence.

CharSequence is a simple five-method interface:

  • int length()
  • char charAt(int index)
  • CharSequence subSequence(int start, int end)
  • String toString()
  • boolean isEmpty() (default method, Java 15+)

It’s a read-only view of a sequence of char values, nothing more. The mental model holds up so far.

This simple interface creates a common shape for text by making it indexable and readable character by character. String isn’t the only implementor: StringBuilder, StringBuffer, and CharBuffer from NIO implement it too.

How the underlying text is stored, and whether it’s mutable, is left entirely to the implementation.

This distinction matters for API design.

If a method only needs to read text, you probably don’t need to make the parameter a full-blown String. Using CharSequence instead means a caller can pass a StringBuilder mid-construction, a CharBuffer from an NIO channel, or a plain String, without any of them needing to call toString() first.

java
// Forces an allocation if I don't have a String already
public void log(String message) {
    // ...
}

// Accepts any readonly text, no toString() required
public void log(CharSequence message) {
    // ...
}

// Now this is valid:
StringBuilder sb = new StringBuilder("[")
    .append(level)
    .append("] ")
    .append(event);

log(sb); // no toString() call needed, no allocation

Even though CharSequence provides a read-only view of a text and opens up acceptable types for method parameters, there’s one deliberate sharp edge: it does not define equals() or hashCode().

Instead of an oversight, this was done deliberately.

A String and a StringBuilder containing identical characters are not meaningfully equal under any reasonable contract, so no shared definition is provided.

The consequence of this decision is subtle. If we compare two CharSequence variables, it falls back to the implementation’s own equals(). In the case of StringBuilder, that means identity-based comparison, since it doesn’t override Object.equals().

java
CharSequence a = "hello, world";
CharSequence b = new StringBuilder("hello, world");

// String.equals() rejects non-String arguments
a.equals(b); // false

// StringBuilder inherits Object.equals(), identity only
b.equals(a); // false

CharSequence is also a small window into a broader Java design philosophy: defining interfaces around capabilities, not types.

“Can be read as a sequence of chars” is a capability.

Whether the backing storage is a byte[], a char[], or a memory-mapped buffer is a separate concern. That makes it a clean abstraction that keeps those layers from leaking into each other.


What’s Actually Inside a String

Before Java 9, the mental model of “String as an array of characters” was structurally accurate. Every String was backed by a char[], with each character encoded in UTF-16. That made it easy to reason with, but every character cost exactly 2 bytes, even for plain ASCII like "hello, world".

Most string content in practice is plain ASCII or Latin-1, so paying for UTF-16 on every character wastes half the memory. That’s why Java 9 changed this with Compact Strings (JEP 254).

Instead of having a single byte[] for backing, it got paired with a single private final byte coder field: 0 for Latin-1, 1 for UTF-16. The reasoning for this change came from profiling real-world JVM heap dumps, as they showed that the vast majority of Strings in production applications contain only Latin-1 characters.

For those, each character can be stored in one byte instead of two, cutting their heap footprint in half. No public API changes, no work required from application code.

The Illusion of Transparency

Compact Strings are completely invisible at the API level.

String operations check the coder field and dispatch to one of two hidden String implementations: StringLatin1 or StringUTF16. Both are package-private classes, so we can’t reference them directly. But we don’t have to, as they’re merely an implementation detail that String presents its unified shape over.

Take charAt(i) as a concrete example.

For a Latin-1 string, it’s a direct byte[] lookup (plus a little validation):

java
final class StringLatin1 {

    public static char charAt(byte[] value, int index) {
        checkIndex(index, value.length);
        return (char)(value[index] & 0xff);
    }
}

For UTF-16, there’s a little more work involved, as it needs to read two bytes and reassemble them with a bit shift:

java
final class StringUTF16 {

    public static char charAt(byte[] value, int index) {
        checkIndex(index, value);
        return getChar(value, index);
    }

    @IntrinsicCandidate
    static char getChar(byte[] val, int index) {
        assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
        index <<= 1;
        return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) |
                      ((val[index]   & 0xff) << LO_BYTE_SHIFT));
    }
}

Either way, String.charAt() returns a char.
The actual dispatch happens under the hood, the caller is seeing nothing different.

The Hidden Cost of UTF-16 Promotion

Okay, so we got two different String implementations that know how to handle themselves. But what if we concatenate those two?

The rules are simple:

  • If both instances are Latin-1, the result can be Latin-1, too, saving heap space.
  • If a Latin-1 instance is concatenated with an UTF-16 one, the result is silently promoted to UTF-16.

The UTF-16 part of the operation demands appropriate backing, so the Latin-1 parts are promoted, losing their memory benefits.

And such a promotion is permanent! A string never demotes back to Latin-1.

This is a consequence of immutability: once a String is created, its coder field is set and never changes. The JVM can’t inspect a string later and decide to repack it.

So any operation that produces a result containing at least one non-Latin-1 character will produce a UTF-16 string, and every subsequent String derived from it will also be UTF-16, even if all the new content being appended is plain ASCII:

java
String greeting = "Hello, "; // Latin-1: 7 bytes
String name     = "Søren";   // UTF-16: 10 bytes (ø is outside Latin-1)

String full = greeting + name;       // UTF-16: 24 bytes
String line = full + ". Welcome!";  // still UTF-16, even though ". Welcome!" is Latin-1-only

In practice, the trigger is often a single character: an emoji in a log message, a name containing an accented character outside Latin-1, or any Unicode symbol.

Note that Latin-1 covers ISO-8859-1, so common Western European characters like é, ü, and ñ are fine. It’s characters like ø, ł, or anything outside that 256-character block that cause promotion.

This rarely matters in everyday code. But if you’re profiling unexpected allocation in a hot path, let’s say, a loop assembling strings from user-provided fields, and one of those fields occasionally contains a character outside Latin-1, every string produced from that point forward costs twice as much to allocate.

It’s the kind of thing that doesn’t show up in unit tests and takes a heap profiler to find.

There’s no public API to check which encoding a String is using. If you need to verify it during debugging or profiling, you can use reflection to expose the coder field:

java
Field coder = String.class.getDeclaredField("coder");
coder.setAccessible(true);
System.out.println(coder.getByte(someString) == 0 ? "LATIN1" : "UTF16");

Don’t ship this. But it’s useful when you suspect promotion is happening somewhere unexpected and want to confirm it.

Immutability as Structure, not just Contract

The other thing about Strings worth understanding is how immutability is enforced, not just promised.

The backing byte[] is stored in a private final field, preventing the reference from being reassigned. And since String is itself a final class, no subclass can add a setter or expose a backdoor.

That means there is no path to mutating the contents of a String.

Ever.

It’s what makes hashCode caching safe. Initially, the hash field is zero and is computed lazily on the first call. Because the content can never change, storing the result requires no synchronization. The worst case if two threads call hashCode() simultaneously is a redundant recalculation, not a corrupted value.

JVM trivia: as the hash code might mathematically calculate to 0, it would be re-calculated on each call. To fix this, Java 13 quietly added a private byte hashIsZero field to remember if a zero-hash was already computed.

It also historically enabled substring() to return a new String that shared the same backing array as the original, with just a different offset and length. Although OpenJDK dropped this behaviour in Java 7 (JDK-4513622), as it was causing memory leaks, where a small substring could keep a large backing array alive indefinitely. The trade was more allocation in exchange for simpler, more predictable GC behavior.

String being final does one more thing: it’s a hint to the JIT. When a class is final, calls to its methods can be devirtualized, letting the compiler know that there are no subclasses to dispatch to, so it can inline the method body directly at the call site. Given how often string operations appear in hot paths, that matters.

JIT Intrinsics

final enables devirtualization, but the JVM goes further for String’s most critical methods. equals(), indexOf(), and hashCode() are JIT intrinsics on HotSpot: the JVM recognizes them by name and replaces their Java implementations with hand-tuned native code at runtime.

You can spot the annotation in the JDK source: @IntrinsicCandidate appears on these methods (and on StringUTF16.getChar(), which you already saw in the charAt() example above). It’s a signal to the JIT that it should substitute its own optimized version rather than compiling the Java body.

The practical consequence is counterintuitive: these methods are often faster than equivalent hand-rolled code, because the intrinsic implementations can use CPU-level instructions, like SIMD vector comparisons, hardware-accelerated hashing, that the JIT wouldn’t generate from plain Java.

If you’re ever tempted to replace equals() with a manual character loop for performance, benchmark first. The intrinsic almost certainly beats you.


The String Pool and Interning

Immutability gives the JVM a lot to work with.

The JVM maintains a global table of unique String instances called the String Pool, and every string literal in our code is automatically added to it. So when you write "hello, world!" in two different places, the compiler ensures both references point to the same object, saving us heap space.

Since Java 7 (JDK-6962931), the pool lives on the regular heap and is subject to garbage collection like any other object. Before that, it lived in PermGen, a fixed-size memory region that didn’t participate in normal GC, which made unbounded pool growth a real problem.

The String Pool saves us memory, but it’s also the culprit behind the == trap:

It’s a Trap

Two string literals with the same content are the same object, so == happens to return true. But == compares references, not content! The moment one of the Strings comes from anywhere other than a literal, the equality check fails:

java
String a = "hello";
String b = "hello";
a == b;         // true, as both are the same interned object

String c = new String("hello");
a == c;         // false, as c is a fresh object, not the pooled one
a.equals(c);    // true, same content, which is what you actually wanted

That’s why we must always use .equals() and never == for Strings! It may work in many cases, but it’s an implementation detail and not part of the public API or contract.

And it’s not a theoretical risk, it’s a bug that surfaces any time you compare a parsed, deserialized, or dynamically-built String against a literal. No compiler warning, no exception, just a comparison that silently returns false.

Everyone, Get Into The Pool

Imagine parsing a million log lines and storing the severity level as a String on each event object. Every line produces a fresh "INFO" or "WARN" from the parser, a new object each time, even though there are only five or six distinct values in the entire dataset. That’s a lot of identical strings sitting in memory for no reason.

You might hope the String Pool would help here. It does in general, but only for literals: "INFO" written in source code will always resolve to the same pooled object (and string-valued constant expressions, per JLS 3.10.5). But "INFO" produced at runtime by a parser, a split(), or a network read is a fresh allocation every time.

The pool doesn’t know about it.

That’s where intern() comes in. It checks the pool for an equivalent string and returns the canonical instance if one exists, or adds it. Call intern() on every parsed level, and those million objects collapse into five.

java
String level = parser.readLevel().intern(); // returns the pooled "INFO", always the same object

Today, however, that use case is largely obsolete, thanks to G1 String Deduplication.

Starting with Java 8u20 (JEP 192), the G1 garbage collector can identify String instances with identical content during GC pauses and rewire them to share the same backing byte[]. The String objects themselves remain distinct (== will still return false), but the memory is shared. Enable it with -XX:+UseStringDeduplication and the log-level problem largely solves itself.

The main remaining use case for manual intern() is when you specifically want == to work for some obscure reason, or when you’re using a GC that doesn’t support deduplication.

The catch is that overusing it, though, can do more harm than good for String objects that aren’t actually repeated. The pool is a hash table with a fixed number of buckets: 65,536 by default, tunable via -XX:StringTableSize. Feed it a stream of arbitrary, distinct values (user-provided input, generated IDs, anything that doesn’t repeat), and the chains behind each bucket grow longer with every call. What should be an O(1) lookup quietly becomes O(n).

intern() is only safe for strings from a known, bounded set.


The Many Ways of String Concatenation

The + operator is how most string construction starts, and it looks completely trivial. But it’s also where a lot of outdated beliefs and advice still circulate that made sense once, but has been stale for years if we know what the JVM is actually doing.

To understand where it came from, look at what javac used to do with + inside a loop.

Take these few lines of code, for example:

java
String result = "";
for (String word : words) {
    result = result + " " + word;
}

The compiler replaced the + operator with a StringBuilder, which meant the compiler generated roughly this:

java
String result = "";
for (String word : words) {
    result = new StringBuilder()
        .append(result)
        .append(" ")
        .append(word)
        .toString();
}

A fresh StringBuilder on every pass.

Each one copies the entire accumulated string so far, appends the new word, then creates a new String, making the previous garbage on each iteration.

For 100 words, we copy roughly 1 + 2 + 3 + … + 100 characters: O(n²) work for what should be O(n).

With a long enough input, this becomes a genuine problem, and it generates significant GC pressure.

That’s where “always use StringBuilder” advice came from, as it forced us to think about the concatenation, and not just let the compiler replace + operators directly:

java
StringBuilder sb = new StringBuilder();
for (String word : words) {
    sb.append(" ").append(word);
}
String result = sb.toString();

One allocation, one internal resize if needed, one final copy.

The advice is correct, but we often applied it to scenarios where it wasn’t necessary, and a simple + would be easier to reason about.

Even when used correctly, since Java 9 (JEP 280), that advice is stale for many cases. Instead of relying on StringBuilder, the compiler now emits a single invokedynamic call, and a runtime class called StringConcatFactory decides the optimal strategy to build the String. That moves the decision of what to do from compile-time to runtime, where the JIT has far more context than javac ever had.

That means we can write expressions like "Hello, " + name + "!" and trust the compiler.

This doesn’t mean every concatenation can be left to +. Where we do need to make a choice is when + might not see the full picture: loops, conditional appending, or building delimiter-separated output.

That’s where String-building types enter the picture.

StringBuffer, A Blast From The Past

The oldest one going back to the Java 1.0-era is StringBuffer.

Its methods are synchronized, which makes it thread-safe, but slow. Synchronization by default is unnecessary in most situations, and even if we need it there are alternatives. Modern code should use StringBuilder inside a ThreadLocal or coordinate access externally.

StringBuilder, The Better Alternative

The StringBuilder class is the right tool for iterative construction, which is why javac defaulted to it for so long. But it’s worth knowing what “iterative” actually means here.

For simple + expressions, we don’t need it and can trust the compiler. Where StringBuilder genuinely earns its place is anywhere the compiler can’t see the full structure at compile time: loops with unknown iteration counts, conditional appending, or assembling a String from many pieces across multiple branches.

java
// Let the compiler handle this, no StringBuilder needed
String label = "User #" + id + " (" + role + ")";

// Use StringBuilder here, as loop count isn't known at compile time
StringBuilder sb = new StringBuilder();
for (String chunk : chunks) {
    if (!chunk.isBlank()) {
        sb.append(chunk).append('\n');
    }
}

String result = sb.toString();

StringJoiner: The Underrated Specialist

StringBuilder is a general-purpose tool for concatenating arbitrary Strings. That’s good enough for many cases, but it quickly becomes burdensome if we want delimiter-separated output, like CSV.

We need to append an item, then check whether to append a comma, and so on… I know I’ve written way too much of such code in the past, and it’s ugly and noisy:

java
// The ceremony of not adding a trailing comma
StringBuilder sb = new StringBuilder();
sb.append("[");
for (String tag : tags) {
    if (sb.length() > 1) {
        sb.append(", ");
    }
    sb.append(tag);
}
sb.append("]");

String result = sb.toString(); // "[java, jvm, performance]"

Introduced in Java 8, StringJoiner, was designed specifically to eliminate that boilerplate. Its constructor takes a delimiter, and optionally a prefix and suffix, simplifying the code immensely:

java
StringJoiner joiner = new StringJoiner(", ", "[", "]");
joiner.setEmptyValue("(none)");

for (String tag : tags) {
    joiner.add(tag);
}

String result = joiner.toString();
// tags = ["java", "jvm"] → "[java, jvm]"
// tags = []               "(none)"

A few features worth knowing:

  • setEmptyValue() handles the zero-element edge case cleanly without additional guard checks after the fact.

  • merge(other) adds the content of other without prefix/suffix. That’s useful when we’re building parts of a String in separate branches and assembling them at the end.

If we already have an Iterable or varargs, String#join() is a static shorthand that skips the StringJoiner entirely:

java
String csv = String.join(", ", "a", "b", "c"); // "a, b, c"

And if we’re already in a Stream pipeline, Collectors.joining(delimiter, prefix, suffix) uses StringJoiner under the hood, so we don’t need to concatenate/collect ourselves.


Modern String API Highlights

The String API has grown steadily in the last decade. Since developers rarely re-read the documentation for classes they use every day, several great additions easily slip under the radar.

Here are the ones actually worth knowing about.

strip(), isBlank(), and Unicode Whitespace

The classic, Java 1.0-era trim() method only removes characters with a codepoint of 32 or below. It completely ignores any modern Unicode whitespace like the non-breaking space or em space (U+2003) that are often problematic.

Java 11 introduced strip() alongside stripLeading() and stripTrailing(), all of which use Character.isWhitespace(), which is a Unicode-aware check:

java
String s = "\u2003hello\u2003"; // em spaces
s.trim();   // "\u2003hello\u2003" (em space, codepoint 8195, not removed)
s.strip();  // "hello"

The isBlank() method follows the same Unicode-awareness. It returns true if the String is empty or contains only Unicode whitespace, making it the Unicode-correct replacement for the combination of isEmpty() and trim().

Three Boilerplate Killers

The lines() method (Java 11) returns a lazy Stream<String> that splits on any standard line terminator (\n, \r, or \r\n), fixing the line-ending woes from the usual split("\n") without allocating all lines upfront.

The repeat(int count) method (Java 11) replaces clunky StringBuilder loops:

java
"ab".repeat(3)   // "ababab"
"-".repeat(40)   // a divider line

One of my favorites is the new formatted(Object...) method (Java 15), which is an instance-method alias for format(...):

java
// Before
String msg = String.format("User %s has %d points", name, points);

// After
String msg = "User %s has %d points".formatted(name, points);

The advantage isn’t just style: it reads left-to-right, with the format string where it naturally belongs. It also chains cleanly with text blocks.

Text Blocks

Text blocks (JEP 378, Java 15) are the most substantial addition in this list.

The """ delimiter lets you write multi-line strings without ugly explicit newlines or escape sequences, and has sensible indention handling:

java
String json = """
        {
            "name": "%s",
            "active": true
        }
        """.formatted(name);

The key rule to internalize is how incidental whitespace is stripped.

The compiler finds the least-indented non-empty line in the block as a baseline, including the closing """. This line is used to consider how many spaces are supposed to be stripped away. Its position is therefore a lever: push it left to retain more leading whitespace, align it with the content to strip it all.

java
String a = """
           hello
           """;   // closing """ at same indent -> "hello\n"

String b = """
           hello
        """;       // closing """ 3 spaces left of content -> "   hello\n"

Two other escape sequences were added specifically for text blocks, but work in any string literal:

  • \ at the end of a line suppresses the newline. This way we can break a long string across source lines without embedding a newline in the value, similar to how shell script does it.

  • \s is a trailing-space anchor. It forces the preceding space to be preserved during incidental whitespace stripping, which would otherwise eat trailing spaces on each line.

String Templates: The Feature that didn’t make it

If you’ve been waiting for String Templates to allow inline expressions like STR."Hello, \{name}!", you can stop waiting.

After three preview rounds ending with Java 23 (JEP 430, JEP 459, JEP 465), the feature was completely withdrawn from the JDK due to design complexities, and the team went back to the drawing board.

As of early 2026, there’s sadly no replacement on the horizon. Use formatted() and text blocks to cover most of the practical ground in the meantime.


Encoding and Unicode: The Sharp Edges

A char in Java is a UTF-16 code unit, not a Unicode code point. For the Basic Multilingual Plane (most Latin, Greek, Cyrillic, CJK, and other common scripts) those are the same thing. But Unicode defines characters up to U+10FFFF, and anything above U+FFFF requires two code units: a surrogate pair. Emoji and some CJK extension characters fall into this range.

The consequence is that charAt() and length() can mislead us:

java
String s = "😀";
s.length();        // 2 -> two code units, not one character
s.charAt(0);       // '\uD83D' -> half a surrogate pair, meaningless alone
s.codePointAt(0);  // 128512 -> the actual code point

If your code processes user-provided text and does anything beyond simple ASCII, like counting characters, slicing, iterating, etc., use codePoints() instead of chars(), and codePointAt() instead of charAt().

The chars() and charAt() variants work at the code unit level and will silently produce wrong results on any string containing characters outside the BMP.

Another possible sharp edge can be found when converting between String and byte[], as it requires an explicit Charset.

getBytes() and new String(bytes) without a charset argument used to be a platform-dependent nightmare, which finally became UTF-8 by default in Java 18 (JEP 400).

Always passing StandardCharsets.UTF_8 explicitly is still a great practice for readability and backward compatibility, but the risk of platform mismatch is largely gone in modern Java.


Beyond the API

For the vast majority of our day-to-day work, Java’s String behaves exactly as we expect. The JVM does an incredible job of managing text efficiently, and the stability of the API over the last three decades is one of Java’s greatest design achievements. It allowed massive internal upgrades, like the shift to Compact Strings in Java 9 or invokedynamic concatenation, without breaking a single line of application code.

However, solid abstractions can still leak sometimes.

A single non-Latin-1 character like an emoji can quietly double the memory footprint. Re-parsing the same text from a network stream over and over circumvents the String pool entirely. And blindly leaning on intern() for dynamic inputs can actively degrade lookup performance across the whole application.

When a heap dump shows unexpected memory pressure, or when a heavily trafficked loop is suddenly thrashing the GC, having a solid mental model of Java’s most common type can take a lot of guesswork out of the equation.

Next time the profiler points to java.lang.String, you’ll know exactly what’s actually happening.


A Functional Approach to Java Cover Image
Interested in using functional concepts and techniques in your Java code?
Check out my book!
Available in English, Polish, and Korean.

Resources

References & API Documentation

JDK Bugs & Historical Milestones

Relevant JEPs

Deep Dives & Further Reading