Designing a Crash-Resilient Binary Format (Part 2)
Framing, Checksums, and Making Disk Writes Honest
In the previous article, we built a fast append-only log using Java’s Panama FFM API.
It was clean, efficient, and… fundamentally broken.
Storage engines live in a hostile environment, where power loss, torn writes, kernel crashes, and silent data corruption are not edge cases, but almost inevitable over a large enough timeframe and enough hardware.
It’s time to design a proper on-disk record format with explicit framing and CRC checksums, implement defensive read paths, and make durability a conscious contract instead of a hopeful side effect.
Table of Contents
A Bag of Bytes Isn’t a Database
The fast but quite simplistic RawLog from the last article was just a featureless bag of bytes with three fatal flaws preventing it from being a real storage system:
The Framing Problem:
We have no way to distinguish between one record and another.The Integrity Problem (Torn Writes & Bit Rot):
If the power fails halfway through writing a 16-byte value, we get a “torn write,” a partially written, corrupted record.We want to write: [ 0xDE 0xAD 0xBE 0xEF 0xDE 0xAD 0xBE 0xEF ] (8 bytes) ^------------------^ (8 bytes) Power fails after 4 bytes... What's on disk: [ 0xDE 0xAD 0xBE 0xEF 0x00 0x00 0x00 0x00 ] <-- CORRUPTED DATA ^-- Written --^ ^-- From before --^Without an integrity check, we would read this corrupted data and believe it’s valid.
The Durability Problem:
Our calls to write into aMemorySegmentreturn as soon as the data is in the OS page cache (RAM). If the system crashes now, that data is gone. We need a way to make sure the data is physically on disk.
In this article, we will address all three.
We’ll forge a proper storage primitive (the StorageSegment class) by designing a self-describing and verifiable binary record format, implementing a vigilant RecordCodec to enforce data integrity.
When we’re done, the disk will finally be a safer place to put our data for this simplified engine.
Designing a Self-Describing, Defensive Layout
To ensure our data is parsable and safe, each piece of data we write must be self-contained, self-describing, and independently verifiable.
The Anatomy of a Record
A well-designed binary record format includes the data itself and everything else to verify its correctness. That requires adding a fixed-size header containing a checksum and length:
+---------------+---------------+---------------+---------------+---------------+
| CRC32C (4B) | Key Len (4B) | Val Len (4B) | Key Bytes | Value Bytes |
| (Checksum) | (Integer) | (Integer) | (Variable) | (Variable) |
+---------------+---------------+---------------+---------------+---------------+A quick note on the integer format, as mentioned in the previous article. We specify little-endian byte order for cross-platform consistency.
Also, because records can start at any byte offset, our memory accesses must be unaligned.
Panama’s ValueLayout.JAVA_INT_UNALIGNED.withOrder(ByteOrder.LITTLE_ENDIAN) handles this for us, preventing potential crashes on certain CPU architectures.
How it Solves Our Problems
Adding the 12-byte fixed header solves two of the three problems:
Framing:
TheKey LenandVal Lenfields tell us exactly how many bytes to read next.Integrity:
TheCRC32Cchecksum is our magic integrity check. It’s calculated from the lengths, key bytes, and value bytes. After reading a record, we recalculate the CRC and compare it, detecting corruption via mismatches.
CRC32Cwas chosen because it’s s frequently hardware-accelerated on modern Intel/AMD CPUs with dedicated instructions, making this critical integrity check extremely fast.
The remaining problem, durability, can’t be solved with the record structure itself, it will be addressed in code next.
Implementation: The RecordCodec
We need a translator that sits between our Java objects and the on-disk byte format.
Let’s create a dedicated RecordCodec class to handle the serialization logic and, critically, the checksum validation, keeping our StorageSegment class focused purely on I/O.
The In-Memory Representation: DataRecord
First, let’s define our in-memory type and the constants for our on-disk layout.
A Java record is the perfect data carrier, and is as simple as it gets.
The custom toString helps to improve debuggability:
public record DataRecord(byte[] key, byte[] value) {
// Convenience creator
public static DataRecord of(String key, String value) {
return new DataRecord(key.getBytes(StandardCharsets.UTF_8),
value.getBytes(StandardCharsets.UTF_8));
}
@Override
public String toString() {
return new String(this.key, StandardCharsets.UTF_8)
+ " -> "
+ new String(this.value, StandardCharsets.UTF_8);
}
}The data layout will be defined by constants in our codec, so we don’t have to deal with “magic numbers” all over the place:
public class RecordCodec {
public static final int HEADER_SIZE = 12;
// LAYOUT: [CRC (4b)] [KeyLen (4b)] [ValLen (4b)] [Key] [Value]
private static final long CRC_OFFSET = 0;
private static final long KEY_LEN_OFFSET = 4;
private static final long VAL_LEN_OFFSET = 8;
private static final long PAYLOAD_START_OFFSET = HEADER_SIZE - KEY_LEN_OFFSET;
// Maximum size for a single key or value (1MB for this simplified implementation)
private static final int MAX_RECORD_SIZE = 1_024 * 1_024;
}The Scribe: “Write-Then-Backfill”
The most important part is writing a DataRecord to a MemorySegment.
The process might seem slightly counterintuitive at first, but it’s designed for maximum safety.
Our write method follows a specific three-step process designed for safety and efficiency.
Its job is to take a DataRecord and a targetSegment (a memory region to write into) and serialize the record according to our format.
We write the payload first, calculate the checksum from those written bytes, then backfill the CRC field. This ensures the checksum reflects the exact on-disk representation without requiring temporary buffers.
public class RecordCodec {
// ...
// Explicitly define little endian.
// This ensures our file format is portable across different CPU architectures.
private static final ValueLayout.OfInt INT_LAYOUT =
ValueLayout.JAVA_INT_UNALIGNED.withOrder(ByteOrder.LITTLE_ENDIAN);
public static int write(DataRecord record, MemorySegment targetSegment) {
int keyLen = record.key().length;
int valLen = record.value().length;
int totalSize = HEADER_SIZE + keyLen + valLen;
// Defensive Check: Ensure the record isn't absurdly large.
if (keyLen < 0 || valLen < 0 || keyLen > MAX_RECORD_SIZE || valLen > MAX_RECORD_SIZE) {
throw new IllegalArgumentException("Record is too large.");
}
// Defensive Check: Ensure the target segment is large enough.
if (targetSegment.byteSize() < totalSize) {
throw new IllegalArgumentException("Target segment is too small to fit the record.");
}
// STEP 1: Write lengths and data payload first.
// We use our predefined layout to keep it portable.
targetSegment.set(INT_LAYOUT, KEY_LEN_OFFSET, keyLen);
targetSegment.set(INT_LAYOUT, VAL_LEN_OFFSET, valLen);
// Copy key bytes
MemorySegment keyDst = targetSegment.asSlice(HEADER_SIZE, keyLen);
MemorySegment keySrc = MemorySegment.ofArray(record.key());
keyDst.copyFrom(keySrc);
// Copy value bytes
MemorySegment valDst = targetSegment.asSlice(HEADER_SIZE + keyLen, valLen);
MemorySegment valSrc = MemorySegment.ofArray(record.value());
valDst.copyFrom(valSrc);
// STEP 2: Calculate CRC over the data payload (lengths + key + value).
// We slice the segment to exclude the CRC field.
// Java's CRC32C class accepts a `ByteBuffer` which MemorySegment provides.
// NOTE: It will be allocated on-heap, which is a reasonable trade-off
// for the article, but for a hot-path production environment, it may not.
MemorySegment dataSlice = targetSegment.asSlice(KEY_LEN_OFFSET, totalSize - KEY_LEN_OFFSET);
CRC32C crc = new CRC32C();
crc.update(dataSlice.asByteBuffer());
// STEP 3: Backfill the CRC value
targetSegment.set(INT_LAYOUT, CRC_OFFSET, (int) crc.getValue());
return totalSize;
}
}Writing is solved, now we need to read safely!
The Guardian: Reading Records with “Verify-Before-Allocate”
When reading, we should be paranoid.
The Verify-Before-Allocate pattern protects us from OutOfMemoryError or junk data caused by corruption.
Reading the record back is where our design pays off:
Step 1: Read the Header
We optimistically read the entire fixed-size header (storedCrc,keyLen, andvalLen) from thesourceSegment.Step 2: Integrity Check
Before we even allocatebyte[]arrays for the key and value, we should perform the integrity check. Recalculate the CRC from the segment’s data portion and compare it withstoredCrc.Step 3: Fail Fast
Only after the data has been verified do we allocate anybyte[]. If the checksums do not match, we immediately throw anIOException. We do not proceed. This prevents corrupted data from ever entering our application.Step 4: Read Payload
If the checksum is valid, we can now safely read the key and value data.
public class RecordCodec {
// ...
public static DataRecord read(MemorySegment source) throws IOException {
// STEP 1: Read the fixed-size header.
int storedCrc = source.get(INT_LAYOUT, CRC_OFFSET);
int keyLen = source.get(INT_LAYOUT, KEY_LEN_OFFSET);
int valLen = source.get(INT_LAYOUT, VAL_LEN_OFFSET);
// STEP 2 and 3: Verify the data we have at each step and fail fast.
if (keyLen < 0 || valLen < 0 || keyLen > MAX_RECORD_SIZE || valLen > MAX_RECORD_SIZE) {
throw new IOException("Corrupt Record: Invalid dimensions.");
}
// More defensive Checks.
// A corrupted length could point past the end of the segment.
// This prevents a crash/exception when we create the dataSlice.
long totalRecordSize = (long) HEADER_SIZE + keyLen + valLen;
if (totalRecordSize > source.byteSize()) {
throw new IOException("Corrupted record: size exceeds segment bounds.");
}
// Verify checksum before trusting the rest of the data
MemorySegment dataSlice = source.asSlice(KEY_LEN_OFFSET, PAYLOAD_START_OFFSET + keyLen + valLen);
CRC32C crc = new CRC32C();
crc.update(dataSlice.asByteBuffer());
if (storedCrc != (int) crc.getValue()) {
throw new IOException("CRC mismatch: data is corrupted.");
}
// STEP 4: CRC was valid, it's safe to allocate and read data
byte[] key = new byte[keyLen]; // dangerous!
byte[] value = new byte[valLen];
// Copy out key bytes
MemorySegment.ofArray(key)
.copyFrom(source.asSlice(HEADER_SIZE, keyLen));
// Copy out value bytes
MemorySegment.ofArray(value)
.copyFrom(source.asSlice(HEADER_SIZE + keyLen, valLen));
return new DataRecord(key, value);
}
}We safely read back the data and verify its integrity thanks to the checksum.
The only “dangerous” part left is that
keyLenneeds validation in a production-grade system. We cap sizes withMAX_RECORD_SIZE, but if we ever lift that guard or allow unbounded sizes, a corrupt file could still triggerOutOfMemoryError. For the article’s purpose, a strict upper bound keeps it honest and simple.
Let’s take a look at a common crash scenario and how the codec would handle a torn write.
Imagine the power fails while writing our key="a", value="bc" record, right after the key byte 0x61 is written.
Written on disk: [Header...][0x61][garbage...]
When we try to read this record, the following happens:
- The codec reads the header. It expects a key of length 1 and a value of length 2.
- The CRC calculates based on actual bytes on disk:
[Header fields][0x61][garbage...]. - Checksum-mismatch is detected immediately and fails fast with an Exception
The torn write is detected, and corrupt data should not enter our application. Looks good to me!
Framing & Integrity
With our RecordCodec complete, we have forged a critical link in our storage engine.
We have translated our abstract on-disk design into a safe, reliable contract.
We’ve Solved Framing:
The codec can unambiguously parse a stream of records.We’ve Solved Integrity:
Theread()method acts as a vigilant gatekeeper, ensuring that no corrupted data can ever pass into our application.
Our code now has a way to safely speak the language of the disk.
The next step is to put this translator to work and address our final problem: durability.
Durability: From RawLog to StorageSegment
It’s time to evolve our naïve RawLog class into a more capable StorageSegment class.
This class will use RecordCodec and add the final piece of the puzzle: durability.
Writing to a MemorySegment is fast because it only modifies the OS’s in-memory page cache located in the RAM.
To make data durable, we must command the OS to flush this cache to the physical disk.
The MemorySegment#force() method is our durability barrier: it requests from the operating system to flush modified pages of the mapped region to stable storage.
Think of it as fsync-like behavior for memory-mapped I/O, requesting to turn “it’s in the RAM somewhere” to “the OS has been told to persist it.”
App Code -------> Manipulate MemorySegment
(Our Process) |
v
Page Cache
(Operating System Kernel RAM)
|
mappedSegment.force()
request flush
|
v
Physical Disk
(SSD / HDD - Non-Volatile)As code, it looks like this:
// StorageSegment.java (replaces RawLog.java)
public class StorageSegment implements AutoCloseable {
private final Arena arena;
private final MemorySegment mappedSegment;
private final long maxSize;
private long writeOffset = 0;
public StorageSegment(Path path, long maxSize) throws IOException {
this.arena = Arena.ofConfined();
this.maxSize = maxSize;
try (FileChannel fc = FileChannel.open(path, CREATE, READ, WRITE)) {
fc.truncate(fileSize);
fc.position(maxSize - 1);
fc.write(java.nio.ByteBuffer.wrap(new byte[]{0}));
this.mappedSegment = fc.map(FileChannel.MapMode.READ_WRITE, 0, maxSize, arena);
}
}
/**
* @return the offset where the DataRecord was written
*/
public long append(DataRecord record) {
long currentOffset = this.writeOffset;
// Don't try to write beyond the mapping
long remaining = maxSize - currentOffset;
if (remaining <= 0) {
throw new IllegalStateException("StorageSegment full");
}
// Create a safety-bounded slice for the codec to write to
MemorySegment recordSlice = this.mappedSegment.asSlice(currentOffset, remaining);
int bytesWritten = RecordCodec.write(record, recordSlice);
// Check if the write made sense
if (bytesWritten < 0 || bytesWritten > remaining) {
throw new IllegalStateException("Invalid length written: " + bytesWritten);
}
this.writeOffset += bytesWritten;
return currentOffset;
}
public DataRecord read(long offset) throws IOException {
// Check we're actually trying to read within the segment
if (offset < 0 || offset >= this.maxSize) {
throw new IndexOutOfBoundsException();
}
// Create a safety-bounded slice for the codec to read from
long remaining = this.maxSize - offset;
MemorySegment recordSlice = this.mappedSegment.asSlice(offset, remaining);
return RecordCodec.read(recordSlice);
}
/**
* @return current size of data written to this segment
*/
public long size() {
return this.writeOffset;
}
/**
* @return maximum capacity of this segment
*/
public long capacity() {
return this.maxSize;
}
public void flush() {
this.mappedSegment.force();
}
@Override
public void close() {
try {
flush();
} finally {
this.arena.close();
}
}
}By combining the raw I/O power of MemorySegment with the validation logic of our RecordCodec, we have forged our first truly robust storage primitive with explicit framing, integrity verification, and an explicit flush hook: StorageSegment.
One important nuance:
force()doesn’t magically turn this into being fully crash-proof by itself. What it gives us is a tool to define a durability contract.Calling it on every append maximizes safety but costs throughput. Calling it in batches instead improves performance but risks losing the most recent batch on a crash.
A complete engine also needs recovery rules (e.g., scan until the first invalid CRC and truncate) and, usually, a persisted commit marker. Both are straightforward extensions once the format and flush primitive exist.
Recap: Our First Storage Primitive That’s Ready for Reality
We have systematically eliminated the three critical flaws:
Framing is solved:
By encoding key and value lengths in the header, records are unambiguously parsable.Integrity is verifiable:
CRC32C turns corruption and torn writes into detectable failures. We no longer have to simply “trust” the hardware on read. A record is either valid, or it’s rejected, no “garbage in, garbage out.”Durability is explicit:
Mapped writes land in the OS page cache first. By callingmappedSegment.force(), we have explicit flush barrier and make persistence a conscious contract rather than a hopeful side effect. When we call it (every write vs batching vs close) defines how much recent data we’re willing to risk on a crash.
Taken together, StorageSegment is no longer a bag of bytes.
It’s a storage primitive with structure, defensive reads, and a real durability lever.
Strong enough to build on, and honest about what still needs to be layered on top.
Putting It All to the Test
Let’s revisit our “HelloWorld” experiment with the new, robust StorageSegment, and sanity-check that it behaves as intended.
We’ll write two records, r1 and r2, then close the segment to ensure everything is flushed to disk.
When we reopen the file, we’ll read back the first record using its offset and verify the data is intact.
public class Main {
public static void main(String[] args) throws Exception {
Path segmentFile = Files.createTempFile("storage-segment-", ".log");
long offset1, offset2;
try (StorageSegment segment = new StorageSegment(segmentFile, 4_096)) {
System.out.println("Writing DataRecords...");
DataRecord r1 = DataRecord.of("Hello", "World");
offset1 = segment.append(r1);
System.out.println("Written: " + r1);
DataRecord r2 = DataRecord.of("Panama", "rocks!");
offset2 = segment.append(r2);
System.out.println("Written: " + r2);
}
// Re-open and read back to confirm it's persistent and parsable
System.out.println("\nRe-opening segment for reading...");
try (StorageSegment segment = new StorageSegment(segmentFile, 4_096)) {
DataRecord r = segment.read(offset1);
System.out.println("Read: " + r);
r = segment.read(offset2);
System.out.println("Read: " + r);
}
}
}Running this shows we have solved framing and durability in this simplified design:
Writing DataRecords...
Written: Hello -> World
Written: Panama -> rocks!
Re-opening segment for reading...
Read: Hello -> World
Read: Panama -> rocks!But walking along the “happy path” is easy.
A storage system’s promises of integrity should be validated under adversarial conditions.
To validate our CRC protection, we need a test that doesn’t just verify the happy path, but actively attempts to break our assumptions. We’ll write a valid record, intentionally corrupt it on disk, and verify that our codec detects the tampering.
This kind of adversarial testing is critical for storage systems, as it shows our checksums behave as intended, not just that they compile.
class CorruptionTest {
@Test
void shouldThrowExceptionOnCorruptedRecord() throws IOException {
Path segmentFile = Files.createTempFile("corrupt-", ".log");
DataRecord originalRecord = new DataRecord("key".getBytes(), "value".getBytes());
// STEP 1: Write a valid record.
try (StorageSegment segment = new StorageSegment(segmentFile, 1_024)) {
segment.append(originalRecord);
}
// STEP 2: Manually corrupt the file.
// We open the file with standard I/O to tamper with its bytes.
try (FileChannel fc = FileChannel.open(segmentFile, StandardOpenOption.WRITE)) {
// Corrupt the last byte of the value "e" -> "f".
// Offset: header(12) + keyLen(3) + valLen(5) = 20
long corruptionOffset = RecordCodec.HEADER_SIZE + originalRecord.key().length + originalRecord.value().length - 1;
fc.write(java.nio.ByteBuffer.wrap(new byte[]{'f'}), corruptionOffset);
}
// STEP 3: Try to read the record and assert that it fails.
try (StorageSegment segment = new StorageSegment(segmentFile, 1024)) {
assertThrows(IOException.class, () -> segment.read(0));
}
}
}The test success is a strong signal: our StorageSegment is not just a storage container.
Our codec validation works as designed.
The Question of Deletion
We can append and read records, but attentive readers will notice we haven’t addressed deletion yet.
What happens when we need to remove a key?
In an append-only system, we can’t truly “remove” bytes from the middle of a file. That would require shifting gigabytes of subsequent data, destroying our performance goals and violating our immutability principle.
Instead, we use a pattern called tombstoning: to delete a key, we append a special marker record with an empty value.
private static final byte[] TOMBSTONE_VALUE = new byte[0];
public void delete(byte[] key) {
// Append a record with an empty value as a deletion marker
DataRecord tombstone = new DataRecord(key, TOMBSTONE_VALUE);
segment.append(tombstone);
// When we build an index in Article III, it will point to this tombstone
}When someone tries to get() a deleted key, we read the tombstone and return Optional.empty().
This works elegantly, but it creates a new problem: the disk accumulates garbage. Every update overwrites the old value (leaving it orphaned), and every delete leaves a tombstone. Our log grows forever, consuming disk space with dead data.
We’ll tackle this space management problem in Article III with a technique called compaction—a garbage collector for our storage files.
The Foundation is Solid, But Speed is Calling
We have successfully addressed the three critical flaws from Article I. Our data now has Structure, Integrity, and Durability. The foundation is now rock solid, and the storage engine is no longer a toy or simple experiment.
But there’s a still a fatal flaw lurking underneath it all: reads are slow.
How do we find the data for a specific key, like "Panama"?
Reading a record means knowing its exact byte offset.
Right now, the only way is to read the very beginning, decode the first record, calculate its on-disk size, seek forward to the next record, decode it, and so on, until we find the key we’re looking for.
This is a full sequential scan, and a major performance problem.
It’s correct, and fine for few kilobytes. But for a 1GB segment file, it’s catastrophic.
In Part III, we’ll confront those problems head-on, by building a GC-free off-heap hash index, and then use it to create a disk garbage collector through compaction.
This is where the system becomes real, having tackled the fundamental trade-offs commonly encountered in storage engines.
