Module: Storage Model | Duration: 20 min read | Lesson: 1 of 14

You've been using Kafka for a year. You know topics, partitions, producers, consumers. Then in a system-design interview, a staff engineer asks: "Why is Kafka fast?"

The textbook answers, "horizontal scaling, batching, page cache", are correct but unsatisfying. They describe consequences. The real answer is one idea so simple it sounds wrong: Kafka is fast because its storage primitive is an append-only file, and modern operating systems are absurdly fast at appending to files.

Every design choice in Kafka, the API, replication, retention, compaction, tiered storage, descends from that one decision. Get this right and Kafka stops being a magic black box. Get it wrong and you'll keep blaming "the JVM" or "the network" for problems that are actually about not understanding the log.

This course opens the hood on Kafka. We start here, with the storage model, because every other internal makes sense only after you've seen this one.

2. Concept Explanation

Why the Industry Believed Disk Was Slow

For decades, the consensus was: disk is slow, RAM is fast, keep things in RAM as much as possible. That's why traditional message brokers (ActiveMQ, RabbitMQ in default configs) tried to keep messages in memory and only spilled to disk when forced.

The consensus was wrong, or rather, half-right. Random disk I/O is slow. Sequential disk I/O on modern hardware is in the same league as RAM access. On a typical SATA SSD: ~500 MB/s sequential, ~50 MB/s random. On NVMe: 3-7 GB/s sequential. Spinning rust: ~150 MB/s sequential, but the random penalty is brutal (~1 MB/s).

Kafka's authors realized: if you confine yourself to sequential reads and writes, disk stops being the bottleneck. Network and CPU become the new ceilings, both of which scale horizontally far more easily than disk seeks ever did.

The Storage Model in One Sentence

Each partition is an append-only sequence of log segments on a single broker's local filesystem.

That is the entire storage architecture. No B-trees. No indexes that get rewritten. No locks held across many records. No tombstones-in-place. Just append.

A broker storing partition orders-7 has a directory:

/var/lib/kafka/data/orders-7/
  00000000000000000000.log         # earliest segment, oldest records
  00000000000000000000.index       # offset → file-position
  00000000000000000000.timeindex   # timestamp → offset
  00000000000000125000.log         # next segment, starts at offset 125000
  00000000000000125000.index
  00000000000000125000.timeindex
  00000000000000250000.log         # active segment (currently being written)
  00000000000000250000.index
  00000000000000250000.timeindex

The filename of each .log file is the base offset of that segment, the offset of its first record. Index files are keyed off the same base offset. The active segment is the only one Kafka writes to. Older segments are immutable until they're deleted (delete policy) or rewritten (compact policy).

What's in a Segment File

Each .log file is a sequence of record batches:

┌─────────────────────────────────────────┐
│  RecordBatch                            │
│   - baseOffset                          │
│   - lastOffsetDelta                     │
│   - producerId, producerEpoch, baseSeq  │  (for idempotence; Lesson 08)
│   - magic, attributes, crc              │
│   - records[]:                          │
│     - offsetDelta                       │
│     - timestampDelta                    │
│     - keyLen, key                       │
│     - valueLen, value                   │
│     - headers[]                         │
├─────────────────────────────────────────┤
│  RecordBatch                            │
│   ...                                   │
└─────────────────────────────────────────┘

Key points:

Records are stored inside batches. The protocol unit is the batch, not the individual record. This is fundamental, compression, idempotence, and transactions all operate at the batch level.
Offsets are deltas within a batch. A batch stores baseOffset once, then per-record offsetDelta. Saves space.
The on-disk format is identical to the wire format. Producers send a batch as bytes, the broker writes those bytes directly to the segment, consumers receive those bytes back. No transcoding. This enables zero-copy reads via sendfile().

Zero-Copy and Page Cache, The Two Speed Tricks

When a consumer fetches data, the broker conceptually does:

open segment file → read N bytes → send them over socket

The naive implementation copies bytes: disk → kernel → user-space (broker JVM) → kernel → network. That's 4 copies and 2 context switches.

Kafka uses the Linux sendfile(2) system call: tell the kernel "send bytes from file descriptor X to socket Y." The kernel copies bytes directly from the page cache to the NIC's DMA buffer. Zero copies in user space, zero context switches. This is the famous "zero-copy" optimization, and it's why a Kafka broker can saturate a 10 GbE NIC with one or two CPU cores worth of work.

For this to work, the on-disk bytes must be exactly what the wire expects. That's why Kafka's format is identical on disk and on the wire, any transformation would force user-space copies.

The second trick is the page cache. When records are produced, the broker write()s them to the segment file. The kernel buffers them in the page cache, eventually flushing to disk. When consumers fetch recent records, those same pages are still in the cache, RAM-speed reads, no disk hit. Kafka makes no attempt to cache records inside the JVM; the OS already does it, and far more efficiently. This is also why Kafka brokers prefer 64+ GB RAM machines, most of the RAM is for the page cache, not the JVM heap.

Why Reads and Writes Don't Interfere (Much)

Because the active segment is append-only, writes are pure sequential appends to one position. Reads happen at other positions in the file (consumers reading older data, or readers slightly behind the head). The OS prefetcher and the page cache make this work beautifully.

If consumers and producers are at the head of the log together (the "tail", recent data), they share the same hot pages. If consumers are far behind (an analytics job replaying yesterday), they pull cold pages off disk into the cache. As long as the working set fits in RAM, throughput is RAM-speed.

This is why a single slow consumer reading old data can occasionally cause measurable disk I/O, it's pulling cold pages, even though current producers and consumers are flying. Lesson: in a tiered-storage world (Lesson 11), historical reads are offloaded to S3 entirely.

Offsets Are File Positions in Disguise

A Kafka offset is conceptually a 64-bit integer counter, but operationally, it's the index into the log. Given an offset:

Binary-search the segment files (by base offset) to find the segment containing it.
Use the .index file to find the byte offset within that segment.
Seek there and read the batch.

The .index file is a sparse map: offset → file_position, written every index.interval.bytes (default 4 KB) of log written. So an exact-offset lookup is two file accesses: index seek, log seek. Effectively O(log N) segment lookup + O(1) in-segment seek.

Why Append-Only Means Everything Is Simpler

Databases need B-trees because they support arbitrary inserts, updates, and deletes, operations whose write patterns are random. Random writes require seeks, splits, locking, and concurrency control. The result is amazing for WHERE id = ? lookups and unworkable for high-throughput streaming.

Kafka explicitly chose: we don't support arbitrary inserts. We don't support updates. We support append. In return:

No locks across records.
No fragmentation.
No write amplification.
Compression amortizes across whole batches.
Replication is just "copy the bytes."
Retention is just "delete old files."

This is why a Kafka broker on commodity hardware can sustain millions of messages per second per node. The mechanism is fundamentally simpler than a database.

Trade-offs You're Implicitly Accepting

Nothing is free. By choosing the log as the primitive, Kafka trades away:

No query interface. You can't ask "give me all messages where customer_id = 42." You can only scan sequentially.
No in-place mutation. Want to fix a bad record? Publish a new one. The old one stays.
Approximate retention. Deletion is segment-granular, not record-granular.
No fine-grained ACLs on subsets of data. A consumer either reads the whole partition or not.

These are intentional limitations. Every "feature" you might want to add (in-place delete, secondary indexes) would compromise the storage primitive that makes Kafka fast.

3. Worked Example

Let's go on a literal disk dive on a running Kafka broker.

# In a running broker container
ls -la /var/lib/kafka/data/orders-7/

-rw-r--r-- 1 kafka kafka 10485760 May  9 09:00 00000000000000000000.index
-rw-r--r-- 1 kafka kafka  1073741824 May  9 09:00 00000000000000000000.log
-rw-r--r-- 1 kafka kafka 10485760 May  9 09:00 00000000000000000000.timeindex
-rw-r--r-- 1 kafka kafka 10485760 May  9 11:32 00000000000000125847.index
-rw-r--r-- 1 kafka kafka   652134912 May  9 11:32 00000000000000125847.log
-rw-r--r-- 1 kafka kafka 10485760 May  9 11:32 00000000000000125847.timeindex

Observations:

Two segments. The first is rolled (1 GB, the segment.bytes default). The second is active (652 MB, still growing).
Index files are pre-allocated at 10 MB and filled sparsely.
The base offset of the active segment is 125,847, meaning the first segment held 125,847 records.

Decode a batch with kafka-dump-log

Kafka ships a tool to inspect segment files:

kafka-dump-log --files /var/lib/kafka/data/orders-7/00000000000000125847.log \
  --print-data-log | head -40

Sample output:

baseOffset: 125847 lastOffset: 125898 count: 52 baseSequence: 0 lastSequence: 51 \
 producerId: 5021 producerEpoch: 14 partitionLeaderEpoch: 27 isTransactional: false \
 isControl: false position: 0 CreateTime: 1715248801200 size: 8421 magic: 2 compresscodec: zstd crc: ...
  offset: 125847 CreateTime: 1715248801200 keySize: 8 valueSize: 142 sequence: 0 ...
  offset: 125848 CreateTime: 1715248801202 keySize: 8 valueSize: 137 sequence: 1 ...
  ...

Things to notice:

52 records packed in one batch. The Kafka unit is the batch.
size: 8421 for 52 records of ~140 bytes each: compression ratio is roughly 5x (52 × 140 / 8421).
producerId/producerEpoch/sequence, idempotence machinery (Lesson 08).
partitionLeaderEpoch, the leader-election generation (Lesson 05).

This is what's actually on disk. This is what's sent over the wire. This is what's served to consumers. One representation, end to end.

Watching the page cache work

With vmtouch (a free tool) you can see exactly which pages of a file are in cache:

vmtouch -v /var/lib/kafka/data/orders-7/00000000000000125847.log

For an active topic, you'll see the tail of the file (recent records) fully cached and older parts cold. That visualization is what Kafka's read pattern looks like in practice.

Bring up the lab. Internals lessons need a real multi-broker cluster: one broker hides too much. Clone the lab repo (one time, shared across all Kafka courses) and start the 3-broker cluster:

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/ingestion-and-transport/kafka-fundamentals-to-internals/kafka-internals/
./scripts/setup.sh

Verify the cluster, controller, and tooling are reachable:

./scripts/verify.sh
# expected: "Kafka 3.7 ready (3 brokers, KRaft controller), JMX exporters on :7071-7073, kafka-ui on :8082"

You are helping me bring up a 3-broker local Kafka cluster for the
"Apache Kafka: Internals & Protocol Deep Dive" course.

Lab artifacts (already in `petascalelabs-lab-setup/ingestion-and-transport/kafka-fundamentals-to-internals/kafka-internals/`):
  - docker-compose.yml: 3 KRaft-mode Kafka brokers, JMX Prometheus exporters per broker, kafka-ui dashboard
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh
  - scripts/break/*.sh helpers that kill a single broker or partition the network (for the replication and rebalance lessons)

My machine:
  OS: <fill in>
  RAM: <fill in GB>
  Docker version: <fill in>

Please walk me through:
1. Confirming Docker has at least 8 GB allocated (this lab needs more headroom than the fundamentals lab because three brokers run concurrently).
2. Any OS-specific prerequisites (Rosetta, WSL2, virtualization flags) for my OS.
3. How to read the verify output to confirm all three brokers are in-sync and the controller is healthy.
4. The right teardown command if I need to stop and reclaim disk.

Do not assume my OS, ask if it's not clear.

Later lessons assume this cluster is running. Lessons that need broker-loss behavior call ./scripts/break/kill-broker.sh <N> and recover with ./scripts/break/start-broker.sh <N>.

Aha: Kafka's throughput isn't from clever indexes, it's from refusing to use them on write. An append-only log means every write is sequential. Sequential writes on a modern SSD hit the device's full bandwidth. Add a B-tree and you'd halve that overnight. Kafka's "primitive" storage is its competitive moat.

4. Your Turn

Tasks:

Explain in 3-4 sentences why Kafka's storage model is dramatically simpler than a database's, and what limitations that simplicity implies.
A junior engineer asks: "Why does Kafka need so much RAM if it stores everything on disk?" Explain page cache in 2-3 sentences.
Compute the worst-case offset lookup cost in a topic with 7 days of retention, 1 GB segments, and 1 MB/s ingest. (Hint: how many segments, and how does O(log segments) matter?)
Why can't Kafka use zero-copy (sendfile) for encrypted consumer connections (TLS)? What's the throughput implication?
The team is debating whether to put Kafka data on a 100 GB local SSD or a 1 TB networked EBS volume. List two factors in favor of each choice, focused on the storage primitive's needs.

5. Real-World Application

Jay Kreps' 2013 essay "The Log: What every software engineer should know about real-time data's unifying abstraction" is the philosophical foundation of Kafka. Reading it after this lesson is one of those before/after experiences. Kreps lays out why the append-only log is not just a storage trick but a fundamental abstraction for distributed systems.

Redpanda, a Kafka-API-compatible engine written in C++, rebuilt the broker from scratch to remove the JVM but kept the storage model identical. Same segments, same indices, same zero-copy story. They cite the storage primitive as the reason the rebuild was even possible, the JVM was incidental, the log was essential.

WarpStream and Apache Kafka tiered storage both extend the log model onto object storage (S3). The hot tail stays on local disk (zero-copy + page cache), but cold segments move to S3 transparently. The fact this works is a direct consequence of the storage model, segments are immutable and easy to migrate.

Pulsar (another streaming system) is sometimes pitched as "Kafka with better storage." Architecturally, Pulsar separates the serving layer from the storage layer (BookKeeper). The trade-off: more flexible storage, but no end-to-end zero-copy and a more complex operational model. Different choice, different costs, informed by understanding what Kafka's storage primitive is.

For job seekers: "why is Kafka fast?" is a classic interview question. The textbook answer is "sequential disk + zero-copy + page cache." The senior answer is "because its storage primitive is an append-only log, and every other design decision flows from that choice."

6. Recap + Bridge

Kafka's speed isn't magic, it's the inevitable consequence of choosing the append-only log as the storage primitive, which unlocks sequential disk I/O, page cache exploitation, and zero-copy reads, while accepting the trade-off of no in-place mutation or queryability.

In the next lesson, we zoom into log segments, indexes, and the page cache in finer detail, exactly how Kafka finds offset 12,398,712 in a topic that's been running for two years.