Module: Capacity & Performance | Duration: 20 min read | Lesson: 1 of 13
TheWorldShop's platform team gets a ticket: "We're launching live order tracking. Marketing expects 40,000 events per second at peak. How many Kafka brokers do we need?"
Dev, three weeks into owning the cluster, does what everyone does first. He guesses. "Six brokers, probably?" Six it is. Two weeks later the cluster falls over during a flash sale, not because CPU was pinned, but because the network cards on three brokers were saturated copying replicas to each other while producers were still pushing.
The painful part: the math was knowable up front. Capacity planning for Kafka isn't a dark art. It's arithmetic on four numbers (throughput, replication factor, retention, and message size) plus an honest look at which resource runs out first. This lesson is that arithmetic, and the resource you'll be surprised is usually the bottleneck.
2. Concept Explanation
The Four Inputs
Every Kafka sizing exercise starts from four numbers. Get these from the team that owns the workload, not from a vendor spreadsheet.
- Write throughput in MB/s. Peak, not average. Take events/sec × average message size.
- Replication factor (RF). Almost always 3 in production. This multiplies your network and disk write load.
- Retention. How long bytes stay on disk before deletion. Drives total storage.
- Read fan-out. How many consumer groups read each message, and whether they read the tail (cheap, page cache) or replay history (expensive, cold disk).
Why Network Is Usually the First Wall
Here's the counterintuitive part. People size for disk and CPU. Kafka brokers usually run out of network first, because of replication.
When a producer writes 100 MB/s to a topic with RF=3 spread across the cluster, that 100 MB/s of ingest becomes:
- 100 MB/s of producer traffic landing on leader partitions.
- Each leader ships its data to 2 followers: 200 MB/s of replication traffic leaving leaders.
- Followers receive 200 MB/s.
So the cluster moves roughly 3x the raw ingest internally before a single consumer reads anything. Add consumers reading the tail at RF×fan-out and a "100 MB/s" workload is moving 500+ MB/s across NICs. A 10 GbE card tops out near 1,250 MB/s in theory, far less under real packet sizes and bidirectional load. Replication alone can eat a NIC.
The Sizing Formulas
Total stored bytes (the disk number):
Then divide by per-broker usable disk, and pad. You never run Kafka disks above ~70% full: compaction, segment rolls, and a single broker failure (whose partitions rebalance onto survivors) all need headroom.
Network out of a leader broker (the NIC number):
Broker count is max of three independent constraints, not a single formula:
The cluster is sized by whichever runs out first. That's the whole discipline: compute all three, take the max, then add a safety broker so a single failure doesn't tip you over.
The Partition Budget Nobody Mentions
Brokers don't just hold bytes, they hold partition replicas, and each replica costs an open file handle, a chunk of page cache, and a slice of the controller's metadata. A modern KRaft broker is comfortable into the low tens of thousands of partition replicas; ZooKeeper-era clusters fell over far sooner. Past the budget, recovery time after a restart balloons (every partition must be re-opened and validated) and controller failover slows. Partition count is a real capacity dimension, not free.
Headroom and the N-1 Rule
Size so the cluster survives losing one broker at peak. When a broker dies, its leadership and replication load redistributes onto the survivors. If you were running 5 brokers at 80% NIC, losing one pushes the rest past 100% and you cascade. Plan each broker to sit near 60-70% of its limiting resource at peak so N-1 still fits under 100%.
Aha: "How many brokers?" is the wrong first question. The right one is "which resource runs out first?" For most streaming workloads it's the NIC, not the disk or CPU, because RF=3 means every byte produced is moved across the network three times before a consumer ever sees it. Size the network, and the broker count falls out.
3. Worked Example
TheWorldShop's order-tracking workload:
- 40,000 events/sec at peak, average 1.5 KB per event.
- RF = 3, retention = 3 days, two consumer groups both reading the tail.
Step 1, write throughput:
Step 2, storage:
At 70% max fill on 2 TB NVMe drives (1.4 TB usable each):
Step 3, network egress per leader:
Spread across brokers, this is tiny per node: a 10 GbE NIC (~1,000+ MB/s usable) is nowhere near saturated. Network is not the bottleneck here.
Step 4, take the max. Disk says 34 brokers. Network says ~3. Partition budget (say 200 partitions × RF 3 = 600 replicas) says ~1. Disk wins at 34.
That's a surprising answer, so challenge the inputs. Three days of retention at this volume is the real cost driver. If the lakehouse already persists these events and Kafka only needs them for 12 hours of replay buffer, retention drops 6x and so does broker count: ~6 brokers. The biggest capacity lever here is a retention conversation, not a hardware order.
Verify the cluster and the load-generation tooling are reachable:
Run a quick throughput probe to see real numbers from your hardware:
Later lessons assume this cluster is running and reuse ./scripts/perf-test.sh.
4. Your Turn
Exercise: TheWorldShop adds a clickstream topic: 200,000 events/sec, 400 bytes each, RF=3, retention 24 hours, three consumer groups (two tail, one nightly full replay).
- Compute peak ingest in MB/s.
- Compute total cluster storage at 70% max disk fill on 4 TB drives (2.8 TB usable).
- Compute aggregate network egress, including the nightly replay group's full read.
- State which resource is the binding constraint and give the broker count with N-1 headroom.
- Propose one change to the inputs that would cut broker count the most, and justify it.
5. Real-World Application
Confluent's and AWS MSK's sizing guides both lead with network and storage, not CPU, for exactly the replication-multiplier reason above. MSK even publishes per-instance-type ingress/egress caps because customers kept saturating EBS and ENI bandwidth long before CPU.
LinkedIn, where Kafka was born, runs clusters sized by partition count and replication bandwidth, and famously moved heavy historical reads off the hot cluster precisely because cold replay reads pollute the page cache and steal NIC from live traffic.
The retention conversation is the single most common cost win consultants find. Teams set "7 days" as a reflex, triple their storage bill and broker count, and never replay more than the last hour. Tiered storage (Lesson 11) changes this math again by pushing cold bytes to S3.
For interviews: "size a Kafka cluster for X events/sec" is a classic system-design prompt. The senior move is to compute all three constraints, name network as the usual surprise, and then question the retention requirement instead of just multiplying.
6. Recap + Bridge
Kafka capacity planning is arithmetic on four inputs (throughput, RF, retention, fan-out) feeding three independent constraints (disk, network, partition budget). Broker count is the max of the three plus N-1 headroom, and the binding constraint is more often network or retention-driven storage than CPU.
Next, we tune the producer, the first place latency and throughput are won or lost, where batch.size, linger.ms, and compression turn that raw 60 MB/s into either a smooth stream or a thundering herd.