Capacity Planning: Sizing Brokers, Disks & Network

Module: Capacity & Performance | Duration: 20 min read | Lesson: 1 of 13


TheWorldShop's platform team gets a ticket: "We're launching live order tracking. Marketing expects 40,000 events per second at peak. How many Kafka brokers do we need?"

Dev, three weeks into owning the cluster, does what everyone does first. He guesses. "Six brokers, probably?" Six it is. Two weeks later the cluster falls over during a flash sale, not because CPU was pinned, but because the network cards on three brokers were saturated copying replicas to each other while producers were still pushing.

The painful part: the math was knowable up front. Capacity planning for Kafka isn't a dark art. It's arithmetic on four numbers (throughput, replication factor, retention, and message size) plus an honest look at which resource runs out first. This lesson is that arithmetic, and the resource you'll be surprised is usually the bottleneck.


2. Concept Explanation

The Four Inputs

Every Kafka sizing exercise starts from four numbers. Get these from the team that owns the workload, not from a vendor spreadsheet.

  1. Write throughput in MB/s. Peak, not average. Take events/sec × average message size.
  2. Replication factor (RF). Almost always 3 in production. This multiplies your network and disk write load.
  3. Retention. How long bytes stay on disk before deletion. Drives total storage.
  4. Read fan-out. How many consumer groups read each message, and whether they read the tail (cheap, page cache) or replay history (expensive, cold disk).

Why Network Is Usually the First Wall

Here's the counterintuitive part. People size for disk and CPU. Kafka brokers usually run out of network first, because of replication.

When a producer writes 100 MB/s to a topic with RF=3 spread across the cluster, that 100 MB/s of ingest becomes:

  • 100 MB/s of producer traffic landing on leader partitions.
  • Each leader ships its data to 2 followers: 200 MB/s of replication traffic leaving leaders.
  • Followers receive 200 MB/s.

So the cluster moves roughly 3x the raw ingest internally before a single consumer reads anything. Add consumers reading the tail at RF×fan-out and a "100 MB/s" workload is moving 500+ MB/s across NICs. A 10 GbE card tops out near 1,250 MB/s in theory, far less under real packet sizes and bidirectional load. Replication alone can eat a NIC.

The Sizing Formulas

Total stored bytes (the disk number):

storage_per_cluster = write_MBps × retention_seconds × replication_factor

Then divide by per-broker usable disk, and pad. You never run Kafka disks above ~70% full: compaction, segment rolls, and a single broker failure (whose partitions rebalance onto survivors) all need headroom.

Network out of a leader broker (the NIC number):

broker_egress = producer_in × (RF - 1)        # replication to followers
              + consumer_out                  # tail reads × fan-out

Broker count is max of three independent constraints, not a single formula:

brokers = max(
  ceil(total_storage / usable_disk_per_broker),
  ceil(total_network / usable_nic_per_broker),
  ceil(total_partitions / partitions_per_broker_budget)
)

The cluster is sized by whichever runs out first. That's the whole discipline: compute all three, take the max, then add a safety broker so a single failure doesn't tip you over.

The Partition Budget Nobody Mentions

Brokers don't just hold bytes, they hold partition replicas, and each replica costs an open file handle, a chunk of page cache, and a slice of the controller's metadata. A modern KRaft broker is comfortable into the low tens of thousands of partition replicas; ZooKeeper-era clusters fell over far sooner. Past the budget, recovery time after a restart balloons (every partition must be re-opened and validated) and controller failover slows. Partition count is a real capacity dimension, not free.

Headroom and the N-1 Rule

Size so the cluster survives losing one broker at peak. When a broker dies, its leadership and replication load redistributes onto the survivors. If you were running 5 brokers at 80% NIC, losing one pushes the rest past 100% and you cascade. Plan each broker to sit near 60-70% of its limiting resource at peak so N-1 still fits under 100%.

Aha: "How many brokers?" is the wrong first question. The right one is "which resource runs out first?" For most streaming workloads it's the NIC, not the disk or CPU, because RF=3 means every byte produced is moved across the network three times before a consumer ever sees it. Size the network, and the broker count falls out.


3. Worked Example

TheWorldShop's order-tracking workload:

  • 40,000 events/sec at peak, average 1.5 KB per event.
  • RF = 3, retention = 3 days, two consumer groups both reading the tail.

Step 1, write throughput:

40,000 events/s × 1.5 KB = 60,000 KB/s ≈ 60 MB/s ingest

Step 2, storage:

60 MB/s × (3 × 86,400 s) × RF 3
= 60 × 259,200 × 3
= 46,656,000 MB ≈ 46.7 TB across the cluster

At 70% max fill on 2 TB NVMe drives (1.4 TB usable each):

ceil(46,700 GB / 1,400 GB) = 34 brokers by disk

Step 3, network egress per leader:

replication out = 60 MB/s × (3 - 1) = 120 MB/s
consumer out    = 60 MB/s × 2 groups = 120 MB/s
total cluster egress ≈ 240 MB/s + 120 MB/s in = ~360 MB/s aggregate

Spread across brokers, this is tiny per node: a 10 GbE NIC (~1,000+ MB/s usable) is nowhere near saturated. Network is not the bottleneck here.

Step 4, take the max. Disk says 34 brokers. Network says ~3. Partition budget (say 200 partitions × RF 3 = 600 replicas) says ~1. Disk wins at 34.

That's a surprising answer, so challenge the inputs. Three days of retention at this volume is the real cost driver. If the lakehouse already persists these events and Kafka only needs them for 12 hours of replay buffer, retention drops 6x and so does broker count: ~6 brokers. The biggest capacity lever here is a retention conversation, not a hardware order.

# Bring up the lab. Operations lessons use the same 3-broker cluster as
# the internals course, plus tooling for load generation and metrics.
git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/ingestion-and-transport/kafka-fundamentals-to-internals/kafka-operations/
./scripts/setup.sh

Verify the cluster and the load-generation tooling are reachable:

./scripts/verify.sh
# expected: "Kafka 3.7 ready (3 brokers, KRaft), JMX on :7071-7073, Grafana on :3000, kafka-producer-perf-test available"

Run a quick throughput probe to see real numbers from your hardware:

./scripts/perf-test.sh --records 1000000 --size 1500 --throughput -1
# reports MB/s and p99 latency your laptop's single-broker setup can sustain
You are helping me capacity-plan a Kafka cluster for the
"Apache Kafka: Operations, Performance & Reliability" course.

I want to compute broker count from first principles. Help me build a
small spreadsheet or script that takes these inputs and reports the
three independent constraints (disk, network, partition budget) plus
the recommended broker count with N-1 headroom:

  - peak events/sec
  - average message size (KB)
  - replication factor
  - retention (days)
  - number of consumer groups and whether they read the tail or replay
  - per-broker usable disk (GB) and NIC bandwidth (Gbit)

My environment:
  OS: <fill in>
  RAM: <fill in GB>

Walk me through the arithmetic for each constraint, then show which one
dominates and why. Do not assume my numbers, ask for the ones I leave blank.

Later lessons assume this cluster is running and reuse ./scripts/perf-test.sh.


4. Your Turn

Exercise: TheWorldShop adds a clickstream topic: 200,000 events/sec, 400 bytes each, RF=3, retention 24 hours, three consumer groups (two tail, one nightly full replay).

  1. Compute peak ingest in MB/s.
  2. Compute total cluster storage at 70% max disk fill on 4 TB drives (2.8 TB usable).
  3. Compute aggregate network egress, including the nightly replay group's full read.
  4. State which resource is the binding constraint and give the broker count with N-1 headroom.
  5. Propose one change to the inputs that would cut broker count the most, and justify it.

5. Real-World Application

Confluent's and AWS MSK's sizing guides both lead with network and storage, not CPU, for exactly the replication-multiplier reason above. MSK even publishes per-instance-type ingress/egress caps because customers kept saturating EBS and ENI bandwidth long before CPU.

LinkedIn, where Kafka was born, runs clusters sized by partition count and replication bandwidth, and famously moved heavy historical reads off the hot cluster precisely because cold replay reads pollute the page cache and steal NIC from live traffic.

The retention conversation is the single most common cost win consultants find. Teams set "7 days" as a reflex, triple their storage bill and broker count, and never replay more than the last hour. Tiered storage (Lesson 11) changes this math again by pushing cold bytes to S3.

For interviews: "size a Kafka cluster for X events/sec" is a classic system-design prompt. The senior move is to compute all three constraints, name network as the usual surprise, and then question the retention requirement instead of just multiplying.


6. Recap + Bridge

Kafka capacity planning is arithmetic on four inputs (throughput, RF, retention, fan-out) feeding three independent constraints (disk, network, partition budget). Broker count is the max of the three plus N-1 headroom, and the binding constraint is more often network or retention-driven storage than CPU.

Next, we tune the producer, the first place latency and throughput are won or lost, where batch.size, linger.ms, and compression turn that raw 60 MB/s into either a smooth stream or a thundering herd.