Module: Why Kafka | Duration: 20 min read | Lesson: 1 of 14

It's Black Friday. 11:14 AM. TheWorldShop's CTO, Aarav, is staring at a Slack channel that's lighting up red.

The fraud detection team is reporting that they're seeing fraudulent orders 45 minutes after they've shipped. The recommendation team says their "trending now" widget is showing yesterday's products. The warehouse team just got a midnight CSV dump of orders that they were expected to pick by 9 AM. And the analytics team is asking, for the third time, why their dashboard says revenue is $0 even though sales are clearly happening.

Aarav opens up the architecture diagram. Every team is pulling data the same way: a nightly batch job that snapshots the orders database, dumps it to S3, and notifies downstream consumers. It worked fine when TheWorldShop did 10,000 orders a day. Now they do 10,000 orders an hour. The batch model is no longer just slow, it's actively destroying the business.

Fraud needs to know about orders in seconds, not hours. Recommendations need clickstream data in minutes, not days. The warehouse needs orders as they happen, not as a midnight file drop. Analytics needs continuous totals, not a snapshot from 14 hours ago.

The problem isn't any single team's pipeline. The problem is that the entire company is built on request-response and batch thinking, when what the business actually needs is streams of events that flow continuously and let each team consume them at their own pace.

This is the problem Kafka was built to solve.

2. Concept Explanation

Two Models for Moving Data

There are essentially two ways to get data from System A to System B:

1. Request-Response (Pull): System B asks System A "give me your data," usually on a schedule. This is what every nightly ETL job, every cron-driven export, every "dump the table to CSV" pipeline does. It's simple. It's also fundamentally batched, by the time B reads, A has moved on. The freshest data B can have is exactly as fresh as the schedule allows.

2. Event Streaming (Push): A produces events as they happen, and any number of B's, C's, and D's subscribe to those events and react in near real time. A doesn't know or care who's listening. B doesn't have to ask, events arrive.

Both models have their place. But once your business depends on time-sensitive reactions (fraud, recommendations, alerting, IoT), or you have many consumers of the same data (analytics, warehousing, ML training, real-time dashboards), the request-response model collapses under its own coordination cost.

Why Point-to-Point Integration Breaks

TheWorldShop's current architecture is a point-to-point mess. Each team writes its own pipeline against the orders database:

             ┌─→ Fraud (queries DB every 10 min)
             ├─→ Recommendations (nightly batch)
Orders DB ───┼─→ Warehouse (midnight CSV)
             ├─→ Analytics (hourly snapshot)
             └─→ Email (cron-driven script)

For 5 consumers reading from 1 source, that's 5 different integrations. For N producers and M consumers, you end up with N × M integrations, every team becomes coupled to every other team's data format, schedule, and operational habits. Schema changes require coordinating across all consumers. Backfilling for one team means re-running every job. Adding the sixth consumer is harder than adding the first.

This architecture has a name: the integration spaghetti. Every mature company that didn't fix this early ends up with it.

The Log: One Unified Abstraction

The insight that gave us Kafka (originally built at LinkedIn around 2010, open-sourced via Apache in 2011) is deceptively simple:

What if every system wrote its events to a shared, durable, append-only log, and every other system just read from that log?

The log becomes the single source of truth. The orders service writes "order placed" events to the log. Fraud, recommendations, warehouse, analytics, and email all read from the same log, independently, at their own pace. Adding a sixth consumer doesn't require the orders service to know anything about it.

                  ┌─→ Fraud (reads in real time)
                  ├─→ Recommendations (reads continuously)
Orders service ───┤  LOG  ─→ Warehouse (reads in batches)
                  ├─→ Analytics (reads continuously)
                  └─→ Email (reads in real time)

This is the entire conceptual model of Kafka. Kafka is a durable, distributed, partitioned, append-only log. Everything else, producers, consumers, brokers, partitions, replication, is mechanism to make that log fast, fault-tolerant, and infinitely scalable.

Why "Append-Only Log" Is the Right Abstraction

Databases let you UPDATE and DELETE. Logs only let you APPEND. That sounds like a limitation, but it's actually the source of Kafka's power:

Order is preserved. Events appear in the log in the order they were written. Consumers see them in that order.
Replay is free. Want to reprocess the last 7 days for a new ML model? Just rewind your read position. The events are still there.
Many readers don't slow it down. Reading is a sequential scan from your last position. Adding consumers doesn't add write load.
It's the same data model as a database's own write-ahead log (WAL). Every relational database is, under the hood, a log followed by a materialized view. Kafka exposes the log directly.

Streams vs Queues, A Common Confusion

Message queues (RabbitMQ, ActiveMQ, SQS) have existed since the 1990s. They look superficially like Kafka, but they have a fundamentally different model:

	Traditional Queue	Kafka (Log)
Storage	Message deleted after consumption	Message retained for days/weeks/forever
Consumers	One consumer typically reads each message	Many independent consumers replay the same data
Replay	Impossible, message is gone	Trivial, rewind offset
Throughput	Tens of thousands/sec	Millions/sec
Mental model	Job queue	Source of truth

This matters because you'll often hear "Kafka is a message broker." Technically true, but underselling it. Kafka is more like a database whose query interface is tail -f. The retention is what changes everything.

Where Kafka Sits in the Modern Data Stack

In TheWorldShop's eventual architecture, Kafka becomes the central nervous system:

[ App services ] → Kafka → [ Stream processors: Flink / Kafka Streams ]
      │              │              ↓
      │              │       [ Real-time features ]
      │              │              │
      │              └─→ [ Connect → Iceberg / Delta / S3 ]
      │                                  │
      │                                  ↓
      │                          [ Spark / Trino / dbt ]
      │                                  │
      │                                  ↓
      └─→ CDC (Debezium) → Kafka  → [ Dashboards / ML ]

Every downstream system, your data lake, your warehouse, your real-time features, your ML training pipelines, drinks from the same Kafka log. This is sometimes called the Kappa Architecture (a play on the older Lambda Architecture), where streams are the source of truth and batch is just "a window over the stream."

What Kafka Is Not

It's just as important to know what Kafka is not, because beginners often try to use it as something it isn't:

Not a database with query support. You can't SELECT * WHERE customer_id = 42. You can only read sequentially from a position.
Not a transactional system. Kafka transactions exist (we'll cover them later) but they are for atomic publishing, not for OLTP workloads.
Not a small-data tool. If you have 10 events per minute and one consumer, Kafka is overkill, use a queue or a database.
Not magic. Kafka can lose data, deliver duplicates, get out of order, if you configure it wrong. Most of this course is about getting that configuration right.

3. Worked Example

Let's walk through what changes for TheWorldShop when they introduce Kafka.

Before, Batch / Point-to-Point

The orders-service writes new orders to a Postgres table. Every team writes their own integration:

# Fraud team, runs every 10 min
SELECT * FROM orders WHERE created_at > $last_run;
# Recommendations team, runs nightly at 2 AM
COPY (SELECT * FROM orders WHERE date = yesterday) TO 's3://...';
# Warehouse team, midnight CSV drop
mysqldump --where="date = today" > orders.csv;

Every one of those is a separate pipeline, with separate credentials, separate failure modes, separate operators on call. When marketing wants to add a new "order placed" email at 11 PM their time, they file a ticket against the orders team to expose another extract. Six weeks later, still no email.

After, Kafka-Centric

The orders-service publishes one event per order to a Kafka topic called orders:

{
  "event": "order_placed",
  "order_id": "O-87234",
  "customer_id": "C-1042",
  "items": [{"sku": "BOOK-9931", "qty": 2, "price": 19.99}],
  "total": 39.98,
  "ts": "2024-11-29T11:14:03Z"
}

Now:

Fraud subscribes to orders and scores each event in milliseconds.
Recommendations subscribes and updates a Redis cache of "recently bought."
Warehouse subscribes and writes orders to its inventory system as they arrive.
Analytics uses Kafka Connect to dump the same events into Iceberg, where dbt / Spark / Trino read them as a regular table.
Marketing spins up a new consumer in a day, no change to the orders service required.

The orders service writes once. The log becomes the contract. Everyone downstream is decoupled.

Note what's happening operationally:

Latency dropped from minutes/hours to seconds, for everyone simultaneously.
Coupling dropped because the orders team no longer maintains five separate extracts.
Failure isolation improved, if recommendations crashes, fraud doesn't notice.
Replay is now possible, when fraud rolls out a new model, they replay the last 7 days of orders and re-score everything.

This is the leverage that pushes companies to adopt Kafka.

Aha: Kafka isn't a faster message queue. A queue consumes a message (one reader, then it's gone). Kafka is an append-only log (many readers, replay anytime, retention by policy). Once you see that, every "can we add another consumer?" question answers itself: of course you can, the log doesn't care.

4. Your Turn

Scenario: TheWorldShop wants to add three new capabilities:

A real-time "someone just bought this" social-proof banner on product pages.
A nightly snapshot of orders into the analytics warehouse.
A loyalty service that gives points for every $50 spent, applied within 30 seconds.

Task: For each capability, answer in writing:

Is it a streaming use case, a batch use case, or both? Why?
If you used the old point-to-point architecture, what new integration would the orders team have to build for each capability?
With a Kafka orders topic in place, how does each new capability change the orders team's workload? Whose code has to change for capability 3?

Bonus question: Why might it be a bad idea to put a 50 events/day internal admin audit log on Kafka?

5. Real-World Application

LinkedIn built Kafka in 2010 to solve exactly TheWorldShop's problem at far larger scale. Their member activity stream (every profile view, every connection request, every job search) was being copied through dozens of point-to-point integrations. Today, LinkedIn runs trillions of messages a day through Kafka, every interaction on the site becomes an event, and dozens of downstream systems consume it independently. The original Kafka authors (Jay Kreps, Neha Narkhede, Jun Rao) later founded Confluent, the company that commercializes Kafka.

Netflix processes over 1 trillion events per day on Kafka. Every play, pause, seek, and rating becomes an event. Their recommendation algorithms, A/B test analytics, and operational telemetry all read from the same shared Kafka cluster.

Uber routes nearly every internal system through Kafka, driver location updates, surge pricing signals, ETA recalculations, fraud detection. Their internal stream processing platform is built on top of Kafka and Flink.

Goldman Sachs, JPMorgan, and most major banks use Kafka for market data distribution, trade event capture, and regulatory reporting. The append-only log model is also a perfect fit for audit and compliance: events are immutable, ordered, and timestamped.

For job seekers: roles titled Data Platform Engineer, Streaming Engineer, Site Reliability Engineer (Data), or Backend Engineer at any company doing more than a few million events a day will expect comfort with Kafka. Understanding why Kafka exists, the integration-spaghetti problem, the log abstraction, separates engineers who use Kafka well from those who treat it as a black box.

6. Recap + Bridge

Kafka exists because batch and point-to-point integrations break down once your business has many systems that need the same data quickly, and the entire system collapses into one elegant idea: a durable, append-only, replayable log that any number of producers and consumers can independently write to and read from.

In the next lesson, we'll open up that log and look at its three core primitives, topics, partitions, and offsets, which are the only nouns you really need to understand 90% of Kafka.