Module: Foundations | Duration: 20 min read | Lesson: 1 of 12
TheWorldShop's fraud team needs to react to suspicious activity within seconds, and to handle events that arrive minutes late (mobile clients with flaky connections) correctly, putting each event in the right time bucket regardless of when it showed up. Their Spark Structured Streaming job processes in micro-batches and their Kafka Streams app struggles with the complex multi-stream, large-state joins they need.
A staff engineer suggests Apache Flink. The fraud lead is skeptical: "Isn't that just another stream processor?" The answer is no, and the difference is foundational. Spark streams by chopping the stream into tiny batches (micro-batch). Flink processes each event as it arrives (true streaming), with a model of event time and late data built into its core, not bolted on.
This course is Flink in production depth. It opens with why Flink exists: the dataflow model and true streaming that make seconds-latency, correct-under-lateness processing its native mode rather than a configuration.
2. Concept Explanation
Micro-Batch vs True Streaming
Two architectures for "stream processing":
- Micro-batch (Spark Structured Streaming): buffer incoming records for a short interval, then process the whole batch. Latency is bounded below by the batch interval, you can't go faster than your batch size. Simple, reuses the batch engine, great throughput, but inherently a few-hundred-ms-to-seconds floor.
- True streaming (Flink): process each record (or small buffered group) as it flows through a long-running operator graph, no batch boundary. Latency can be milliseconds. The pipeline is always running, not waking up per batch.
For TheWorldShop's seconds-latency fraud need, true streaming's low latency floor matters. But the deeper reason for Flink is the next part.
The Dataflow Model
Flink implements the Dataflow model (from Google's Dataflow paper, the foundation of Apache Beam too). Its core insight: stream processing should be defined by answering four questions independently:
- What are you computing? (the transformation: count, join, aggregate)
- Where in event time? (windowing)
- When in processing time do you emit results? (triggers, watermarks)
- How do refinements relate? (how late data updates earlier results)
The power is that event time (when things happened) is separated from processing time (when you process them). A late-arriving event still belongs to its event-time window; the model handles "this event is for 10:00 even though it's now 10:05" as a first-class concept, not an afterthought. This is exactly TheWorldShop's flaky-mobile-client requirement, and it's native to Flink, where Kafka Streams expresses it more narrowly (grace periods, L4 of the previous course) and Spark via watermarks.
Streaming as the Primary, Batch as a Special Case
Flink's philosophy inverts the usual mental model: batch is a special case of streaming (a bounded stream), not the other way around. A bounded data set is just a stream that ends. The same engine, APIs, and operators handle both; you don't switch tools to do a backfill. (This unification matters for the backfill course in Track D.)
What Flink Brings That Pays Off Later
Flink's foundation enables capabilities this course will detail:
- Event time + watermarks (Lessons 3, 4): correct handling of out-of-order and late events.
- Large, RocksDB-backed managed state (Lesson 5): terabytes of state, far beyond casual.
- Checkpointing for exactly-once (Lesson 6): consistent snapshots of distributed state.
- Savepoints (Lesson 7): upgrade and rescale a job without losing state.
- A broad connector ecosystem and SQL (Lessons 10, 11): many sources/sinks, SQL and CDC.
When Flink Is the Right Choice (vs Kafka Streams / Spark)
Flink fits when you need: low-latency true streaming, sophisticated event-time/late-data handling, very large state, complex multi-stream topologies, a rich connector set beyond Kafka, or one engine for both streaming and batch. It costs more operationally, it's a cluster you run (JobManager + TaskManagers), unlike the Kafka Streams library model. The full three-way decision (Kafka Streams vs Flink vs Spark) is course C3; here, internalize that Flink trades operational simplicity for streaming power and correctness.
Aha: Flink isn't "faster Spark Streaming", the difference is architectural. Spark batches the stream (latency floored by batch size); Flink flows each event through always-running operators (millisecond latency). And the deeper distinction is the dataflow model's separation of event time from processing time: a late event still lands in the window of when it happened, treated as a first-class case, not patched on. That's why Flink is the default when "correct under late and out-of-order data" is a hard requirement.
3. Worked Example
See true streaming and event-time handling on the lab.
Bring up the lab. The Flink course needs a Flink cluster (JobManager + TaskManagers) plus Kafka. Clone the lab repo (shared) and start the stack:
Verify Flink and Kafka are reachable:
Submit a true-streaming job and watch per-event latency:
See event-time vs processing-time on a late event:
The same late event, correct under event time, wrong under processing time, and Flink makes event time the natural default.
4. Your Turn
Exercise: TheWorldShop must build a fraud pipeline that reacts within ~2 seconds and correctly time-buckets events from mobile clients that can arrive up to several minutes late. The team currently uses Spark Structured Streaming (micro-batch).
- Explain why micro-batch imposes a latency floor and how true streaming avoids it.
- Define event time vs processing time, and explain why the late-mobile-event requirement demands event time.
- Using the dataflow model's four questions, frame "count fraudulent-looking orders per 1-minute event-time window."
- Give two reasons (beyond latency) the team might choose Flink over Kafka Streams for this pipeline.
- What's the main operational cost of Flink compared to the Kafka Streams library model?
5. Real-World Application
Flink dominates low-latency, correctness-critical streaming, fraud detection, real-time pricing, alerting, and complex event processing at companies like Alibaba (which built much of Flink's scale), Uber, Netflix, and Stripe. When "seconds matter and late data must be correct," Flink is the common answer.
The dataflow model is an industry foundation, it underlies Apache Beam (Google Dataflow's OSS model) and influenced Spark and Kafka Streams' event-time handling. Learning Flink teaches a model that transfers across the streaming ecosystem.
Batch-as-bounded-stream is increasingly mainstream: Flink (and the lakehouse world) blur the batch/stream line, one engine, one set of semantics, for both real-time and backfill. This unification is a major theme of modern data architecture and reappears in the backfill course.
6. Recap + Bridge
Flink is a true streaming engine (process each event as it flows, millisecond latency) built on the dataflow model, which separates event time (when things happened) from processing time and treats late/out-of-order data as first-class. It treats batch as a bounded special case of streaming. It's the default when low latency and correctness-under-lateness are hard requirements, at the cost of operating a cluster.
Flink runs as a cluster of cooperating processes. Next we open it up: the Flink architecture, JobManager, TaskManagers, and slots, the runtime that executes your dataflow.