Why Flink: True Streaming & the Dataflow Model

Module: Foundations | Duration: 20 min read | Lesson: 1 of 12


TheWorldShop's fraud team needs to react to suspicious activity within seconds, and to handle events that arrive minutes late (mobile clients with flaky connections) correctly, putting each event in the right time bucket regardless of when it showed up. Their Spark Structured Streaming job processes in micro-batches and their Kafka Streams app struggles with the complex multi-stream, large-state joins they need.

A staff engineer suggests Apache Flink. The fraud lead is skeptical: "Isn't that just another stream processor?" The answer is no, and the difference is foundational. Spark streams by chopping the stream into tiny batches (micro-batch). Flink processes each event as it arrives (true streaming), with a model of event time and late data built into its core, not bolted on.

This course is Flink in production depth. It opens with why Flink exists: the dataflow model and true streaming that make seconds-latency, correct-under-lateness processing its native mode rather than a configuration.


2. Concept Explanation

Micro-Batch vs True Streaming

Two architectures for "stream processing":

  • Micro-batch (Spark Structured Streaming): buffer incoming records for a short interval, then process the whole batch. Latency is bounded below by the batch interval, you can't go faster than your batch size. Simple, reuses the batch engine, great throughput, but inherently a few-hundred-ms-to-seconds floor.
  • True streaming (Flink): process each record (or small buffered group) as it flows through a long-running operator graph, no batch boundary. Latency can be milliseconds. The pipeline is always running, not waking up per batch.

For TheWorldShop's seconds-latency fraud need, true streaming's low latency floor matters. But the deeper reason for Flink is the next part.

The Dataflow Model

Flink implements the Dataflow model (from Google's Dataflow paper, the foundation of Apache Beam too). Its core insight: stream processing should be defined by answering four questions independently:

  1. What are you computing? (the transformation: count, join, aggregate)
  2. Where in event time? (windowing)
  3. When in processing time do you emit results? (triggers, watermarks)
  4. How do refinements relate? (how late data updates earlier results)

The power is that event time (when things happened) is separated from processing time (when you process them). A late-arriving event still belongs to its event-time window; the model handles "this event is for 10:00 even though it's now 10:05" as a first-class concept, not an afterthought. This is exactly TheWorldShop's flaky-mobile-client requirement, and it's native to Flink, where Kafka Streams expresses it more narrowly (grace periods, L4 of the previous course) and Spark via watermarks.

Streaming as the Primary, Batch as a Special Case

Flink's philosophy inverts the usual mental model: batch is a special case of streaming (a bounded stream), not the other way around. A bounded data set is just a stream that ends. The same engine, APIs, and operators handle both; you don't switch tools to do a backfill. (This unification matters for the backfill course in Track D.)

What Flink Brings That Pays Off Later

Flink's foundation enables capabilities this course will detail:

  • Event time + watermarks (Lessons 3, 4): correct handling of out-of-order and late events.
  • Large, RocksDB-backed managed state (Lesson 5): terabytes of state, far beyond casual.
  • Checkpointing for exactly-once (Lesson 6): consistent snapshots of distributed state.
  • Savepoints (Lesson 7): upgrade and rescale a job without losing state.
  • A broad connector ecosystem and SQL (Lessons 10, 11): many sources/sinks, SQL and CDC.

When Flink Is the Right Choice (vs Kafka Streams / Spark)

Flink fits when you need: low-latency true streaming, sophisticated event-time/late-data handling, very large state, complex multi-stream topologies, a rich connector set beyond Kafka, or one engine for both streaming and batch. It costs more operationally, it's a cluster you run (JobManager + TaskManagers), unlike the Kafka Streams library model. The full three-way decision (Kafka Streams vs Flink vs Spark) is course C3; here, internalize that Flink trades operational simplicity for streaming power and correctness.

Aha: Flink isn't "faster Spark Streaming", the difference is architectural. Spark batches the stream (latency floored by batch size); Flink flows each event through always-running operators (millisecond latency). And the deeper distinction is the dataflow model's separation of event time from processing time: a late event still lands in the window of when it happened, treated as a first-class case, not patched on. That's why Flink is the default when "correct under late and out-of-order data" is a hard requirement.


3. Worked Example

See true streaming and event-time handling on the lab.

Bring up the lab. The Flink course needs a Flink cluster (JobManager + TaskManagers) plus Kafka. Clone the lab repo (shared) and start the stack:

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/ingestion-and-transport/stream-processing/flink/
./scripts/setup.sh

Verify Flink and Kafka are reachable:

./scripts/verify.sh
# expected: "Flink 1.18 ready (1 JobManager, 2 TaskManagers, 4 slots), Web UI on :8081, Kafka ready, theworldshop.orders seeded"
You are helping me run the lab for the "Flink for Stream Processing"
course. The lab is in
petascalelabs-lab-setup/ingestion-and-transport/stream-processing/flink/
and includes:
  - docker-compose.yml: Flink 1.18 (JobManager + 2 TaskManagers), Kafka (KRaft), a seeded theworldshop.orders topic
  - the Flink Web UI on :8081
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh, scripts/submit-job.sh

My environment:
  OS: <fill in>
  RAM: <fill in GB>  (Flink + Kafka want a few GB)

Walk me through:
1. Confirming Docker has enough memory for Flink + Kafka.
2. Bringing up the cluster and opening the Flink Web UI.
3. Submitting the sample job and watching it in the UI.
4. The teardown command.

Do not assume my OS; ask if unclear.

Submit a true-streaming job and watch per-event latency:

./scripts/submit-job.sh order-counter
./scripts/produce-orders.sh --rate 100
./scripts/latency-probe.sh order-counter
# per-event end-to-end latency in the low tens of ms — no batch interval floor

See event-time vs processing-time on a late event:

# An order that happened at 10:00:00 but arrives at 10:05:00 (late mobile client)
./scripts/produce-late-order.sh --event-time 10:00:00 --arrives 10:05:00
./scripts/submit-job.sh windowed-count --time event
./scripts/consume.sh order-counts --window 10:00
# the late order is counted in the 10:00 window (event time), not 10:05
./scripts/submit-job.sh windowed-count --time processing
./scripts/consume.sh order-counts --window 10:05
# with processing time, the same order lands in 10:05 — wrong bucket

The same late event, correct under event time, wrong under processing time, and Flink makes event time the natural default.


4. Your Turn

Exercise: TheWorldShop must build a fraud pipeline that reacts within ~2 seconds and correctly time-buckets events from mobile clients that can arrive up to several minutes late. The team currently uses Spark Structured Streaming (micro-batch).

  1. Explain why micro-batch imposes a latency floor and how true streaming avoids it.
  2. Define event time vs processing time, and explain why the late-mobile-event requirement demands event time.
  3. Using the dataflow model's four questions, frame "count fraudulent-looking orders per 1-minute event-time window."
  4. Give two reasons (beyond latency) the team might choose Flink over Kafka Streams for this pipeline.
  5. What's the main operational cost of Flink compared to the Kafka Streams library model?

5. Real-World Application

Flink dominates low-latency, correctness-critical streaming, fraud detection, real-time pricing, alerting, and complex event processing at companies like Alibaba (which built much of Flink's scale), Uber, Netflix, and Stripe. When "seconds matter and late data must be correct," Flink is the common answer.

The dataflow model is an industry foundation, it underlies Apache Beam (Google Dataflow's OSS model) and influenced Spark and Kafka Streams' event-time handling. Learning Flink teaches a model that transfers across the streaming ecosystem.

Batch-as-bounded-stream is increasingly mainstream: Flink (and the lakehouse world) blur the batch/stream line, one engine, one set of semantics, for both real-time and backfill. This unification is a major theme of modern data architecture and reappears in the backfill course.


6. Recap + Bridge

Flink is a true streaming engine (process each event as it flows, millisecond latency) built on the dataflow model, which separates event time (when things happened) from processing time and treats late/out-of-order data as first-class. It treats batch as a bounded special case of streaming. It's the default when low latency and correctness-under-lateness are hard requirements, at the cost of operating a cluster.

Flink runs as a cluster of cooperating processes. Next we open it up: the Flink architecture, JobManager, TaskManagers, and slots, the runtime that executes your dataflow.