Why Stream Processing?

Module: Streaming Foundations | Duration: 20 min read | Lesson: 1 of 20


TheWorldShop's inventory system runs a batch job at midnight to flag out-of-stock products. At 11 PM, a viral TikTok video features a TheWorldShop product. Within 90 minutes, 50,000 customers add it to their cart. The last unit sold at 11:03 PM. For the next 57 minutes, the website still shows it as "In Stock." Thousands of customers complete checkout. TheWorldShop spends the next week issuing refunds and apologies.

The midnight batch job was fine when inventory changed slowly. It's catastrophic when demand spikes in minutes. The problem isn't the code, it's the fundamental latency of batch processing. The solution isn't a faster batch job. It's a different computational model: stream processing.


2. Concept Explanation

Batch vs Stream Processing

Batch processing: Accumulate data, then process it all at once.

Data arrives: ●●●●●●●●●●●●●●●●●●●●
                                    ↓ (trigger: midnight)
Process:                         [JOB RUNS]
                                    ↓
Output:                          [RESULTS] (stale by up to 24 hours)

Stream processing: Process data continuously as it arrives.

Data arrives: ●  ●  ●  ●  ●  ●  ●  ●  ●
Process:      ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓
Output:       ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓ (near real-time)

Three Processing Models

1. True streaming (record-at-a-time): Process each event individually as it arrives. Lowest latency (~milliseconds). Systems: Apache Flink, Apache Storm, Kafka Streams.

2. Micro-batch: Accumulate events for a short interval (1 second, 10 seconds), then process that mini-batch. Latency = batch interval (~seconds). Systems: Spark Structured Streaming (default), Spark DStream.

3. Batch: Process hours or days of accumulated data. Latency = hours. Systems: Spark batch, Hive, traditional ETL.

Latency:     ms          seconds         hours
             │               │               │
    True Streaming    Micro-batch           Batch
    (Flink/Storm)  (Spark Streaming)    (Spark/Hive)
         ↑               ↑                  ↑
    Higher cost      Sweet spot         Lower cost
    More complex   Most use cases       Simplest

When Do You Need Streaming?

Use streaming when the value of the result degrades faster than the batch interval.

Use CaseAcceptable LatencyStreaming?
Fraud detection (block a transaction)< 1 secondYes (true streaming or micro-batch)
Inventory alert (restock product)< 5 minutesYes (micro-batch)
Live dashboard (orders per minute)< 30 secondsYes (micro-batch)
Daily revenue report24 hoursNo (batch is fine)
Monthly finance reconciliationDaysNo (batch is better)
ML model trainingHoursNo (batch is correct)

TheWorldShop's streaming needs:

  1. Inventory alerts: restock notification when stock < 10 (within 2 minutes)
  2. Fraud detection: flag suspicious transactions before they clear (within 30 seconds)
  3. Live dashboard: orders per minute for the operations team (within 10 seconds)
  4. Session timeout: mark users as inactive after 30 minutes of no clicks

Spark's Two Streaming APIs

Spark has two streaming APIs:

DStream (Discretized Streams), Legacy, Spark 1.x/2.x:

  • Represents a stream as a sequence of RDDs (one per batch interval)
  • Built on RDD API, low-level, flexible
  • No event-time support, limited exactly-once semantics
  • Still widely used in existing systems

Structured Streaming, Modern, Spark 2.0+:

  • Represents a stream as an unbounded table, rows keep arriving
  • Built on DataFrame/Dataset API, full SQL support, Catalyst optimization
  • First-class event-time support, watermarks, stateful processing
  • Exactly-once end-to-end (with supported sources/sinks)
  • The recommended API for new development
DStream model:
  Stream = RDD[T] at t=0, RDD[T] at t=1, RDD[T] at t=2, ...

Structured Streaming model:
  Stream = one big table where new rows keep arriving
  Query = SELECT ... FROM stream WHERE ... GROUP BY ...
  Spark re-executes the query on each new batch of rows

The Streaming Triad: Throughput, Latency, Correctness

Every streaming system makes trade-offs across three dimensions:

Throughput: How many events per second can the system process? Latency: How long after an event arrives until it appears in results? Correctness: Are results exactly right, or approximate?

At-most-once:   events may be lost, never processed twice. Fastest. Wrong counts.
At-least-once:  events processed at least once, may be duplicated. Fast. Overcounts possible.
Exactly-once:   each event processed exactly once. Slower. Correct.

Spark Structured Streaming with Kafka sources and idempotent sinks achieves exactly-once semantics, the strongest guarantee and the hardest to implement.

TheWorldShop's Streaming Architecture

Kafka Topics:
  ├── theworldshop.orders      (order events)
  ├── theworldshop.inventory   (stock updates)
  ├── theworldshop.clicks      (user click events)
  └── theworldshop.fraud-signals (payment events)

                    │
                    ▼
         Spark Structured Streaming
         (micro-batch, 10-second intervals)
                    │
          ┌─────────┼──────────┐
          ▼         ▼          ▼
     Inventory   Live        Fraud
      Alerts    Dashboard   Detection
          │         │          │
          ▼         ▼          ▼
       Delta      Redis      Alert
       Table    Dashboard    Service

3. Worked Example

Illustrating the batch vs streaming difference concretely:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
  .appName("Batch vs Stream Demo")
  .master("local[*]")
  .getOrCreate()

import spark.implicits._

// === Batch approach (what TheWorldShop had) ===
// Runs at midnight, reads 24 hours of inventory events
val batchInventory = spark.read.parquet("s3://theworldshop/events/inventory/date=2024-11-29/")
  .groupBy("productId")
  .agg(last("quantity").as("currentStock"))
  .filter($"currentStock" < 10)

// Result is available at ~12:15 AM (15 min job run time)
// But the stock might have hit zero at 11 PM, 75 minutes of overselling

// === Streaming approach (what TheWorldShop needs) ===
// Processes events as they arrive from Kafka
val streamingInventory = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "kafka:9092")
  .option("subscribe", "theworldshop.inventory")
  .load()
  .select(
    $"key".cast("string").as("productId"),
    from_json($"value".cast("string"),
      new org.apache.spark.sql.types.StructType()
        .add("quantity", "int")
        .add("eventTime", "timestamp")
    ).as("data")
  )
  .select($"productId", $"data.quantity", $"data.eventTime")
  // Aggregate: latest quantity per product (using event time window)
  .groupBy($"productId")
  .agg(last("quantity").as("currentStock"))
  .filter($"currentStock" < 10)

// Write alerts to console (for demo):
val query = streamingInventory.writeStream
  .format("console")
  .outputMode("complete")
  .trigger(org.apache.spark.sql.streaming.Trigger.ProcessingTime("10 seconds"))
  .start()

query.awaitTermination()
// Now: low-stock alerts fire within 10 seconds of the stock update
// The 75-minute gap drops to 10 seconds

Aha: Streaming isn't a faster batch job. The win is that the answer is ready before the question gets stale. If your batch interval is shorter than the value-decay of the result, batch already works. Reach for streaming only when waiting destroys the answer.


4. Your Turn

Map TheWorldShop's use cases to the right processing model:

TheWorldShop operations:
A. Generate monthly financial report (last month's revenue by category)
B. Alert if a seller's order cancellation rate > 20% in last 10 minutes
C. Update product search rankings (based on last 7 days of click data)
D. Detect if the same credit card is used for > 5 orders in 60 seconds
E. Recompute "customers also viewed" recommendations daily
F. Show live order count on the ops dashboard (updates every 30 seconds)

Challenge:

  1. For each operation (A-F), choose: batch (hourly/daily), micro-batch (seconds/minutes), or true streaming (milliseconds)
  2. What's the "cost of being wrong" for B vs C? Why does this drive the latency choice?
  3. For operation D (fraud detection): what does "at-least-once" mean here? Is it acceptable? What's the risk of "at-most-once"?
  4. Could you implement F (live dashboard) with batch processing? What batch interval would you need, and what's the infrastructure cost compared to streaming?

5. Real-World Application

  • Uber's surge pricing is computed from a stream of ride requests and driver availability events. At-rest batch computation would show surge 5 minutes late, long enough for hundreds of mismatched requests. Their streaming system at peak processes 1M events/minute with sub-second latency.
  • LinkedIn's newsfeed uses micro-batch streaming (Apache Samza, conceptually similar to Spark Structured Streaming) to update feed rankings as connections post new content, 30-second latency from post to appearing in followers' feeds.
  • NYSE's trade surveillance system processes every trade event in real-time to detect wash trading, spoofing, and front-running. At-most-once is legally unacceptable, missing a detected fraud event could mean SEC violations. They run exactly-once Flink pipelines with sub-100ms latency.

6. Recap + Bridge

Stream processing processes events continuously as they arrive, trading the simplicity of batch for latency reduction that becomes critical when the value of a result (stock alert, fraud flag, live count) degrades faster than any batch interval can deliver; Spark offers DStream (RDD-based, legacy) and Structured Streaming (DataFrame-based, modern) as its two streaming APIs, both using micro-batch execution.

In the next lesson, we build our first DStream pipeline, starting with Spark's original streaming API to understand the micro-batch model, batch interval tuning, and receiver-based input before moving to the more powerful Structured Streaming model.