Module: Streaming Foundations | Duration: 20 min read | Lesson: 1 of 20
TheWorldShop's inventory system runs a batch job at midnight to flag out-of-stock products. At 11 PM, a viral TikTok video features a TheWorldShop product. Within 90 minutes, 50,000 customers add it to their cart. The last unit sold at 11:03 PM. For the next 57 minutes, the website still shows it as "In Stock." Thousands of customers complete checkout. TheWorldShop spends the next week issuing refunds and apologies.
The midnight batch job was fine when inventory changed slowly. It's catastrophic when demand spikes in minutes. The problem isn't the code, it's the fundamental latency of batch processing. The solution isn't a faster batch job. It's a different computational model: stream processing.
2. Concept Explanation
Batch vs Stream Processing
Batch processing: Accumulate data, then process it all at once.
Stream processing: Process data continuously as it arrives.
Three Processing Models
1. True streaming (record-at-a-time): Process each event individually as it arrives. Lowest latency (~milliseconds). Systems: Apache Flink, Apache Storm, Kafka Streams.
2. Micro-batch: Accumulate events for a short interval (1 second, 10 seconds), then process that mini-batch. Latency = batch interval (~seconds). Systems: Spark Structured Streaming (default), Spark DStream.
3. Batch: Process hours or days of accumulated data. Latency = hours. Systems: Spark batch, Hive, traditional ETL.
When Do You Need Streaming?
Use streaming when the value of the result degrades faster than the batch interval.
| Use Case | Acceptable Latency | Streaming? |
|---|---|---|
| Fraud detection (block a transaction) | < 1 second | Yes (true streaming or micro-batch) |
| Inventory alert (restock product) | < 5 minutes | Yes (micro-batch) |
| Live dashboard (orders per minute) | < 30 seconds | Yes (micro-batch) |
| Daily revenue report | 24 hours | No (batch is fine) |
| Monthly finance reconciliation | Days | No (batch is better) |
| ML model training | Hours | No (batch is correct) |
TheWorldShop's streaming needs:
- Inventory alerts: restock notification when stock < 10 (within 2 minutes)
- Fraud detection: flag suspicious transactions before they clear (within 30 seconds)
- Live dashboard: orders per minute for the operations team (within 10 seconds)
- Session timeout: mark users as inactive after 30 minutes of no clicks
Spark's Two Streaming APIs
Spark has two streaming APIs:
DStream (Discretized Streams), Legacy, Spark 1.x/2.x:
- Represents a stream as a sequence of RDDs (one per batch interval)
- Built on RDD API, low-level, flexible
- No event-time support, limited exactly-once semantics
- Still widely used in existing systems
Structured Streaming, Modern, Spark 2.0+:
- Represents a stream as an unbounded table, rows keep arriving
- Built on DataFrame/Dataset API, full SQL support, Catalyst optimization
- First-class event-time support, watermarks, stateful processing
- Exactly-once end-to-end (with supported sources/sinks)
- The recommended API for new development
The Streaming Triad: Throughput, Latency, Correctness
Every streaming system makes trade-offs across three dimensions:
Throughput: How many events per second can the system process? Latency: How long after an event arrives until it appears in results? Correctness: Are results exactly right, or approximate?
Spark Structured Streaming with Kafka sources and idempotent sinks achieves exactly-once semantics, the strongest guarantee and the hardest to implement.
TheWorldShop's Streaming Architecture
3. Worked Example
Illustrating the batch vs streaming difference concretely:
Aha: Streaming isn't a faster batch job. The win is that the answer is ready before the question gets stale. If your batch interval is shorter than the value-decay of the result, batch already works. Reach for streaming only when waiting destroys the answer.
4. Your Turn
Map TheWorldShop's use cases to the right processing model:
Challenge:
- For each operation (A-F), choose: batch (hourly/daily), micro-batch (seconds/minutes), or true streaming (milliseconds)
- What's the "cost of being wrong" for B vs C? Why does this drive the latency choice?
- For operation D (fraud detection): what does "at-least-once" mean here? Is it acceptable? What's the risk of "at-most-once"?
- Could you implement F (live dashboard) with batch processing? What batch interval would you need, and what's the infrastructure cost compared to streaming?
5. Real-World Application
- Uber's surge pricing is computed from a stream of ride requests and driver availability events. At-rest batch computation would show surge 5 minutes late, long enough for hundreds of mismatched requests. Their streaming system at peak processes 1M events/minute with sub-second latency.
- LinkedIn's newsfeed uses micro-batch streaming (Apache Samza, conceptually similar to Spark Structured Streaming) to update feed rankings as connections post new content, 30-second latency from post to appearing in followers' feeds.
- NYSE's trade surveillance system processes every trade event in real-time to detect wash trading, spoofing, and front-running. At-most-once is legally unacceptable, missing a detected fraud event could mean SEC violations. They run exactly-once Flink pipelines with sub-100ms latency.
6. Recap + Bridge
Stream processing processes events continuously as they arrive, trading the simplicity of batch for latency reduction that becomes critical when the value of a result (stock alert, fraud flag, live count) degrades faster than any batch interval can deliver; Spark offers DStream (RDD-based, legacy) and Structured Streaming (DataFrame-based, modern) as its two streaming APIs, both using micro-batch execution.
In the next lesson, we build our first DStream pipeline, starting with Spark's original streaming API to understand the micro-batch model, batch interval tuning, and receiver-based input before moving to the more powerful Structured Streaming model.