The Hot-Shard Spiral

P1hard10 minIncident Response

It's 14:03 on Black Friday. MegaToys' flash sale went live at 13:30. Checkout is green, payments are clearing — but the warehouse queue is empty. orders_fact is 41 minutes stale and the gap is growing. At 13:00 the pipeline was sub-30s fresh. Fulfillment, fraud, and the live 'low stock' banners all read this table. Sale ends 15:00. You're on call.

Pipeline lag
41 → 2 min behind
Rebalances
9 → 0 /min

The incident

It's Black Friday and MegaToys' flash sale opened at 13:30. Orders are pouring in — checkout is green and payments are clearing, so to a customer everything looks fine. But the warehouse queue is empty: every system that reads orders_fact (fulfillment, fraud scoring, the live 'low stock' banners) is now working off data that's 41 minutes stale, and that gap is widening about 1.4 minutes every minute. At 13:00 this pipeline was fresh to within 30 seconds. The producers are healthy and Postgres is writing in milliseconds — so orders are being created and would land instantly if they arrived. They're getting stuck somewhere on the path between Kafka and Flink, and the stall is not clearing on its own.

Symptoms on the table

  • orders_fact freshness 41 min (SLA 5 min) and climbing
  • order-indexer group lag 18.4M messages and rising
  • partition 7 alone holds 17.9M of the 18.4M lag (97%)
  • order-indexer rebalancing every ~7 seconds
  • checkout, payments, and producer throughput all nominal

Systems on the board

The real components in play for this incident — the surface you investigate when the clock starts.

Order Producer
checkout service
Kafka
orders.v2 · 12 partitions
Consumer Group
order-indexer · 12 members
KEDA Autoscaler
lag-based ScaledObject
Flink Enrichment
stateful join → sink
Postgres Sink
orders_fact
Ops Dashboard
Grafana freshness

What you'll practice

This is a timed, hands-on incident in the Incident Response. You diagnose the symptom, trace it to a root cause across real components, and ship a fix before the clock runs out — the same loop you run on call, without the production blast radius.

Members-only challenge

Solve it in the Simulation Arcade.

The interactive workspace — live metrics, the component map, and the fix you ship — runs inside Petascale Labs. Sign in to start the clock.

Related topics

Browse the full Arcade

Every challenge maps to a stratum in the curriculum.