The Small-Files Tax
It's 06:47. Finance Slack pinged overnight: analytics GCS bill jumped 4× last week, breaking the quarterly budget. No outages, no paging. Dashboards still render. Board prep starts at 08:00. You're on call.
The incident
Analytics infrastructure cost has quadrupled in 7 days — finance flagged a $48k overrun. No outage, no SLA breach, no alert. The bill is real and it's compounding daily.
Symptoms on the table
- GCS class-B operation count up 38× week-over-week
- Trino coordinator metadata heap usage from 4 GB → 14 GB
- Superset dashboard P95 from 12s to 47s (but still loads)
- Iceberg metadata.json file grew from 6 MB to 412 MB on revenue.events
- No paging alerts fired — all queries still succeed
Systems on the board
The real components in play for this incident — the surface you investigate when the clock starts.
What you'll practice
This is a timed, hands-on incident in the Incident Response. You diagnose the symptom, trace it to a root cause across real components, and ship a fix before the clock runs out — the same loop you run on call, without the production blast radius.
Interactive challenge
Solve it in the Simulation Arcade.
The interactive workspace — live metrics, the component map, and the fix you ship — runs inside Petascale Labs. Sign in to start the clock.
Related topics
Browse the full Arcade
Every challenge maps to a stratum in the curriculum.