The Small-Files Tax

P2medium10 minIncident Response

It's 06:47. Finance Slack pinged overnight: analytics GCS bill jumped 4× last week, breaking the quarterly budget. No outages, no paging. Dashboards still render. Board prep starts at 08:00. You're on call.

Class-B ops cost
1,240 → 180 $/day
Small files
4,200,000 → 38,000 files

The incident

Analytics infrastructure cost has quadrupled in 7 days — finance flagged a $48k overrun. No outage, no SLA breach, no alert. The bill is real and it's compounding daily.

Symptoms on the table

  • GCS class-B operation count up 38× week-over-week
  • Trino coordinator metadata heap usage from 4 GB → 14 GB
  • Superset dashboard P95 from 12s to 47s (but still loads)
  • Iceberg metadata.json file grew from 6 MB to 412 MB on revenue.events
  • No paging alerts fired — all queries still succeed

Systems on the board

The real components in play for this incident — the surface you investigate when the clock starts.

Kafka
ingest topic
Spark Streaming
micro-batch writer
Parquet (raw)
object storage layer
Iceberg
table format / metadata
Compaction Job
nightly Airflow DAG
Trino
query engine
Superset
BI dashboards

What you'll practice

This is a timed, hands-on incident in the Incident Response. You diagnose the symptom, trace it to a root cause across real components, and ship a fix before the clock runs out — the same loop you run on call, without the production blast radius.

Interactive challenge

Solve it in the Simulation Arcade.

The interactive workspace — live metrics, the component map, and the fix you ship — runs inside Petascale Labs. Sign in to start the clock.

Related topics

Browse the full Arcade

Every challenge maps to a stratum in the curriculum.