Module: Scheduler Deep Dive | Duration: ~14 min | Lesson: 1 of 10

Dev's first Airflow on-call rotation. He opens the metadata DB to poke around. He runs SELECT * FROM dag_run WHERE state='running' and sees 47 rows. He runs SELECT * FROM task_instance WHERE state='running' and sees 312. He runs SELECT * FROM task_instance WHERE state='queued' and sees 8,124.

Dev had assumed "the scheduler fires tasks." His mental model was cron: a clock ticks, a task fires. What he's looking at is a database state machine, with thousands of rows moving between states every minute. The scheduler isn't a clock. The scheduler is a database loop reconciling intent against state.

Understanding which rows move where, on each tick, is the difference between "Airflow is magic" and "Airflow is one implementation of a known idea."

2. Concept Explanation

Cron is the wrong mental model

People come to Airflow expecting cron with extra features. Cron is a timer that runs commands. Airflow is closer to a database-backed reconciliation loop: it compares "what should be happening" (DAG definitions plus schedule) against "what is happening" (rows in the metadata DB) and acts to close the gap.

The scheduler is one Python process (or a few, for HA) running this loop continuously. Each iteration is a tick. On each tick the scheduler does roughly these things:

Heartbeat. Write a row to the job table saying "I'm alive at timestamp T."
Parse DAG files (incrementally). Walk the DAGs folder; re-parse files that have changed; update in-memory DAG objects.
Create new DAG runs. For each DAG, look at the schedule and the last dag_run. If a new interval should fire, insert a row into dag_run with state queued or running.
Create task instances. For each running dag_run, generate rows in task_instance for tasks that don't yet have one.
Schedule task instances. Move task instances from none/scheduled to queued when their dependencies are met (upstream succeeded, trigger rules satisfied, pool slot available).
Send to executor. For each queued task instance, hand off to the executor (Local, Celery, Kubernetes). The executor moves the row to running.
Adopt running tasks. Reap orphans, advance dependents on completion.

The loop runs every "scheduler tick" (around 5 seconds in Airflow 2.x defaults).

The states a task instance moves through

none -> scheduled -> queued -> running -> success
                                       -> failed -> up_for_retry -> queued -> ...
                                       -> upstream_failed
                                       -> skipped

Each transition is a row update in task_instance. The scheduler picks rows from scheduled, moves them to queued, gives them to the executor. The executor's worker moves them to running. The worker's exit code moves them to success or failed. The scheduler's next tick reads the new state and advances dependents.

This is what people mean when they say "Airflow is database-driven." The Postgres task_instance table is the truth. Every UI screen, every API endpoint, every "is this task running?" question, ultimately resolves to a SELECT against that table.

Why this matters operationally

Most performance and correctness questions in Airflow come down to one of:

The scheduler can't get through its loop fast enough. Parse is slow, the DB is slow, or one of the steps is taking too long. New runs lag. (Lesson 9.)
The metadata DB is the bottleneck. Locks, slow queries, missing indexes. Every step of the loop suffers. (Lesson 4.)
Tasks are queued but not running. The scheduler is fine; the executor or workers are saturated. (Lessons 3, 5, 7, 8.)
Tasks are running but state isn't updating. Worker died, executor lost track, scheduler is adopting orphans. Heartbeat-related.

Each of these has a different fix, and you can only pick the right one if you know which step of the scheduler loop is the bottleneck. "Airflow is slow" is too vague to act on. "The scheduler's parse phase is taking 40 seconds" is actionable.

The bottleneck is the loop, not the workers

A counterintuitive truth: in most Airflow shops with throughput problems, the scheduler loop is the bottleneck long before the workers are. Workers may be 20% busy. The scheduler loop is 100% busy because parse plus DB queries are saturating it. Tasks pile up in queued not because workers can't keep up, but because the scheduler can't move them to queued fast enough.

This is why "buy more workers" rarely fixes "Airflow is slow." The workers were never the constraint. Fix the loop first.

What you can poke

The loop is observable. Useful queries against the metadata DB:

-- How current is the scheduler heartbeat?
SELECT EXTRACT(EPOCH FROM (NOW() - latest_heartbeat)) AS seconds_since_heartbeat
FROM   job WHERE job_type='SchedulerJob' AND state='running';

-- How many task instances are stuck in each state?
SELECT state, COUNT(*) FROM task_instance
WHERE  start_date > NOW() - INTERVAL '1 hour'
GROUP BY state;

-- What's the slowest part of the loop right now?
-- (Read scheduler logs for 'completed parsing' lines)

If seconds_since_heartbeat > 30, the scheduler is unhealthy. If queued counts are large and growing while running counts are flat, the scheduler can't dispatch fast enough. If running counts are saturated against parallelism, the workers are the constraint and you do want to scale them.

The diagnostic flow is always: read the metadata DB, find the saturated stage, fix that stage.

3. Worked Example

A morning where Dev observes the scheduler under load and learns to read the state machine.

At 02:00, the daily DAGs start firing.

Dev queries the metadata DB:

SELECT state, COUNT(*) FROM dag_run WHERE execution_date > NOW() - INTERVAL '30 minutes' GROUP BY state;

state   | count
--------+-------
queued  | 47
running | 13

47 DAG runs were created in the last 30 minutes. 13 are running. 34 are in queued, waiting for the scheduler's next pass to move their tasks forward.

At 02:05, he checks task instances.

SELECT state, COUNT(*) FROM task_instance WHERE start_date > NOW() - INTERVAL '30 minutes' GROUP BY state;

state          | count
---------------+-------
scheduled      | 1,254
queued         |   188
running        |    52
success        |    19
none           |   840

The numbers tell a story:

none (840): task instances exist but haven't been picked up yet. The scheduler hasn't reached step 5 for them.
scheduled (1,254): dependencies met, waiting for a pool slot or executor handoff. The scheduler will move these to queued over the next few ticks.
queued (188): handed to the executor, waiting for a worker to start them.
running (52): actually executing.
success (19): done.

The bottleneck is upstream of running. The workers are quiet (running=52 with parallelism=128). The scheduler is choking on its own queue.

At 02:08, Dev checks the heartbeat:

SELECT EXTRACT(EPOCH FROM (NOW() - latest_heartbeat)) AS seconds_since
FROM   job WHERE job_type='SchedulerJob' AND state='running';

seconds_since
-------------
27

The heartbeat is 27 seconds old. Below 30s, so technically still "alive," but the loop is slow. The scheduler is doing more work per tick than it should.

At 02:12, Dev looks at the scheduler logs:

[02:11:47] DAGFileProcessorManager: completed parsing of file /opt/airflow/dags/etl.py in 3.21s
[02:11:50] DAGFileProcessorManager: completed parsing of file /opt/airflow/dags/marts.py in 4.84s
[02:11:55] DAGFileProcessorManager: completed parsing of file /opt/airflow/dags/dim_loader.py in 5.02s

Three files taking 3-5 seconds each. With 200 DAG files, that's most of the loop's wall-clock budget being eaten by parse (Lesson 2). The fix is in those files, not in the scheduler config.

Aha: The Airflow scheduler isn't a clock. It's a database-backed reconciliation loop. Every tick reads rows from task_instance and writes new rows back. When someone says "Airflow is slow," the right next question is "which step of the loop is slow?" The metadata DB has the answer.

4. Real-World Application

The teams that maintain Airflow well treat the metadata DB as a first-class observability surface. Some shops build a "scheduler health" dashboard that reads:

Heartbeat age (seconds since latest scheduler heartbeat)
Parse times per DAG file (histogram, p50/p95/p99)
Task instance state counts (per-state, last hour)
Pool utilization (used vs total slots)
DAG run latency (creation time vs scheduled time)

The dashboard becomes the on-call's first stop for any Airflow-shaped page. Without it, every investigation starts with "let me ssh in and tail some logs," which is slow and tribal.

The Airflow Web UI exposes some of this (the Stats page, the Pools page), but custom queries against the metadata DB are more flexible. Just don't run them in a transaction that holds locks; the scheduler's own queries will pile up behind you.

5. Your Turn

Exercise: An Airflow cluster reports the following at peak load:

seconds_since_heartbeat: 52
task_instance states: scheduled: 2,000, queued: 50, running: 30, parallelism config: 128
Worker CPU: 18% average across the fleet
Scheduler logs show ~6 seconds per DAG parse

Which step of the scheduler loop is the bottleneck? Justify from the numbers.
Why is buying more workers the wrong response?
Name two configuration changes (one per relevant layer) you'd make first. Don't guess; tie each to a specific symptom.

6. Recap + Bridge

The scheduler is a database-driven reconciliation loop, not a clock. Every tick it reads rows, transitions them, and writes back. "Slow Airflow" decomposes into "which step of the loop is slow?", and the metadata DB tells you. Next lesson digs into the most-common bottleneck step in real clusters: DAG parsing, and why one heavy import file taxes every scheduler tick.