Migrating from Cron to Airflow Without an Outage

Module: Migrations and Deprecations | Duration: ~12 min | Lesson: 1 of 6


TheWorldShop runs 40 nightly jobs from a single crontab on one server. It's fragile (no retries, no visibility, no dependencies, just timing and hope), and the team is moving to Airflow. The plan, written in a kickoff doc, is "cut over the weekend of the 14th."

Dev pushes back. "Cut over to what, exactly? We don't actually know what half these cron jobs produce, whether the Airflow versions match, or what breaks downstream if one is subtly wrong. A big-bang weekend cutover means we find out all of that on Sunday night with everything already switched."

The crontab works, mostly. How do you replace it with something better without a weekend where you discover, live, that your reimplementation didn't match?


2. Concept Explanation

The big-bang cutover is the trap

The instinct is to reimplement everything in Airflow, pick a date, flip the switch, and decommission cron. This fails for a predictable reason: you can't know your reimplementation is correct until it runs against real data, and a big-bang cutover means the first time it runs against real data is also the moment cron is gone and there's no fallback. Every discrepancy becomes a live incident, all at once, with no baseline to compare against.

The deeper problem is unknown correctness. A crontab that's run for years has accreted subtle behavior: a job that depends on another finishing first (enforced only by lucky timing), an edge case someone patched in 2022, an output format a downstream consumer quietly relies on. You don't have a spec. The cron jobs are the spec. So "reimplement and cut over" is really "rewrite an unspecified system and hope the rewrite matches," and hope is not a migration strategy.

Run both in parallel and compare outputs

The safe migration is parallel run with output comparison. For a period (a week or more), Airflow runs the new version of each job alongside cron, both producing output, and you compare the two row-by-row. Cron stays the source of truth; Airflow's output is validated against it but not yet trusted by consumers.

This converts "discover discrepancies live on Sunday" into "discover discrepancies during a calm week, with both outputs in front of you." Each difference is a bug you fix while cron is still safely in charge. The migration's correctness is demonstrated, not assumed.

The cutover is the day the diff is empty

This gives a crisp, non-political definition of "ready to cut over": the cutover happens when the diff between cron's output and Airflow's output is empty (and stays empty), not when someone gets impatient or a date on a slide arrives. The empty diff is the evidence. Before it, you have discrepancies to resolve; after it sustains, you have proof the reimplementation matches. The date is an outcome, not an input.

Concretely, per job:

  1. Shadow phase: Airflow runs the job, writes to a parallel location (revenue_airflow next to cron's revenue). Consumers still read cron's output.
  2. Compare: an automated diff runs after each cycle. Differences are logged and triaged. The diff trend should go to zero as you fix bugs.
  3. Cut over (per job): once the diff is reliably empty, flip consumers to read Airflow's output. Cron's version becomes the shadow.
  4. Decommission: after the Airflow version has been authoritative and stable for a safety period, remove the cron job.

Migrate incrementally, job by job, not all at once

Parallel-run pairs naturally with incremental migration: cut over jobs one at a time (or in small dependency-respecting batches), not all 40 at once. Each job's cutover is small, reversible, and independently validated. If job 7's Airflow version has a subtle bug, only job 7 is affected, and cron job 7 is one flag away. A big-bang cutover, by contrast, makes all 40 risks land simultaneously with one shared rollback.

Order the increments by dependency and risk: migrate leaf jobs (nothing depends on them) and low-risk jobs first to build confidence and shake out the tooling, then work toward the high-blast-radius core jobs once the process is proven.

Keep a rollback until the safety period passes

Even after an empty diff and cutover, keep cron able to take over for a defined safety period. The shadow runs cheaply, and the day Airflow's version does something cron didn't, you flip back in seconds while you investigate. Decommissioning cron is the last step, taken only after the new version has been authoritative through a representative period (a month-end close, a Black Friday, whatever stresses the job). Removing the fallback early just recreates the big-bang risk with extra steps.


3. Worked Example

Migrating TheWorldShop's revenue cron job to Airflow, safely.

Step 1: shadow, write to a parallel location.

# Airflow DAG writes alongside cron, not over it. Consumers untouched.
INSERT OVERWRITE TABLE revenue_airflow PARTITION (dt='{{ ds }}')   -- parallel
SELECT region, SUM(amount_usd) FROM orders_usd WHERE dt='{{ ds }}' GROUP BY region;
-- cron still writes the authoritative `revenue` table; consumers read it.

Step 2: diff the two outputs every cycle.

-- Row-by-row comparison. Goal: zero rows out.
SELECT c.region, c.revenue AS cron_rev, a.revenue AS airflow_rev
FROM revenue        c        -- cron (source of truth)
FULL OUTER JOIN revenue_airflow a USING (region, dt)
WHERE c.dt = '{{ ds }}'
  AND (c.revenue IS DISTINCT FROM a.revenue
       OR c.region IS NULL OR a.region IS NULL);   -- any mismatch or missing key

Step 3: triage the diff. Each row is a bug in the reimplementation.

Day 1 diff: 12 regions differ -> Airflow version missed a late-arriving-orders
            reprocess window cron did implicitly. Fix: add the trailing window.
Day 3 diff: 1 region differs   -> rounding: cron used ROUND(x,2), Airflow didn't.
Day 5 diff: 0 rows.
Day 6 diff: 0 rows.
Day 7 diff: 0 rows.            -> the diff is empty and stable.

Step 4: cut over this one job, keep the shadow.

# Flip consumers to read revenue_airflow (or swap which writes `revenue`).
# Cron's revenue job now runs as the SHADOW. Rollback = flip back, seconds.
set_consumer_source("revenue", "airflow")

Step 5: decommission cron's revenue job, only after a safety period.

Airflow version authoritative through month-end close + 2 weeks, diff stayed empty.
-> remove the cron entry. Fallback retired last, deliberately.

Repeat per job, leaf-first, until the crontab is empty. No weekend, no big bang, every cutover backed by an empty-diff proof and a one-flag rollback.

Aha: Run both systems in parallel and compare outputs row-by-row; the cutover is the day the diff is empty, not the day someone gets impatient. A crontab that's run for years is the spec, so "reimplement and cut over" is "rewrite an unspecified system and hope." Parallel run turns "discover the mismatch live on Sunday night" into "fix it during a calm week with both outputs in front of you."


4. Real-World Application

Parallel-run-and-compare is the standard playbook for any risky migration where the old system is the de facto spec, and it long predates data engineering, it's how banks migrate core systems and how teams do the strangler-fig pattern in application code. In data specifically, it's the only defensible way to migrate from cron, hand-rolled scripts, or a legacy orchestrator, precisely because the old pipeline's exact behavior is undocumented and the diff is the only trustworthy spec.

The tooling supports it well. Writing Airflow output to a parallel table and running a reconciliation query is straightforward, and the comparison itself is just the reconciliation/contract-test machinery from earlier lessons pointed at "old output vs new output" instead of "data vs contract." dbt's audit_helper package exists specifically to diff two model outputs during a migration. The empty-diff bar is objective and removes the politics: you cut over on evidence, not on a date a stakeholder wrote on a slide.

The judgment is in sequencing and in resisting impatience. Migrate leaf and low-risk jobs first to prove the process, save the high-blast-radius core for last, and keep the old system as a fallback through a representative stress period (a month-end, a peak day) before decommissioning. The most common way this migration still goes wrong is removing the fallback too early, the moment the diff first hits zero, before the new version has survived the edge cases that only appear at close or at peak load. Patience on the decommission step is the cheapest insurance in the whole migration.


5. Your Turn

Exercise: TheWorldShop has 40 cron jobs. Among them: nightly_revenue (10 downstream consumers, feeds finance), cleanup_temp_files (no downstream, just housekeeping), and inventory_sync (depends on nightly_revenue finishing first, enforced today only by cron scheduling them 30 minutes apart).

  1. In what order would you migrate these three, and why?
  2. The inventory_sync dependency on nightly_revenue is implicit (just timing). How does moving to Airflow let you make it explicit, and why is parallel-run still needed even though Airflow models the dependency?
  3. The diff for nightly_revenue hits zero on day 3. A stakeholder says "great, kill the cron job." What's your response, and what specifically do you wait for?

6. Recap + Bridge

Never big-bang a migration off a system whose behavior is undocumented, the old system is the spec. Run both in parallel, write the new output to a shadow location, and diff it row-by-row against the old; the cutover is the day the diff is empty and stays empty, not a date on a slide. Migrate incrementally, leaf-first, and keep the old system as a one-flag rollback through a representative stress period before decommissioning. Next lesson: the opposite problem, getting rid of a pipeline. Deprecating a DAG nobody admits they own, where the screams are the audit trail.