Deploying DAGs Without Restarting Anything

Module: Airflow in Production | Duration: ~12 min | Lesson: 1 of 7


Dev's first Airflow deploy at TheWorldShop. He builds a Docker image that bakes the DAGs folder into it, pushes the new image to ECR, rolls the scheduler deployment. The scheduler restarts. Every running task gets killed in-flight. Two finance close jobs land at 09:14 instead of 06:00.

Dev didn't know Airflow had to do anything on a DAG deploy. He treated it like a regular code deploy. Other teams ship DAG changes 30+ times a day without restarting anything. The reason is one design decision they made and Dev didn't: don't bake DAGs into the image.


2. Concept Explanation

Why restarting hurts

Restarting the Airflow scheduler:

  • Kills any task running on the scheduler's process (with LocalExecutor; with Celery/K8s this is less direct but still affects in-flight scheduling decisions).
  • Interrupts the parse loop; new DAGs take 1-2 cycles to reappear.
  • Triggers metadata-DB reconciliation as the scheduler re-establishes state.
  • For Celery: workers continue running existing tasks but new ones queue.

The cost is small if it's planned (off-hours window, no critical tasks running). It's expensive if it happens unexpectedly during a busy period. Finance loses morning close. ML jobs lose hours of training. Vendor SLAs get missed.

The goal: never restart the scheduler for a DAG-content change. Restart only for infrastructure changes (config, image upgrade).

The two DAG-delivery patterns

There are exactly two production-worthy ways to get DAG files into a running Airflow cluster:

1. Bind-mount + git-sync.

The DAGs folder is a shared volume mounted into both scheduler and worker pods. A sidecar container (git-sync is the common choice; it's part of the Airflow Helm chart) periodically pulls from a Git repo into the volume.

git repository (DAGs) 
  -> git-sync sidecar (pulls every N seconds)
    -> shared volume (PVC in K8s)
      -> mounted into scheduler pod
      -> mounted into worker pods

A new commit to the repo lands in the volume within ~30 seconds. The scheduler's next parse cycle (Lesson 1 of Course 3.1) picks it up. No restart. No image rebuild. No outage.

2. Image rebuild + rolling deploy.

The DAGs folder is baked into the Airflow Docker image. New DAGs require a new image build, push to registry, and a rolling deploy of the scheduler/workers.

This is what Dev did. It restarts everything every deploy. It's also slow (build + push takes 5-10 minutes; deploy adds more). It does have one advantage: the DAG files and the dependencies (Python packages) are versioned together. If a DAG needs a new library, you bump the image and the library lands at the same time.

The pattern that scales: bind-mount + git-sync for DAG changes, image rebuild only for dependency changes.

How git-sync works

git-sync is a small Go binary that runs in a sidecar container. Configuration (in the Helm chart values):

dags:
  gitSync:
    enabled: true
    repo: https://github.com/theworldshop/airflow-dags
    branch: main
    rev: HEAD
    depth: 1
    wait: 30          # poll every 30 seconds
    subPath: dags     # the DAGs folder within the repo

Every 30 seconds, git-sync runs git fetch && git checkout. If there's a new commit, it updates the working directory. The shared volume is updated atomically (via symlink swap), so the scheduler never sees partial files.

The scheduler's parse loop picks up changed files. New DAGs appear in the UI within 30-60 seconds of merge.

The DAGs folder layout

Inside the repo, the DAGs folder is structured like a normal Python project:

dags/
  __init__.py
  common/
    utils.py
    schemas.py
  marts/
    daily_revenue.py
    monthly_close.py
  ingest/
    vendor_a.py
    vendor_b.py
  requirements.txt

The requirements.txt is not read by Airflow. It's documentation for what dependencies the image needs to include. Updating it means rebuilding the image (the second pattern).

Tip: keep dags/ clean. The scheduler walks it recursively and parses every .py file. Junk files (test scripts, debug code, old notebooks-as-py) are all parsed and add to the cycle time.

The image rebuild path

When you need a new Python dependency (e.g., a DAG that imports pandas for the first time), you have to update the image. The deploy flow:

  1. Update requirements.txt and Dockerfile.
  2. Build a new image.
  3. Push to ECR/GCR/whatever.
  4. Roll the scheduler deployment to use the new image.
  5. Roll the worker deployments to use the new image.

This is genuinely restart-required. The fix is to make image rebuilds rare by keeping the image dependency-rich enough that most DAG additions don't trigger one.

A standard pattern: bake a "kitchen-sink" image with common libraries (pandas, requests, boto3, snowflake-connector, dbt-core, common dbt adapters). 95% of new DAGs don't need new libraries. The 5% that do bump the image during a planned window.

Avoiding the trap

Three deployment antipatterns to avoid:

  1. Hardcoding paths in DAGs. A DAG that does with open('/opt/airflow/dags/marts/config.yaml') breaks if the DAGs folder is mounted somewhere else. Use relative paths or pass via env var.
  2. Image-baked DAGs as the only path. Slow deploys mean engineers batch changes, which slows the team's iteration loop. Add git-sync.
  3. Git-sync without a deploy gate. Untested DAGs landing in production within 30 seconds of merge is a fast feedback loop and also a fast outage loop. Add CI tests that block merge until the DAG passes parse-time and dry-run checks.

The CI gate

Even with git-sync, every merge to the DAGs repo should pass:

  • Python parse check (python -c "import dag_file" passes).
  • Airflow DAG validation (airflow dags test ... runs the DAG in dry-run mode).
  • Import-time benchmark (parse time < 200ms per file, per Lesson 2 of Course 3.1).
  • Linter checks (no em dashes, no print statements, no CURRENT_DATE in SQL templates).

A merged PR that fails these is a merged PR that breaks production. Block merges on the checks, not on human review alone.

The deploy SLA

A team that ships well has metrics:

  • Merge to running: time from PR merge to the new DAG appearing in the Airflow UI. Target: < 90 seconds.
  • Deploys per day: the team should be able to ship 10+ DAG changes per day without operational pain.
  • Restart events per week: should be < 1 (only for image rebuilds, infrastructure changes).

If your team is restarting Airflow once a day to "pick up DAG changes," you're paying a tax that git-sync removes.


3. Worked Example

TheWorldShop's deploy migration.

Before: image-baked DAGs. Every DAG change required:

  1. PR merge.
  2. CI builds a new image (~7 minutes).
  3. Engineer manually triggers a deploy (kubectl rollout restart deployment/airflow-scheduler).
  4. Scheduler restarts. Tasks queued during restart wait. Running tasks on the scheduler die.
  5. New DAG appears in UI within 1-2 minutes after restart.

Total: ~15 minutes per change. Engineers batch 5-10 changes per week and deploy on Friday afternoons. Mistakes survive until next Friday.

After: git-sync + rare image rebuilds.

Step 1: Configure git-sync in the Helm chart.

# values.yaml
dags:
  gitSync:
    enabled: true
    repo: https://github.com/theworldshop/airflow-dags.git
    branch: main
    wait: 30
    subPath: dags

Step 2: Move the DAGs out of the image. Update the Dockerfile to not copy the DAGs folder.

# Before
COPY dags/ /opt/airflow/dags/

# After
# (no DAG copy; git-sync handles it)

Step 3: Bake common Python dependencies into the image.

COPY requirements.txt /
RUN pip install -r /requirements.txt

requirements.txt includes pandas, requests, boto3, snowflake-connector-python, dbt-core, etc. The 95% of new DAGs don't need new libraries.

Step 4: Roll the new image once. After this, no more rolling for DAG changes.

Result: PR merge to new DAG in the UI: ~90 seconds. Engineers ship 5-10 DAG changes per day. Restarts happen ~monthly when a new dependency lands in the image.

The CI gate, in practice

# .github/workflows/dag-pr.yml
name: DAG PR checks
on: pull_request

jobs:
  parse:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install apache-airflow==2.10.* pandas requests boto3
      - name: DAG parse check
        run: |
          for f in dags/**/*.py; do
            python -c "import importlib.util; spec=importlib.util.spec_from_file_location('m','$f'); m=importlib.util.module_from_spec(spec); spec.loader.exec_module(m)" || exit 1
          done
      - name: Parse time check
        run: |
          for f in dags/**/*.py; do
            t=$(python -c "import time, importlib.util; t=time.time(); spec=importlib.util.spec_from_file_location('m','$f'); m=importlib.util.module_from_spec(spec); spec.loader.exec_module(m); print(time.time()-t)")
            python -c "exit(1 if $t > 0.3 else 0)" || (echo "$f takes $t seconds to import" && exit 1)
          done
      - name: Em dash check
        run: |
          if grep -P '—' dags/**/*.py; then echo "Em dash found"; exit 1; fi
      - name: Airflow DAG test
        run: |
          export AIRFLOW_HOME=$(pwd)
          airflow db init
          for f in dags/**/*.py; do
            dag_id=$(python -c "from importlib.util import spec_from_file_location, module_from_spec; spec=spec_from_file_location('m','$f'); m=module_from_spec(spec); spec.loader.exec_module(m); from airflow import DAG; print([k for k,v in m.__dict__.items() if isinstance(v, DAG)][0])")
            airflow dags test $dag_id 2025-01-01 || exit 1
          done

This blocks merge unless every DAG: (a) imports cleanly, (b) parses in < 300ms, (c) has no em dashes, (d) passes Airflow's dry-run.

Aha: Restarting Airflow for a DAG change is a self-inflicted outage. Bind-mount + git-sync delivers DAGs into a running cluster within 90 seconds, no restart needed. The image only needs to rebuild when Python dependencies change, which should be rare with a kitchen-sink base image. Stop deploying like it's 2017.


4. Real-World Application

The git-sync pattern is the standard in the Airflow Helm chart's reference deployment. Astronomer, MWAA, and Cloud Composer all default to it (or to S3/GCS-bucket-backed equivalents). The image-baked-DAGs pattern persists mostly in older self-hosted deployments and in shops where the platform team hasn't refactored.

Some shops use S3 or GCS as the DAG backing store (the bucket is mounted to all pods). Functionally similar to git-sync: a separate process writes new DAGs to the shared store, and the scheduler picks them up on its next parse. The git-sync version has the advantage of native Git semantics (atomic commits, rollback by git revert).

The teams that haven't moved to git-sync are usually the ones with the most restart-induced outages. The migration is a one-week project for the platform team. The recurring benefit is 5-10x faster iteration speed for every data engineer on the team.


5. Your Turn

Exercise: A team currently ships DAG changes by rebuilding their Airflow Docker image and rolling the deployment. They ship ~3 changes per week, each costing ~20 minutes of restart-related impact.

  1. Estimate the annual cost of the current pattern in "engineer hours wasted plus operational impact." Make defensible assumptions.
  2. Sketch the migration to git-sync. Specify what changes: in the image, in the Helm values, in CI, in the team's workflow.
  3. The team's lead asks "what's the risk of letting DAGs auto-deploy on merge without a manual gate?" Defend the auto-deploy choice.

6. Recap + Bridge

Restarting Airflow for DAG changes is a self-inflicted outage. git-sync delivers DAGs into a running cluster within 90 seconds, no restart needed. Image rebuilds happen only for dependency changes, made rare by a kitchen-sink base image. Add a CI gate to catch failures before merge. Next lesson: secrets, and the silent way Variable.get() leaks them through your airflow.log.