Spark Declarative Pipelines: What It Actually Does Under the Hood

Most Spark pipelines start clean and end up encrusted. The first version is a notebook: read a source, do a few transforms, write a table. Then the source turns into a stream, so you wire up a checkpoint. Then someone needs yesterday's data reprocessed, so you bolt on a backfill path. Then a second table depends on the first, and a third on the second, and now there's an ordering somewhere - maybe in an Airflow DAG, maybe in the order the cells happen to run - that has to stay correct by hand. Add a table in the wrong place and the whole thing reads stale data without telling you.

None of that is the interesting part of the job. It's glue: checkpoint management, incremental-read bookkeeping, dependency ordering, the batch-vs-stream fork you re-implement in every project. It's also where pipelines actually break in production, usually at 3am.

Spark Declarative Pipelines (SDP) is a framework for building data pipelines on Apache Spark where you declare the tables you want and let Spark build them. You describe each table by what it should contain, and SDP handles the rest: the execution order, the compute, the errors and retries, the work that makes a pipeline reliable and maintainable. It does this for both batch and streaming, and for the everyday ingestion cases, files from cloud storage (S3, ADLS, GCS) and message buses (Kafka, Kinesis, Pub/Sub, Event Hubs), with incremental transformations on top. You never write the reads, the writes, or the ordering yourself.

That's the whole pitch: SDP deletes the glue. It shipped in Apache Spark 4.1, and it's worth understanding precisely, because the abstraction is genuinely good and the places it leaks tell you a lot about how pipelines really work. (Apache's own Spark Declarative Pipelines programming guide is the reference if you want the full API surface alongside this walkthrough.)

Declarative vs imperative: what the word actually means

"Declarative" sounds like jargon, but the idea is simple. Imperative code is a list of steps: do this, then this, then that. You're responsible for the order and the mechanics. Declarative code states the goal and lets the system work out the steps. You say what you want, not how to get it.

You already use a declarative tool every day: SQL. When you write a SELECT, you don't tell the database how to scan the disk, where to shuffle data, or which join algorithm to use. You describe the result you want and the engine figures out the plan. That's why a one-line GROUP BY can quietly turn into a multi-stage distributed job you never had to think about. Spark works the same way for a single query: Catalyst, its optimizer, decides the physical plan.

SDP takes that same idea and lifts it one level up, from a single query to a whole pipeline of tables. Instead of declaring one result and letting Spark plan the execution, you declare many results, with dependencies between them, and let Spark plan the execution across all of them.

The contrast is sharp. The imperative version you already know looks like this:

bronze = spark.read.format("json").load("/data/orders")
silver = bronze.where("amount > 0").withColumn("day", to_date("ts"))
silver.write.mode("overwrite").saveAsTable("silver_orders")

gold = spark.table("silver_orders").groupBy("day").sum("amount")
gold.write.mode("overwrite").saveAsTable("daily_revenue")

You wrote the reads, the writes, the order, and the materialization. If silver_orders and daily_revenue move to different files, you keep the order straight. If orders become a stream, you rewrite all of it.

The declarative version never calls write. It declares two datasets and the query that defines each:

from pyspark import pipelines as dp
from pyspark.sql.functions import to_date

@dp.materialized_view
def silver_orders():
    return (
        spark.read.table("bronze_orders")
        .where("amount > 0")
        .withColumn("day", to_date("ts"))
    )

@dp.materialized_view
def daily_revenue():
    return spark.read.table("silver_orders").groupBy("day").sum("amount")

There's no saveAsTable, no mode, no explicit ordering. SDP sees that daily_revenue reads silver_orders, infers the edge, and runs them in the right order. That inference - reading your code to build a graph - is the core trick, and everything else hangs off it.

The lineage, and the naming mess

SDP didn't appear from nowhere. Databricks built this pattern years ago as Delta Live Tables (DLT), ran it at scale across thousands of customers, and learned where the declarative model wins and where people need an escape hatch. At Data + AI Summit in June 2025 they contributed the framework to Apache Spark as an open-source project (tracked as SPARK-51727), and it became generally available in Spark 4.1.

That history matters for one practical reason: the names are a swamp. The same thing has been called Delta Live Tables, then folded under the "Lakeflow" product umbrella, and is now "Spark Declarative Pipelines" in the open-source project. Databricks' billing logs and a lot of engineers still say "DLT." When you read docs or Stack Overflow answers, assume all four terms point at the same model and check the version. The open-source SDP in Spark 4.1 is the vendor-neutral core; Databricks layers proprietary extras on top (more on that later).

The important shift is that the framework is now Spark-native and runs anywhere Spark runs. You don't need Databricks to use it, and you can run a pipeline on your laptop.

The building blocks, precisely

SDP has a small vocabulary. Get these five terms exact and the rest of the system reads cleanly.

A flow is the atomic unit of processing. It reads from a source, applies your logic, and writes into one target dataset. A flow is either batch or streaming. Everything else is built out of flows.

A streaming table is a table plus one or more streaming flows that append into it. It processes data incrementally - each run picks up only what's new, tracked by a checkpoint. This is what you reach for at the ingestion edge, where data arrives continuously and you never want to reprocess the whole history.

@dp.table
def bronze_orders():
    return (
        spark.readStream.format("kafka")
        .option("kafka.bootstrap.servers", "localhost:9092")
        .option("subscribe", "orders")
        .load()
    )

The SQL form marks the incremental read with the STREAM keyword, here reading from another table in the pipeline:

CREATE STREAMING TABLE silver_orders
AS SELECT * FROM STREAM bronze_orders;

A materialized view is a table defined by exactly one batch flow. It's recomputed to reflect its sources, and it's the right tool for aggregations and joins where you want a correct, complete result rather than an incremental append. daily_revenue above is a materialized view: every run, it reflects the full current state of silver_orders.

A temporary view is scoped to a single pipeline run. It isn't persisted; it exists so other flows in the same pipeline can reference it without writing an intermediate table to storage. Use it to name a reusable subquery, not to produce output.

A sink writes transformed data to an external destination - Kafka, an event bus, anything Spark Structured Streaming can write to. Sinks come with sharp edges: they're Python-only (no SQL form), streaming-only, and append-only.

dp.create_sink(
    name="alerts_topic",
    format="kafka",
    options={"kafka.bootstrap.servers": "localhost:9092", "topic": "alerts"},
)

@dp.append_flow(target="alerts_topic")
def route_alerts():
    return spark.readStream.table("silver_orders").where("amount > 10000")

Wrapping all of this is the pipeline: the unit you develop and run. A pipeline is a set of source files (Python, SQL, or both) plus a YAML spec that names it and points at the code:

name: orders_pipeline
libraries:
  - glob:
      include: transformations/**
storage: file:///var/lib/orders_pipeline/storage
catalog: my_catalog
database: my_db

storage is the directory where SDP keeps the checkpoints for the pipeline's streaming tables. catalog and database set the default target for your tables (schema works as an alias for database). The libraries glob is how SDP discovers the files that define your datasets - which is the first thing the engine does when it runs.

How the dataflow graph actually executes

Here's the part worth slowing down on, because it explains both the magic and the restrictions.

When a pipeline runs, SDP doesn't execute your functions like a normal script. It analyzes them first. It loads every file in libraries, evaluates each dataset definition to discover what it reads and what it writes, and builds a directed graph: nodes are datasets, edges are the spark.read.table(...) / spark.readStream.table(...) references between them. From that graph it derives a topological order, decides what can run in parallel, and only then executes - streaming flows and batch flows, in dependency order.

You never wrote that order. SDP read it out of your code. Add a new table between silver and gold and the graph re-derives itself; there's no DAG to hand-edit.

This is also why the analysis phase can catch a whole class of bugs before any data moves. The CLI has a dry-run that builds and validates the graph without reading or writing a byte - it surfaces syntax errors, analysis errors like a missing column or table, and structural problems like a cyclic dependency. That's a faster feedback loop than discovering a typo three tables deep at runtime.

There's a catch baked into this design, and it's the most common way people trip. Your dataset functions get evaluated multiple times, during planning and during execution. They are supposed to be pure descriptions of a DataFrame, nothing more. So SDP forbids the operations that force execution or side effects inside a definition: no collect(), count(), toPandas(), save(), saveAsTable(), .start(), .toTable(). A definition must return a DataFrame and do nothing else.

!Warning

If you put a count() or a write inside a dataset function "just to check something," you're not adding a debug line - you're triggering a real action every time SDP analyzes the graph. The ban isn't arbitrary. The functions are graph descriptions, and a description that has side effects runs them on every pass.

On the SQL side the restrictions are smaller but real: PIVOT isn't supported in SDP SQL, and sinks have no SQL form at all.

Streaming versus batch, and what "refresh" really means

The single most important choice in an SDP pipeline is whether a flow reads with STREAM / readStream or with a plain batch read, because that one decision sets the entire processing semantics.

A streaming read is incremental. Each run consumes only new input since the last run, and SDP tracks the position in a checkpoint under storage. This is cheap and continuous, and it's how you keep an ingestion table fresh without reprocessing terabytes every time. A batch read is complete: it recomputes against the full current state of the source, which is what you want for an aggregate that has to be correct, not just appended to.

The spark-pipelines CLI is how you drive all of this:

# scaffold a new project
spark-pipelines init --name orders_pipeline

# validate the graph without touching data
spark-pipelines dry-run

# normal incremental run
spark-pipelines run

# reprocess specific datasets from scratch
spark-pipelines run --full-refresh silver_orders,daily_revenue

# reprocess everything
spark-pipelines run --full-refresh-all

# run against a remote cluster over Spark Connect
spark-pipelines run --remote sc://my-cluster:15002

The word refresh hides an important distinction. A normal run is incremental: streaming tables pick up new data from their checkpoints, materialized views recompute. A --full-refresh is destructive in a specific way - it clears the data and the checkpoints for the named datasets and rebuilds them from scratch. That's exactly what you want after a logic change that invalidates old results, and exactly what you don't want to run by accident on a 500 GB table, because it throws away the incremental progress you'd otherwise keep.

✓Tip

--full-refresh and --refresh can be combined for different datasets in one run, but --full-refresh-all can't be mixed with either. When in doubt, name the specific datasets rather than refreshing the whole graph - it's the difference between a few minutes and a few hours of compute.

What a real pipeline looks like

Put the pieces together and the medallion pattern - the bronze/silver/gold shape most teams converge on - falls out naturally. The point isn't the layer names; it's that each layer maps to exactly one building block, and the right streaming-versus-batch choice at each step.

Bronze is raw ingestion. A streaming table reading Kafka, doing as little as possible, just landing the data incrementally so you never reprocess history:

@dp.table
def bronze_orders():
    return (
        spark.readStream.format("kafka")
        .option("kafka.bootstrap.servers", "localhost:9092")
        .option("subscribe", "orders")
        .load()
    )

Silver is cleaned and conformed. Still a streaming table - you want it to stay incremental - but now with parsing, typing, and quality rules. SDP lets you attach expectations that drop or flag rows that violate a constraint, so bad records don't silently poison downstream tables:

from pyspark.sql.functions import from_json, col

@dp.table
@dp.expect_or_drop("valid_amount", "amount > 0")
def silver_orders():
    return (
        spark.readStream.table("bronze_orders")
        .select(from_json(col("value").cast("string"), order_schema).alias("o"))
        .select("o.*")
    )

Gold is the business-facing aggregate. Here you switch to a materialized view, because a daily revenue rollup needs to be complete and correct, not an append of deltas:

@dp.materialized_view
def daily_revenue():
    return (
        spark.read.table("silver_orders")
        .groupBy("day")
        .sum("amount")
    )

Three layers, three building blocks, one inferred graph. Notice what you didn't write: no checkpoint paths, no saveAsTable, no ordering, no separate streaming job for bronze/silver and batch job for gold. That's the glue SDP absorbed.

If the layers underneath this feel like a black box, they shouldn't stay one. SDP sits directly on top of Spark Structured Streaming's unbounded-table model

a streaming table is a Structured Streaming query with the boilerplate removed. The quality-gate idea has a longer history in pipeline quality versus data quality, and the Delta tables every SDP pipeline writes have their own transaction-log mechanics worth knowing when something goes wrong.

What engineers are still asking for

A good way to read a framework's maturity is to look at what the people running it in production keep asking for. The requests from teams using SDP and its Databricks predecessor cluster into a handful of themes, and those themes say more about the framework's current edges than any changelog.

Observability: knowing when something quietly stops working. Observability means being able to see what your pipeline is doing while it runs. The fear is the silent failure: a table stops getting new rows, or starts lagging hours behind, and nobody notices until someone downstream asks why a dashboard is stale. Engineers want SDP to tell them this out of the box, without writing their own monitoring: how many rows each table processed over time, how far behind real time each table is, and an alert when a table goes stale. Today the common hack is to add a current_timestamp() column at every step so you can later look at the data and work out where it got stuck. The fact that people resort to that is the tell: SDP already knows which tables feed which and how long each took, so the information exists. It just isn't surfaced yet.

Scheduling without a second tool. A lot of people are surprised that a pipeline can't carry its own schedule and instead needs an orchestrator to trigger it. The counter-argument is sound: orchestration is a separate concern, and a real scheduler can react to file arrivals and table updates, not just the clock. But for the simple "run this hourly" case, needing a whole orchestration layer feels like overhead, and the request keeps coming up.

Partial refresh: rebuilding only the part that changed. Recall the two ways to rerun a pipeline. A normal run is incremental: it picks up new data and leaves existing results alone. A full refresh throws everything away and rebuilds the table from scratch. The problem is that sometimes you need something in between. Say you fixed a bug that only affected last Tuesday's data, or one upstream column changed. Your only option today is a full refresh, which on a multi-terabyte table means hours of compute to rebuild data that was almost entirely fine. What people want is the ability to rebuild just a slice, one date range, one partition, or one section of the graph, without wiping the whole table and its saved progress.

Treating streaming tables like normal tables. A streaming table created by SDP isn't quite a free-standing table you can do whatever you want with; it's tied to the pipeline that owns it. That gets awkward as things grow. Teams want to move a table from one pipeline to another, split a pipeline that ballooned to hundreds of tables into smaller ones, or hand a streaming table to another team to read as if it were just an ordinary Delta table. All of that is harder than it should be right now. The underlying request is simple: let an SDP table behave like a regular table when you need it to, instead of staying locked inside its pipeline.

And then the perennial language question: Scala support. The honest answer from the maintainers is that it isn't a priority, because the data-engineering center of gravity has moved to Python and SQL. If your transformation logic is all Scala, SDP isn't for you yet, and probably won't be soon.

iNote

A few items on these wishlists already exist in the Databricks product even though they're not in open-source Spark 4.1 - things like the AUTO CDC flow for slowly-changing dimensions, the Enzyme incremental-recompute engine behind materialized views, and predictive optimization. That split is the thing to keep straight: the open-source SDP is the portable core, and Databricks sells performance and operational features layered on top.

Where it fits, and the honest take

SDP earns its place when the pain you have is glue: incremental ingestion you keep re-implementing, a dependency order you maintain by hand, the batch-versus-stream fork duplicated across jobs, checkpoints copied between projects. It collapses all of that into a graph the engine derives from your code, and the dry-run loop alone catches a category of bugs that used to surface only in production.

It's the wrong tool when you need fine-grained control the declarative model deliberately hides - custom stateful streaming with bespoke timeout logic, Scala-heavy transformation code, or orchestration patterns richer than "run this graph." For those, raw Spark plus an orchestrator is still the honest answer, and SDP doesn't pretend otherwise.

The deeper point is the one the abstraction can't change. SDP hides Structured Streaming semantics, checkpoint management, and Delta's transaction log - and the day a pipeline misbehaves, those are exactly the layers you'll be debugging. A streaming table that won't advance is a checkpoint problem. A materialized view that's slow is a recompute problem. A schema-evolution failure is a Delta problem. The framework writes the glue for you; it doesn't excuse you from understanding what the glue was holding together. That's not a knock on SDP. It's the reason the layers underneath are worth learning even as the tools on top get better at hiding them.

SDP is young in the open-source tree, the naming will keep settling, and the wishlist above is long. But the core bet - declaring tables instead of scripting steps - is the same bet Spark SQL made for queries a decade ago, and that one worked out. This is that idea, one level up.