Module: Query Engine Foundations | Duration: 18 min read | Lesson: 1 of 16

It's Maya's third week at Vellora, a mid-stage fintech. She just inherited a six-month-old analytics stack: Trino over Iceberg on S3, a ClickHouse cluster for product analytics dashboards, and a Spark + Airflow batch layer that produces the underlying tables.

Yesterday, a dashboard timed out. The senior engineer on call replied in Slack: "Just give the Spark job more executors and rerun." Maya nods, opens the YARN UI, and pauses. The dashboard isn't a Spark job. It's a SQL query hitting ClickHouse. There is no Spark in the request path.

She types back: "I don't think Spark is involved here." The engineer responds: "It's all the same — it's just distributed compute, right?"

Maya is about to discover the most useful distinction in the modern data stack: compute engines and query engines are not the same kind of system. Treating them as variations of one thing is the canonical sign of someone who hasn't yet seen the line.

This lesson draws that line.

2. Concept Explanation

Two systems that look alike from a distance

From far away, Spark and Trino look like cousins:

Both run on clusters.
Both read Parquet from object storage.
Both speak SQL.
Both move data between workers.

But the purpose of each — what they're optimized for, what fault model they assume, how they execute — diverges sharply.

Compute engines: throughput on arbitrary DAGs

Spark and Flink are general-purpose distributed execution frameworks. They will run anything you can express in their DSL — ETL pipelines, ML training, graph algorithms, streaming joins of unusual shape. The engine schedules whatever job you submit.

The architecture optimizes for long jobs that must survive worker failures:

BSP (Bulk Synchronous Parallel) execution with stage boundaries — a stage completes, its output is materialized (shuffled to disk for Spark), and the next stage starts.
Lineage and checkpointing so partial work survives node loss.
Iterator + codegen (Spark Tungsten) or operator chaining (Flink) as the per-row engine.
The goal is throughput on a job that may run for hours.

Query engines: latency on SELECT

Trino, ClickHouse, Druid, Pinot, and DuckDB are SQL serving systems. They don't run your ML pipeline. They serve a dashboard query in 200ms. They serve a hundred concurrent dashboard queries.

The architecture optimizes for interactive latency:

MPP with streaming exchange — data flows worker-to-worker through the network without disk barriers.
Vectorized execution — batches of columns through SIMD-friendly kernels.
Fail the query, don't checkpoint — an interactive user retries; the engine doesn't.
The goal is p99 latency on a query that may run for 200ms.

Why this matters: the architectural divergence is real

Spark's shuffle writes intermediate data to disk so the job survives node loss. Trino's exchange operator streams data between workers and dies if any node fails. These are opposite solutions to the same problem (move data between workers), driven by opposite tradeoffs (durability vs. latency).

ClickHouse historically punished joins because OLAP was assumed to be denormalized. Spark assumes joins everywhere. DuckDB runs in-process with no scheduler. Spark is a scheduler. These are not implementation details — they are the defining shape of each system.

The mental model

Compute engines run code on data. Query engines serve SQL.

It sounds glib, but it's the cleanest test:

If your workload is "run this code, then write the result somewhere" → compute engine.
If your workload is "answer this SELECT, return rows to a client, fast" → query engine.

Spark Structured Streaming queries that look like SQL are still compute jobs — they're long-running, fault-tolerant, write to a sink. DuckDB queries that look batch-ish are still query-engine queries — they answer SELECT, return rows.

3. Worked Example

Same logical query, two engines, two execution shapes.

SELECT category, SUM(revenue) FROM orders WHERE order_date >= '2026-01-01' GROUP BY category;

On Spark (compute engine, batch job):

Driver builds a DAG: scan → filter → partial aggregate → shuffle → final aggregate.
Tasks launch on executors. Each scans its partition of orders.
Partial aggregates compute per partition.
Shuffle writes intermediate aggregates to local disk. Reducers fetch them.
Final aggregate runs on reducers. Result written to a sink (file, table).
If a node dies during step 4, Spark recomputes from lineage. Job completes.
End-to-end: minutes. Designed for durability across long jobs.

On Trino (query engine, interactive query):

Coordinator parses, plans, optimizes, picks a plan with a cost-based optimizer.
Workers receive splits, start scanning partitions.
Filter + partial aggregate happen in vectorized batches as data arrives.
Streaming exchange ships partial aggregates between workers — no disk.
Final aggregate streams to the coordinator. Coordinator streams rows to the client.
If a worker dies, the whole query fails. Client retries.
End-to-end: hundreds of milliseconds. Designed for latency.

Same SQL. Same data. Two completely different shapes.

4. Your Turn

Exercise: Classify each workload as compute engine or query engine territory. Identify the deciding factor.

A nightly job that joins seven tables, computes ML features, writes Parquet to S3. Runs ~3 hours.
A dashboard that returns "active users by country, last 24h" — needs to load in under 1 second, refreshed by 200 concurrent analysts.
A backfill pipeline that recomputes one year of partitioned aggregates after a schema change.
An ad-hoc SQL query an analyst types into a notebook to debug a number on yesterday's dashboard.
A streaming pipeline that consumes Kafka, joins with a reference table, writes to Iceberg.
A "data app" backend that runs the same SQL with parameter substitutions thousands of times an hour.

5. Real-World Application

Most mid-stage data platforms run both. The pattern at Vellora is typical:

Spark + Airflow for nightly ETL into Iceberg tables (compute).
Trino for federated ad-hoc queries that span Iceberg + Postgres (query).
ClickHouse for low-latency product analytics dashboards (query).

The friction Maya hit is universal: engineers steeped in one mental model misapply it to the other. "Throw more executors at it" works for Spark and not for ClickHouse, because ClickHouse doesn't have executors — it has shards and threads, and the bottleneck is rarely raw compute.

The rest of this course is about giving you the query-engine mental model from first principles, so the next time someone says "it's all the same" you know exactly where the line is and what's on either side.

6. Recap + Bridge

You learned:

The functional split: compute engines run code on data; query engines serve SQL.
The architectural split: BSP + checkpoint vs MPP + streaming exchange.
The latency split: minutes-to-hours vs milliseconds-to-seconds.

Next lesson: we sharpen the definition of "the workload query engines exist for" — the OLAP workload profile. Latency, concurrency, cardinality, read-mostly. Knowing the workload's shape is the first step in choosing an engine that fits it.