Module: Trino Core | Duration: 12 min read | Lesson: 1 of 14

TheWorldShop's data is scattered. Orders live in Postgres. Clickstream sits in Iceberg on S3. A few reference tables are in a vendor's MySQL. The analytics team wants one question answered: "revenue per customer plan, joined across all three." Today that means three exports, a Python script, and a Tuesday lost to glue code.

Someone proposes Trino: point one SQL endpoint at all three systems and write the join as if they were one database. No copying, no ETL, query them where they live. It sounds too good, and the catch is real and important. This lesson is Trino's bet, query anything, store nothing, and the price it pays for that bet.

2. Concept Explanation

Every engine you've met so far owns its storage. ClickHouse has MergeTree parts. Druid has segments. DuckDB has its file. Trino owns nothing. It's a pure query engine: it has a SQL parser, a distributed execution engine, and an optimizer, but no storage layer at all. The data lives in other systems, and Trino reads it through connectors.

The thesis: separation of compute from storage, taken all the way

Trino's bet is that source-system specialization beats engine-level storage ownership. Postgres is great at being Postgres. S3 + Iceberg is great at cheap durable columnar storage. Kafka is great at streams. Rather than copy all of that into yet another storage format, Trino queries each source in place and joins across them at query time. One SQL dialect, many backends.

This is federation: a single query spanning Postgres, Iceberg, and MySQL, with Trino as the brain that plans it, pushes work down to each source where it can, pulls back what it must, and joins the results.

The cost of owning no storage

Owning no storage buys flexibility and costs you the things storage ownership gives you:

No caching by default. Trino doesn't keep a hot copy of your data. Every query re-reads from the source. A second identical query doesn't get faster on its own (you add caching layers separately).
No indexes of its own. Trino relies on whatever the source can do (a Postgres index, Iceberg partition pruning). If the source can't filter efficiently, Trino has to pull data and filter itself.
Performance is only as good as the connector and the source. A great connector (Iceberg) pushes filters, projections, and stats down. A weak one (some JDBC sources) makes Trino drag rows across the network and do the work itself.
Tail latency from the slowest source. A federated query is only as fast as its slowest participant. Join a snappy Iceberg table to an overloaded MySQL and the MySQL sets the pace.

Trino's other defining choice: built to fail fast

You met this in Query Engine Foundations: Trino (unlike Spark) historically optimized for interactive queries that either finish fast or die. A worker failure killed the query and you retried. That's the opposite of Spark's "survive failures on multi-hour jobs." (Trino later added fault-tolerant execution for long queries, Lesson 13, but the default DNA is interactive-and-fast.)

Aha: Trino's superpower and its weakness are the same fact: it owns no storage. That's why it can query Postgres, Iceberg, Kafka, and MySQL in one SQL statement, and also why a second run of that query isn't faster, why a weak connector tanks performance, and why your federated query inherits the latency of the slowest system it touches. "Store nothing" isn't a limitation to work around, it's the whole identity.

3. Worked Example

TheWorldShop's cross-system question, conceptually, before you run it in the lab next lesson.

The data:

orders in object storage as Iceberg (or TPC-H synthetic in the lab).
customers in Postgres (with the plan column).

The Trino query, as if it were all one database:

SELECT c.plan, count(*) AS orders, sum(o.totalprice) AS revenue
FROM iceberg.tws.orders o
JOIN postgresql.public.customers c ON c.user_id = o.custkey
GROUP BY c.plan
ORDER BY revenue DESC;

What Trino does under the hood (the rest of this course unpacks each step):

Parses and plans the query across two catalogs.
Pushes down what it can: the scan and any filters on orders to the Iceberg connector, the customers read (and ideally a filter) to Postgres.
Pulls back the reduced data, joins in its own distributed engine, aggregates, returns.

No data was copied into Trino's storage, because Trino has none. The orders stayed in Iceberg, the customers stayed in Postgres, and the join happened in Trino's memory at query time. That's the thesis in one statement: the SQL is simple, the magic (and the cost) is in how Trino splits the work between itself and the sources.

4. Your Turn

Exercise: Reason about Trino's bet.

You run the same federated dashboard query 50 times an hour and it's slow every time. Why doesn't it speed up on its own, and what do you add?
A federated join touches a fast Iceberg table and a heavily-loaded MySQL replica. Where does the latency come from?
When is "query in place, don't copy" the right call, and when should you actually ETL the data into one store first?
Trino and Spark both run distributed SQL. Name the core philosophical difference in how they treat a worker failure.

5. Real-World Application

Trino (and its ancestor Presto) is the lakehouse and federation query layer at huge scale: it's the interactive SQL engine over data lakes at companies like Netflix, LinkedIn, and many others, and the "one SQL endpoint over all our systems" tool for ad hoc analytics. The thesis pays off exactly where it's supposed to: heterogeneous data, interactive exploration, no desire to copy everything into one warehouse first.

It also fails where the thesis predicts. Teams that point Trino at a slow operational database for a high-frequency dashboard learn the "slowest source sets the pace" and "no caching by default" lessons the hard way. The mature move is knowing when to federate (ad hoc, exploratory) and when to materialize (hot, repeated), which is the exact judgment S6.7's federation lesson formalizes.

6. Recap + Bridge

Trino owns no storage. It's a pure query engine that reads other systems through connectors and joins across them at query time, federation. That bet buys "query anything in one SQL dialect" and costs caching, native indexes, and tail latency from the slowest source. Its default DNA is interactive: finish fast or fail fast.

To deliver on "store nothing, query everything," Trino splits into a planning brain and stateless muscle. Next: Trino Architecture, coordinator plans, workers execute, and you bring up the lab.