Module: DuckDB Hands-On | Duration: 16 min read | Lesson: 1 of 14

Dev has a 4GB Parquet file of TheWorldShop orders sitting in S3. He needs one number: revenue by country for last quarter. The "correct" answer at his company is to spin up a Spark cluster, wait for it to provision, submit a job, and get the result in twenty minutes.

Instead he types this into a terminal:

SELECT country, SUM(amount)
FROM 's3://theworldshop/orders/2026-q1/*.parquet'
GROUP BY country;

Two seconds later, the answer is on his screen. No cluster. No driver. No connection string. The query engine ran inside the same process as his shell prompt.

That is the embedded pattern, and once you see it you start noticing how often the cluster was never needed. This lesson is about what "embedded" actually changes, and what it quietly takes away.

2. Concept Explanation

A library, not a server

Postgres is a server. You start a daemon, it listens on a port, your application opens a TCP connection, sends SQL over the wire, and reads rows back. There is a process boundary between you and the engine.

DuckDB has no daemon. It's a library you link into your own process, the same way you link SQLite. There is no port, no connection string, no network hop. Your Python (or Go, or R, or the CLI) calls a function, and the query executes on the call stack of your own program.

import duckdb
duckdb.sql("SELECT 42")   # runs right here, in this process

The mental shift: the database is not a thing you talk to. It's a thing you are.

What in-process buys you

Three things disappear, and their absence is the whole value proposition.

The network disappears. No serialization to send a query, no deserialization to read results. When DuckDB hands data to pandas, it can hand over the actual memory (you'll see this in the Arrow lesson). The fastest network call is the one you never make.

Auth and provisioning disappear. There's no user to create, no role to grant, no cluster to size. The query has exactly the permissions your process has. For a notebook, a CI job, or a CLI tool, that's precisely right.

The scheduler disappears. Spark decides which executor runs which task. DuckDB just runs, using the threads of the machine it's on. For sub-cluster data, the scheduling overhead was pure tax.

What in-process takes away

The same boundary that caused the overhead was also doing useful work, and removing it has a cost.

No isolation. A server protects itself from your bad query. If your SQL asks for more memory than exists, the server process dies, not your application. With DuckDB, the query OOMs your process. The dashboard, the API handler, the notebook kernel: gone together.

No independent scaling. A server can be sized separately from its clients. An embedded engine shares whatever the host process has. You cannot give "the database" more RAM without giving the whole process more RAM.

One process worth of work. A server can be a cluster behind the connection string. DuckDB is one process. Its ceiling is one machine. We'll spend a whole lesson on exactly where that ceiling sits.

SQLite for analytics, not a small Postgres

The right reference point is SQLite, not Postgres. SQLite is the most-deployed database on earth precisely because it's embedded: it's in your phone, your browser, your car. But SQLite is row-oriented and tuned for transactional point lookups. It falls over on a GROUP BY across a hundred million rows.

DuckDB took SQLite's deployment model (embed me, I'm a library) and married it to a columnar, vectorized OLAP engine. Same "no server" ergonomics. Opposite workload.

3. Worked Example

Let's run the engine in the place it shines: directly over files, no import step.

Bring up the lab. Clone the lab repo once (it's shared across all courses):

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/query-engines-and-olap/duckdb-and-embedded-olap/duckdb-hands-on/
./scripts/setup.sh

Verify DuckDB and the sample data are reachable:

./scripts/verify.sh
# expected: "DuckDB 1.x ready, sample orders.parquet mounted at ./data/orders.parquet"

Now open the DuckDB CLI and query the file directly:

-- no CREATE TABLE, no COPY, no import
SELECT country, COUNT(*) AS orders, ROUND(SUM(amount), 2) AS revenue
FROM './data/orders.parquet'
GROUP BY country
ORDER BY revenue DESC
LIMIT 5;

Notice what you did not do. You didn't define a schema. You didn't load the data into a table. You pointed the engine at a file and it read the Parquet footer, learned the columns and types, and ran a vectorized aggregation. The "database" held no state at all.

I'm working through the "DuckDB Hands-On" course on data-learning. I want to set up DuckDB locally and run my first query against a Parquet file.

My environment:
- OS: <fill in: macOS / Linux / Windows>
- RAM: <fill in, e.g. 16GB>
- I have <Docker / Homebrew / pip / nothing> available

Please give me the exact commands to:
1. Install the DuckDB CLI (or the Python package, whichever is simpler on my OS).
2. Confirm the install works with a "SELECT 42".
3. Query a local Parquet file directly with SELECT ... FROM 'file.parquet' and explain what each step does.

Keep it to my OS only. Do not give me commands for other platforms.

Aha: "Embedded" isn't a packaging detail, it's the feature. The reason SELECT FROM 'file.parquet' feels like magic is that there's no process to ship the query to and no process to ship the rows back from. The query runs where you're already standing. Every other DuckDB superpower (Arrow zero-copy, instant startup, no auth) is a downstream consequence of that one decision.

4. Your Turn

Exercise: Your team runs a Flask API. Someone proposes embedding DuckDB inside the API process so each request can run an analytical query directly, no separate warehouse.

Name one concrete benefit of embedding the engine in the API process.
Name the single scariest failure mode of this design.
A request runs SELECT * FROM huge_table ORDER BY x and the sort needs more RAM than the box has. In the embedded design, what dies?
Describe one change to the design that keeps most of the embedded benefit but contains that blast radius.

5. Real-World Application

The embedded pattern shows up everywhere the data is smaller than the team assumed:

Notebooks and ad-hoc analysis. A data scientist points DuckDB at a Parquet directory and never thinks about a warehouse. This is the single most common entry point.
CI test fixtures. A unit test needs to run a realistic aggregation. DuckDB runs it in-process in milliseconds, with no service to stand up in the CI runner.
The "you don't need Spark for this" PR. A nightly job processing 30GB gets ported from a Spark cluster to a single DuckDB process and runs faster and cheaper. We build exactly this in the capstone.
The embedded analytics layer of apps. BI tools and data apps embed DuckDB (and DuckDB-Wasm in the browser) to run SQL over user data with no backend query tier at all.

The common thread: someone reached for a distributed system out of habit, and the data fit on one machine the whole time.

6. Recap + Bridge

You learned:

DuckDB is a library you embed, not a server you dial into, the SQLite model applied to OLAP.
In-process removes the network, auth, and scheduler, which is why file-level queries feel instant.
The same missing boundary removes isolation and independent scaling: a query OOMs your process.
The honest framing is "SQLite for analytics," not "a smaller Postgres or a smaller Spark."

You now know why the engine feels the way it does from the outside. Next we go one level down: what's actually inside that single process. DuckDB Architecture, the vectorized engine, MVCC, and the single-file storage format that make it feel like SQLite for analytics.