Why ClickHouse Exists

Module: ClickHouse Core | Duration: 12 min read | Lesson: 1 of 16


Priya inherits a dashboard at TheWorldShop. It counts unique visitors per country, per hour, over the last 90 days. The backing store is Postgres. The query takes 40 seconds, and the product team refreshes it every time someone opens the page. On a busy morning the database falls over and orders stop flowing, because the same Postgres also takes checkouts.

Her first instinct is to add an index. It doesn't help. The query isn't slow because it can't find rows. It's slow because it has to read and aggregate hundreds of millions of them, every time, and Postgres was built to fetch a few rows fast, not to sweep a billion.

Someone on the team says "just put it in ClickHouse." Priya wants to know what that actually changes, and why an engine from a web-analytics company in 2009 is the answer to her 2026 problem.


2. Concept Explanation

ClickHouse came out of Yandex Metrica, a web-analytics product (think Google Analytics) that had to show site owners live reports over their raw click logs. The workload was specific and it shaped everything:

  • Billions of rows, growing every second.
  • Aggregations, not lookups. count, sum, uniq, group by country, hour. Nobody asks for "row number 4,812,003."
  • Sub-second answers expected, on a dashboard, while hundreds of other site owners run their own reports at the same time.
  • Append-mostly. Events arrive and are never updated. Corrections are rare.

A row-store like Postgres or MySQL stores each row's columns next to each other on disk. To SUM(revenue) over a billion rows, it reads a billion rows in full, every column, even though it only needed one. That's the wrong shape for this workload.

ClickHouse made the opposite bet, the same bet every OLAP engine makes: store each column together. To sum revenue, read only the revenue column. The other 30 columns never leave disk. On top of that columnar base, ClickHouse layered three choices that define it:

  1. Vectorized execution. Process columns in blocks of thousands of values, not one row at a time. (You met this idea in Query Engine Foundations.) Tight loops over arrays of one type let the CPU run at full speed.
  2. Sort-and-merge storage (MergeTree). Data is written in sorted, immutable chunks and merged in the background. Sorting is what makes the index and the compression work.
  3. No JVM, one C++ binary. ClickHouse is a single native process. No garbage-collection pauses, threads instead of processes, and tuning that lives in settings rather than heap flags.

The result is an engine that can scan and aggregate billions of rows per second per node, and serve many such queries at once. That's the Metrica workload. It's also, almost exactly, Priya's dashboard.

What ClickHouse is not good at

The same choices that make it fast at aggregation make it bad at other things, and knowing this now saves you a painful migration later:

  • Point lookups and OLTP. No fast "fetch one row by primary key with a transaction." It's not a replacement for Postgres-the-system-of-record.
  • Frequent single-row updates and deletes. Possible, but they fight the immutable-parts model. We'll see why in the MergeTree lessons.
  • Many small inserts. Insert a thousand rows a thousand times and you'll create a thousand tiny parts and a merge storm. ClickHouse wants big batches.

Aha: ClickHouse isn't "a faster Postgres." It's a different machine for a different job. The moment you try to make it do single-row updates and point lookups, every design choice that made it fast turns into a footgun. Pick it for the workload it was built for, not because it benchmarked well on someone's blog.


3. Worked Example

Picture Priya's "unique visitors per country per hour" query on both engines, conceptually.

Postgres (row store):

  • Walks the events table or an index, materializing whole rows.
  • Hashes country, hour and tracks distinct user_id per group.
  • Touches every column of every row it reads, even though it needs three.
  • 40 seconds, and it competes with checkout traffic for buffer cache.

ClickHouse (column store, MergeTree):

  • Reads only three columns: event_time, country, user_id.
  • Those columns are sorted and compressed, so it reads far fewer bytes off disk.
  • It aggregates in vectorized blocks across all CPU cores.
  • Sub-second, and you can run it on a separate machine so it never touches the order-taking database.

Same question. The difference isn't a cleverer query. It's that the storage and execution model match the shape of the question. The dashboard moves off the transactional database entirely, which fixes Priya's real production problem: the reports and the checkouts stop fighting.


4. Your Turn

Exercise: For each workload, decide whether ClickHouse is a good fit, and say why in one sentence.

  1. A leaderboard that updates a single player's score 50 times a second by primary key.
  2. "Revenue by product category by day for the last 2 years," refreshed by 300 analysts.
  3. The system of record for TheWorldShop orders, including the checkout transaction.
  4. Storing 5 billion ad-impression events per day and answering "impressions by campaign by hour."
  5. A user-profile service: given a user id, return that user's profile row in under 5ms.

5. Real-World Application

ClickHouse shows up wherever "billions of rows, aggregate them fast, many users at once" is the job:

  • Product analytics (PostHog is built on ClickHouse): events in, funnels and retention out.
  • Observability backends (logs, metrics, traces): Cloudflare, Sentry, and others store telemetry in ClickHouse because the query shape is "aggregate over a time window."
  • Real-time dashboards like Priya's: move the heavy read workload off the transactional database.

The honest framing for your team: ClickHouse is a specialized aggregation engine you put next to your systems of record, fed by a stream or a batch copy. It rarely replaces Postgres. It relieves it.


6. Recap + Bridge

You learned why ClickHouse exists: a 2009 web-analytics workload (billions of append-only rows, aggregation queries, high concurrency, sub-second answers) drove a columnar, vectorized, sorted-storage, single-binary design. Those same choices make it superb at aggregation and poor at point lookups and updates.

Next we open the engine up. ClickHouse Architecture: one binary, threads not processes, columnar storage, no JVM, and the lab where you'll run every internals lesson for the rest of this course.