Why Data Lakes Were Broken

Module: Beginners | Duration: ~20 min | Lesson: 1 of 7


It's 2017. You're a data engineer at a mid-size e-commerce company, let's call it TheWorldShop. You have 5 TB of sales data in Parquet files on S3. You run a query to pull yesterday's orders. It returns 12,000 rows.

Your colleague then updates a customer's address record to comply with a GDPR deletion request. She does it the only way she knows how: she rewrites the whole partition, 80 GB of files, overnight.

You pull yesterday's orders again the next morning. 9,000 rows.

3,000 rows just disappeared. There is no audit trail. There is no rollback. Two jobs ran concurrently, there was no coordination, and one silently overwrote the other.

This is the story of why Apache Iceberg was born.


2. Concept Explanation

The Original Data Lake Promise

The original pitch for data lakes in ~2012 was compelling: dump all your raw data into cheap object storage (S3, GCS, HDFS), use Hive or Spark to query it with SQL, and skip expensive proprietary data warehouses.

The underlying storage model was simple: data = directories of Parquet/ORC files on disk.

s3://theworldshop-data/sales/
  year=2024/month=01/day=15/part-00000.parquet
  year=2024/month=01/day=15/part-00001.parquet
  year=2024/month=01/day=16/part-00000.parquet

Hive tracked which folders existed via a Metastore. Spark could list directories and read files. It worked, until it didn't.

The Three Fundamental Problems

Problem 1: No Atomicity

When you write new data, you write multiple files. If your Spark job crashes after writing 3 of 10 files, those 3 partial files are now "part of your table." There is no atomic commit. Readers will see partial results.

# You intend to write this atomically:
part-00000.parquet  ✅ written
part-00001.parquet  ✅ written  
part-00002.parquet  💥 job crashes

# But readers already see:
part-00000.parquet  ← corrupt partial state
part-00001.parquet  ← corrupt partial state

Problem 2: No Isolation

Two writers can overwrite each other's data with no conflict detection. This is what happened in the TheWorldShop example above. "Last writer wins", silently.

Problem 3: Partition Evolution is a Nightmare

Suppose you originally partitioned by month. After a year, you realize you need to partition by day for better query performance. Your options:

  • Option A: Rewrite all historical data (petabytes, expensive, risky)
  • Option B: Keep old data in month-partitions, new data in day-partitions (query engine must understand both, messy)

Neither option is good. And because the query engine discovers partitions by listing directory names, if the partition format changes, you need to update every query that filters by date.

Problem 4: No Schema Safety

Adding a column in Hive means updating the Metastore. But old Parquet files don't have that column. When you read old + new files together, you get null for the new column in old rows, unless someone configured schema merging correctly. If someone drops a column and adds a new one with the same name but different type, you get silent data corruption.

Problem 5: Performance at Scale

To answer "which files contain data for January 2024?", Hive has to list every file in the partition directory. With millions of small files, this listing takes minutes. Netflix famously had table scans that took 10+ minutes just for partition discovery, before reading a single byte of actual data.

The Root Cause

All five problems have the same root cause: there is no table-level metadata layer.

The "table" is just a convention. It's a bunch of files in a directory with a naming scheme. There's no atomic unit, no versioning, no statistics, no column tracking.

This is what Apache Iceberg fixes.


3. Worked Example

Let's make the "no atomicity" problem concrete with Spark pseudocode:

# Dangerous: old-style Hive/Spark write
df.write \
  .mode("overwrite") \
  .partitionBy("date") \
  .parquet("s3://theworldshop/sales/")

# What actually happens on S3:
# 1. Spark deletes old partition files  ← visible to readers IMMEDIATELY
# 2. Spark writes new files             ← takes 2-3 minutes
# 3. Any reader during step 2 sees ZERO rows for that partition

During that 2-3 minute window, your dashboards show missing data. Your ML pipeline might train on incomplete data. Your SLA alert fires.

Now contrast this with an Iceberg write (we'll learn exactly how it works in Lesson 6):

# Safe: Iceberg write
df.writeTo("catalog.theworldshop.sales") \
  .option("write.format.default", "parquet") \
  .append()

# What actually happens:
# 1. Spark writes new files to a staging location  ← invisible to readers
# 2. Iceberg atomically commits a new snapshot     ← single metadata file swap
# 3. Readers instantly see the new snapshot        ← no partial state possible

The difference: Iceberg's atomic commit means a write is either fully visible or completely invisible. No in-between state.


Aha: A Hive-style data lake doesn't have tables, it has directories that you've agreed to treat as tables. There's no atomic rename across files, no transactional metadata, no consistent view across concurrent writers. Every "why did my query return wrong data?" question on a Hive lake traces back to that one missing word: table.


4. Your Turn

Exercise: Consider a data pipeline that reads from Kafka and writes hourly batches to a Parquet data lake on S3.

  1. List three scenarios where the lack of atomicity could cause data quality issues.
  2. For each scenario, describe what a user querying the table would observe.
  3. Now consider a schema change: adding a discount_percentage column. How would a Hive-style data lake handle queries that read both old files (without the column) and new files (with it)?

5. Real-World Application

This exact problem caused Netflix to build Apache Iceberg in 2017. Their data pipelines were experiencing:

  • Silent data loss from concurrent writes
  • 10+ minute query planning times from slow partition listing
  • Schema drift bugs that were hard to detect and debug

Netflix engineer Ryan Blue (now one of Iceberg's original contributors) described the problem in a 2018 talk: "We didn't have a table. We had a directory convention that we agreed to call a table."

Today, Netflix, Apple, LinkedIn, Airbnb, and Adobe all run Iceberg at petabyte scale in production.

Your job: If you're a data engineer, understanding this history is crucial. Every design decision in Iceberg, snapshots, manifests, column IDs, atomic commits, is a direct solution to one of these problems. Once you know the problems, the solutions make immediate sense.


6. Recap + Bridge

What we learned: The original "data lake = files in a directory" model had five fundamental flaws: no atomicity, no isolation, painful partition evolution, unsafe schema changes, and poor scan performance at scale.

Coming up next: Lesson 2 answers, so what exactly IS Iceberg? It's not a file format (Parquet is still your file format). It's something different: an open table format, a metadata specification that sits above your files and solves all five problems.