← Back to all posts
#data-engineering#career#lessons-learned#best-practices#production

I Asked 100 Data Engineers What They Wish They Knew Earlier

By Petascale Labs ·

I want to be honest about the title before we start: nobody handed out a clipboard. I didn't run a formal hundred-person survey. But I've read a lot of postmortems, sat in a lot of on-call channels, and watched a lot of "I wish someone had told me this earlier" threads go by - and after enough of them, the regrets of experienced data engineers start to rhyme. Put a hundred of them in a room and the same lines come up, almost word for word.

The surprising part isn't what they wish they'd learned. You'd expect the answer to be "more tools" - another orchestrator, another query engine, the streaming framework they never got to. It almost never is. The thing experienced engineers wish they'd understood earlier is depth in the layers they already touched: the place a model quietly lies, the retry that wasn't safe, the bytes on disk that turned out to be the real bill.

That maps cleanly onto how we think about the field on the Data Engineer Roadmap: you meet every area early, and seniority is how far into each one you go. The regrets below are exactly the points where the floor of "it runs" drops away and the depth begins. They cluster into a few themes - so let's walk the themes, and point at a free place to feel each one rather than just nod along.

1. "The model was lying, and nothing errored"

The single most common regret, by a distance: trusting a dimension table that silently overwrites its own history.

You model a customer dimension. A customer on the Pro plan places fifty orders in Q1. In Q2 they upgrade to Enterprise, and your pipeline does the obvious thing - UPDATE the row. Months later someone asks for "revenue by plan, last quarter." The query is flawless SQL. The number it returns is fiction.

!Warning

Why it's fiction. The fifty Q1 orders join to the customer as the row looks today - Enterprise - because the old value was overwritten. All that historical revenue gets filed under a plan the customer wasn't on when they spent the money. Nothing throws. Nothing warns. The dashboard just confidently reports a past that never happened. This is Slowly Changing Dimensions Type 1, and it's the regret people phrase as "I didn't know the default was the dangerous one."

The lesson underneath isn't "memorize the SCD types." It's a single sentence that took some engineers years to internalize: a fact joins to a dimension as it was at the moment the fact happened, not as it is now. Type 1 throws away exactly the information you need to honor that, and it does it silently.

We pulled this apart in Slowly Changing Dimensions, Actually Explained, and the fastest way to feel it is to stop reading and replay one: the free SCD Playground lets you push a change timeline through the types side by side and watch a fact land on the right - or wrong - version. The full treatment lives in the Dimensional Data Modeling track.

2. "I learned idempotency at 3am"

Almost everyone has a version of this story, and it's always told at the same hour. A task failed halfway. The scheduler retried it, the way it's supposed to. And the retry didn't resume the work - it repeated it. The batch got double-written. The reconciliation job paid out twice. The "we'll just re-run the backfill" sent the same forty thousand emails again.

What they wish they'd known is that idempotency is a property you have to reason about, not a flag you turn on. The happy path is the easy 80%. The real work of orchestration is the question every retry silently asks: if this runs twice, is the result the same as if it ran once? Upserts keyed on a natural key are idempotent; blind INSERTs are not. Deleting-then-writing a partition is idempotent; appending is not. "Exactly-once" almost always turns out to be "at-least-once plus a dedup you forgot to write."

Aha: The scheduler doesn't owe you a guarantee that a task runs exactly once - distributed systems can't cheaply promise that. It promises at least once. Idempotency is how you make "at least once" behave like "exactly once."

This is the depth behind every DAG, and it's why we treat failure modes - not the happy path - as the real curriculum in the Orchestration & Pipelines track.

3. "The bytes on disk were the bill"

Here's a regret that sounds like a performance tip and is actually a cost lesson: how your data is laid out on disk is the query speed and the cloud bill, far more than the query text is.

The story is usually the small-file problem. A streaming job writes a file every few seconds. Six months later a table that holds a modest amount of data is spread across two million tiny Parquet files, and a simple SELECT takes minutes and a small fortune, because the engine spends all its time opening files and reading footers instead of reading data.

The other half of the regret is the inverse surprise: two files with identical rows can differ tenfold in scan cost. That's row-group sizing, encoding choices, and whether the min/max statistics let the engine skip pages it doesn't need. None of it is visible in the query - it's a property of the bytes.

We opened the format up in What's Actually Inside Your Parquet File, and you can point the free Parquet Viewer at your own files, entirely in the browser, to see the row groups and statistics for yourself. The depth lives in the Storage & File Formats track.

4. "It worked until two writers committed at once"

This one only bites once you've graduated to a lakehouse, and it bites hard. You adopt or Delta, you get ACID transactions and time travel, and for a while a table feels like a folder you can write to from anywhere. Then your nightly compaction job and your streaming ingest commit at the same moment, and one of them fails with a conflict you've never seen - or worse, doesn't fail, and quietly drops a snapshot.

What they wish they'd known: a table format is a distributed system, not a directory. The interesting behavior isn't the read - it's what happens when two writers race. Snapshot isolation, optimistic concurrency, conflict resolution, compaction competing with your ingest for the same files: this is distributed-systems reasoning wearing a familiar MERGE INTO costume. The engine makes you a guarantee about concurrent writers, and the guarantee has edges you need to know before you hit them at scale.

That whole layer - the part that decides whether your "ACID table" is actually correct under concurrency - is the Open Table Formats track.

5. "My masking didn't mask"

The governance regret is the scariest, because it fails closed-looking and open-actually. Someone is told to anonymize a column. They reach for the obvious thing. It produces output that looks scrambled. Everyone signs off. The data is still re-identifiable.

!Warning

Two classics that pass review and fail in practice. An unsalted hash of an email is not anonymization - it's a deterministic lookup table. Anyone with a list of likely emails can hash them all and join straight back to your "anonymized" rows. And a redacted ZIP code that keeps the first five digits still pins many people to a neighborhood; combined with age and gender, the famous result is that a large share of the population is uniquely identifiable. The masking ran. The guarantee didn't hold.

The lesson experienced engineers wish they'd had on day one: a masking policy is a promise, and the promise is easy to break by accident. Salting, k-anonymity on quasi-identifiers, tokenization with a guarded vault, and right-to-erasure that actually reaches into immutable snapshots - these are the difference between "looks masked" and "is masked."

We walked the guarantee-breaking gotchas in Stop Hand-Writing PII Masking Policies, the free PII Masking Generator produces the DDL with the safe defaults baked in, and the PII & Data Governance track covers the architecture behind it.

6. "AI raised the floor - the depth is the job now"

The forward-looking regret, the one that's only a year or two old: people spent their early energy getting fast at producing the code, and the code is now the cheap part. AI writes the query, the DAG, the PySpark job, the masking policy - and it writes them at 2am without complaining. What it can't do is reason about any of the depth in the five sections above. It won't warn you that Type 1 is about to corrupt your history, that the retry isn't idempotent, why the scan cost what it cost, what happens when two writers commit, or that the hash is a lookup table.

So the thing to optimize for early - the regret-proofing move - is to invest in the reasoning AI can't reach, not the boilerplate it now hands you for free. We made the full case for that in The Data Engineer Roadmap for 2026: same layers junior to senior, and the deep end of each layer is exactly the part that stays yours. That deep end is what the curriculum and the Arcade exist to let you practice on real engines.

The one sentence to keep

If you distill a hundred of these regrets down to a single line, it isn't about a tool. It's this: the thing that hurts later is almost never the code you couldn't write - it's the property you couldn't see. The silent overwrite, the unsafe retry, the layout you couldn't read, the race you didn't model, the mask that didn't mask. Every one of them ran perfectly and was wrong.

You can't learn that from a checklist; you mostly learn it from the 3am page. The next best thing is to go find each property somewhere it's safe to break - replay a dimension, open a Parquet file, generate a masking policy - and feel the gap before it costs you. That's the whole point of the curriculum and the Arcade: meet the regret early, on purpose, instead of at the worst possible hour.

Found this useful? Give it a like.