The data engineer roadmap for what AI left behind.

Learn Python and SQL, build a pipeline - and you're a data engineer.

That was true. Then AI learned to do all three better.

So this is the roadmap for what's left: every area that still matters, and the depth in each one that an AI writing your code still can't reason about for you.

Free and complete. Junior and senior work the same areas - what changes is how deep you go.

Show me the roadmap for:

You're seeing the core topics every data engineer needs. Switch to “Going senior” to reveal the deeper topics in each area.

Highlighted topics are the senior-level depth - the part AI can't shortcut for you.

The roadmap, milestone by milestone

Get started

  1. Foundations & SQL

    The language and tooling everything else assumes.

    SQL joinsWindow functionsCTEsPython (pandas, APIs)Command line & GitWarehouse basicsETL vs ELTQuery optimization (EXPLAIN, indexes)

    AI writes most of this now - learn it well enough to catch when the generated query is quietly wrong.

  2. Data Modeling & Transformation

    Turning raw tables into something the business can trust.

    Dimensional modeling (Kimball)Star & snowflake schemasFact vs dimension tablesNormalization vs denormalizationdbt models & testsSlowly Changing Dimensions (SCD 1/2/3)One Big Table (OBT)Grain & conformed dimensionsData Vault

    AI drafts the model; the judgement - grain, conformed dimensions, what “one customer” means - is yours. See the track →

    Free toolSCD PlaygroundReadSlowly Changing Dimensions, actually explained
  3. Orchestration & Pipelines

    Getting work to run on time, in order, and survive failure.

    Apache Airflow (DAGs)Scheduling & sensorsdbt schedulingBackfillsRetries & idempotencyScheduler & executor internalsData-aware schedulingLineage & observabilityFreshness SLAs & on-call

    AI generates the DAG. Knowing the failure modes behind a 3am page is the actual job. See the track →

  4. Storage & File Formats

    The bytes on disk that decide every query's cost.

    Parquet (columnar basics)Row vs columnar storageCompression (gzip, Zstd)Object storage (S3 / GCS)Partitioning basicsRow groups & page statisticsPredicate pushdownEncoding (dictionary, RLE)File sizing & the small-file problemORC, Avro, Arrow internals

    AI reads and writes the files. Why a scan costs what it costs is what it can't reason about for you. See the track →

    Free toolParquet ViewerReadWhat's actually inside your Parquet file
  5. Data Lakes & Table Formats

    ACID and time travel on top of cheap object storage.

    Lake vs warehouse vs lakehouseWhat Iceberg / Delta areTime travel (concept)Schema evolutionACID & snapshot isolation internalsCompaction & small filesCatalogs (Glue, Nessie, Unity)Iceberg vs Delta vs Hudi tradeoffs

    AI scaffolds the table ops; what happens when two writers commit at once is the part that bites. See the track →

  6. Ingestion & Streaming

    Getting data in reliably - batch and real-time.

    Batch ingestion (Airbyte / Fivetran)Kafka basics (topics, partitions)Producers & consumersEvent time vs processing timeISR & exactly-once semanticsConsumer group rebalancingChange Data Capture (Debezium)Stream processing (Flink, Kafka Streams)

    AI writes the producer and consumer. Where data-quality bugs are actually born, it won't warn you. See the track →

  7. Distributed Compute

    Processing data that doesn't fit on one machine.

    Apache Spark (DataFrames)Transformations vs actionsPySpark basicsLazy evaluationShuffle & partitioningBroadcast joins & data skewCatalyst & codegenMemory & fault toleranceApache FlinkDuckDB

    AI writes the transformation; tuning shuffle, skew and memory - and knowing why it spilled - is yours to own. See the track →

  8. Query Engines & OLAP

    Serving interactive queries to analysts and dashboards.

    What OLAP isWarehouse vs query engineClickHouse basicsTrino / Presto (querying)MergeTree & projectionsFederation & pushdownExecution models (Volcano, vectorized, MPP)Cost-based optimizationDruid & Pinot (real-time OLAP)EXPLAIN literacy

    AI writes the SQL. Why the dashboard query is slow - and how to fix it at the engine - is senior work. See the track →

  9. Semantic & Metrics Layer

    Where data finally becomes agreed-upon business language.

    Metrics & dashboardsdbt Semantic Layer (basics)Data quality tests (Great Expectations, Soda)Data contractsSchema registriesMetric governanceReverse ETL

    AI drafts a metric; making “revenue” mean exactly one thing across every team is a human contract. See the track →

  10. Governance, Quality & Cloud

    The concerns that run across every layer above.

    PII basics & GDPR / CCPACloud (AWS / GCP / Azure)CI/CD for dataMasking & tokenizationAccess control (row / column)Right-to-erasure across a lakehouseInfrastructure as code (Terraform)Data observability at scale

    AI flags the obvious PII; designing erasure across a lakehouse is architecture, not autocomplete. See the track →

    Free toolPII Masking GeneratorReadStop hand-writing PII masking policies

Finish

On par with AI

Job-ready - you can do what AI now does for you.

Finish

You lead AI

Senior - fluent in the depth AI can't shortcut for you.

The map is free. The depth is where you practice.

This roadmap stays free to read. If you want to actually work through the depth in each area - on real engines, not slideware - that's what we built.

Common questions

How long does it take to become a data engineer?

Less than it used to. With AI handling the boilerplate, syntax and debugging, the surface skills come faster - figure 4–6 months of focused study to become job-ready for a first role, and 1–2 years to mid-level. Reaching senior still takes time, because the depth beneath the tools is the part AI can't shortcut: you touch every area early, and what stretches out is how far into each one you go.

Do you need a degree to become a data engineer?

No. A computer-science or related degree helps, but hiring is overwhelmingly skills-first: a working command of SQL and Python, a cloud warehouse, orchestration, and two or three real projects that prove you can ship end-to-end will get you further than a credential. What you do need is depth you can actually demonstrate.

Is data engineering hard to learn?

The surface isn't - SQL and Python are approachable, and AI now smooths the syntax and boilerplate. What's genuinely hard is the depth: reasoning about why a query is slow, why a pipeline failed at 3am, or what a Spark shuffle is doing under load. That's the part that takes real time, and the part this roadmap is built around.

Data engineer vs data analyst vs data scientist - what's the difference?

A data analyst answers questions from existing data; a data scientist builds models and statistical insight; a data engineer builds and runs the systems that move, store and serve the data both of them depend on. If analysts and scientists are the consumers, the data engineer owns the infrastructure underneath - storage, pipelines, warehouses, and the reliability of all of it.

Do you still need SQL and Python in an AI-native world?

Yes - but they're table stakes now, not a differentiator. AI can write most queries and glue code. The durable skill is knowing whether the output is right and why a system behaves the way it does: storage formats, table formats, compute internals, query engines.

Will AI replace data engineers?

No - but it raises the floor. AI now writes the queries, the glue code and the boilerplate pipelines, so those stop being a differentiator. What it can't do is reason about the system: why a scan costs what it costs, what happens when two writers commit at once, why a job spilled to disk. Engineers who understand that depth direct AI instead of competing with it - which is the whole premise of this roadmap.

What's the difference between a junior and a senior data engineer?

Not different areas - the same areas, at more depth. A junior knows what Parquet is and can partition a table; a senior reasons about row groups, predicate pushdown, encoding and the small-file problem. Junior writes a Spark job; senior debugs its shuffle and skew. The roadmap shows both: the deeper topics are highlighted when you switch to the senior view.

Is this roadmap free?

Yes. The entire roadmap is free to read. Petascale Labs is where you can practice the depth hands-on, but you don't need an account to use the map.