Petascale Labs
The PlatformSimulation ArcadeLibraryToolsPricing
Curriculum
  • Storage & File Formats
  • Ingestion & Transport
  • Open Table Formats
  • Compute Engines
    Apache Spark: Fundamentals
    • 01Why Distributed Computing?Free
    • 02Spark Architecture Deep Dive🔒
    • 03Your First Spark App🔒
    • 04RDDs: The Foundation🔒
    • 05Transformations vs Actions🔒
    • 06Key-Value RDDs: PairRDDs, Shuffles, and the groupByKey Trap🔒
    • 07RDD Persistence & Caching🔒
    • 08Broadcast Variables & Accumulators🔒
    • 09Enter DataFrames🔒
    • 10Spark SQL🔒
    • 11Data Sources: Read & Write🔒
    • 12DataFrame Transformations🔒
    • 13Joins: The Hard Part🔒
    • 14Window Functions🔒
    • 15User-Defined Functions (UDFs)🔒
    • 16Datasets: Type-Safe DataFrames🔒
    • 17Partitioning Strategy🔒
    • 18Deploying Spark Apps🔒
    • 19Monitoring & Debugging🔒
    • 20Capstone: ShopStream Batch Analytics Pipeline🔒
    Apache Spark: Advanced Internals
    Apache Spark: Streaming
  • Orchestration & Pipelines
  • PII & Data Governance
  • Query Engines & OLAP
  • Semantic & Metrics Layer
CoursesChallenges
  1. Home/
  2. Curriculum/
  3. Compute Engines/
  4. Apache Spark: Fundamentals

Apache Spark: Fundamentals

RDDs, DataFrames, Spark SQL, joins, window functions, and production batch pipelines.

Learn distributed computing from scratch — RDDs, DataFrames, Spark SQL, joins, window functions, and deploying production batch pipelines with Apache Spark.

Foundations20 chapters· 6h 40m· in Compute Engines

Course content

  1. 01Why Distributed Computing?Free
  2. 02Spark Architecture Deep Dive🔒
  3. 03Your First Spark App🔒
  4. 04RDDs: The Foundation🔒
  5. 05Transformations vs Actions🔒
  6. 06Key-Value RDDs: PairRDDs, Shuffles, and the groupByKey Trap🔒
  7. 07RDD Persistence & Caching🔒
  8. 08Broadcast Variables & Accumulators🔒
  9. 09Enter DataFrames🔒
  10. 10Spark SQL🔒
  11. 11Data Sources: Read & Write🔒
  12. 12DataFrame Transformations🔒
  13. 13Joins: The Hard Part🔒
  14. 14Window Functions🔒
  15. 15User-Defined Functions (UDFs)🔒
  16. 16Datasets: Type-Safe DataFrames🔒
  17. 17Partitioning Strategy🔒
  18. 18Deploying Spark Apps🔒
  19. 19Monitoring & Debugging🔒
  20. 20Capstone: ShopStream Batch Analytics Pipeline🔒

What to learn next

↗Apache Spark: Advanced Internals· next

Read the first chapter free

Start reading now — no account required for the free chapters.

Start: Why Distributed Computing? →More in Compute Engines
Petascale Labs

The physics layer of data

From byte-level storage to business-grade metrics. Built with depth, not breadth.

Curriculum

All strataStorage & File FormatsIngestion & TransportOpen Table FormatsCompute EnginesOrchestration & PipelinesQuery Engines & OLAPSemantic & Metrics LayerPII & Data Governance

Tools

All toolsParquet ViewerFreeSCD PlaygroundFree

Company

AboutContact

Legal

Privacy PolicyTerms of ServiceCookie Policy

Email

hello@petascalelabs.com

Support

support@petascalelabs.com

Company

Petascale Labs, Inc.

© 2026 Petascale Labs, Inc. All rights reserved.

PrivacyTermsCookiesContact