Petascale Labs
The PlatformSimulation Arcade
RoadmapCoursesChallengesTopicsToolsFree
PricingBlog
  1. Home/
  2. Topics/
  3. Apache spark
Topic

Apache spark

Apache spark shows up across 4 courses in 2 layersof the data platform stack. Here's where it's taught, a free way to practice it, and what to learn next.

Where it's taught

⚡Compute Engines

Apache Spark: Fundamentals

RDDs, DataFrames, Spark SQL, joins, window functions, and production batch pipelines.

20 ch · 6h 40m

1 free

Apache Spark: Advanced Internals

DAG scheduler, shuffle mechanics, Tungsten, Catalyst, AQE, data skew, and Delta Lake.

20 ch · 6h 50m

1 free
🔐PII & Data Governance

PySpark PII Detection

Detect PII at scale with PySpark regex and Microsoft Presidio: patterns, UDF performance, confidence scoring, and a real scan job.

7 ch · 2h 50m

1 free

Capstone: End-to-End PII Pipeline

Ship the full pipeline: raw -> detect -> mask -> govern -> store in Iceberg, then handle a complete GDPR erasure cycle end to end.

8 ch · 3h 35m

1 free

Related topics

↗PII detection↗adaptive query execution↗apache iceberg↗broadcast variables↗catalyst optimizer↗data governance↗data masking↗data scanning↗data skew↗dataframes

Start learning apache spark free

The first chapter of every course is free to read — no account needed.

Start: Apache Spark: Fundamentals →All strata
Petascale Labs

The physics layer of data

From byte-level storage to business-grade metrics. Built with depth, not breadth.

Curriculum

Data Engineer RoadmapAll strataStorage & File FormatsIngestion & TransportOpen Table FormatsCompute EnginesOrchestration & PipelinesQuery Engines & OLAPSemantic & Metrics LayerPII & Data Governance

Tools

All toolsParquet ViewerFreeSCD PlaygroundFreePII Masking GeneratorFree

Company

AboutBlogContact

Legal

Privacy PolicyTerms of ServiceCookie Policy

Email

hello@petascalelabs.com

Support

support@petascalelabs.com

Company

Petascale Labs, Inc.

© 2026 Petascale Labs, Inc. All rights reserved.

PrivacyTermsCookiesContact