Module: Foundations | Duration: 20 min read | Lesson: 1 of 10

After three courses on Kafka and stream processing, TheWorldShop's team has streaming on the brain. So when they need to pull data from a partner's REST API, a third-party SaaS (the payment processor's dashboard export), and an internal Postgres for nightly analytics, an engineer proposes building a streaming CDC pipeline for each. Weeks of work, three new always-on pipelines, to move data that the business only looks at once a day.

A senior data engineer pushes back: "The partner API updates twice a day. The SaaS only offers a daily export. Analytics runs at 6am. Why are you building real-time pipelines for data nobody consumes in real time?" A scheduled batch job, a few hours, would do all three, simpler, cheaper, and more robust to the partner's flaky API.

The streaming hype makes "batch" sound obsolete. It isn't. Batch ingestion is the right tool for a huge class of problems, and the often-skipped half of this stratum. This course is batch done well, and it opens with when batch beats streaming.

2. Concept Explanation

Batch Is Not the Past

The streaming narrative frames batch as legacy. In reality, most data integration in production is still batch, and correctly so. Batch ingestion means: on a schedule (hourly, daily), extract data from a source, load it into a destination, done until the next run. No always-on pipeline, no per-event processing.

The question isn't "batch or streaming?" as a tribal identity, it's "what does this data's consumption pattern and source actually require?" Many sources and consumers are fundamentally batch-shaped, and forcing them into streaming adds cost and fragility for no benefit.

When Batch Is the Right Choice

Batch wins when:

The source is batch-shaped. A SaaS that only offers a daily CSV export, a partner API with rate limits, a database you can only query off-hours, these are batch sources. You can't stream what's only available in bulk on a schedule.
The consumer is batch-shaped. Nightly analytics, daily reports, weekly model retraining, monthly billing. If the result is consumed on a schedule, real-time ingestion buys nothing.
Latency tolerance is hours. If "data fresh as of this morning" is fine, a daily batch is simpler and cheaper than an always-on stream.
You want simplicity and robustness. A batch job that runs, succeeds or fails atomically, and can be re-run is operationally simpler than a 24/7 pipeline. It's easier to reason about, test, and recover (just re-run it).
Cost matters and volume is periodic. A batch job uses resources only while running; a streaming pipeline runs always. For periodic data, batch is far cheaper.

When Streaming Is the Right Choice

Streaming wins when (the previous courses' territory):

The source is event-shaped (a database changelog, a message queue, clickstream).
The consumer needs low latency (fraud, real-time dashboards, alerting).
The data is continuous and high-volume and you want to process as it arrives.

The Honest Framing: Latency vs Simplicity

The core tradeoff is freshness against simplicity/cost:

Streaming buys freshness (seconds) at the cost of always-on complexity.
Batch accepts staleness (hours) for operational simplicity and lower cost.

There's no virtue in freshness the business doesn't use. Building a streaming pipeline for daily-consumed data is paying the always-on tax for latency nobody needs, the inverse of the "do we really need real-time?" interrogation from the stream-processor course. The senior move is to match ingestion cadence to consumption cadence.

Batch and Streaming Converge (Foreshadowing)

The hard line between batch and streaming is blurring (Flink's "batch is a bounded stream," lakehouse tables that serve both). Modern architectures often run both: streaming for the fresh tail, batch for bulk loads and backfills, reconciled into the same tables. This course's second half (backfills) and Track E (the platform) are about making them coexist. For now: batch is a first-class, often-better choice, not a fallback.

Aha: There's no virtue in freshness the business doesn't consume. Building a streaming pipeline for data that's read once a day pays the always-on complexity-and-cost tax for latency nobody uses, the exact inverse of over-claiming "real-time." Batch isn't legacy; it's the right tool when the source or the consumer is batch-shaped, and most data integration genuinely is. Match ingestion cadence to consumption cadence, and you'll reach for batch far more often than the streaming hype suggests.

3. Worked Example

Decide batch vs streaming for TheWorldShop's three sources.

Bring up the lab. This course uses a batch-ingestion harness (a connector framework, sources, and a warehouse destination). Clone the lab repo (shared) and start it:

git clone https://github.com/petascalelabs/petascalelabs-lab-setup.git
cd petascalelabs-lab-setup/ingestion-and-transport/batch-ingestion-and-backfills/batch-ingestion-patterns/
./scripts/setup.sh

Verify the sources, the connector runtime, and the destination warehouse are reachable:

./scripts/verify.sh
# expected: "Postgres source seeded, mock partner REST API on :9100, sample SaaS CSV export, Airbyte-style runner ready, warehouse (DuckDB/Postgres) ready"

You are helping me run the lab for the "Batch Ingestion Patterns" course.
The lab is in
petascalelabs-lab-setup/ingestion-and-transport/batch-ingestion-and-backfills/batch-ingestion-patterns/
and includes:
  - docker-compose.yml: a Postgres source, a mock partner REST API, a sample SaaS CSV export, a connector runner (Airbyte/Singer-style), and a destination warehouse
  - scripts/setup.sh, scripts/verify.sh, scripts/teardown.sh, scripts/run-sync.sh

My environment:
  OS: <fill in>
  RAM: <fill in GB>

Walk me through:
1. Bringing up the sources, runner, and warehouse.
2. Running a sample sync from the Postgres source to the warehouse.
3. Where to see the synced data and the connector's state.
4. The teardown command.

Do not assume my OS; ask if unclear.

Step 1, the decision worksheet:

./scripts/decide-batch-vs-stream.sh

source                 shape          consumer            latency need   -> choice
partner REST API       batch (2x/day) nightly analytics   hours          BATCH
SaaS CSV export        batch (daily)  daily report        hours          BATCH (no other option)
internal Postgres      event (WAL)    real-time fraud      seconds        STREAMING (CDC)
internal Postgres      event (WAL)    nightly warehouse    hours          BATCH (don't need CDC here)

The same Postgres feeds both: streaming CDC for fraud, batch for the nightly warehouse. Cadence matched to consumption.

Step 2, the cost contrast:

./scripts/cost-compare.sh --source partner-api
# streaming pipeline: always-on runner + Kafka + 24/7 ops  ~$X/mo
# batch job (2x/day):  runs minutes, twice daily            ~$X/20 /mo
# for 2x/day data, streaming pays the always-on tax for unused freshness

Step 3, run a batch sync (the simple, robust path):

./scripts/run-sync.sh --source partner-api --dest warehouse.partner_orders
# extracts, loads, exits. Re-runnable. Fails atomically. No always-on pipeline to babysit.

Step 4, robustness, just re-run:

./scripts/break/partner-api-flaky.sh   # the partner API errors intermittently
./scripts/run-sync.sh --source partner-api --retry
# the batch job retries / can be safely re-run; contrast a 24/7 stream wrestling a flaky source

4. Your Turn

Exercise: TheWorldShop has four data needs: (a) a marketing SaaS that only offers a daily CSV export, fed to a weekly report; (b) the orders database, feeding both real-time fraud and a nightly warehouse load; (c) a partner inventory API, rate-limited, consulted hourly by a pricing job; (d) clickstream for a live dashboard.

For each, choose batch or streaming and justify by source shape and consumer cadence.
For (b), explain why the same source can correctly use both batch and streaming.
The team wants to build streaming CDC for the marketing SaaS (a). Explain why that's wrong on source-shape grounds alone.
State the core tradeoff batch accepts and what it buys in return.
Give one robustness advantage a re-runnable batch job has over an always-on streaming pipeline against a flaky source.

5. Real-World Application

Most production data integration is still batch, the ELT connector tools (Fivetran, Airbyte, Stitch) that dominate the modern data stack are fundamentally batch (scheduled syncs), because the bulk of analytical data is consumed on schedules and many sources are bulk-export-only.

The "stream everything" over-engineering trap is common, teams fresh off streaming projects build always-on pipelines for daily-consumed data, paying the operational tax for unused freshness. Matching ingestion cadence to consumption cadence is the discipline that prevents it.

Batch and streaming coexist in mature platforms, the orders database feeding both real-time CDC (fraud) and nightly batch (warehouse) is a textbook pattern. The architecture isn't batch or streaming; it's the right cadence per consumer, which Track E's platform formalizes.

6. Recap + Bridge

Batch ingestion (scheduled extract-and-load, no always-on pipeline) is the right tool, not legacy, for a huge class of problems: batch-shaped sources (bulk exports, rate-limited APIs), batch-shaped consumers (nightly analytics, reports), and hours-latency tolerance, where it's simpler, cheaper, and more robust than streaming. Match ingestion cadence to consumption cadence; there's no virtue in unused freshness. Batch and streaming coexist, often on the same source.

Batch ingestion is dominated by a connector model you don't build from scratch. Next: the EL(T) connector model, Airbyte, Singer, Meltano, and Fivetran, and how they standardize moving data between hundreds of sources and destinations.