Why Iceberg and not just Parquet?

Plain Parquet is files in a folder. Iceberg adds a manifest layer that tracks snapshots, schema evolution, and partition pruning. Two engines reading the same Iceberg tables see consistent data even during writes. Plain Parquet does not give you that, which is why every modern lakehouse uses a table format on top.

Does this require paid AWS or Snowflake access to follow along?

No for the local part. The smoke test, transform validation, and Terraform validate all run with zero credentials. The deployment chapters show the AWS Console and Snowflake worksheets so you understand the wire-up, but the workshop is fully runnable on a laptop until you choose to deploy.

Can I use Databricks instead of Glue?

Yes. The Iceberg layer is identical. The Glue PySpark job swaps for a Databricks notebook or job. The Snowflake side does not change. The course flags every Glue-specific detail so the port is mechanical.

What about Polaris, Unity Catalog, or Tabular?

They are catalog implementations alternative to the Glue Data Catalog. The course uses Glue because it is the AWS-native default and sets up cleanly with Snowflake. Polaris and Unity each ship as drop-in replacements, with their own IAM and integration setup.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

AWS lakehouse with Iceberg, Glue, and Snowflake

Name: AWS lakehouse with Apache Iceberg, Glue, and Snowflake
Price: 49 USD
Availability: InStock

Stop copying data nightly between S3 and Snowflake. Write Iceberg from Glue, read it from Snowflake, and let the warehouse and lake share one storage layer.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Build a full lakehouse on AWS: Lambda extracts a weather API to S3, Glue PySpark writes Apache Iceberg tables in the Glue Data Catalog, Snowflake reads them as external Iceberg tables, Airflow orchestrates the daily refresh, and Terraform provisions everything. Stops being an architecture diagram. Starts being your runtime.

Lambda to S3 to Glue PySpark to Iceberg in the Glue Data Catalog, queried by Snowflake as external tables. Airflow orchestrates. Terraform provisions. CodeBuild ships.

What you'll ship

Real projects, not toy demos.

A Lambda extractor that pulls a weather API into S3 with retries and structured error logging
A Glue PySpark job that writes Apache Iceberg tables registered in the Glue Data Catalog
A Snowflake Catalog Integration plus External Volume so Snowflake reads Iceberg without copying data
An Airflow DAG that schedules Lambda extracts, fans into multiple Glue runs, and waits for completion
A Terraform configuration that provisions S3 buckets, Glue catalog, IAM roles, CodeBuild triggers, and the Snowflake-AWS trust policy
A working end-to-end pipeline you can rehearse on the bundled location dataset before you point real credentials at it

What you'll learn

You finish able to:

Defend the lakehouse architecture choice against pure-warehouse and pure-lake alternatives
Write Iceberg tables from Glue PySpark with schema evolution, partitioning, and small-file mitigation
Wire Snowflake Catalog Integration and External Volume so Snowflake queries Iceberg without copying
Orchestrate the pipeline with an Airflow DAG that spans Lambda, Glue, and Snowflake refreshes
Provision the whole stack with Terraform so dev, staging, and prod stay reproducible
Operate the pipeline: monitor freshness, debug a failed Glue run, and roll back a bad deploy

Curriculum

From an empty S3 bucket to a Snowflake worksheet querying live Iceberg, with Airflow scheduling and Terraform-managed infrastructure.

01
The lakehouse mental model
Map the architecture choice, defend Iceberg over plain Parquet, and trace exactly how Snowflake reads from the Glue Data Catalog without copying data.
4 lessons
02
Lambda extract to S3 raw
Pull data from a real API into S3 with retries, structured errors, and a raw layout that downstream Glue jobs will not fight you over.
4 lessons
03
Glue PySpark to Iceberg
Configure the Iceberg catalog in Spark, write tables that Snowflake can read, partition for query patterns, and handle schema evolution and small files.
5 lessons
04
Snowflake external Iceberg tables
Wire Catalog Integration plus External Volume in Snowflake, register the tables, and verify both engines see consistent rows.
4 lessons
05
Airflow orchestration
Run the Lambda extract, fan out to multiple Glue runs, force a Snowflake refresh, and handle failures gracefully.
4 lessons
06
Terraform end to end
Provision S3, IAM, Glue, CodeBuild, and the Snowflake-AWS trust policy as code. Reproduce the whole stack with one command per environment.
5 lessons

Who it's for

Is this for you?

Data engineers

tasked with building a "lakehouse" without a clear story for how Snowflake and Glue cooperate without doubling storage cost

Analytics engineers

who already use Snowflake and want their dbt models to read straight from raw Iceberg without nightly copy jobs

Platform engineers

wiring Iceberg for the first time and discovering Glue, Snowflake, IAM, and Catalog Integration each have surprising defaults

Senior backend engineers

asked to own analytics infra and needing a reference pattern that holds up to a real production deploy

FAQ

Common questions.

Why Iceberg and not just Parquet?
Plain Parquet is files in a folder. Iceberg adds a manifest layer that tracks snapshots, schema evolution, and partition pruning. Two engines reading the same Iceberg tables see consistent data even during writes. Plain Parquet does not give you that, which is why every modern lakehouse uses a table format on top.
Does this require paid AWS or Snowflake access to follow along?
No for the local part. The smoke test, transform validation, and Terraform validate all run with zero credentials. The deployment chapters show the AWS Console and Snowflake worksheets so you understand the wire-up, but the workshop is fully runnable on a laptop until you choose to deploy.
Can I use Databricks instead of Glue?
Yes. The Iceberg layer is identical. The Glue PySpark job swaps for a Databricks notebook or job. The Snowflake side does not change. The course flags every Glue-specific detail so the port is mechanical.
What about Polaris, Unity Catalog, or Tabular?
They are catalog implementations alternative to the Glue Data Catalog. The course uses Glue because it is the AWS-native default and sets up cleanly with Snowflake. Polaris and Unity each ship as drop-in replacements, with their own IAM and integration setup.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

After this course:

One lake. Many readers. Real schema evolution.

Enroll

AWS lakehouse with Apache Iceberg, Glue, and Snowflake

From $16/mo with Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

AWS lakehouse with Iceberg, Glue, and Snowflake

Stop copying data nightly between S3 and Snowflake. Write Iceberg from Glue, read it from Snowflake, and let the warehouse and lake share one storage layer.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Lambda to S3 to Glue PySpark to Iceberg in the Glue Data Catalog, queried by Snowflake as external tables. Airflow orchestrates. Terraform provisions. CodeBuild ships.

What you'll ship

Real projects, not toy demos.

A Lambda extractor that pulls a weather API into S3 with retries and structured error logging
A Glue PySpark job that writes Apache Iceberg tables registered in the Glue Data Catalog
A Snowflake Catalog Integration plus External Volume so Snowflake reads Iceberg without copying data
An Airflow DAG that schedules Lambda extracts, fans into multiple Glue runs, and waits for completion
A Terraform configuration that provisions S3 buckets, Glue catalog, IAM roles, CodeBuild triggers, and the Snowflake-AWS trust policy
A working end-to-end pipeline you can rehearse on the bundled location dataset before you point real credentials at it

What you'll learn

You finish able to:

Defend the lakehouse architecture choice against pure-warehouse and pure-lake alternatives
Write Iceberg tables from Glue PySpark with schema evolution, partitioning, and small-file mitigation
Wire Snowflake Catalog Integration and External Volume so Snowflake queries Iceberg without copying
Orchestrate the pipeline with an Airflow DAG that spans Lambda, Glue, and Snowflake refreshes
Provision the whole stack with Terraform so dev, staging, and prod stay reproducible
Operate the pipeline: monitor freshness, debug a failed Glue run, and roll back a bad deploy

Curriculum

From an empty S3 bucket to a Snowflake worksheet querying live Iceberg, with Airflow scheduling and Terraform-managed infrastructure.

01
The lakehouse mental model
Map the architecture choice, defend Iceberg over plain Parquet, and trace exactly how Snowflake reads from the Glue Data Catalog without copying data.
4 lessons
02
Lambda extract to S3 raw
Pull data from a real API into S3 with retries, structured errors, and a raw layout that downstream Glue jobs will not fight you over.
4 lessons
03
Glue PySpark to Iceberg
Configure the Iceberg catalog in Spark, write tables that Snowflake can read, partition for query patterns, and handle schema evolution and small files.
5 lessons
04
Snowflake external Iceberg tables
Wire Catalog Integration plus External Volume in Snowflake, register the tables, and verify both engines see consistent rows.
4 lessons
05
Airflow orchestration
Run the Lambda extract, fan out to multiple Glue runs, force a Snowflake refresh, and handle failures gracefully.
4 lessons
06
Terraform end to end
Provision S3, IAM, Glue, CodeBuild, and the Snowflake-AWS trust policy as code. Reproduce the whole stack with one command per environment.
5 lessons

Who it's for

Is this for you?

Data engineers

tasked with building a "lakehouse" without a clear story for how Snowflake and Glue cooperate without doubling storage cost

Analytics engineers

who already use Snowflake and want their dbt models to read straight from raw Iceberg without nightly copy jobs

Platform engineers

wiring Iceberg for the first time and discovering Glue, Snowflake, IAM, and Catalog Integration each have surprising defaults

Senior backend engineers

asked to own analytics infra and needing a reference pattern that holds up to a real production deploy

FAQ

Common questions.

Why Iceberg and not just Parquet?
Plain Parquet is files in a folder. Iceberg adds a manifest layer that tracks snapshots, schema evolution, and partition pruning. Two engines reading the same Iceberg tables see consistent data even during writes. Plain Parquet does not give you that, which is why every modern lakehouse uses a table format on top.
Does this require paid AWS or Snowflake access to follow along?
No for the local part. The smoke test, transform validation, and Terraform validate all run with zero credentials. The deployment chapters show the AWS Console and Snowflake worksheets so you understand the wire-up, but the workshop is fully runnable on a laptop until you choose to deploy.
Can I use Databricks instead of Glue?
Yes. The Iceberg layer is identical. The Glue PySpark job swaps for a Databricks notebook or job. The Snowflake side does not change. The course flags every Glue-specific detail so the port is mechanical.
What about Polaris, Unity Catalog, or Tabular?
They are catalog implementations alternative to the Glue Data Catalog. The course uses Glue because it is the AWS-native default and sets up cleanly with Snowflake. Polaris and Unity each ship as drop-in replacements, with their own IAM and integration setup.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

After this course:

One lake. Many readers. Real schema evolution.

Enroll

AWS lakehouse with Apache Iceberg, Glue, and Snowflake

From $16/mo with Pro