Loading...
Loading...
Stop copying data nightly between S3 and Snowflake. Write Iceberg from Glue, read it from Snowflake, and let the warehouse and lake share one storage layer.
Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.
Engineers are learning here from
Build a full lakehouse on AWS: Lambda extracts a weather API to S3, Glue PySpark writes Apache Iceberg tables in the Glue Data Catalog, Snowflake reads them as external Iceberg tables, Airflow orchestrates the daily refresh, and Terraform provisions everything. Stops being an architecture diagram. Starts being your runtime.
Lambda to S3 to Glue PySpark to Iceberg in the Glue Data Catalog, queried by Snowflake as external tables. Airflow orchestrates. Terraform provisions. CodeBuild ships.
What you'll ship
What you'll learn
Curriculum
The lakehouse mental model
Map the architecture choice, defend Iceberg over plain Parquet, and trace exactly how Snowflake reads from the Glue Data Catalog without copying data.
Lambda extract to S3 raw
Pull data from a real API into S3 with retries, structured errors, and a raw layout that downstream Glue jobs will not fight you over.
Glue PySpark to Iceberg
Configure the Iceberg catalog in Spark, write tables that Snowflake can read, partition for query patterns, and handle schema evolution and small files.
Snowflake external Iceberg tables
Wire Catalog Integration plus External Volume in Snowflake, register the tables, and verify both engines see consistent rows.
Airflow orchestration
Run the Lambda extract, fan out to multiple Glue runs, force a Snowflake refresh, and handle failures gracefully.
Terraform end to end
Provision S3, IAM, Glue, CodeBuild, and the Snowflake-AWS trust policy as code. Reproduce the whole stack with one command per environment.
Who it's for
tasked with building a "lakehouse" without a clear story for how Snowflake and Glue cooperate without doubling storage cost
who already use Snowflake and want their dbt models to read straight from raw Iceberg without nightly copy jobs
wiring Iceberg for the first time and discovering Glue, Snowflake, IAM, and Catalog Integration each have surprising defaults
asked to own analytics infra and needing a reference pattern that holds up to a real production deploy
FAQ
Plain Parquet is files in a folder. Iceberg adds a manifest layer that tracks snapshots, schema evolution, and partition pruning. Two engines reading the same Iceberg tables see consistent data even during writes. Plain Parquet does not give you that, which is why every modern lakehouse uses a table format on top.
No for the local part. The smoke test, transform validation, and Terraform validate all run with zero credentials. The deployment chapters show the AWS Console and Snowflake worksheets so you understand the wire-up, but the workshop is fully runnable on a laptop until you choose to deploy.
Yes. The Iceberg layer is identical. The Glue PySpark job swaps for a Databricks notebook or job. The Snowflake side does not change. The course flags every Glue-specific detail so the port is mechanical.
They are catalog implementations alternative to the Glue Data Catalog. The course uses Glue because it is the AWS-native default and sets up cleanly with Snowflake. Polaris and Unity each ship as drop-in replacements, with their own IAM and integration setup.
Pricing
One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.
Unlock with Pro
You save 47% with regional pricing
Billed annually. Cancel anytime.
Still deciding? Ask Param a question
AWS lakehouse with Apache Iceberg, Glue, and Snowflake
From $16/mo with Pro