Loading...
Loading...
Write a PySpark transform once. Iterate locally for free. Ship to AWS Glue 4.0 via CodeBuild. Stop paying DPU-seconds to debug typos.
Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.
Engineers are learning here from
Take a 336K-row flight dataset from a raw CSV in S3 to a curated Parquet warehouse. Write a PySpark transform once, run it locally without AWS credits, and ship the same code to AWS Glue 4.0 via CodeBuild. The first cloud course in the learnwithparam data engineering track.
Write a PySpark transform that runs identically locally and on AWS Glue 4.0. Ship via CodeBuild, validate with a local smoke run, and skip the surprise DPU bills.
What you'll ship
What you'll learn
Curriculum
Why AWS Glue
Frame the shape of a real ETL job, decide when Glue is the right primitive over Lambda or EMR, and meet the bundled flight dataset that exposes every transformation you will need.
The PySpark transform
Write the heart of the job: typed reads, safe null handling, derived columns, and window aggregations that compute carrier and route metrics without losing rows.
Local versus Glue runtime
Split the transform from the entry points so the same code runs locally and inside AWS Glue. Iterate at zero cost. Skip the DPU-second tax.
Packaging and CI/CD
Encode the Glue job in JSON, ship it through CodeBuild on every push, and wire IAM so your first run lands without a morning of debugging.
Validating outputs and the capstone
Round-trip Parquet through pyarrow, watch the job in CloudWatch, then wire the whole pipeline end-to-end as a capstone.
Who it's for
inheriting a Glue job they cannot debug locally because the Glue runtime is opaque
tired of paying DPU-seconds to discover a typo and waiting 90 seconds per iteration
wanting to graduate from notebooks to a real PySpark job that ships through CI/CD
asked to host AWS Glue jobs and needing a reference pattern teachable to the team
FAQ
No. The whole transform runs locally against the bundled flights CSV. AWS is the deployment target, not the development target. The course shows you the AWS Console, IAM setup, and CodeBuild pipeline so when you do hit "Apply" on real infrastructure, the path is mechanical.
AWS Glue 4.0 ships PySpark 3.4. The course uses PySpark 3.5 locally. Every API used (window functions, groupBy, parquet writes, casting) is identical across both. The course flags the few places where Glue 4.0 differs and shows the workaround.
dbt is for SQL transforms in a warehouse. EMR is for cluster-managed Spark when you need long-running stateful jobs. Glue sits between them: managed Spark for short batch ETL with a pay-per-DPU billing model. The course opens with a "pick the right primitive" lesson so you know when each fits.
The flights CSV is 336K rows of real-world dirty airline data: nulls, timestamp coercion needed, duplicate flight numbers, and join keys spread across columns. It is small enough to run on a laptop, big enough to expose the bugs autodetect schemas would never catch.
Pricing
One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.
Unlock with Pro
You save 47% with regional pricing
Billed annually. Cancel anytime.
Still deciding? Ask Param a question
AWS Glue and PySpark ETL on a real flight dataset
From $16/mo with Pro