Do I need an AWS account to take this course?

No. The whole transform runs locally against the bundled flights CSV. AWS is the deployment target, not the development target. The course shows you the AWS Console, IAM setup, and CodeBuild pipeline so when you do hit "Apply" on real infrastructure, the path is mechanical.

What about Glue version compatibility?

AWS Glue 4.0 ships PySpark 3.4. The course uses PySpark 3.5 locally. Every API used (window functions, groupBy, parquet writes, casting) is identical across both. The course flags the few places where Glue 4.0 differs and shows the workaround.

Why not just use dbt or use EMR?

dbt is for SQL transforms in a warehouse. EMR is for cluster-managed Spark when you need long-running stateful jobs. Glue sits between them: managed Spark for short batch ETL with a pay-per-DPU billing model. The course opens with a "pick the right primitive" lesson so you know when each fits.

Is the bundled dataset realistic?

The flights CSV is 336K rows of real-world dirty airline data: nulls, timestamp coercion needed, duplicate flight numbers, and join keys spread across columns. It is small enough to run on a laptop, big enough to expose the bugs autodetect schemas would never catch.

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

AWS Glue and PySpark ETL on a real flight dataset

Name: AWS Glue and PySpark ETL on a real flight dataset
Price: 49 USD
Availability: InStock

Write a PySpark transform once. Iterate locally for free. Ship to AWS Glue 4.0 via CodeBuild. Stop paying DPU-seconds to debug typos.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Take a 336K-row flight dataset from a raw CSV in S3 to a curated Parquet warehouse. Write a PySpark transform once, run it locally without AWS credits, and ship the same code to AWS Glue 4.0 via CodeBuild. The first cloud course in the learnwithparam data engineering track.

Write a PySpark transform that runs identically locally and on AWS Glue 4.0. Ship via CodeBuild, validate with a local smoke run, and skip the surprise DPU bills.

What you'll ship

Real projects, not toy demos.

A pure PySpark transform that runs identically on a laptop and on AWS Glue 4.0
A shared `transform.py` module imported by both the Glue entry point and a local runner so logic never drifts
Carrier and route analytics with Spark window functions, partitioned by year on output
A Glue job descriptor (`glue_job_config.json`) that CodeBuild applies on every push
A `buildspec.yaml` CI/CD pipeline that uploads the script to S3 and refreshes the Glue job
A local smoke run that round-trips Parquet through pyarrow before you spend a single DPU-second

What you'll learn

You finish able to:

Pick AWS Glue, AWS Lambda, or EMR for a given ETL workload and defend the choice
Write PySpark transforms that coerce types, fill nulls safely, and survive real-world dirty data
Use Spark window functions to compute carrier and route delay metrics without losing row counts
Split a Glue job into a pure transform module and a thin entry point so logic runs anywhere
Configure a Glue job via JSON descriptor and ship updates from CodeBuild on every push
Validate Parquet outputs with row-count assertions and a pyarrow round-trip before paying for Glue

Curriculum

From an empty PySpark file to a CodeBuild-shipped Glue job that lands curated carrier and route analytics in S3.

01
Why AWS Glue
Frame the shape of a real ETL job, decide when Glue is the right primitive over Lambda or EMR, and meet the bundled flight dataset that exposes every transformation you will need.
3 lessons
02
The PySpark transform
Write the heart of the job: typed reads, safe null handling, derived columns, and window aggregations that compute carrier and route metrics without losing rows.
5 lessons
03
Local versus Glue runtime
Split the transform from the entry points so the same code runs locally and inside AWS Glue. Iterate at zero cost. Skip the DPU-second tax.
3 lessons
04
Packaging and CI/CD
Encode the Glue job in JSON, ship it through CodeBuild on every push, and wire IAM so your first run lands without a morning of debugging.
3 lessons
05
Validating outputs and the capstone
Round-trip Parquet through pyarrow, watch the job in CloudWatch, then wire the whole pipeline end-to-end as a capstone.
3 lessons

Who it's for

Is this for you?

Backend engineers

inheriting a Glue job they cannot debug locally because the Glue runtime is opaque

Data engineers

tired of paying DPU-seconds to discover a typo and waiting 90 seconds per iteration

Analytics engineers

wanting to graduate from notebooks to a real PySpark job that ships through CI/CD

Platform engineers

asked to host AWS Glue jobs and needing a reference pattern teachable to the team

FAQ

Common questions.

Do I need an AWS account to take this course?
No. The whole transform runs locally against the bundled flights CSV. AWS is the deployment target, not the development target. The course shows you the AWS Console, IAM setup, and CodeBuild pipeline so when you do hit "Apply" on real infrastructure, the path is mechanical.
What about Glue version compatibility?
AWS Glue 4.0 ships PySpark 3.4. The course uses PySpark 3.5 locally. Every API used (window functions, groupBy, parquet writes, casting) is identical across both. The course flags the few places where Glue 4.0 differs and shows the workaround.
Why not just use dbt or use EMR?
dbt is for SQL transforms in a warehouse. EMR is for cluster-managed Spark when you need long-running stateful jobs. Glue sits between them: managed Spark for short batch ETL with a pay-per-DPU billing model. The course opens with a "pick the right primitive" lesson so you know when each fits.
Is the bundled dataset realistic?
The flights CSV is 336K rows of real-world dirty airline data: nulls, timestamp coercion needed, duplicate flight numbers, and join keys spread across columns. It is small enough to run on a laptop, big enough to expose the bugs autodetect schemas would never catch.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

After this course:

Local-first development. Same code, two runtimes. Shippable on day one.

Enroll

AWS Glue and PySpark ETL on a real flight dataset

From $16/mo with Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro

$30/mo$16/mobilled yearlyGet Pro

47% OFFYearly Pro$30/mo$16/mobilled yearlyGet Pro

Premium course

AWS Glue and PySpark ETL on a real flight dataset

Write a PySpark transform once. Iterate locally for free. Ship to AWS Glue 4.0 via CodeBuild. Stop paying DPU-seconds to debug typos.

Enroll Preview curriculum

Still deciding? Ask first.

Message a mentor about fit, prerequisites, or where to start. Replies come on WhatsApp, usually within a day.

Curriculum fit, prerequisites, or where to start
Honest answer, no pressure to enroll

Engineers are learning here from

NVIDIAMICROSOFTGRABWISEPIPEDRIVEBOLTGLIA

Write a PySpark transform that runs identically locally and on AWS Glue 4.0. Ship via CodeBuild, validate with a local smoke run, and skip the surprise DPU bills.

What you'll ship

Real projects, not toy demos.

A pure PySpark transform that runs identically on a laptop and on AWS Glue 4.0
A shared `transform.py` module imported by both the Glue entry point and a local runner so logic never drifts
Carrier and route analytics with Spark window functions, partitioned by year on output
A Glue job descriptor (`glue_job_config.json`) that CodeBuild applies on every push
A `buildspec.yaml` CI/CD pipeline that uploads the script to S3 and refreshes the Glue job
A local smoke run that round-trips Parquet through pyarrow before you spend a single DPU-second

What you'll learn

You finish able to:

Pick AWS Glue, AWS Lambda, or EMR for a given ETL workload and defend the choice
Write PySpark transforms that coerce types, fill nulls safely, and survive real-world dirty data
Use Spark window functions to compute carrier and route delay metrics without losing row counts
Split a Glue job into a pure transform module and a thin entry point so logic runs anywhere
Configure a Glue job via JSON descriptor and ship updates from CodeBuild on every push
Validate Parquet outputs with row-count assertions and a pyarrow round-trip before paying for Glue

Curriculum

From an empty PySpark file to a CodeBuild-shipped Glue job that lands curated carrier and route analytics in S3.

01
Why AWS Glue
Frame the shape of a real ETL job, decide when Glue is the right primitive over Lambda or EMR, and meet the bundled flight dataset that exposes every transformation you will need.
3 lessons
02
The PySpark transform
Write the heart of the job: typed reads, safe null handling, derived columns, and window aggregations that compute carrier and route metrics without losing rows.
5 lessons
03
Local versus Glue runtime
Split the transform from the entry points so the same code runs locally and inside AWS Glue. Iterate at zero cost. Skip the DPU-second tax.
3 lessons
04
Packaging and CI/CD
Encode the Glue job in JSON, ship it through CodeBuild on every push, and wire IAM so your first run lands without a morning of debugging.
3 lessons
05
Validating outputs and the capstone
Round-trip Parquet through pyarrow, watch the job in CloudWatch, then wire the whole pipeline end-to-end as a capstone.
3 lessons

Who it's for

Is this for you?

Backend engineers

inheriting a Glue job they cannot debug locally because the Glue runtime is opaque

Data engineers

tired of paying DPU-seconds to discover a typo and waiting 90 seconds per iteration

Analytics engineers

wanting to graduate from notebooks to a real PySpark job that ships through CI/CD

Platform engineers

asked to host AWS Glue jobs and needing a reference pattern teachable to the team

FAQ

Common questions.

Do I need an AWS account to take this course?
No. The whole transform runs locally against the bundled flights CSV. AWS is the deployment target, not the development target. The course shows you the AWS Console, IAM setup, and CodeBuild pipeline so when you do hit "Apply" on real infrastructure, the path is mechanical.
What about Glue version compatibility?
AWS Glue 4.0 ships PySpark 3.4. The course uses PySpark 3.5 locally. Every API used (window functions, groupBy, parquet writes, casting) is identical across both. The course flags the few places where Glue 4.0 differs and shows the workaround.
Why not just use dbt or use EMR?
dbt is for SQL transforms in a warehouse. EMR is for cluster-managed Spark when you need long-running stateful jobs. Glue sits between them: managed Spark for short batch ETL with a pay-per-DPU billing model. The course opens with a "pick the right primitive" lesson so you know when each fits.
Is the bundled dataset realistic?
The flights CSV is 336K rows of real-world dirty airline data: nulls, timestamp coercion needed, duplicate flight numbers, and join keys spread across columns. It is small enough to run on a laptop, big enough to expose the bugs autodetect schemas would never catch.

Pricing

Unlock this course with Pro.

One subscription unlocks every paid course and workshop replay. Pick yearly or monthly.

Unlock with Pro

$30$16/mo

You save 47% with regional pricing

Billed annually. Cancel anytime.

This course plus every paid course
Workshop replays in your library
New releases the day they ship

Still deciding? Ask Param a question

After this course:

Local-first development. Same code, two runtimes. Shippable on day one.

Enroll

AWS Glue and PySpark ETL on a real flight dataset

From $16/mo with Pro