0
0
Apache Airflowdevops~30 mins

AWS operators (S3, Redshift, EMR) in Apache Airflow - Mini Project: Build & Apply

Choose your learning style9 modes available
AWS Operators with Airflow: S3, Redshift, and EMR
📖 Scenario: You are working as a data engineer. Your team uses Apache Airflow to automate cloud tasks. You need to create a simple workflow that uploads a file to AWS S3, loads data into Redshift, and runs a job on EMR.
🎯 Goal: Build an Airflow DAG that uses AWS operators to upload a file to S3, copy data into Redshift, and start an EMR job flow.
📋 What You'll Learn
Create an Airflow DAG named aws_data_pipeline
Use S3CreateObjectOperator to upload a file to an S3 bucket
Use RedshiftSQLOperator to run a SQL COPY command in Redshift
Use EmrCreateJobFlowOperator to start an EMR cluster
Set task dependencies so the S3 upload runs before Redshift load, which runs before EMR job
💡 Why This Matters
🌍 Real World
Automating data workflows in the cloud is common in data engineering. Airflow helps schedule and manage these tasks reliably.
💼 Career
Knowing how to use AWS operators in Airflow is valuable for cloud data engineers and DevOps professionals managing ETL pipelines.
Progress0 / 4 steps
1
Create the Airflow DAG and S3 upload task
Create an Airflow DAG named aws_data_pipeline with default arguments. Inside it, create a task called upload_to_s3 using S3CreateObjectOperator to upload the string 'Hello, Airflow!' as a file named greeting.txt to the bucket my-test-bucket.
Apache Airflow
Need a hint?

Remember to import S3CreateObjectOperator and set bucket_name, object_key, and data exactly as specified.

2
Add Redshift load task configuration
Add a task called load_to_redshift using RedshiftSQLOperator inside the same DAG. Configure it to run the SQL command COPY my_table FROM 's3://my-test-bucket/greeting.txt' IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole' FORMAT AS TEXT; on the Redshift cluster. Use redshift_default as the connection ID.
Apache Airflow
Need a hint?

Use triple quotes for the SQL string and set redshift_conn_id exactly as shown.

3
Add EMR job flow creation task
Add a task called start_emr_cluster using EmrCreateJobFlowOperator inside the DAG. Configure it with job_flow_overrides to create an EMR cluster with the name AirflowEMRCluster, release label emr-6.3.0, and one master node of type m5.xlarge. Use aws_default as the AWS connection ID.
Apache Airflow
Need a hint?

Define JOB_FLOW_OVERRIDES as a dictionary with the specified keys and values before the DAG context.

4
Set task dependencies to define workflow order
Set the task dependencies so that upload_to_s3 runs before load_to_redshift, and load_to_redshift runs before start_emr_cluster.
Apache Airflow
Need a hint?

Use the >> operator to set the order of tasks.