AWS operators (S3, Redshift, EMR) in Apache Airflow - Time & Space Complexity
When using AWS operators in Airflow, it's important to understand how the number of tasks affects execution time.
We want to know how the time to run grows as we add more AWS operations like S3 uploads or Redshift queries.
Analyze the time complexity of the following operation sequence.
from airflow import DAG
from airflow.providers.amazon.aws.operators.s3 import S3CreateObjectOperator
from airflow.providers.amazon.aws.operators.redshift import RedshiftSQLOperator
from airflow.providers.amazon.aws.operators.emr import EmrCreateJobFlowOperator
with DAG('aws_batch_dag') as dag:
for i in range(n):
s3_task = S3CreateObjectOperator(task_id=f's3_upload_{i}', ...)
redshift_task = RedshiftSQLOperator(task_id=f'redshift_query_{i}', ...)
emr_task = EmrCreateJobFlowOperator(task_id=f'emr_job_{i}', ...)
This sequence runs n sets of AWS tasks: uploading to S3, running a Redshift query, and starting an EMR job.
Identify the API calls, resource provisioning, data transfers that repeat.
- Primary operation: Each iteration runs three AWS API calls: one to S3, one to Redshift, and one to EMR.
- How many times: Each of these calls happens once per iteration, so 3 times n in total.
As you increase n, the number of AWS calls grows directly with n.
| Input Size (n) | Approx. API Calls/Operations |
|---|---|
| 10 | 30 (3 calls x 10) |
| 100 | 300 (3 calls x 100) |
| 1000 | 3000 (3 calls x 1000) |
Pattern observation: The total operations increase steadily and directly with the number of iterations.
Time Complexity: O(n)
This means the total time grows in a straight line as you add more AWS tasks.
[X] Wrong: "Adding more AWS tasks won't affect total time much because they run in the cloud."
[OK] Correct: Each AWS call takes time and resources, so more tasks mean more total time, even if they run in parallel.
Understanding how task count affects execution helps you design efficient workflows and explain your choices clearly.
"What if we changed the tasks to run in parallel instead of sequentially? How would the time complexity change?"