In Airflow, tasks may run multiple times due to retries or manual triggers. Why is designing tasks to be idempotent important?
Think about what happens if a task runs more than once and changes data each time.
Idempotency ensures that running a task multiple times does not cause unintended changes or duplicate effects. This is crucial in Airflow where retries or manual reruns happen.
Consider an Airflow PythonOperator task that creates a file only if it does not exist. What will be the output if the task runs twice?
import os file_path = '/tmp/data.txt' if not os.path.exists(file_path): with open(file_path, 'w') as f: f.write('Airflow data') print('File created or already exists')
Check what happens when the file already exists on the second run.
The code checks if the file exists before creating it. On the second run, the file exists, so it skips creation but still prints the message.
You want an Airflow task to update a database record only if a specific condition is met, avoiding duplicate updates on retries. Which SQL statement ensures idempotency?
Think about how to avoid changing the record if it already has the desired status.
Option A updates the record only if the status is not already 'active', preventing repeated updates on retries and ensuring idempotency.
An Airflow task inserts rows into a table every time it runs, causing duplicates on retries. What is the best way to fix this?
How can you prevent inserting the same data multiple times?
Checking for existing rows before inserting ensures the task does not create duplicates on retries, making it idempotent.
You have an Airflow DAG that calls an external API to create resources. The API does not support idempotency keys. How can you design the DAG to avoid creating duplicate resources on retries?
Think about how to remember what was created in previous attempts.
By storing created resource IDs in XCom, the task can check if the resource was already created and skip duplicate API calls, ensuring idempotency despite the API limitation.