GCP operators (BigQuery, GCS, Dataflow) in Apache Airflow - Time & Space Complexity
When using GCP operators in Airflow, it's important to understand how the number of tasks and API calls grows as you handle more data or jobs.
We want to know how the work done by these operators changes when the input size changes.
Analyze the time complexity of the following operation sequence.
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
from airflow.providers.google.cloud.operators.gcs import GCSListObjectsOperator
from airflow.providers.google.cloud.operators.dataflow import DataflowStartFlexTemplateOperator
with DAG('gcp_data_pipeline', start_date=days_ago(1)) as dag:
list_files = GCSListObjectsOperator(bucket='my-bucket')
for file in list_files.output:
start_dataflow = DataflowStartFlexTemplateOperator(
job_name=f'dataflow-job-{file}',
template_path='gs://dataflow-templates/latest/flex',
parameters={'inputFile': f'gs://my-bucket/{file}'}
)
run_bigquery = BigQueryInsertJobOperator(
configuration={...}
)
This sequence lists files in a GCS bucket, then for each file starts a Dataflow job and runs a BigQuery job.
Identify the API calls, resource provisioning, data transfers that repeat.
- Primary operation: Starting a Dataflow job and running a BigQuery job for each file.
- How many times: Once per file found in the GCS bucket.
As the number of files in the bucket increases, the number of Dataflow and BigQuery jobs started grows the same way.
| Input Size (n) | Approx. Api Calls/Operations |
|---|---|
| 10 | About 10 Dataflow + 10 BigQuery jobs |
| 100 | About 100 Dataflow + 100 BigQuery jobs |
| 1000 | About 1000 Dataflow + 1000 BigQuery jobs |
Pattern observation: The number of operations grows directly with the number of files.
Time Complexity: O(n)
This means the work grows linearly with the number of files processed.
[X] Wrong: "Starting one Dataflow job can handle all files at once, so the number of files doesn't affect the number of jobs."
[OK] Correct: In this setup, each file triggers its own Dataflow and BigQuery job, so more files mean more jobs and API calls.
Understanding how cloud operator tasks scale with input size helps you design efficient pipelines and explain your reasoning clearly in discussions.
"What if we changed the pipeline to batch all files into a single Dataflow job? How would the time complexity change?"