You have a Dataflow pipeline processing streaming data with fixed windows of 5 minutes. What will happen if late data arrives after the window has closed and the allowed lateness period has passed?
Think about how Dataflow handles data that arrives after the window and allowed lateness period.
Dataflow discards late data that arrives after the window and allowed lateness period to maintain consistent results and avoid reprocessing.
You need to process a large dataset that updates once a day and produce a report. Which Dataflow processing mode is most appropriate?
Consider how often the data updates and the best way to process large static datasets.
Batch mode is best for large datasets that update periodically, as it processes the entire dataset efficiently once per update.
You want to restrict who can start and manage your Dataflow jobs in your GCP project. Which IAM role should you assign to users to allow them to create and cancel Dataflow jobs but not modify other resources?
Think about the role that allows job management but limits broader admin permissions.
The Dataflow Developer role allows users to create and cancel jobs without full admin rights.
You want your Dataflow streaming job to automatically adjust the number of worker instances based on workload. Which autoscaling algorithm should you choose for best responsiveness?
Consider which metric best reflects workload changes in streaming data.
THROUGHPUT_BASED autoscaling adjusts workers based on data volume, providing responsive scaling for streaming jobs.
You have a Dataflow batch pipeline that processes large files daily. You want to reduce cost without significantly increasing processing time. Which combination of strategies is best?
Think about balancing worker count and pipeline optimizations.
Autoscaling with a max worker limit controls cost, and shuffle optimization improves performance by reducing data movement.