Reliability design principles in GCP - Time & Space Complexity
When designing reliable cloud systems, it is important to understand how the time to recover or respond grows as the system scales.
We want to know how the effort or operations needed to keep the system reliable change as we add more components or users.
Analyze the time complexity of monitoring and recovery operations in a multi-instance service.
// Pseudocode for reliability checks
for each instance in service_instances:
check_health(instance)
if unhealthy:
restart_instance(instance)
notify_admin(instance)
This sequence checks each instance's health, restarts if needed, and sends notifications.
Look at what repeats as the number of instances grows.
- Primary operation: Health check API call per instance
- How many times: Once per instance each cycle
- Additional operations: Restart and notify only for unhealthy instances, which vary
As the number of instances increases, the number of health checks grows proportionally.
| Input Size (n) | Approx. API Calls/Operations |
|---|---|
| 10 | 10 health checks |
| 100 | 100 health checks |
| 1000 | 1000 health checks |
Pattern observation: The operations grow linearly with the number of instances.
Time Complexity: O(n)
This means the time to perform health checks grows directly with the number of instances.
[X] Wrong: "Adding more instances won't affect monitoring time because checks run in parallel."
[OK] Correct: Even if checks run in parallel, the total work done still increases with instances, affecting resource use and potential delays.
Understanding how reliability operations scale helps you design systems that stay dependable as they grow, a key skill in cloud roles.
"What if we batch health checks instead of checking each instance separately? How would the time complexity change?"