Reliability pillar principles in AWS - Time & Space Complexity
We want to understand how the time to keep a system reliable changes as the system grows.
How does adding more parts affect the work to keep everything running smoothly?
Analyze the time complexity of monitoring and recovering multiple AWS resources.
// Pseudocode for monitoring and recovery
for each resource in resources:
check health status
if unhealthy:
trigger recovery action
log status
This sequence checks each resource's health and recovers it if needed.
Look at what repeats as the system grows.
- Primary operation: Health check API call for each resource
- How many times: Once per resource
- Secondary operation: Recovery action if needed, also per resource but only when unhealthy
As you add more resources, the number of health checks grows directly with the number of resources.
| Input Size (n) | Approx. API Calls/Operations |
|---|---|
| 10 | 10 health checks |
| 100 | 100 health checks |
| 1000 | 1000 health checks |
Pattern observation: The work grows in a straight line as you add more resources.
Time Complexity: O(n)
This means the time to monitor and recover grows directly with the number of resources.
[X] Wrong: "Adding more resources won't affect monitoring time much because checks are fast."
[OK] Correct: Each resource adds a check, so total time adds up linearly, not staying the same.
Understanding how monitoring scales helps you design systems that stay reliable as they grow, a key skill in cloud roles.
"What if we grouped resources and checked groups instead of individual resources? How would the time complexity change?"