Failover routing for disaster recovery in AWS - Time & Space Complexity
When setting up failover routing for disaster recovery, it is important to understand how the time to switch traffic grows as the system scales.
We want to know how the routing process behaves as more resources or endpoints are involved.
Analyze the time complexity of the following AWS Route 53 failover routing setup.
// Create primary record
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch '{"Changes":[{"Action":"CREATE","ResourceRecordSet":{"Name":"example.com","Type":"A","SetIdentifier":"Primary","Failover":"PRIMARY","TTL":60,"ResourceRecords":[{"Value":"192.0.2.1"}]}}]}'
// Create secondary record
aws route53 change-resource-record-sets --hosted-zone-id ZONEID --change-batch '{"Changes":[{"Action":"CREATE","ResourceRecordSet":{"Name":"example.com","Type":"A","SetIdentifier":"Secondary","Failover":"SECONDARY","TTL":60,"ResourceRecords":[{"Value":"192.0.2.2"}]}}]}'
// Health check for primary
aws route53 create-health-check --caller-reference "primary-check" --health-check-config '{"IPAddress":"192.0.2.1","Port":80,"Type":"HTTP","ResourcePath":"/health"}'
// Associate health check with primary record
// Route 53 automatically switches to secondary if primary fails
This sequence sets up DNS failover with a primary and secondary endpoint monitored by health checks.
In this setup, the main repeating operations are:
- Primary operation: DNS health checks and routing decisions by Route 53.
- How many times: Health checks run continuously at regular intervals; routing decisions happen each time a health check result is evaluated.
As the number of endpoints or health checks increases, Route 53 must evaluate more health check results to decide routing.
| Input Size (n) | Approx. API Calls/Operations |
|---|---|
| 2 endpoints | 2 health checks evaluated regularly |
| 10 endpoints | 10 health checks evaluated regularly |
| 100 endpoints | 100 health checks evaluated regularly |
Pattern observation: The number of health checks and routing evaluations grows linearly with the number of endpoints.
Time Complexity: O(n)
This means the time to evaluate health and decide routing grows directly in proportion to the number of endpoints monitored.
[X] Wrong: "Failover routing time stays the same no matter how many endpoints we add."
[OK] Correct: Each additional endpoint adds a health check to evaluate, so the system must do more work, increasing the time linearly.
Understanding how failover routing scales helps you design reliable systems that respond quickly during outages, a key skill in cloud infrastructure roles.
"What if we added a global traffic policy with multiple failover groups? How would the time complexity change?"