Introduction

Reliability means making sure your app or service keeps working well even if things go wrong. It helps avoid downtime and keeps users happy by handling failures smoothly.

When you want your website to stay online even if a server crashes

When you need your app to recover quickly from unexpected errors

When you want to test how your system behaves under failure conditions

When you want to automatically fix problems without manual help

When you want to plan for growth without losing service quality

Commands

This command creates an alarm that watches CPU usage on an EC2 instance. If CPU usage goes above 80% for 2 periods of 5 minutes, it triggers an alert to notify or take action. This helps detect problems early.

Terminal

aws cloudwatch put-metric-alarm --alarm-name HighCPUUtilization --metric-name CPUUtilization --namespace AWS/EC2 --statistic Average --period 300 --threshold 80 --comparison-operator GreaterThanThreshold --dimensions Name=InstanceId,Value=i-0123456789abcdef0 --evaluation-periods 2 --alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopic --unit Percent

Expected OutputExpected

No output (command runs silently)

→

--alarm-name - Name of the alarm

→

--threshold - Value that triggers the alarm

→

--alarm-actions - What happens when alarm triggers

This command creates an auto scaling group that keeps the number of EC2 instances between 1 and 3. It automatically adds or removes instances to keep the app running smoothly during changes in demand.

Terminal

aws autoscaling create-auto-scaling-group --auto-scaling-group-name my-asg --launch-configuration-name my-launch-config --min-size 1 --max-size 3 --desired-capacity 2 --vpc-zone-identifier subnet-12345678

Expected OutputExpected

No output (command runs silently)

→

--min-size - Minimum number of instances

→

--max-size - Maximum number of instances

→

--desired-capacity - Starting number of instances

This command checks the current status of the auto scaling group to see how many instances are running and if scaling actions happened.

Terminal

aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names my-asg

Expected OutputExpected

{ "AutoScalingGroups": [ { "AutoScalingGroupName": "my-asg", "MinSize": 1, "MaxSize": 3, "DesiredCapacity": 2, "Instances": [ { "InstanceId": "i-0123456789abcdef0", "LifecycleState": "InService", "HealthStatus": "Healthy" }, { "InstanceId": "i-0fedcba9876543210", "LifecycleState": "InService", "HealthStatus": "Healthy" } ] } ] }

Key Concept

If you remember nothing else from this pattern, remember: design your system to detect problems early and automatically fix them to keep your service running smoothly.

Common Mistakes

Not setting alarms for important metrics like CPU or disk space

Without alarms, you won't know when your system is struggling until users complain or it crashes

Set alarms on key metrics to get notified early and take action

Setting auto scaling min and max sizes too narrow

If min and max are the same, auto scaling can't adjust capacity to handle load changes

Set a range that allows scaling up and down based on demand

Not verifying auto scaling group status after creation

You might think scaling is working but instances could be unhealthy or missing

Use describe commands to check the health and number of instances regularly

Summary

Create alarms to watch important system metrics and get notified of issues.

Use auto scaling groups to automatically add or remove servers based on demand.

Check the status of your auto scaling groups to ensure your system stays healthy.