I need help understanding the optimal configuration for ECS task placement in a multi-AZ environment.
We recently resolved a task placement issue by constraining our ASG and ECS service to a single AZ and adding the AmazonECSManaged tag (using aws autoscaling create-or-update-tags --tags "ResourceId=ecs-xyz-asg,ResourceType=auto-scaling-group,Key=AmazonECSManaged,Value=true,PropagateAtLaunch=false"
).
While this fixed the immediate “TaskFailedToStart” error related to scale-in protection, it’s not suitable for production as it sacrifices fault tolerance.
We can’t justify maintaining idle instances across all AZs just for alignment purposes, as this would cost approximately $400/month per idle instance.
How can we achieve reliable task placement while maintaining multi-AZ resilience without incurring the cost overhead of idle capacity? We’re specifically looking for guidance on balancing reliable task placement, multi-AZ fault tolerance, and cost optimization in an ECS on EC2 environment.