diff --git a/knowledge base/cloud computing/aws/ecs.md b/knowledge base/cloud computing/aws/ecs.md index 1a189d9..608d001 100644 --- a/knowledge base/cloud computing/aws/ecs.md +++ b/knowledge base/cloud computing/aws/ecs.md @@ -22,14 +22,16 @@ 1. [Docker volumes](#docker-volumes) 1. [Bind mounts](#bind-mounts) 1. [Networking](#networking) -1. [Intra-task container dependencies](#intra-task-container-dependencies) + 1. [Connecting to a service](#connecting-to-a-service) + 1. [Allow tasks to communicate with each other](#allow-tasks-to-communicate-with-each-other) + 1. [Load Balancer](#load-balancer) + 1. [ECS Service Connect](#ecs-service-connect) + 1. [ECS service discovery](#ecs-service-discovery) + 1. [VPC Lattice](#vpc-lattice) +1. [Container dependencies](#container-dependencies) 1. [Execute commands in tasks' containers](#execute-commands-in-tasks-containers) 1. [Scale the number of tasks automatically](#scale-the-number-of-tasks-automatically) 1. [Target tracking](#target-tracking) -1. [Allow tasks to communicate with each other](#allow-tasks-to-communicate-with-each-other) - 1. [ECS Service Connect](#ecs-service-connect) - 1. [ECS service discovery](#ecs-service-discovery) - 1. [VPC Lattice](#vpc-lattice) 1. [Scrape metrics using Prometheus](#scrape-metrics-using-prometheus) 1. [Send logs to a central location](#send-logs-to-a-central-location) 1. [FireLens](#firelens) @@ -39,9 +41,11 @@ 1. [Mount Secrets Manager secrets as files in containers](#mount-secrets-manager-secrets-as-files-in-containers) 1. [Make a sidecar container write secrets to shared volumes](#make-a-sidecar-container-write-secrets-to-shared-volumes) 1. [Best practices](#best-practices) -1. [Cost-saving measures](#cost-saving-measures) +1. [Pricing](#pricing) + 1. [Cost-saving measures](#cost-saving-measures) 1. [Troubleshooting](#troubleshooting) 1. [Invalid 'cpu' setting for task](#invalid-cpu-setting-for-task) + 1. [Tasks in a service using a Load Balancer are being stopped even if healthy](#tasks-in-a-service-using-a-load-balancer-are-being-stopped-even-if-healthy) 1. [Further readings](#further-readings) 1. [Sources](#sources) @@ -1015,7 +1019,359 @@ one's behalf.
Such role is automatically created when creating a cluster, or when creating or updating a service in the AWS Management Console. -## Intra-task container dependencies +### Connecting to a service + +Services create Tasks.
+Each Task is granted an IP address or can otherwise be reached depending on its network configuration. + +At this point there is no way to refer to multiple replicas as a single one.
+One can: + +- Manually add a generic Route53 record, containing the references to those Tasks, and refer to it. +- Create a proxy that forwards (and maybe load balances) the traffic to those Tasks. +- Leverage a Load Balancer to forward and balance the traffic. + + > [!warning] + > This is the easiest way to allow reachability, but it could be also the most expensive if incorrectly configured. + + ```ts + const service = new aws.ecs.Service( + "someApp", + { + name: "someApp", + …, + loadBalancers: [{ + containerName: "service", + containerPort: 8080, + targetGroupArn: targetGroup.arn, + }], + }, + ); + ``` + +- Leverage other types of resources to allow communication between services and applications.
+ See also [Allow tasks to communicate with each other]. + +### Allow tasks to communicate with each other + +Refer [How can I allow the tasks in my Amazon ECS services to communicate with each other?] and +[Interconnect Amazon ECS services]. + +Tasks in a cluster are **not** normally able to communicate with each other.
+Use a Load Balancer, ECS Service Connect, ECS service discovery or VPC Lattice to allow that. + +#### Load Balancer + +Configure a Load Balancer for the services, and optionally a mnemonic Route53 `CNAME` record (or alias) for them, then +call the corresponding FQDN.
+Or leverage an existing one if that is the case. + +It is the easiest way, but it is overkill and expensive if all one wants to do is routing internal traffic. + +#### ECS Service Connect + +Refer [Use Service Connect to connect Amazon ECS services with short names]. + +ECS Service Connect provides ECS clusters with the configuration they need for service-to-service discovery, +connectivity, and traffic monitoring by building both service discovery and a service mesh in the clusters. + +It provides: + +- The complete configuration services need to join the mesh. +- A unified way to refer to services within namespaces that does **not** depend on the VPC's DNS configuration. +- Standardized metrics and logs to monitor all the applications. + +The feature creates a virtual network of related services.
+The same service configuration can be used across different namespaces to run independent yet identical sets of +applications. + +When using Service Connect, ECS dynamically manages Service Connect endpoints for each task as they start and stop. It +does so by injecting the definition of a _sidecar_ proxy container **in services**. This does **not** change their task +definition.
+Each task created for each registered service will end up running the sidecar proxy container in order, so that the task +is added to the mesh. + +Injecting the proxy in the services and not in the task definitions allows for the same task definition to be reused to +run identical applications in different namespaces with different Service Connect configurations.
+It also means that, since the proxy is **not** in the task definition, it **cannot** be configured by users. + +Service Connect **only** interconnects **services** within the **same** namespace. + +One can add one Service Connect configuration to new or existing services.
+When that happens, ECS creates: + +- A Service Connect endpoint in the namespace. +- A new deployment in the service that replaces the tasks that are currently running with ones equipped with the proxy. + +Existing tasks and other applications can continue to connect to existing endpoints and external applications.
+If a service using Service Connect adds tasks by scaling out, new connections from clients will be load balanced between +**all** of the running tasks. If the service is updated, new connections from clients will be load balanced only between +the **new** version of the tasks. + +The list of endpoints in the namespace changes every time **any** service in that namespace is deployed.
+Existing tasks, and replacement tasks, continue to behave the same as they did after the most recent deployment.
+Existing tasks **cannot** resolve and connect to new endpoints. Only tasks with a Service Connect configuration in the +same namespace **and** that start running after this deployment can. + +Applications can use short names and standard ports to connect to **services** in the same or other clusters.
+This includes connecting across VPCs in the same AWS Region. + +By default, the Service Connect proxy listens on the `containerPort` specified in the task definition's port +mapping.
+The service's Security Group rules **must** allow incoming traffic to this port from the subnets where clients will run. + +The proxy will consume some of the resources allocated to their task.
+It is recommended: + +- Adding at least 256 CPU units and 64 MiB of memory to the task's resources. +- \[If expecting tasks to receive more than 500 requests per second at their peak load] Increasing the sidecar's + resources addition to at least 512 CPU units. +- \[If expecting to create more than 100 Service Connect services in the namespace, or 2000 tasks in total across all + ECS services within the namespace], Adding 128 MiB extra of memory for the Service Connect proxy container.
+ One **must** do this in **every** task definition that is used by **any** of the ECS services in the namespace. + +It is recommended one sets the log configuration in the Service Connect configuration. + +Proxy configuration: + +- Tasks in a Service Connect endpoint are load balanced in a `round-robin` strategy. +- The proxy uses data about prior failed connections to avoid sending new connections to the tasks that had the failed + connections for some time.
+ At the time of writing, failing 5 or more connections in the last 30 seconds makes the proxy avoid that task for 30 to + 300 seconds. +- Connection that pass through the proxy and fail are retried, but **avoid** the host that failed the previous + connection.
+ This ensures that each connection through Service Connect doesn't fail for one-off reasons. +- Wait a maximum time for applications to respond.
+ The default timeout value is 15 seconds, but it can be updated. + +
+ Limitations + +Service Connect does **not** support: + +- ECS' `host` network mode. +- Windows containers. +- HTTP 1.0. +- Standalone tasks and any task created by other resources than services. +- Services using the `blue/green` or `external deployment` types. +- External container instance for ECS Anywhere. +- PPv2. +- Task definitions that set _container_ memory limits.
+ It is required to set the _task_ memory limit, though. + +Tasks using the `bridge` network mode and Service Connect will **not** support the `hostname` container definition +parameter. + +Each service can belong to only one namespace. + +Service Connect can use any AWS Cloud Map namespace, as long as they are in the **same** Region **and** AWS account. + +Service Connect does **not** delete namespaces when clusters are deleted.
+One must delete namespaces in AWS Cloud Map themselves. + +
+ +
+ Requirements + +- Tasks running in Fargate **must** use the Fargate Linux platform version 1.4.0 or higher. +- The ECS agent on container instances must be version 1.67.2 or higher. +- Container instances must run the ECS-optimized Amazon Linux 2023 AMI version `20230428` or later, or the ECS-optimized + Amazon Linux 2 AMI version `2.0.20221115` or later.
+ These versions equip the Service Connect agent in addition to the ECS container agent. +- Container instances must have the `ecs:Poll` permission assigned to them for resource + `arn:aws:ecs:{{region}}:{{accountId}}:task-set/cluster/*`.
+ If using the `ecsInstanceRole` or `AmazonEC2ContainerServiceforEC2Role` IAM roles, there is no need for additional + permissions. +- Services **must** use the **rolling deployment** strategy, as it is the only one supported. +- Task definitions **must** set their task's memory limit. +- The task memory limit must be set to a number **greater** than the sum of the container memory limits.
+ The CPU and memory in the task limits that aren't allocated in the container limits will be used by the Service + Connect's proxy container and other containers that don't set container limits. +- All endpoints must be **unique** within their namespace. +- All discovery names must be **unique** within their namespace. +- One **must** redeploy existing services before applications can resolve the new endpoints.
+ New endpoints that are added to the namespace **after** the service's most recent deployment **will not** be added to + the proxy configuration. +- Application Load Balancer traffic defaults to routing through the Service Connect agent in `awsvpc` network mode.
+ If one wants non-service traffic to bypass the Service Connect agent, one will need to use the `ingressPortOverride` + parameter in their Service Connect service configuration. + +
+ +Procedure: + +1. Configure the ECS cluster to use the desired AWS Cloud Map namespace. + +
+ Simplified process + + Create the cluster with the desired name for the AWS Cloud Map namespace, and specify that name for the namespace + when asked.
+ ECS will create a new HTTP namespace with the necessary configuration.
+ As reminder, Service Connect doesn't use or create DNS hosted zones in Amazon Route 53. FIXME: check this + +
+ +1. Configure port names in the server services' task definitions for all the port mappings that the services will expose + in Service Connect. + +
+ + ```json + containerDefinitions: [{ + "name": "postgres", + "protocol": "tcp", + "containerPort": 5432 + }] + ``` + +
+ +1. Configure the server services to create Service Connect endpoints within the namespace. + +
+ + ```json + "serviceConnectConfiguration": { + "enabled": true, + "namespace": "ecs-dev-cluster", + "services": [{ + "portName": "postgres", + "discoveryName": "postgres", + "clientAliases": [{ + "port": 5432, + "dnsName": "pgsql" + }] + }] + } + ``` + +
+ +1. Deploy the services.
+ This will create the endpoints AWS Cloud Map namespace used by the cluster.
+ ECS also injects the Service Connect proxy container in each task. +1. Deploy the client applications as ECS services.
+ ECS connects them to the Service Connect endpoints through the Service Connect proxy in each task. +1. Applications only use the proxy to connect to Service Connect endpoints.
+ No additional configuration is required to use the proxy. +1. \[optionally] Monitor traffic through the Service Connect proxy in Amazon CloudWatch. + +#### ECS service discovery + +Service discovery helps manage HTTP and DNS namespaces for ECS services. + +ECS automatically registers and de-registers the list of launched tasks to AWS Cloud Map.
+Cloud Map maintains DNS records that resolve to the internal IP addresses of one or more tasks from registered +services.
+Other services in the **same** VPC can use such DNS records to send traffic directly to containers using their internal +IP addresses. + +This approach provides low latency since traffic travels directly between the containers. + +ECS service discovery is a good fit when using the `awsvpc` network mode, where: + +- Each task is assigned its own, unique IP address. +- That IP address is an `A` record. +- Each service can have a unique security group assigned. + +When using _bridged network_ mode, `A` records are no longer enough for service discovery and one **must** also use a +`SRV` DNS record. This is due to containers sharing the same IP address and having ports mapped randomly.
+`SRV` records can keep track of both IP addresses and port numbers, but requires applications to be appropriately +configured. + +Service discovery supports only the `A` and `SRV` DNS record types.
+DNS records are automatically added or removed as tasks start or stop for ECS services. + +Task registration in CloudMap might take some seconds to finish.
+Until ECS registers the tasks, Containers in them might complain about being unable to resolve the services they are +using. + +DNS records have a TTL and it might happen that tasks died before this ended.
+One **must** implement extra logic in one's applications, so that they can handle retries and deal with connection +failures when the records are not yet updated. + +See also [Use service discovery to connect Amazon ECS services with DNS names]. + +Procedure: + +1. Create the desired AWS Cloud Map namespace. +1. Create the desired Cloud Map service in the namespace. +1. Configure the ECS service offering acting as server to use the Cloud Map service. + +
+ + ```json + "serviceRegistries": [{ + "registryArn": "arn:aws:servicediscovery:eu-west-1:012345678901:service/srv-uuf33b226vw93biy" + }] + ``` + +
+ +NS lookup commands from within containers might fail, but they might still be able to resolve services registered in +CloudMap namespaces. + +
+ +```sh +$ aws ecs execute-command --cluster 'dev' \ + --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \ + --interactive --command 'nslookup mimir.dev.ecs.internal' + +The Session Manager plugin was installed successfully. Use the AWS CLI to start a session. + +Starting session with SessionId: ecs-execute-command-p3pkkrysjdptxa8iu3cz3kxnke +Server: 172.16.0.2 +Address: 172.16.0.2:53 + +Non-authoritative answer: + +$ aws ecs execute-command --cluster 'dev' \ + --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \ + --interactive --command 'wget -SO- mimir.dev.ecs.local:8080/ready' + +The Session Manager plugin was installed successfully. Use the AWS CLI to start a session. + +Starting session with SessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri +Connecting to mimir.dev.ecs.local:8080 (172.16.88.99:8080) + HTTP/1.1 200 OK + Date: Thu, 08 May 2025 09:35:02 GMT + Content-Type: text/plain + Content-Length: 5 + Connection: close + +saving to '/dev/stdout' +stdout 100% |********************************| 5 0:00:00 ETA +'/dev/stdout' saved + +Exiting session with sessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri. +``` + +
+ +#### VPC Lattice + +Managed application networking service that customers can use to observe, secure, and monitor applications built across +AWS compute services, VPCs, and accounts without having to modify their code. + +VPC Lattice technically replaces the need for Application Load Balancers by leveraging target groups themselves.
+Target groups which are a collection of compute resources, and can refer EC2 instances, IP addresses, Lambda functions, +and Application Load Balancers.
+Listeners are used to forward traffic to specified target groups when the conditions are met.
+ECS also automatically replaces unhealthy tasks. + +ECS tasks can be enabled **as IP targets** in VPC Lattice by associating their services with a VPC Lattice target +group.
+ECS automatically registers tasks to the VPC Lattice target group when they are launched for registered services. + +Deployments _might_ take longer when using VPC Lattice due to the extent of changes required. + +See also [What is Amazon VPC Lattice?] and its [Amazon VPC Lattice pricing]. + +## Container dependencies Containers can depend on other containers **from the same task**.
On startup, ECS evaluates all container dependency conditions and starts the containers only when the required @@ -1335,317 +1691,6 @@ The **only** available metrics for the integrated checks are currently: - The service's **average** memory utilization (`ECSServiceMemoryUtilization`) for the last minute. - The service's Application Load Balancer's **average** requests count (`ALBRequestCountPerTarget`) for the last minute. -## Allow tasks to communicate with each other - -Refer [How can I allow the tasks in my Amazon ECS services to communicate with each other?] and -[Interconnect Amazon ECS services]. - -Tasks in a cluster are **not** normally able to communicate with each other.
-Use ECS Service Connect, ECS service discovery or VPC Lattice to allow that. - -### ECS Service Connect - -Refer [Use Service Connect to connect Amazon ECS services with short names]. - -ECS Service Connect provides ECS clusters with the configuration they need for service-to-service discovery, -connectivity, and traffic monitoring by building both service discovery and a service mesh in the clusters. - -It provides: - -- The complete configuration services need to join the mesh. -- A unified way to refer to services within namespaces that does **not** depend on the VPC's DNS configuration. -- Standardized metrics and logs to monitor all the applications. - -The feature creates a virtual network of related services.
-The same service configuration can be used across different namespaces to run independent yet identical sets of -applications. - -When using Service Connect, ECS dynamically manages Service Connect endpoints for each task as they start and stop. It -does so by injecting the definition of a _sidecar_ proxy container **in services**. This does **not** change their task -definition.
-Each task created for each registered service will end up running the sidecar proxy container in order, so that the task -is added to the mesh. - -Injecting the proxy in the services and not in the task definitions allows for the same task definition to be reused to -run identical applications in different namespaces with different Service Connect configurations.
-It also means that, since the proxy is **not** in the task definition, it **cannot** be configured by users. - -Service Connect **only** interconnects **services** within the **same** namespace. - -One can add one Service Connect configuration to new or existing services.
-When that happens, ECS creates: - -- A Service Connect endpoint in the namespace. -- A new deployment in the service that replaces the tasks that are currently running with ones equipped with the proxy. - -Existing tasks and other applications can continue to connect to existing endpoints and external applications.
-If a service using Service Connect adds tasks by scaling out, new connections from clients will be load balanced between -**all** of the running tasks. If the service is updated, new connections from clients will be load balanced only between -the **new** version of the tasks. - -The list of endpoints in the namespace changes every time **any** service in that namespace is deployed.
-Existing tasks, and replacement tasks, continue to behave the same as they did after the most recent deployment.
-Existing tasks **cannot** resolve and connect to new endpoints. Only tasks with a Service Connect configuration in the -same namespace **and** that start running after this deployment can. - -Applications can use short names and standard ports to connect to **services** in the same or other clusters.
-This includes connecting across VPCs in the same AWS Region. - -By default, the Service Connect proxy listens on the `containerPort` specified in the task definition's port -mapping.
-The service's Security Group rules **must** allow incoming traffic to this port from the subnets where clients will run. - -The proxy will consume some of the resources allocated to their task.
-It is recommended: - -- Adding at least 256 CPU units and 64 MiB of memory to the task's resources. -- \[If expecting tasks to receive more than 500 requests per second at their peak load] Increasing the sidecar's - resources addition to at least 512 CPU units. -- \[If expecting to create more than 100 Service Connect services in the namespace, or 2000 tasks in total across all - ECS services within the namespace], Adding 128 MiB extra of memory for the Service Connect proxy container.
- One **must** do this in **every** task definition that is used by **any** of the ECS services in the namespace. - -It is recommended one sets the log configuration in the Service Connect configuration. - -Proxy configuration: - -- Tasks in a Service Connect endpoint are load balanced in a `round-robin` strategy. -- The proxy uses data about prior failed connections to avoid sending new connections to the tasks that had the failed - connections for some time.
- At the time of writing, failing 5 or more connections in the last 30 seconds makes the proxy avoid that task for 30 to - 300 seconds. -- Connection that pass through the proxy and fail are retried, but **avoid** the host that failed the previous - connection.
- This ensures that each connection through Service Connect doesn't fail for one-off reasons. -- Wait a maximum time for applications to respond.
- The default timeout value is 15 seconds, but it can be updated. - -
- Limitations - -Service Connect does **not** support: - -- ECS' `host` network mode. -- Windows containers. -- HTTP 1.0. -- Standalone tasks and any task created by other resources than services. -- Services using the `blue/green` or `external deployment` types. -- External container instance for ECS Anywhere. -- PPv2. -- Task definitions that set _container_ memory limits.
- It is required to set the _task_ memory limit, though. - -Tasks using the `bridge` network mode and Service Connect will **not** support the `hostname` container definition -parameter. - -Each service can belong to only one namespace. - -Service Connect can use any AWS Cloud Map namespace, as long as they are in the **same** Region **and** AWS account. - -Service Connect does **not** delete namespaces when clusters are deleted.
-One must delete namespaces in AWS Cloud Map themselves. - -
- -
- Requirements - -- Tasks running in Fargate **must** use the Fargate Linux platform version 1.4.0 or higher. -- The ECS agent on container instances must be version 1.67.2 or higher. -- Container instances must run the ECS-optimized Amazon Linux 2023 AMI version `20230428` or later, or the ECS-optimized - Amazon Linux 2 AMI version `2.0.20221115` or later.
- These versions equip the Service Connect agent in addition to the ECS container agent. -- Container instances must have the `ecs:Poll` permission assigned to them for resource - `arn:aws:ecs:{{region}}:{{accountId}}:task-set/cluster/*`.
- If using the `ecsInstanceRole` or `AmazonEC2ContainerServiceforEC2Role` IAM roles, there is no need for additional - permissions. -- Services **must** use the **rolling deployment** strategy, as it is the only one supported. -- Task definitions **must** set their task's memory limit. -- The task memory limit must be set to a number **greater** than the sum of the container memory limits.
- The CPU and memory in the task limits that aren't allocated in the container limits will be used by the Service - Connect's proxy container and other containers that don't set container limits. -- All endpoints must be **unique** within their namespace. -- All discovery names must be **unique** within their namespace. -- One **must** redeploy existing services before applications can resolve the new endpoints.
- New endpoints that are added to the namespace **after** the service's most recent deployment **will not** be added to - the proxy configuration. -- Application Load Balancer traffic defaults to routing through the Service Connect agent in `awsvpc` network mode.
- If one wants non-service traffic to bypass the Service Connect agent, one will need to use the `ingressPortOverride` - parameter in their Service Connect service configuration. - -
- -Procedure: - -1. Configure the ECS cluster to use the desired AWS Cloud Map namespace. - -
- Simplified process - - Create the cluster with the desired name for the AWS Cloud Map namespace, and specify that name for the namespace - when asked.
- ECS will create a new HTTP namespace with the necessary configuration.
- As reminder, Service Connect doesn't use or create DNS hosted zones in Amazon Route 53. FIXME: check this - -
- -1. Configure port names in the server services' task definitions for all the port mappings that the services will expose - in Service Connect. - -
- - ```json - containerDefinitions: [{ - "name": "postgres", - "protocol": "tcp", - "containerPort": 5432 - }] - ``` - -
- -1. Configure the server services to create Service Connect endpoints within the namespace. - -
- - ```json - "serviceConnectConfiguration": { - "enabled": true, - "namespace": "ecs-dev-cluster", - "services": [{ - "portName": "postgres", - "discoveryName": "postgres", - "clientAliases": [{ - "port": 5432, - "dnsName": "pgsql" - }] - }] - } - ``` - -
- -1. Deploy the services.
- This will create the endpoints AWS Cloud Map namespace used by the cluster.
- ECS also injects the Service Connect proxy container in each task. -1. Deploy the client applications as ECS services.
- ECS connects them to the Service Connect endpoints through the Service Connect proxy in each task. -1. Applications only use the proxy to connect to Service Connect endpoints.
- No additional configuration is required to use the proxy. -1. \[optionally] Monitor traffic through the Service Connect proxy in Amazon CloudWatch. - -### ECS service discovery - -Service discovery helps manage HTTP and DNS namespaces for ECS services. - -ECS automatically registers and de-registers the list of launched tasks to AWS Cloud Map.
-Cloud Map maintains DNS records that resolve to the internal IP addresses of one or more tasks from registered -services.
-Other services in the **same** VPC can use such DNS records to send traffic directly to containers using their internal -IP addresses. - -This approach provides low latency since traffic travels directly between the containers. - -ECS service discovery is a good fit when using the `awsvpc` network mode, where: - -- Each task is assigned its own, unique IP address. -- That IP address is an `A` record. -- Each service can have a unique security group assigned. - -When using _bridged network_ mode, `A` records are no longer enough for service discovery and one **must** also use a -`SRV` DNS record. This is due to containers sharing the same IP address and having ports mapped randomly.
-`SRV` records can keep track of both IP addresses and port numbers, but requires applications to be appropriately -configured. - -Service discovery supports only the `A` and `SRV` DNS record types.
-DNS records are automatically added or removed as tasks start or stop for ECS services. - -Task registration in CloudMap might take some seconds to finish.
-Until ECS registers the tasks, Containers in them might complain about being unable to resolve the services they are -using. - -DNS records have a TTL and it might happen that tasks died before this ended.
-One **must** implement extra logic in one's applications, so that they can handle retries and deal with connection -failures when the records are not yet updated. - -See also [Use service discovery to connect Amazon ECS services with DNS names]. - -Procedure: - -1. Create the desired AWS Cloud Map namespace. -1. Create the desired Cloud Map service in the namespace. -1. Configure the ECS service offering acting as server to use the Cloud Map service. - -
- - ```json - "serviceRegistries": [{ - "registryArn": "arn:aws:servicediscovery:eu-west-1:012345678901:service/srv-uuf33b226vw93biy" - }] - ``` - -
- -NS lookup commands from within containers might fail, but they might still be able to resolve services registered in -CloudMap namespaces. - -
- -```sh -$ aws ecs execute-command --cluster 'dev' \ - --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \ - --interactive --command 'nslookup mimir.dev.ecs.internal' - -The Session Manager plugin was installed successfully. Use the AWS CLI to start a session. - -Starting session with SessionId: ecs-execute-command-p3pkkrysjdptxa8iu3cz3kxnke -Server: 172.16.0.2 -Address: 172.16.0.2:53 - -Non-authoritative answer: - -$ aws ecs execute-command --cluster 'dev' \ - --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \ - --interactive --command 'wget -SO- mimir.dev.ecs.local:8080/ready' - -The Session Manager plugin was installed successfully. Use the AWS CLI to start a session. - -Starting session with SessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri -Connecting to mimir.dev.ecs.local:8080 (172.16.88.99:8080) - HTTP/1.1 200 OK - Date: Thu, 08 May 2025 09:35:02 GMT - Content-Type: text/plain - Content-Length: 5 - Connection: close - -saving to '/dev/stdout' -stdout 100% |********************************| 5 0:00:00 ETA -'/dev/stdout' saved - -Exiting session with sessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri. -``` - -
- -### VPC Lattice - -Managed application networking service that customers can use to observe, secure, and monitor applications built across -AWS compute services, VPCs, and accounts without having to modify their code. - -VPC Lattice technically replaces the need for Application Load Balancers by leveraging target groups themselves.
-Target groups which are a collection of compute resources, and can refer EC2 instances, IP addresses, Lambda functions, -and Application Load Balancers.
-Listeners are used to forward traffic to specified target groups when the conditions are met.
-ECS also automatically replaces unhealthy tasks. - -ECS tasks can be enabled **as IP targets** in VPC Lattice by associating their services with a VPC Lattice target -group.
-ECS automatically registers tasks to the VPC Lattice target group when they are launched for registered services. - -Deployments _might_ take longer when using VPC Lattice due to the extent of changes required. - -See also [What is Amazon VPC Lattice?] and its [Amazon VPC Lattice pricing]. - ## Scrape metrics using Prometheus Refer [Prometheus service discovery for AWS ECS] and [Scraping Prometheus metrics from applications running in AWS ECS]. @@ -2034,7 +2079,75 @@ Useful when wanting multiple containers to access the same secret, or just clean - When using **spot** compute capacity, consider ensuring containers exit gracefully before the task stops.
Refer [Capacity providers]. -## Cost-saving measures +## Pricing + +Refer [AWS Fargate Pricing] and [A Simple Breakdown of Amazon ECS Pricing]. + +20 GB of ephemeral storage per task are **included**. + +**Hourly** costs in `eu-west-1` as per 2026-02-10 (tax _excluded_): + +| Provider | Capacity Type | Architecture | OS | Resource | Price | +| -------- | ------------- | ------------ | ------- | ---------------------------- | ------------------------------- | +| Fargate | On-Demand | X86 | Linux | 1 vCPU | $0.04048 | +| Fargate | On-Demand | X86 | Linux | 1 GB RAM | $0.004445 | +| Fargate | SPOT | X86 | Linux | 1 vCPU | $0.01467395 | +| Fargate | SPOT | X86 | Linux | 1 GB RAM | $0.00161131 | +| Fargate | On-Demand | ARM | Linux | 1 vCPU | $0.03238 | +| Fargate | On-Demand | ARM | Linux | 1 GB RAM | $0.00356 | +| Fargate | SPOT | ARM | Linux | 1 vCPU | $0.01173771 | +| Fargate | SPOT | ARM | Linux | 1 GB RAM | $0.0012905 | +| Fargate | On-Demand | X86 | Windows | 1 vCPU | $0.046552 + $0.046 (OS license) | +| Fargate | On-Demand | X86 | Windows | 1 GB RAM | $0.00511175 | +| Fargate | Any | Any | Any | 1 GB extra ephemeral storage | $0.000122 | + +
+ Example: Fargate (Linux, X86) + +| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) | +| ----------------------- | -----: | --------: | -------: | --------: | ---------: | +| vCPU | 0.5 | $0.02024 | $0.48576 | $15.05856 | $177.78816 | +| RAM | 1 GB | $0.004445 | $0.10668 | $3.30708 | $39.04488 | +| Extra ephemeral storage | 5 GB | $0.00061 | $0.01464 | $0.45384 | $5.35824 | + +Total: ~$0.03 per hour, ~$0.61 per day, ~$18.82 per 31d-month, ~$222.20 per 366d-year. + +--- + +| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) | +| ----------------------- | -----: | -------: | -------: | ---------: | -----------: | +| vCPU | 4 | $0.16192 | $3.88608 | $120.46848 | $1,422.30528 | +| RAM | 20 GB | $0.0889 | $2.1336 | $66.1416 | $780.8976 | +| Extra ephemeral storage | 0 GB | $0.00 | $0.00 | $0.00 | $0.00 | + +Total: ~$0.26 per hour, ~$6.02 per day, ~$186.62 per 31d-month, ~$2,203.21 per 366d-year. + +
+ +
+ Example: Fargate SPOT (Linux, ARM) + +| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) | +| ----------------------- | -----: | -----------: | ----------: | ----------: | -----------: | +| vCPU | 0.5 | $0.005868855 | $0.14085252 | $4.36642812 | $51.55202232 | +| RAM | 1 GB | $0.0012905 | $0.030972 | $0.960132 | $11.335752 | +| Extra ephemeral storage | 5 GB | $0.00061 | $0.01464 | $0.45384 | $5.35824 | + +Total: ~$0.01 per hour, ~$0.19 per day, ~$5.79 per 31d-month, ~$68.25 per 366d-year. + +--- + +| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) | +| ----------------------- | -----: | ----------: | ----------: | -----------: | ------------: | +| vCPU | 4 | $0.04695084 | $1.12682016 | $34.93142496 | $412.41617856 | +| RAM | 20 GB | $0.02581 | $0.61944 | $19.20264 | $226.71504 | +| Extra ephemeral storage | 0 GB | $0.00 | $0.00 | $0.00 | $0.00 | + +Total: ~$0.08 per hour, ~$1.75 per day, ~$54.14 per 31d-month, ~$639.14 per 366d-year. + +
+ +### Cost-saving measures - Prefer using ARM-based compute capacity over the default `X86_64`, where feasible.
Refer [CPU architectures]. @@ -2043,7 +2156,7 @@ Useful when wanting multiple containers to access the same secret, or just clean - Prefer using [**spot** capacity][effectively using spot instances in aws ecs for production workloads] for - Non-critical services and tasks. - - State**less** or otherwise **interruption tolerant** tasks. + - State**less** or otherwise **interruption-tolerant** tasks. Refer [Capacity providers]. - Consider applying for EC2 Instance and/or Compute Savings Plans if using EC2 capacity.
@@ -2061,7 +2174,7 @@ Useful when wanting multiple containers to access the same secret, or just clean > Mind the limitations that come with the auto scaling settings. - If only used internally (e.g., via a VPN), consider configuring intra-network communication capabilities for the - application **instead of** using a load balancer.
+ application (e.g., CloudMap) **instead of** using load balancers.
Refer [Allow tasks to communicate with each other]. ## Troubleshooting @@ -2088,6 +2201,55 @@ Specify a supported value for the task CPU and memory in your task definition. +### Tasks in a service using a Load Balancer are being stopped even if healthy + +
+ Context + +One or more containers' definition in the Task define a health check. + +Traffic to the Service is served by a Load Balancer.
+The Load Balancer uses a Target Group where the Service registers Tasks. + +The containers' health checks pass and the Task is considered _healthy_. + +Messages like the following are visible from the Service's page in ECS or from the Load Balancer or Target Group's +pages: + +- `service X deregistered targets` +- `task stopped because it failed ELB health checks` +- `Health checks failed` + +
+ +
+ Cause + +Load Balancing and ECS are integrated. + +The Target Group defines its own health check in order to decide whether to serve traffic to specific targets. + +The containers' health check and the Target Group's health check are completely separated. + +The containers' health check only require ECS to communicate with the container engine.
+If a container's health check fails, the Task is deemed _unhealthy_ and ECS replaces it. + +ECS reacts to an associated Load Balancer (and hence Target Group)'s opinion.
+If the Target Group's health check fails, traffic is not forwarded to the Task. After `unhealthy_threshold × interval`, +the integration makes ECS mark the Task as unhealthy and deregister it from the Target Group.
+ECS will eventually stop the Task, then launch a replacement to maintain the desired count. + +
+ +
+ Solution + +- Align the containers' and the Target Group's health checks. +- Consider making the Target Group's health check more forgiving, e.g., via higher unhealthy threshold, or more + accepted HTTP codes. + +
+ ## Further readings - [Amazon Web Services] @@ -2102,8 +2264,9 @@ Specify a supported value for the task CPU and memory in your task definition. - [What Is AWS Cloud Map?] - [Centralized Container Logging with Fluent Bit] - [Effective Logging Strategies with Amazon ECS and Fluentd] -- [ECS pricing] +- [A Simple Breakdown of Amazon ECS Pricing] - [Announcing AWS Graviton2 Support for AWS Fargate] +- [Optimize load balancer health check parameters for Amazon ECS] ### Sources @@ -2183,6 +2346,7 @@ Specify a supported value for the task CPU and memory in your task definition. [Announcing AWS Graviton2 Support for AWS Fargate]: https://aws.amazon.com/blogs/aws/announcing-aws-graviton2-support-for-aws-fargate-get-up-to-40-better-price-performance-for-your-serverless-containers/ [Automatically scale your Amazon ECS service]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html [AWS Distro for OpenTelemetry]: https://aws-otel.github.io/ +[AWS Fargate Pricing]: https://aws.amazon.com/fargate/pricing/ [AWS Fargate Spot Now Generally Available]: https://aws.amazon.com/blogs/aws/aws-fargate-spot-now-generally-available/ [Centralized Container Logging with Fluent Bit]: https://aws.amazon.com/blogs/opensource/centralized-container-logging-fluent-bit/ [ecs execute-command proposal]: https://github.com/aws/containers-roadmap/issues/1050 @@ -2198,6 +2362,7 @@ Specify a supported value for the task CPU and memory in your task definition. [install the session manager plugin for the aws cli]: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html [Interconnect Amazon ECS services]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/interconnecting-services.html [Metrics collection from Amazon ECS using Amazon Managed Service for Prometheus]: https://aws.amazon.com/blogs/opensource/metrics-collection-from-amazon-ecs-using-amazon-managed-service-for-prometheus/ +[Optimize load balancer health check parameters for Amazon ECS]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/load-balancer-healthcheck.html [Pass Secrets Manager secrets through Amazon ECS environment variables]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/secrets-envvar-secrets-manager.html [Pass sensitive data to an Amazon ECS container]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/specifying-sensitive-data.html [storage options for amazon ecs tasks]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_data_volumes.html @@ -2219,12 +2384,12 @@ Specify a supported value for the task CPU and memory in your task definition. [`aws ecs execute-command` results in `TargetNotConnectedException` `The execute command failed due to an internal error`]: https://stackoverflow.com/questions/69261159/aws-ecs-execute-command-results-in-targetnotconnectedexception-the-execute [308 Permanent Redirect]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/308 +[A Simple Breakdown of Amazon ECS Pricing]: https://awsfundamentals.com/blog/amazon-ecs-pricing [a step-by-step guide to enabling amazon ecs exec]: https://medium.com/@mariotolic/a-step-by-step-guide-to-enabling-amazon-ecs-exec-a88b05858709 [attach ebs volume to aws ecs fargate]: https://medium.com/@shujaatsscripts/attach-ebs-volume-to-aws-ecs-fargate-e23fea7bb1a7 [Avoiding Common Pitfalls with ECS Capacity Providers and Auto Scaling]: https://medium.com/@bounouh.fedi/avoiding-common-pitfalls-with-ecs-capacity-providers-and-auto-scaling-24899ab6fc25 [AWS Fargate Pricing Explained]: https://www.vantage.sh/blog/fargate-pricing [aws-cloudmap-prometheus-sd]: https://github.com/awslabs/aws-cloudmap-prometheus-sd -[ECS pricing]: https://awsfundamentals.com/blog/amazon-ecs-pricing [Effective Logging Strategies with Amazon ECS and Fluentd]: https://reintech.io/blog/effective-logging-strategies-amazon-ecs-fluent [exposing multiple ports for an aws ecs service]: https://medium.com/@faisalsuhail1/exposing-multiple-ports-for-an-aws-ecs-service-64b9821c09e8 [guide to using amazon ebs with amazon ecs and aws fargate]: https://stackpioneers.com/2024/01/12/guide-to-using-amazon-ebs-with-amazon-ecs-and-aws-fargate/