diff --git a/knowledge base/cloud computing/aws/ecs.md b/knowledge base/cloud computing/aws/ecs.md
index 1a189d9..608d001 100644
--- a/knowledge base/cloud computing/aws/ecs.md
+++ b/knowledge base/cloud computing/aws/ecs.md
@@ -22,14 +22,16 @@
1. [Docker volumes](#docker-volumes)
1. [Bind mounts](#bind-mounts)
1. [Networking](#networking)
-1. [Intra-task container dependencies](#intra-task-container-dependencies)
+ 1. [Connecting to a service](#connecting-to-a-service)
+ 1. [Allow tasks to communicate with each other](#allow-tasks-to-communicate-with-each-other)
+ 1. [Load Balancer](#load-balancer)
+ 1. [ECS Service Connect](#ecs-service-connect)
+ 1. [ECS service discovery](#ecs-service-discovery)
+ 1. [VPC Lattice](#vpc-lattice)
+1. [Container dependencies](#container-dependencies)
1. [Execute commands in tasks' containers](#execute-commands-in-tasks-containers)
1. [Scale the number of tasks automatically](#scale-the-number-of-tasks-automatically)
1. [Target tracking](#target-tracking)
-1. [Allow tasks to communicate with each other](#allow-tasks-to-communicate-with-each-other)
- 1. [ECS Service Connect](#ecs-service-connect)
- 1. [ECS service discovery](#ecs-service-discovery)
- 1. [VPC Lattice](#vpc-lattice)
1. [Scrape metrics using Prometheus](#scrape-metrics-using-prometheus)
1. [Send logs to a central location](#send-logs-to-a-central-location)
1. [FireLens](#firelens)
@@ -39,9 +41,11 @@
1. [Mount Secrets Manager secrets as files in containers](#mount-secrets-manager-secrets-as-files-in-containers)
1. [Make a sidecar container write secrets to shared volumes](#make-a-sidecar-container-write-secrets-to-shared-volumes)
1. [Best practices](#best-practices)
-1. [Cost-saving measures](#cost-saving-measures)
+1. [Pricing](#pricing)
+ 1. [Cost-saving measures](#cost-saving-measures)
1. [Troubleshooting](#troubleshooting)
1. [Invalid 'cpu' setting for task](#invalid-cpu-setting-for-task)
+ 1. [Tasks in a service using a Load Balancer are being stopped even if healthy](#tasks-in-a-service-using-a-load-balancer-are-being-stopped-even-if-healthy)
1. [Further readings](#further-readings)
1. [Sources](#sources)
@@ -1015,7 +1019,359 @@ one's behalf.
Such role is automatically created when creating a cluster, or when creating or updating a service in the AWS Management
Console.
-## Intra-task container dependencies
+### Connecting to a service
+
+Services create Tasks.
+Each Task is granted an IP address or can otherwise be reached depending on its network configuration.
+
+At this point there is no way to refer to multiple replicas as a single one.
+One can:
+
+- Manually add a generic Route53 record, containing the references to those Tasks, and refer to it.
+- Create a proxy that forwards (and maybe load balances) the traffic to those Tasks.
+- Leverage a Load Balancer to forward and balance the traffic.
+
+ > [!warning]
+ > This is the easiest way to allow reachability, but it could be also the most expensive if incorrectly configured.
+
+ ```ts
+ const service = new aws.ecs.Service(
+ "someApp",
+ {
+ name: "someApp",
+ …,
+ loadBalancers: [{
+ containerName: "service",
+ containerPort: 8080,
+ targetGroupArn: targetGroup.arn,
+ }],
+ },
+ );
+ ```
+
+- Leverage other types of resources to allow communication between services and applications.
+ See also [Allow tasks to communicate with each other].
+
+### Allow tasks to communicate with each other
+
+Refer [How can I allow the tasks in my Amazon ECS services to communicate with each other?] and
+[Interconnect Amazon ECS services].
+
+Tasks in a cluster are **not** normally able to communicate with each other.
+Use a Load Balancer, ECS Service Connect, ECS service discovery or VPC Lattice to allow that.
+
+#### Load Balancer
+
+Configure a Load Balancer for the services, and optionally a mnemonic Route53 `CNAME` record (or alias) for them, then
+call the corresponding FQDN.
+Or leverage an existing one if that is the case.
+
+It is the easiest way, but it is overkill and expensive if all one wants to do is routing internal traffic.
+
+#### ECS Service Connect
+
+Refer [Use Service Connect to connect Amazon ECS services with short names].
+
+ECS Service Connect provides ECS clusters with the configuration they need for service-to-service discovery,
+connectivity, and traffic monitoring by building both service discovery and a service mesh in the clusters.
+
+It provides:
+
+- The complete configuration services need to join the mesh.
+- A unified way to refer to services within namespaces that does **not** depend on the VPC's DNS configuration.
+- Standardized metrics and logs to monitor all the applications.
+
+The feature creates a virtual network of related services.
+The same service configuration can be used across different namespaces to run independent yet identical sets of
+applications.
+
+When using Service Connect, ECS dynamically manages Service Connect endpoints for each task as they start and stop. It
+does so by injecting the definition of a _sidecar_ proxy container **in services**. This does **not** change their task
+definition.
+Each task created for each registered service will end up running the sidecar proxy container in order, so that the task
+is added to the mesh.
+
+Injecting the proxy in the services and not in the task definitions allows for the same task definition to be reused to
+run identical applications in different namespaces with different Service Connect configurations.
+It also means that, since the proxy is **not** in the task definition, it **cannot** be configured by users.
+
+Service Connect **only** interconnects **services** within the **same** namespace.
+
+One can add one Service Connect configuration to new or existing services.
+When that happens, ECS creates:
+
+- A Service Connect endpoint in the namespace.
+- A new deployment in the service that replaces the tasks that are currently running with ones equipped with the proxy.
+
+Existing tasks and other applications can continue to connect to existing endpoints and external applications.
+If a service using Service Connect adds tasks by scaling out, new connections from clients will be load balanced between
+**all** of the running tasks. If the service is updated, new connections from clients will be load balanced only between
+the **new** version of the tasks.
+
+The list of endpoints in the namespace changes every time **any** service in that namespace is deployed.
+Existing tasks, and replacement tasks, continue to behave the same as they did after the most recent deployment.
+Existing tasks **cannot** resolve and connect to new endpoints. Only tasks with a Service Connect configuration in the
+same namespace **and** that start running after this deployment can.
+
+Applications can use short names and standard ports to connect to **services** in the same or other clusters.
+This includes connecting across VPCs in the same AWS Region.
+
+By default, the Service Connect proxy listens on the `containerPort` specified in the task definition's port
+mapping.
+The service's Security Group rules **must** allow incoming traffic to this port from the subnets where clients will run.
+
+The proxy will consume some of the resources allocated to their task.
+It is recommended:
+
+- Adding at least 256 CPU units and 64 MiB of memory to the task's resources.
+- \[If expecting tasks to receive more than 500 requests per second at their peak load] Increasing the sidecar's
+ resources addition to at least 512 CPU units.
+- \[If expecting to create more than 100 Service Connect services in the namespace, or 2000 tasks in total across all
+ ECS services within the namespace], Adding 128 MiB extra of memory for the Service Connect proxy container.
+ One **must** do this in **every** task definition that is used by **any** of the ECS services in the namespace.
+
+It is recommended one sets the log configuration in the Service Connect configuration.
+
+Proxy configuration:
+
+- Tasks in a Service Connect endpoint are load balanced in a `round-robin` strategy.
+- The proxy uses data about prior failed connections to avoid sending new connections to the tasks that had the failed
+ connections for some time.
+ At the time of writing, failing 5 or more connections in the last 30 seconds makes the proxy avoid that task for 30 to
+ 300 seconds.
+- Connection that pass through the proxy and fail are retried, but **avoid** the host that failed the previous
+ connection.
+ This ensures that each connection through Service Connect doesn't fail for one-off reasons.
+- Wait a maximum time for applications to respond.
+ The default timeout value is 15 seconds, but it can be updated.
+
+
+ Limitations
+
+Service Connect does **not** support:
+
+- ECS' `host` network mode.
+- Windows containers.
+- HTTP 1.0.
+- Standalone tasks and any task created by other resources than services.
+- Services using the `blue/green` or `external deployment` types.
+- External container instance for ECS Anywhere.
+- PPv2.
+- Task definitions that set _container_ memory limits.
+ It is required to set the _task_ memory limit, though.
+
+Tasks using the `bridge` network mode and Service Connect will **not** support the `hostname` container definition
+parameter.
+
+Each service can belong to only one namespace.
+
+Service Connect can use any AWS Cloud Map namespace, as long as they are in the **same** Region **and** AWS account.
+
+Service Connect does **not** delete namespaces when clusters are deleted.
+One must delete namespaces in AWS Cloud Map themselves.
+
+
+
+
+ Requirements
+
+- Tasks running in Fargate **must** use the Fargate Linux platform version 1.4.0 or higher.
+- The ECS agent on container instances must be version 1.67.2 or higher.
+- Container instances must run the ECS-optimized Amazon Linux 2023 AMI version `20230428` or later, or the ECS-optimized
+ Amazon Linux 2 AMI version `2.0.20221115` or later.
+ These versions equip the Service Connect agent in addition to the ECS container agent.
+- Container instances must have the `ecs:Poll` permission assigned to them for resource
+ `arn:aws:ecs:{{region}}:{{accountId}}:task-set/cluster/*`.
+ If using the `ecsInstanceRole` or `AmazonEC2ContainerServiceforEC2Role` IAM roles, there is no need for additional
+ permissions.
+- Services **must** use the **rolling deployment** strategy, as it is the only one supported.
+- Task definitions **must** set their task's memory limit.
+- The task memory limit must be set to a number **greater** than the sum of the container memory limits.
+ The CPU and memory in the task limits that aren't allocated in the container limits will be used by the Service
+ Connect's proxy container and other containers that don't set container limits.
+- All endpoints must be **unique** within their namespace.
+- All discovery names must be **unique** within their namespace.
+- One **must** redeploy existing services before applications can resolve the new endpoints.
+ New endpoints that are added to the namespace **after** the service's most recent deployment **will not** be added to
+ the proxy configuration.
+- Application Load Balancer traffic defaults to routing through the Service Connect agent in `awsvpc` network mode.
+ If one wants non-service traffic to bypass the Service Connect agent, one will need to use the `ingressPortOverride`
+ parameter in their Service Connect service configuration.
+
+
+
+Procedure:
+
+1. Configure the ECS cluster to use the desired AWS Cloud Map namespace.
+
+
+ Simplified process
+
+ Create the cluster with the desired name for the AWS Cloud Map namespace, and specify that name for the namespace
+ when asked.
+ ECS will create a new HTTP namespace with the necessary configuration.
+ As reminder, Service Connect doesn't use or create DNS hosted zones in Amazon Route 53. FIXME: check this
+
+
+
+1. Configure port names in the server services' task definitions for all the port mappings that the services will expose
+ in Service Connect.
+
+
+
+ ```json
+ containerDefinitions: [{
+ "name": "postgres",
+ "protocol": "tcp",
+ "containerPort": 5432
+ }]
+ ```
+
+
+
+1. Configure the server services to create Service Connect endpoints within the namespace.
+
+
+
+ ```json
+ "serviceConnectConfiguration": {
+ "enabled": true,
+ "namespace": "ecs-dev-cluster",
+ "services": [{
+ "portName": "postgres",
+ "discoveryName": "postgres",
+ "clientAliases": [{
+ "port": 5432,
+ "dnsName": "pgsql"
+ }]
+ }]
+ }
+ ```
+
+
+
+1. Deploy the services.
+ This will create the endpoints AWS Cloud Map namespace used by the cluster.
+ ECS also injects the Service Connect proxy container in each task.
+1. Deploy the client applications as ECS services.
+ ECS connects them to the Service Connect endpoints through the Service Connect proxy in each task.
+1. Applications only use the proxy to connect to Service Connect endpoints.
+ No additional configuration is required to use the proxy.
+1. \[optionally] Monitor traffic through the Service Connect proxy in Amazon CloudWatch.
+
+#### ECS service discovery
+
+Service discovery helps manage HTTP and DNS namespaces for ECS services.
+
+ECS automatically registers and de-registers the list of launched tasks to AWS Cloud Map.
+Cloud Map maintains DNS records that resolve to the internal IP addresses of one or more tasks from registered
+services.
+Other services in the **same** VPC can use such DNS records to send traffic directly to containers using their internal
+IP addresses.
+
+This approach provides low latency since traffic travels directly between the containers.
+
+ECS service discovery is a good fit when using the `awsvpc` network mode, where:
+
+- Each task is assigned its own, unique IP address.
+- That IP address is an `A` record.
+- Each service can have a unique security group assigned.
+
+When using _bridged network_ mode, `A` records are no longer enough for service discovery and one **must** also use a
+`SRV` DNS record. This is due to containers sharing the same IP address and having ports mapped randomly.
+`SRV` records can keep track of both IP addresses and port numbers, but requires applications to be appropriately
+configured.
+
+Service discovery supports only the `A` and `SRV` DNS record types.
+DNS records are automatically added or removed as tasks start or stop for ECS services.
+
+Task registration in CloudMap might take some seconds to finish.
+Until ECS registers the tasks, Containers in them might complain about being unable to resolve the services they are
+using.
+
+DNS records have a TTL and it might happen that tasks died before this ended.
+One **must** implement extra logic in one's applications, so that they can handle retries and deal with connection
+failures when the records are not yet updated.
+
+See also [Use service discovery to connect Amazon ECS services with DNS names].
+
+Procedure:
+
+1. Create the desired AWS Cloud Map namespace.
+1. Create the desired Cloud Map service in the namespace.
+1. Configure the ECS service offering acting as server to use the Cloud Map service.
+
+
+
+ ```json
+ "serviceRegistries": [{
+ "registryArn": "arn:aws:servicediscovery:eu-west-1:012345678901:service/srv-uuf33b226vw93biy"
+ }]
+ ```
+
+
+
+NS lookup commands from within containers might fail, but they might still be able to resolve services registered in
+CloudMap namespaces.
+
+
+
+```sh
+$ aws ecs execute-command --cluster 'dev' \
+ --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \
+ --interactive --command 'nslookup mimir.dev.ecs.internal'
+
+The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
+
+Starting session with SessionId: ecs-execute-command-p3pkkrysjdptxa8iu3cz3kxnke
+Server: 172.16.0.2
+Address: 172.16.0.2:53
+
+Non-authoritative answer:
+
+$ aws ecs execute-command --cluster 'dev' \
+ --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \
+ --interactive --command 'wget -SO- mimir.dev.ecs.local:8080/ready'
+
+The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
+
+Starting session with SessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri
+Connecting to mimir.dev.ecs.local:8080 (172.16.88.99:8080)
+ HTTP/1.1 200 OK
+ Date: Thu, 08 May 2025 09:35:02 GMT
+ Content-Type: text/plain
+ Content-Length: 5
+ Connection: close
+
+saving to '/dev/stdout'
+stdout 100% |********************************| 5 0:00:00 ETA
+'/dev/stdout' saved
+
+Exiting session with sessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri.
+```
+
+
+
+#### VPC Lattice
+
+Managed application networking service that customers can use to observe, secure, and monitor applications built across
+AWS compute services, VPCs, and accounts without having to modify their code.
+
+VPC Lattice technically replaces the need for Application Load Balancers by leveraging target groups themselves.
+Target groups which are a collection of compute resources, and can refer EC2 instances, IP addresses, Lambda functions,
+and Application Load Balancers.
+Listeners are used to forward traffic to specified target groups when the conditions are met.
+ECS also automatically replaces unhealthy tasks.
+
+ECS tasks can be enabled **as IP targets** in VPC Lattice by associating their services with a VPC Lattice target
+group.
+ECS automatically registers tasks to the VPC Lattice target group when they are launched for registered services.
+
+Deployments _might_ take longer when using VPC Lattice due to the extent of changes required.
+
+See also [What is Amazon VPC Lattice?] and its [Amazon VPC Lattice pricing].
+
+## Container dependencies
Containers can depend on other containers **from the same task**.
On startup, ECS evaluates all container dependency conditions and starts the containers only when the required
@@ -1335,317 +1691,6 @@ The **only** available metrics for the integrated checks are currently:
- The service's **average** memory utilization (`ECSServiceMemoryUtilization`) for the last minute.
- The service's Application Load Balancer's **average** requests count (`ALBRequestCountPerTarget`) for the last minute.
-## Allow tasks to communicate with each other
-
-Refer [How can I allow the tasks in my Amazon ECS services to communicate with each other?] and
-[Interconnect Amazon ECS services].
-
-Tasks in a cluster are **not** normally able to communicate with each other.
-Use ECS Service Connect, ECS service discovery or VPC Lattice to allow that.
-
-### ECS Service Connect
-
-Refer [Use Service Connect to connect Amazon ECS services with short names].
-
-ECS Service Connect provides ECS clusters with the configuration they need for service-to-service discovery,
-connectivity, and traffic monitoring by building both service discovery and a service mesh in the clusters.
-
-It provides:
-
-- The complete configuration services need to join the mesh.
-- A unified way to refer to services within namespaces that does **not** depend on the VPC's DNS configuration.
-- Standardized metrics and logs to monitor all the applications.
-
-The feature creates a virtual network of related services.
-The same service configuration can be used across different namespaces to run independent yet identical sets of
-applications.
-
-When using Service Connect, ECS dynamically manages Service Connect endpoints for each task as they start and stop. It
-does so by injecting the definition of a _sidecar_ proxy container **in services**. This does **not** change their task
-definition.
-Each task created for each registered service will end up running the sidecar proxy container in order, so that the task
-is added to the mesh.
-
-Injecting the proxy in the services and not in the task definitions allows for the same task definition to be reused to
-run identical applications in different namespaces with different Service Connect configurations.
-It also means that, since the proxy is **not** in the task definition, it **cannot** be configured by users.
-
-Service Connect **only** interconnects **services** within the **same** namespace.
-
-One can add one Service Connect configuration to new or existing services.
-When that happens, ECS creates:
-
-- A Service Connect endpoint in the namespace.
-- A new deployment in the service that replaces the tasks that are currently running with ones equipped with the proxy.
-
-Existing tasks and other applications can continue to connect to existing endpoints and external applications.
-If a service using Service Connect adds tasks by scaling out, new connections from clients will be load balanced between
-**all** of the running tasks. If the service is updated, new connections from clients will be load balanced only between
-the **new** version of the tasks.
-
-The list of endpoints in the namespace changes every time **any** service in that namespace is deployed.
-Existing tasks, and replacement tasks, continue to behave the same as they did after the most recent deployment.
-Existing tasks **cannot** resolve and connect to new endpoints. Only tasks with a Service Connect configuration in the
-same namespace **and** that start running after this deployment can.
-
-Applications can use short names and standard ports to connect to **services** in the same or other clusters.
-This includes connecting across VPCs in the same AWS Region.
-
-By default, the Service Connect proxy listens on the `containerPort` specified in the task definition's port
-mapping.
-The service's Security Group rules **must** allow incoming traffic to this port from the subnets where clients will run.
-
-The proxy will consume some of the resources allocated to their task.
-It is recommended:
-
-- Adding at least 256 CPU units and 64 MiB of memory to the task's resources.
-- \[If expecting tasks to receive more than 500 requests per second at their peak load] Increasing the sidecar's
- resources addition to at least 512 CPU units.
-- \[If expecting to create more than 100 Service Connect services in the namespace, or 2000 tasks in total across all
- ECS services within the namespace], Adding 128 MiB extra of memory for the Service Connect proxy container.
- One **must** do this in **every** task definition that is used by **any** of the ECS services in the namespace.
-
-It is recommended one sets the log configuration in the Service Connect configuration.
-
-Proxy configuration:
-
-- Tasks in a Service Connect endpoint are load balanced in a `round-robin` strategy.
-- The proxy uses data about prior failed connections to avoid sending new connections to the tasks that had the failed
- connections for some time.
- At the time of writing, failing 5 or more connections in the last 30 seconds makes the proxy avoid that task for 30 to
- 300 seconds.
-- Connection that pass through the proxy and fail are retried, but **avoid** the host that failed the previous
- connection.
- This ensures that each connection through Service Connect doesn't fail for one-off reasons.
-- Wait a maximum time for applications to respond.
- The default timeout value is 15 seconds, but it can be updated.
-
-
- Limitations
-
-Service Connect does **not** support:
-
-- ECS' `host` network mode.
-- Windows containers.
-- HTTP 1.0.
-- Standalone tasks and any task created by other resources than services.
-- Services using the `blue/green` or `external deployment` types.
-- External container instance for ECS Anywhere.
-- PPv2.
-- Task definitions that set _container_ memory limits.
- It is required to set the _task_ memory limit, though.
-
-Tasks using the `bridge` network mode and Service Connect will **not** support the `hostname` container definition
-parameter.
-
-Each service can belong to only one namespace.
-
-Service Connect can use any AWS Cloud Map namespace, as long as they are in the **same** Region **and** AWS account.
-
-Service Connect does **not** delete namespaces when clusters are deleted.
-One must delete namespaces in AWS Cloud Map themselves.
-
-
-
-
- Requirements
-
-- Tasks running in Fargate **must** use the Fargate Linux platform version 1.4.0 or higher.
-- The ECS agent on container instances must be version 1.67.2 or higher.
-- Container instances must run the ECS-optimized Amazon Linux 2023 AMI version `20230428` or later, or the ECS-optimized
- Amazon Linux 2 AMI version `2.0.20221115` or later.
- These versions equip the Service Connect agent in addition to the ECS container agent.
-- Container instances must have the `ecs:Poll` permission assigned to them for resource
- `arn:aws:ecs:{{region}}:{{accountId}}:task-set/cluster/*`.
- If using the `ecsInstanceRole` or `AmazonEC2ContainerServiceforEC2Role` IAM roles, there is no need for additional
- permissions.
-- Services **must** use the **rolling deployment** strategy, as it is the only one supported.
-- Task definitions **must** set their task's memory limit.
-- The task memory limit must be set to a number **greater** than the sum of the container memory limits.
- The CPU and memory in the task limits that aren't allocated in the container limits will be used by the Service
- Connect's proxy container and other containers that don't set container limits.
-- All endpoints must be **unique** within their namespace.
-- All discovery names must be **unique** within their namespace.
-- One **must** redeploy existing services before applications can resolve the new endpoints.
- New endpoints that are added to the namespace **after** the service's most recent deployment **will not** be added to
- the proxy configuration.
-- Application Load Balancer traffic defaults to routing through the Service Connect agent in `awsvpc` network mode.
- If one wants non-service traffic to bypass the Service Connect agent, one will need to use the `ingressPortOverride`
- parameter in their Service Connect service configuration.
-
-
-
-Procedure:
-
-1. Configure the ECS cluster to use the desired AWS Cloud Map namespace.
-
-
- Simplified process
-
- Create the cluster with the desired name for the AWS Cloud Map namespace, and specify that name for the namespace
- when asked.
- ECS will create a new HTTP namespace with the necessary configuration.
- As reminder, Service Connect doesn't use or create DNS hosted zones in Amazon Route 53. FIXME: check this
-
-
-
-1. Configure port names in the server services' task definitions for all the port mappings that the services will expose
- in Service Connect.
-
-
-
- ```json
- containerDefinitions: [{
- "name": "postgres",
- "protocol": "tcp",
- "containerPort": 5432
- }]
- ```
-
-
-
-1. Configure the server services to create Service Connect endpoints within the namespace.
-
-
-
- ```json
- "serviceConnectConfiguration": {
- "enabled": true,
- "namespace": "ecs-dev-cluster",
- "services": [{
- "portName": "postgres",
- "discoveryName": "postgres",
- "clientAliases": [{
- "port": 5432,
- "dnsName": "pgsql"
- }]
- }]
- }
- ```
-
-
-
-1. Deploy the services.
- This will create the endpoints AWS Cloud Map namespace used by the cluster.
- ECS also injects the Service Connect proxy container in each task.
-1. Deploy the client applications as ECS services.
- ECS connects them to the Service Connect endpoints through the Service Connect proxy in each task.
-1. Applications only use the proxy to connect to Service Connect endpoints.
- No additional configuration is required to use the proxy.
-1. \[optionally] Monitor traffic through the Service Connect proxy in Amazon CloudWatch.
-
-### ECS service discovery
-
-Service discovery helps manage HTTP and DNS namespaces for ECS services.
-
-ECS automatically registers and de-registers the list of launched tasks to AWS Cloud Map.
-Cloud Map maintains DNS records that resolve to the internal IP addresses of one or more tasks from registered
-services.
-Other services in the **same** VPC can use such DNS records to send traffic directly to containers using their internal
-IP addresses.
-
-This approach provides low latency since traffic travels directly between the containers.
-
-ECS service discovery is a good fit when using the `awsvpc` network mode, where:
-
-- Each task is assigned its own, unique IP address.
-- That IP address is an `A` record.
-- Each service can have a unique security group assigned.
-
-When using _bridged network_ mode, `A` records are no longer enough for service discovery and one **must** also use a
-`SRV` DNS record. This is due to containers sharing the same IP address and having ports mapped randomly.
-`SRV` records can keep track of both IP addresses and port numbers, but requires applications to be appropriately
-configured.
-
-Service discovery supports only the `A` and `SRV` DNS record types.
-DNS records are automatically added or removed as tasks start or stop for ECS services.
-
-Task registration in CloudMap might take some seconds to finish.
-Until ECS registers the tasks, Containers in them might complain about being unable to resolve the services they are
-using.
-
-DNS records have a TTL and it might happen that tasks died before this ended.
-One **must** implement extra logic in one's applications, so that they can handle retries and deal with connection
-failures when the records are not yet updated.
-
-See also [Use service discovery to connect Amazon ECS services with DNS names].
-
-Procedure:
-
-1. Create the desired AWS Cloud Map namespace.
-1. Create the desired Cloud Map service in the namespace.
-1. Configure the ECS service offering acting as server to use the Cloud Map service.
-
-
-
- ```json
- "serviceRegistries": [{
- "registryArn": "arn:aws:servicediscovery:eu-west-1:012345678901:service/srv-uuf33b226vw93biy"
- }]
- ```
-
-
-
-NS lookup commands from within containers might fail, but they might still be able to resolve services registered in
-CloudMap namespaces.
-
-
-
-```sh
-$ aws ecs execute-command --cluster 'dev' \
- --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \
- --interactive --command 'nslookup mimir.dev.ecs.internal'
-
-The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
-
-Starting session with SessionId: ecs-execute-command-p3pkkrysjdptxa8iu3cz3kxnke
-Server: 172.16.0.2
-Address: 172.16.0.2:53
-
-Non-authoritative answer:
-
-$ aws ecs execute-command --cluster 'dev' \
- --task 'arn:aws:ecs:eu-west-1:012345678901:task/dev/abcdef0123456789abcdef0123456789' --container 'prometheus' \
- --interactive --command 'wget -SO- mimir.dev.ecs.local:8080/ready'
-
-The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
-
-Starting session with SessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri
-Connecting to mimir.dev.ecs.local:8080 (172.16.88.99:8080)
- HTTP/1.1 200 OK
- Date: Thu, 08 May 2025 09:35:02 GMT
- Content-Type: text/plain
- Content-Length: 5
- Connection: close
-
-saving to '/dev/stdout'
-stdout 100% |********************************| 5 0:00:00 ETA
-'/dev/stdout' saved
-
-Exiting session with sessionId: ecs-execute-command-hjgyio7n6nf2o9h4qn6ht7lzri.
-```
-
-
-
-### VPC Lattice
-
-Managed application networking service that customers can use to observe, secure, and monitor applications built across
-AWS compute services, VPCs, and accounts without having to modify their code.
-
-VPC Lattice technically replaces the need for Application Load Balancers by leveraging target groups themselves.
-Target groups which are a collection of compute resources, and can refer EC2 instances, IP addresses, Lambda functions,
-and Application Load Balancers.
-Listeners are used to forward traffic to specified target groups when the conditions are met.
-ECS also automatically replaces unhealthy tasks.
-
-ECS tasks can be enabled **as IP targets** in VPC Lattice by associating their services with a VPC Lattice target
-group.
-ECS automatically registers tasks to the VPC Lattice target group when they are launched for registered services.
-
-Deployments _might_ take longer when using VPC Lattice due to the extent of changes required.
-
-See also [What is Amazon VPC Lattice?] and its [Amazon VPC Lattice pricing].
-
## Scrape metrics using Prometheus
Refer [Prometheus service discovery for AWS ECS] and [Scraping Prometheus metrics from applications running in AWS ECS].
@@ -2034,7 +2079,75 @@ Useful when wanting multiple containers to access the same secret, or just clean
- When using **spot** compute capacity, consider ensuring containers exit gracefully before the task stops.
Refer [Capacity providers].
-## Cost-saving measures
+## Pricing
+
+Refer [AWS Fargate Pricing] and [A Simple Breakdown of Amazon ECS Pricing].
+
+20 GB of ephemeral storage per task are **included**.
+
+**Hourly** costs in `eu-west-1` as per 2026-02-10 (tax _excluded_):
+
+| Provider | Capacity Type | Architecture | OS | Resource | Price |
+| -------- | ------------- | ------------ | ------- | ---------------------------- | ------------------------------- |
+| Fargate | On-Demand | X86 | Linux | 1 vCPU | $0.04048 |
+| Fargate | On-Demand | X86 | Linux | 1 GB RAM | $0.004445 |
+| Fargate | SPOT | X86 | Linux | 1 vCPU | $0.01467395 |
+| Fargate | SPOT | X86 | Linux | 1 GB RAM | $0.00161131 |
+| Fargate | On-Demand | ARM | Linux | 1 vCPU | $0.03238 |
+| Fargate | On-Demand | ARM | Linux | 1 GB RAM | $0.00356 |
+| Fargate | SPOT | ARM | Linux | 1 vCPU | $0.01173771 |
+| Fargate | SPOT | ARM | Linux | 1 GB RAM | $0.0012905 |
+| Fargate | On-Demand | X86 | Windows | 1 vCPU | $0.046552 + $0.046 (OS license) |
+| Fargate | On-Demand | X86 | Windows | 1 GB RAM | $0.00511175 |
+| Fargate | Any | Any | Any | 1 GB extra ephemeral storage | $0.000122 |
+
+
+ Example: Fargate (Linux, X86)
+
+| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) |
+| ----------------------- | -----: | --------: | -------: | --------: | ---------: |
+| vCPU | 0.5 | $0.02024 | $0.48576 | $15.05856 | $177.78816 |
+| RAM | 1 GB | $0.004445 | $0.10668 | $3.30708 | $39.04488 |
+| Extra ephemeral storage | 5 GB | $0.00061 | $0.01464 | $0.45384 | $5.35824 |
+
+Total: ~$0.03 per hour, ~$0.61 per day, ~$18.82 per 31d-month, ~$222.20 per 366d-year.
+
+---
+
+| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) |
+| ----------------------- | -----: | -------: | -------: | ---------: | -----------: |
+| vCPU | 4 | $0.16192 | $3.88608 | $120.46848 | $1,422.30528 |
+| RAM | 20 GB | $0.0889 | $2.1336 | $66.1416 | $780.8976 |
+| Extra ephemeral storage | 0 GB | $0.00 | $0.00 | $0.00 | $0.00 |
+
+Total: ~$0.26 per hour, ~$6.02 per day, ~$186.62 per 31d-month, ~$2,203.21 per 366d-year.
+
+
+
+
+ Example: Fargate SPOT (Linux, ARM)
+
+| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) |
+| ----------------------- | -----: | -----------: | ----------: | ----------: | -----------: |
+| vCPU | 0.5 | $0.005868855 | $0.14085252 | $4.36642812 | $51.55202232 |
+| RAM | 1 GB | $0.0012905 | $0.030972 | $0.960132 | $11.335752 |
+| Extra ephemeral storage | 5 GB | $0.00061 | $0.01464 | $0.45384 | $5.35824 |
+
+Total: ~$0.01 per hour, ~$0.19 per day, ~$5.79 per 31d-month, ~$68.25 per 366d-year.
+
+---
+
+| Resource | Amount | 1h | 1d | 1m(31d) | 1y(366d) |
+| ----------------------- | -----: | ----------: | ----------: | -----------: | ------------: |
+| vCPU | 4 | $0.04695084 | $1.12682016 | $34.93142496 | $412.41617856 |
+| RAM | 20 GB | $0.02581 | $0.61944 | $19.20264 | $226.71504 |
+| Extra ephemeral storage | 0 GB | $0.00 | $0.00 | $0.00 | $0.00 |
+
+Total: ~$0.08 per hour, ~$1.75 per day, ~$54.14 per 31d-month, ~$639.14 per 366d-year.
+
+
+
+### Cost-saving measures
- Prefer using ARM-based compute capacity over the default `X86_64`, where feasible.
Refer [CPU architectures].
@@ -2043,7 +2156,7 @@ Useful when wanting multiple containers to access the same secret, or just clean
- Prefer using [**spot** capacity][effectively using spot instances in aws ecs for production workloads] for
- Non-critical services and tasks.
- - State**less** or otherwise **interruption tolerant** tasks.
+ - State**less** or otherwise **interruption-tolerant** tasks.
Refer [Capacity providers].
- Consider applying for EC2 Instance and/or Compute Savings Plans if using EC2 capacity.
@@ -2061,7 +2174,7 @@ Useful when wanting multiple containers to access the same secret, or just clean
> Mind the limitations that come with the auto scaling settings.
- If only used internally (e.g., via a VPN), consider configuring intra-network communication capabilities for the
- application **instead of** using a load balancer.
+ application (e.g., CloudMap) **instead of** using load balancers.
Refer [Allow tasks to communicate with each other].
## Troubleshooting
@@ -2088,6 +2201,55 @@ Specify a supported value for the task CPU and memory in your task definition.
+### Tasks in a service using a Load Balancer are being stopped even if healthy
+
+
+ Context
+
+One or more containers' definition in the Task define a health check.
+
+Traffic to the Service is served by a Load Balancer.
+The Load Balancer uses a Target Group where the Service registers Tasks.
+
+The containers' health checks pass and the Task is considered _healthy_.
+
+Messages like the following are visible from the Service's page in ECS or from the Load Balancer or Target Group's
+pages:
+
+- `service X deregistered targets`
+- `task stopped because it failed ELB health checks`
+- `Health checks failed`
+
+
+
+
+ Cause
+
+Load Balancing and ECS are integrated.
+
+The Target Group defines its own health check in order to decide whether to serve traffic to specific targets.
+
+The containers' health check and the Target Group's health check are completely separated.
+
+The containers' health check only require ECS to communicate with the container engine.
+If a container's health check fails, the Task is deemed _unhealthy_ and ECS replaces it.
+
+ECS reacts to an associated Load Balancer (and hence Target Group)'s opinion.
+If the Target Group's health check fails, traffic is not forwarded to the Task. After `unhealthy_threshold × interval`,
+the integration makes ECS mark the Task as unhealthy and deregister it from the Target Group.
+ECS will eventually stop the Task, then launch a replacement to maintain the desired count.
+
+
+
+
+ Solution
+
+- Align the containers' and the Target Group's health checks.
+- Consider making the Target Group's health check more forgiving, e.g., via higher unhealthy threshold, or more
+ accepted HTTP codes.
+
+
+
## Further readings
- [Amazon Web Services]
@@ -2102,8 +2264,9 @@ Specify a supported value for the task CPU and memory in your task definition.
- [What Is AWS Cloud Map?]
- [Centralized Container Logging with Fluent Bit]
- [Effective Logging Strategies with Amazon ECS and Fluentd]
-- [ECS pricing]
+- [A Simple Breakdown of Amazon ECS Pricing]
- [Announcing AWS Graviton2 Support for AWS Fargate]
+- [Optimize load balancer health check parameters for Amazon ECS]
### Sources
@@ -2183,6 +2346,7 @@ Specify a supported value for the task CPU and memory in your task definition.
[Announcing AWS Graviton2 Support for AWS Fargate]: https://aws.amazon.com/blogs/aws/announcing-aws-graviton2-support-for-aws-fargate-get-up-to-40-better-price-performance-for-your-serverless-containers/
[Automatically scale your Amazon ECS service]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
[AWS Distro for OpenTelemetry]: https://aws-otel.github.io/
+[AWS Fargate Pricing]: https://aws.amazon.com/fargate/pricing/
[AWS Fargate Spot Now Generally Available]: https://aws.amazon.com/blogs/aws/aws-fargate-spot-now-generally-available/
[Centralized Container Logging with Fluent Bit]: https://aws.amazon.com/blogs/opensource/centralized-container-logging-fluent-bit/
[ecs execute-command proposal]: https://github.com/aws/containers-roadmap/issues/1050
@@ -2198,6 +2362,7 @@ Specify a supported value for the task CPU and memory in your task definition.
[install the session manager plugin for the aws cli]: https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html
[Interconnect Amazon ECS services]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/interconnecting-services.html
[Metrics collection from Amazon ECS using Amazon Managed Service for Prometheus]: https://aws.amazon.com/blogs/opensource/metrics-collection-from-amazon-ecs-using-amazon-managed-service-for-prometheus/
+[Optimize load balancer health check parameters for Amazon ECS]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/load-balancer-healthcheck.html
[Pass Secrets Manager secrets through Amazon ECS environment variables]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/secrets-envvar-secrets-manager.html
[Pass sensitive data to an Amazon ECS container]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/specifying-sensitive-data.html
[storage options for amazon ecs tasks]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_data_volumes.html
@@ -2219,12 +2384,12 @@ Specify a supported value for the task CPU and memory in your task definition.
[`aws ecs execute-command` results in `TargetNotConnectedException` `The execute command failed due to an internal error`]: https://stackoverflow.com/questions/69261159/aws-ecs-execute-command-results-in-targetnotconnectedexception-the-execute
[308 Permanent Redirect]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/308
+[A Simple Breakdown of Amazon ECS Pricing]: https://awsfundamentals.com/blog/amazon-ecs-pricing
[a step-by-step guide to enabling amazon ecs exec]: https://medium.com/@mariotolic/a-step-by-step-guide-to-enabling-amazon-ecs-exec-a88b05858709
[attach ebs volume to aws ecs fargate]: https://medium.com/@shujaatsscripts/attach-ebs-volume-to-aws-ecs-fargate-e23fea7bb1a7
[Avoiding Common Pitfalls with ECS Capacity Providers and Auto Scaling]: https://medium.com/@bounouh.fedi/avoiding-common-pitfalls-with-ecs-capacity-providers-and-auto-scaling-24899ab6fc25
[AWS Fargate Pricing Explained]: https://www.vantage.sh/blog/fargate-pricing
[aws-cloudmap-prometheus-sd]: https://github.com/awslabs/aws-cloudmap-prometheus-sd
-[ECS pricing]: https://awsfundamentals.com/blog/amazon-ecs-pricing
[Effective Logging Strategies with Amazon ECS and Fluentd]: https://reintech.io/blog/effective-logging-strategies-amazon-ecs-fluent
[exposing multiple ports for an aws ecs service]: https://medium.com/@faisalsuhail1/exposing-multiple-ports-for-an-aws-ecs-service-64b9821c09e8
[guide to using amazon ebs with amazon ecs and aws fargate]: https://stackpioneers.com/2024/01/12/guide-to-using-amazon-ebs-with-amazon-ecs-and-aws-fargate/