feat(mimir): deduplicate multiple prometheus sources

This commit is contained in:
Michele Cereda
2025-05-03 00:36:33 +02:00
parent db5755dd65
commit e9bff01864
2 changed files with 136 additions and 6 deletions

View File

@@ -7,6 +7,7 @@
1. [Standalone tasks](#standalone-tasks)
1. [Services](#services)
1. [Resource constraints](#resource-constraints)
1. [Environment variables](#environment-variables)
1. [Storage](#storage)
1. [EBS volumes](#ebs-volumes)
1. [EFS volumes](#efs-volumes)
@@ -21,7 +22,7 @@
1. [Troubleshooting](#troubleshooting)
1. [Invalid 'cpu' setting for task](#invalid-cpu-setting-for-task)
1. [Further readings](#further-readings)
1. [Sources](#sources)
1. [Sources](#sources)
## TL;DR
@@ -279,6 +280,42 @@ the `memoryReservation` value.<br/>
If specifying `memoryReservation`, that value is guaranteed to the container and subtracted from the available memory
resources for the container instance that the container is placed on. Otherwise, the value of `memory` is used.
## Environment variables
Refer [Amazon ECS environment variables].
ECS sets default environment variables for any task it runs.
<details>
```sh
$ aws ecs list-tasks --cluster 'devel' --service-name 'prometheus' --query 'taskArns' --output 'text' \
| xargs -I '%%' aws ecs execute-command --cluster 'devel' --task '%%' --container 'prometheus' \
--interactive --command 'printenv'
The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
Starting session with SessionId: ecs-execute-command-abcdefghijklmnopqrstuvwxyz
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=ip-172-31-10-103.eu-west-1.compute.internal
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=/v2/credentials/abcdefgh-1234-abcd-9876-abcdefgh0123
AWS_DEFAULT_REGION=eu-west-1
AWS_EXECUTION_ENV=AWS_ECS_FARGATE
AWS_REGION=eu-west-1
ECS_AGENT_URI=http://169.254.170.2/api/abcdef0123456789abcdef0123456789-1111111111
ECS_CONTAINER_METADATA_URI=http://169.254.170.2/v3/abcdef0123456789abcdef0123456789-1111111111
ECS_CONTAINER_METADATA_URI_V4=http://169.254.170.2/v4/abcdef0123456789abcdef0123456789-1111111111
HOME=/root
TERM=xterm-256color
LANG=C.UTF-8
Exiting session with sessionId: ecs-execute-command-abcdefghijklmnopqrstuvwxyz.
```
</details>
## Storage
Refer [Storage options for Amazon ECS tasks].
@@ -503,7 +540,7 @@ Requirements:
"Statement": [{
"Effect": "Allow",
"Action": "ecs:ExecuteCommand",
"Resource": "arn:aws:ecs:eu-west-1:012345678901:cluster/staging",
"Resource": "arn:aws:ecs:eu-west-1:012345678901:cluster/devel",
"Condition": {
"StringEquals": {
"aws:ResourceTag/application": "appName",
@@ -527,15 +564,15 @@ Procedure:
<summary>Example</summary>
```sh
aws ecs describe-tasks --cluster 'staging' --tasks 'ef6260ed8aab49cf926667ab0c52c313' --output 'yaml' \
aws ecs describe-tasks --cluster 'devel' --tasks 'ef6260ed8aab49cf926667ab0c52c313' --output 'yaml' \
--query 'tasks[0] | {
"managedAgents": containers[].managedAgents[?@.name==`ExecuteCommandAgent`][],
"enableExecuteCommand": enableExecuteCommand
}'
aws ecs list-tasks --cluster 'staging' --service-name 'mimir' --query 'taskArns' --output 'text' \
aws ecs list-tasks --cluster 'devel' --service-name 'mimir' --query 'taskArns' --output 'text' \
| xargs \
aws ecs describe-tasks --cluster 'Staging' \
aws ecs describe-tasks --cluster 'devel' \
--output 'yaml' --query 'tasks[0] | {
"managedAgents": containers[].managedAgents[?@.name==`ExecuteCommandAgent`][],
"enableExecuteCommand": enableExecuteCommand
@@ -560,7 +597,7 @@ Procedure:
```sh
aws ecs execute-command --interactive --command 'df -h' \
--cluster 'staging' --task 'ef6260ed8aab49cf926667ab0c52c313' --container 'nginx'
--cluster 'devel' --task 'ef6260ed8aab49cf926667ab0c52c313' --container 'nginx'
```
```plaintext
@@ -977,6 +1014,7 @@ Specify a supported value for the task CPU and memory in your task definition.
[efs]: efs.md
<!-- Upstream -->
[Amazon ECS environment variables]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-environment-variables.html
[amazon ecs exec checker]: https://github.com/aws-containers/amazon-ecs-exec-checker
[Amazon ECS Service Discovery]: https://aws.amazon.com/blogs/aws/amazon-ecs-service-discovery/
[amazon ecs services]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html

View File

@@ -19,6 +19,8 @@ and set up alerting rules across multiple tenants to leverage tenant federation.
1. [Migrate to Mimir](#migrate-to-mimir)
1. [Ingest Out-Of-Order samples](#ingest-out-of-order-samples)
1. [Deduplication of data from multiple Prometheus scrapers](#deduplication-of-data-from-multiple-prometheus-scrapers)
1. [Configure Prometheus for deduplication](#configure-prometheus-for-deduplication)
1. [Configure Mimir for deduplication](#configure-mimir-for-deduplication)
1. [APIs](#apis)
1. [Troubleshooting](#troubleshooting)
1. [HTTP status 401 Unauthorized: no org id](#http-status-401-unauthorized-no-org-id)
@@ -430,6 +432,94 @@ limits:
Refer [Configure Grafana Mimir high-availability deduplication].
Mimir can deduplicate the data received from HA pairs of Prometheus instances.<br/>
It does so by:
- Electing a _leader_ replica for each data source pair.
- Only ingesting samples from the leader, and dropping the ones from the other replica.
- Switching the leader to the standby replica, should Mimir see no new samples from the leader for some time (30s by
default).
The failure timeout should be kept low enough to avoid dropping too much data before failing over to the standby
replica.<br/>
For queries using the `rate()` function, it is suggested to make the rate time interval at least four times that of the
scrape period to account for any of these failover scenarios (e.g., a rate time-interval of at least 1-minutes for a
scrape period of 15 seconds).
The distributor includes a high-availability (HA) tracker.<br/>
It deduplicates incoming samples based on a `cluster` and `replica` label that it expects on each incoming series.
The `cluster` label uniquely identifies the cluster of redundant Prometheus servers for a given tenant.<br/>
The `replica` label uniquely identifies the replica instance within that Prometheus cluster.<br/>
Incoming samples are considered duplicated (and thus dropped) if they are received from any replica that is **not** the
currently elected leader within any cluster.
> For performance reasons, the HA tracker only checks the `cluster` and `replica` label of the **first** series in the
> request to determine whether **all** series in the request should be deduplicated.
### Configure Prometheus for deduplication
Set the two labels for each Prometheus server.
The easiest approach is to set them as _external labels_.<br/>
The default labels are `cluster` and `__replica__`.
```yml
global:
external_labels:
cluster: infra
__replica__: replica1 # since Prometheus 3.0 one can use vars like ${HOSTNAME}
```
### Configure Mimir for deduplication
The **minimal** configuration requires the following:
1. Enable the distributor's HA tracker.
<details style="padding: 0 0 1rem 1rem">
<summary>Example: enable for <b>all</b> tenants</summary>
```sh
mimir … \
-distributor.ha-tracker.enable='true' \
-distributor.ha-tracker.enable-for-all-users='true'
```
```yml
limits:
accept_ha_samples: true
distributor:
ha_tracker:
enable_ha_tracker: true
```
</details>
1. Configure the HA tracker's KV store.
`memberlist` support is currently experimental.<br/>
See also [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos].
<details style="padding: 0 0 1rem 1rem">
<summary>Example: <code>inmemory</code></summary>
```sh
mimir … -distributor.ha-tracker.store='inmemory'
```
```yml
distributor:
ha_tracker:
kvstore:
store: inmemory
```
</details>
1. Configure the expected label names for each cluster and its replica.<br/>
Only needed when using different labels than the default ones.
## APIs
Refer [Grafana Mimir HTTP API].
@@ -525,6 +615,7 @@ ingester:
- [hashicorp/memberlist]
- [Gossip protocol]
- [Ceiling Function]
- [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos]
Alternatives:
@@ -581,4 +672,5 @@ Alternatives:
[Ceiling Function]: https://www.geeksforgeeks.org/ceiling-function/
[Gossip protocol]: https://en.wikipedia.org/wiki/Gossip_protocol
[hashicorp/memberlist]: https://github.com/hashicorp/memberlist
[In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos]: https://medium.com/@karim.albakry/in-depth-comparison-of-distributed-coordination-tools-consul-etcd-zookeeper-and-nacos-a6f8e5d612a6
[Mimir on AWS ECS Fargate]: https://github.com/grafana/mimir/discussions/3807#discussioncomment-4602413