From e9bff01864ebaa5bd498198bcd2e5861dd7f05ac Mon Sep 17 00:00:00 2001 From: Michele Cereda Date: Sat, 3 May 2025 00:36:33 +0200 Subject: [PATCH] feat(mimir): deduplicate multiple prometheus sources --- knowledge base/cloud computing/aws/ecs.md | 50 ++++++++++-- knowledge base/mimir.md | 92 +++++++++++++++++++++++ 2 files changed, 136 insertions(+), 6 deletions(-) diff --git a/knowledge base/cloud computing/aws/ecs.md b/knowledge base/cloud computing/aws/ecs.md index ee76c2f..25dff54 100644 --- a/knowledge base/cloud computing/aws/ecs.md +++ b/knowledge base/cloud computing/aws/ecs.md @@ -7,6 +7,7 @@ 1. [Standalone tasks](#standalone-tasks) 1. [Services](#services) 1. [Resource constraints](#resource-constraints) +1. [Environment variables](#environment-variables) 1. [Storage](#storage) 1. [EBS volumes](#ebs-volumes) 1. [EFS volumes](#efs-volumes) @@ -21,7 +22,7 @@ 1. [Troubleshooting](#troubleshooting) 1. [Invalid 'cpu' setting for task](#invalid-cpu-setting-for-task) 1. [Further readings](#further-readings) - 1. [Sources](#sources) + 1. [Sources](#sources) ## TL;DR @@ -279,6 +280,42 @@ the `memoryReservation` value.
If specifying `memoryReservation`, that value is guaranteed to the container and subtracted from the available memory resources for the container instance that the container is placed on. Otherwise, the value of `memory` is used. +## Environment variables + +Refer [Amazon ECS environment variables]. + +ECS sets default environment variables for any task it runs. + +
+ +```sh +$ aws ecs list-tasks --cluster 'devel' --service-name 'prometheus' --query 'taskArns' --output 'text' \ + | xargs -I '%%' aws ecs execute-command --cluster 'devel' --task '%%' --container 'prometheus' \ + --interactive --command 'printenv' + +The Session Manager plugin was installed successfully. Use the AWS CLI to start a session. + + +Starting session with SessionId: ecs-execute-command-abcdefghijklmnopqrstuvwxyz +PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin +HOSTNAME=ip-172-31-10-103.eu-west-1.compute.internal +AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=/v2/credentials/abcdefgh-1234-abcd-9876-abcdefgh0123 +AWS_DEFAULT_REGION=eu-west-1 +AWS_EXECUTION_ENV=AWS_ECS_FARGATE +AWS_REGION=eu-west-1 +ECS_AGENT_URI=http://169.254.170.2/api/abcdef0123456789abcdef0123456789-1111111111 +ECS_CONTAINER_METADATA_URI=http://169.254.170.2/v3/abcdef0123456789abcdef0123456789-1111111111 +ECS_CONTAINER_METADATA_URI_V4=http://169.254.170.2/v4/abcdef0123456789abcdef0123456789-1111111111 +HOME=/root +TERM=xterm-256color +LANG=C.UTF-8 + + +Exiting session with sessionId: ecs-execute-command-abcdefghijklmnopqrstuvwxyz. +``` + +
+ ## Storage Refer [Storage options for Amazon ECS tasks]. @@ -503,7 +540,7 @@ Requirements: "Statement": [{ "Effect": "Allow", "Action": "ecs:ExecuteCommand", - "Resource": "arn:aws:ecs:eu-west-1:012345678901:cluster/staging", + "Resource": "arn:aws:ecs:eu-west-1:012345678901:cluster/devel", "Condition": { "StringEquals": { "aws:ResourceTag/application": "appName", @@ -527,15 +564,15 @@ Procedure: Example ```sh - aws ecs describe-tasks --cluster 'staging' --tasks 'ef6260ed8aab49cf926667ab0c52c313' --output 'yaml' \ + aws ecs describe-tasks --cluster 'devel' --tasks 'ef6260ed8aab49cf926667ab0c52c313' --output 'yaml' \ --query 'tasks[0] | { "managedAgents": containers[].managedAgents[?@.name==`ExecuteCommandAgent`][], "enableExecuteCommand": enableExecuteCommand }' - aws ecs list-tasks --cluster 'staging' --service-name 'mimir' --query 'taskArns' --output 'text' \ + aws ecs list-tasks --cluster 'devel' --service-name 'mimir' --query 'taskArns' --output 'text' \ | xargs \ - aws ecs describe-tasks --cluster 'Staging' \ + aws ecs describe-tasks --cluster 'devel' \ --output 'yaml' --query 'tasks[0] | { "managedAgents": containers[].managedAgents[?@.name==`ExecuteCommandAgent`][], "enableExecuteCommand": enableExecuteCommand @@ -560,7 +597,7 @@ Procedure: ```sh aws ecs execute-command --interactive --command 'df -h' \ - --cluster 'staging' --task 'ef6260ed8aab49cf926667ab0c52c313' --container 'nginx' + --cluster 'devel' --task 'ef6260ed8aab49cf926667ab0c52c313' --container 'nginx' ``` ```plaintext @@ -977,6 +1014,7 @@ Specify a supported value for the task CPU and memory in your task definition. [efs]: efs.md +[Amazon ECS environment variables]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-environment-variables.html [amazon ecs exec checker]: https://github.com/aws-containers/amazon-ecs-exec-checker [Amazon ECS Service Discovery]: https://aws.amazon.com/blogs/aws/amazon-ecs-service-discovery/ [amazon ecs services]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html diff --git a/knowledge base/mimir.md b/knowledge base/mimir.md index 4f593d2..03ec865 100644 --- a/knowledge base/mimir.md +++ b/knowledge base/mimir.md @@ -19,6 +19,8 @@ and set up alerting rules across multiple tenants to leverage tenant federation. 1. [Migrate to Mimir](#migrate-to-mimir) 1. [Ingest Out-Of-Order samples](#ingest-out-of-order-samples) 1. [Deduplication of data from multiple Prometheus scrapers](#deduplication-of-data-from-multiple-prometheus-scrapers) + 1. [Configure Prometheus for deduplication](#configure-prometheus-for-deduplication) + 1. [Configure Mimir for deduplication](#configure-mimir-for-deduplication) 1. [APIs](#apis) 1. [Troubleshooting](#troubleshooting) 1. [HTTP status 401 Unauthorized: no org id](#http-status-401-unauthorized-no-org-id) @@ -430,6 +432,94 @@ limits: Refer [Configure Grafana Mimir high-availability deduplication]. +Mimir can deduplicate the data received from HA pairs of Prometheus instances.
+It does so by: + +- Electing a _leader_ replica for each data source pair. +- Only ingesting samples from the leader, and dropping the ones from the other replica. +- Switching the leader to the standby replica, should Mimir see no new samples from the leader for some time (30s by + default). + +The failure timeout should be kept low enough to avoid dropping too much data before failing over to the standby +replica.
+For queries using the `rate()` function, it is suggested to make the rate time interval at least four times that of the +scrape period to account for any of these failover scenarios (e.g., a rate time-interval of at least 1-minutes for a +scrape period of 15 seconds). + +The distributor includes a high-availability (HA) tracker.
+It deduplicates incoming samples based on a `cluster` and `replica` label that it expects on each incoming series. + +The `cluster` label uniquely identifies the cluster of redundant Prometheus servers for a given tenant.
+The `replica` label uniquely identifies the replica instance within that Prometheus cluster.
+Incoming samples are considered duplicated (and thus dropped) if they are received from any replica that is **not** the +currently elected leader within any cluster. + +> For performance reasons, the HA tracker only checks the `cluster` and `replica` label of the **first** series in the +> request to determine whether **all** series in the request should be deduplicated. + +### Configure Prometheus for deduplication + +Set the two labels for each Prometheus server. + +The easiest approach is to set them as _external labels_.
+The default labels are `cluster` and `__replica__`. + +```yml +global: + external_labels: + cluster: infra + __replica__: replica1 # since Prometheus 3.0 one can use vars like ${HOSTNAME} +``` + +### Configure Mimir for deduplication + +The **minimal** configuration requires the following: + +1. Enable the distributor's HA tracker. + +
+ Example: enable for all tenants + + ```sh + mimir … \ + -distributor.ha-tracker.enable='true' \ + -distributor.ha-tracker.enable-for-all-users='true' + ``` + + ```yml + limits: + accept_ha_samples: true + distributor: + ha_tracker: + enable_ha_tracker: true + ``` + +
+ +1. Configure the HA tracker's KV store. + + `memberlist` support is currently experimental.
+ See also [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos]. + +
+ Example: inmemory + + ```sh + mimir … -distributor.ha-tracker.store='inmemory' + ``` + + ```yml + distributor: + ha_tracker: + kvstore: + store: inmemory + ``` + +
+ +1. Configure the expected label names for each cluster and its replica.
+ Only needed when using different labels than the default ones. + ## APIs Refer [Grafana Mimir HTTP API]. @@ -525,6 +615,7 @@ ingester: - [hashicorp/memberlist] - [Gossip protocol] - [Ceiling Function] +- [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos] Alternatives: @@ -581,4 +672,5 @@ Alternatives: [Ceiling Function]: https://www.geeksforgeeks.org/ceiling-function/ [Gossip protocol]: https://en.wikipedia.org/wiki/Gossip_protocol [hashicorp/memberlist]: https://github.com/hashicorp/memberlist +[In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos]: https://medium.com/@karim.albakry/in-depth-comparison-of-distributed-coordination-tools-consul-etcd-zookeeper-and-nacos-a6f8e5d612a6 [Mimir on AWS ECS Fargate]: https://github.com/grafana/mimir/discussions/3807#discussioncomment-4602413