mirror of
https://gitea.com/mcereda/oam.git
synced 2026-02-09 05:44:23 +00:00
feat(mimir): deduplicate multiple prometheus sources
This commit is contained in:
@@ -7,6 +7,7 @@
|
||||
1. [Standalone tasks](#standalone-tasks)
|
||||
1. [Services](#services)
|
||||
1. [Resource constraints](#resource-constraints)
|
||||
1. [Environment variables](#environment-variables)
|
||||
1. [Storage](#storage)
|
||||
1. [EBS volumes](#ebs-volumes)
|
||||
1. [EFS volumes](#efs-volumes)
|
||||
@@ -21,7 +22,7 @@
|
||||
1. [Troubleshooting](#troubleshooting)
|
||||
1. [Invalid 'cpu' setting for task](#invalid-cpu-setting-for-task)
|
||||
1. [Further readings](#further-readings)
|
||||
1. [Sources](#sources)
|
||||
1. [Sources](#sources)
|
||||
|
||||
## TL;DR
|
||||
|
||||
@@ -279,6 +280,42 @@ the `memoryReservation` value.<br/>
|
||||
If specifying `memoryReservation`, that value is guaranteed to the container and subtracted from the available memory
|
||||
resources for the container instance that the container is placed on. Otherwise, the value of `memory` is used.
|
||||
|
||||
## Environment variables
|
||||
|
||||
Refer [Amazon ECS environment variables].
|
||||
|
||||
ECS sets default environment variables for any task it runs.
|
||||
|
||||
<details>
|
||||
|
||||
```sh
|
||||
$ aws ecs list-tasks --cluster 'devel' --service-name 'prometheus' --query 'taskArns' --output 'text' \
|
||||
| xargs -I '%%' aws ecs execute-command --cluster 'devel' --task '%%' --container 'prometheus' \
|
||||
--interactive --command 'printenv'
|
||||
|
||||
The Session Manager plugin was installed successfully. Use the AWS CLI to start a session.
|
||||
|
||||
|
||||
Starting session with SessionId: ecs-execute-command-abcdefghijklmnopqrstuvwxyz
|
||||
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
||||
HOSTNAME=ip-172-31-10-103.eu-west-1.compute.internal
|
||||
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=/v2/credentials/abcdefgh-1234-abcd-9876-abcdefgh0123
|
||||
AWS_DEFAULT_REGION=eu-west-1
|
||||
AWS_EXECUTION_ENV=AWS_ECS_FARGATE
|
||||
AWS_REGION=eu-west-1
|
||||
ECS_AGENT_URI=http://169.254.170.2/api/abcdef0123456789abcdef0123456789-1111111111
|
||||
ECS_CONTAINER_METADATA_URI=http://169.254.170.2/v3/abcdef0123456789abcdef0123456789-1111111111
|
||||
ECS_CONTAINER_METADATA_URI_V4=http://169.254.170.2/v4/abcdef0123456789abcdef0123456789-1111111111
|
||||
HOME=/root
|
||||
TERM=xterm-256color
|
||||
LANG=C.UTF-8
|
||||
|
||||
|
||||
Exiting session with sessionId: ecs-execute-command-abcdefghijklmnopqrstuvwxyz.
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Storage
|
||||
|
||||
Refer [Storage options for Amazon ECS tasks].
|
||||
@@ -503,7 +540,7 @@ Requirements:
|
||||
"Statement": [{
|
||||
"Effect": "Allow",
|
||||
"Action": "ecs:ExecuteCommand",
|
||||
"Resource": "arn:aws:ecs:eu-west-1:012345678901:cluster/staging",
|
||||
"Resource": "arn:aws:ecs:eu-west-1:012345678901:cluster/devel",
|
||||
"Condition": {
|
||||
"StringEquals": {
|
||||
"aws:ResourceTag/application": "appName",
|
||||
@@ -527,15 +564,15 @@ Procedure:
|
||||
<summary>Example</summary>
|
||||
|
||||
```sh
|
||||
aws ecs describe-tasks --cluster 'staging' --tasks 'ef6260ed8aab49cf926667ab0c52c313' --output 'yaml' \
|
||||
aws ecs describe-tasks --cluster 'devel' --tasks 'ef6260ed8aab49cf926667ab0c52c313' --output 'yaml' \
|
||||
--query 'tasks[0] | {
|
||||
"managedAgents": containers[].managedAgents[?@.name==`ExecuteCommandAgent`][],
|
||||
"enableExecuteCommand": enableExecuteCommand
|
||||
}'
|
||||
|
||||
aws ecs list-tasks --cluster 'staging' --service-name 'mimir' --query 'taskArns' --output 'text' \
|
||||
aws ecs list-tasks --cluster 'devel' --service-name 'mimir' --query 'taskArns' --output 'text' \
|
||||
| xargs \
|
||||
aws ecs describe-tasks --cluster 'Staging' \
|
||||
aws ecs describe-tasks --cluster 'devel' \
|
||||
--output 'yaml' --query 'tasks[0] | {
|
||||
"managedAgents": containers[].managedAgents[?@.name==`ExecuteCommandAgent`][],
|
||||
"enableExecuteCommand": enableExecuteCommand
|
||||
@@ -560,7 +597,7 @@ Procedure:
|
||||
|
||||
```sh
|
||||
aws ecs execute-command --interactive --command 'df -h' \
|
||||
--cluster 'staging' --task 'ef6260ed8aab49cf926667ab0c52c313' --container 'nginx'
|
||||
--cluster 'devel' --task 'ef6260ed8aab49cf926667ab0c52c313' --container 'nginx'
|
||||
```
|
||||
|
||||
```plaintext
|
||||
@@ -977,6 +1014,7 @@ Specify a supported value for the task CPU and memory in your task definition.
|
||||
[efs]: efs.md
|
||||
|
||||
<!-- Upstream -->
|
||||
[Amazon ECS environment variables]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-environment-variables.html
|
||||
[amazon ecs exec checker]: https://github.com/aws-containers/amazon-ecs-exec-checker
|
||||
[Amazon ECS Service Discovery]: https://aws.amazon.com/blogs/aws/amazon-ecs-service-discovery/
|
||||
[amazon ecs services]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html
|
||||
|
||||
@@ -19,6 +19,8 @@ and set up alerting rules across multiple tenants to leverage tenant federation.
|
||||
1. [Migrate to Mimir](#migrate-to-mimir)
|
||||
1. [Ingest Out-Of-Order samples](#ingest-out-of-order-samples)
|
||||
1. [Deduplication of data from multiple Prometheus scrapers](#deduplication-of-data-from-multiple-prometheus-scrapers)
|
||||
1. [Configure Prometheus for deduplication](#configure-prometheus-for-deduplication)
|
||||
1. [Configure Mimir for deduplication](#configure-mimir-for-deduplication)
|
||||
1. [APIs](#apis)
|
||||
1. [Troubleshooting](#troubleshooting)
|
||||
1. [HTTP status 401 Unauthorized: no org id](#http-status-401-unauthorized-no-org-id)
|
||||
@@ -430,6 +432,94 @@ limits:
|
||||
|
||||
Refer [Configure Grafana Mimir high-availability deduplication].
|
||||
|
||||
Mimir can deduplicate the data received from HA pairs of Prometheus instances.<br/>
|
||||
It does so by:
|
||||
|
||||
- Electing a _leader_ replica for each data source pair.
|
||||
- Only ingesting samples from the leader, and dropping the ones from the other replica.
|
||||
- Switching the leader to the standby replica, should Mimir see no new samples from the leader for some time (30s by
|
||||
default).
|
||||
|
||||
The failure timeout should be kept low enough to avoid dropping too much data before failing over to the standby
|
||||
replica.<br/>
|
||||
For queries using the `rate()` function, it is suggested to make the rate time interval at least four times that of the
|
||||
scrape period to account for any of these failover scenarios (e.g., a rate time-interval of at least 1-minutes for a
|
||||
scrape period of 15 seconds).
|
||||
|
||||
The distributor includes a high-availability (HA) tracker.<br/>
|
||||
It deduplicates incoming samples based on a `cluster` and `replica` label that it expects on each incoming series.
|
||||
|
||||
The `cluster` label uniquely identifies the cluster of redundant Prometheus servers for a given tenant.<br/>
|
||||
The `replica` label uniquely identifies the replica instance within that Prometheus cluster.<br/>
|
||||
Incoming samples are considered duplicated (and thus dropped) if they are received from any replica that is **not** the
|
||||
currently elected leader within any cluster.
|
||||
|
||||
> For performance reasons, the HA tracker only checks the `cluster` and `replica` label of the **first** series in the
|
||||
> request to determine whether **all** series in the request should be deduplicated.
|
||||
|
||||
### Configure Prometheus for deduplication
|
||||
|
||||
Set the two labels for each Prometheus server.
|
||||
|
||||
The easiest approach is to set them as _external labels_.<br/>
|
||||
The default labels are `cluster` and `__replica__`.
|
||||
|
||||
```yml
|
||||
global:
|
||||
external_labels:
|
||||
cluster: infra
|
||||
__replica__: replica1 # since Prometheus 3.0 one can use vars like ${HOSTNAME}
|
||||
```
|
||||
|
||||
### Configure Mimir for deduplication
|
||||
|
||||
The **minimal** configuration requires the following:
|
||||
|
||||
1. Enable the distributor's HA tracker.
|
||||
|
||||
<details style="padding: 0 0 1rem 1rem">
|
||||
<summary>Example: enable for <b>all</b> tenants</summary>
|
||||
|
||||
```sh
|
||||
mimir … \
|
||||
-distributor.ha-tracker.enable='true' \
|
||||
-distributor.ha-tracker.enable-for-all-users='true'
|
||||
```
|
||||
|
||||
```yml
|
||||
limits:
|
||||
accept_ha_samples: true
|
||||
distributor:
|
||||
ha_tracker:
|
||||
enable_ha_tracker: true
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
1. Configure the HA tracker's KV store.
|
||||
|
||||
`memberlist` support is currently experimental.<br/>
|
||||
See also [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos].
|
||||
|
||||
<details style="padding: 0 0 1rem 1rem">
|
||||
<summary>Example: <code>inmemory</code></summary>
|
||||
|
||||
```sh
|
||||
mimir … -distributor.ha-tracker.store='inmemory'
|
||||
```
|
||||
|
||||
```yml
|
||||
distributor:
|
||||
ha_tracker:
|
||||
kvstore:
|
||||
store: inmemory
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
1. Configure the expected label names for each cluster and its replica.<br/>
|
||||
Only needed when using different labels than the default ones.
|
||||
|
||||
## APIs
|
||||
|
||||
Refer [Grafana Mimir HTTP API].
|
||||
@@ -525,6 +615,7 @@ ingester:
|
||||
- [hashicorp/memberlist]
|
||||
- [Gossip protocol]
|
||||
- [Ceiling Function]
|
||||
- [In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos]
|
||||
|
||||
Alternatives:
|
||||
|
||||
@@ -581,4 +672,5 @@ Alternatives:
|
||||
[Ceiling Function]: https://www.geeksforgeeks.org/ceiling-function/
|
||||
[Gossip protocol]: https://en.wikipedia.org/wiki/Gossip_protocol
|
||||
[hashicorp/memberlist]: https://github.com/hashicorp/memberlist
|
||||
[In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos]: https://medium.com/@karim.albakry/in-depth-comparison-of-distributed-coordination-tools-consul-etcd-zookeeper-and-nacos-a6f8e5d612a6
|
||||
[Mimir on AWS ECS Fargate]: https://github.com/grafana/mimir/discussions/3807#discussioncomment-4602413
|
||||
|
||||
Reference in New Issue
Block a user