docs(prometheus): high availability with extra steps

2026-02-09 05:44:23 +00:00 · 2025-04-17 23:07:24 +02:00
parent b2549693f5
commit 00c142096e
6 changed files with 518 additions and 1 deletions
--- a/computing/aws/cli.md
+++ b/computing/aws/cli.md
@@ -316,6 +316,7 @@ yubikeytotp = awscli_plugin_yubikeytotp
 ## Further readings

 - [Amazon Web Services]
+- [Codebase]
 - CLI [quickstart]
 - [Configure profiles] in the CLI
 - [How do I assume an IAM role using the AWS CLI?]
@@ -349,6 +350,7 @@ yubikeytotp = awscli_plugin_yubikeytotp
 [cli config files]: ../../../examples/dotfiles/.aws

 <!-- Upstream -->
+[codebase]: https://github.com/aws/aws-cli/tree/v2
 [configure profiles]: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html
 [how do i assume an iam role using the aws cli?]: https://repost.aws/knowledge-center/iam-assume-role-cli
 [improved cli auto-prompt mode]: https://github.com/aws/aws-cli/issues/5664
--- a/base/cortex.md
+++ b/base/cortex.md
@@ -0,0 +1,78 @@
+# Cortex
+
+> TODO
+
+Intro
+
+<!-- Remove this line to uncomment if used
+## Table of contents <!-- omit in toc -->
+
+1. [TL;DR](#tldr)
+1. [Further readings](#further-readings)
+   1. [Sources](#sources)
+
+## TL;DR
+
+<!-- Uncomment if used
+<details>
+  <summary>Setup</summary>
+
+```sh
+```
+
+</details>
+-->
+
+<!-- Uncomment if used
+<details>
+  <summary>Usage</summary>
+
+```sh
+```
+
+</details>
+-->
+
+<!-- Uncomment if used
+<details>
+  <summary>Real world use cases</summary>
+
+```sh
+```
+
+</details>
+-->
+
+## Further readings
+
+- [Website]
+- [Codebase]
+- [Prometheus]
+
+Alternatives:
+
+- Grafana's [Mimir]
+- [Thanos]
+
+### Sources
+
+- [Documentation]
+
+<!--
+  Reference
+  ═╬═Time══
+  -->
+
+<!-- In-article sections -->
+<!-- Knowledge base -->
+[mimir]: mimir.md
+[prometheus]: prometheus.md
+[thanos]: thanos.md
+
+<!-- Files -->
+<!-- Upstream -->
+[codebase]: https://github.com/cortexproject/cortex
+[documentation]: https://cortexmetrics.io/docs/
+[website]: https://cortexmetrics.io/
+
+<!-- Others -->
--- a/base/dig.md
+++ b/base/dig.md
@@ -52,6 +52,9 @@ dig '@8.8.8.8' 'google.com'

 # Return all results.
 dig 'google.com' 'ANY'
+
+# Only return the first answer.
+dig +short 'google.com'
 ```

 </details>
@@ -61,6 +64,7 @@ dig 'google.com' 'ANY'

 ```sh
 dig +trace '@1.1.1.1' 'google.com'
+dig 'A' +short '@172.31.0.2' 'fs-0123456789abcdef0.efs.eu-west-1.amazonaws.com'
 ```

 </details>
--- a/base/mimir.md
+++ b/base/mimir.md
@@ -0,0 +1,335 @@
+# Grafana's Mimir
+
+Metrics aggregator.
+
+Allows ingesting [Prometheus] or OpenTelemetry metrics, run queries, create new data through the use of recording rules,
+and set up alerting rules across multiple tenants to leverage tenant federation.
+
+<!-- Remove this line to uncomment if used
+## Table of contents <!-- omit in toc -->
+
+1. [TL;DR](#tldr)
+1. [Setup](#setup)
+   1. [Monolithic mode](#monolithic-mode)
+   1. [Microservices mode](#microservices-mode)
+1. [Storage](#storage)
+   1. [Object storage](#object-storage)
+1. [APIs](#apis)
+1. [Deduplication of data from multiple Prometheus scrapers](#deduplication-of-data-from-multiple-prometheus-scrapers)
+1. [Migrate to Mimir](#migrate-to-mimir)
+1. [Further readings](#further-readings)
+   1. [Sources](#sources)
+
+## TL;DR
+
+Scrapers (like Prometheus or Grafana's Alloy) need to send metrics data to Mimir.<br/>
+Mimir will **not** scrape metrics itself.
+
+Mimir listens by default on port `8080` for HTTP and on port `9095` for GRPC.
+
+Mimir stores time series in TSDB blocks, that are uploaded to an object storage bucket.<br/>
+Such blocks are the same that Prometheus and Thanos use, though each application stores blocks in different places and
+uses slightly different metadata files for them.
+
+Mimir supports multiple tenants, and stores blocks on a **per-tenant** level.<br/>
+When multi-tenancy is **disabled**, it will only manage a single tenant going by the name `anonymous`.
+
+Blocks can be uploaded using the `mimirtool` utility, so that Mimir can access them.<br/>
+Mimir **will** perform some sanitization and validation of each block's metadata.
+
+```sh
+mimirtool backfill --address='http://mimir.example.org' --id='anonymous' 'block_1' … 'block_N'
+```
+
+As a result of validation, Mimir will probably reject Thanos' blocks due to unsupported labels.<br/>
+As a workaround, upload Thanos' blocks directly to Mimir's blocks bucket, using the `<tenant>/<block ID>/` prefix.
+
+<details>
+  <summary>Setup</summary>
+
+```sh
+docker pull 'grafana/mimir'
+
+mimir
+docker run --rm --name 'mimir' --publish '8080:8080' --publish '9095:9095' 'grafana/mimir'
+
+mimir --config.file='./demo.yaml'
+docker run --rm --name 'mimir' --publish '8080:8080' --publish '9095:9095' \
+  --volume "$PWD/config.yaml:/etc/mimir/config.yaml" \
+  'grafana/mimir' --config.file='/etc/mimir/config.yaml'
+```
+
+</details>
+
+<details>
+  <summary>Usage</summary>
+
+```sh
+# Get help.
+mimir -help
+mimir -help-all
+
+# Validate configuration files
+mimir -modules -config.file 'path/to/config.yaml'
+
+# See the current configuration of components
+GET /config
+GET /runtime_config
+
+# See changes in the runtime configuration from the default one
+GET /runtime_config?mode=diff
+
+# Check the service is ready
+# A.K.A. readiness probe
+GET /ready
+
+# Get metrics
+GET /metrics
+```
+
+</details>
+
+<!-- Uncomment if used
+<details>
+  <summary>Real world use cases</summary>
+
+```sh
+```
+
+</details>
+-->
+
+## Setup
+
+Mimir's configuration file is YAML-based.<br/>
+There is no default configuration file, but it _can_ be specified on launch.
+
+```sh
+mimir --config.file='./demo.yaml'
+
+docker run --rm --name 'mimir' --publish '8080:8080' --publish '9095:9095' \
+  --volume "$PWD/config.yaml:/etc/mimir/config.yaml" \
+  'grafana/mimir' --config.file='/etc/mimir/config.yaml'
+```
+
+Refer [Grafana Mimir configuration parameters] for the available parameters.
+
+If enabled, environment variable references can be used in the configuration file to set values that need to be
+configurable during deployment.<br/>
+This feature is enabled on the command line via the `-config.expand-env=true` option.
+
+Each variable reference is replaced at startup by the value of the environment variable.<br/>
+The replacement is case-**sensitive**, and occurs **before** the YAML file is parsed.<br/>
+References to undefined variables are replaced by empty strings unless a default value or custom error text is
+specified.
+
+Use the `${VAR}` placeholder, optionally specifying a default value with `${VAR:default_value}`, where `VAR` is the name
+of the environment variable and `default_value` is the value to use if the environment variable is undefined.
+
+Configuration files can be stored gz-compressed. In this case, add a `.gz` extension to those files that should be
+decompressed before parsing.
+
+Mimir loads a given configuration file at startup. This configuration **cannot** be modified at runtime.
+
+Mimir supports _secondary_ configuration files that define the _runtime_'s configuration.<br/>
+This configuration is reloaded **dynamically**. It allows to change the runtime configuration without having to restart
+Mimir's components or instance.
+
+Runtime configuration must be **explicitly** enabled, either on launch or in the configuration file under
+`runtime_config`.<br/>
+If multiple runtime configuration files are specified, they will be **merged** left to right.<br/>
+Mimir reloads the contents of these files every 10 seconds.
+
+```sh
+mimir … -runtime-config.file='path/to/file/1,path/to/file/N'
+```
+
+It only encompasses a **subset** of the whole configuration that was set at startup, but its values take precedence over
+command-line options.
+
+Some settings are repeated for multiple components.<br/>
+To avoid repetition in the configuration file, set them up in the `common` configuration file section or give them to
+Mimir using the `-common.*` CLI options.<br/>
+Common settings are applied to all components first, then the components' specific configurations override them.
+
+Settings are applied as follows, with each one applied later overriding the previous ones:
+
+1. YAML common values
+1. YAML specific values
+1. CLI common flags
+1. CLI specific flags
+
+Specific configuration for one component that is passed to other components is simply ignored by those.<br/>
+This makes it safe to reuse files.
+
+Mimir can be deployed in one of two modes:
+
+- _Monolithic_, which runs all required components in a single process.
+- _Microservices_, where components are run as distinct processes.
+
+The deployment mode is determined by the `-target` option given to Mimir's process.
+
+Whatever the Mimir's deployment mode, it will need to receive data from other applications.<br/>
+It will **not** scrape metrics itself.
+
+<details style="padding: 0 0 1rem 0">
+<summary>Prometheus configuration</summary>
+
+```yaml
+remote_write:
+  - url: http://mimir.example.org:9009/api/v1/push
+```
+
+</details>
+
+[Grafana] considers Mimir a data source of type _Prometheus_, and must be [provisioned](grafana.md#datasources)
+accordingly.<br/>
+From there, metrics can be queried in Grafana's _Explore_ tab, or can populate dashboards that use Mimir as their data
+source.
+
+### Monolithic mode
+
+Runs **all** required components in a **single** process.
+
+Can be horizontally scaled out by deploying multiple instances of Mimir's binary, all of them started with the
+`-target=all` option.
+
+```mermaid
+graph LR
+  r(Reads)
+  w(Writes)
+  lb(Load Balancer)
+  m1(Mimir<br/>instance 1)
+  mN(Mimir<br/>instance N)
+  os(Object Storage)
+
+  r --> lb
+  w --> lb
+  lb --> m1
+  lb --> mN
+  m1 --> os
+  mN --> os
+```
+
+### Microservices mode
+
+Mimir's components are deployed as distinct processes.<br/>
+Each process is invoked with its own `-target` option set to a specific component (i.e., `-target='ingester'` or
+`-target='distributor'`).
+
+```mermaid
+graph LR
+  r(Reads)
+  qf(Query Frontend)
+  q(Querier)
+  sg(Store Gateway)
+  w(Writes)
+  d(Distributor)
+  i(Ingester)
+  os(Object Storage)
+  c(Compactor)
+
+  r --> qf --> q --> sg --> os
+  w --> d --> i --> os
+  os <--> c
+```
+
+**Every** required component **must** be deployed in order to have a working Mimir instance.
+
+This mode is the preferred method for production deployments, but it is also the most complex.<br/>
+Recommended using Kubernetes and the [`mimir-distributed` Helm chart][helm chart].
+
+Each component scales up independently.<br/>
+This allows for greater flexibility and more granular failure domains.
+
+## Storage
+
+Mimir supports the `s3`, `gcs`, `azure`, `swift`, and `filesystem` backends.<br/>
+`filesystem` is the default one.
+
+### Object storage
+
+Refer [Configure Grafana Mimir object storage backend].
+
+Blocks storage must be located under a **different** prefix or bucket than both the ruler's and AlertManager's stores.
+Mimir **will** fail to start if that is the case.
+
+To avoid that, it is suggested to override the `bucket_name` setting in the specific configurations:
+
+```yaml
+common:
+  storage:
+    backend: s3
+    s3:
+      endpoint: s3.us-east-2.amazonaws.com
+      region: us-east-2
+
+blocks_storage:
+  s3:
+    bucket_name: mimir-blocks
+
+alertmanager_storage:
+  s3:
+    bucket_name: mimir-alertmanager
+
+ruler_storage:
+  s3:
+    bucket_name: mimir-ruler
+```
+
+## APIs
+
+Refer [Grafana Mimir HTTP API].
+
+## Deduplication of data from multiple Prometheus scrapers
+
+Refer [Configure Grafana Mimir high-availability deduplication].
+
+## Migrate to Mimir
+
+Refer [Migrate from Thanos or Prometheus to Grafana Mimir].
+
+## Further readings
+
+- [Website]
+- [Codebase]
+- [Prometheus]
+- [Grafana]
+
+Alternatives:
+
+- [Cortex]
+- [Thanos]
+
+### Sources
+
+- [Documentation]
+- [Migrate from Thanos or Prometheus to Grafana Mimir]
+- [Configure Grafana Mimir object storage backend]
+- [Grafana Mimir configuration parameters]
+
+<!--
+  Reference
+  ═╬═Time══
+  -->
+
+<!-- In-article sections -->
+<!-- Knowledge base -->
+[cortex]: cortex.md
+[grafana]: grafana.md
+[prometheus]: prometheus.md
+[thanos]: thanos.md
+
+<!-- Files -->
+<!-- Upstream -->
+[codebase]: https://github.com/grafana/mimir
+[configure grafana mimir high-availability deduplication]: https://grafana.com/docs/mimir/latest/configure/configure-high-availability-deduplication/
+[configure grafana mimir object storage backend]: https://grafana.com/docs/mimir/latest/configure/configure-object-storage-backend/
+[documentation]: https://grafana.com/docs/mimir/latest/
+[grafana mimir configuration parameters]: https://grafana.com/docs/mimir/latest/configure/configuration-parameters/
+[grafana mimir http api]: https://grafana.com/docs/mimir/latest/references/http-api/
+[helm chart]: https://github.com/grafana/mimir/tree/main/operations/helm/charts/mimir-distributed
+[migrate from thanos or prometheus to grafana mimir]: https://grafana.com/docs/mimir/latest/set-up/migrate/migrate-from-thanos-or-prometheus/
+[website]: https://grafana.com/oss/mimir/
+
+<!-- Others -->
--- a/base/prometheus.md
+++ b/base/prometheus.md
@@ -21,6 +21,7 @@ prohibited from opening ports by security policies.
 1. [Write to remote Prometheus servers](#write-to-remote-prometheus-servers)
 1. [Management API](#management-api)
   1. [Take snapshots of the current data](#take-snapshots-of-the-current-data)
+1. [High availability](#high-availability)
 1. [Further readings](#further-readings)
    1. [Sources](#sources)

@@ -395,6 +396,17 @@ $ curl -X 'POST' 'http://localhost:9090/api/v1/admin/tsdb/snapshot'

 The snapshot now exists at `<data-dir>/snapshots/20171210T211224Z-2be650b6d019eb54`

+## High availability
+
+Typically achieved by:
+
+1. Running multiple Prometheus replicas.<br/>
+   Replicas could each focus on a subset of the whole data, or just duplicate it.
+1. Running a separate AlertManager instance.<br/>
+   This would handle alerts from all the Prometheus instances, automatically managing eventually duplicated data.
+1. Using tools like [Thanos], [Cortex], or Grafana's [Mimir] to aggregate and deduplicate data.
+1. Directing visualizers like Grafana to the aggregator instead of the Prometheus replicas.
+
 ## Further readings

 - [Website]
@@ -412,6 +424,9 @@ The snapshot now exists at `<data-dir>/snapshots/20171210T211224Z-2be650b6d019eb
 - [Prometheus Definitive Guide Part I - Metrics and Use Cases]
 - [Prometheus Definitive Guide Part II - Prometheus Query Language]
 - [Prometheus Definitive Guide Part III - Prometheus Operator]
+- [Cortex]
+- [Thanos]
+- Grafana's [Mimir]

 ### Sources

@@ -432,6 +447,7 @@ The snapshot now exists at `<data-dir>/snapshots/20171210T211224Z-2be650b6d019eb
 - [Install Prometheus and Grafana by Helm]
 - [Prometheus and Grafana setup in Minikube]
 - [I need to know about the below kube_state_metrics description. Exactly looking is what the particular metrics doing]
+- [High Availability in Prometheus: Best Practices and Tips]

 <!--
  Reference
@@ -439,9 +455,12 @@ The snapshot now exists at `<data-dir>/snapshots/20171210T211224Z-2be650b6d019eb
  -->

 <!-- Knowledge base -->
+[cortex]: cortex.md
 [grafana]: grafana.md
+[mimir]: mimir.md
 [node exporter]: node%20exporter.md
 [snmp exporter]: snmp%20exporter.md
+[thanos]: thanos.md

 <!-- Files -->
 [docker/monitoring]: ../docker%20compositions/monitoring/README.md
@@ -465,6 +484,7 @@ The snapshot now exists at `<data-dir>/snapshots/20171210T211224Z-2be650b6d019eb
 [dropping metrics at scrape time with prometheus]: https://www.robustperception.io/dropping-metrics-at-scrape-time-with-prometheus/
 [getting started with prometheus]: https://opensource.com/article/18/12/introduction-prometheus
 [high availability for prometheus and alertmanager: an overview]: https://promlabs.com/blog/2023/08/31/high-availability-for-prometheus-and-alertmanager-an-overview/
+[high availability in prometheus: best practices and tips]: https://last9.io/blog/high-availability-in-prometheus/
 [how i monitor my openwrt router with grafana cloud and prometheus]: https://grafana.com/blog/2021/02/09/how-i-monitor-my-openwrt-router-with-grafana-cloud-and-prometheus/
 [how relabeling in prometheus works]: https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/
 [how to integrate prometheus and grafana on kubernetes using helm]: https://semaphoreci.com/blog/prometheus-grafana-kubernetes-helm
--- a/base/thanos.md
+++ b/base/thanos.md
@@ -0,0 +1,78 @@
+# Thanos
+
+> TODO
+
+Intro
+
+<!-- Remove this line to uncomment if used
+## Table of contents <!-- omit in toc -->
+
+1. [TL;DR](#tldr)
+1. [Further readings](#further-readings)
+   1. [Sources](#sources)
+
+## TL;DR
+
+<!-- Uncomment if used
+<details>
+  <summary>Setup</summary>
+
+```sh
+```
+
+</details>
+-->
+
+<!-- Uncomment if used
+<details>
+  <summary>Usage</summary>
+
+```sh
+```
+
+</details>
+-->
+
+<!-- Uncomment if used
+<details>
+  <summary>Real world use cases</summary>
+
+```sh
+```
+
+</details>
+-->
+
+## Further readings
+
+- [Website]
+- [Codebase]
+- [Prometheus]
+
+Alternatives:
+
+- [Cortex]
+- Grafana's [Mimir]
+
+### Sources
+
+- [Documentation]
+
+<!--
+  Reference
+  ═╬═Time══
+  -->
+
+<!-- In-article sections -->
+<!-- Knowledge base -->
+[cortex]: cortex.md
+[mimir]: mimir.md
+[prometheus]: prometheus.md
+
+<!-- Files -->
+<!-- Upstream -->
+[codebase]: https://github.com/thanos-io/thanos
+[documentation]: https://thanos.io/tip/thanos/
+[website]: https://thanos.io/
+
+<!-- Others -->