brandon/oam

Fork 0

mirror of https://gitea.com/mcereda/oam.git synced 2026-02-09 05:44:23 +00:00

Files

Michele Cereda 836fd43764 chore(loki,mimir): add sources

2025-05-27 23:42:09 +02:00

22 KiB

Raw Blame History

Grafana's Mimir

Metrics aggregator.

Allows ingesting Prometheus or OpenTelemetry metrics, run queries, create new data through the use of recording rules, and set up alerting rules across multiple tenants to leverage tenant federation.

TL;DR
Setup
Storage
1. Object storage
Authentication and authorization
Migrate to Mimir
Ingest Out-Of-Order samples
Deduplication of data from multiple Prometheus scrapers
1. Configure Prometheus for deduplication
2. Configure Mimir for deduplication
APIs
Troubleshooting
1. HTTP status 401 Unauthorized: no org id
2. HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1
Further readings
1. Sources

TL;DR

Scrapers (like Prometheus or Grafana's Alloy) need to send metrics data to Mimir.
Mimir will not scrape metrics itself.

The server listens by default on port 8080 for HTTP and on port 9095 for GRPC.
It also internally advertises data or actions to members in the cluster using hashicorp/memberlist, which implements a gossip protocol. This uses port 7946 by default, and must be reachable by all members in the cluster to work.

Mimir stores time series in TSDB blocks on the local file system by default.
It can uploaded those blocks to an object storage bucket.

The data blocks use the same format that Prometheus and Thanos use for storage, though each application stores blocks in different places and uses slightly different metadata files for them.

Mimir supports multiple tenants, and stores blocks on a per-tenant level.
Multi-tenancy is enabled by default. It can be disabled using the -auth.multitenancy-enabled=false option.

Multi-tenancy, if enabled, will require every API request to have the X-Scope-OrgID header with the value set to the tenant ID one is authenticating for.
When multi-tenancy is disabled, Mimir will store everything under a single tenant going by the name anonymous, and will assume all API requests are for it by automatically filling the X-Scope-OrgID header if not given.

Blocks can be uploaded using the mimirtool utility, so that Mimir can access them.
The server will perform some sanitization and validation of each block's metadata.

mimirtool backfill --address='http://mimir.example.org' --id='anonymous' 'block_1' … 'block_N'

As a result of validation, Mimir will probably reject Thanos' blocks due to unsupported labels.
As a workaround, upload Thanos' blocks directly to Mimir's blocks directory, or bucket using the <tenant>/<block ID>/ prefix.

Setup

docker pull 'grafana/mimir'
helm repo add 'grafana' 'https://grafana.github.io/helm-charts' && helm repo update 'grafana'

# Does *not* look for default configuration files.
# When no configuration file is given, only default values are used. This is not something one might usually want.
mimir --config.file='./demo.yaml'
docker run --rm --name 'mimir' -p '8080:8080' -p '9095:9095' -v "$PWD/config.yaml:/etc/mimir/config.yaml" \
  'grafana/mimir' -- --config.file='/etc/mimir/config.yaml'
helm --namespace 'mimir-test' upgrade --install --create-namespace 'mimir' 'grafana/mimir-distributed'

Usage

# Get help.
mimir -help
mimir -help-all

# Validate configuration files.
mimir -modules -config.file 'path/to/config.yaml'

# Run tests.
# Refer <https://grafana.com/docs/mimir/latest/manage/tools/mimir-continuous-test/>.
mimir -target='continuous-test' \
  -tests.write-endpoint='http://localhost:8080' -tests.read-endpoint='http://localhost:8080' \
  -tests.smoke-test \  # just once
  -server.http-listen-port='18080' -server.grpc-listen-port='19095'  # avoid colliding with the running instance

# See the current configuration of components.
GET /config
GET /runtime_config

# See changes in the runtime configuration from the default one.
GET /runtime_config?mode=diff

# Check the service is ready.
# A.K.A. readiness probe.
GET /ready

# Get metrics.
GET /metrics

Setup

Mimir's configuration file is YAML-based.

There is no default configuration file, but one can be specified on launch.
If no configuration file nor CLI options are given, only the default values will be used.

mimir --config.file='./demo.yaml'

docker run --rm --name 'mimir' --publish '8080:8080' --publish '9095:9095' \
  --volume "$PWD/config.yaml:/etc/mimir/config.yaml" \
  'grafana/mimir' --config.file='/etc/mimir/config.yaml'

Refer Grafana Mimir configuration parameters for the available parameters.

If enabled, environment variable references can be used in the configuration file to set values that need to be configurable during deployment.
This feature is enabled on the command line via the -config.expand-env=true option.

Each variable reference is replaced at startup by the value of the environment variable.
The replacement is case-sensitive, and occurs before the YAML file is parsed.
References to undefined variables are replaced by empty strings unless a default value or custom error text is specified.

Use the ${VAR} placeholder, optionally specifying a default value with ${VAR:default_value}, where VAR is the name of the environment variable and default_value is the value to use if the environment variable is undefined.

Configuration files can be stored gz-compressed. In this case, add a .gz extension to those files that should be decompressed before parsing.

Mimir loads a given configuration file at startup. This configuration cannot be modified at runtime.

Mimir supports secondary configuration files that define the runtime's configuration.
This configuration is reloaded dynamically. It allows to change the runtime configuration without having to restart Mimir's components or instance.

Runtime configuration must be explicitly enabled, either on launch or in the configuration file under runtime_config.
If multiple runtime configuration files are specified, they will be merged left to right.
Mimir reloads the contents of these files every 10 seconds.

mimir … -runtime-config.file='path/to/file/1,path/to/file/N'

It only encompasses a subset of the whole configuration that was set at startup, but its values take precedence over command-line options.

Some settings are repeated for multiple components.
To avoid repetition in the configuration file, set them up in the common configuration file section or give them to Mimir using the -common.* CLI options.
Common settings are applied to all components first, then the components' specific configurations override them.

Settings are applied as follows, with each one applied later overriding the previous ones:

YAML common values
YAML specific values
CLI common flags
CLI specific flags

Specific configuration for one component that is passed to other components is simply ignored by those.
This makes it safe to reuse files.

Mimir can be deployed in one of two modes:

Monolithic, which runs all required components in a single process.
Microservices, where components are run as distinct processes.

The deployment mode is determined by the target option given to Mimir's process.

$ mimir -target='ruler'
$ mimir -target='all,alertmanager,overrides-exporter'

$ yq -y 'config.yml'
target: all,alertmanager,overrides-exporter
$ mimir -config.file='config.yml'

Whatever the Mimir's deployment mode, it will need to receive data from other applications.
It will not scrape metrics itself.

Prometheus configuration

remote_write:
  - url: http://mimir.example.org:8080/api/v1/push
    headers:
      X-Scope-OrgID:
        # required unless multi-tenancy is disabled
        # set it to the correct ones, this is the default
        anonymous

Grafana considers Mimir a data source of type Prometheus, and must be provisioned accordingly.

Example

---
apiVersion: 1
datasources:
  - id: 1
    name: Mimir
    orgId: 1
    uid: abcdef01-a0c1-432e-8ef5-7b277cb0b32b
    type: prometheus
    typeName: Prometheus
    typeLogoUrl: public/app/plugins/datasource/prometheus/img/prometheus_logo.svg
    access: proxy
    url: http://mimir.example.org:8080/prometheus
    user: ''
    database: ''
    basicAuth: false
    isDefault: true
    jsonData:
      httpMethod: POST
    readOnly: false

From there, metrics can be queried in Grafana's Explore tab, or can populate dashboards that use Mimir as their data source.

Monolithic mode

Runs all required components in a single process.

Can be horizontally scaled out by deploying multiple instances of Mimir's binary, all of them started with the -target=all option.

graph LR
  r(Reads)
  w(Writes)
  lb(Load Balancer)
  m1(Mimir<br/>instance 1)
  mN(Mimir<br/>instance N)
  os(Object Storage)

  r --> lb
  w --> lb
  lb --> m1
  lb --> mN
  m1 --> os
  mN --> os

By default Mimir expects 3 ingester replicas, and data ingestion will fail if there are less than 2 in the ingester ring.
See HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1.

Microservices mode

Mimir's components are deployed as distinct processes.
Each process is invoked with its own -target option set to a specific component (i.e., -target='ingester' or -target='distributor').

graph LR
  r(Reads)
  qf(Query Frontend)
  q(Querier)
  sg(Store Gateway)
  w(Writes)
  d(Distributor)
  i(Ingester)
  os(Object Storage)
  c(Compactor)

  r --> qf --> q --> sg --> os
  w --> d --> i --> os
  os <--> c

Every required component must be deployed in order to have a working Mimir instance.

This mode is the preferred method for production deployments, but it is also the most complex.
Recommended using Kubernetes and the mimir-distributed Helm chart.

Each component scales up independently.
This allows for greater flexibility and more granular failure domains.

Run on AWS ECS Fargate

See also AWS ECS and Mimir on AWS ECS Fargate.

Things to consider:

Go for ECS service discovery instead of ECS Service Connect.

This needs to be confirmed, but it is how it worked for me.

Apparently, at the time of writing, Service Connect prefers answering in IPv6 for ECS-related queries.
There seems to be no way to customize this for now.

At the same time, hashicorp/memberlist seems to only use IPv4 unless explicitly required to listen on a IPv6 address.
Which, one would have no way to programmatically set before creating the resources.

Storage

Mimir supports the s3, gcs, azure, swift, and filesystem backends.
filesystem is the default one.

Object storage

Refer Configure Grafana Mimir object storage backend.

Blocks storage must be located under a different prefix or bucket than both the ruler's and AlertManager's stores. Mimir will fail to start if that is the case.

To avoid that, it is suggested to override the bucket_name setting in the specific configurations.

Different buckets

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com  # required
      region: us-east-2

blocks_storage:
  s3:
    bucket_name: mimir-blocks

alertmanager_storage:
  s3:
    bucket_name: mimir-alertmanager

ruler_storage:
  s3:
    bucket_name: mimir-ruler

Same bucket, different prefixes

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com  # required
      region: us-east-2
      bucket_name: mimir

blocks_storage:
  storage_prefix: blocks

alertmanager_storage:
  storage_prefix: alertmanager

ruler_storage:
  storage_prefix: ruler

The WAL is only retained on local disk, not persisted to the object storage.

Metrics data is uploaded to the object storage every 2 hours, typically when a block is cut from the in-memory TSDB head.
After the metrics data block is uploaded, its related WAL is truncated too.

Authentication and authorization

Refer Grafana Mimir authentication and authorization.

Migrate to Mimir

Refer Configure TSDB block upload and Migrate from Thanos or Prometheus to Grafana Mimir.

Ingest Out-Of-Order samples

Refer Configure out-of-order samples ingestion.

limits:
  out_of_order_time_window: 5m  # Allow up to 5 minutes since the latest received sample for the series.

Deduplication of data from multiple Prometheus scrapers

Refer Configure Grafana Mimir high-availability deduplication.

Mimir can deduplicate the data received from HA pairs of Prometheus instances.
It does so by:

Electing a leader replica for each data source pair.
Only ingesting samples from the leader, and dropping the ones from the other replica.
Switching the leader to the standby replica, should Mimir see no new samples from the leader for some time (30s by default).

The failure timeout should be kept low enough to avoid dropping too much data before failing over to the standby replica.
For queries using the rate() function, it is suggested to make the rate time interval at least four times that of the scrape period to account for any of these failover scenarios (e.g., a rate time-interval of at least 1-minutes for a scrape period of 15 seconds).

The distributor includes a high-availability (HA) tracker.
It deduplicates incoming samples based on a cluster and replica label that it expects on each incoming series.

The cluster label uniquely identifies the cluster of redundant Prometheus servers for a given tenant.
The replica label uniquely identifies the replica instance within that Prometheus cluster.
Incoming samples are considered duplicated (and thus dropped) if they are received from any replica that is not the currently elected leader within any cluster.

For performance reasons, the HA tracker only checks the cluster and replica label of the first series in the request to determine whether all series in the request should be deduplicated.

Configure Prometheus for deduplication

Set the two labels for each Prometheus server.

The easiest approach is to set them as external labels.
The default labels are cluster and __replica__.

global:
  external_labels:
    cluster: infra
    __replica__: replica1  # since Prometheus 3.0 one can use vars like ${HOSTNAME}

Configure Mimir for deduplication

The minimal configuration requires the following:

Enable the distributor's HA tracker.

Example: enable for all tenants

mimir … \
  -distributor.ha-tracker.enable='true' \
  -distributor.ha-tracker.enable-for-all-users='true'

limits:
  accept_ha_samples: true
distributor:
  ha_tracker:
    enable_ha_tracker: true

Configure the HA tracker's KV store.

memberlist support is currently experimental.
See also In-Depth Comparison of Distributed Coordination Tools: Consul, etcd, ZooKeeper, and Nacos.
Example: inmemory
```
mimir … -distributor.ha-tracker.store='inmemory'
```
```
distributor:
  ha_tracker:
    kvstore:
      store: inmemory
```
Configure the expected label names for each cluster and its replica.
Only needed when using different labels than the default ones.

APIs

Refer Grafana Mimir HTTP API.

Troubleshooting

HTTP status 401 Unauthorized: no org id

Context: Prometheus servers get this error when trying to push metrics.

Root cause: The push request is missing the X-Scope-OrgID header that would specify the tenancy for the data.

Solution:

Configure Prometheus to add the X-Scope-OrgID header to the data pushed.
Required even when multi-tenancy is disabled. In this case, use the default anonymous tenancy:

remote_write:
  - url: http://mimir.example.org:8080/api/v1/push
    headers:
      X-Scope-OrgID:
        # required unless multi-tenancy is disabled
        # set it to the correct ones, this is the default
        anonymous

HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1

Context:

Mimir is running on AWS ECS in monolithic mode for evaluation.
It is loading the following configuration from a mounted AWS EFS volume:

multitenancy_enabled: false

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com  # required
blocks_storage:
  s3:
    bucket_name: my-mimir-blocks

The service is backed by a load balancer.
The load balancer is allowing requests to reach the task serving Mimir correctly.

A Prometheus server is configured to send data to Mimir:

remote_write:
  - url: http://mimir.dev.somecompany.com:8080/api/v1/push
    headers:
      X-Scope-OrgID: anonymous

The push request passes Mimir's validation for the above header.

Both Mimir and Prometheus print this error when Prometheus tries to push metrics.

Root cause:

Writes to the Mimir cluster are successful only if the majority of the ingesters received the data.

The default value for the ingester.ring.replication_factor setting is 3. As such, Mimir expects by default 3 ingester in its ingester ring, with a minimum of ceil(replication_factor/2) that many ingesters (2 by default) alive at all times.
Data ingestion will fail with that error when less than the minimum alive replica are listed in the ingester ring.

This happens even when running in monolithic mode.

Solution:

As a rule of thumb, make sure at least ceil(replication_factor/2) ingesters are available to Mimir.

When just testing the setup, configure the ingesters' replication_factor to 1:

ingester:
  ring:
    replication_factor: 1

22 KiB Raw Blame History

Grafana's Mimir

TL;DR

Setup

Monolithic mode

Microservices mode

Run on AWS ECS Fargate

Storage

Object storage

Authentication and authorization

Migrate to Mimir

Ingest Out-Of-Order samples

Deduplication of data from multiple Prometheus scrapers

Configure Prometheus for deduplication

Configure Mimir for deduplication

APIs

Troubleshooting

HTTP status 401 Unauthorized: no org id

HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1

Further readings

Sources

22 KiB

Raw Blame History