brandon/oam

Fork 0

mirror of https://gitea.com/mcereda/oam.git synced 2026-02-09 05:44:23 +00:00

Files

Michele Cereda 4f39a1dac0 chore(mimir): run on aws ecs fargate

2025-04-25 18:17:23 +02:00

19 KiB

Raw Blame History

Grafana's Mimir

Metrics aggregator.

Allows ingesting Prometheus or OpenTelemetry metrics, run queries, create new data through the use of recording rules, and set up alerting rules across multiple tenants to leverage tenant federation.

TL;DR
Setup
Storage
1. Object storage
Authentication and authorization
Migrate to Mimir
Ingest Out-Of-Order samples
Deduplication of data from multiple Prometheus scrapers
APIs
Troubleshooting
1. HTTP status 401 Unauthorized: no org id
2. HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1
Further readings
1. Sources

TL;DR

Scrapers (like Prometheus or Grafana's Alloy) need to send metrics data to Mimir.
Mimir will not scrape metrics itself.

Mimir listens by default on port 8080 for HTTP and on port 9095 for GRPC.
It also internally advertises data or actions to members in the cluster using hashicorp/memberlist, which implements a gossip protocol. This uses port 7946 by default, and must be reachable by all members in the cluster to work.

Mimir stores time series in TSDB blocks, that are uploaded to an object storage bucket.
Such blocks are the same that Prometheus and Thanos use, though each application stores blocks in different places and uses slightly different metadata files for them.

Mimir supports multiple tenants, and stores blocks on a per-tenant level.
Multi-tenancy is enabled by default, and can be disabled using the -auth.multitenancy-enabled=false option.
If enabled, then multi-tenancy will require every API request to have the X-Scope-OrgID header with the value set to the tenant ID one is authenticating for.
When multi-tenancy is disabled, it will only manage a single tenant going by the name anonymous.

Blocks can be uploaded using the mimirtool utility, so that Mimir can access them.
Mimir will perform some sanitization and validation of each block's metadata.

mimirtool backfill --address='http://mimir.example.org' --id='anonymous' 'block_1' … 'block_N'

As a result of validation, Mimir will probably reject Thanos' blocks due to unsupported labels.
As a workaround, upload Thanos' blocks directly to Mimir's blocks bucket, using the <tenant>/<block ID>/ prefix.

Setup

docker pull 'grafana/mimir'
helm repo add 'grafana' 'https://grafana.github.io/helm-charts' && helm repo update 'grafana'

# Does *not* look for default configuration files.
# When no configuration file is given, only default values are used. This is not something one might usually want.
mimir --config.file='./demo.yaml'
docker run --rm --name 'mimir' -p '8080:8080' -p '9095:9095' -v "$PWD/config.yaml:/etc/mimir/config.yaml" \
  'grafana/mimir' -- --config.file='/etc/mimir/config.yaml'
helm --namespace 'mimir-test' upgrade --install --create-namespace 'mimir' 'grafana/mimir-distributed'

Usage

# Get help.
mimir -help
mimir -help-all

# Validate configuration files.
mimir -modules -config.file 'path/to/config.yaml'

# Run tests.
# Refer <https://grafana.com/docs/mimir/latest/manage/tools/mimir-continuous-test/>.
mimir -target='continuous-test' \
  -tests.write-endpoint='http://localhost:8080' -tests.read-endpoint='http://localhost:8080' \
  -tests.smoke-test \  # just once
  -server.http-listen-port='18080' -server.grpc-listen-port='19095'  # avoid colliding with the running instance

# See the current configuration of components.
GET /config
GET /runtime_config

# See changes in the runtime configuration from the default one.
GET /runtime_config?mode=diff

# Check the service is ready.
# A.K.A. readiness probe.
GET /ready

# Get metrics.
GET /metrics

Setup

Mimir's configuration file is YAML-based.

There is no default configuration file, but one can be specified on launch.
If no configuration file is given, only the default values will be used.

mimir --config.file='./demo.yaml'

docker run --rm --name 'mimir' --publish '8080:8080' --publish '9095:9095' \
  --volume "$PWD/config.yaml:/etc/mimir/config.yaml" \
  'grafana/mimir' --config.file='/etc/mimir/config.yaml'

Refer Grafana Mimir configuration parameters for the available parameters.

If enabled, environment variable references can be used in the configuration file to set values that need to be configurable during deployment.
This feature is enabled on the command line via the -config.expand-env=true option.

Each variable reference is replaced at startup by the value of the environment variable.
The replacement is case-sensitive, and occurs before the YAML file is parsed.
References to undefined variables are replaced by empty strings unless a default value or custom error text is specified.

Use the ${VAR} placeholder, optionally specifying a default value with ${VAR:default_value}, where VAR is the name of the environment variable and default_value is the value to use if the environment variable is undefined.

Configuration files can be stored gz-compressed. In this case, add a .gz extension to those files that should be decompressed before parsing.

Mimir loads a given configuration file at startup. This configuration cannot be modified at runtime.

Mimir supports secondary configuration files that define the runtime's configuration.
This configuration is reloaded dynamically. It allows to change the runtime configuration without having to restart Mimir's components or instance.

Runtime configuration must be explicitly enabled, either on launch or in the configuration file under runtime_config.
If multiple runtime configuration files are specified, they will be merged left to right.
Mimir reloads the contents of these files every 10 seconds.

mimir … -runtime-config.file='path/to/file/1,path/to/file/N'

It only encompasses a subset of the whole configuration that was set at startup, but its values take precedence over command-line options.

Some settings are repeated for multiple components.
To avoid repetition in the configuration file, set them up in the common configuration file section or give them to Mimir using the -common.* CLI options.
Common settings are applied to all components first, then the components' specific configurations override them.

Settings are applied as follows, with each one applied later overriding the previous ones:

YAML common values
YAML specific values
CLI common flags
CLI specific flags

Specific configuration for one component that is passed to other components is simply ignored by those.
This makes it safe to reuse files.

Mimir can be deployed in one of two modes:

Monolithic, which runs all required components in a single process.
Microservices, where components are run as distinct processes.

The deployment mode is determined by the target option given to Mimir's process.

$ mimir -target='ruler'
$ mimir -target='all,alertmanager,overrides-exporter'

$ yq -y 'config.yml'
target: all,alertmanager,overrides-exporter
$ mimir -config.file='config.yml'

Whatever the Mimir's deployment mode, it will need to receive data from other applications.
It will not scrape metrics itself.

Prometheus configuration

remote_write:
  - url: http://mimir.example.org:8080/api/v1/push
    headers:
      X-Scope-OrgID:
        # required unless multi-tenancy is disabled
        # set it to the correct ones, this is the default
        anonymous

Grafana considers Mimir a data source of type Prometheus, and must be provisioned accordingly.

Example

---
apiVersion: 1
datasources:
  - id: 1
    name: Mimir
    orgId: 1
    uid: abcdef01-a0c1-432e-8ef5-7b277cb0b32b
    type: prometheus
    typeName: Prometheus
    typeLogoUrl: public/app/plugins/datasource/prometheus/img/prometheus_logo.svg
    access: proxy
    url: http://mimir.example.org:8080/prometheus
    user: ''
    database: ''
    basicAuth: false
    isDefault: true
    jsonData:
      httpMethod: POST
    readOnly: false

From there, metrics can be queried in Grafana's Explore tab, or can populate dashboards that use Mimir as their data source.

Monolithic mode

Runs all required components in a single process.

Can be horizontally scaled out by deploying multiple instances of Mimir's binary, all of them started with the -target=all option.

graph LR
  r(Reads)
  w(Writes)
  lb(Load Balancer)
  m1(Mimir<br/>instance 1)
  mN(Mimir<br/>instance N)
  os(Object Storage)

  r --> lb
  w --> lb
  lb --> m1
  lb --> mN
  m1 --> os
  mN --> os

By default Mimir expects 3 ingester replicas, and data ingestion will fail if there are less than 2 in the ingester ring.
See HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1.

Microservices mode

Mimir's components are deployed as distinct processes.
Each process is invoked with its own -target option set to a specific component (i.e., -target='ingester' or -target='distributor').

graph LR
  r(Reads)
  qf(Query Frontend)
  q(Querier)
  sg(Store Gateway)
  w(Writes)
  d(Distributor)
  i(Ingester)
  os(Object Storage)
  c(Compactor)

  r --> qf --> q --> sg --> os
  w --> d --> i --> os
  os <--> c

Every required component must be deployed in order to have a working Mimir instance.

This mode is the preferred method for production deployments, but it is also the most complex.
Recommended using Kubernetes and the mimir-distributed Helm chart.

Each component scales up independently.
This allows for greater flexibility and more granular failure domains.

Run on AWS ECS Fargate

See also AWS ECS and Mimir on AWS ECS Fargate.

Things to consider:

Go for ECS service discovery instead of ECS Service Connect.

This needs to be confirmed, but it is how it worked for me.

Apparently, at the time of writing, Service Connect prefers answering in IPv6 for ECS-related queries.
There seems to be no way to customize this for now.

At the same time, hashicorp/memberlist seems to only use IPv4 unless explicitly required to listen on a IPv6 address.
Which, one would have no way to programmatically set before creating the resources.

Storage

Mimir supports the s3, gcs, azure, swift, and filesystem backends.
filesystem is the default one.

Object storage

Refer Configure Grafana Mimir object storage backend.

Blocks storage must be located under a different prefix or bucket than both the ruler's and AlertManager's stores. Mimir will fail to start if that is the case.

To avoid that, it is suggested to override the bucket_name setting in the specific configurations.

Different buckets

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com  # required
      region: us-east-2

blocks_storage:
  s3:
    bucket_name: mimir-blocks

alertmanager_storage:
  s3:
    bucket_name: mimir-alertmanager

ruler_storage:
  s3:
    bucket_name: mimir-ruler

Same bucket, different prefixes

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com  # required
      region: us-east-2
      bucket_name: mimir

blocks_storage:
  storage_prefix: blocks

alertmanager_storage:
  storage_prefix: alertmanager

ruler_storage:
  storage_prefix: ruler

The WAL is only retained on local disk, not persisted to the object storage.

Metrics data is uploaded to the object storage every 2 hours, typically when a block is cut from the in-memory TSDB head.
After the metrics data block is uploaded, its related WAL is truncated too.

Authentication and authorization

Refer Grafana Mimir authentication and authorization.

Migrate to Mimir

Refer Configure TSDB block upload and Migrate from Thanos or Prometheus to Grafana Mimir.

Ingest Out-Of-Order samples

Refer Configure out-of-order samples ingestion.

limits:
  out_of_order_time_window: 5m  # Allow up to 5 minutes since the latest received sample for the series.

Deduplication of data from multiple Prometheus scrapers

Refer Configure Grafana Mimir high-availability deduplication.

APIs

Refer Grafana Mimir HTTP API.

Troubleshooting

HTTP status 401 Unauthorized: no org id

Context: Prometheus servers get this error when trying to push metrics.

Root cause: The push request is missing the X-Scope-OrgID header that would specify the tenancy for the data.

Solution:

Configure Prometheus to add the X-Scope-OrgID header to the data pushed.
Required even when multi-tenancy is disabled. In this case, use the default anonymous tenancy:

remote_write:
  - url: http://mimir.example.org:8080/api/v1/push
    headers:
      X-Scope-OrgID:
        # required unless multi-tenancy is disabled
        # set it to the correct ones, this is the default
        anonymous

HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1

Context:

Mimir is running on AWS ECS in monolithic mode for evaluation.
It is loading the following configuration from a mounted AWS EFS volume:

multitenancy_enabled: false

common:
  storage:
    backend: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com  # required
blocks_storage:
  s3:
    bucket_name: my-mimir-blocks

The service is backed by a load balancer.
The load balancer is allowing requests to reach the task serving Mimir correctly.

A Prometheus server is configured to send data to Mimir:

remote_write:
  - url: http://mimir.dev.somecompany.com:8080/api/v1/push
    headers:
      X-Scope-OrgID: anonymous

The push request passes Mimir's validation for the above header.

Both Mimir and Prometheus print this error when Prometheus tries to push metrics.

Root cause:

Writes to the Mimir cluster are successful only if the majority of the ingesters received the data.

The default value for the ingester.ring.replication_factor setting is 3. As such, Mimir expects by default 3 ingester in its ingester ring, with a minimum of ceil(replication_factor/2) that many ingesters (2 by default) alive at all times.
Data ingestion will fail with that error when less than the minimum alive replica are listed in the ingester ring.

This happens even when running in monolithic mode.

Solution:

As a rule of thumb, make sure at least ceil(replication_factor/2) ingesters are available to Mimir.

When just testing the setup, configure the ingesters' replication_factor to 1:

ingester:
  ring:
    replication_factor: 1

19 KiB Raw Blame History

Grafana's Mimir

TL;DR

Setup

Monolithic mode

Microservices mode

Run on AWS ECS Fargate

Storage

Object storage

Authentication and authorization

Migrate to Mimir

Ingest Out-Of-Order samples

Deduplication of data from multiple Prometheus scrapers

APIs

Troubleshooting

HTTP status 401 Unauthorized: no org id

HTTP status 500 Internal Server Error: send data to ingesters: at least 2 live replicas required, could only find 1

Further readings

Sources

19 KiB

Raw Blame History