Prometheus

Monitoring and alerting system that collects metrics from configured targets at given intervals, evaluates rule expressions, displays the results, and can trigger alerts when specified conditions are observed.
Metrics can also be pushed using plugins, in the event hosts are behind a firewall or prohibited from opening ports by security policy.

Components
1. Extras
Configuration
Queries
Filter metrics
Further readings
Sources

Components

Prometheus is composed by its server, the Alertmanager and its exporters.

Alerting rules can be created within Prometheus, and configured to send custom alerts to Alertmanager.
Alertmanager then processes and handles the alerts, including sending notifications through different mechanisms or third-party services.

The exporters can be libraries, processes, devices, or anything else exposing metrics so that they can be scraped by Prometheus.
Such metrics are usually made available at the /metrics endpoint, which allows them to be scraped directly from Prometheus without the need of an agent.

Extras

As welcomed addition, Grafana can be configured to use Prometheus as a backend of its in order to provide data visualization and dashboarding functions on the data it provides.

Configuration

The default configuration file is at /etc/prometheus/prometheus.yml.

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: [ 'localhost:9090' ]
  - job_name: nodes
    static_configs:
      - targets:
          - fqdn:9100
          - host.local:9100
  - job_name: router
    static_configs:
      - targets: [ 'openwrt.local:9100' ]
    metric_relabel_configs:
      - source_labels: [__name__]
        action: keep
        regex: '(node_cpu)'

Queries

Prometheus' query syntax is PromQL.

All data is stored as time series, each one identified by a metric name, e.g. node_filesystem_avail_bytes for available filesystem space.
Metrics' names can be used in the expressions to select all of the time series with this name and produce an instant vector.

Time series can be filtered using selectors and labels (sets of key-value pairs):

node_filesystem_avail_bytes{fstype="ext4"}
node_filesystem_avail_bytes{fstype!="xfs"}

Square brackets allow to select a range of samples from the current time backwards:

node_memory_MemAvailable_bytes[5m]

When using time ranges, the vector returned will be a range vector.

Functions can be used to build advanced queries:

100 * (1 - avg by(instance)(irate(node_cpu_seconds_total{job='node_exporter',mode='idle'}[5m])))

Labels are used to filter the job and the mode. node_cpu_seconds_total returns a counter, and the irate() function calculates the per-second rate of change based on the last two data points of the range interval.
To calculate the overall CPU usage, the idle mode of the metric is used. Since idle percent of a processor is the opposite of a busy processor, the irate value is subtracted from 1. To make it a percentage, it is multiplied by 100.

Filter metrics

Refer How relabeling in Prometheus works, Scrape selective metrics in Prometheus and Dropping metrics at scrape time with Prometheus.

Use metric relabeling configurations to select which series to ingest after scraping:

 scrape_configs:
   - job_name: router
     …
+    metric_relabel_configs:
+      - # do *not* record metrics which name matches the regex
+        # in this case, those which name starts with 'node_disk_'
+        source_labels: [ __name__ ]
+        action: drop
+        regex: node_disk_.*
   - job_name: hosts
     …
+    metric_relabel_configs:
+      - # *only* record metrics which name matches the regex
+        # in this case, those which name starts with 'node_cpu_' with cpu=1 and mode=user
+        source_labels:
+          - __name__
+          - cpu
+          - mode
+        regex: node_cpu_.*1.*user.*
+        action: keep

Sources

All the references in the further readings section, plus the following:

6.6 KiB Raw Blame History