Grafana monitoring with Docker. Part 3 - Metrics with Prometheus

Intro

Monitoring with metrics is the most useful because they tell you roughly when and where the issue is.

For example, we can see CPU/RAM/disk usage and other general metrics about an application running, including when there were potentially reported OOM kills, with a history of data in a period of lets say last month.
Or we could be seeing detalization, regarding which our endpoints where slow, which percentage of them were slow? At which endpoint web server spend the most sum of time (after we summed time of all processed responses)?
Or we could be seeing where our message queue handling workers are stuck, where they error and etc.

Metrics show detalization When, and What. historic data that gives us understanding where is the source of the issue or the timeframe where to dig for data (by logs and traces) further.

You no longer need real-time watching current values of application usage, you are able to monitor historic data and finding clues among the noticed patterns!

The most important quality of Metrics is that they are HIGHLY PERFORMANT to query and to store. They are taken one time per 1 minute for example, per 1 unit set of labels (application_name, endpoint_name, status_code as example of labels). So when Logging monitoring, we can be struggling to have weeks of data retention, and tracing is barely affordable to see days of data, we are still able to see Metrics data for many months even in the most high loaded infrastructure.

The most common issue. High cardinality.

The biggest and most often mistake people make is having a lot of unique labeled values in some metric. that causes rapid RAM usage explosion and storage as well. You can debug which metric consumes a lot of space by using query

topk(20, count by (__name__, job)({__name__=~".+"}))

optionally, if you know a specific label in a metric has a lot of values but you have trouble to identify which ones, you can be outputing metric label values grouped by first N characters. Here is the example we output span_name in traces_spanmetrics_latency_count, grouped by the first 5 characters.

sum by (span_prefix) (
  label_replace(
    traces_spanmetrics_latency_count,
    "span_prefix",
    "$1",
    "span_name",
    "(.{5}).*"
  )
)

Raising Prometheus

Important

we provide docker-compose way of configuration as a demo example because more devs are highly likely familiar and comfortable with docker-compose than with terraform. We utilize terraform for configuration of it and recommend it to use instead of docker-compose if u can. Book “Terraform up and running” is an excellent place to start with it.


services:
  prometheus:
    build:
      dockerfile: ./Dockerfile.prometheus
      context: .
    container_name: prometheus
    restart: always
    entrypoint: ["/bin/prometheus"]
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --web.enable-remote-write-receiver
      - --enable-feature=exemplar-storage 
      - --storage.tsdb.retention.time=30d
      - --storage.tsdb.retention.size=10GB
    networks:
      grafana:
        aliases:
          - prometheus
    volumes:
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    mem_limit: 1000m

  alloy-metrics:
    build:
      dockerfile: ./Dockerfile.alloy.metrics
      context: .
    container_name: alloy-metrics
    restart: always
    privileged: true
    entrypoint: ["/bin/alloy"]
    command:
      - run
      - /etc/alloy/config.alloy
      - --storage.path=/var/lib/alloy/data
    logging:
      driver: "json-file"
      options:
        mode: "non-blocking"
        max-buffer-size: "500m"
    networks:
      grafana:
        aliases:
          - alloy-metrics
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /cgroup:/cgroup:ro
      - /:/rootfs:ro
      - /proc:/host/proc:ro
      - /sys:/sys:ro
      - /var/run:/var/run:rw
      - /dev/disk:/dev/disk:ro
      - /etc:/host/etc:ro
    mem_limit: 1000m

networks:
  grafana:
    external: true

volumes:
  prometheus_data:
    name: prometheus_data


# Option to raise as Terraform
terraform {
  required_providers {
    docker = {
      source  = "kreuzwerker/docker"
      version = ">=3.0.2"
    }
    grafana = {
      source = "grafana/grafana"
    }
  }
}

provider "docker" {
  host     = "ssh://homelab"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null", "-i", "~/.ssh/id_rsa.darklab"]
}

module "caddy" {
  source = "./infra/tf/modules/docker_stack/caddy"
}

data "external" "secrets" {
  program = ["pass", "personal/terraform/grafana"]
}

module "monitoring" {
  // Relevant for part 1 article setup and logging
  source = "./infra/tf/modules/docker_stack/monitoring"
  # optionally we can lock ourselves which code to use from external git repo via git source.
  # source = "git@github.com:darklab8/infra.git//tf/modules/docker_stack/monitoring?ref=28407027ebdaba2b48816b63f627c18acd521f46"
  docker_network_caddy_id = module.caddy.network_id
  grafana_password        = data.external.secrets.result["grafana_password"]
  grafana_domain          = "homelab.dd84ai.com"
  logging = {
    enabled = true
  }

  // Relevant for part 2 article
  tracing = {
    enabled = true
  }
  // Relevant for part 3 article
  metrics = {
    enabled = true
  }
  // Relevant for part 4 article
  alerts = {
    enabled             = true
    discord_webhook_url = data.external.secrets.result["discord_webhook_url"]
  }
}

locals {
  grafana_password = data.external.secrets.result["grafana_password"]
  grafana_creds    = "admin:${local.grafana_password}"
}


provider "grafana" {
  url  = "https://demo.dd84ai.com/"
  auth = local.grafana_creds
}

// Data sources for all article parts at the same time
module "datasources" {
  # source = "./datasources"
  source = "./infra/tf/modules/grafana_stack/datasources"
  # optionally we can lock ourselves which code to use from external git repo via git source.
  # source = "git@github.com:darklab8/infra.git//tf/modules/grafana_stack/datasources?ref=27d0889348b1b526234d6db7ff60cf2793a772ca"
}

Participating configs:

prometheus.yaml - Show / Hide


global:
  scrape_interval: 1m


alerting:
  alertmanagers:
    - scheme: http
      static_configs:
        - targets: [ 'alertmanager:9093' ]

cfg.metrics.alloy - Show / Hide


prometheus.exporter.cadvisor "docker_metrics" {
  docker_host = "unix:///var/run/docker.sock"
  storage_duration = "5m"
  store_container_labels = true
  env_metadata_allowlist = [
    "VERSION_ID",
  ]
  enabled_metrics = [
    // default
    "cpu", "sched", "percpu", "memory", "cpuLoad", "diskIO", "disk",
    "network", "app", "process", "perf_event", "oom_event",
    // not default
    "process",
    // DO NOT TURN ON advtcp. it has highly cardinal `container_network_advance_tcp_stats_total`. Or exclude this one before turning on.
    // "memory_numa", "tcp", "udp", "advtcp", "hugetlb", "referenced_memory", "cpu_topology", "resctrl", "cpuset
  ]
}

// Configure a prometheus.scrape component to collect cadvisor metrics.
prometheus.scrape "scraper" {
  targets    = prometheus.exporter.cadvisor.docker_metrics.targets
  forward_to = [ prometheus.relabel.labelator.receiver ]
}

prometheus.relabel "labelator" {
  forward_to = [prometheus.remote_write.backend.receiver]

  rule {
    source_labels = ["__meta_docker_container_name"]
    regex = "/(.*)"
    action = "replace"
    target_label = "container_name"
  }
}

prometheus.relabel "host_metrics" {
  forward_to = [prometheus.remote_write.backend.receiver]
  rule {
      target_label = "instance"
      replacement = string.trim_space(local.file.hostname.content)
  }
}

local.file "hostname" {
  filename  = "/host/etc/hostname"
}

prometheus.remote_write "backend" {
  endpoint {
    url = coalesce(sys.env("PROMETHEUS_URL"),"http://prometheus:9090/api/v1/write")
  }
}

// Add also Unix exporter
// https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.exporter.unix/
// https://grafana.com/docs/grafana-cloud/send-data/metrics/metrics-prometheus/prometheus-config-examples/docker-compose-linux/

prometheus.exporter.unix "hosts" {
    procfs_path	= "/host/proc"  // /proc default
    sysfs_path	= "/sys"   // string	The sysfs mount point.	/sys	no
    rootfs_path	= "/rootfs"     // string	Specify a prefix for accessing the host filesystem.	/	no
    filesystem {
        mount_points_exclude = "^/(sys|proc|dev|host)($$|/)"
    }
}

// Configure a prometheus.scrape component to collect unix metrics.
prometheus.scrape "hosts" {
  targets    = prometheus.exporter.unix.hosts.targets
  forward_to = [prometheus.relabel.host_metrics.receiver]
}

discovery.docker "prometheus_endpoints" {
  host = "unix:///var/run/docker.sock"
  match_first_network = false
  filter {
    name = "label"
    values = ["prometheus"] 
  }
}

prometheus.scrape "prometheus_endpoints" {
  targets    = discovery.docker.prometheus_endpoints.targets
  forward_to = [prometheus.relabel.labelator.receiver]
  scrape_interval = "60s"
}

discovery.dockerswarm "prometheus_endpoints_swarm" {
  host = "unix:///var/run/docker.sock"
  role = "tasks"
  filter {
    name = "label"
    values = ["prometheus"] 
  }
}

prometheus.scrape "prometheus_endpoints_swarm" {
  targets    = discovery.dockerswarm.prometheus_endpoints_swarm.targets
  forward_to = [prometheus.relabel.labelator.receiver]
  scrape_interval = "60s"
}

Dockerfile.prometheus - Show / Hide


FROM prom/prometheus:v3.2.1
COPY infra/tf/modules/docker_stack/monitoring/prometheus.yaml /etc/prometheus/prometheus.yml

Dockerfile.metrics.traces - Show / Hide


FROM grafana/alloy:v1.8.3
COPY infra/tf/modules/docker_stack/monitoring/cfg.metrics.alloy /etc/alloy/config.alloy

Proceed to apply deployment for raising the metrics stack part (or use Opentofu(Terraform) to raise all stuff together as modules from ./main.tf)

git clone --recurse-submodules https://github.com/darklab8/blog
cd blog/articles/article_detailed/article_20250609_grafana/code_examples

export DOCKER_HOST=ssh://root@demo
docker ps

# ONLY if you did not do things from previous article part about Loki or Tempo follow docker-compose path:
docker compose up -d caddy # we need it for reverse proxy and automated TLS certs
docker compose up -d grafana # visualizer where we query traces. Already yaml of provisioned datasources configured

# Continue with Prometheus article content:
# if docker-compose way:
docker compose -f docker-compose.prometheus.yaml build
docker compose -f docker-compose.prometheus.yaml up -d prometheus
docker compose -f docker-compose.prometheus.yaml up -d alloy-metrics

# if opentofu way
tofu init
tofu apply

# after deploy, u need just in case to grant prometheus proper rights to be persistent and possible to init
chmod -R a+rw /var/lib/docker/volumes/prometheus_data
chmod -R a+rw /var/lib/docker/volumes/grafana_data # just in case grant grafana rights too if not granted

If everything was configured correctly, you will be able to open Metrics Drilldown page and see incoming metrics already. This article brings alloy configuration with prewritten docker monitoring, as it is the most comfortable minimalistic approach for deployment in a homelab.

Note

If you wish monitoring by metrics something else besides docker and applications in docker, for example postgres, elasticsearch, aws cloudwatch and etc. Check grafana alloy components for other provided prometheus integrations

Infrastructure Dashboards

We raised in previous section prometheus and universal metrics metrics scrapper “alloy”, which is already configured to scrap unix, docker and app metrics.

Cadvisor alloy component for docker metrics
Unix alloy component for node exporter metrics accordingly

Import dashboards for Docker and Unix

If you imported all right and your Grafana image version is 11.6 as supported by those dashboards and written in docker compose, you will see metrics about your containers and linux server accordingly

Cadvisor dashboard (about docker containers): Node exporter dashboard (about linux server):

Caution

if you remain not seeing dashboards properly, take note which grafana version you use. We can be sure it works fine with 11.6 at least

Note

grafana side of configurations author of article handles by terraform grafana provider instead of manual actions, with the next code used: https://github.com/darklab8/infra/tree/master/tf/grafana

if you what other kind of grafana dashboards, you could browse all choices people release there https://grafana.com/grafana/dashboards/ , of you can make your own.

Caution

if imported dashboards do not show data also in some of its graphs, make sure you have in data source of prometheus turned on timeInterval to 60s, the time of alloy scraping interval. If u raised grafana web interface with this article things, you will have it automatically as it is written in data source provisioning config

Application dashboards

Now that we have the main infra dashboards handled, we need to try having some custom application metrics scraped and made into its own dashboard

Supported languages for Prometheus libraries can be found here https://prometheus.io/docs/instrumenting/clientlibs/
Depending on a language, framework, infra element, there can already be existing integrations/exporters for it, which u could find here https://prometheus.io/docs/instrumenting/exporters/ , like there is even Python django prometheus integration and postgresql integrations
As mentioned before, grafana alloy scraping agent offers plenty of common exporters u could choose to use to get access to more metrics https://grafana.com/docs/alloy/latest/reference/components/prometheus/ . To use them properly you will highly likely need to read their real repository under the hood about a necessary extra volumes/settings u need to pass to grafana alloy to make it work, depending on what you use

Read throughly metric types existing there https://prometheus.io/docs/concepts/metric_types/ to understand how to write your own prometheus metrics. Roughly, we can say

Counter is good for stuff like request counts, action counts, and any kind of counts most of the time. Counter is good when we can just “ADD” yet another time counted smth.
Gauge for if we need to know “Temperature value” of smth, how many workers we have currently or how many active users are currently if we have access to active sessions. Also, a gauge is usable for summing up anything. Gauge is for when it is good to “SET” the value of smth.
Histogram when we need to capture performance of request duration, or any other kind of duration across different “route patterns”.

I integrated my pet project with Go lib of Prometheus https://github.com/prometheus/client_golang added metrics and registered them in an explicit way for ability to add global labels https://github.com/darklab8/fl-darkstat/blob/master/darkcore/metrics/metrics.go

Based on that, I have for the project darkstat a detailed performance evaluating dashboard https://grafana.dd84ai.com/d/belbdnu2uqe4gd/app-darkstat?var-interval=2m&orgId=1&from=now-3h&to=now&timezone=browser&var-environment=production

Since it is a web app, I made sure to capture regular stuff of how success/failure rates and duration of responses my web server makes
And I had plenty of problems with uptime in the past, so I made sure I have an uptime dashboard present and working through a regular Counter

I took extra notice in worst case scenarios, when the worst 50%,25%,10% of responses happen? (P50,P75,P90 metrics)
I took notice which page are having large body size and create large network traffic for me. Due to the nature of the app, it was important for me to look for this metric.
I took notice of which pattern path takes the most time to load, to mark then as potential targets for optimization

Lastly, I keep running for the app’s public API, so out of curiosity, I watch which endpoints are actually in use, with which user agents, to evaluate the number of users
That information gives me information on what is NOT in use, and I could consider evaluating to be removed as not necessary.

The code of this application dashboard is provided by link

What do we capture in application dashboards?

We capture with metrics what is most important for us to monitor:

How responses are given by a web server
How interactions with databases work inside the app
How requests to other applications behave
How payments are processed.
How message queue workers work
How databases run and behave, handling pressure, ram, disk usage

The role of dashboard metrics is to tell us WHERE the issues are happening (but they do not have to tell exactly how they are happening, see traces/logs/profiles for more information). Plus metrics are also the most useful to be utilized in Alerts because of their performance efficiency :)

Dashboards based on logs and traces

If we haven’t mentioned it before, we can make dashboards even from logs, but they will not be query performant and their usability is limited to applications with low logging volume. It is way easier to handle prometheus metrics that are emitted fews per minutes than applications with thousands of log lines per minute.

Example of dashboard based on logs can be found here and its code is here

Dashboard from logs:

As for dashboards based on traces? We can make generic dashboard that works based on metrics generated from traces. It is provided here and its code is here

You may find it interesting because u get auto generated dashboard just because your app is connected to tracing. It has strong limitations of needing low cardinality used tracing span names though, if some application breaks this rule, it needs to be excluded from metrics generator usage.

Caution

turn off metrics generator is you are not needing trace apm dashboard, that will save you some used RAM

Additionally, dashboard graphs can be even generated from traceql metrics, which can be useful in tricky tracing searches. They aren’t again useful for average everyday usage because performance demand is too high there. Only Metrics based dashboards are performance efficient to be navigated plentifully.

Articles updates

All articles about monitoring configurations, including about prometheus metrics, are actively in use at least in homelab of the author through Terraform configuration. If you have any doubts, smth got outdated, less working and etc, see terraform code there as source of truth. With some chance article content will be updated in its repository and redeployed to github pages with fresh fixes. I can be reached for communications in issues to blog repo.

Production grade configuration tips.

While prometheus is fine for a few hosts of homelab, or very small production:

I can highly endorse deploying Mimir if you have serious production in horizontal scaled infrastructure of kubernetes https://github.com/grafana/mimir/tree/main/operations/helm/charts/mimir-distributed The main advantage of it… it is actually horizontally scalable and able to withstand higher workload just because it can distribute RAM workload between its scaled instances.
In its turn k8s monitoring helm chart is perfect for scraping metrics from kubernetes. Grafana alloy is perfect for usage in kubernetes, or in dockerized deployments, or in AWS ECS, or even for deployment to linuxes that have all stuff running through systemd, it can work anywhere and scraping everything.
To make Mimir work at full power, you could wish to learn how to configure Mimir Rules. That will unlock you having in Mimir detailedly working dashboards of any kind if u just import the same Prometheus Rules as provided for whatever dashboards. This can be handled by terraform mimir provider
Some people choose Victoria Metrics as alternative. It may be good alternative, but author of the article did not test it on a workload comparable to Mimir to be sure which one choice works better. Since there are no serious complains to Mimir and it works pretty fine and scalable, there was no yet need to switch to Victoria Metrics for the author.