Grafana monitoring with Docker. Part 2 - Traces with Tempo

Intro

in a previous article we configured Grafana, Loki and Alloy for logs gathering.

in this article we are going through configuring Tracing with Tempo, and we assume that previous configurations of a system were already done from the Part 1 of the article series (like Grafana web gui configuration and Caddy web server at least). Our goal remains configuring Tracing for homelab and your pet projects. At the end of an article we mention production grade configuration tips

What is tracing for?

Tracing is your best friend in case you are monitoring backend systems, that have a lot of different networking requests to databases, third party apis, your own other services. Tracing show which exactly SQL query takes the most time to execute during them. Or it could show if your code is stuck in common N+1 problem of Django ORM (when you execute SQL query per each row instead of a single one due to forgotten select_related/fetch_related thing).

What are its properties?

Tracing is somewhat comparable to Profiling but has big differences. Profiling monitors a single app only and able to show execution time of each function and even memory allocations an dother stuff. Tracing in comparison shows only what was covered in its tracing spans and able to propagate in information shown to other services.
Tracing serves as an EASY TO NAVIGATE GLUE between all monitoring systems, uniting Traces with Logs, Logs to Traces, Traces to Profiles, Metrics to Traces. Everything is joined by Traces! We can find logs by traces, and we can find metrics from traces if very desiring and etc.
Tracing can work majorly by zero application code changes, if in your language were already written “auto instrumenting” solutions onto every sneeze that cover all the common libraries with integrations
- That is the case with Python and its rich set of autoinstrumenting solutions
- Regretfully it is not the case at all with Golang at the moment of writing this article in 2026 year.
- How much easy to configure tracing depends on a language essentially.

My best recommendation regarding integrating it in any language… approach problems with Middlewares/universal interceptors of network requests for every network interacting library you use. Make wrappers if necessary that automatically add tracing spans. Your code should be covered with tracing with least amount of effort automatically for all network interacting libs, then tracing is the most useful for backend apps!

Note

We have as some weak substistution for instrumentation in Go [epbf based tool](https://github.com/open-telemetry/opentelemetry-go-instrumentation), but it is highly limited, your logs, metrics will not have connections to traces, and it works only for specific sub set of libraries which u can't easily change. We will not be covering this tool usage in this series of article since it is not looking like good method to go by default.

Raising Tempo

Important

we provide docker-compose way of configuration as demo example because more devs are highly likely familiar and comfortable with docker-compose than with terraform. We utilize terraform for configuration of it and recommend it to use instead of docker-compose if u can. Book "Terraform up and running" is excellent place to start with it.


version: '3.8'

services:
  tempo:
    build:
      dockerfile: ./Dockerfile.tempo
      context: .
    container_name: tempo
    user: root
    entrypoint: ["sh", "-c"]
    command: ["/tempo -config.file=/etc/tempo.yaml"]
    networks:
      grafana:
        aliases:
          - tempo
    restart: always
    logging:
      driver: json-file
      options:
        mode: non-blocking
        max-buffer-size: 500m
    volumes:
      - tempo_data:/var/tempo
    mem_limit: 1000m

  alloy-traces:
    build:
      dockerfile: ./Dockerfile.alloy.traces
      context: .
    container_name: alloy-traces
    entrypoint: ["/bin/alloy"]
    command: ["run","/etc/alloy/config.alloy","--storage.path=/var/lib/alloy/data"]
    restart: always
    logging:
      driver: json-file
      options:
        mode: non-blocking
        max-buffer-size: 500m
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      grafana:
        aliases:
          - alloy-traces
    mem_limit: 1000m

volumes:
  tempo_data:
    name: "tempo_data"

networks:
  grafana:
    name: grafana
    external: true


# Option to raise as Terraform
terraform {
  required_providers {
    docker = {
      source  = "kreuzwerker/docker"
      version = ">=3.0.2"
    }
    grafana = {
      source = "grafana/grafana"
    }
  }
}

provider "docker" {
  host     = "ssh://homelab"
  ssh_opts = ["-o", "StrictHostKeyChecking=no", "-o", "UserKnownHostsFile=/dev/null", "-i", "~/.ssh/id_rsa.darklab"]
}

module "caddy" {
  source = "./infra/tf/modules/docker_stack/caddy"
}

data "external" "secrets" {
  program = ["pass", "personal/terraform/grafana"]
}

module "monitoring" {
  // Relevant for part 1 article setup and logging
  source = "./infra/tf/modules/docker_stack/monitoring"
  # optionally we can lock ourselves which code to use from external git repo via git source.
  # source = "git@github.com:darklab8/infra.git//tf/modules/docker_stack/monitoring?ref=28407027ebdaba2b48816b63f627c18acd521f46"
  docker_network_caddy_id = module.caddy.network_id
  grafana_password        = data.external.secrets.result["grafana_password"]
  grafana_domain          = "homelab.dd84ai.com"
  logging = {
    enabled = true
  }

  // Relevant for part 2 article
  tracing = {
    enabled = true
  }
  // Relevant for part 3 article
  metrics = {
    enabled = true
  }
  // Relevant for part 4 article
  alerts = {
    enabled             = true
    discord_webhook_url = data.external.secrets.result["discord_webhook_url"]
  }
}

locals {
  grafana_password = data.external.secrets.result["grafana_password"]
  grafana_creds    = "admin:${local.grafana_password}"
}


provider "grafana" {
  url  = "https://demo.dd84ai.com/"
  auth = local.grafana_creds
}

// Data sources for all article parts at the same time
module "datasources" {
  # source = "./datasources"
  source = "./infra/tf/modules/grafana_stack/datasources"
  # optionally we can lock ourselves which code to use from external git repo via git source.
  # source = "git@github.com:darklab8/infra.git//tf/modules/grafana_stack/datasources?ref=27d0889348b1b526234d6db7ff60cf2793a772ca"
}

Participating configs:

Dockerfile.tempo - Show / Hide


FROM grafana/tempo:2.7.2
COPY infra/tf/modules/docker_stack/monitoring/tempo.yaml /etc/tempo.yaml

tempo.yaml - Show / Hide


stream_over_http_enabled: true
server:
  http_listen_port: 3200
  log_level: info

query_frontend:
  search:
    duration_slo: 5s
    throughput_bytes_slo: 1.073741824e+09
    metadata_slo:
        duration_slo: 5s
        throughput_bytes_slo: 1.073741824e+09
  trace_by_id:
    duration_slo: 100ms
  metrics:
    max_duration: 200h
    query_backend_after: 5m
    duration_slo: 5s
    throughput_bytes_slo: 1.073741824e+09

distributor:
  receivers:                           
    otlp:
      protocols:
        http:
          endpoint: "tempo:4318"
        grpc:
          endpoint: "tempo:4317"

compactor:
  compaction:
    block_retention: 24h

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: docker-compose
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    local_blocks:
      filter_server_spans: false
      flush_to_storage: true

storage:
  trace:
    backend: local                 
    wal:
      path: /var/tempo/wal            
    local:
      path: /var/tempo/blocks

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics, local-blocks]
      generate_native_histograms: both

Dockerfile.alloy.traces - Show / Hide


FROM grafana/alloy:v1.8.3
COPY infra/tf/modules/docker_stack/monitoring/cfg.traces.alloy /etc/alloy/config.alloy

cfg.traces.alloy - Show / Hide



logging {
  level  = "info"
  format = "logfmt"
}

otelcol.receiver.otlp "receiver" {
  debug_metrics {
    disable_high_cardinality_metrics = true
  }

  grpc {
    endpoint = "0.0.0.0:4317"
  }
  http {
    endpoint = "0.0.0.0:4318"
  }
  output {
    metrics = [otelcol.processor.transform.default.input]
    logs    = [otelcol.processor.transform.default.input]
    traces  = [otelcol.processor.transform.default.input]
  }
}

otelcol.processor.transform "default" {
  error_mode = "ignore"

  trace_statements {
    context = "resource"
    statements = [
      `limit(attributes, 500, [])`,
      `truncate_all(attributes, 20480)`,
    ]
  }

  trace_statements {
    context = "span"
    statements = [
      `limit(attributes, 500, [])`,
      `truncate_all(attributes, 20480)`,
    ]
  }
  output {
    metrics = [otelcol.processor.batch.default.input]
    logs    = [otelcol.processor.batch.default.input]
    traces  = [otelcol.processor.batch.default.input]
  }
}

otelcol.processor.batch "default" {
  output {
    metrics = [otelcol.exporter.prometheus.default.input]
    logs    = [otelcol.exporter.loki.default.input]
    traces  = [otelcol.exporter.otlphttp.default.input]
  }
}

otelcol.exporter.otlphttp "default" {
  client {
    endpoint = coalesce(sys.env("TEMPO_URL"),"http://tempo:4318/")
    tls {
        insecure             = true
        insecure_skip_verify = true
    }
  }
}

tracing {
    sampling_fraction = encoding.from_json(coalesce(sys.env("SAMPING_FRACTION"),"1"))
    write_to = [otelcol.exporter.otlphttp.default.input]
}

otelcol.exporter.prometheus "default" {
  forward_to = [prometheus.remote_write.local.receiver]

  include_target_info = true
  include_scope_info = true
  resource_to_telemetry_conversion = true
}

prometheus.remote_write "local" {
  endpoint {
    url = coalesce(sys.env("PROMETHEUS_URL"),"http://prometheus:9090/api/v1/write")
  }
}

otelcol.exporter.loki "default" {
  forward_to = [loki.write.local.receiver]
}

loki.write "local" {
  endpoint {
    url = coalesce(sys.env("LOKI_URL"), "http://loki:3100/loki/api/v1/push")
    tenant_id = ""
  }
}

Proceed to apply deployment for raising the tracing stack part (or use Opentofu(Terraform) to raise all stuff together as modules from ./main.tf)

git clone --recurse-submodules https://github.com/darklab8/blog
cd blog/articles/article_detailed/article_20250609_grafana/code_examples

export DOCKER_HOST=ssh://root@homelab
docker ps

# ONLY if you did not do things from first article part about Loki and follow docker-compose path:
docker compose up -d caddy # we need it for reverse proxy and automated TLS certs
docker compose up -d grafana # visualizer where we query traces. Already yaml of provisioned datasources and installed plugin for tracing drilldown interface

# Continue with Tracing article content:
# if docker-compose way:
docker compose -f docker-compose.tracing.yaml build
docker compose -f docker-compose.tracing.yaml up -d tempo # tracing backend
docker compose -f docker-compose.tracing.yaml up -d alloy-traces # agent collector of traces to which we can send them over network

# if opentofu way
tofu init
tofu apply

# after deploy, u need to grant tempo proper rights to be persistent and possible to init
chmod -R a+rw /var/lib/docker/volumes/tempo_data
chmod -R a+rw /var/lib/docker/volumes/grafana_data # just in case grant grafana rights too if not granted

Demo application to test it.

export DOCKER_HOST=ssh://root@homelab
docker compose -f docker-compose.app-traces.yaml build
docker compose -f docker-compose.app-traces.yaml run -it app-traces-go

with the next code is deployed

package main

import (
	"context"
	"errors"
	"fmt"
	"math/rand/v2"
	"time"

	"github.com/darklab8/go-typelog/otlp"
	"github.com/darklab8/go-utils/typelog"
	"go.opentelemetry.io/otel"
)

type WebEndpoint struct {
	pattern      string
	max_duration float64
	url          func() string
}

var WebEndpoints = []WebEndpoint{
	{
		pattern:      "/index.html",
		max_duration: 0.1,
		url:          func() string { return "/index.html" },
	},
	{
		pattern:      "/some_pattern1",
		max_duration: 1,
		url:          func() string { return "/some_pattern1" },
	},
	{
		pattern:      "/another_pattern",
		max_duration: 2,
		url:          func() string { return "/another_pattern" },
	},
	{
		pattern:      "/books/__book_id__",
		max_duration: 0.1,
		url:          func() string { return fmt.Sprintf("books/%d", rand.IntN(100)) },
	},
	{
		pattern:      "/books/__book_id__/page/__page_id__",
		max_duration: 0.2,
		url:          func() string { return fmt.Sprintf("books/%d/page/%d", rand.IntN(100), rand.IntN(1000)) },
	},
}

var (
	logger *typelog.Logger = typelog.NewLogger("go-demo-app")
	Tracer                 = otel.Tracer("go-demo-app")
)

func NestedAction(ctx_span context.Context) {
	ctx_span, span := Tracer.Start(ctx_span, "nested action")
	defer span.End()
}

func doRun() {
	time_start := time.Now()
	fmt.Println("started run", time_start)
	ctx_span, span := Tracer.Start(context.Background(), "web request")
	defer span.End()

	time.Sleep(3 * time.Second)

	web_endpoint := WebEndpoints[rand.IntN(len(WebEndpoints))]
	duration := rand.Float64() * web_endpoint.max_duration
	pattern := web_endpoint.pattern
	logger.InfoCtx(ctx_span, "web request",
		typelog.String("url_pattern", pattern),
		typelog.Float64("duration", duration),
		typelog.String("url_path", web_endpoint.url()),
	)
	NestedAction(ctx_span)
	fmt.Println("fninished run", time.Now(), time.Since(time_start))
	time.Sleep(3 * time.Second)
}

func main() {
	fmt.Println("starting app-traces")
	ctx := context.Background()
	otelShutdown, err := otlp.SetupOTelSDK(ctx) // Set up OpenTelemetry.
	if err != nil {
		fmt.Println("error to initialize tracing, err=", err.Error())
	}
	defer func() { // Handle shutdown properly so nothing leaks.
		err = errors.Join(err, otelShutdown(context.Background()))
	}()
	fmt.Println("configured tracing")
	for {
		doRun()
		time.Sleep(30 * time.Second)
	}
}

and we see in its logging its working

> starting app-traces
> configured trading
> started run 2026-04-27 01:33:20.263967654 +0000 UTC m=+0.004603667
> fninished run 2026-04-27 01:33:23.264569558 +0000 UTC m=+3.005205571 3.000601995s

If everything is all right and no errors appears at any level in the chain of

App works fine (validate with docker logs app-traces-go)
Grafana alloy works fine and has no errors regarding sending traces (validate with docker logs alloy-traces)
Tempo works fine (validate with docker logs tempo) and has no errors related to issues like unable to init backend because not having sufficient rights to initialize its data folder (to fix which u need to run chmod -R a+rw /var/lib/docker/volumes/tempo_data/)
Grafana works fine and initialized itself with provisioning data resources (validate with docker logs grafana)
Grafana plugin for tracing drilldown works fine as well (open tracing drilldown interface in grafana and see if it has any data)

You will see traces visible in your tracing drilldown interface then!

In a real world tracing is the most useful for backend applications and the best to turn it on by default for all the network interacting libraries through writing some kind of middleware. Then it will be able to answer you that issues you have at specific SQL request, or elastic search query, or specific http request. And since it is distributed tracing, the trace will shown how workload works within the called service too (as you can see on the picture below we have https request propagated into Keycloak to show internals of its authorization inside of it)!

Tracing drilldown interface simplifies navigating over them. Clicking blue graph by duration, you can easily find slowest ones. Click errors to find errors. input different filters from “service_name” to kubernetes cluster names and namespaces to filter traces by different places.

Note

In pet projects Tracing usability is honestly very limited, since it is very doubtful for pet project to have any kind of network interaction long enough requiring tracing debugging. Just because your database will rarely reach the level requiring to debug it. You will benefit in pet projects more from Logging and Metrics monitoring system. At any backend real work though, tracing is the most useful system to have, i would dare to say potentially even more useful than any other type of monitoring.

Production grade configuration tips.

It is common deploying horizontally tempo-distributed helm chart in k8s cluster for production grade tempo running if you have some serious workload.
As far as i tested so far, Minio still remains the fastest storage backend for it which for some reason works at least 3 times faster than Garage on large volume of traces (600gb in 2 days at 10% sampling rate). Regretfully Minio is deprecated and some replacement eventually would have to be needed found.
You can try different storages by running tempos in parallel and try find other storage solutions and compare with them.
To make workload more sane for production, you should utilize sampling fraction at preferably no more than 10% if u have serious workload .
K8S-monitoring helm chart remains the most boiler plated way to run it out of the box in kuber. In the rest of cases (like AWS ECS or homelab) easiest to use its Docker based deployment.