Skip to main content
The collector monitors NVIDIA GPUs via NVML using the go-nvml library, which loads libnvidia-ml.so at runtime. No CUDA toolkit or DCGM daemon is needed.

Requirements

  • Linux with NVIDIA GPU drivers installed
  • libnvidia-ml.so present on the host (installed with the NVIDIA driver)
  • For Docker: NVIDIA Container Toolkit

Collected metrics

MetricDescription
hw.gpu.utilizationCompute, encoder, and decoder utilization (0.0–1.0) via hw.gpu.task
hw.gpu.memory.utilizationMemory controller utilization (0.0–1.0)
hw.gpu.memory.limitTotal VRAM (bytes)
hw.gpu.memory.usageUsed VRAM (bytes)
hw.gpu.memory.freeFree VRAM (bytes)
hw.gpu.temperatureDie and memory temperature (°C) via sensor attribute
hw.gpu.fan_speedFan speed (RPM)
hw.gpu.power.drawCurrent power draw (W)
hw.gpu.power.limitPower cap (W)
hw.gpu.energy.consumedCumulative energy (J)
hw.gpu.clock.graphicsGraphics/SM clock (MHz)
hw.gpu.clock.memoryMemory clock (MHz)
hw.errorsECC correctable/uncorrectable errors and PCIe replay errors

Docker

docker run -d \
  --name otel-gpu-collector \
  --gpus all \
  -e OTEL_SERVICE_NAME=my-app \
  -e OTEL_RESOURCE_ATTRIBUTES='deployment.environment=production' \
  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318 \
  ghcr.io/openlit/otel-gpu-collector:latest

Docker Compose

services:
  otel-gpu-collector:
    image: ghcr.io/openlit/otel-gpu-collector:latest
    environment:
      OTEL_SERVICE_NAME: my-app
      OTEL_RESOURCE_ATTRIBUTES: deployment.environment=production
      OTEL_EXPORTER_OTLP_ENDPOINT: http://otel-collector:4318
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: always

Kubernetes (DaemonSet)

To monitor GPUs on every node in a cluster, deploy the collector as a DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-gpu-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-gpu-collector
  template:
    metadata:
      labels:
        app: otel-gpu-collector
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: collector
          image: ghcr.io/openlit/otel-gpu-collector:latest
          env:
            - name: OTEL_SERVICE_NAME
              value: gpu-collector
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: deployment.environment=production
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: http://otel-collector.monitoring.svc.cluster.local:4318
          resources:
            limits:
              nvidia.com/gpu: 1
          securityContext:
            privileged: false
The collector does not need privileged mode for NVML — just access to libnvidia-ml.so. eBPF tracing requires CAP_BPF + CAP_PERFMON.

Metrics reference

Full metrics list with types, units, and attributes

Configuration

All environment variables and defaults