Skip to main content
The OpenTelemetry GPU Collector is a lightweight, single-binary metrics collector written in Go. It exports GPU hardware telemetry, host system metrics, and process metrics via OpenTelemetry (OTLP) — with no Python dependencies, no DCGM daemon, and no vendor-specific agents. It is fully configured via standard OpenTelemetry environment variables and follows the OTel semantic conventions for hardware metrics.

Goals

  • OpenTelemetry-native — uses standard OTEL_* env vars, exports via OTLP gRPC or HTTP to any OTel-compatible backend
  • Cross-vendor GPU support — NVIDIA (NVML), AMD (sysfs/hwmon), and Intel (i915/Xe sysfs) from a single binary
  • OTel semantic conventionshw.gpu.* metric names, hw.id / hw.name / hw.vendor attributes per spec
  • Zero dependencies — no DCGM, no Python, no CUDA toolkit needed at runtime for hardware metrics
  • Resilient — continues exporting host metrics even when no GPUs are present; retries GPU discovery every 30s

What it collects

GPU Hardware Telemetry

Utilization, memory, temperature, power draw, energy, clock speeds, ECC errors, PCIe errors — for NVIDIA, AMD, and Intel GPUs

Host System Metrics

CPU utilization, memory usage, disk I/O, filesystem usage, and network I/O — on Linux, macOS, and Windows

eBPF CUDA Tracing

Kernel launch counts, grid/block sizes, memory allocations, and memory copies — via uprobes on libcudart.so (opt-in, Linux only)

GPU vendor support

VendorBackendMetrics available
NVIDIANVML via go-nvml — loads libnvidia-ml.so at runtimeUtilization, memory, temperature, power, energy, clocks, ECC errors, PCIe errors, fan speed
AMDsysfs + hwmon — zero external dependenciesUtilization, memory, temperature, power, energy, fan speed
Intelsysfs + hwmon + DRM (i915/Xe driver)Temperature, power draw/limit, cumulative energy, graphics clock, fan speed (kernel 6.16+)

Platform support

FeatureLinuxmacOSWindows
System metrics (CPU, memory, disk, network)YesYesYes
Process metrics (CPU, memory, threads, FDs)YesYesYes
GPU metrics (NVIDIA, AMD, Intel)Yes
eBPF CUDA tracingYes

How it works

Host Metrics (all platforms via gopsutil)
    +-- CPU utilization, memory, disk I/O, filesystem, network
    +-- Process: self CPU, memory, threads, FDs, Go runtime

GPU Metrics (Linux only)
    +-- PCI Bus Scan (/sys/bus/pci/devices/)
    |     +-- NVIDIA (0x10de) --> NVML backend
    |     +-- AMD    (0x1002) --> sysfs/hwmon backend
    |     +-- Intel  (0x8086) --> sysfs/hwmon + DRM backend
    |
    +-- [Optional: eBPF CUDA tracing via uprobes on libcudart.so]

Export
    +-- OTel SDK --> OTLP gRPC/HTTP --> your OTel collector / backend
1

Scans PCI bus

On startup, the collector scans /sys/bus/pci/devices/ for PCI class codes 0x0300 (VGA), 0x0302 (3D controller), and 0x0380 (display controller), identifying GPU vendor from the PCI vendor ID.
2

Initialises vendor backend

Each detected GPU is handed to its vendor-specific backend. NVIDIA loads libnvidia-ml.so via NVML. AMD and Intel read directly from kernel sysfs/hwmon — no additional libraries needed.
3

Registers OTel instruments

Observable gauge and counter instruments are registered with the OTel SDK meter. On each collection tick, the SDK calls back into the collector to read fresh values from each GPU.
4

Exports via OTLP

Metrics are exported via OTLP to any compatible backend — OpenLIT, Grafana, Datadog, New Relic, or a standard OTel Collector.

Quickstart

Get the collector running in under 5 minutes with Docker

Configuration

Full reference for all environment variables