libnvidia-ml.so at runtime. No CUDA toolkit or DCGM daemon is needed.
Requirements
- Linux with NVIDIA GPU drivers installed
libnvidia-ml.sopresent on the host (installed with the NVIDIA driver)- For Docker: NVIDIA Container Toolkit
Collected metrics
| Metric | Description |
|---|---|
hw.gpu.utilization | Compute, encoder, and decoder utilization (0.0–1.0) via hw.gpu.task |
hw.gpu.memory.utilization | Memory controller utilization (0.0–1.0) |
hw.gpu.memory.limit | Total VRAM (bytes) |
hw.gpu.memory.usage | Used VRAM (bytes) |
hw.gpu.memory.free | Free VRAM (bytes) |
hw.gpu.temperature | Die and memory temperature (°C) via sensor attribute |
hw.gpu.fan_speed | Fan speed (RPM) |
hw.gpu.power.draw | Current power draw (W) |
hw.gpu.power.limit | Power cap (W) |
hw.gpu.energy.consumed | Cumulative energy (J) |
hw.gpu.clock.graphics | Graphics/SM clock (MHz) |
hw.gpu.clock.memory | Memory clock (MHz) |
hw.errors | ECC correctable/uncorrectable errors and PCIe replay errors |
Docker
Docker Compose
Kubernetes (DaemonSet)
To monitor GPUs on every node in a cluster, deploy the collector as a DaemonSet:The collector does not need privileged mode for NVML — just access to
libnvidia-ml.so. eBPF tracing requires CAP_BPF + CAP_PERFMON.Metrics reference
Full metrics list with types, units, and attributes
Configuration
All environment variables and defaults

