Goals
- OpenTelemetry-native — uses standard
OTEL_*env vars, exports via OTLP gRPC or HTTP to any OTel-compatible backend - Cross-vendor GPU support — NVIDIA (NVML), AMD (sysfs/hwmon), and Intel (i915/Xe sysfs) from a single binary
- OTel semantic conventions —
hw.gpu.*metric names,hw.id/hw.name/hw.vendorattributes per spec - Zero dependencies — no DCGM, no Python, no CUDA toolkit needed at runtime for hardware metrics
- Resilient — continues exporting host metrics even when no GPUs are present; retries GPU discovery every 30s
What it collects
GPU Hardware Telemetry
Utilization, memory, temperature, power draw, energy, clock speeds, ECC errors, PCIe errors — for NVIDIA, AMD, and Intel GPUs
Host System Metrics
CPU utilization, memory usage, disk I/O, filesystem usage, and network I/O — on Linux, macOS, and Windows
eBPF CUDA Tracing
Kernel launch counts, grid/block sizes, memory allocations, and memory copies — via uprobes on libcudart.so (opt-in, Linux only)
GPU vendor support
| Vendor | Backend | Metrics available |
|---|---|---|
| NVIDIA | NVML via go-nvml — loads libnvidia-ml.so at runtime | Utilization, memory, temperature, power, energy, clocks, ECC errors, PCIe errors, fan speed |
| AMD | sysfs + hwmon — zero external dependencies | Utilization, memory, temperature, power, energy, fan speed |
| Intel | sysfs + hwmon + DRM (i915/Xe driver) | Temperature, power draw/limit, cumulative energy, graphics clock, fan speed (kernel 6.16+) |
Platform support
| Feature | Linux | macOS | Windows |
|---|---|---|---|
| System metrics (CPU, memory, disk, network) | Yes | Yes | Yes |
| Process metrics (CPU, memory, threads, FDs) | Yes | Yes | Yes |
| GPU metrics (NVIDIA, AMD, Intel) | Yes | — | — |
| eBPF CUDA tracing | Yes | — | — |
How it works
Scans PCI bus
On startup, the collector scans
/sys/bus/pci/devices/ for PCI class codes 0x0300 (VGA), 0x0302 (3D controller), and 0x0380 (display controller), identifying GPU vendor from the PCI vendor ID.Initialises vendor backend
Each detected GPU is handed to its vendor-specific backend. NVIDIA loads
libnvidia-ml.so via NVML. AMD and Intel read directly from kernel sysfs/hwmon — no additional libraries needed.Registers OTel instruments
Observable gauge and counter instruments are registered with the OTel SDK meter. On each collection tick, the SDK calls back into the collector to read fresh values from each GPU.
Quickstart
Get the collector running in under 5 minutes with Docker
Configuration
Full reference for all environment variables

