Skip to main content
All metric names and attributes follow the OpenTelemetry semantic conventions for hardware and system metrics.

GPU Hardware Telemetry

Collected for each detected GPU on Linux. Availability depends on vendor and GPU model.

Metrics

MetricTypeUnitDescriptionNVIDIAAMDIntel
hw.gpu.utilizationGauge1GPU compute/encoder/decoder utilization (0.0–1.0)YesYes
hw.gpu.memory.utilizationGauge1Memory controller utilization (0.0–1.0)YesYes
hw.gpu.memory.limitUpDownCounterByTotal GPU memoryYesYes
hw.gpu.memory.usageUpDownCounterByUsed GPU memoryYesYes
hw.gpu.memory.freeUpDownCounterByFree GPU memoryYesYes
hw.gpu.temperatureGaugeCelDie or memory temperatureYesYesYes
hw.gpu.fan_speedGauge{rpm}Fan speedYesYesYes*
hw.gpu.power.drawGaugeWCurrent power drawYesYesYes
hw.gpu.power.limitGaugeWPower limit/capYesYesYes
hw.gpu.energy.consumedCounterJCumulative energy consumedYesYesYes
hw.gpu.clock.graphicsGaugeMHzGraphics/SM clock frequencyYesYesYes*
hw.gpu.clock.memoryGaugeMHzMemory clock frequencyYesYes
hw.errorsCounter{error}ECC and PCIe error countYes
* Intel fan speed requires Linux kernel 6.16+. Intel graphics clock requires the Xe driver.

Attributes

All GPU metrics carry these base attributes:
AttributeDescriptionExample
hw.idUnique device identifier (required by spec)GPU-a1b2c3d4-5678-...
hw.nameProduct nameNVIDIA A100-SXM4-80GB
hw.vendorVendor namenvidia, amd, intel
gpu.indexZero-based device index0, 1
gpu.pci_addressPCI bus address0000:01:00.0
Additional per-metric attributes:
MetricAttributeValues
hw.gpu.utilizationhw.gpu.taskgeneral, encoder, decoder
hw.gpu.temperaturesensordie, memory
hw.errorserror.typecorrected, uncorrected, pcie_replay
hw.errorshw.typegpu

System Metrics

Collected on all platforms (Linux, macOS, Windows) via gopsutil. Follows the OTel semantic conventions for system metrics.
MetricTypeUnitDescriptionAttributes
system.cpu.utilizationGauge1CPU utilization per logical core (0.0–1.0)cpu.logical_number
system.cpu.logical.countUpDownCounter{cpu}Number of logical CPU cores
system.memory.usageUpDownCounterByMemory bytes by statesystem.memory.state=
system.memory.utilizationGauge1Memory utilization (0.0–1.0)
system.disk.ioCounterByDisk I/O bytessystem.device, disk.io.direction=
system.disk.operationsCounter{operation}Disk I/O operationssystem.device, disk.io.direction=
system.filesystem.usageUpDownCounterByFilesystem space by statesystem.device, system.filesystem.mountpoint, system.filesystem.type, system.filesystem.state=
system.filesystem.utilizationGauge1Filesystem utilization (0.0–1.0)system.device, system.filesystem.mountpoint, system.filesystem.type
system.network.ioCounterByNetwork I/O bytesnetwork.interface.name, network.io.direction=
system.network.errorsCounter{error}Network errorsnetwork.interface.name, network.io.direction=
system.memory.state values cached and buffers are only reported on Linux. Loopback interfaces (lo, lo0) are excluded from network metrics.

Process Metrics

Self-monitoring of the collector process. Follows the OTel semantic conventions for process metrics.
MetricTypeUnitDescriptionAttributes
process.cpu.timeCountersCumulative CPU timecpu.mode=
process.cpu.utilizationGauge1CPU utilization (0.0–1.0)
process.memory.usageUpDownCounterByResident memory (RSS)
process.memory.virtualUpDownCounterByVirtual memory size
process.thread.countUpDownCounter{thread}OS thread count
process.unix.file_descriptor.countUpDownCounter{file_descriptor}Open file descriptors (Linux/macOS)
process.runtime.go.goroutinesGauge{goroutine}Go goroutine count
process.runtime.go.mem.heap_allocGaugeByGo heap memory allocated

eBPF CUDA Metrics (opt-in)

Enable with OTEL_GPU_EBPF_ENABLED=true. Requires Linux, CAP_BPF + CAP_PERFMON (or root), and an NVIDIA CUDA runtime (libcudart.so). Attaches uprobes to cudaLaunchKernel, cudaMalloc, and cudaMemcpy.
MetricTypeUnitDescriptionAttributes
gpu.kernel.launch.callsCounter{call}CUDA kernel launch countcuda.kernel.name
gpu.kernel.grid.sizeHistogram{thread}Total threads in grid per launchcuda.kernel.name
gpu.kernel.block.sizeHistogram{thread}Threads per block per launchcuda.kernel.name
gpu.memory.allocationsCounterByBytes allocated via cudaMalloc
gpu.memory.copiesHistogramByBytes per cudaMemcpy callcuda.memcpy.kind=