OpenLIT uses OpenTelemetry to help you monitor NVIDIA and AMD GPUs for AI applications. Track GPU metrics like utilization, temperature, memory usage, and power consumption during AI training and inference workloads.

Choose your method

GPU monitoring can be implemented in two ways depending on your setup and requirements:

OpenLIT SDK

It is useful if you already have an AI application running on GPU that’s instrumented with OpenLIT.It extends your existing observability to include GPU metrics alongside your LLM traces.

OpenTelemetry GPU Collector

It is useful for remote GPUs with only LLM models hosted, containerized deployments.This approach allows you to get GPU metrics without modifying application code.

Supported Parameters

SDK Configuration Options

ParameterTypeDefaultDescription
collect_system_metricsbooleanFalseEnable GPU and system metrics collection
otlp_endpointstringNoneOpenTelemetry OTLP endpoint URL
otlp_headersstringNoneAuthentication headers for OTLP endpoint
service_namestring"unknown_service"Name of your AI application
environmentstringNoneDeployment environment (dev, staging, prod)

Environment Variables

VariableDescriptionExample
OPENLIT_COLLECT_SYSTEM_METRICSEnable GPU monitoringtrue
OTEL_EXPORTER_OTLP_ENDPOINTOTLP endpoint URLhttp://127.0.0.1:4318
OTEL_SERVICE_NAMEService name for telemetrymy-gpu-app
OTEL_DEPLOYMENT_ENVIRONMENTDeployment environmentproduction

Kubernetes

Running in Kubernetes? Try the OpenLIT Operator

Automatically inject instrumentation into existing workloads without modifying pod specs, container images, or application code.