Programmatic Evaluations

This guide demonstrates how to implement LLM evaluation tools to assess model output quality. With OpenLIT’s programmatic evaluations, you can perform LLM hallucination detection, bias detection, and toxicity filtering using production-ready evaluation metrics. Learn how to use our All evaluator for comprehensive LLM output assessment, measuring hallucination, bias, and toxicity simultaneously. We’ll also show you how to collect OpenTelemetry evaluation metrics for continuous model performance monitoring.

Initialize evaluations

Set up evaluations for large language models with just two lines of code:

Python
Typescript

import openlit

# Comprehensive LLM evaluation: hallucination detection, bias detection, toxicity filtering
evals = openlit.evals.All()
result = evals.measure()

Full Example:

example.py

import openlit

# openlit can also read the OPENAI_API_KEY variable directy from env if not specified via function argument
openai_api_key=os.getenv("OPENAI_API_KEY")

# Production-ready LLM evaluation tools for hallucination detection and bias screening
evals = openlit.evals.All(provider="openai", api_key=openai_api_key)

contexts = ["Einstein won the Nobel Prize for his discovery of the photoelectric effect in 1921"]
prompt = "When and why did Einstein win the Nobel Prize?"
text = "Einstein won the Nobel Prize in 1969 for his discovery of the photoelectric effect"

result = evals.measure(prompt=prompt, contexts=contexts, text=text)

Output

verdict='yes' evaluation='Hallucination' score=0.9 classification='factual_inaccuracy' explanation='The text incorrectly states that Einstein won the Nobel Prize in 1969, while the context specifies that he won it in 1921 for his discovery of the photoelectric effect, leading to a significant factual inconsistency.'

The All evaluator assesses model outputs for hallucination detection, bias detection, and toxicity filtering simultaneously. For targeted model evaluation, use specific evaluators:

Hallucination detection

Detect factual inaccuracies and false information in LLM responses

Bias detection

Identify potential biases and unfair representations in AI outputs

Toxicity filtering

Screen content for harmful, offensive, or inappropriate material

For advanced LLM evaluation metrics and supported providers, explore our Evaluations Guide.

Track LLM evaluation metrics

To send evaluation scores to OpenTelemetry backends, your application needs to be instrumented via OpenLIT. Choose from three instrumentation methods, then simply add collect_metrics=True to track hallucination detection, bias screening, and toxicity filtering metrics.

Zero-Code instrumentation
Manual instrumentation
OpenLIT Operator

No code changes needed - instrument via CLI:

# Run with zero-code instrumentation
openlit-instrument python your_app.py

Then in your application:

import openlit

# Enable evaluation metrics tracking - OpenLIT instrumentation handles the rest
evals = openlit.evals.All(collect_metrics=True)
result = evals.measure(prompt=prompt, contexts=contexts, text=text)

Metrics are sent to the same OpenTelemetry backend conifgured during instrumentation, check our support destinations for configuration details.

You’re all set! Your AI applications now have complete model evaluation capabilities with automated hallucination detection, bias screening, and toxicity filtering. Monitor LLM output quality with real-time evaluation metrics. If you have any questions or need support, reach out to our community.