LLM-as-a-Judge

LLM-as-a-Judge is a technique to evaluate the quality of LLM applications by using powerful language models as evaluators. The LLM judge analyzes your AI outputs and provides structured scores, classifications, and detailed reasoning about response quality, safety, and performance.

Why use LLM-as-a-Judge?

Scalable & Cost-Effective: Evaluate thousands of LLM outputs automatically at a fraction of human evaluation costs
Human-Like Quality Assessment: Capture nuanced quality dimensions like helpfulness, safety, and coherence that simple metrics miss
Consistent & Reproducible: Apply uniform evaluation criteria across all outputs with repeatable scoring for reliable model comparisons
Actionable Insights: Get structured reasoning and detailed explanations for evaluation decisions to systematically improve your AI systems

Built-in evaluators

OpenLIT provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls.

Hallucination detection

Identifies factual inaccuracies, contradictions, and fabricated information in AI responses

Bias detection

Monitors for discriminatory patterns across protected attributes and ensures fair outputs

Toxicity detection

Screens for harmful, offensive, or inappropriate content in AI-generated text

Combined analysis

All-in-one evaluator combining hallucination, bias, and toxicity detection

Running evaluations

OpenLIT provides two convenient ways to configure evaluation settings:

Navigate to Settings → Evaluation Settings
Configure your model provider and the LLM to use as a judge (OpenAI GPT-4, Anthropic Claude, etc.)
Add your LLM Provider API key in Vault or select one from previously created secrets in Vault
Set evaluation recurring time in cron schedule format
Enable auto evaluation for continuous monitoring

Alternatively, you can also directly enable evaluations from Traces:

Open any LLM trace in the Requests page
Click the Evaluation tab in the trace details
Click Setup Evaluation! to configure settings
Your evaluation configuration will apply to future traces

Monitor & Iterate

Once evaluations are running, OpenLIT continuously analyzes your LLM responses and provides actionable insights:

Review Individual Results: Examine detailed evaluation scores, classifications, and explanations for each LLM trace
Track Quality Trends: Monitor aggregate metrics across time periods and compare performance between different models or versions
Manage Evaluations: Enable, disable, or modify evaluation settings as your application evolves

Detailed results in traces

Go to the Requests page to see all your LLM traces
Click on any LLM trace to view details
Click the Evaluation tab to see evaluation results for that specific trace
Detailed Metrics: Each evaluation shows:
- Score: Numerical score (0-1) indicating the severity or likelihood of the issue
- Classification: Category classification (e.g., “factual_inaccuracy”)
- Explanation: Detailed reasoning from the LLM judge about why this score was given
- Verdict: Simple yes/no determination based on your threshold settings

Aggregate statistics in dashboard

Total Hallucination Detected: Count of traces flagged for hallucination issues
Total Bias Detected: Number of traces identified with bias concerns
Total Toxicity Detected: Count of traces containing toxic or harmful content
Detection Rate Trends: Percentage changes and trends over time periods

Quickstart: LLM Guardrails

Protect and secure your LLM responses in 2 simple steps

Integrations

60+ AI integrations with automatic instrumentation and performance tracking

Create a dashboard

Create custom visualizations with flexible widgets, queries, and real-time AI monitoring

Getting Started

Observability

Evaluations

Dashboards

Prompts and Experiments

Developer Resources

Why use LLM-as-a-Judge?

Built-in evaluators

Hallucination detection

Bias detection

Toxicity detection

Combined analysis

Running evaluations

Monitor & Iterate

Detailed results in traces

Aggregate statistics in dashboard

Quickstart: LLM Guardrails

Integrations

Create a dashboard

Getting Started

Observability

Evaluations

Dashboards

Prompts and Experiments

Developer Resources

​Why use LLM-as-a-Judge?

​Built-in evaluators

Hallucination detection

Bias detection

Toxicity detection

Combined analysis

​Running evaluations

​Monitor & Iterate

​Detailed results in traces

​Aggregate statistics in dashboard

Quickstart: LLM Guardrails

Integrations

Create a dashboard

Why use LLM-as-a-Judge?

Built-in evaluators

Running evaluations

Monitor & Iterate

Detailed results in traces

Aggregate statistics in dashboard