LLM-as-a-Judge is a technique to evaluate the quality of LLM applications by using powerful language models as evaluators. The LLM judge analyzes your AI outputs and provides structured scores, classifications, and detailed reasoning about response quality, safety, and performance.

Why use LLM-as-a-Judge?

  • Scalable & Cost-Effective: Evaluate thousands of LLM outputs automatically at a fraction of human evaluation costs
  • Human-Like Quality Assessment: Capture nuanced quality dimensions like helpfulness, safety, and coherence that simple metrics miss
  • Consistent & Reproducible: Apply uniform evaluation criteria across all outputs with repeatable scoring for reliable model comparisons
  • Actionable Insights: Get structured reasoning and detailed explanations for evaluation decisions to systematically improve your AI systems

Built-in evaluators

OpenLIT provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls.

Hallucination detection

Identifies factual inaccuracies, contradictions, and fabricated information in AI responses

Bias detection

Monitors for discriminatory patterns across protected attributes and ensures fair outputs

Toxicity detection

Screens for harmful, offensive, or inappropriate content in AI-generated text

Combined analysis

All-in-one evaluator combining hallucination, bias, and toxicity detection

Running evaluations

OpenLIT provides two convenient ways to configure evaluation settings:
  1. Navigate to SettingsEvaluation Settings
  2. Configure your model provider and the LLM to use as a judge (OpenAI GPT-4, Anthropic Claude, etc.)
  3. Add your LLM Provider API key in Vault or select one from previously created secrets in Vault
  4. Set evaluation recurring time in cron schedule format
  5. Enable auto evaluation for continuous monitoring
Alternatively, you can also directly enable evaluations from Traces:
  1. Open any LLM trace in the Requests page
  2. Click the Evaluation tab in the trace details
  3. Click Setup Evaluation! to configure settings
  4. Your evaluation configuration will apply to future traces

Monitor & Iterate

Once evaluations are running, OpenLIT continuously analyzes your LLM responses and provides actionable insights:
  • Review Individual Results: Examine detailed evaluation scores, classifications, and explanations for each LLM trace
  • Track Quality Trends: Monitor aggregate metrics across time periods and compare performance between different models or versions
  • Manage Evaluations: Enable, disable, or modify evaluation settings as your application evolves

Detailed results in traces

  1. Go to the Requests page to see all your LLM traces
  2. Click on any LLM trace to view details
  3. Click the Evaluation tab to see evaluation results for that specific trace
  4. Detailed Metrics: Each evaluation shows:
    • Score: Numerical score (0-1) indicating the severity or likelihood of the issue
    • Classification: Category classification (e.g., “factual_inaccuracy”)
    • Explanation: Detailed reasoning from the LLM judge about why this score was given
    • Verdict: Simple yes/no determination based on your threshold settings

Aggregate statistics in dashboard

  • Total Hallucination Detected: Count of traces flagged for hallucination issues
  • Total Bias Detected: Number of traces identified with bias concerns
  • Total Toxicity Detected: Count of traces containing toxic or harmful content
  • Detection Rate Trends: Percentage changes and trends over time periods