Why use LLM-as-a-Judge?
- Scalable & Cost-Effective: Evaluate thousands of LLM outputs automatically at a fraction of human evaluation costs
- Human-Like Quality Assessment: Capture nuanced quality dimensions like helpfulness, safety, and coherence that simple metrics miss
- Consistent & Reproducible: Apply uniform evaluation criteria across all outputs with repeatable scoring for reliable model comparisons
- Actionable Insights: Get structured reasoning and detailed explanations for evaluation decisions to systematically improve your AI systems
Built-in evaluators
OpenLIT provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls.Hallucination detection
Identifies factual inaccuracies, contradictions, and fabricated information in AI responses
Bias detection
Monitors for discriminatory patterns across protected attributes and ensures fair outputs
Toxicity detection
Screens for harmful, offensive, or inappropriate content in AI-generated text
Combined analysis
All-in-one evaluator combining hallucination, bias, and toxicity detection
Running evaluations
OpenLIT provides two convenient ways to configure evaluation settings:
- Navigate to Settings → Evaluation Settings
- Configure your model provider and the LLM to use as a judge (OpenAI GPT-4, Anthropic Claude, etc.)
- Add your LLM Provider API key in Vault or select one from previously created secrets in Vault
- Set evaluation recurring time in cron schedule format
- Enable auto evaluation for continuous monitoring

- Open any LLM trace in the Requests page
- Click the Evaluation tab in the trace details
- Click Setup Evaluation! to configure settings
- Your evaluation configuration will apply to future traces
Monitor & Iterate
Once evaluations are running, OpenLIT continuously analyzes your LLM responses and provides actionable insights:- Review Individual Results: Examine detailed evaluation scores, classifications, and explanations for each LLM trace
- Track Quality Trends: Monitor aggregate metrics across time periods and compare performance between different models or versions
- Manage Evaluations: Enable, disable, or modify evaluation settings as your application evolves
Detailed results in traces

- Go to the Requests page to see all your LLM traces
- Click on any LLM trace to view details
- Click the Evaluation tab to see evaluation results for that specific trace
- Detailed Metrics: Each evaluation shows:
- Score: Numerical score (0-1) indicating the severity or likelihood of the issue
- Classification: Category classification (e.g., “factual_inaccuracy”)
- Explanation: Detailed reasoning from the LLM judge about why this score was given
- Verdict: Simple yes/no determination based on your threshold settings
Aggregate statistics in dashboard

- Total Hallucination Detected: Count of traces flagged for hallucination issues
- Total Bias Detected: Number of traces identified with bias concerns
- Total Toxicity Detected: Count of traces containing toxic or harmful content
- Detection Rate Trends: Percentage changes and trends over time periods