Why evaluations?
Evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing user experience and trust in your AI applications. Here are the key benefits:- Quality & Safety Assurance: Detect hallucinations, bias, toxicity, safety issues, and ensure consistent, reliable AI outputs
- Performance Monitoring: Track model performance degradation and measure response quality across different scenarios
- Risk Mitigation: Catch potential issues before they reach users and ensure compliance with safety standards
- Cost Optimization: Monitor cost-effectiveness and ROI of different AI configurations and model choices
- Continuous Improvement: Build data-driven insights for A/B testing, optimization, and iterative development
11 built-in evaluation types
OpenLIT includes 11 evaluation types out of the box. Each can be independently enabled, customized with custom prompts, and linked to Rule Engine rules for conditional evaluation.| Evaluation Type | Description | Default |
|---|---|---|
| Hallucination | Detects factual inaccuracies, contradictions, and fabricated information | Enabled |
| Bias | Monitors for discriminatory patterns across gender, ethnicity, age, religion | Enabled |
| Toxicity | Screens for harmful, offensive, threatening, or hateful language | Enabled |
| Relevance | Evaluates how well the response addresses the prompt | Disabled |
| Coherence | Assesses logical flow and internal consistency | Disabled |
| Faithfulness | Measures alignment with provided context or source material | Disabled |
| Safety | Detects jailbreak attempts, prompt injection, and unsafe content generation | Disabled |
| Instruction Following | Evaluates adherence to instructions, constraints, and formatting | Disabled |
| Completeness | Assesses whether all parts of the query are fully addressed | Disabled |
| Conciseness | Evaluates whether the response avoids unnecessary verbosity | Disabled |
| Sensitivity | Detects PII leakage, credentials, and confidential data exposure | Disabled |
Context is the source of truth: When context is provided (via the Rule Engine), evaluations judge the LLM response against the provided context — not against real-world knowledge. This enables domain-specific evaluation where your custom knowledge is authoritative.
Custom evaluation types
Beyond the 11 built-in types, you can create custom evaluation types with your own prompts. Navigate to Evaluations > Settings > Evaluation Types and click Create Custom Type. Each custom type requires an id (lowercase with underscores), a label, a description, and an evaluation prompt. Custom types run alongside built-in types in both auto and manual evaluations, and can be linked to Rule Engine rules just like built-in types. Unlike built-in types, custom types can be deleted when no longer needed.AI evaluation methods
OpenLIT provides automated LLM evaluation and testing capabilities for production AI applications:Automated LLM-as-a-Judge
Zero-setup AI quality monitoring that automatically evaluates your LLM responses:- Production Monitoring: Auto-evaluate every LLM response for quality and safety issues
- Smart Scheduling: Configure evaluation frequency with cron schedules for cost optimization
- Real-time Scoring: Instant evaluation results visible in trace details and dashboards
- Rule Engine Integration: Link evaluation types to rules for conditional evaluation based on trace attributes (model, provider, service, environment)
- Context-Aware: Evaluations use context from the Rule Engine as ground truth for judging responses
Programmatic evaluations
Programmatic AI evaluation tools for custom testing workflows and development pipelines:- Code Integration: Call LLM evaluators directly in your application code via the SDK
- CI/CD Quality Gates: Automated testing for model improvements and regression detection
- Rule Engine SDK: Use
evaluate_rule()to fetch matching contexts and evaluation configs at runtime
LLM-as-a-Judge
Use advanced LLMs to evaluate AI application quality with automated scoring
Programmatic evaluations
Quick start guide for implementing custom evaluations in your code
Rule Engine
Define conditional rules to provide context and control which evaluations run

