Skip to main content
OpenLIT provides automated evaluation that helps you assess and monitor the quality, safety, and performance of your LLM outputs across development and production environments.

Why evaluations?

Evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing user experience and trust in your AI applications. Here are the key benefits:
  • Quality & Safety Assurance: Detect hallucinations, bias, toxicity, safety issues, and ensure consistent, reliable AI outputs
  • Performance Monitoring: Track model performance degradation and measure response quality across different scenarios
  • Risk Mitigation: Catch potential issues before they reach users and ensure compliance with safety standards
  • Cost Optimization: Monitor cost-effectiveness and ROI of different AI configurations and model choices
  • Continuous Improvement: Build data-driven insights for A/B testing, optimization, and iterative development

11 built-in evaluation types

OpenLIT includes 11 evaluation types out of the box. Each can be independently enabled, customized with custom prompts, and linked to Rule Engine rules for conditional evaluation.
Evaluation TypeDescriptionDefault
HallucinationDetects factual inaccuracies, contradictions, and fabricated informationEnabled
BiasMonitors for discriminatory patterns across gender, ethnicity, age, religionEnabled
ToxicityScreens for harmful, offensive, threatening, or hateful languageEnabled
RelevanceEvaluates how well the response addresses the promptDisabled
CoherenceAssesses logical flow and internal consistencyDisabled
FaithfulnessMeasures alignment with provided context or source materialDisabled
SafetyDetects jailbreak attempts, prompt injection, and unsafe content generationDisabled
Instruction FollowingEvaluates adherence to instructions, constraints, and formattingDisabled
CompletenessAssesses whether all parts of the query are fully addressedDisabled
ConcisenessEvaluates whether the response avoids unnecessary verbosityDisabled
SensitivityDetects PII leakage, credentials, and confidential data exposureDisabled
Context is the source of truth: When context is provided (via the Rule Engine), evaluations judge the LLM response against the provided context — not against real-world knowledge. This enables domain-specific evaluation where your custom knowledge is authoritative.

Custom evaluation types

Beyond the 11 built-in types, you can create custom evaluation types with your own prompts. Navigate to Evaluations > Settings > Evaluation Types and click Create Custom Type. Each custom type requires an id (lowercase with underscores), a label, a description, and an evaluation prompt. Custom types run alongside built-in types in both auto and manual evaluations, and can be linked to Rule Engine rules just like built-in types. Unlike built-in types, custom types can be deleted when no longer needed.

AI evaluation methods

OpenLIT provides automated LLM evaluation and testing capabilities for production AI applications:

Automated LLM-as-a-Judge

Zero-setup AI quality monitoring that automatically evaluates your LLM responses:
  • Production Monitoring: Auto-evaluate every LLM response for quality and safety issues
  • Smart Scheduling: Configure evaluation frequency with cron schedules for cost optimization
  • Real-time Scoring: Instant evaluation results visible in trace details and dashboards
  • Rule Engine Integration: Link evaluation types to rules for conditional evaluation based on trace attributes (model, provider, service, environment)
  • Context-Aware: Evaluations use context from the Rule Engine as ground truth for judging responses

Programmatic evaluations

Programmatic AI evaluation tools for custom testing workflows and development pipelines:
  • Code Integration: Call LLM evaluators directly in your application code via the SDK
  • CI/CD Quality Gates: Automated testing for model improvements and regression detection
  • Rule Engine SDK: Use evaluate_rule() to fetch matching contexts and evaluation configs at runtime

LLM-as-a-Judge

Use advanced LLMs to evaluate AI application quality with automated scoring

Programmatic evaluations

Quick start guide for implementing custom evaluations in your code

Rule Engine

Define conditional rules to provide context and control which evaluations run