Evaluations

Overview

The OpenLIT SDK provides tools to evaluate AI-generated text, ensuring it aligns with desired standards and identifying issues like hallucinations, bias, and toxicity. We offer four main evaluation modules:

All Evaluations

Detects and evaluates all text risks including bias, toxicity, and hallucination.

Hallucination

Detects factual inaccuracies and contradictions in text.

Toxicity

Detects and flags harmful or offensive language.

Bias

Detects and addresses prejudiced or biased language.

Hallucination Detection

Evaluates text for inaccuracies compared to the given context or factual information, identifying instances where the generated content diverges from the truth.

Usage

Use LLM-based detection with providers like OpenAI or Anthropic:

import openlit

# Optionally, set your API key as an environment variable
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Initialize the hallucination detector
hallucination_detector = openlit.evals.Hallucination(provider="openai")

# Measure hallucination in text
result = hallucination_detector.measure(
    prompt="Discuss Einstein's achievements",
    contexts=["Einstein discovered the photoelectric effect."],
    text="Einstein won the Nobel Prize in 1969 for the theory of relativity."
)

Supported Providers and LLMs

OpenAI

GPT-4o, GPT-4o mini

Anthropic

Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus

Parameters

`openlit.evals.Hallucination()` Class Parameters

These parameters are used to set up the Hallucination class:

Name	Description	Default Value	Example Value
`provider`	The name of the LLM provider, either `"openai"` or `"anthropic"`.	`"openai"`	`"openai"`
`api_key`	API key for LLM authentication, set via `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variables.	`None`	`os.getenv("OPENAI_API_KEY")`
`model`	Specific model to use with the LLM provider (optional).	`None`	`"gpt-4o"`
`base_url`	Base URL for the LLM API (optional).	`None`	`"https://api.openai.com/v1"`
`custom_categories`	Additional categories for detection (optional).	`None`	`{"custom_category": "Custom description"}`
`threshold_score`	Score above which a verdict is “yes” (indicating hallucination).	`0.5`	`0.7`
`collect_metrics`	Enable metrics collection.	`False`	`True`

`measure` Method Parameters

These parameters are passed when you call the measure method to analyze a specific text:

Name	Description	Example Value
`prompt`	The prompt provided by the user.	`"Discuss Einstein's achievements"`
`contexts`	A list of context sentences relevant to the task.	`["Einstein discovered the photoelectric effect."]`
`text`	The text to analyze for hallucination.	`"Einstein won the Nobel Prize in 1969 for the theory of relativity."`

Classification Categories

Category	Definition
`factual_inaccuracy`	Incorrect facts, e.g., Context: [“Paris is the capital of France.”]; Text: “Lyon is the capital.”
`nonsensical_response`	Irrelevant info, e.g., Context: [“Discussing music trends.”]; Text: “Golf uses clubs on grass.”
`gibberish`	Nonsensical text, e.g., Context: [“Discuss advanced algorithms.”]; Text: “asdas asdhasudqoiwjopakcea.”
`contradiction`	Conflicting info, e.g., Context: [“Einstein was born in 1879.”]; Text: “Einstein was born in 1875 and 1879.”

How it Works

Explanation

Input Gathering: Collects prompt, contexts, and text to be analyzed.
Prompt Creation: Constructs a system prompt using the specified context and categories.
LLM Evaluation: Utilizes the LLM to evaluate the alignment of the text with provided contexts and detect hallucinations.
JSON Output: Provides results with a score, evaluation (“hallucination”), classification, explanation, and verdict (“yes” or “no”).

JSON Output:

Output

The JSON object returned includes:

{
  "score": "float",
  "evaluation": "hallucination",
  "classification": "CATEGORY_OF_HALLUCINATION or none",
  "explanation": "Very short one-sentence reason",
  "verdict": "yes or no"
}

Score: Represents the level of hallucination detected.
Evaluation: Specifies the type of evaluation conducted (“hallucination”).
Classification: Indicates the detected type of hallucination or “none” if none found.
Explanation: Provides a brief reason for the classification.
Verdict: “yes” if detected (score above threshold), “no” otherwise.

Bias Detection

Identifies and evaluates instances of bias in text generated by AI models. This module leverages Language Model (LLM) analysis to ensure fair and unbiased outputs.

Usage

Use LLM-based detection with providers like OpenAI or Anthropic:

import openlit

# Optionally, set your API key as an environment variable
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Initialize the bias detector
bias_detector = openlit.evals.BiasDetector(provider="openai")

# Measure bias in text
result = bias_detector.measure(
    prompt="Discuss workplace equality.",
    contexts=["Everyone should have equal opportunity regardless of background."],
    text="Older employees tend to struggle with new technology."
)

Supported Providers and LLMs

OpenAI

GPT-4o, GPT-4o mini

Anthropic

Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus

Parameters

`openlit.evals.BiasDetector()` Class Parameters

These parameters are used to set up the BiasDetector class:

Name	Description	Default Value	Example Value
`provider`	The name of the LLM provider, either `"openai"` or `"anthropic"`.	`"openai"`	`"openai"`
`api_key`	API key for LLM authentication, set via `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variables.	`None`	`os.getenv("OPENAI_API_KEY")`
`model`	Specific model to use with the LLM provider (optional).	`None`	`"gpt-4o"`
`base_url`	Base URL for the LLM API (optional).	`None`	`"https://api.openai.com/v1"`
`custom_categories`	Additional categories for detection (optional).	`None`	`{"custom_category": "Custom description"}`
`threshold_score`	Score above which a verdict is “yes” (indicating bias).	`0.5`	`0.6`
`collect_metrics`	Enable metrics collection.	`False`	`True`

`measure` Method Parameters

These parameters are passed when you call the measure method to analyze a specific text:

Name	Description	Example Value
`prompt`	The prompt provided by the user.	`"Discuss workplace equality."`
`contexts`	A list of context sentences relevant to the task.	`["Everyone should have equal opportunity regardless of background."]`
`text`	The text to analyze for bias.	`"Older employees tend to struggle with new technology."`

Classification Categories

Category	Definition
`sexual_orientation`	Biases or assumptions about an individual’s sexual preferences.
`age`	Biases related to the age of an individual.
`disability`	Biases or stereotypes concerning individuals with disabilities.
`physical_appearance`	Biases based on the physical look of an individual.
`religion`	Biases or prejudices connected to a person’s religious beliefs.
`pregnancy_status`	Biases towards individuals who are pregnant or have children.
`marital_status`	Biases related to whether someone is single, married, divorced, etc.
`nationality / location`	Biases associated with an individual’s country or place of origin.
`gender`	Biases related to an individual’s gender.
`ethnicity`	Assumptions or stereotypes based on racial or ethnic background.
`socioeconomic_status`	Biases regarding an individual’s economic and social position.

How it Works

Explanation

Input Gathering: Collects prompt, contexts, and text to be analyzed.
Prompt Creation: Constructs a system prompt using the specified context and categories.
LLM Evaluation: Utilizes the LLM to evaluate the text for any bias present against provided contexts.
JSON Output: Provides results with a score, evaluation (“bias_detection”), classification, explanation, and verdict (“yes” or “no”).

JSON Output:

Output

The JSON object returned includes:

{
  "score": "float",
  "evaluation": "bias_detection",
  "classification": "CATEGORY_OF_BIAS or none",
  "explanation": "Very short one-sentence reason",
  "verdict": "yes or no"
}

Score: Represents the level of bias detected.
Evaluation: Specifies the type of evaluation conducted (“bias_detection”).
Classification: Indicates the detected type of bias or “none” if none found.
Explanation: Provides a brief reason for the classification.
Verdict: “yes” if bias detected (score above threshold), “no” otherwise.

Toxicity Detection

Evaluates AI-generated text for harmful or offensive language, ensuring interactions are respectful and appropriate. This module uses Language Model (LLM) analysis to detect and address toxic content.

Usage

Use LLM-based detection with providers like OpenAI or Anthropic:

import openlit

# Optionally, set your API key as an environment variable
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Initialize the toxicity detector
toxicity_detector = openlit.evals.ToxicityDetector(provider="openai")

# Measure toxicity in text
result = toxicity_detector.measure(
    prompt="Engage in a respectful discussion about global events.",
    contexts=["Conversations should remain civil and informative."],
    text="Your opinion is absurd, and only an idiot would think that."
)

Supported Providers and LLMs

OpenAI

GPT-4o, GPT-4o mini

Anthropic

Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus

Parameters

`openlit.evals.ToxicityDetector()` Class Parameters

These parameters are used to set up the ToxicityDetector class:

Name	Description	Default Value	Example Value
`provider`	The name of the LLM provider, either `"openai"` or `"anthropic"`.	`"openai"`	`"openai"`
`api_key`	API key for LLM authentication, set via `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variables.	`None`	`os.getenv("OPENAI_API_KEY")`
`model`	Specific model to use with the LLM provider (optional).	`None`	`"gpt-4o"`
`base_url`	Base URL for the LLM API (optional).	`None`	`"https://api.openai.com/v1"`
`custom_categories`	Additional categories for detection (optional).	`None`	`{"custom_category": "Custom description"}`
`threshold_score`	Score above which a verdict is “yes” (indicating toxicity).	`0.5`	`0.6`
`collect_metrics`	Enable metrics collection.	`False`	`True`

`measure` Method Parameters

These parameters are passed when you call the measure method to analyze a specific text:

Name	Description	Example Value
`prompt`	The prompt provided by the user.	`"Engage in a respectful discussion about global events."`
`contexts`	A list of context sentences relevant to the task.	`["Conversations should remain civil and informative."]`
`text`	The text to analyze for toxicity.	`"Your opinion is absurd, and only an idiot would think that."`

Classification Categories

Category	Definition
`threat`	Language that threatens harm or danger to others, e.g., “You’ll regret saying things like that.”
`dismissive`	Dismissive or belittling language, e.g., “Your opinion is worthless.”
`hate`	Hateful or intensely negative language, e.g., “Only an idiot would think this.”
`mockery`	Mocking or sarcastic tone, e.g., “Brilliant observation! Did it take you all day to come up with that?”
`personal_attack`	Language attacking a person’s character or abilities, e.g., “You’re clueless.”

How it Works

Explanation

Input Gathering: Collects prompt, contexts, and text to be analyzed.
Prompt Creation: Constructs a system prompt using the specified context and categories.
LLM Evaluation: Utilizes the LLM to evaluate the text for any toxic language present against provided contexts.
JSON Output: Provides results with a score, evaluation (“toxicity_detection”), classification, explanation, and verdict (“yes” or “no”).

JSON Output:

Output

The JSON object returned includes:

{
  "score": "float",
  "evaluation": "toxicity_detection",
  "classification": "CATEGORY_OF_TOXICITY or none",
  "explanation": "Very short one-sentence reason",
  "verdict": "yes or no"
}

Score: Represents the level of toxicity detected.
Evaluation: Specifies the type of evaluation conducted (“toxicity_detection”).
Classification: Indicates the detected type of toxicity or “none” if none found.
Explanation: Provides a brief reason for the classification.
Verdict: “yes” if toxicity detected (score above threshold), “no” otherwise.

All Evaluations

Combines the capabilities of bias, toxicity, and hallucination detection to provide a comprehensive analysis of AI-generated text. This module ensures that interactions are accurate, respectful, and free from bias.

Usage

Use LLM-based detection with providers like OpenAI or Anthropic:

import openlit

# Optionally, set your API key as an environment variable
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

# Initialize the all evaluations detector
all_evals_detector = openlit.evals.All(provider="openai")

# Measure issues in text
result = all_evals_detector.measure(
    prompt="Discuss the achievements of scientists.",
    contexts=["Einstein discovered the photoelectric effect, contributing to quantum physics."],
    text="Einstein won the Nobel Prize in 1969 for discovering black holes."
)

Supported Providers and LLMs

OpenAI

GPT-4o, GPT-4o mini

Anthropic

Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus

Parameters

`openlit.evals.All()` Class Parameters

These parameters are used to set up the All class:

Name	Description	Default Value	Example Value
`provider`	The name of the LLM provider, either `"openai"` or `"anthropic"`.	`"openai"`	`"openai"`
`api_key`	API key for LLM authentication, set via `OPENAI_API_KEY` or `ANTHROPIC_API_KEY` environment variables.	`None`	`os.getenv("OPENAI_API_KEY")`
`model`	Specific model to use with the LLM provider (optional).	`None`	`"gpt-4o"`
`base_url`	Base URL for the LLM API (optional).	`None`	`"https://api.openai.com/v1"`
`custom_categories`	Additional categories for detection (optional).	`None`	`{"custom_category": "Custom description"}`
`threshold_score`	Score above which a verdict is “yes” (indicating an issue).	`0.5`	`0.6`
`collect_metrics`	Enable metrics collection.	`False`	`True`

`measure` Method Parameters

These parameters are passed when you call the measure method to analyze a specific text:

Name	Description	Example Value
`prompt`	The prompt provided by the user.	`"Discuss the achievements of scientists."`
`contexts`	A list of context sentences relevant to the task.	`["Einstein discovered the photoelectric effect."]`
`text`	The text to analyze for bias, toxicity, or hallucination.	`"Einstein won the Nobel Prize in 1969 for discovering black holes."`

Classification Categories

Bias Categories

Category	Definition
`sexual_orientation`	Involves biases or assumptions about an individual’s sexual preferences.
`age`	Biases related to the age of an individual.
`disability`	Biases or stereotypes concerning individuals with disabilities.
`physical_appearance`	Biases based on the physical look of an individual.
`religion`	Biases or prejudices connected to a person’s religious beliefs.
`pregnancy_status`	Biases towards individuals who are pregnant or have children.
`marital_status`	Biases related to whether someone is single, married, divorced, etc.
`nationality / location`	Biases associated with an individual’s country or place of origin.
`gender`	Biases related to an individual’s gender.
`ethnicity`	Assumptions or stereotypes based on racial or ethnic background.
`socioeconomic_status`	Biases regarding an individual’s economic and social position.

Toxicity Categories

Category	Definition
`threat`	Language that threatens harm or danger to others.
`dismissive`	Dismissive or belittling language.
`hate`	Hateful or intensely negative language.
`mockery`	Mocking or sarcastic tone.
`personal_attack`	Language attacking a person’s character or abilities.

Hallucination Categories

Category	Definition
`factual_inaccuracy`	Incorrect facts.
`nonsensical_response`	Irrelevant info.
`gibberish`	Nonsensical text.
`contradiction`	Conflicting info.

How it Works

Explanation

Input Gathering: Collects prompt, contexts, and text to be analyzed.
Prompt Creation: Constructs a system prompt encompassing bias, toxicity, and hallucination categories.
LLM Evaluation: Utilizes the LLM to evaluate for any presence of bias, toxicity, or hallucination against provided contexts.
JSON Output: Provides results with a score, evaluation (highest risk type), classification, explanation, and verdict (“yes” or “no”).

JSON Output:

Output

The JSON object returned includes:

{
  "score": "float",
  "evaluation": "evaluation_type",
  "classification": "CATEGORY or none",
  "explanation": "Very short one-sentence reason",
  "verdict": "yes or no"
}

Score: Represents the level of detected issue.
Evaluation: Indicates the type of evaluation conducted (bias, toxicity, or hallucination).
Classification: Indicates the detected type or “none” if none found.
Explanation: Provides a brief reason for the classification.
Verdict: “yes” if detected (score above threshold), “no” otherwise.

Deploy OpenLIT

Deployment options for scalable LLM monitoring infrastructure

Online Evaluations

Get started with evaluating your LLM responses in 2 simple steps on OpenLIT

Destinations

Send telemetry to Datadog, Grafana, New Relic, and other observability stacks

Running in Kubernetes? Try the OpenLIT Operator

Automatically inject instrumentation into existing workloads without modifying pod specs, container images, or application code.

Getting Started

Core Features

Integrations

Destinations

​Overview

All Evaluations

Hallucination

Toxicity

Bias

​Evaluations

​Hallucination Detection

​Usage

​Supported Providers and LLMs

​Parameters

​Classification Categories

​How it Works

​JSON Output:

​Bias Detection

​Usage

​Supported Providers and LLMs

​Parameters

​Classification Categories

​How it Works

​JSON Output:

​Toxicity Detection

​Usage

​Supported Providers and LLMs

​Parameters

​Classification Categories

​How it Works

​JSON Output:

​All Evaluations

​Usage

​Supported Providers and LLMs

​Parameters

​Classification Categories

​How it Works

​JSON Output:

Deploy OpenLIT

Online Evaluations

Destinations

Running in Kubernetes? Try the OpenLIT Operator

Overview

Evaluations

Hallucination Detection

Usage

Supported Providers and LLMs

Parameters

Classification Categories

How it Works

JSON Output:

Bias Detection

Usage

Supported Providers and LLMs

Parameters

Classification Categories

How it Works

JSON Output:

Toxicity Detection

Usage

Supported Providers and LLMs

Parameters

Classification Categories

How it Works

JSON Output:

All Evaluations

Usage

Supported Providers and LLMs

Parameters

Classification Categories

How it Works

JSON Output: