Documentation Index
Fetch the complete documentation index at: https://launchdarkly-preview.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This topic explains how to use offline evaluations to validate AI Config variations before releasing them to production. Offline evaluations provide a repeatable workflow for testing prompt and model changes using datasets with known inputs and expected outputs. These evaluations help you make informed decisions before changes impact end users. By running variations on the same dataset and evaluation criteria, you can compare performance, detect regressions, and validate improvements with confidence. Offline evaluations help you:- Compare prompt and model variations using consistent inputs
- Identify regressions before changes reach production
- Measure quality using standardized criteria and judges
- Review aggregate scores and row-level results to understand performance
- Decide whether a variation is ready for rollout
Offline and online evaluations
Offline and online evaluations serve different purposes. Offline evaluations:- Run before deployment
- Use datasets with known inputs and expected outputs
- Evaluate variations in a controlled environment
- Help validate changes before rollout
- Run in production on live user traffic
- Evaluate responses using attached judges
- Score responses continuously
- Help monitor performance after release
How offline evaluations work
Offline evaluations use AI Config variations to generate and evaluate outputs for a dataset. Each row in the dataset represents a single evaluation task. This approach lets you evaluate many inputs at once and understand how a variation performs across a consistent set of scenarios. For each input, LaunchDarkly:- Generates a model output
- Evaluates the output using selected criteria or judges
- Records structured results
Example evaluation result
The following example shows the result for a single dataset row, including a score and reasoning returned by the evaluation criteria or judge.Configure an offline evaluation
Configure an offline evaluation to define the dataset, AI Config variation, model, evaluation criteria, and execution settings such as sampling and runtime limits. Use this configuration to control what inputs are tested and how outputs are evaluated. For example, you can use sampling to test a subset of inputs, run a preview to validate your setup, and define thresholds that reflect your quality standards. To access and configure an offline evaluation:- Navigate to your project.
- In the left navigation, expand AI, then select Playgrounds.
- Create or open an evaluation in the playground. For detailed steps, read Create and manage evaluations.
- Configure the evaluation, including selecting a dataset and AI Config variation.
- Configure evaluation criteria and thresholds in the Acceptance criteria panel.
- Choose how dataset rows are sampled using the row selection controls.
- (Optional) Run a preview on a subset of rows.
Dataset requirements
Offline evaluations use uploaded datasets as input for AI Config variations. To learn how to prepare and upload datasets, read Datasets in AI Configs. Datasets must be in CSV or JSONL format and can include expected output, variables, and metadata. Use datasets to validate outputs by comparing them to expected results. Dataset schema includes:input: Prompt or request used to generate a model response. Accepts a string or JSON object.expected_output: Optional. Ideal output used for comparison or scoring. Accepts a string or JSON object.variables: Optional. Named placeholders used in prompt templates. These are substituted into message templates. Example:{{variable_name}}metadata: Optional. JSON object of arbitrary key-value data attached to a row for tracking, filtering, or reporting. Metadata is stored alongside results but is not used in generation or evaluation. Example:{"source": "production", "category": "factual"})
Run and review evaluations
Run offline evaluations to understand how your AI Config variations perform across a dataset before deciding whether to release changes. When you start an evaluation run, LaunchDarkly processes dataset rows asynchronously. Each row is processed independently to generate outputs, apply evaluation criteria, and record results. Results are available during and after execution. As rows are processed, progress updates continuously so you can monitor evaluation status. Complete evaluation results include:- Status counts for rows
- Aggregate scores per criterion
- Latency and token usage metrics
- Row-level outputs and scores