Evaluations

Test AI response quality with configurable evaluation prompts and track results over time.

Evaluations let you define test prompts and check whether the AI produces correct responses. Use them to monitor response quality and catch regressions after configuration changes.

All users can view evaluations and run them. Creating, editing, and deleting evaluations requires the Admin role.

Overview

The page displays a time-series chart at the top showing pass/fail history across evaluation runs. Below it is a table listing all evaluations with their current status.

Each evaluation shows:

Name or prompt snippet
Status — Passed (green), Failed (red), or Not Run (outline)
Last Execution — When the evaluation last ran

Creating an evaluation (Admin)

Click the + Create Evaluation button.

Enter a name, the prompt to send to the AI, and the expected output criteria.

Save the evaluation. It appears in the table with a "Not Run" status.

Running evaluations

You can run evaluations individually or in batches.

Single evaluation

Click the play button on any evaluation row to run it individually. A progress indicator appears while the evaluation executes, and the status updates when complete.

Batch run

Use these options to run multiple evaluations at once:

Click Run All to execute every evaluation
Select specific evaluations with checkboxes, then click Run Selected (N) to run only those

Batch jobs show a progress bar tracking completion across all selected evaluations.

Tracking results

The time-series chart at the top visualizes pass/fail trends. Hover over data points to see details about failures, including which evaluations failed and why.

Run evaluations after changing knowledge blocks, security policies, or AI settings to verify that responses remain accurate.