Evaluation toolkit for LLM completions and agent traces. Write test cases as YAML, run them against any model defined in your models.json registry, and score results with rule-based assertions or LLM-as-a-judge.
agent-eval run --file my-case.eval.yaml
agent-eval run --case all
agent-eval matrix --case my-case --variant "gpt4o=gpt-4o" --variant "mini=gpt-4o-mini"
agent-eval diff --case my-case --base '{"model":"gpt-4o"}' --candidate '{"model":"gpt-4o-mini"}'
- How it works
- Install
- Configure
- Quick start
- Case format
- Assertions reference
- Commands
- Advanced
- Environment variables
- Exit codes
┌─────────────────────────────────────────────────────────────┐
│ agent-eval run │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────────────────────┐
│ Load cases │ │ .eval.yaml file OR │
│ (.eval.yaml, │────▶│ --inline JSON OR │
│ built-in, CLI) │ │ --case <id> OR │
└─────────────────┘ │ --file <glob> │
│ └──────────────────────────────────┘
▼
┌─────────────────┐
│ Run each case │ plain → OpenAI-compatible chat completion
│ (runner) │ agent → Full agent loop with MCP tools
└─────────────────┘
│ EvalTrace (conversation + tool calls)
▼
┌─────────────────┐
│ Score trace │ tier 1: rule-based (tool_usage, final_status)
│ (scorers) │ tier 2: LLM-as-a-judge (llm_judge, task_success)
└─────────────────┘ tier 3: human review flag
│ EvalResult (passed/failed + per-dimension scores)
▼
┌─────────────────┐
│ Report │ terminal output + optional HTML report
└─────────────────┘
Two case types:
| Type | What it does | Required config |
|---|---|---|
plain |
Sends messages to a chat completion endpoint; scores the response | models.json with the model used in the case |
agent |
Runs a full agent loop with MCP tool calls; scores the full trace | Same + EVAL_MCP_* when your MCP server requires auth/override |
GitHub Packages requires authentication even for public packages. Create a Personal Access Token (PAT) with read:packages scope:
- Go to https://github.com/settings/tokens/new
- Select scopes:
read:packages(minimum) - Generate and copy the token
Then configure npm:
# Using GitHub CLI (recommended)
echo "//npm.pkg.github.com/:_authToken=$(gh auth token)" >> ~/.npmrc
# Or manually with your PAT
echo "//npm.pkg.github.com/:_authToken=YOUR_GITHUB_TOKEN" >> ~/.npmrcAdd this to your ~/.npmrc (or project-level .npmrc):
echo "@talesofai:registry=https://npm.pkg.github.com" >> ~/.npmrcYour ~/.npmrc should now contain:
//npm.pkg.github.com/:_authToken=ghp_xxxx
@talesofai:registry=https://npm.pkg.github.com
# Global install (recommended for CLI usage)
npm install -g @talesofai/agent-eval
# Or as a project dependency
npm install @talesofai/agent-evalagent-eval --versionIf you want to contribute or need the latest unreleased changes:
git clone https://github.com/talesofai/talesofai-eval.git
cd talesofai-eval
pnpm install
pnpm build
pnpm agent-eval --version- Create a
models.jsonfile in your project (see Model registry) - Set up your
.envwith API keys (see Configure) - Run
agent-eval doctorto verify everything works
⚠️ The CLI command isagent-eval, noteval. The latter is a shell built-in that silently does nothing.
Copy the example env file and fill in your credentials:
cp packages/eval/.env.example .env # from source
# or create .env manually in your working directoryThe tool auto-discovers .env and .env.local by walking up from the working directory. .env.local takes precedence over .env (good for personal overrides).
All model configuration — endpoint, credentials, and model metadata — lives in models.json. The runner resolves input.model from the case file as a model id in that registry. See Model registry below for the full format.
# Path to your model registry (auto-discovered if ./models.json exists in cwd)
EVAL_MODELS_PATH=./models.json
# Judge model id — must be defined in your models.json
EVAL_JUDGE_MODEL=gpt-4o-mini# Model registry
EVAL_MODELS_PATH=./models.json
# LLM judge — single model
EVAL_JUDGE_MODEL=gpt-4o-mini
# LLM judge — multi-model (overrides EVAL_JUDGE_MODEL when set)
# EVAL_JUDGE_MODELS=gpt-4o-mini,gpt-4o,claude-3-5-sonnet
# EVAL_JUDGE_AGGREGATION=median # median (default) | mean | iqm
# Judge endpoint override (if different from runner model; e.g. LiteLLM gateway)
# EVAL_JUDGE_BASE_URL=https://your-litellm.com/v1
# EVAL_JUDGE_API_KEY=your-litellm-key
# Agent runner — MCP tool server
EVAL_MCP_SERVER_BASE_URL=https://mcp.talesofai.cn # default, override if needed
EVAL_MCP_X_TOKEN= # auth token for MCP server
# Agent runner — upstream API (character/asset provider)
EVAL_UPSTREAM_API_BASE_URL=https://api.talesofai.cn # default
EVAL_UPSTREAM_X_TOKEN= # auth token for upstreamagent-eval doctorThis checks all required env vars and prints ✅ / --mode plain or --mode agent to scope the check.
All models — for the runner and for judging — must be defined in a models.json file that you create and maintain. No models are bundled with the package — this is intentional so you control exactly which models and endpoints are available.
The input.model field in every case file is resolved as a model id against this registry.
Resolution order:
EVAL_MODELS_PATHenv var (explicit path)./models.jsonin current working directory (auto-discovered)
Create ./models.json in your project root:
{
"models": {
"gpt-4o-mini": {
"id": "gpt-4o-mini",
"name": "GPT-4o Mini",
"api": "openai-completions",
"provider": "openai",
"baseUrl": "https://api.openai.com/v1",
"apiKey": "${OPENAI_API_KEY}"
},
"qwen-plus": {
"id": "qwen-plus",
"name": "Qwen Plus",
"api": "openai-completions",
"provider": "alibaba",
"baseUrl": "${DASHSCOPE_BASE_URL}",
"apiKey": "${DASHSCOPE_API_KEY}"
},
"claude-3-5-sonnet": {
"id": "claude-3-5-sonnet",
"name": "Claude 3.5 Sonnet",
"api": "anthropic-messages",
"provider": "anthropic",
"baseUrl": "${ANTHROPIC_BASE_URL}",
"headers": {
"x-api-key": "${ANTHROPIC_API_KEY}"
}
}
}
}Field reference:
| Field | Required | Description |
|---|---|---|
id |
✅ | Model identifier (used in case input.model, EVAL_JUDGE_MODEL, etc.) |
name |
✅ | Human-readable name |
api |
✅ | openai-completions or anthropic-messages |
provider |
✅ | Provider label (informational) |
baseUrl |
✅ | API base URL. Supports ${ENV_VAR} interpolation. |
apiKey |
— | API key for this model. Supports ${ENV_VAR} interpolation. Preferred over putting auth in headers. |
headers |
— | Additional HTTP headers. Supports ${ENV_VAR} interpolation. |
input |
— | ["text"] or ["text", "image"] |
contextWindow |
— | Context window size in tokens |
maxTokens |
— | Max output tokens |
${VAR_NAME} in any field is expanded from environment variables at load time.
# 1. Check config
agent-eval doctor
# 2. List built-in cases
agent-eval list
# 3. Run one case by id
agent-eval run --case <id>
# 4. Run all built-in cases
agent-eval run --case all
# 5. Run a local YAML file
agent-eval run --file my-case.eval.yaml
# 6. Run multiple YAML files with glob
agent-eval run --file "cases/**/*.eval.yaml"
# 7. One-liner (no YAML file)
agent-eval run \
--model gpt-4o-mini \
--system-prompt "You are a helpful assistant." \
--message "user:Tell me a joke" \
--judge-prompt "Response should be funny" \
--judge-threshold 0.7Cases are YAML files ending in .eval.yaml. Each file defines one case.
Tests a chat completion response directly.
type: plain
id: tone-check # unique kebab-case id
description: Response should be friendly and casual
input:
model: gpt-4o-mini # model id from models.json
system_prompt: You are a friendly assistant.
messages:
- role: user
content: Tell me a story
# Optional: limit which tools are injected (plain cases rarely use tools)
# allowed_tool_names: [tool_a, tool_b]
criteria:
assertions:
- type: llm_judge
prompt: Response should be friendly and casual
pass_threshold: 0.7Multi-turn conversation:
input:
model: gpt-4o-mini
system_prompt: You are a math tutor.
messages:
- role: user
content: What is 2+2?
- role: assistant
content: 4
- role: user
content: And 3+3?Image input:
input:
model: gpt-4o
messages:
- role: user
content:
- type: image_url
image_url:
url: https://example.com/image.png
- type: text
text: Describe this imageRuns a full agent loop (multi-turn, tool calls). The agent communicates with an MCP tool server.
The runner resolves the agent's identity from system_prompt + model. parameters values are
interpolated into system_prompt and messages via {{key}} placeholders.
type: agent
id: make-image
description: Agent should call make_image and confirm generation
input:
system_prompt: |
You are a creative assistant. {{task_context}}
model: gpt-4o-mini
# Parameters are interpolated into system_prompt and messages via {{key}}
parameters:
task_context: Help users generate high-quality images.
messages:
- role: user
content: Generate a cat image
# Tool access control
allowed_tool_names: # whitelist of tools the agent may call
- make_image_v1
- make_video_v1
need_approval_tool_names: [] # tools that pause for approval before running
# Simulate follow-up turns after the agent finishes its first response
auto_followup:
mode: adversarial_help_choose # only supported mode
max_turns: 1 # default: 1
# Deprecated — no longer used by the runner, kept for case file identification only
# preset_key: latitude://8|live|running_agent_new
criteria:
assertions:
- type: tool_usage
expected_tools: [make_image_v1] # agent must have called this tool
- type: llm_judge
prompt: Response should confirm that the image was generated
pass_threshold: 0.7Note on
preset_key: older case files may containpreset_key. The runner ignores it — onlysystem_prompt+modeldetermine how the agent is run. You can safely removepreset_keyfrom new cases.
Assertions define how a trace is scored. All assertions live under criteria.assertions.
Each assertion runs as an independent dimension. A case passes only when all non-human_review assertions pass.
Assertions have a tier (1–3) that controls when they run. Use --tier-max to limit evaluation depth:
| Tier | Default for | Meaning |
|---|---|---|
| 1 | tool_usage, final_status, error_recovery |
Rule-based, no LLM needed. Fast CI. |
| 2 | llm_judge, task_success, tool_parameter_accuracy |
LLM-as-a-judge. Requires judge config. Default --tier-max. |
| 3 | human_review |
Flag for async human review. Never blocks automated scoring. |
agent-eval run --case all --tier-max 1 # fast rules-only, no LLM judge
agent-eval run --case all --tier-max 2 # default: rules + LLM judge
agent-eval run --case all --tier-max 3 # include human_review flagsYou can override the default tier of any assertion with tier: <1|2|3>.
Checks which tools were called during the trace.
- type: tool_usage
expected_tools: [make_image_v1] # all listed tools must appear
forbidden_tools: [dangerous_tool_v1] # none of these may appearBoth fields are optional; omitting both is a no-op (always passes).
Checks the agent's final status after the run.
- type: final_status
expected_status: SUCCESS # SUCCESS | PENDING | FAILUREChecks whether the agent retried or recovered after a tool failure.
- type: error_recovery
tool_name: make_image_v1 # optional: scope to a specific tool
pass_threshold: 0.5 # optional: default 0.5An LLM reads the full conversation and scores it 0–1 against your prompt. Requires EVAL_JUDGE_MODEL.
- type: llm_judge
prompt: The response should be concise and answer the user's question directly.
pass_threshold: 0.7 # score must be >= this to pass (0–1)A holistic LLM evaluation of whether the agent completed the user's goal. Unlike llm_judge, the scoring criterion is inferred from context unless overridden.
- type: task_success
user_goal: "Generate a cat image and confirm it to the user" # optional override
pass_threshold: 0.7An LLM checks whether the tool was called with correct and relevant parameters.
- type: tool_parameter_accuracy
tool_name: make_image_v1
expected_description: "Should include a cat in the prompt parameter"
pass_threshold: 0.7Flags the case for async manual review. Always runs (never blocks other assertions). Never causes the case to fail in automated scoring.
- type: human_review
reason: "Needs visual inspection of generated image quality"# By built-in case id
agent-eval run --case <id>
agent-eval run --case all
# By YAML file (glob supported)
agent-eval run --file my-case.eval.yaml
agent-eval run --file "cases/**/*.eval.yaml"
# Inline JSON (no file needed)
agent-eval run --inline '{"type":"plain","id":"x","description":"...","input":{...},"criteria":{...}}'
# Filter by type
agent-eval run --case all --type plain
agent-eval run --case all --type agent
# Control scoring depth
agent-eval run --case all --tier-max 1 # rule-based only, fast
# Concurrency (default: min(total, 8))
agent-eval run --case all --concurrency 4
# Verbose: show full conversation in output
agent-eval run --case my-case --verbose
# Output formats
agent-eval run --case my-case --format terminal # default, human-readable
agent-eval run --case my-case --format json # machine-readable NDJSON
# Record traces to a directory (auto-enabled for >1 case)
agent-eval run --case all --record ./my-records
# Replay from saved traces (re-score without re-running the LLM)
agent-eval run --case all --replay ./my-records
# Disable auto-share of HTML report
agent-eval run --case all --share=falseOne-liner (construct case from CLI flags):
agent-eval run \
--model gpt-4o-mini \
--system-prompt "You are a helpful assistant." \
--message "user:Hello" \
--message "assistant:Hi there!" \
--message "user:What can you do?" \
--judge-prompt "Should give a helpful answer" \
--judge-threshold 0.7Run the same cases against multiple parameter variants. Produces a grid: cases × variants.
# Shorthand: label=model
agent-eval matrix --case all \
--variant "gpt4o=gpt-4o" \
--variant "mini=gpt-4o-mini"
# Full JSON variant (override any input field)
agent-eval matrix --case all \
--variant '{"label":"v1","model":"gpt-4o","system_prompt":"Be concise."}' \
--variant '{"label":"v2","model":"gpt-4o","system_prompt":"Be detailed."}'
# With file and concurrency
agent-eval matrix \
--file "cases/**/*.eval.yaml" \
--variant "a=gpt-4o" \
--variant "b=gpt-4o-mini" \
--concurrency 4 \
--record ./matrix-resultsResults are saved under <record>/<variant-label>/<case-id>.result.json.
Run the same cases under two configurations and let an LLM judge which is better.
agent-eval diff --case all \
--base '{"model":"gpt-4o"}' \
--candidate '{"model":"gpt-4o-mini"}'
# Add labels for readable output
agent-eval diff --case all \
--base '{"label":"prod","model":"gpt-4o"}' \
--candidate '{"label":"cheap","model":"gpt-4o-mini"}'Verdict per case: base_better | candidate_better | equivalent | error.
agent-eval list
# Outputs JSON array: [{id, type, description}, ...]agent-eval inspect --case <id>
agent-eval inspect --file my-case.eval.yamlagent-eval doctor # check all
agent-eval doctor --mode plain # only plain-case env vars
agent-eval doctor --mode agent # only agent-case env vars
agent-eval doctor --format json # machine-readable outputAfter a recorded run, regenerate the HTML report without re-running cases:
agent-eval report --from ./my-records
# → ./my-records/run-report.html
# → ./my-records/run-report-list.html
agent-eval report --from ./my-records --out ./output/report.htmlagent-eval matrix-report --from ./matrix-results
# Reads variant subdirs: ./matrix-results/<variant>/*.result.json
# → ./matrix-results/matrix-report.html
⚠️ This command is talesofai-internal. It requiresEVAL_UPSTREAM_X_TOKENand access to the talesofai API.
agent-eval pull-online \
--collection-uuid <uuid> \
--out cases/my-imported-case.eval.yaml
# With pagination (import page 2, 5 items)
agent-eval pull-online \
--collection-uuid <uuid> \
--page-index 1 \
--page-size 5 \
--out cases/batch.eval.yamlWhy replay? Running agent cases is slow and expensive. Record once, then iterate on scoring logic or LLM judge prompts without re-running the agent.
# Step 1: record traces
agent-eval run --case all --record ./records/run-001
# Step 2: replay (re-score, skip LLM execution)
agent-eval run --case all --replay ./records/run-001
# Step 3: change your assertions in the YAML, replay again
agent-eval run --case all --replay ./records/run-001Replay behavior:
- If a
<case-id>.result.jsonexists → use cached result directly - If only
<case-id>.trace.jsonexists → re-score the trace - Auto-record is enabled for runs with >1 case (saved to
.eval-records/run-<timestamp>/)
Backfill metrics on older results:
agent-eval run --case all --replay ./old-records --replay-write-metricsMatrix runs N cases × M variants in parallel and produces a comparison grid.
agent-eval matrix \
--file "cases/**/*.eval.yaml" \
--variant "v1=gpt-4o" \
--variant "v2=gpt-4o-mini" \
--variant "v3=claude-3-5-sonnet" \
--record ./matrix-20240301
# Generate HTML report separately
agent-eval matrix-report --from ./matrix-20240301Matrix automatically resumes from existing results if you re-run with the same --record directory.
diff is useful for evaluating prompt changes or model upgrades on a shared case set.
# Compare two system prompts
agent-eval diff --file cases/my-case.eval.yaml \
--base '{"label":"original","system_prompt":"You are helpful."}' \
--candidate '{"label":"new","system_prompt":"You are concise and helpful."}'The diff uses a single LLM judge (EVAL_JUDGE_MODEL) to compare the two traces head-to-head.
Use multiple LLMs to judge the same output and aggregate scores for higher reliability.
Setup:
# Use a LiteLLM gateway or any unified endpoint
EVAL_JUDGE_BASE_URL=https://your-litellm.com/v1
EVAL_JUDGE_API_KEY=your-key
EVAL_JUDGE_MODELS=gpt-4o-mini,gpt-4o,claude-3-5-sonnet
EVAL_JUDGE_AGGREGATION=medianWhen EVAL_JUDGE_MODELS is set, it takes precedence over EVAL_JUDGE_MODEL.
Aggregation methods:
| Method | Description |
|---|---|
median |
Robust to outliers. Default. |
mean |
Simple average. |
iqm |
Interquartile mean — drops top/bottom 25%. |
Output example:
score: 0.85
reason: gpt-4o-mini: 0.80 - accurate | gpt-4o: 0.90 - detailed |
claude-3-5-sonnet: 0.85 - correct | [aggregated via median: 0.85]
See Model registry under Configure for the full format and field reference.
| Variable | Required | Default | Description |
|---|---|---|---|
EVAL_MODELS_PATH |
✅* | ./models.json in cwd |
Path to your model registry JSON. Required unless ./models.json exists in cwd. |
EVAL_UPSTREAM_X_TOKEN |
— | — | Optional x-token header added to runner requests and upstream API calls. |
* Any env vars referenced by ${VAR} placeholders in your models.json entries must also be set (e.g. OPENAI_API_KEY, ANTHROPIC_API_KEY).
| Variable | Required | Default | Description |
|---|---|---|---|
EVAL_JUDGE_MODEL |
✅* | — | Single judge model id (must be defined in your registry). Required unless EVAL_JUDGE_MODELS is set. |
EVAL_JUDGE_MODELS |
— | — | Comma-separated model ids for multi-model judging. Overrides EVAL_JUDGE_MODEL. |
EVAL_JUDGE_AGGREGATION |
— | median |
Aggregation method: median, mean, or iqm |
EVAL_JUDGE_BASE_URL |
— | — | Judge endpoint override (e.g. LiteLLM gateway). Overrides the model's baseUrl from registry. |
EVAL_JUDGE_API_KEY |
— | — | Judge API key override. Overrides the model's apiKey from registry. |
| Variable | Required | Default | Description |
|---|---|---|---|
EVAL_MCP_SERVER_BASE_URL |
— | https://mcp.talesofai.cn |
MCP tool server base URL |
EVAL_MCP_X_TOKEN |
— | — | Auth token for MCP server |
EVAL_UPSTREAM_API_BASE_URL |
— | https://api.talesofai.cn |
Upstream API for character/asset provider |
EVAL_UPSTREAM_X_TOKEN |
— | — | Auth token for upstream API |
| Variable | Required | Default | Description |
|---|---|---|---|
EVAL_LEGACY_AGENT_PROMPT_FILE |
— | — | Override legacy agent prompt template file |
AGENT_EVAL_DISABLE_ENV_AUTOLOAD |
— | — | Set to 1 to disable auto .env discovery |
| Code | Meaning |
|---|---|
0 |
All cases passed |
1 |
One or more cases failed (assertions failed, but no system errors) |
2 |
System error (missing config, runner crash, IO error) |
Use exit codes in CI:
agent-eval run --case all --tier-max 1 || exit 1MIT