Learn

What is AI Agent Evaluation?

How you know your AI employee is actually doing the job.

Agent evaluation is the practice of systematically measuring AI agent quality against defined tasks and rubrics — accuracy, groundedness, tool-use correctness, end-to-end task success, and safety. It combines automated metrics (exact match, F1, BLEU), LLM-as-judge scoring, golden-set regression tests, and human review. Evals are the difference between 'the agent feels good' and 'the agent actually works in production at a known quality bar.'

Free to startNo credit card requiredUpdated Apr 2026
Short answer

Agent evaluation is the practice of systematically measuring AI agent quality against defined tasks and rubrics — accuracy, groundedness, tool-use correctness, end-to-end task success, and safety. It combines automated metrics (exact match, F1, BLEU), LLM-as-judge scoring, golden-set regression tests, and human review. Evals are the difference between 'the agent feels good' and 'the agent actually works in production at a known quality bar.'

In depth

LLMs and agents don't come with unit tests. An agent that answers correctly in a demo can fail in the next conversation; a prompt change that feels cleaner can silently regress accuracy by 15%. Without evaluations, teams ship based on vibes, which breaks the moment traffic scales or the model updates. Evaluation is the discipline that makes agent development feel like engineering instead of alchemy. The evaluation hierarchy has four layers. (1) Offline eval on a golden set — a curated collection of inputs with expected outputs or rubrics. Run every code change through this set, compare against baseline, gate deployment on regressions. 100-1000 examples is typical. (2) LLM-as-judge — a separate LLM (usually a stronger one) reads the input and agent output and scores against a rubric. Fast, cheap, good for subjective tasks where exact match doesn't apply. (3) Live production metrics — user thumbs-up/down, task completion, escalation rate, time-to-resolution. These capture real quality but are noisy and lag. (4) Human review — periodic sampling of production outputs read by domain experts. Catches what automated evals miss but expensive to scale. What to measure depends on the agent. For RAG agents: groundedness (does the answer match retrieved context), answer relevance (does it address the query), context precision/recall (did retrieval find the right docs), and faithfulness (no hallucinations). For tool-using agents: tool selection accuracy, parameter correctness, multi-step planning quality, recovery from tool errors. For end-to-end agents: task success rate on a benchmark like SWE-bench or WebArena, time to completion, cost per task. Leading frameworks — RAGAS, DeepEval, TruLens, OpenAI Evals, Braintrust, Langsmith, Weights & Biases Weave — cover different slices of this stack. LLM-as-judge deserves special attention. It's tempting to use GPT-5 to score your GPT-5 agent's output, but the judge must be at least as capable as the agent being judged and should use a different prompt/rubric to avoid circularity. Best practice: use a stronger model as judge, calibrate judge reliability against human annotators on a sample of 50-100 examples, and report inter-judge agreement. Anthropic and OpenAI have both published studies showing LLM-as-judge correlates 70-85% with human judgments when properly set up — not perfect but good enough to drive iteration cycles that would be impossible to do with humans alone. Agent-specific benchmarks emerged in 2023-2024 and matured rapidly. SWE-bench Verified (Princeton, 2024) measures end-to-end code-fix tasks against real GitHub issues. WebArena (CMU, 2023) tests web-browsing agents on e-commerce, forum, and CMS tasks. GAIA (Meta AI, 2023) covers general assistant tasks. TauBench (Sierra, 2024) benchmarks conversational agents with tool use. These benchmarks let you compare your agent to published baselines, which is invaluable because raw per-task accuracy numbers are meaningless without context. A 60% SWE-bench score sounds bad until you know that human developers score 80% and the previous SOTA was 40%. For Tycoon, agent evaluation is operational discipline. Every change to Astra's system prompt, skill loading, or tool configuration runs a golden-set eval before shipping. We track groundedness (Astra shouldn't claim numbers she didn't retrieve), task completion rate (did she do what the user asked), and escalation appropriateness (did she ask humans at the right moments). Production metrics feed back into the golden set — interesting failures become new eval cases. Without this loop, improving the agent would be guesswork.

Examples

  • RAGAS — Python library for RAG eval: faithfulness, answer relevance, context precision, context recall
  • DeepEval — pytest-style framework for LLM outputs with 15+ built-in metrics
  • TruLens — observability + eval for LLM apps, focuses on RAG triads (context/answer/groundedness)
  • OpenAI Evals — open-source framework of standardized eval tasks for comparing models
  • Braintrust — commercial eval platform with golden datasets, LLM-as-judge, and CI integration
  • Langsmith — LangChain's eval and observability product, focused on chain-level metrics
  • SWE-bench Verified (Princeton, 2024) — 500 real GitHub issues with expert-verified ground-truth patches
  • Tycoon's golden-set eval: 200 curated task-to-output pairs that gate every Astra prompt change

Related terms

Frequently asked questions

Why can't I just use 'did it work for me' as evaluation?

Three reasons. (1) Vibes are noisy — the same prompt can feel good one day and bad the next; you can't distinguish real regressions from bad mood. (2) You only see your own usage patterns; production traffic has 10x the variety and 100x the edge cases. (3) Without a shared eval set, a team can't agree on whether a change is net-positive — different teammates will have different vibes. A golden set with rubric-based scoring gives you a shared, reproducible measurement. You still need it even if you're a solo founder — past-you disagrees with present-you, and the eval set keeps you honest across time.

How big should my golden eval set be?

Minimum 50, ideally 200-1000. Below 50 the variance is too high to detect small regressions; above 1000 marginal utility drops and iteration speed slows. For MVPs start with 50 carefully-chosen examples covering happy path + 5-10 edge cases you know about. Expand over time from production failures — every time the agent fails in production, add that case (or a close variant) to the eval set. After 6-12 months of this, you'll have a rich 500+ example set that reflects real usage. Quality beats quantity: 100 well-scored examples beat 1000 sloppy ones.

Should I use LLM-as-judge or human review?

Both, at different frequencies. LLM-as-judge on every commit: cheap, fast, good for regressions on subjective tasks. Human review periodically (weekly or monthly) on a sampled slice to catch what LLM judges miss and to calibrate the judge. The key risk in LLM-as-judge is systematic bias — judges tend to prefer longer answers, more confident-sounding answers, and answers structurally similar to the reference. Human calibration of 50 examples per quarter keeps the judge honest. OpenAI, Anthropic, and Google all publish that well-calibrated LLM judges agree with humans 70-90% depending on task, which is enough for fast iteration but not enough to stake a product launch on.

What metrics matter for a RAG-based agent?

Five metrics from the RAG triad and friends. (1) Faithfulness: does the answer only use information from retrieved context, or does it hallucinate? (2) Answer relevance: does the answer actually address the query? (3) Context precision: of the retrieved chunks, how many are relevant? (4) Context recall: did retrieval find all the relevant chunks in your corpus? (5) End-to-end accuracy: does the final answer match ground truth on a QA set. RAGAS, TruLens, and DeepEval all automate these with LLM-as-judge. Faithfulness is usually the first to break — when you see it drop, the model started hallucinating. Context recall breaks when your retrieval is tuned wrong. Tracking all five together localizes the failure.

How do I evaluate multi-step agents that use tools?

Evaluate at multiple granularities. (1) Tool call correctness: for each tool call, was the right tool chosen with the right arguments? Rubric or exact match. (2) Trajectory quality: was the sequence of tool calls sensible, or did the agent thrash? Often an LLM-as-judge task. (3) End-to-end task success: did the final state match the goal? Often ground-truth comparable. SWE-bench and WebArena both evaluate at (3) with programmatic checks (did the code patch pass the test suite, did the form submit correctly). For custom agents, mirror this: build a task harness that checks end-state programmatically, augmented with LLM-as-judge for reasoning quality. TauBench from Sierra is the current reference pattern for evaluating conversational agents with tool use.

Run your one-person company.

Hire your AI team in 30 seconds. Start for free.

Free to start · No credit card required · Set up in 30 seconds