Series note — Appendix to MAF v1: Python and .NET. Sits after Ch22 — Python ↔ .NET asymmetries. Supersedes the original Python-only Part 9 — Evaluating Agent Quality — this version updates the tool-call extraction to MAF v1’s canonical
AgentRunResponseshape, switches alias matching from substrings to word boundaries (the original false-positived on “profit margin” matching the “price” alias), and ships a smoke / full split for CI vs nightly runs. A .NET equivalent lives alongside the Python implementation.
Capstone code — This chapter doesn’t ship a standalone tutorial folder. The runnable evaluation framework lives in the e-commerce-agents capstone at
agents/python/evals/:evaluator.py(scoring + run loop),run_evals.py(CLI),datasets/(golden cases).
Why this chapter#
The first time you change a system prompt and ship it, three things happen: an agent that was correct yesterday hallucinates a price, a tool that was being called stops being called, and you find out four days later from a customer ticket. None of your unit tests catch it because the unit tests cover tools, not tool selection. The orchestration layer — “which tool, with which arguments, at which step” — is what the LLM owns, and that’s where regressions live.
This chapter builds the eval pipeline that closes the gap. Behavioural assertions instead of string matching. A golden dataset per agent. A scoring model that catches the three failure modes that actually matter. A CLI runner that returns non-zero on regression. A CI workflow that runs the eval on every PR.
Prerequisites#
- Completed Ch01 — Your First Agent and Ch02 — Adding Tools.
- Familiar with Ch07 — Observability (we read tool-call lists from the run result, not from spans, but the mental model overlaps).
- Python 3.12+ with
uv, or .NET 9 withdotnetCLI. - An OpenAI or Azure OpenAI key — evals always run real LLM calls. Budget ~$0.02–0.05 per 5-case run; see the cost-tracking section.
What you’ll learn#
- Why “give X, expect Y” tests don’t work for agents, and what does work.
- The three behavioural axes — groundedness, correctness, completeness — and the weighting that pushed the capstone to 90%+ pass rates.
- How to extract tool-call lists from MAF v1’s
AgentRunResponsecleanly (withouthasattrguessing). - Why the original alias matcher false-positived on
"profit margin"matching"price", and the four-line fix. - How to split a dataset into smoke (PR-gate) vs full (nightly) tiers without duplicating cases.
Why traditional testing fails for agents#
Three approaches don’t survive contact with a non-deterministic LLM:
- Unit tests on tool functions — the tool is testable; the decision to call it is not.
search_products(query="headphones", max_price=300)returns deterministic rows, but the LLM might callsemantic_searchinstead, or setmax_price=299.99, or callsearch_productstwice. - Integration tests on response strings — “Here are some headphones under $300” and “I found 5 wireless headphones within your budget” are both correct. A string-match test fails on one of them.
- Snapshot tests — record a “golden” response, compare future runs. LLM responses vary on every invocation. You’re updating snapshots constantly, and that defeats the test.
What works is behavioural assertions: instead of “the response contains the string 'Sony WH-1000XM5 - $348.00',” you assert that the agent called search_products, the max_price parameter was ≤300, and the response mentions a product name and price. Those properties hold across phrasings, models, and runs. They catch the failures that matter — silent tool-skip, wrong tool, missing field — and ignore the ones that don’t.
The three axes#
| Axis | What it asks | Score type | Weight |
|---|---|---|---|
| Groundedness | Did the agent call any tool before answering a factual question? | Binary (0 or 1) | 40% |
| Correctness | Did it call the right tool(s)? | Partial (0–1) | 40% |
| Completeness | Did the response contain the fields the user asked for? | Partial (0–1) | 20% |
Groundedness is binary because partial grounding doesn’t exist — either the agent consulted real data or it made something up. Correctness gives partial credit because multi-step queries can reasonably take alternative routes; if the agent calls 1 of 2 expected tools, that’s 0.5. We deliberately don’t penalize extra tool calls — being thorough isn’t being wrong.
Completeness is weighted lower because an incomplete response is recoverable (the user follows up); a hallucinated or wrong-tool response is not.
The 40/40/20 split came from running ~200 real eval cases against the capstone and tuning until the score correlated with what humans flagged as “this answer is broken.” Other splits work; what matters is that you commit to one and watch the trend.
Golden datasets#
A golden dataset is a JSON file. Each row is a behavioural test:
{
"input": "Find me wireless headphones under $200",
"expected_tools": ["search_products"],
"expected_fields": ["name", "price"],
"criteria": { "tool_called": true, "grounded": true },
"tier": "smoke"
}input— what you’d type into the chat box.expected_tools— the tool names you expect the LLM to invoke. Order doesn’t matter; extras are allowed.expected_fields— fields that should appear in the response, matched against an alias map (more on this in a moment).criteria— flags the scoring functions read.tool_called: truemeans “this is a factual question; expect at least one tool call.”tier—smokefor the PR gate (the 5–10 cases that must pass on every commit),fullfor the nightly run (the 50–100 cases that can take longer and cost more).
Five cases per agent is enough to catch obvious regressions during development. Twenty per agent is what you want pre-production. A hundred plus drawn from real conversations is what you want once you’re seeing real users.
JSON")] loader["load_dataset"] runner["AgentEvaluator
per-case run"] llm[("LLM provider")] scorer["score: groundedness +
correctness + completeness"] summary["EvalSummary"] ci["CI gate
(--pass-threshold)"] dataset --> loader --> runner runner --> llm llm --> runner runner --> scorer --> summary --> ci class loader,runner,scorer core class llm external class summary,ci success
Each case in the dataset goes through the agent, the response is scored on the three axes, the per-case results aggregate into an EvalSummary, and the CI gate checks the overall score against --pass-threshold.
Reading tool calls from MAF v1 cleanly#
The original Part 9 used a defensive hasattr chain to extract tool calls because the AgentRunResult shape varied between MAF previews. v1 stabilises on AgentRunResponse with a documented contract. The clean extraction:
# evals/extractors.py
from agent_framework import AgentRunResponse
def extract_tool_calls(response: AgentRunResponse) -> list[str]:
"""Return the names of every tool the agent invoked, in call order."""
names: list[str] = []
for message in response.messages:
for content in message.contents:
if content.type == "function_call":
names.append(content.name)
return names
def extract_response_text(response: AgentRunResponse) -> str:
"""Concatenate every text content from the assistant's final message(s)."""
parts: list[str] = []
for message in response.messages:
if message.role != "assistant":
continue
for content in message.contents:
if content.type == "text":
parts.append(content.text)
return "\n".join(parts)Two things this gives you that the previous shape didn’t:
- Stable across MAF point releases. The
AgentRunResponse/ChatMessage/ChatMessageContenttriple is the canonical surface; future minor versions can add new content types without breaking this code. - No
hasattrguesswork. Ifresponse.messagesever moves, mypy/pyright catches it at type-check time, not at runtime when an eval silently scores 0 on every case.
The .NET twin reads the equivalent AgentRunResponse from Microsoft.Agents.AI:
public static IReadOnlyList<string> ExtractToolCalls(AgentRunResponse response)
=> response.Messages
.SelectMany(m => m.Contents)
.OfType<FunctionCallContent>()
.Select(c => c.Name)
.ToList();
public static string ExtractResponseText(AgentRunResponse response)
=> string.Join("\n", response.Messages
.Where(m => m.Role == ChatRole.Assistant)
.SelectMany(m => m.Contents)
.OfType<TextContent>()
.Select(c => c.Text));Same shape, language-idiomatic. If MAF ships a new content type (image, citation, etc.) you handle it here in one place — the rest of the evaluator stays untouched.
Scoring#
Three small functions. The data class above carries everything they need.
Groundedness — binary#
def score_groundedness(tools_called: list[str], criteria: dict) -> float:
if not criteria.get("grounded", True):
return 1.0 # case explicitly opts out (e.g., a chitchat case)
if criteria.get("tool_called", True) and not tools_called:
return 0.0 # expected a tool, got none — hard fail
return 1.0Correctness — partial credit, no penalty for extras#
def score_correctness(tools_called: list[str], expected: list[str]) -> float:
if not expected:
return 1.0
matched = sum(1 for t in expected if t in tools_called)
return matched / len(expected)Completeness — word-boundary alias matching#
This is where the original framework had the bug: a substring search for the alias "price" matched "profit margin", "surprise", "appraisal" — anything containing the four characters p-r-i-c. The fix is one regex per alias, with \b word boundaries:
import re
# Patterns are compiled once and cached; rebuilding per-case is hot in CI.
_alias_patterns: dict[str, list[re.Pattern]] = {}
def _patterns_for(field: str, aliases: list[str]) -> list[re.Pattern]:
key = field
if key not in _alias_patterns:
_alias_patterns[key] = [
re.compile(rf"(?<![A-Za-z]){re.escape(a)}(?![A-Za-z])", re.IGNORECASE)
for a in aliases
]
return _alias_patterns[key]
FIELD_ALIASES = {
"price": ["price", "$", "USD", "cost"],
"rating": ["rating", "stars", "score"],
"status": ["status", "state"],
"tracking_number": ["tracking", "shipment"],
}
def score_completeness(text: str, expected_fields: list[str]) -> tuple[float, list[str], list[str]]:
if not expected_fields:
return 1.0, [], []
found, missing = [], []
for field in expected_fields:
aliases = FIELD_ALIASES.get(field, [field])
patterns = _patterns_for(field, aliases)
if any(p.search(text) for p in patterns):
found.append(field)
else:
missing.append(field)
return len(found) / len(expected_fields), found, missingWhy (?<![A-Za-z])...(?![A-Za-z]) instead of \b? \b treats $ as adjacent to a word boundary regardless of context, which is wrong for the "$" alias when it shows up inside a price string like "USD$199". The lookarounds are stricter — only ASCII letter neighbours block the match. For "$" specifically, this means the alias matches when $ is not surrounded by letters (so "$199" matches; "a$b" doesn’t). Test thoroughly with your own data — character-class assumptions are domain-specific.
Wiring it together#
The evaluator class is small. The interesting bits are above; the orchestration is mechanical.
@dataclass
class EvalCase:
input: str
expected_tools: list[str]
expected_fields: list[str]
criteria: dict[str, bool]
tier: str = "full" # "smoke" or "full"
@dataclass
class EvalResult:
case: EvalCase
groundedness: float
correctness: float
completeness: float
overall: float
tools_called: list[str]
fields_missing: list[str]
latency_ms: int
tokens_in: int
tokens_out: int
error: str | None = None
@property
def passed(self) -> bool: return self.overall >= 0.7
class AgentEvaluator:
WEIGHTS = (0.4, 0.4, 0.2)
def __init__(self, agent, *, pass_threshold: float = 0.7):
self.agent = agent
self.pass_threshold = pass_threshold
async def evaluate_case(self, case: EvalCase) -> EvalResult:
t0 = time.perf_counter()
try:
response = await self.agent.run(case.input)
except Exception as e:
return EvalResult(case=case, groundedness=0.0, correctness=0.0,
completeness=0.0, overall=0.0, tools_called=[],
fields_missing=case.expected_fields, latency_ms=0,
tokens_in=0, tokens_out=0, error=repr(e))
latency = int((time.perf_counter() - t0) * 1000)
tools = extract_tool_calls(response)
text = extract_response_text(response)
usage = response.usage_details # Ch07 attribute, populated by the LLM provider
g = score_groundedness(tools, case.criteria)
c = score_correctness(tools, case.expected_tools)
m, _, missing = score_completeness(text, case.expected_fields)
overall = self.WEIGHTS[0]*g + self.WEIGHTS[1]*c + self.WEIGHTS[2]*m
return EvalResult(case=case, groundedness=g, correctness=c, completeness=m,
overall=overall, tools_called=tools, fields_missing=missing,
latency_ms=latency, tokens_in=usage.input_tokens,
tokens_out=usage.output_tokens)Pulling token counts from response.usage_details (rather than estimating from response length) gives accurate cost numbers for free — the LLM provider populates them.
CLI runner with smoke / full modes#
The smoke / full split is what makes evals viable in CI without burning the budget on every PR.
# evals/run_evals.py
import argparse, asyncio, importlib, json, sys
AGENT_FACTORIES = {
"product-discovery": ("product_discovery.agent", "create_product_discovery_agent"),
"order-management": ("order_management.agent", "create_order_management_agent"),
# ... one row per agent
}
def parse_args():
p = argparse.ArgumentParser()
p.add_argument("--agent", required=True, choices=AGENT_FACTORIES.keys())
p.add_argument("--dataset", required=True)
tier = p.add_mutually_exclusive_group()
tier.add_argument("--smoke", action="store_true",
help="Run only cases with tier='smoke'. Use in PR CI.")
tier.add_argument("--full", action="store_true",
help="Run every case regardless of tier. Use in nightly CI.")
p.add_argument("--pass-threshold", type=float, default=0.7)
p.add_argument("--output-json", help="Write summary to this path for CI artefact upload")
p.add_argument("--verbose", action="store_true")
return p.parse_args()
def filter_by_tier(cases: list[EvalCase], smoke: bool, full: bool) -> list[EvalCase]:
if full or (not smoke and not full):
return cases
return [c for c in cases if c.tier == "smoke"]
async def main():
args = parse_args()
module, factory = AGENT_FACTORIES[args.agent]
agent = await getattr(importlib.import_module(module), factory)()
cases = filter_by_tier(load_dataset(args.dataset), args.smoke, args.full)
if not cases:
print(f"No cases match the requested tier", file=sys.stderr)
sys.exit(2)
evaluator = AgentEvaluator(agent, pass_threshold=args.pass_threshold)
summary = await evaluator.evaluate_dataset(cases)
print_report(summary, verbose=args.verbose)
if args.output_json:
with open(args.output_json, "w") as f:
json.dump(asdict(summary), f, indent=2, default=str)
sys.exit(0 if summary.overall >= args.pass_threshold else 1)PR CI runs --smoke (5 cases per agent, ~$0.05 per agent, finishes in 30 seconds). Nightly CI runs --full (50+ cases per agent, $1–2 per agent, finishes in 5 minutes). Same code, same dataset file, different tier filter.
The exit code is the gate: 0 when overall ≥ threshold, 1 when below, 2 when the runner itself errored (no cases matched, agent factory missing, etc.). 1 and 2 mean different things to the CI workflow — investigate 2 first because it’s a config bug, not a regression.
Output#
======================================================================
EVALUATION REPORT — product-discovery (smoke tier)
======================================================================
Dataset: evals/datasets/product_discovery.json
Cases: 5 (5 smoke / 0 skipped)
Passed: 4 (>= 0.70 threshold)
Failed: 1
----------------------------------------------------------------------
Groundedness: 100.0%
Correctness: 80.0%
Completeness: 90.0%
Overall: 90.0% PASS
----------------------------------------------------------------------
Latency p50/p95: 1840ms / 3120ms
Tokens (in/out): 4,200 / 1,800
Estimated cost: $0.0228
======================================================================Verbose mode (--verbose) drops one block per case underneath the summary, with per-axis scores, the tools that were called, and the fields that were missing. That’s the view you want when you’re debugging a specific failure.
CI workflow#
This is the same file from the original Part 9, with the tier flag added and the secret hardened (the original had a hardcoded Postgres password — already fixed in the old article):
# .github/workflows/agent-evals.yml
name: Agent Evals
on:
pull_request:
paths: ['agents/**', 'evals/**']
jobs:
smoke:
runs-on: ubuntu-latest
services:
postgres:
image: pgvector/pgvector:pg16
env:
POSTGRES_DB: ecommerce_agents
POSTGRES_USER: ecommerce
POSTGRES_PASSWORD: ${{ secrets.EVAL_POSTGRES_PASSWORD }}
ports: ['5432:5432']
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: '3.12' }
- uses: astral-sh/setup-uv@v3
- name: Install
run: cd agents && uv sync
- name: Seed DB
run: cd agents && uv run python -m scripts.seed
- name: Smoke evals (every agent)
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DATABASE_URL: postgresql://ecommerce:${{ secrets.EVAL_POSTGRES_PASSWORD }}@localhost:5432/ecommerce_agents
run: |
for agent in product-discovery order-management pricing-promotions; do
uv run python -m evals.run_evals \
--agent "$agent" \
--dataset evals/datasets/$agent.json \
--smoke \
--output-json eval-$agent.json \
--pass-threshold 0.7
done
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with: { name: eval-results, path: agents/eval-*.json }The nightly workflow is identical except on: schedule: [cron: '0 4 * * *'] and --full instead of --smoke.
When to run evals#
| Trigger | Tier | Reasoning |
|---|---|---|
| System-prompt edit | Smoke (affected agent) | PR check; immediate signal |
| Tool function refactor | Smoke (affected agent) | PR check; tool selection most likely affected |
| LLM model version bump | Full (every agent) | Manual or nightly; cross-agent impact |
| MAF framework upgrade | Full (every agent) | Same as above |
| New tool added | Smoke (the agent that gained it) | PR check |
| DB schema change touching tools | Smoke (every agent with affected tool) | PR check |
Cost#
Five-case smoke against a Sonnet 4.6 / GPT-4.1 agent runs $0.02–0.05 depending on response length. Three things to keep in mind:
- Smoke on every PR across five agents is ~$0.50/PR. At 50 PRs/week that’s $25/week — about a coffee.
- Full nightly (50 cases × 5 agents) is ~$2.50/night — $75/month.
- Model migration (running full evals against the old and new models for comparison) is a one-off $5–10.
These are 2026-04 numbers; rebuild the math when prices move. The evaluator surfaces the per-run cost so the trend is visible from the CI artefacts.
Gotchas#
- Non-determinism within a run. Two runs of the same case can score differently. For a single regression run, score variance is usually small (±5%). For tracking a metric over time, run each case 3× and take the median; the runner has a
--repeat 3flag (omitted above for brevity) that does this. - Tool calls vs tool attempts. If the LLM emits a malformed tool call, MAF surfaces it as a
function_callcontent with the wrong arguments. The extractor counts it; the agent’s tool actually didn’t run. Decide which behaviour you want — usually you want to count attempts, not successes, because a malformed call is still a model decision worth catching. - Streaming responses. The eval calls the non-streaming
agent.run(), notagent.run_stream(). Token counts and tool-call lists are populated only on the finalAgentRunResponse; streaming chunks don’t carry the totals. If your production path is streaming-only, the eval still gives you the right behavioural signal — the LLM’s tool selection doesn’t change between streaming and non-streaming.
What changes for the capstone#
The capstone today (agents/python/evals/) ships the original Part 9 framework. Phase 9 of the refactor plan migrates it:
agents/python/evals/extractors.py— replacehasattrchain with the canonicalAgentRunResponseextraction.agents/python/evals/scoring.py— swap substring alias matcher for the regex pattern above.agents/python/evals/datasets/*.json— add"tier": "smoke" | "full"to every case..github/workflows/agent-evals.yml— split into smoke (PR) and nightly (schedule) jobs.
No agent code changes. The eval pipeline lives entirely in its own subtree.
What’s next#
That closes the production-quality loop the original Part 9 left open. The next chapter back to the original e-commerce series is the prompt engineering port — five concerns, YAML composition, role-aware overrides — modernised against MAF v1’s context-provider surface.

