MAF v1 — Evaluation framework (Python + .NET)

MAF v1: Python and .NET - This article is part of a series.

Part 23: This Article

Part 24: MAF v1 — Prompt engineering (Python + .NET)

Part 25: MAF v1 — Deployment with Docker and Compose

Part 20.1: MAF v1 — DevUI for agents and workflows (Python)

Part 20.2: MAF v1 — Production hardening (Python + .NET)

Series note — Appendix to MAF v1: Python and .NET. Sits after Ch22 — Python ↔ .NET asymmetries. Supersedes the original Python-only Part 9 — Evaluating Agent Quality — this version updates the tool-call extraction to MAF v1’s canonical AgentRunResponse shape, switches alias matching from substrings to word boundaries (the original false-positived on “profit margin” matching the “price” alias), and ships a smoke / full split for CI vs nightly runs. A .NET equivalent lives alongside the Python implementation.

Capstone code — This chapter doesn’t ship a standalone tutorial folder. The runnable evaluation framework lives in the e-commerce-agents capstone at agents/python/evals/: evaluator.py (scoring + run loop), run_evals.py (CLI), datasets/ (golden cases).

Why this chapter
#

The first time you change a system prompt and ship it, three things happen: an agent that was correct yesterday hallucinates a price, a tool that was being called stops being called, and you find out four days later from a customer ticket. None of your unit tests catch it because the unit tests cover tools, not tool selection. The orchestration layer — “which tool, with which arguments, at which step” — is what the LLM owns, and that’s where regressions live.

This chapter builds the eval pipeline that closes the gap. Behavioural assertions instead of string matching. A golden dataset per agent. A scoring model that catches the three failure modes that actually matter. A CLI runner that returns non-zero on regression. A CI workflow that runs the eval on every PR.

Prerequisites
#

Completed Ch01 — Your First Agent and Ch02 — Adding Tools.
Familiar with Ch07 — Observability (we read tool-call lists from the run result, not from spans, but the mental model overlaps).
Python 3.12+ with uv, or .NET 9 with dotnet CLI.
An OpenAI or Azure OpenAI key — evals always run real LLM calls. Budget ~$0.02–0.05 per 5-case run; see the cost-tracking section.

What you’ll learn
#

Why “give X, expect Y” tests don’t work for agents, and what does work.
The three behavioural axes — groundedness, correctness, completeness — and the weighting that pushed the capstone to 90%+ pass rates.
How to extract tool-call lists from MAF v1’s AgentRunResponse cleanly (without hasattr guessing).
Why the original alias matcher false-positived on "profit margin" matching "price", and the four-line fix.
How to split a dataset into smoke (PR-gate) vs full (nightly) tiers without duplicating cases.

Why traditional testing fails for agents
#

Three approaches don’t survive contact with a non-deterministic LLM:

Unit tests on tool functions — the tool is testable; the decision to call it is not. search_products(query="headphones", max_price=300) returns deterministic rows, but the LLM might call semantic_search instead, or set max_price=299.99, or call search_products twice.
Integration tests on response strings — “Here are some headphones under $300” and “I found 5 wireless headphones within your budget” are both correct. A string-match test fails on one of them.
Snapshot tests — record a “golden” response, compare future runs. LLM responses vary on every invocation. You’re updating snapshots constantly, and that defeats the test.

What works is behavioural assertions: instead of “the response contains the string 'Sony WH-1000XM5 - $348.00',” you assert that the agent called search_products, the max_price parameter was ≤300, and the response mentions a product name and price. Those properties hold across phrasings, models, and runs. They catch the failures that matter — silent tool-skip, wrong tool, missing field — and ignore the ones that don’t.

The three axes
#

Axis	What it asks	Score type	Weight
Groundedness	Did the agent call any tool before answering a factual question?	Binary (0 or 1)	40%
Correctness	Did it call the right tool(s)?	Partial (0–1)	40%
Completeness	Did the response contain the fields the user asked for?	Partial (0–1)	20%

Groundedness is binary because partial grounding doesn’t exist — either the agent consulted real data or it made something up. Correctness gives partial credit because multi-step queries can reasonably take alternative routes; if the agent calls 1 of 2 expected tools, that’s 0.5. We deliberately don’t penalize extra tool calls — being thorough isn’t being wrong.

Completeness is weighted lower because an incomplete response is recoverable (the user follows up); a hallucinated or wrong-tool response is not.

The 40/40/20 split came from running ~200 real eval cases against the capstone and tuning until the score correlated with what humans flagged as “this answer is broken.” Other splits work; what matters is that you commit to one and watch the trend.

Golden datasets
#

A golden dataset is a JSON file. Each row is a behavioural test:

{
  "input": "Find me wireless headphones under $200",
  "expected_tools": ["search_products"],
  "expected_fields": ["name", "price"],
  "criteria": { "tool_called": true, "grounded": true },
  "tier": "smoke"
}

input — what you’d type into the chat box.
expected_tools — the tool names you expect the LLM to invoke. Order doesn’t matter; extras are allowed.
expected_fields — fields that should appear in the response, matched against an alias map (more on this in a moment).
criteria — flags the scoring functions read. tool_called: true means “this is a factual question; expect at least one tool call.”
tier — smoke for the PR gate (the 5–10 cases that must pass on every commit), full for the nightly run (the 50–100 cases that can take longer and cost more).

Five cases per agent is enough to catch obvious regressions during development. Twenty per agent is what you want pre-production. A hundred plus drawn from real conversations is what you want once you’re seeing real users.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor': '#2563eb','primaryTextColor': '#ffffff','primaryBorderColor': '#1e40af', 'lineColor': '#64748b','secondaryColor': '#f59e0b','tertiaryColor': '#10b981', 'background': 'transparent'}}}%% flowchart LR classDef core fill:#2563eb,stroke:#1e40af,color:#ffffff classDef external fill:#f59e0b,stroke:#b45309,color:#000000 classDef success fill:#10b981,stroke:#047857,color:#ffffff dataset[("golden dataset
JSON")] loader["load_dataset"] runner["AgentEvaluator
per-case run"] llm[("LLM provider")] scorer["score: groundedness +
correctness + completeness"] summary["EvalSummary"] ci["CI gate
(--pass-threshold)"] dataset --> loader --> runner runner --> llm llm --> runner runner --> scorer --> summary --> ci class loader,runner,scorer core class llm external class summary,ci success

Each case in the dataset goes through the agent, the response is scored on the three axes, the per-case results aggregate into an EvalSummary, and the CI gate checks the overall score against --pass-threshold.

Reading tool calls from MAF v1 cleanly
#

The original Part 9 used a defensive hasattr chain to extract tool calls because the AgentRunResult shape varied between MAF previews. v1 stabilises on AgentRunResponse with a documented contract. The clean extraction:

# evals/extractors.py
from agent_framework import AgentRunResponse

def extract_tool_calls(response: AgentRunResponse) -> list[str]:
    """Return the names of every tool the agent invoked, in call order."""
    names: list[str] = []
    for message in response.messages:
        for content in message.contents:
            if content.type == "function_call":
                names.append(content.name)
    return names

def extract_response_text(response: AgentRunResponse) -> str:
    """Concatenate every text content from the assistant's final message(s)."""
    parts: list[str] = []
    for message in response.messages:
        if message.role != "assistant":
            continue
        for content in message.contents:
            if content.type == "text":
                parts.append(content.text)
    return "\n".join(parts)

Two things this gives you that the previous shape didn’t:

Stable across MAF point releases. The AgentRunResponse / ChatMessage / ChatMessageContent triple is the canonical surface; future minor versions can add new content types without breaking this code.
No hasattr guesswork. If response.messages ever moves, mypy/pyright catches it at type-check time, not at runtime when an eval silently scores 0 on every case.

The .NET twin reads the equivalent AgentRunResponse from Microsoft.Agents.AI:

public static IReadOnlyList<string> ExtractToolCalls(AgentRunResponse response)
    => response.Messages
        .SelectMany(m => m.Contents)
        .OfType<FunctionCallContent>()
        .Select(c => c.Name)
        .ToList();

public static string ExtractResponseText(AgentRunResponse response)
    => string.Join("\n", response.Messages
        .Where(m => m.Role == ChatRole.Assistant)
        .SelectMany(m => m.Contents)
        .OfType<TextContent>()
        .Select(c => c.Text));

Same shape, language-idiomatic. If MAF ships a new content type (image, citation, etc.) you handle it here in one place — the rest of the evaluator stays untouched.

Scoring
#

Three small functions. The data class above carries everything they need.

Groundedness — binary
#

def score_groundedness(tools_called: list[str], criteria: dict) -> float:
    if not criteria.get("grounded", True):
        return 1.0          # case explicitly opts out (e.g., a chitchat case)
    if criteria.get("tool_called", True) and not tools_called:
        return 0.0          # expected a tool, got none — hard fail
    return 1.0

Correctness — partial credit, no penalty for extras
#

def score_correctness(tools_called: list[str], expected: list[str]) -> float:
    if not expected:
        return 1.0
    matched = sum(1 for t in expected if t in tools_called)
    return matched / len(expected)

Completeness — word-boundary alias matching
#

This is where the original framework had the bug: a substring search for the alias "price" matched "profit margin", "surprise", "appraisal" — anything containing the four characters p-r-i-c. The fix is one regex per alias, with \b word boundaries:

import re

# Patterns are compiled once and cached; rebuilding per-case is hot in CI.
_alias_patterns: dict[str, list[re.Pattern]] = {}

def _patterns_for(field: str, aliases: list[str]) -> list[re.Pattern]:
    key = field
    if key not in _alias_patterns:
        _alias_patterns[key] = [
            re.compile(rf"(?<![A-Za-z]){re.escape(a)}(?![A-Za-z])", re.IGNORECASE)
            for a in aliases
        ]
    return _alias_patterns[key]

FIELD_ALIASES = {
    "price":   ["price", "$", "USD", "cost"],
    "rating":  ["rating", "stars", "score"],
    "status":  ["status", "state"],
    "tracking_number": ["tracking", "shipment"],
}

def score_completeness(text: str, expected_fields: list[str]) -> tuple[float, list[str], list[str]]:
    if not expected_fields:
        return 1.0, [], []
    found, missing = [], []
    for field in expected_fields:
        aliases = FIELD_ALIASES.get(field, [field])
        patterns = _patterns_for(field, aliases)
        if any(p.search(text) for p in patterns):
            found.append(field)
        else:
            missing.append(field)
    return len(found) / len(expected_fields), found, missing

Why (?<![A-Za-z])...(?![A-Za-z]) instead of \b? \b treats $ as adjacent to a word boundary regardless of context, which is wrong for the "$" alias when it shows up inside a price string like "USD$199". The lookarounds are stricter — only ASCII letter neighbours block the match. For "$" specifically, this means the alias matches when $ is not surrounded by letters (so "$199" matches; "a$b" doesn’t). Test thoroughly with your own data — character-class assumptions are domain-specific.

Wiring it together
#

The evaluator class is small. The interesting bits are above; the orchestration is mechanical.

@dataclass
class EvalCase:
    input: str
    expected_tools: list[str]
    expected_fields: list[str]
    criteria: dict[str, bool]
    tier: str = "full"           # "smoke" or "full"

@dataclass
class EvalResult:
    case: EvalCase
    groundedness: float
    correctness: float
    completeness: float
    overall: float
    tools_called: list[str]
    fields_missing: list[str]
    latency_ms: int
    tokens_in: int
    tokens_out: int
    error: str | None = None
    @property
    def passed(self) -> bool: return self.overall >= 0.7

class AgentEvaluator:
    WEIGHTS = (0.4, 0.4, 0.2)

    def __init__(self, agent, *, pass_threshold: float = 0.7):
        self.agent = agent
        self.pass_threshold = pass_threshold

    async def evaluate_case(self, case: EvalCase) -> EvalResult:
        t0 = time.perf_counter()
        try:
            response = await self.agent.run(case.input)
        except Exception as e:
            return EvalResult(case=case, groundedness=0.0, correctness=0.0,
                              completeness=0.0, overall=0.0, tools_called=[],
                              fields_missing=case.expected_fields, latency_ms=0,
                              tokens_in=0, tokens_out=0, error=repr(e))
        latency = int((time.perf_counter() - t0) * 1000)

        tools = extract_tool_calls(response)
        text  = extract_response_text(response)
        usage = response.usage_details        # Ch07 attribute, populated by the LLM provider

        g = score_groundedness(tools, case.criteria)
        c = score_correctness(tools, case.expected_tools)
        m, _, missing = score_completeness(text, case.expected_fields)
        overall = self.WEIGHTS[0]*g + self.WEIGHTS[1]*c + self.WEIGHTS[2]*m

        return EvalResult(case=case, groundedness=g, correctness=c, completeness=m,
                          overall=overall, tools_called=tools, fields_missing=missing,
                          latency_ms=latency, tokens_in=usage.input_tokens,
                          tokens_out=usage.output_tokens)

Pulling token counts from response.usage_details (rather than estimating from response length) gives accurate cost numbers for free — the LLM provider populates them.

CLI runner with smoke / full modes
#

The smoke / full split is what makes evals viable in CI without burning the budget on every PR.

# evals/run_evals.py
import argparse, asyncio, importlib, json, sys

AGENT_FACTORIES = {
    "product-discovery": ("product_discovery.agent", "create_product_discovery_agent"),
    "order-management":  ("order_management.agent",  "create_order_management_agent"),
    # ... one row per agent
}

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--agent", required=True, choices=AGENT_FACTORIES.keys())
    p.add_argument("--dataset", required=True)
    tier = p.add_mutually_exclusive_group()
    tier.add_argument("--smoke", action="store_true",
                      help="Run only cases with tier='smoke'. Use in PR CI.")
    tier.add_argument("--full",  action="store_true",
                      help="Run every case regardless of tier. Use in nightly CI.")
    p.add_argument("--pass-threshold", type=float, default=0.7)
    p.add_argument("--output-json", help="Write summary to this path for CI artefact upload")
    p.add_argument("--verbose", action="store_true")
    return p.parse_args()

def filter_by_tier(cases: list[EvalCase], smoke: bool, full: bool) -> list[EvalCase]:
    if full or (not smoke and not full):
        return cases
    return [c for c in cases if c.tier == "smoke"]

async def main():
    args = parse_args()
    module, factory = AGENT_FACTORIES[args.agent]
    agent = await getattr(importlib.import_module(module), factory)()

    cases = filter_by_tier(load_dataset(args.dataset), args.smoke, args.full)
    if not cases:
        print(f"No cases match the requested tier", file=sys.stderr)
        sys.exit(2)

    evaluator = AgentEvaluator(agent, pass_threshold=args.pass_threshold)
    summary = await evaluator.evaluate_dataset(cases)
    print_report(summary, verbose=args.verbose)

    if args.output_json:
        with open(args.output_json, "w") as f:
            json.dump(asdict(summary), f, indent=2, default=str)

    sys.exit(0 if summary.overall >= args.pass_threshold else 1)

PR CI runs --smoke (5 cases per agent, ~$0.05 per agent, finishes in 30 seconds). Nightly CI runs --full (50+ cases per agent, $1–2 per agent, finishes in 5 minutes). Same code, same dataset file, different tier filter.

The exit code is the gate: 0 when overall ≥ threshold, 1 when below, 2 when the runner itself errored (no cases matched, agent factory missing, etc.). 1 and 2 mean different things to the CI workflow — investigate 2 first because it’s a config bug, not a regression.

Output
#

======================================================================
  EVALUATION REPORT — product-discovery (smoke tier)
======================================================================
  Dataset:     evals/datasets/product_discovery.json
  Cases:       5 (5 smoke / 0 skipped)
  Passed:      4 (>= 0.70 threshold)
  Failed:      1
----------------------------------------------------------------------
  Groundedness:  100.0%
  Correctness:    80.0%
  Completeness:   90.0%
  Overall:        90.0%   PASS
----------------------------------------------------------------------
  Latency p50/p95: 1840ms / 3120ms
  Tokens (in/out): 4,200 / 1,800
  Estimated cost:  $0.0228
======================================================================

Verbose mode (--verbose) drops one block per case underneath the summary, with per-axis scores, the tools that were called, and the fields that were missing. That’s the view you want when you’re debugging a specific failure.

CI workflow
#

This is the same file from the original Part 9, with the tier flag added and the secret hardened (the original had a hardcoded Postgres password — already fixed in the old article):

# .github/workflows/agent-evals.yml
name: Agent Evals

on:
  pull_request:
    paths: ['agents/**', 'evals/**']

jobs:
  smoke:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: pgvector/pgvector:pg16
        env:
          POSTGRES_DB:   ecommerce_agents
          POSTGRES_USER: ecommerce
          POSTGRES_PASSWORD: ${{ secrets.EVAL_POSTGRES_PASSWORD }}
        ports: ['5432:5432']
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.12' }
      - uses: astral-sh/setup-uv@v3
      - name: Install
        run: cd agents && uv sync
      - name: Seed DB
        run: cd agents && uv run python -m scripts.seed
      - name: Smoke evals (every agent)
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          DATABASE_URL: postgresql://ecommerce:${{ secrets.EVAL_POSTGRES_PASSWORD }}@localhost:5432/ecommerce_agents
        run: |
          for agent in product-discovery order-management pricing-promotions; do
            uv run python -m evals.run_evals \
              --agent "$agent" \
              --dataset evals/datasets/$agent.json \
              --smoke \
              --output-json eval-$agent.json \
              --pass-threshold 0.7
          done
      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v4
        with: { name: eval-results, path: agents/eval-*.json }

The nightly workflow is identical except on: schedule: [cron: '0 4 * * *'] and --full instead of --smoke.

When to run evals
#

Trigger	Tier	Reasoning
System-prompt edit	Smoke (affected agent)	PR check; immediate signal
Tool function refactor	Smoke (affected agent)	PR check; tool selection most likely affected
LLM model version bump	Full (every agent)	Manual or nightly; cross-agent impact
MAF framework upgrade	Full (every agent)	Same as above
New tool added	Smoke (the agent that gained it)	PR check
DB schema change touching tools	Smoke (every agent with affected tool)	PR check

Cost
#

Five-case smoke against a Sonnet 4.6 / GPT-4.1 agent runs $0.02–0.05 depending on response length. Three things to keep in mind:

Smoke on every PR across five agents is ~$0.50/PR. At 50 PRs/week that’s $25/week — about a coffee.
Full nightly (50 cases × 5 agents) is ~$2.50/night — $75/month.
Model migration (running full evals against the old and new models for comparison) is a one-off $5–10.

These are 2026-04 numbers; rebuild the math when prices move. The evaluator surfaces the per-run cost so the trend is visible from the CI artefacts.

Gotchas
#

Non-determinism within a run. Two runs of the same case can score differently. For a single regression run, score variance is usually small (±5%). For tracking a metric over time, run each case 3× and take the median; the runner has a --repeat 3 flag (omitted above for brevity) that does this.
Tool calls vs tool attempts. If the LLM emits a malformed tool call, MAF surfaces it as a function_call content with the wrong arguments. The extractor counts it; the agent’s tool actually didn’t run. Decide which behaviour you want — usually you want to count attempts, not successes, because a malformed call is still a model decision worth catching.
Streaming responses. The eval calls the non-streaming agent.run(), not agent.run_stream(). Token counts and tool-call lists are populated only on the final AgentRunResponse; streaming chunks don’t carry the totals. If your production path is streaming-only, the eval still gives you the right behavioural signal — the LLM’s tool selection doesn’t change between streaming and non-streaming.

What changes for the capstone
#

The capstone today (agents/python/evals/) ships the original Part 9 framework. Phase 9 of the refactor plan migrates it:

agents/python/evals/extractors.py — replace hasattr chain with the canonical AgentRunResponse extraction.
agents/python/evals/scoring.py — swap substring alias matcher for the regex pattern above.
agents/python/evals/datasets/*.json — add "tier": "smoke" | "full" to every case.
.github/workflows/agent-evals.yml — split into smoke (PR) and nightly (schedule) jobs.

No agent code changes. The eval pipeline lives entirely in its own subtree.

What’s next
#

That closes the production-quality loop the original Part 9 left open. The next chapter back to the original e-commerce series is the prompt engineering port — five concerns, YAML composition, role-aware overrides — modernised against MAF v1’s context-provider surface.