Skip to main content
  1. Tags/

Agent-Quality

Evaluating Agent Quality -- Testing What You Cannot Unit Test

Deep Dive · Apr 12, 2026 · 18 min read
Revised with .NET examples — A newer version of this article, covering both Python and .NET, is available as part of the MAF v1: Python and .NET series: MAF v1 — 23-evaluation-framework. The newer version applies three substantive fixes to the framework below — canonical AgentRunResponse extraction (no more hasattr chain), word-boundary alias matching (the original false-positives "profit" against the "price" alias), and a smoke / full tier split for CI vs nightly runs. Read this article for the conceptual ground; read the new one for the production-grade implementation.
Evaluating Agent Quality -- Testing What You Cannot Unit Test