Evaluating Agent Quality -- Testing What You Cannot Unit Test
Revised with .NET examples — A newer version of this article, covering both Python and .NET, is available as part of the MAF v1: Python and .NET series: MAF v1 — 23-evaluation-framework. The newer version applies three substantive fixes to the framework below — canonical AgentRunResponse extraction (no more hasattr chain), word-boundary alias matching (the original false-positives "profit" against the "price" alias), and a smoke / full tier split for CI vs nightly runs. Read this article for the conceptual ground; read the new one for the production-grade implementation.
