AI Engineering
·
May 9, 2026
·
14 min read Build an eval pipeline with golden datasets, scoring, smoke/full modes, and CI gates to catch prompt regressions before release.
Deep Dive
·
Apr 12, 2026
·
18 min read Revised with .NET examples — A newer version of this article, covering both Python and .NET, is available as part of the MAF v1: Python and .NET series: MAF v1 — 23-evaluation-framework. The newer version applies three substantive fixes to the framework below — canonical AgentRunResponse extraction (no more hasattr chain), word-boundary alias matching (the original false-positives "profit" against the "price" alias), and a smoke / full tier split for CI vs nightly runs. Read this article for the conceptual ground; read the new one for the production-grade implementation.