Evaluation frameworks check outputs but miss the full picture. What production AI agent testing looks like when engineers, QA, and product all need to stay in the loop.