11. Evaluation and Testing¶

How you measure whether a model is good and safe. Children split by method (human, automated, domain-specific) and by target (benchmarks for capability; safety/hallucination/adversarial/red-teaming for failure modes). "Regression testing" and "eval-driven development" import software discipline into ML. The defining challenge: outputs are open-ended, so evaluation is itself a hard, contested problem — unlike traditional software tests.

Children¶

benchmarks
human evaluation
automated evaluation
domain-specific evaluation
safety evaluation
hallucination testing
adversarial testing
red teaming
regression testing
eval-driven development

Training & Post-Training — what eval measures the output of
Safety, Security & Governance — eval measures risk, governance enforces against it
AI Engineering — eval-driven development

11. Evaluation and Testing¶

Children¶

Related¶