11. Evaluation and Testing¶
How you measure whether a model is good and safe. Children split by method (human, automated, domain-specific) and by target (benchmarks for capability; safety/hallucination/adversarial/red-teaming for failure modes). "Regression testing" and "eval-driven development" import software discipline into ML. The defining challenge: outputs are open-ended, so evaluation is itself a hard, contested problem — unlike traditional software tests.
Children¶
- benchmarks
- human evaluation
- automated evaluation
- domain-specific evaluation
- safety evaluation
- hallucination testing
- adversarial testing
- red teaming
- regression testing
- eval-driven development
Related¶
- Training & Post-Training — what eval measures the output of
- Safety, Security & Governance — eval measures risk, governance enforces against it
- AI Engineering — eval-driven development