Skip to content

11. Evaluation and Testing

How you measure whether a model is good and safe. Children split by method (human, automated, domain-specific) and by target (benchmarks for capability; safety/hallucination/adversarial/red-teaming for failure modes). "Regression testing" and "eval-driven development" import software discipline into ML. The defining challenge: outputs are open-ended, so evaluation is itself a hard, contested problem — unlike traditional software tests.

Children

  • benchmarks
  • human evaluation
  • automated evaluation
  • domain-specific evaluation
  • safety evaluation
  • hallucination testing
  • adversarial testing
  • red teaming
  • regression testing
  • eval-driven development