Skip to content

9. Data and Datasets

The raw material every model is made of — promoted here to a first-class branch because data quality bounds model quality more than architecture does. Children span the data lifecycle (sourcing, cleaning, labeling, curation, augmentation), dataset types (pretraining corpora, instruction/preference datasets, evaluation sets, synthetic data), and the governance that rides on data (licensing, provenance, PII, contamination, bias). This branch is upstream of training and evaluation; "garbage in, garbage out" is the whole reason it stands alone rather than hiding inside Training & Post-Training.

Children

  • data lifecycle
  • sourcing / collection
  • cleaning / deduplication
  • labeling / annotation
  • curation / filtering
  • augmentation
  • dataset types
  • pretraining corpora
  • instruction datasets
  • preference datasets
  • evaluation / benchmark sets
  • synthetic data
  • data governance
  • licensing
  • provenance
  • PII / privacy
  • contamination / leakage
  • bias and representativeness