9. Data and Datasets¶

The raw material every model is made of — promoted here to a first-class branch because data quality bounds model quality more than architecture does. Children span the data lifecycle (sourcing, cleaning, labeling, curation, augmentation), dataset types (pretraining corpora, instruction/preference datasets, evaluation sets, synthetic data), and the governance that rides on data (licensing, provenance, PII, contamination, bias). This branch is upstream of training and evaluation; "garbage in, garbage out" is the whole reason it stands alone rather than hiding inside Training & Post-Training.

Children¶

data lifecycle
sourcing / collection
cleaning / deduplication
labeling / annotation
curation / filtering
augmentation
dataset types
pretraining corpora
instruction datasets
preference datasets
evaluation / benchmark sets
synthetic data
data governance
licensing
provenance
PII / privacy
contamination / leakage
bias and representativeness

Training & Post-Training — the consumer of this data
Evaluation & Testing — evaluation sets and contamination
RAG — retrieval over document data
Safety, Governance & Alignment — data privacy and responsible sourcing

9. Data and Datasets¶

Children¶

Related¶