7. Transformer Architecture¶
The specific neural design behind modern LLMs. Children trace the forward path of a token: embeddings (token + positional) → attention (self/cross/multi-head) → feed-forward, with residuals and layer-norm stabilizing it, assembled into encoder/decoder stacks. Attention is the load-bearing innovation (relating every token to every other); decoder-only is the dominant LLM variant. This is the mechanical detail of what Language Models are made of.
Children¶
- token embeddings
- positional embeddings
- attention:
- self-attention
- cross-attention
- multi-head attention
- feed-forward network
- residual connections
- layer normalization
- encoder
- decoder
- decoder-only architecture
Related¶
- Deep Learning — the broader family
- Language Models — what this architecture powers
- Model Internals — attention heads, parameters at run time