7. Transformer Architecture¶

The specific neural design behind modern LLMs. Children trace the forward path of a token: embeddings (token + positional) → attention (self/cross/multi-head) → feed-forward, with residuals and layer-norm stabilizing it, assembled into encoder/decoder stacks. Attention is the load-bearing innovation (relating every token to every other); decoder-only is the dominant LLM variant. This is the mechanical detail of what Language Models are made of.

Children¶

token embeddings
positional embeddings
attention:
self-attention
cross-attention
multi-head attention
feed-forward network
residual connections
layer normalization
encoder
decoder
decoder-only architecture

Deep Learning — the broader family
Language Models — what this architecture powers
Model Internals — attention heads, parameters at run time

7. Transformer Architecture¶

Children¶

Related¶