LLM

Text → Tokenizer → Token IDs → context-free Embedding Lookup (vector) → Transformer Layers (with Attention) → Vector 序列

Embedding

Attention

$$ Attention(q,D)=\sum _{i=1}^{m}\alpha (q,k_i)v_i $$

image.png

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{\mathrm{T}}}{\sqrt{d_k}}\right)V $$

image.png

Stacking transformer

Multi-head transformer

Positional Encoding