LLM
Text → Tokenizer → Token IDs → context-free Embedding Lookup (vector) → Transformer Layers (with Attention) → Vector 序列
Embedding
- Embedding: A function that maps inputs to a vector space
- Token: 文本的最小计算单位, discrete integer ID
- Image Embedding: MPL, CNN, Auto encoder
Attention
- Transformer: a way to achieve contextual embedding
- 在进入 Transformer 层(尤其是 Attention)时,模型会对每个 embedding 向量 $x_i$ 做三次线性变换:$Q_i=x_iW_Q$, $K_i=x_iW_K$, $V_i=x_iW_V$
$$
Attention(q,D)=\sum _{i=1}^{m}\alpha (q,k_i)v_i
$$

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{\mathrm{T}}}{\sqrt{d_k}}\right)V
$$
- softmax, Dot product attention

Stacking transformer
- attention 输出(经过线性层和残差连接、layer norm)就构成下一层的「输入表示」
- 这相当于「新的 embedding tensor」,但已经是 context-aware(带上下文的语义向量)。
Multi-head transformer
Positional Encoding