10_Intro to MLLMs - Part 1 | Notion

LLM

Text → Tokenizer → Token IDs → context-free Embedding Lookup (vector) → Transformer Layers (with Attention) → Vector 序列

Embedding

Embedding: A function that maps inputs to a vector space
Token: 文本的最小计算单位, discrete integer ID
Image Embedding: MPL, CNN, Auto encoder

Attention

Transformer: a way to achieve contextual embedding
在进入 Transformer 层（尤其是 Attention）时，模型会对每个 embedding 向量 $x_i$ 做三次线性变换：$Q_i=x_iW_Q$, $K_i=x_iW_K$, $V_i=x_iW_V$

$$ Attention(q,D)=\sum _{i=1}^{m}\alpha (q,k_i)v_i $$

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{\mathrm{T}}}{\sqrt{d_k}}\right)V $$

softmax, Dot product attention

Stacking transformer

attention 输出（经过线性层和残差连接、layer norm）就构成下一层的「输入表示」
这相当于「新的 embedding tensor」，但已经是 context-aware（带上下文的语义向量）。

Multi-head transformer

Positional Encoding