Transformer#

Transformer 是 Google 于 2017 年在《Attention Is All You Need》中提出的架构，完全基于自注意力机制，抛弃了 RNN/CNN，成为现代 NLP 与大模型的基础架构。

1. 是什么#

Transformer 是一种序列到序列（Seq2Seq）模型架构，核心是自注意力机制（Self-Attention），能并行处理序列中所有位置的关系。

核心特点：

并行计算：不像 RNN 需逐步处理，可一次处理整个序列
长距离依赖：任意两个位置直接计算注意力，无信息衰减
可扩展性：堆叠更多层、更大参数即可提升能力（GPT、BERT 等）

整体架构：

graph LR
    subgraph Encoder
        A[Input Embedding] --> B[+ Positional Encoding]
        B --> C[Multi-Head Attention]
        C --> D[Add & Norm]
        D --> E[Feed Forward]
        E --> F[Add & Norm]
    end
    subgraph Decoder
        G[Output Embedding] --> H[+ Positional Encoding]
        H --> I[Masked Multi-Head Attention]
        I --> J[Add & Norm]
        J --> K[Cross Attention]
        K --> L[Add & Norm]
        L --> M[Feed Forward]
        M --> N[Add & Norm]
    end
    F --> K
    N --> O[Linear + Softmax]

核心组件：

组件	作用
Self-Attention	计算序列内每个位置与其他位置的关联权重
Multi-Head	多组注意力并行，捕捉不同子空间的特征
Positional Encoding	注入位置信息（正弦/余弦或可学习）
Feed Forward	两层全连接 + 激活，逐位置变换
Add & Norm	残差连接 + LayerNorm，稳定训练

三种变体：

类型	代表模型	结构	典型任务
Encoder-only	BERT	仅编码器	分类、NER、句子相似度
Decoder-only	GPT、LLaMA	仅解码器	文本生成、对话
Encoder-Decoder	T5、BART	完整结构	翻译、摘要

2. 怎么使用#

使用 Hugging Face Transformers#

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

# ========== Encoder 模型（如 BERT）==========
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModel.from_pretrained("bert-base-chinese")

text = "Transformer 是现代 NLP 的基础"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# 获取词向量 (batch_size, seq_len, hidden_size)
last_hidden = outputs.last_hidden_state

# 获取句向量（[CLS] 位置）
sentence_embedding = last_hidden[:, 0, :]


# ========== Decoder 模型（如 GPT）==========
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0]))

常见任务示例#

from transformers import pipeline

# 文本分类
classifier = pipeline("sentiment-analysis")
classifier("I love this movie!")  # [{'label': 'POSITIVE', 'score': 0.99}]

# 文本生成
generator = pipeline("text-generation", model="gpt2")
generator("The future of AI is", max_length=50)

# 问答
qa = pipeline("question-answering")
qa(question="What is Transformer?", context="Transformer is a neural network architecture.")

# 翻译
translator = pipeline("translation_en_to_zh", model="Helsinki-NLP/opus-mt-en-zh")
translator("Hello, world!")

# 填空
fill = pipeline("fill-mask", model="bert-base-chinese")
fill("今天天气[MASK]好")

常用预训练模型#

模型	类型	参数量	特点
BERT	Encoder	110M/340M	双向编码，适合理解任务
GPT-2	Decoder	117M-1.5B	自回归生成
T5	Enc-Dec	60M-11B	统一文本到文本格式
LLaMA	Decoder	7B-70B	开源高效，当前主流基座
Qwen	Decoder	0.5B-72B	中英文优秀
ChatGLM	Decoder	6B-130B	中文对话优化

延伸#

注意力公式：$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
位置编码：原版用正弦/余弦，现代模型多用 RoPE（旋转位置编码）
优化变体：FlashAttention（显存优化）、GQA（分组查询注意力）、滑动窗口