Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Nanda、Lee、Wattenberg (BlackboxNLP 2023)：OthelloGPT 线性化修正，Mine/Yours/Empty 坐标系使线性探针准确率从 ~75% 跃升至 ~99%，线性向量加法干预法，多电路假说

源

SOURCE · NANDA ET AL. 2023 · linear representations · OthelloGPT frame correction · EMNLP

Emergent Linear Representations in World Models

Nanda, Lee, Wattenberg (BlackboxNLP @ EMNLP 2023) — the linearization key for OthelloGPT

Li et al. found only non-linear probes could decode OthelloGPT’s board state (~75% linear vs ~98% non-linear). Nanda et al. pinpoint the cause: the labeling frame was wrong. Switch to Mine/Yours/Empty (player-to-move perspective) instead of Black/White/Empty (absolute color), and linear-probe accuracy jumps from ~75% to ~99%. The representation was linear all along.

Effect of the frame correction

Probe type

Frame

Accuracy

Linear probe

Black/White/Empty (Li et al.)

~75%

Non-linear MLP

Black/White/Empty (Li et al.)

~98%

Linear probe

Mine/Yours/Empty (Nanda et al.)

~99%

Extra findings and method contributions

Linear vector-addition intervention

A single addition edits activations — error rate 0.10, cleaner than iterative gradient (0.12), especially on the Erasing task

Flipped representation

A per-timestep “was-just-flipped” feature is linearly encoded (F1 >96% from layer 3 on)

Attention specialization

Some heads attend only to odd timesteps (own side), others only to even (opponent)

Multi-circuit hypothesis

Near endgame the model often predicts the move before fully completing board state — parallel circuits, not a single algorithm

→ Othello World Model · Linear Representation · Activation InterventionarXiv:2309.00941 (2023)

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

来源： sources/arxiv_papers/2309.00941-emergent-linear-representations-world-models.md URL： https://arxiv.org/abs/2309.00941 作者： Neel Nanda（独立研究者）、Andrew Lee（密歇根大学）、Martin Wattenberg（哈佛大学） 发表时间： 2023-09-02（BlackboxNLP @ EMNLP 2023 录用）

论文核心问题

Li et al. (2022) 已证明 OthelloGPT 内部存在棋盘表征，但他们使用的是非线性探针——这意味着表征是非线性编码的，需要多层 MLP 才能解码。

Nanda et al. 提出了一个关键问题：这种非线性性是表征本身固有的，还是标注方案出了问题？

核心洞察：参照系的选择

Li et al. 用绝对颜色（Black/White/Empty）标注棋盘——这是人类直觉上的自然选择。Nanda et al. 的发现：OthelloGPT 不用绝对颜色，而用相对颜色（Mine/Yours/Empty，即当前行棋方的视角）编码棋盘状态。

改变标注方案后，线性探针的准确率从 ~75% 跃升至 ~99%，与非线性探针持平。

探针类型	标注方案	准确率（第 4 层后）
线性探针	Black/White/Empty（Li et al.）	~75%
非线性探针	Black/White/Empty（Li et al.）	~98%
线性探针	Mine/Yours/Empty（Nanda et al.）	~99%

含义：表征本来就是线性的，只是 Li et al. 选错了特征空间的坐标系。

激活干预：比梯度下降更简单

Li et al. 的因果验证依赖梯度下降（需要多次迭代优化）修改激活。Nanda et al. 用线性探针方向的单次向量加法即可实现等效干预：

$x' \leftarrow x + \alpha \cdot p^\lambda_d(x)$

其中 $d \in \{\text{Mine}, \text{Yours}, \text{Empty}\}$ ， $\alpha$ 是缩放因子。

结果：线性向量加法干预的错误率（0.10）与梯度引导干预（0.12）相当，甚至在”Erasing”任务上更优（0.02 vs 0.11）。

这不仅是工程简化——更说明线性表征的方向向量就是干预的自然载体，无需非线性优化。

额外线性表征发现

除了 Mine/Yours/Empty 主表征，论文还发现：

Flipped 表征：OthelloGPT 线性编码每个时间步被翻转的棋子（F1 score 在第 3 层后超过 96%）
Empty 电路：第一层注意力头广播”哪些格子已被下过”，Empty 方向是 Token 嵌入的线性函数
注意力专化：部分注意力头只关注奇数时间步（己方），另一些只关注偶数时间步（对方），进一步支持 Mine/Yours 表征框架

多电路假说（Multiple Circuits Hypothesis）

论文发现一个反直觉现象：在终局（最后约 20 步），模型常常在计算出完整棋盘状态之前就预测出了合法着手（“MoveFirst”现象）。

解读：OthelloGPT 可能包含多条并行电路——

主棋盘电路：Mine/Yours/Empty 追踪，适用于大多数局面
终局简捷电路：不依赖完整棋盘状态，直接从局部特征（空格、包围关系）预测

这与 jylin04 (2024) 的”Bag of Heuristics”发现高度吻合：模型可能通过多个局部规则的集合而非单一统一算法来工作。

与线性表征假说的关系

本文是支持线性表征假说的关键经验证据之一：

证明了世界模型（不仅是语言特征）可以以线性方式编码
提示”特征空间的选择”是发现线性表征的关键——人类直觉上的”自然”坐标可能不是模型的内部坐标
证明了线性表征的方向可以直接用于因果干预，无需复杂优化

关键词与关联概念

Othello 世界模型假说 — 本文是该假说发展的第二代：线性化修正
线性表征假说 — 本文提供了世界模型层面的线性表征经验证据
激活干预 — 本文提出了更简单的线性向量加法干预方法
探针分类器 — 本文核心实验工具，揭示了特征空间选择对探针结论的决定性影响
机制可解释性 — 本文分析了 Empty 电路和注意力专化作为 MI 方法的应用示范

References

sources/arxiv_papers/2309.00941-emergent-linear-representations-world-models.md
Li et al. (2022)：sources/arxiv_papers/2210.13382-emergent-world-representations-othello-gpt.md