Li et al. (2022/2023)：涌现的世界表征——OthelloGPT

Li et al. (ICLR 2023)：OthelloGPT 奠基论文，非线性探针+激活干预验证棋盘表征的因果性，序列模型涌现世界模型的范式案例

源

SOURCE · OTHELLOGPT · Li et al. · ICLR 2023 · founding the world-model hypothesis · activation intervention

OthelloGPT: Emergent World Representations

Li et al. (2022/ICLR 2023) — do language models memorize surface statistics, or build an internal world model?

OthelloGPT (8-layer GPT trained only on move sequences, never shown board geometry) settles the case with controlled experiments: the model doesn’t memorize — it builds an internal board representation. 0.01% error on synthetic data; a skewed-dataset test rules out sequence memorization. Board state is encoded “non-linearly” in activations (linear probes fail, MLP probes succeed).

Key experimental results

Setup

Error rate

Untrained baseline

93.29%

Synthetic-data training

0.01%

Skewed dataset (anti-memorization test)

0.02%

Linear probe (fails)

>20% (near random)

Non-linear MLP probe (succeeds)

1.7–4.8%

Methodological contributions

Activation intervention for causality

Gradient descent edits mid-layer activations toward a counterfactual board state B’ → observe prediction change — correlation becomes causation

Intervention effect

Pre-intervention avg error 2.68, post-intervention 0.12 — holds even for off-distribution board states

Follow-up challenge

Nanda et al. (2023) found: under MINE/YOURS/EMPTY coordinates, linear probes hit ~99% — the non-linear conclusion was partly a coordinate-frame artifact

→ Othello World Model Hypothesis · Activation Intervention · Neel NandaICLR 2023 arXiv:2210.13382

Li et al. (2022/2023)：涌现的世界表征——OthelloGPT

来源：sources/arxiv_papers/2210.13382-emergent-world-representations-othello-gpt.md 原始 URL：https://arxiv.org/abs/2210.13382 作者：Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg 发表：2022-10-24（预印本），ICLR 2023（正式发表） 常用名称：OthelloGPT

核心问题

语言模型在训练时是”记忆表面统计”还是”构建内部世界模型”？Li et al. 选择 Othello 棋局着手序列作为受控测试场，因为：游戏树太大无法靠记忆，但规则和状态比国际象棋简单得多。

关键实验与发现

实验设置

模型：8 层 GPT，8 头注意力，512 维隐空间，60 个 token 词表（对应 60 个可玩格）
数据：Championship（13.2 万条专家对局）和 Synthetic（2000 万条随机合法对局）
关键约束：模型只看到着手序列，从未见过棋盘几何信息

性能结果

设置	错误率
未训练基线	93.29%
合成数据训练	0.01%
锦标赛数据训练	5.17%

反记忆测试：“偏斜数据集”（移除 1/4 的游戏树）下仍达 0.02% 错误率，排除序列记忆解释。

探针实验

线性探针失败：错误率 >20%（随机基线 52.95%），进步有限
非线性探针（MLP）成功：合成训练 1.7–4.8%，锦标赛训练 9.4–12.8%，随机化基线 25.4–26.4%

结论：棋盘状态以非线性方式编码于模型激活中。

激活干预实验（方法论核心）

探针精度高≠表征具有因果作用。Li et al. 引入激活干预（activation intervention）验证因果性：

用梯度下降修改网络中间层激活，使探针报告的棋盘状态从 B 变为反事实的 B’
观察下游预测是否随激活修改而改变

x′ ← x − α ∂ℒ_CE(p_θ(x), B′)/∂x

结果（介入层 Ls=4，干预 5 层）：

干预前平均错误：2.68（合法棋位）/ 2.59（非法棋位）
干预后平均错误：0.12（合法棋位）/ 0.06（非法棋位）

即使对训练分布外的不可达棋盘状态，干预依然有效。棋盘表征对预测有因果作用。

潜在显著性图

通过测量每个格点的内部表征变化对预测输出的影响，生成可视化：

合成训练：显著性集中于决定合法性的格点——模型学到了游戏规则
锦标赛训练：复杂显著性模式，反映超越合法性的战略考量

方法论贡献

工具	贡献
非线性探针	比线性探针更能揭示非线性表征
激活干预	从相关→因果，验证表征的功能性角色
潜在显著性图	归因可视化，揭示模型关注什么
偏斜数据集测试	排除序列记忆作为性能来源

与后续工作的关系

后续工作	关系
Nanda et al. (2023)	发现用 MINE/YOURS/EMPTY 标注时线性探针也成功——Li et al. 的非线性结论可能部分是标注坐标系问题
Yuan & Søgaard (2025)	扩展到 7 个 LLM，用 Procrustes 对齐替代探针，达到 99% 无监督精度，跨架构表征收敛；另发现 2-hop 规划退化
jylin04 MATS (2024)	在 OthelloGPT 上进行机制可解释性分析，认为其算法是”启发式规则集合”而非统一算法

关键概念链接

Othello 世界模型假说 — 本文是该假说的奠基性实证
探针分类器 — 本文对线性 vs 非线性探针的对比是经典案例
激活干预（神经网络） — 本文引入的方法论核心
世界模型 — OthelloGPT 是 LLM 世界模型假说的早期证据
机制可解释性 — 激活干预和潜在显著性图是 MI 的重要前驱方法

References

原始论文：https://arxiv.org/abs/2210.13382
剪辑文件：sources/arxiv_papers/2210.13382-emergent-world-representations-othello-gpt.md