Yuan & Søgaard (2025)：重访 Othello 世界模型假说

Yuan & Søgaard (ICLR 2025)：7 个 LLM 的 Othello 实验，99% 无监督接地精度，跨架构表征对齐 93–96%，世界模型假说的最强证据，但 2-hop 规划退化

源

SOURCE · YUAN & SØGAARD · ICLR 2025 · OthelloGPT revisited · cross-architecture alignment

Revisiting the Othello World Model Hypothesis

Yuan & Søgaard (University of Copenhagen, ICLR 2025) — validating the world-model hypothesis via cross-architecture Procrustes alignment

Extends the OthelloGPT experiment to 7 LLMs (GPT-2, T5, Bart, Flan-T5, Mistral, LLaMA-2, Qwen2.5), replacing probes with Procrustes representation alignment: unsupervised alignment reaches 96.1% — under sequence-prediction pressure, different architectures converge on the same underlying representational attractor. But one critical limit: 2-hop continued prediction degrades significantly.

Cross-architecture alignment results

Supervised alignment (GPT-2 → Bart)93.1% cosine similarity

Unsupervised alignment (Bart → Mistral)96.1% cosine similarity

Seven models — decoders (GPT-2, Mistral) and encoder-decoders (T5, Bart) — all converge on the same representational attractor

Key tensions and interpretations

2-hop degradation

1-hop move prediction near perfect; 2-hop degrades significantly — board-representation precision ≠ strategic planning ability

Relation to jylin04

MATS “Bag of Heuristics”: representations can be precise, yet the mechanism computing them is still a distributed aggregate of heuristic rules — two distinct levels of the question

Spatial-relation structure

Latent move projections: the top-5 predicted legal moves are nearest, in embedding space, to board-spatially adjacent cells — not just rules, but board geometry

Methodological strength

Procrustes alignment exposes global representational organization, not just presence of single features — a stronger method than probing

→ Othello World Model Hypothesis · World Models · Neel NandaICLR 2025 arXiv:2503.04421

Yuan & Søgaard (2025)：重访 Othello 世界模型假说

来源：sources/arxiv_papers/2503.04421-revisiting-othello-world-model-hypothesis.md 原始 URL：https://arxiv.org/abs/2503.04421 作者：Yifei Yuan, Anders Søgaard（哥本哈根大学）发表：2025-03-06，ICLR 2025

摘要

本文是对 Li et al. (2023) “Othello 世界模型假说”的系统性重访。该假说认为：GPT-2 等序列模型在仅接受棋局着手序列训练后，能够在内部涌现出对棋盘状态的表征——即一种隐式的”世界模型”。

Yuan & Søgaard 将实验规模扩展到七个 LLM（GPT-2、T5、Bart、Flan-T5、Mistral、LLaMA-2、Qwen2.5），并以比探针更强的方法论（跨模型表征对齐）验证假说，结论是：所有模型都达到高达 99% 的无监督接地精度，且学到的棋盘特征跨架构高度相似。

核心发现

1. 七模型全部涌现棋盘表征

无论解码器（GPT-2、Mistral）还是编码器-解码器（T5、Bart），无论预训练还是非预训练变体，所有模型都能在足够数据下学会 Othello 着手预测，并诱导出棋盘布局。

2. 跨架构表征对齐高达 93–96%

关键方法创新：用 Procrustes 表征对齐（来自跨语言嵌入研究）替代探针。在两个模型分别处理相同序列后，学一个线性映射 W 使两模型的最终隐层表征对齐。

有监督对齐（GPT-2 → Bart）：93.1% 余弦相似度
无监督对齐（Bart → Mistral）：96.1% 余弦相似度

这说明不同架构在序列预测压力下收敛到了相同的底层表征吸引子。

3. 99% 精度 vs. 浅层序列预测的张力

1-hop 着手预测：近乎完美（near-perfect）
2-hop 连续着手预测：显著退化

棋盘状态表征精度 ≠ 战略深度。 结构性世界模型不自动蕴含规划能力。

4. 潜在着手投影揭示空间关系

通过将隐层特征投影到视觉空间，模型预测的前5合法着手在嵌入空间中与目标格的空间相邻格最为相近——表明模型不仅学习游戏规则，还学习了棋盘的空间关系结构。

方法论贡献

方法	优于探针之处
Procrustes 表征对齐	揭示跨模型全局信息组织，而非单一特征的存在性
无监督对齐（对抗训练+Procrustes 精炼）	无需平行标注数据
2-hop 生成评估	区分棋盘表征与战略规划能力
潜在着手投影	展示空间关系编码，探针无法触及

对相关概念的影响

与 Li et al. (2022) 的关系

Yuan & Søgaard 在方法上更进一步：原始 OthelloGPT 用非线性探针和因果干预证明表征存在；本文用跨模型对齐证明这一表征是跨架构收敛的，且有空间结构。假说的证据链更坚实。

与 jylin04 (2024) MATS 分析的张力

MATS 分析认为 OthelloGPT 学到的是”启发式规则的集合”而非统一算法。Yuan & Søgaard 的发现（高精度、跨架构收敛）并不直接反驳这一点——表征可以是准确的，而计算该表征的机制仍然是分布式的启发式规则集合。两个层次的问题不同：

“表征是什么”（本文回答：精确的棋盘状态）
“表征如何被计算”（MATS 回答：局部启发式规则聚合）

关键概念链接

References

原始论文：https://arxiv.org/abs/2503.04421
剪辑文件：sources/arxiv_papers/2503.04421-revisiting-othello-world-model-hypothesis.md

Revisiting the Othello World Model Hypothesis

Yuan & Søgaard (2025)：重访 Othello 世界模型假说

摘要

核心发现

1. 七模型全部涌现棋盘表征

2. 跨架构表征对齐高达 93–96%

3. 99% 精度 vs. 浅层序列预测的张力

4. 潜在着手投影揭示空间关系

方法论贡献

对相关概念的影响

与 Li et al. (2022) 的关系

与 jylin04 (2024) MATS 分析的张力

相关资源

关键概念链接

References