Neel Nanda

Neel Nanda：Google DeepMind 可解释性研究者，TransformerLens 作者，MATS 导师，机制可解释性领域核心建设者

实

ENTITY · NEEL NANDA · GOOGLE DEEPMIND · TRANSFORMERLENS · MATS · MECHANISTIC INTERPRETABILITY

Neel Nanda

Interpretability researcher at Google DeepMind — core infrastructure builder for the mechanistic interpretability field

Nanda built TransformerLens — the most widely used open-source toolkit in MI, offering a clean hook API and residual-stream decomposition support. He runs the MATS (ML Alignment Theory Scholars) training program and made the key contribution to OthelloGPT representation research: finding the right feature basis (MINE/YOURS/EMPTY) lifted linear-probe accuracy from ~75% to ~99%.

Core Contributions

TransformerLensThe standard MI toolkit — hook API reads/modifies activations anywhere, supports residual-stream decomposition and attention-pattern analysis

MATS Training ProgramIntensive MI research program — jylin04’s OthelloGPT “Bag of Heuristics” came out of MATS 6.0 (Summer 2024)

OthelloGPT Linearization (2023)With Lee & Wattenberg — the MINE/YOURS/EMPTY basis correction took linear probes from ~75% to ~99%

Methodological Insight

The Basis Problem

Li et al.’s non-linear-probe conclusion stemmed from the wrong feature basis, not genuine non-linearity of the representation

Intervention Simplicity

Proposes a single-shot linear vector-addition intervention — simpler and more effective than Li et al.’s gradient iteration

Grokking Research

Circuit-level analysis of small transformer toy tasks (modular addition), revealing algorithmic phase transitions and grokking

→ Mechanistic Interpretability · Othello World Model · Linear RepresentationBlackboxNLP @ EMNLP 2023

Neel Nanda

领域：机制可解释性（Mechanistic Interpretability）机构：Google DeepMind（可解释性团队）；前 Anthropic 角色：MATS（ML Alignment Theory Scholars）项目导师，机制可解释性研究者

主要贡献

TransformerLens（开源库）

Neel Nanda 开发了 TransformerLens（前身 EasyTransformer），是机制可解释性研究领域最广泛使用的开源工具库。提供：

干净的 hook API，可在任意位置读取/修改激活
预训练模型库
支持残差流分解、注意力模式分析等 MI 核心操作

MATS 项目

MATS（ML Alignment Theory Scholars）是 Neel Nanda 主导的研究训练计划，每期招募研究生和研究者进行密集的机制可解释性研究。jylin04 的 OthelloGPT 分析（“Bag of Heuristics”）即来自 MATS 6.0（2024 年夏季）。

OthelloGPT 线性表征（2023）

与 Andrew Lee、Martin Wattenberg 合作，发表”Emergent Linear Representations in World Models of Self-Supervised Sequence Models”（BlackboxNLP @ EMNLP 2023）。

核心发现：Li et al. (2022) 认为 OthelloGPT 使用非线性探针，Nanda et al. 找到了问题根源——错误的特征坐标系。用 Mine/Yours/Empty（相对行棋方）代替 Black/White/Empty（绝对颜色）后，线性探针准确率从 ~75% 跃升至 ~99%。同时提出单次线性向量加法干预法，比 Li et al. 的梯度迭代干预更简洁。

→ Wiki 摘要：sources/2309.00941-emergent-linear-representations-world-models.md

模块化性与玩具任务研究

Neel Nanda 及其合作者对小型 transformer 在玩具任务（如 modular addition）上的电路进行了系统研究，发现”小要塞（grokking）“现象和算法发现的相变过程。

与 OthelloGPT 研究的关系

jylin04 (2024) 的 OthelloGPT Bag of Heuristics 分析是 MATS 6.0 的产物，由 Neel Nanda 指导。Nanda 在自己的工作中也使用了 OthelloGPT 数据集——他与合作者的研究发现了 Othello 世界模型假说的线性化版本（MINE/YOURS/EMPTY 标注方案），这成为假说演化中的 Nanda et al. (2023) 那篇。

References

Nanda et al. (2023)：“Emergent Linear Representations in World Models of Self-Supervised Sequence Models”（BlackboxNLP @ EMNLP 2023）→ sources/arxiv_papers/2309.00941-emergent-linear-representations-world-models.md，Wiki 摘要：sources/2309.00941-emergent-linear-representations-world-models.md
jylin04 (2024)：“OthelloGPT Learned a Bag of Heuristics”（MATS 6.0）→ sources/othellogpt-bag-of-heuristics-jylin04-mats2024.md

Neel Nanda

Neel Nanda

主要贡献

TransformerLens（开源库）

MATS 项目

OthelloGPT 线性表征（2023）

模块化性与玩具任务研究

与 OthelloGPT 研究的关系

相关概念

References