Seven Mental · 心智七篇
← Knowledge Atlas · Entity

Neel Nanda

Neel Nanda:Google DeepMind 可解释性研究者,TransformerLens 作者,MATS 导师,机制可解释性领域核心建设者
ENTITY · NEEL NANDA · GOOGLE DEEPMIND · TRANSFORMERLENS · MATS · MECHANISTIC INTERPRETABILITY

Neel Nanda

Interpretability researcher at Google DeepMind — core infrastructure builder for the mechanistic interpretability field

Nanda built TransformerLens — the most widely used open-source toolkit in MI, offering a clean hook API and residual-stream decomposition support. He runs the MATS (ML Alignment Theory Scholars) training program and made the key contribution to OthelloGPT representation research: finding the right feature basis (MINE/YOURS/EMPTY) lifted linear-probe accuracy from ~75% to ~99%.

Core Contributions
TransformerLensThe standard MI toolkit — hook API reads/modifies activations anywhere, supports residual-stream decomposition and attention-pattern analysis
MATS Training ProgramIntensive MI research program — jylin04’s OthelloGPT “Bag of Heuristics” came out of MATS 6.0 (Summer 2024)
OthelloGPT Linearization (2023)With Lee & Wattenberg — the MINE/YOURS/EMPTY basis correction took linear probes from ~75% to ~99%
Methodological Insight
The Basis Problem
Li et al.’s non-linear-probe conclusion stemmed from the wrong feature basis, not genuine non-linearity of the representation
Intervention Simplicity
Proposes a single-shot linear vector-addition intervention — simpler and more effective than Li et al.’s gradient iteration
Grokking Research
Circuit-level analysis of small transformer toy tasks (modular addition), revealing algorithmic phase transitions and grokking
→ Mechanistic Interpretability · Othello World Model · Linear RepresentationBlackboxNLP @ EMNLP 2023

Neel Nanda

领域:机制可解释性(Mechanistic Interpretability) 机构:Google DeepMind(可解释性团队);前 Anthropic 角色:MATS(ML Alignment Theory Scholars)项目导师,机制可解释性研究者


主要贡献

TransformerLens(开源库)

Neel Nanda 开发了 TransformerLens(前身 EasyTransformer),是机制可解释性研究领域最广泛使用的开源工具库。提供:

  • 干净的 hook API,可在任意位置读取/修改激活
  • 预训练模型库
  • 支持残差流分解、注意力模式分析等 MI 核心操作

MATS 项目

MATS(ML Alignment Theory Scholars)是 Neel Nanda 主导的研究训练计划,每期招募研究生和研究者进行密集的机制可解释性研究。jylin04 的 OthelloGPT 分析(“Bag of Heuristics”)即来自 MATS 6.0(2024 年夏季)。

OthelloGPT 线性表征(2023)

与 Andrew Lee、Martin Wattenberg 合作,发表”Emergent Linear Representations in World Models of Self-Supervised Sequence Models”(BlackboxNLP @ EMNLP 2023)。

核心发现:Li et al. (2022) 认为 OthelloGPT 使用非线性探针,Nanda et al. 找到了问题根源——错误的特征坐标系。用 Mine/Yours/Empty(相对行棋方)代替 Black/White/Empty(绝对颜色)后,线性探针准确率从 ~75% 跃升至 ~99%。同时提出单次线性向量加法干预法,比 Li et al. 的梯度迭代干预更简洁。

→ Wiki 摘要:sources/2309.00941-emergent-linear-representations-world-models.md

模块化性与玩具任务研究

Neel Nanda 及其合作者对小型 transformer 在玩具任务(如 modular addition)上的电路进行了系统研究,发现”小要塞(grokking)“现象和算法发现的相变过程。


与 OthelloGPT 研究的关系

jylin04 (2024) 的 OthelloGPT Bag of Heuristics 分析是 MATS 6.0 的产物,由 Neel Nanda 指导。Nanda 在自己的工作中也使用了 OthelloGPT 数据集——他与合作者的研究发现了 Othello 世界模型假说的线性化版本(MINE/YOURS/EMPTY 标注方案),这成为假说演化中的 Nanda et al. (2023) 那篇。


相关概念

References

  • Nanda et al. (2023):“Emergent Linear Representations in World Models of Self-Supervised Sequence Models”(BlackboxNLP @ EMNLP 2023)→ sources/arxiv_papers/2309.00941-emergent-linear-representations-world-models.md,Wiki 摘要:sources/2309.00941-emergent-linear-representations-world-models.md
  • jylin04 (2024):“OthelloGPT Learned a Bag of Heuristics”(MATS 6.0)→ sources/othellogpt-bag-of-heuristics-jylin04-mats2024.md