The Linear Representation Hypothesis and the Geometry of Large Language Models

Park、Choe、Veitch (ICML 2024)：LRH 严格反事实形式化，因果内积统一子空间/测量/干预三种表征，LLaMA-2 实验验证

源

SOURCE · PARK CHOE VEITCH · ICML 2024 · rigorous LRH formalization · causal inner product

The Linear Representation Hypothesis

Park, Choe, Veitch (U Chicago / Google, ICML 2024) — the rigorous mathematical formalization of LRH

”Linear representation” is empirically famous — but what does it actually mean? This paper distinguishes three different intuitions (subspace / measurement / intervention), formalizes “concept” with counterfactual variables, and introduces the causal inner product — the unique inner product under which causally separable concepts are orthogonal, unifying all three intuitions in one mathematical frame.

Three intuitions about linear representation

SubspaceCounterfactual word-pair differences lie along one direction: queen-king ∥ woman-man

MeasurementA linear probe can read off concept value: is this French or English?

InterventionAdding along a direction changes the target concept without disturbing others

Key properties of the causal inner product

Definition

The unique inner product that makes “causally separable concepts orthogonal”

Form

⟨γ̄, γ̄’⟩_C = γ̄ᵀ Cov(γ)⁻¹ γ̄’ (weighted by the inverse vocab-unembedding covariance)

LLaMA-2 7B validation

Block-diagonal structure over 27 concepts — semantically related concepts share blocks, causally separable ones are approximately orthogonal

Why plain Euclidean fails

LLM activation space has arbitrary invertible-linear unidentifiability — Euclidean distance carries no semantic meaning

→ Linear Representation Hypothesis · Mechanistic InterpretabilityICML 2024 arXiv:2311.03658

The Linear Representation Hypothesis and the Geometry of Large Language Models

来源： sources/arxiv_papers/2311.03658-linear-rep-hypothesis.md URL： https://arxiv.org/abs/2311.03658 作者： Kiho Park、Yo Joong Choe、Victor Veitch（University of Chicago / Google） 发表时间： 2023-11-07（ICML 2024 录用）

论文核心问题

“线性表征假说”（LRH）是一个广泛流传的经验观察，但几乎没有人认真问过：“线性表征”到底是什么意思？

LRH 实际上对应着至少三种不同的直觉：

子空间（Subspace）：counterfactual 词对差值落在同一方向（如 “queen”-“king” 平行于 “woman”-“man”）
测量（Measurement）：线性探针能预测概念值（是法语还是英语？）
干预（Intervention）：沿方向向量修改激活可改变输出概念，而不影响其他概念

这三种直觉在什么条件下等价？又应该用什么几何结构来衡量方向间的相似性？Park et al. 给出了迄今最严格的回答。

核心贡献

1. 概念的反事实形式化

用反事实变量（counterfactual）形式化”概念”：概念 $W$ 是一个潜变量，由上下文 $X$ 引起，同时作为输出 $Y$ 的原因。两个概念因果可分离（causally separable）当且仅当它们可以独立变化（如语言和性别可以各自独立切换，但”法语→英语”和”法语→俄语”不能同时成立）。

2. 三种表征的数学连接

表征类型	空间	定义方式	连接定理
非嵌入表征（Unembedding）	输出词向量空间 $\Gamma$	counterfactual 词对差值方向	连接到测量（Thm 2.2）
嵌入表征（Embedding）	上下文激活空间 $\Lambda$	仅改变目标概念的上下文对差值方向	连接到干预（Thm 2.5）
因果内积统一（Causal IP）	变换后的统一空间	因果内积下二者重合	Riesz 同构（Thm 3.2）

核心定理链：非嵌入表征 → 测量表征（线性探针）；嵌入表征 → 干预表征（steering vector）；因果内积将两者统一到同一方向。

3. 因果内积（Causal Inner Product）

普通欧氏内积对于 LLM 表征空间而言没有语义意义——因为模型参数仅由 softmax 概率确定，表征空间存在任意可逆线性变换的不可辨性。

Park et al. 提出的因果内积是唯一满足以下条件的内积：

因果可分离的概念在此内积下正交。

即如果”语言”和”性别”是独立的概念，它们的方向向量在因果内积下的内积为零。

其显式形式（在合理假设下）： $\langle \bar{\gamma}, \bar{\gamma}' \rangle_C := \bar{\gamma}^\top \mathrm{Cov}(\gamma)^{-1} \bar{\gamma}'$

其中 $\gamma$ 是均匀采样词汇的非嵌入向量， $\mathrm{Cov}(\gamma)^{-1}$ 是词汇非嵌入矩阵的协方差逆矩阵。

实验结果（LLaMA-2 7B）

27 个概念（BATS 3.0 词类比数据集 + 语言对）的非嵌入表征验证：counterfactual 词对差值高度对齐 → LRH 成立（仅 “thing⇒part” 一个例外）
因果内积下因果可分离概念近似正交：block diagonal 结构清晰，块对应语义相似概念组
概念方向作为线性探针：非嵌入表征方向 $\bar{\gamma}_W$ 对上下文分类的准确率远超随机
干预实验：沿概念方向修改激活（ $\lambda \leftarrow \lambda + \alpha\bar{\lambda}_W$ ）可将 “king” 改为 “queen”，同时不影响大小写概念

关键词与关联概念

线性表征假说 — 本文是 LRH 的严格形式化
因果内积 — 本文核心创新
探针分类器 — 测量表征的实验工具；本文证明子空间表征 ≡ 线性探针方向
机制可解释性 — 上位研究领域
时空世界模型 — Gurnee & Tegmark 的经验验证与本文形式化互为补充

References

sources/arxiv_papers/2311.03658-linear-rep-hypothesis.md