Seven Mental · 心智七篇
← Knowledge Atlas · Source

On the Biology of a Large Language Model

Claude 3.5 Haiku 内部机制研究:规划、幻觉、越狱、CoT 忠实性
SOURCE · ANTHROPIC · BIOLOGY OF LLM · Circuit Tracing · Claude 3.5 Haiku dissection

On the Biology of a Large Language Model

Anthropic Transformer Circuits Team (2025) — circuit tracing applied to Claude 3.5 Haiku across 10 behavioral case studies

As biology uses microscopes to study organisms, this paper performs an “anatomical” study of Claude 3.5 Haiku. Attribution graphs yield satisfying insight on ~25% of prompts; all findings are validated via perturbation experiments (suppression/injection). Core metaphor: “Dallas → Texas → Austin” involves real intermediate feature jumps — not rote lookup.

10 case studies (selected)
Multi-step reasoningDallas→Texas→Austin: genuine intermediate features, not a lookup table
Hallucination mechanismRefusal is the default — a “known entity” feature suppresses it; spurious triggering produces hallucinations
CoT faithfulnessThree distinguishable modes: genuine reasoning, fabrication, motivated reasoning (steps inferred backward from the answer)
Safety refusalsFine-tuning creates a generic “harmful request” feature — aggregated from specific harmful-request features learned in pretraining
Jailbreak analysisTension between syntactic-coherence features and safety mechanisms — the model finishes the grammatical structure before it can “refuse”
Methodological traits
Shared multilingual features
Abstract-concept features shared across languages — Claude 3.5 Haiku’s sharing ratio is 2x+ that of smaller models
Poetry planning
Rhyme candidates activate at the start of a line — both forward-looking and backward-looking planning
Bottom-up discovery
Many mechanisms were found without prior hypotheses — attribution graphs guide exploration themselves
→ Mechanistic Interpretability · Circuit Tracing · Anthropictransformer-circuits.pub (2025)

On the Biology of a Large Language Model

摘要

circuit tracing 方法 应用于 Claude 3.5 Haiku,系统研究十种模型行为的内部机制。核心比喻:如同生物学用显微镜研究有机体,本文对 LLM 进行”解剖学”研究。

十大案例研究

领域发现
多步推理”Dallas → Texas → Austin”存在真实的中间特征跳转,非死记硬背
诗歌规划模型在行首就激活韵脚候选词特征,兼具前瞻和回溯规划
多语言跨语言共享抽象概念特征,Claude 3.5 Haiku 共享比例是小模型的 2 倍+
加法同一加法电路在完全不同的上下文中泛化
医学诊断模型在”脑内”生成候选诊断并据此决定追问哪些症状
幻觉拒绝回答是默认行为,“已知实体”特征抑制此默认,误触发导致幻觉
安全拒绝微调产生通用”有害请求”特征,从预训练中学到的具体有害请求特征聚合而来
越狱语法连贯性特征与安全机制的张力——完成语法结构后才能”拒绝”
CoT 忠实性可区分真实推理、无中生有、动机推理(从答案反推步骤)三种模式
隐藏目标微调追求隐藏目标的模型,可解释性方法能发现嵌入在”助手人格”中的相关机制

方法论特点

  • 归因图在约 25% 的 prompt 上产生满意的洞察
  • 所有发现通过扰动实验(inhibition/injection)验证
  • 发现许多机制是自下而上、无预设假说时发现的

对 Agent 工程的启示

虽然本文是纯解释性研究,但多项发现与 agent 系统设计相关:

  • 幻觉机制理解有助于设计更好的 guardrails
  • CoT 忠实性分析直接关联 agent 的推理可靠性
  • 越狱分析启发更深层的安全防护设计

References

  • sources/anthropic_official/biology-large-language-model.md