On the Biology of a Large Language Model

Claude 3.5 Haiku 内部机制研究：规划、幻觉、越狱、CoT 忠实性

源

SOURCE · ANTHROPIC · BIOLOGY OF LLM · Circuit Tracing · Claude 3.5 Haiku dissection

On the Biology of a Large Language Model

Anthropic Transformer Circuits Team (2025) — circuit tracing applied to Claude 3.5 Haiku across 10 behavioral case studies

As biology uses microscopes to study organisms, this paper performs an “anatomical” study of Claude 3.5 Haiku. Attribution graphs yield satisfying insight on ~25% of prompts; all findings are validated via perturbation experiments (suppression/injection). Core metaphor: “Dallas → Texas → Austin” involves real intermediate feature jumps — not rote lookup.

10 case studies (selected)

Multi-step reasoningDallas→Texas→Austin: genuine intermediate features, not a lookup table

Hallucination mechanismRefusal is the default — a “known entity” feature suppresses it; spurious triggering produces hallucinations

CoT faithfulnessThree distinguishable modes: genuine reasoning, fabrication, motivated reasoning (steps inferred backward from the answer)

Safety refusalsFine-tuning creates a generic “harmful request” feature — aggregated from specific harmful-request features learned in pretraining

Jailbreak analysisTension between syntactic-coherence features and safety mechanisms — the model finishes the grammatical structure before it can “refuse”

Methodological traits

Shared multilingual features

Abstract-concept features shared across languages — Claude 3.5 Haiku’s sharing ratio is 2x+ that of smaller models

Poetry planning

Rhyme candidates activate at the start of a line — both forward-looking and backward-looking planning

Bottom-up discovery

Many mechanisms were found without prior hypotheses — attribution graphs guide exploration themselves

→ Mechanistic Interpretability · Circuit Tracing · Anthropictransformer-circuits.pub (2025)

On the Biology of a Large Language Model

来源: sources/anthropic_official/biology-large-language-model.md
URL: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
作者: Anthropic (Transformer Circuits team)
发布: 2025

摘要

将 circuit tracing 方法应用于 Claude 3.5 Haiku，系统研究十种模型行为的内部机制。核心比喻：如同生物学用显微镜研究有机体，本文对 LLM 进行”解剖学”研究。

十大案例研究

领域	发现
多步推理	”Dallas → Texas → Austin”存在真实的中间特征跳转，非死记硬背
诗歌规划	模型在行首就激活韵脚候选词特征，兼具前瞻和回溯规划
多语言	跨语言共享抽象概念特征，Claude 3.5 Haiku 共享比例是小模型的 2 倍+
加法	同一加法电路在完全不同的上下文中泛化
医学诊断	模型在”脑内”生成候选诊断并据此决定追问哪些症状
幻觉	拒绝回答是默认行为，“已知实体”特征抑制此默认，误触发导致幻觉
安全拒绝	微调产生通用”有害请求”特征，从预训练中学到的具体有害请求特征聚合而来
越狱	语法连贯性特征与安全机制的张力——完成语法结构后才能”拒绝”
CoT 忠实性	可区分真实推理、无中生有、动机推理（从答案反推步骤）三种模式
隐藏目标	微调追求隐藏目标的模型，可解释性方法能发现嵌入在”助手人格”中的相关机制

方法论特点

归因图在约 25% 的 prompt 上产生满意的洞察
所有发现通过扰动实验（inhibition/injection）验证
发现许多机制是自下而上、无预设假说时发现的

对 Agent 工程的启示

虽然本文是纯解释性研究，但多项发现与 agent 系统设计相关：

幻觉机制理解有助于设计更好的 guardrails
CoT 忠实性分析直接关联 agent 的推理可靠性
越狱分析启发更深层的安全防护设计

References

sources/anthropic_official/biology-large-language-model.md