知识提取与忠实性（Knowledge Extraction & Fidelity）

知识提取与忠实性：从神经网络提取符号描述，忠实性（对网络行为的准确度）是 XAI 的核心标准，LIME 等方法忠实性极低

念

CONCEPT · KNOWLEDGE EXTRACTION FIDELITY · CORE XAI CRITERION

Knowledge Extraction and Fidelity

Fidelity — how well the extracted symbolic description matches the network’s behavior, not its training fit

Knowledge extraction = deriving a readable symbolic description (rules, decision trees) from a neural network. Fidelity = agreement between that description and the network’s actual behavior, not its fit to training data. LIME-family methods have very low fidelity — plausible-looking but causally disconnected explanations.

High fidelity

Student-teacher framework: rate at which extracted knowledge reproduces network predictions

Provable soundness: formal guarantees on description precision

Low fidelity (the problem)

LIME’s local linear approximation may diverge wildly from the original network on the same region

Post-hoc explanation layers cannot guarantee faithful description of the underlying model

Why extract knowledge

Bias discovery

GDPR compliance — reveal which proxies for protected variables the network actually uses

Model debugging

Find when the network relies on features it shouldn’t (shortcut learning)

Neurosymbolic loop

Extracted symbolic knowledge feeds back into the next learning round as constraint

vs mechanistic interpretabilityKnowledge extraction: map to symbolic rules. Mech interp: trace information flow at the activation level. Both demand causal accuracy

→ Mechanistic Interpretability · Activation Intervention · Neurosymbolic AIGarcez & Lamb (2020)

知识提取与忠实性（Knowledge Extraction & Fidelity）

定义： 知识提取（knowledge extraction）是从训练好的神经网络中导出可读的符号描述（如逻辑规则、决策树）的过程。忠实性（fidelity）是评价提取质量的核心指标：提取出的符号描述在多大程度上准确反映了神经网络的实际行为。

忠实性的定义

忠实性 ≠ 准确率。 一个解释方法对训练数据的拟合度再高，若它描述的不是网络真正的计算过程，就不具有忠实性。

正确定义：提取的知识与神经网络行为的一致性（student-teacher 框架下，student 对 teacher 的模拟精度）。

为什么忠实性是核心标准

XAI 的根本目的

可解释 AI 的目标是让人理解 AI 系统实际上在做什么——不是给出一个看似合理但实则无关的解释。

当前 XAI 领域的系统性缺陷

许多流行方法（如 LIME）的忠实性极低：

LIME 问题： 局部线性近似用于解释局部决策，但近似模型的行为可能与原始网络在同一区域大相径庭
事后解释的局限： 在模型训练完成后叠加的解释层无法保证对原始模型的忠实描述

Garcez & Lamb 2020 明确批评这种”放弃忠实性”的趋势，认为这使 XAI 方法实际上无法用于可信 AI 的目标。

忠实知识提取的形式保证

可证明的正确性（Soundness）： 提取算法在数学上能保证提取结果的描述精度——但可证明正确性通常对应指数级复杂度。

实用策略： 当精确提取过于昂贵时，用忠实性度量（数值评估）代替形式证明。忠实性 = 提取规则对网络预测的复现率。

知识提取的用途

偏见识别： GDPR 要求删除性别/种族等保护变量，但数据代理变量仍携带偏见。提取符号规则可以揭示网络实际利用了哪些变量
模型调试： 发现网络依赖了不应依赖的特征（捷径学习）
决策支持： 为医疗、法律等高风险场景提供可追溯的决策依据
进一步学习： 提取的符号知识可以作为约束反馈给下一轮学习（神经符号循环）

与机制可解释性的关系

机制可解释性（mechanistic interpretability）与知识提取有相似目标，但方法不同：

知识提取： 将网络行为映射为可读的符号语言（规则、逻辑公式）
机制可解释性： 在激活层面追踪信息流动（电路追踪、归因图）

两者都要求对网络的描述要因果准确（即不只是相关近似），这与忠实性原则高度一致。

与局部 vs 全局解释的区分

局部解释： 对特定输入的决策做解释（LIME 的做法）
全局解释： 对整个模型的行为做解释

Garcez & Lamb 认为高忠实性的知识提取应尽量追求全局解释，局部解释价值有限，且容易产生误导。

References

d’Avila Garcez, A. & Lamb, L.C. (2023). Neurosymbolic AI: The 3rd Wave. Artificial Intelligence Review. wikis/sources/2012.05876-neurosymbolic-ai-third-wave.md
相关概念：mechanistic-interpretability、trajectory-bias