From Words to Worlds: Compositionality for Cognitive Architectures

Dhar & Søgaard 2024：LLM 组合性实证检验，规模提升组合性、指令微调降低组合性，ICML 2024 Workshop

源

SOURCE · DHAR & SØGAARD · ICML 2024 · LLM compositionality empirics · revisiting Fodor’s prediction

From Words to Worlds: Compositionality for Cognitive Architectures

Dhar & Søgaard (ICML 2024 Workshop) — an empirical test of Fodor & Pylyshyn’s 1988 prediction against LLMs

Fodor & Pylyshyn (1988) predicted connectionist systems cannot support compositionality. Dhar & Søgaard test 12 LLMs across three compositionality dimensions and find: scale does improve compositional ability — partly refuting Fodor’s pessimism; but instruction tuning broadly reduces it — which matches Fodor’s core worry: connectionist systematicity is a training accident, not an architectural guarantee.

Three compositionality dimensions

Substitutability (ANTAILS)Synonym substitution preserves meaning — the most basic compositionality demand

Systematicity (PLANE)Recombining known compositional patterns in new contexts — directly mapping onto Fodor & Pylyshyn’s systematicity

Overgeneralization (COMPCOMB)Distinguishing true compositionality (trenchcoat=coat) from exocentric compounds (turncoat≠coat)

Core findings

Scale up → partial refutation of Fodor

Consistent across 4 model families: larger scale, better compositional ability — connectionism isn’t simply incapable

Instruction tuning down → Fodor vindicated

RLHF objective misaligns with compositional semantics — Mistral’s substitutability drops from 0.50 to 0.30

Subsective adjectives stay weak

”Good dancer”-style cases — mirrors the late-acquired adjective class in child development

Same root as trajectory bias

Instruction tuning lowering compositionality shares a mechanism with trajectory bias — friction between symbolic constraint and connectionist dynamics

→ Systematicity · Compositionality · Jerry Fodor · Paul SmolenskyICML 2024 arXiv:2407.13419

From Words to Worlds: Compositionality for Cognitive Architectures

来源文件：sources/arxiv_papers/2407.13419-from-words-to-worlds-compositionality-cognitive-architectures.md 原始出处：arXiv:2407.13419 [cs.CL]，2024 年 7 月 18 日作者：Ruchira Dhar, Anders Søgaard 发表：ICML 2024 Workshop on LLMs & Cognition DOI：https://doi.org/10.48550/arXiv.2407.13419 许可：CC BY-NC-ND 4.0

摘要概述

本文重新审视 Fodor & Pylyshyn（1988）对联结主义的经典批判：联结主义系统是否能展示足够的组合性，以胜任认知架构的角色？

作者测试了四个 LLM 家族（共 12 个模型）在三类组合性任务上的表现，发现：规模提升组合能力，但指令微调常常降低组合能力。这一分裂揭示了 LLM 发展与人类认知能力之间的开放问题。

研究框架

为什么要重回 1988 年的争论

LLM（大型语言模型）本质上是大规模联结主义系统。Fodor & Pylyshyn 预测，不具备组合句法结构的系统无法展示可靠的组合性。

这一预测在 LLM 时代变成了可实证检验的问题：

LLM 是否展示了组合性？
若展示了，这种组合性是否解释了其性能提升？
规模与指令微调分别对组合性有何影响？

测试的三个组合性维度

维度一：可替换性（Substitutivity） — ANTAILS 数据集

测试同义词替换不改变意义的能力——最弱、最基础的组合性要求。

例：若”大狗”蕴含某关系，则”大型狗”（大/大型为同义词）应当蕴含相同关系。

评估设置：

Setup 1：两选择 QA，固定排名精度
Setup 2：对数概率计算蕴含判断

维度二：系统性与全局性 — PLANE 数据集

测试模型是否能在新语境中重组已知组合模式——直接对应 Fodor & Pylyshyn 的系统性概念。

三类形容词的区分测试：

交叉性形容词（Intersective）：“红色汽车” = 红的 ∩ 汽车——形容词独立成立
子集性形容词（Subsective）：“好舞者” ≠ 好的 ∩ 舞者——“好”相对于舞者类别才有意义
内涵性形容词（Intensional）：“所谓罪犯”——形容词根本不应用于名词的外延

维度三：过度泛化 — COMPCOMB 数据集（新颖）

区分组合性现象与非组合性（外心）复合词：

组合性：trenchcoat（战壕 coat）→ 确实是一种 coat
非组合性：turncoat（叛徒）→ 不是一种 coat（外心复合词）

测试模型是否过度泛化组合性解读。

实验设计

四个模型家族：Falcon、Llama 2、Codellama、Mistral 三种变体：7B 基础模型、7B 指令微调模型、更大版本模型 评估方法：全零-shot，两种提示变体取平均，以规避指令微调偏置

主要结果

可替换性（ANTAILS）

模型家族	基础	指令微调	更大版本
Falcon	0.50±0.01	0.46±0.01	0.54±0.03
Llama 2	0.50±0.01	0.54±0.04	0.60±0.01
Codellama	0.50±0.01	0.50±0.01	0.55±0.02
Mistral	0.50±0.01	0.30±0.04	0.51±0.02

关键发现：规模一致提升表现；指令微调效果不一致（Mistral 大幅下降，Llama 2 小幅提升）。

系统性（PLANE）

规模提升与 ANTAILS 相似的系统性表现
指令微调普遍降低系统性
子集性形容词是持续弱点——平行于儿童发展中更晚习得这类形容词

过度泛化（COMPCOMB）

大模型在嵌入层仍表现出过度泛化，但最后隐藏层表征表现更好
这说明组合性理解在深层表征中比静态词嵌入更可获得

核心结论

对 Fodor & Pylyshyn 预测的实证回应

规模部分验证了联结主义可以趋近组合性：大规模联结主义系统（LLM）确实展示了可测量的组合能力。Fodor & Pylyshyn 1988 年的悲观预测在”完全不能”的强形式上被否定。
指令微调揭示了组合性的脆弱性：RLHF 的优化目标（对齐人类偏好）与组合语义结构不对齐，会部分摧毁组合能力。这与 Fodor & Pylyshyn 的核心关切一致：联结主义的系统性是训练偶然的，不是架构保证的。
组合性解释部分而非全部性能提升：扩展组合能力解释了规模带来的部分收益，但不能解释指令微调带来的收益（因为后者通常降低组合性）。

对认知架构研究的含义

要将 LLM 确立为认知架构，需要：

展示组合性（本文显示部分满足）
展示组合策略解释了性能改善（本文显示部分满足）
组合性在分布外新颖案例上保持（本文结果混合）

附录：组合性争论简史

时代	贡献
Frege 1892	组合性原则：复杂意义由部分意义和结构决定
Fodor & Pylyshyn 1988	联结主义无法支持组合性理解；系统性要求组合结构
Smolensky 1988	功能性组合性；亚概念层的正确地位；张量积表征
Chalmers, Van Gelder	神经网络通过权重和激活模式展示”功能性组合性”
Dhar & Søgaard 2024	实证回到 1988 年的预测：LLM 的组合性是真实的但脆弱的

与现有 Wiki 概念的联系

系统性：本文是系统性论证的当代实证版本
组合性：三个测试维度是组合性不同层面的操作化
轨迹偏差：指令微调降低组合性与轨迹偏差有相同的机制根源——符号层约束与联结主义动力学的摩擦
神经符号 AI：研究发现组合性与性能的部分对齐，而不是全面融合——符合第三波神经符号 AI 的”部分实现”现状诊断

References

原始来源：sources/arxiv_papers/2407.13419-from-words-to-worlds-compositionality-cognitive-architectures.md
直接对话：Fodor & Pylyshyn 1988
理论背景：Smolensky 1988
概念基础：系统性、组合性