Seven Mental · 心智七篇
← Knowledge Atlas · Concept

Scaling Laws(缩放定律)

缩放定律:LLM 性能是参数量和数据量的可预测平滑函数,AI 军备竞赛的理论基础
CONCEPT · SCALING LAWS · PARAMETERS × DATA = PREDICTABLE PERFORMANCE · KARPATHY 2023

Scaling Laws

Scaling Laws — LLM performance is a predictable, smooth function of parameter count and training data

On a log scale, performance is roughly linear in N (parameters) and D (data), with no saturation yet. Two key properties: predictability (given N and D, you can forecast accuracy with high confidence) + no saturation (as of 2023, bigger = better). Algorithmic advances are icing, not a prerequisite.

Theoretical basis of the Gold Rush
Scaling laws → performance gains = engineering spend (not research breakthroughs) → GPU-cluster arms race
Practical corollaryNo need to wait for a theoretical breakthrough — buy a bigger cluster, prepare more data, and performance grows
Relations to neighboring concepts
LLM-OS analogy
Scaling laws are the “Moore’s law” of the LLM ecosystem — predictable compute growth drives everything else
Hedging error cascade
Per-step performance improves with scale, but multi-step agent error cascades can cancel the gain
Isomorphic to the Bitter Lesson
Scale, not insight — scaling laws are the precise training-time quantification of Sutton’s lesson
NFL reminder
Scaling-law predictions rest on a specific distribution assumption — out-of-distribution tasks may not follow the same curve
→ LLM Training Pipeline · Error Cascade · Bitter LessonKarpathy (2023)

Scaling Laws(缩放定律)

定义

LLM 的性能是参数量(N)和训练数据量(D)的可预测、平滑函数。这个关系在对数尺度上近似线性,且目前未见饱和迹象。Scaling laws 是当前 AI 基础设施”军备竞赛”的理论基础——它将模型能力提升从”研究突破”转变为”工程投入”。

核心特征

Karpathy2023 年演讲 中指出两个关键性质:

  1. 可预测性:给定 N 和 D,可以以极高置信度预测下一词预测精度——算法进步是锦上添花,不是必要条件
  2. 无饱和:到 2023 年为止,性能曲线未出现 plateau,更大的模型 + 更多数据 = 更好的性能

这解释了为什么”Gold Rush”发生在计算层面——不需要研究突破,只需更大的 GPU 集群和更多数据。

从下一词预测到涌现能力

Scaling laws 的形式定义基于下一词预测损失。但经验上,这一指标的改善与下游任务表现强相关——从 GPT-3.5 到 GPT-4 的升级中,几乎所有基准测试的表现同步提升。

这种关联不是理论保证,而是经验观察。它暗示下一词预测可能是一种”通用目标”——足够好的下一词预测器需要编码足够多的世界知识。

与 Wiki 已有概念的关系

  • LLM-OS Analogy — scaling laws 对应”硬件”层面的摩尔定律类比:计算能力的可预测增长驱动整个生态系统演进
  • LLM Training Pipeline — scaling laws 主要描述 pretraining 阶段的行为
  • Error Cascade — 即使单步性能随 scale 提升,多步任务中的误差级联效应可能抵消收益

References

  • sources/karpathy-intro-to-large-language-models.md