When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming
下一个词预测何时有用?边缘化、遍历性、混合可识别性、局部充分性、RAG、工具与编程
Francesco Corielli
AI总结 本文通过区分完整条件语言过程、边缘文本过程和模型诱导分布,论证了下一个词预测的有效性依赖于强假设(平稳性、代表性、遍历性)以及观察前缀对潜在上下文的充分性,并解释了RAG和工具使用作为条件充分性机制的作用。
详情
在观察序列上训练的语言模型通常被描述为学习给定前一个词的下一个词的条件分布。这种描述仅在一定条件下成立。在真实词轨迹上训练的模型并未观察到完整的条件法则;它接收的是采样后的延续。此外,真实语言生成不仅受前文影响,还受非文本环境的影响:事实、事件、意图、目标、信念、社会背景和任务特定约束。本文区分了三个常被混淆的对象:以潜在环境为条件的完整条件语言过程、通过积分掉这些环境得到的边缘纯文本过程,以及从有限观察语料库中学习到的模型诱导分布。 本文认为,将模型训练解释为估计边缘纯文本法则需要强假设:平稳性、代表性和遍历性,这些假设在统计估计中是标准的,但在应用于异质语言语料库时存在问题。即使这些假设成立,边缘纯文本法则也仅当观察前缀是延续相关潜在环境的近似充分统计量时才有用。从信息论角度看,有用性要求下一个词与被省略环境之间的条件互信息(给定观察文本)很小。 然后,本文将这一论证扩展到异质训练语料库。 最后,本文将检索增强生成(RAG)和工具使用解释为条件充分性装置。
Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.