语言大模型 / LLM - arXivDaily 专题

2606.18663 2026-06-18 cs.CL 新提交 90%

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D: 通过代理训练轨迹实现动态数据混合

Kaiyan Zhao, Zhongtao Miao, Akiko Aizawa, Yoshimasa Tsuruoka

发表机构 * The University of Tokyo（东京大学）； National Institute of Informatics（国立信息学研究所）

专题命中预训练：LLM预训练动态数据混合方法

AI总结提出RegMix-D，通过代理训练轨迹预测多阶段最优混合比例，实现动态数据混合，在13个下游任务上优于RegMix和DoReMi，且代理计算预算仅为RegMix的25%。

Comments Work in progress

详情

AI中文摘要

数据混合选择对于大型语言模型预训练至关重要。现有方法如RegMix通过在小规模代理运行上拟合回归模型来选择单个静态混合。我们提出RegMix-D，这是RegMix的一个简单扩展，用于动态混合。我们的关键观察是，代理运行不仅产生端点损失，还产生完整的损失轨迹，这些轨迹可用于进一步改进数据混合。通过在这些轨迹上训练回归模型，我们可以预测多个训练阶段的最优混合。RegMix-D支持两种部署模式：一种离线变体，在目标训练之前生成完整的混合计划；另一种在线变体，在训练期间使用观察到的损失自适应调整混合。在Pile数据集的250亿token上使用1B参数目标模型的实验表明，RegMix-D在13个下游任务上一致优于RegMix和DoReMi，同时保持代理高效：即使仅使用128个代理模型（RegMix代理计算预算的25%），它也超越了RegMix。

英文摘要

Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).

URL PDF HTML ☆

赞 0 踩 0

2606.19036 2026-06-18 cs.LG 新提交 85%

Geometric and Stochastic Analysis of Discontinuities in Sparse Mixture-of-Experts

稀疏混合专家模型中不连续性的几何与随机分析

Tho Tran Huu, Huu-Tuan Nguyen, Thien-Hai Nguyen, Nhat-Tri Ho, Viet-Hoang Tran, Tho Quan, Tan Minh Nguyen

发表机构 * Department of Mathematics, National University of Singapore, Singapore（新加坡国立大学数学系）； Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam（胡志明市技术大学计算机科学与工程学院）

专题命中预训练：分析稀疏MoE不连续性，提出平滑机制，核心是LLM架构改进。

AI总结本文对稀疏混合专家模型中的不连续性进行几何与随机分析，分类不连续阶数，建立渐近体积估计，证明随机路径几乎必然击中一阶不连续，并提出低开销平滑机制以提升性能。

Comments ICML 2026 Spotlight. arXiv admin note: text overlap with arXiv:2510.17794 by other authors

详情

AI中文摘要

稀疏混合专家（SMoE）架构现已广泛应用于最先进的语言和视觉模型中，其中条件路由允许扩展到非常大的网络。然而，正是这种Top-$k$专家选择使得条件路由成为可能，同时也导致SMoE映射本质上不连续。在这些不连续曲面附近，即使任意接近的输入也可能激活截然不同的专家集，从而产生显著不同的输出。本文对这些不连续性进行了严格的几何和随机分析。首先，我们根据切换事件中并列专家的数量对不连续性进行阶数分类。利用测度论切片论证，我们建立了加厚不连续曲面的渐近体积估计，表明低阶不连续集占主导地位，而高阶不连续集占据的体积相对极小。接着，通过扩散过程对输入空间中的随机扰动建模，我们证明路径最终会遇到不连续，并且首次击中几乎必然发生在阶数为1的不连续上，同时给出了显式的有限时间概率界。我们进一步推导了占据时间界，量化了随机路径在每个不连续阶数邻域内停留的时长。这些理论结果表明输入更可能位于低阶不连续附近。受此启发，我们提出一种简单的平滑机制，可直接应用于现有SMoE，在接近不连续处软性地整合专家；我们的分析保证增加的额外计算开销很小，同时在不连续附近提供局部平滑，跨语言和视觉任务的实验表明，平滑不仅增强了SMoE映射的连续性，还提升了经验性能。

英文摘要

Sparse Mixture-of-Experts (SMoE) architectures are now widely deployed in state-of-the-art language and vision models, where conditional routing allows scaling to very large networks. However, this very Top-$k$ expert selection that enables conditional routing also renders the SMoE map inherently discontinuous. In the vicinity of these discontinuity surfaces, even inputs that are arbitrarily close may activate substantially different sets of experts resulting in significantly different outputs. In this work we give a rigorous geometric and stochastic analysis of these discontinuities. We first classify them by order, determined by the number of tied experts at a switching event. Using measure-theoretic slicing arguments, we establish asymptotic volume estimates for the thickened discontinuity surfaces, showing that lower-order discontinuity sets dominate, whereas higher-order ones occupy a vanishingly small relative volume. Next, modeling random perturbations in the input space via a diffusion process, we prove that the path eventually encounter a discontinuity, and moreover that the first hit almost surely occurs on an order-1 discontinuity with explicit finite-time probability bounds. We further derive occupation-time bounds that quantify the duration the random path spend in the neighborhoods of each discontinuity order. These theoretical results imply that inputs are more likely to lie near lower order discontinuities. Motivated by this insight, we propose a simple smoothing mechanism that can be directly applied to existing SMoEs, softly incorporating experts near discontinuities; our analysis guarantees that the added computational overhead remains small while providing localized smoothing near discontinuities, and experiments across language and vision tasks show that smoothing not only enforces continuity of the SMoE map but also enhances empirical performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19005 2026-06-18 cs.CL cs.LG 新提交 85%

Sumi: Open Uniform Diffusion Language Model from Scratch

Sumi: 从头训练的开放均匀扩散语言模型

Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

发表机构 * Tohoku University（东北大学）

专题命中预训练：从头预训练7B均匀扩散语言模型，性能与自回归模型相当。

AI总结本文提出Sumi，一个从零开始预训练的70亿参数均匀扩散语言模型，在1.5T tokens上训练，性能与同规模自回归模型相当，并开源所有资源。

详情

AI中文摘要

扩散模型已成为自回归模型的有前途的替代方案。其中，均匀扩散语言模型（UDLM）允许在任何步骤更新任何token，原则上能够实现更灵活的生成。然而，目前还没有从零开始预训练的大参数规模和大token预算的UDLM。自回归建模和掩码扩散建模已经拥有大规模的可供社区研究和构建的模型；而均匀扩散模型则没有。大规模从头预训练的UDLM将为研究缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供一个干净的参考点。为此，我们引入了Sumi（日语中“墨水”的意思），一个完全开放的70亿参数均匀扩散语言模型，从零开始在1.5T tokens上预训练。Sumi在知识、推理和编码基准测试中与在可比token预算下训练的自回归模型表现相当，但在常识基准测试中表现较差，其中我们以教育为主的数据混合可能是原因之一。我们发布了模型权重、检查点和完整的训练方案，包括在公开可用的语料库上的数据混合的完整规范。我们希望这次发布能使社区研究大规模原生均匀扩散，并促进对其尚未很好理解的方面的研究。

英文摘要

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

URL PDF HTML ☆

赞 0 踩 0

2606.19025 2026-06-18 cs.LG cs.AI cs.DC cs.SY eess.SY 新提交 80%

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE: 打破全副本壁垒的专家混合联邦系统

Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

发表机构 * DeepSeek-AI

专题命中预训练：提出跨数据中心MoE训练系统，降低通信开销。

AI总结提出FoMoE系统，通过跨工作节点分区专家层打破全副本范式，结合部分专家复制和跳跃令牌机制，显著降低通信开销并提升吞吐量。

详情

AI中文摘要

预训练大型语言模型（LLMs）通常需要大规模基础设施，配备紧密耦合的硬件加速器。虽然增加模型和数据集规模仍是性能的主要驱动力，但专家混合（MoE）架构最近通过将参数数量与计算成本解耦，取得了最先进的结果。这种效率使得在受限计算预算下训练大规模模型成为可能，但通常需要单个数据中心的高速互连。为了克服这些物理限制，最近的方法如DiLoCo和Photon使用低通信数据并行方法，使得能够在地理分布、弱连接的数据中心之间进行扩展。然而，这些方法存在根本性的低效问题：它们需要在每个站点拥有完整的模型副本，这带来了高昂的内存约束和通信开销。在这项工作中，我们引入了FoMoE，一个通过跨工作节点分区专家层来打破全副本范式的系统。我们证明FoMoE：（I）通过部分专家复制，在所研究的场景中，相比高效基线降低了高达1.42倍的通信成本，相比DDP降低了45.44倍；（II）通过一种新颖的跳跃令牌机制，实现了高达1.4倍的经验吞吐量加速；（III）在训练代理场景中展示了稳定的路由，并通过系统建模将通信/内存优势推广到100B规模的配置。

英文摘要

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

URL PDF HTML ☆

赞 0 踩 0

2606.18650 2026-06-18 cs.LG 新提交 80%

BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training

BLADE: 面向LLM训练的可扩展双层自适应数据选择

Jiaxing Wang, Deping Xiang, Jin Xu, Zirui Liu, Zicheng Zhang, Guoqiang Gong, Jun Fang, Chao Liu, Pengzhang Liu, Tongxuan Liu, Ke Zhang, Qixia Jiang

发表机构 * University of Oxford（牛津大学）； Renmin University of China（中国人民大学）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中预训练：面向LLM训练的可扩展双层自适应数据选择

AI总结提出BLADE框架，通过拉格朗日乘子将双层优化转化为单层惩罚目标，避免逆Hessian计算，实现动态参考模型，理论保证一阶收敛，实验优于现有方法。

详情

AI中文摘要

随着大语言模型（LLM）数据集规模扩展到数万亿token，数据选择已成为过滤无信息噪声和构建自适应学习轨迹的关键前沿。除了静态启发式过滤，LLM训练的高级数据选择方法主要遵循两种范式，每种都有根本性局限。基于影响的方法提供了原则性的双层目标，但需要难以处理的逆Hessian计算，而超额损失方法计算高效但依赖静态参考模型，该模型在训练过程中与不断演化的代理模型失配。我们提出BLADE（双层自适应数据选择），一种无Hessian的数据选择框架。BLADE通过拉格朗日乘子将基于影响的方法背后的双层优化问题重新表述为惩罚单层目标，避免了逆Hessian计算，同时揭示了与基于超额损失的数据选择之间的原则性联系。所得目标恢复了超额损失形式，但用与训练同步的动态参考模型替代了静态参考模型。理论上，我们证明该惩罚公式保证一阶收敛。为了实现高效的在线批次选择，我们将BLADE实例化为一种无记忆随机块坐标Frank-Wolfe算法。大量实验表明，BLADE始终优于最先进的数据选择基线，为LLM训练提供了实用方案。

英文摘要

As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advanced data selection methods for LLM training largely follow two paradigms, each with fundamental limitations. Influence-based methods provide principled bi-level objectives but require intractable inverse-Hessian computations, while excess-loss methods are computationally efficient but rely on a static reference model that becomes misaligned with the evolving proxy model during training. We propose BLADE (Bi-Level Adaptive Data sElection), a Hessian-free framework for data selection. BLADE reformulates the bi-level optimization problem underlying influence-based methods as a penalized single-level objective via Lagrange multipliers, avoiding inverse-Hessian computation while revealing a principled connection to excess-loss based data selection. The resulting objective recovers an excess-loss form but replaces the static reference model with a dynamic one that stays synchronized with training. Theoretically, we prove that this penalized formulation guarantees first-order convergence. For efficient online batch selection, we instantiate BLADE as a memoryless randomized block-coordinate Frank-Wolfe algorithm. Extensive experiments show that BLADE consistently outperforms state-of-the-art data selection baselines, providing a practical recipe for LLM training.

URL PDF HTML ☆

赞 0 踩 0

2606.18192 2026-06-18 cs.AI 新提交 80%

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集：将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Nanjing University（南京大学）； Stanford University（斯坦福大学）

专题命中预训练：构建长上下文预训练数据集用于LLM

AI总结为解决长上下文文档稀缺问题，提出SEFD数据集，将SEC文件重建为布局忠实的MultiMarkdown格式，用于金融语言建模与评估，具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情

AI中文摘要

随着高质量公共网络语料库日益枯竭，干净的长上下文文档已成为大型语言模型（LLM）训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的，或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集（SEFD），这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集，用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据，并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型，并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1，一个152B令牌的初始公共快照，并提供了更大的1850万文件档案（估计为550B令牌）的语料库级分析。我们进一步引入了两个基于SEFD的基准：EDGAR-Forecast，用于评估模型知识截止后基于文件的数值预测；以及EDGAR-OCR，用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

URL PDF HTML ☆

赞 0 踩 0

2606.10466 2026-06-18 cs.LG cs.AI 新提交 80%

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales（新南威尔士大学）； HKUST(GZ)（香港科技大学（广州））； BUAA（北京航空航天大学）

专题命中预训练：统一预训练语言模型生成时间序列

AI总结提出UPLOTS，一种基于统一预训练语言模型和提示引导的框架，通过动态多数据集损失重加权和提示到模式映射，实现跨领域约束时间序列生成，在四个基准上验证了其泛化性和数据增强效果。

详情

AI中文摘要

在时间序列生成中，现有方法通常为每个数据集手工设计或训练单独的模型，这阻碍了它们的可扩展性，并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题，我们提出了UPLOTS，一种统一的、提示引导的语言模型框架，用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型，而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络，从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射，这使得UPLOTS能够在训练期间内化多样化的时间结构，并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置（包括峰值周期、日历、负载水平和波动性模式）上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明，UPLOTS能够泛化到原始峰值模式设置之外，并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取：this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.

URL PDF HTML ☆

赞 0 踩 0

2606.18587 2026-06-18 cs.CL cs.AI 新提交 75%

Dual Dimensionality for Local and Global Attention

局部与全局注意力的双重维度

Zhiyuan Wang, Xuan Luo, Sirui Zeng, Xifeng Yan

发表机构 * UC Santa Barbara（加州大学圣塔芭芭拉分校）

专题命中预训练：提出距离自适应表示优化Transformer注意力

AI总结提出距离自适应表示（DAR），对局部上下文保留全维度表示，对远距离token使用低维表示，在保持性能的同时减少KV缓存。

详情

AI中文摘要

解码器仅Transformer计算前面token的KV缓存上的注意力。键（和值）通常以相同的维度表示，无论其与预测目标的距离如何。然而，在自然语言中，下一个词受紧邻的前一个词影响最大。我们假设局部和远距离token对表示能力有不对称需求：局部token对预测即时输出更关键，因此需要更丰富的表示，而远距离token主要作为长期记忆，低维表示可能就足够了。我们将这一思想形式化为距离自适应表示（DAR），在受控设置中实现，该设置在局部上下文窗口内保留全维度表示，同时为超出该窗口的token分配降维表示（例如原始维度的1/4）。在多个预训练规模（70M到410M参数）以及1B规模模型上的持续监督微调中，该方法与全维度基线的性能紧密匹配。相比之下，在所有token位置上均匀降低维度会导致性能下降。这些结果挑战了键和值维度应在所有token位置上均匀的常见假设。我们的发现为设计注意力架构提供了新方向，该架构可自适应地跨序列分配表示能力，从而在推理期间进一步减少KV缓存。

英文摘要

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

URL PDF HTML ☆

赞 0 踩 0

2606.19170 2026-06-18 cs.CL 新提交 70%

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

Dango：一个严格仅L1的大型语言模型，用于研究第二语言习得

Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

发表机构 * Kyoto University（京都大学）； NII-LLMC（日本国立信息与通信技术研究所-语言模型中心）

专题命中预训练：模拟第二语言习得的LLM，涉及预训练

AI总结提出1.8B参数的Dango模型，通过过滤L2污染和微调L2学习课程，模拟人类L2产出模式，优于未过滤和多语言基线。

Comments 8 pages main text, 20 pages total including references and appendices

详情

AI中文摘要

我们介绍了Dango，一个1.8B参数的大型语言模型，旨在用于第二语言习得（SLA）中L1到L2（日语到英语）迁移的受控研究。虽然先前的研究已经探索了语言模型中的SLA，但它们主要依赖于较小的或非解码器模型，限制了它们生成开放式文本的能力，并降低了它们作为实用L2模拟器的适用性。我们发现了将模型扩展到该规模时的一个关键挑战：用于L1习得的“单语”预训练语料库中的L2污染。为了解决这个问题，我们提出了一种过滤方法，以减少对英语的过早暴露，同时保留现实的最小暴露。然后，我们在LLM生成的L2学习课程上对模型进行微调，以模拟L2习得过程。我们的评估证实，Dango发展了类似人类的L2产出模式，优于未过滤和标准的多语言基线。我们发布了模型、数据和代码，以促进可重复的计算SLA研究和面向学习者的应用。

英文摘要

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

URL PDF HTML ☆

赞 0 踩 0

2606.18465 2026-06-18 cs.LG cs.AI 新提交 70%

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

权重范数在Grokking中控制什么？交叉熵下的对数尺度中介作用

Truong Xuan Khanh

发表机构 * H&K Research Studio, Clevix LLC

专题命中预训练：研究Grokking中权重范数的作用

AI总结本文通过固定权重范数并改变输出温度，发现Grokking延迟主要由对数尺度（logit scale）决定，权重范数仅通过影响对数尺度间接起作用。

Comments 16 papges, 10 tables and 4 figures. Code and data to reproduce all numbers, tables, and figures: https://github.com/ClevixLab/grokking-logit-scale

详情

AI中文摘要

Grokking，即从记忆到泛化的延迟跳跃，通常与权重范数相关：范数越小，泛化越早。我们探究范数实际控制什么。通过钳位固定权重范数并仅改变输出温度，我们在交叉熵下将Grokking延迟滑动到其整个范数诱导范围；将有效对数尺度匹配回基线可恢复两个模数下约85%的延迟。在范数和温度的网格上，延迟仅由对数尺度决定（R2 = 0.97），范数仅额外贡献1-2%。该效应依赖于损失函数：在均方误差下，对数尺度被固定，范数通过不同路径起作用。记忆控制、float64 softmax崩溃审计和无LayerNorm的Transformer均指向同一通道。从同一状态分叉，延迟遵循钳位的范数值而非钳位操作本身，这排除了重缩放伪影。近端变量是对数尺度及其驱动的softmax饱和；权重范数仅是上游手柄。所有数字、表格和图表均可从发布的代码和数据中复现。

英文摘要

Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying only an output temperature, we slide the grokking delay across its entire norm-induced range under cross-entropy; matching the effective logit scale back to baseline recovers about 85% of the delay at two moduli. Across a grid of norms and temperatures the delay collapses onto the logit scale alone (R2 = 0.97), with the norm adding 1-2% beyond it. The effect is loss-dependent: under mean-squared error the logit scale is pinned and the norm acts through a different route. A memorization control, a float64 softmax-collapse audit, and a no-LayerNorm transformer point to the same channel. Forking arms from one identical state, the delay follows the held norm value and not the clamp operation, which closes a rescaling-artifact concern. The proximal variable is the logit scale and the softmax saturation it drives; the weight norm is only an upstream handle. All numbers, tables, and figures reproduce from released code and data.

URL PDF HTML ☆

赞 0 踩 0

2606.18524 2026-06-18 cs.LG 新提交 60%

On the Residual Scaling of Looped Transformers: Stability and Transferability

关于循环Transformer的残差缩放：稳定性和可迁移性

Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li

发表机构 * Tsinghua University（清华大学）

专题命中预训练：分析循环Transformer的残差缩放

AI总结针对循环Transformer，提出残差缩放因子应为1/N而非1/√L，并推导出多层的分解参数化，实现超参数从少循环到多循环的迁移。

Comments 19 pages, 9 figures

详情

AI中文摘要

循环（权重共享）Transformer 将共享残差块应用 N 次（h ← h + ε f(h)，每一步使用相同的 f），在不增加参数的情况下增加有效深度。先前的深度缩放分析建议深度为 L 的残差网络使用 ε = 1/√L。我们证明这对于循环架构是不够的：权重共享使得残差更新在迭代间相关，需要更强的缩放 ε = 1/N。对于多层块（L 个独特层循环 N 次），我们推导出一个分解参数化 ε = λ/(N√L)，将两种增长源分开：1/N 控制层内循环相关性，1/√L 控制层间方差。一个关键结果是，最优学习率仅取决于独特层数 L，而非循环次数 N，从而实现了从小的 N 到大的 N 的直接超参数迁移，无需重新调整。在循环 Transformer 上的实验证实，1/N 缩放相比 1/√N 缩放提高了可训练性，并在不同循环次数下获得更优的损失。

英文摘要

Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = λ/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.

URL PDF HTML ☆

赞 0 踩 0

2606.18324 2026-06-18 cs.LG cs.AI 新提交 60%

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

为什么SWAVE可能不是你所需的一切：复数值循环语言模型的概念演化回顾

Ramprasath Ganesaraja, Swathika N, Sahil Dilip Panse

发表机构 * EdgeVerve Systems Limited（EdgeVerve系统有限公司）

专题命中预训练：回顾复数值循环语言模型SWAVE的演化。

AI总结本文回顾了复数值循环语言模型SWAVE的演化过程，揭示了其设计假设的缺陷，并提出了cos-domination collapse等理论见解和工程原则。

详情

AI中文摘要

SWave是一个复数值循环语言模型（169.26M参数，D=384，L=16，T=2048），在FineWeb-Edu上使用2xH100 NVL训练。它基于三个基本前提设计：将语言表示为复数值波而非实数值能实现更丰富的信息编码；Cayley参数化的酉变换提供数学保证防止状态衰减或爆炸；旋转而非收缩的隐藏状态能在任意长上下文中保持信号完整性。SWave的核心在三个开发阶段中经历了实质性演化。发现Resonance Head在结构上允许虚通道坍缩为全局损失最小值（我们称为cos-domination collapse的失败模式），并被来自相位关联记忆（PAM）架构的具有独立实部和虚部嵌入表的解耦头取代。这解决了退化最小值，并实现了稳定的200,000步训练（最佳步PPL 22.0，第89,861步）。ComplexNorm和Wave Propagation Scan在所有三个阶段中都是承重结构，并保留在最终架构中。ProtectGatedScan被重新定义为结构先验而非学习行为。四个多尺度保留概念在受控评估下未显示可测量的改进，被发现非承重。ComplexGatedUnit被参数更少的实值平方ReLU通道混合器取代。一旦结构约束得到解决，辅助训练目标未显示益处。研究得出了cos-domination collapse的形式化描述、用于数值稳定性的对数空间反向传播并行扫描、六个可迁移的复数值循环训练工程原则，以及用于捕捉传统测试套件遗漏的结构偏差的计划到代码可追溯性方法。

英文摘要

SWave is a complex-valued recurrent language model (169.26M parameters, D=384, L=16, T=2048) trained on FineWeb-Edu using 2xH100 NVL. It was designed around three founding premises: that representing language as complex waves rather than real-valued numbers enables richer information encoding; that a Cayley-parameterised unitary transition provides a mathematical guarantee against state decay or explosion; and that a hidden state which rotates rather than shrinks preserves signal integrity over arbitrarily long contexts. The core of SWave evolved substantially across three development phases. The Resonance Head was found to structurally admit imaginary-channel collapse as a global loss minimum (a failure mode we term cos-domination collapse) and was superseded by an untied head with independent real and imaginary embedding tables from the Phase-Associative Memory (PAM) architecture. This resolved the degenerate minimum and enabled stable 200,000-step training (best-step PPL 22.0 at step 89,861). ComplexNorm and the Wave Propagation Scan proved load-bearing throughout all three phases and were retained to the final architecture. ProtectGatedScan was reframed as a structural prior rather than a learned behaviour. The four multi-scale retention concepts showed no measurable improvement under controlled evaluation and were found non-load-bearing. The ComplexGatedUnit was superseded by a real-valued squared-ReLU channel mixer with fewer parameters. The auxiliary training objectives showed no benefit once structural constraints were resolved. The investigation yields a formal characterisation of cos-domination collapse, a parallel scan with a log-space backward pass for numerical stability, six transferable engineering principles for complex-valued recurrent training, and a plan-to-code traceability methodology for catching structural divergences that conventional test suites miss.

URL PDF HTML ☆

赞 0 踩 0