赞 0 踩 0

2606.09079 2026-06-10 cs.LG cs.AI 版本更新

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4: 通过前瞻稀疏注意力实现闪电索引超长上下文

Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Miao Peng, Nuo Chen, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu

发表机构 * Independent Researchers（独立研究者）； Tencent（腾讯）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tsinghua University（清华大学）

AI总结提出前瞻稀疏注意力（LSA），基于DeepSeek-V4架构的神经记忆索引器，通过预测未来上下文需求仅保留关键KV块，在超长上下文场景下将物理KV缓存压缩至全上下文的13.5%，同时保持或略微提升下游准确率。

详情

Comments: Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4

AI中文摘要

传统大语言模型在解码过程中保持完整的KV缓存，导致超长上下文服务出现严重的GPU内存瓶颈。在本报告中，我们提出前瞻稀疏注意力（LSA），一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。LSA并非被动地关注所有历史令牌，而是主动预测未来的上下文需求，并仅在GPU内存中保留查询关键的KV块。关键的是，我们通过无骨干的解耦训练策略实例化该架构。通过将索引器制定为标准双编码器架构，我们使用标准检索训练框架独立训练它，而无需将庞大的骨干模型加载到GPU内存中。我们证明这种“少即是多”的范式显著最大化服务效率，同时在依赖长期全局记忆的任务中充当有效的注意力去噪器。在主要的长上下文评估套件（例如LongBench-v2、LongMemEval和RULER）中，FM-DS-V4将平均物理KV缓存占用压缩至全上下文基线的仅13.5%，同时一致地保持或略微提升下游准确率（平均绝对边际+0.6%）。关键的是，在极端500K规模下，FlashMemory将物理KV缓存开销抑制超过90%，而不会破坏骨干的核心推理能力。

英文摘要

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

URL PDF HTML ☆

赞 0 踩 0

2606.08982 2026-06-10 cs.AI 版本更新

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Baichuan-M4：面向持续照护的临床级医疗智能体系统

Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

发表机构 * Baichuan AI（百川智能）； THUBPM Group, Tsinghua University（清华大学THUBPM课题组）

AI总结提出Baichuan-M4临床级医疗大模型，通过统一运行时、持续照护强化学习框架和临床工具层三大支柱构建智能体系统，在多项医疗评估中取得领先结果，幻觉率降至3.3%。

详情

AI中文摘要

Baichuan-M4是百川智能开发的临床级医疗大模型，专为\emph{持续照护}而非单轮医疗问答设计。它围绕三大支柱构建为协调的医疗智能体系统：\textbf{Baichuan-Harness}，一个统一运行时，保持强化学习训练与实际部署的一致性，同时强制执行动作约束、工具使用、长期患者记忆和多智能体协调；一个\textbf{核心推理模型}，采用持续照护强化学习框架训练，该框架集成了跨度级奖励建模（SPAR++）、推理路径压缩、课程学习和稳定的策略优化；以及一个\textbf{临床工具层}，用于患者记忆管理、权威循证检索以及跨文档、X光和皮肤科的多模态医学感知。在跨维度医学评估套件中，Baichuan-M4在静态医学知识与安全性、动态OSCE式咨询、长上下文临床记忆、循证检索、医学文档OCR和多模态图像理解方面取得领先结果，同时将幻觉率降至3.3%。

英文摘要

Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a core reasoning model trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

URL PDF HTML ☆

赞 0 踩 0

2606.08779 2026-06-10 cs.LG 版本更新

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

重新制定LLM强化学习以在黑箱差异下高效训练

Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Zhejiang University（浙江大学）； Tianjin University（天津大学）

AI总结针对强化学习中的训练-推理差异问题，提出差异约束马尔可夫决策过程（DCMDP），通过拉格朗日松弛自适应平衡性能提升与差异控制，实现稳定高效训练。

详情

AI中文摘要

强化学习（RL）已成为一种关键的后训练范式，但它经常遭受不可预测的次优性能甚至训练崩溃。最近的研究将这些失败归因于隐藏的训练-推理差异（或不匹配），源于底层引擎和架构的不同。我们发现，当提供适当的学习信号时，训练策略可以主动自我纠正这种差异。然后，我们进一步通过经验确定了一个差异容忍区域：在该区域内，激进地缩小差异会抑制策略探索并降低学习效率，而在该区域外，减少过度差异可提高优化一致性并提升可达到的局部性能上限。根据这些发现，我们将此问题表述为差异约束马尔可夫决策过程（DCMDP），其中奖励最大化与对齐训练-推理行为的约束相结合，实现稳定的双目标优化。为了自适应地平衡性能改进和差异控制，我们引入了一种拉格朗日松弛机制，根据当前差异违反程度动态调整两个目标的相对权重。这使得双目标优化稳定：策略可以在容忍区域内自由探索，而当差异超出安全边界时则被引导回来。经验上，DCMDP显著提升了8B密集模型（Qwen-3-8b）和30B混合专家模型（Qwen-3-30bA3b）的性能，并实现了一种异构训练范式，其中LLM可以在高保真训练设置下进行优化，同时明确对齐以用于低成本、资源受限的推理部署。

英文摘要

Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉：长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington（华盛顿大学）； National Tsing Hua University（国立清华大学）； Seoul National University（首尔大学）； Mila - Québec AI Institute（米拉-魁北克人工智能研究所）； Allen Institute for AI（艾伦人工智能研究所）

AI总结通过分析2023-2025年*CL会议论文中的人类评估协议，发现报告不透明和可重复性差的问题，并提出改进建议。

详情

Comments: Accepted to ACL 2026 Main

AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而，这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中，我们对*CL会议出版物（2023-2025年）中评估长文本生成任务的人类评估协议进行了大规模分析，包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准，并应用这些标准系统地检查了社区内的报告规范和实践。我们发现，人类评估研究设计的重要方面普遍报告不足，导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现，我们概述了可操作的建议，以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到：https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

URL PDF HTML ☆

赞 0 踩 0

2606.07605 2026-06-10 cs.LG cs.AI 版本更新

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

SRT: 基于解缠校正流的时间序列超分辨率

Jufang Duan, Shenglong Xiao, Yuren Zhang

发表机构 * Bytedance（字节跳动）

AI总结提出SRT框架，通过解缠校正流将低分辨率时间序列重建为高分辨率，分解趋势与季节成分，利用隐式神经表示对齐分辨率，并引入跨分辨率注意力机制生成细节。

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026)
Comments: Accepted to the International Conference on Learning Representations (ICLR) 2026

AI中文摘要

具有高时间分辨率的细粒度时间序列数据对于广泛应用的精确分析至关重要。然而，获取此类数据通常受到成本和可行性的限制。可以通过基于特定先验从低分辨率输入重建高分辨率信号来解决此问题，这被称为超分辨率。虽然在计算机视觉中得到了广泛研究，但直接将图像超分辨率技术迁移到时间序列并非易事。为了从根本上解决这一挑战，我们提出了时间序列超分辨率（SRT），这是一种通过解缠校正流重建低分辨率输入中丢失的时间模式的新框架。SRT将输入分解为趋势和季节成分，使用隐式神经表示将它们对齐到目标分辨率，并利用一种新颖的跨分辨率注意力机制来指导高分辨率细节的生成。我们进一步引入了SRT-large，这是一个经过大规模预训练的扩展版本，具有强大的零样本超分辨率能力。在九个公共数据集上的大量实验表明，SRT和SRT-large在多个尺度因子下始终优于现有方法，展示了稳健的性能以及我们架构中每个组件的有效性。

英文摘要

Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose Super-Resolution for Time series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pre-training, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.

URL PDF HTML ☆

赞 0 踩 0

2606.07586 2026-06-10 cs.LG cs.AI cs.AR cs.MA 版本更新

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

从人类引导到自主：面向空间NPU上端到端LLM部署的智能体技能系统

Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

发表机构 * AMD Research and Advanced Development（AMD研究与高级开发）

AI总结提出两阶段方法，从人类引导的智能体辅助部署到自主技能系统，在AMD XDNA 2 NPU上实现8种LLM的端到端自动部署，性能超越或持平人工优化基线。

详情

Comments: Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

AI中文摘要

空间神经处理单元（NPU）为边缘LLM推理提供了能效平台，但在此类硬件上高效端到端部署LLM仍然劳动密集。尽管AI编码智能体已开始降低这一成本，现有研究主要关注单核优化，而非在资源受限的空间NPU上进行端到端LLM部署。\n我们提出一种两阶段方法，在AMD XDNA 2 NPU上实例化，从人类引导开发进展到智能体自主。第一阶段，我们通过人类引导的智能体辅助开发Llama-3.2-1B的参考部署。与手工优化基线相比，该实现实现了2.2倍的预填充加速和4.0倍的解码加速，优化轨迹及其经验教训全程记录为结构化文档。第二阶段，我们将文档提炼为一个由八个阶段组成的智能体技能系统，编排优化和调试技能集，并在每个阶段严格执行数值正确性。\n利用我们的智能体技能系统，我们使用开源编译器栈在AMD XDNA 2 NPU上自主端到端部署了另外八个仅解码器LLM（Llama-3.2-3B、SmolLM2-1.7B、Qwen2.5-{0.5B, 1.5B, 3B}、Qwen3-{0.6B, 1.7B, 4B}）。据我们所知，这些模型此前尚未通过任何开源软件栈部署在AMD NPU上。每次部署在0.5-4小时的智能体挂钟时间内完成，几乎无需人类引导，并通过数值正确性门控，展示了对先前未见LLM的功能泛化能力。其中八个中的三个达到或超过了我们Llama-3.2-1B参考部署的持续性能，表明所得实现无需额外模型特定人工工程即可具有竞争力。

英文摘要

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势：揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique ； MBZUAI ； ENS-PSL ； Durham University

AI总结通过控制实验和项目反应理论模型，分离语言能力与文化知识访问，发现本地语言在文化知识访问上具有优势，但常被语言能力不足掩盖。

详情

AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题，但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制：许多评估依赖于可能无法反映文化知识自然出现的平行模板问题，并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型（文化无关 vs. 文化特定）与查询语言（英语 vs. 本地语言），并使用共享的1PL项目反应理论模型估计能力，我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上，我们发现文化无关问题上存在一致的英语优势，表明更强的英语能力。然而，在考虑了这种能力差距后，本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖，但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明，较弱的本地语言表现并不一定意味着较弱的文化知识；相反，本地文化知识可能通过本地语言更容易访问，但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.07135 2026-06-10 cs.LG 版本更新

Explaining Unsupervised Disease Staging in Huntington's Disease: Insights into Model Representations and Clusters

解释亨廷顿病中的无监督疾病分期：模型表示与聚类洞察

Lubna Mahmoud Abu Zohair, Hind Zantout

发表机构 * Heriot-Watt University

AI总结本文通过可解释性分析扩展无监督疾病分期框架，在Enroll-HD数据集上揭示模型嵌入与临床进展一致，并利用SHAP量化特征重要性，识别出从早期认知运动障碍到严重功能依赖的疾病阶段。

详情

DOI: 10.48550/arXiv.2606.07135
Comments: Accepted for oral presentation and as a full-length paper at the International Conference on AI in Healthcare 2026 (26-28 August 2026, Imperial College London) and will be published by Springer in the Lecture Notes in Computer Science (LNCS) series

AI中文摘要

亨廷顿病（HD）是一种进行性神经退行性疾病，影响运动、认知和行为功能，准确描述疾病进展对于改善患者预后和生活质量至关重要。无监督机器学习（ML）方法已证明能够从纵向数据中发现疾病进展轨迹和有意义的潜在阶段；然而，其有限的可解释性限制了临床信任和转化。我们通过将可解释性分析应用于提取的特征表示和发现的疾病阶段，扩展了先前提出的基于ML的疾病分期框架。应用于Enroll-HD数据集，我们首先将学习到的表示投影到低维空间，以直观评估所得聚类是否与既定临床指标的进展一致。然后，我们使用显著性图识别随时间对学习嵌入贡献最大的临床特征。最后，我们训练一个替代分类器并应用SHAP来量化特征对聚类分配的重要性，并分析哪些临床变量驱动疾病阶段之间的转换。可解释性分析表明，学习到的嵌入捕捉了具有临床意义的疾病结构，与既定的运动和功能严重程度评分一致，并显示出跨聚类的进行性恶化。在此分析中，SHAP揭示了疾病阶段的分层，范围从早期认知运动障碍到严重功能依赖，与已知的临床进展模式一致，同时也突出了阶段内变异性。

英文摘要

Huntington's disease (HD) is a progressive neurodegenerative disorder that affects motor, cognitive, and behavioral functions, where accurate characterization of disease progression remains essential to improve patient outcome and quality of life. Unsupervised machine learning (ML) approaches have demonstrated the ability to uncover disease progression trajectories and meaningful latent stages from longitudinal data; however, their limited interpretability restricts clinical trust and translation. We extend a previously proposed ML-based disease staging framework by applying an explainability analysis to the extracted feature representations and discovered disease stages. Applied to the Enroll-HD dataset, we first project the learned representations into a lower-dimensional space to intuitively assess whether the resulting clusters align with the progression of established clinical measures. We then use saliency maps to identify the clinical features that most strongly contribute to the learned embeddings over time. Finally, we train a surrogate classifier and apply SHAP to quantify feature importance for cluster assignments and to analyze which clinical variables drive transitions between disease stages. The explainability analysis indicates that the learned embeddings capture clinically meaningful disease structure, aligning with established motor and functional severity scores and exhibiting progressive deterioration across clusters. Within this analysis, SHAP reveals a stratification of disease stages, ranging from early cognitive-motor impairment to severe functional dependency, consistent with known clinical progression patterns, while also highlighting intra-stage variability.

URL PDF HTML ☆

赞 0 踩 0

2606.07088 2026-06-10 cs.LG math.OC 版本更新

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

残差控制乘子学习用于随机约束决策

Kang Liu, Jianchen Hu, Ziyu Qu, Edward Hengzhou Yan, Lun Yang, Meng Zhang

发表机构 * Xi’an Jiaotong University ； Tencent ； China University of Geosciences

AI总结提出残差控制乘子学习(RCML)，通过将乘子更新重构为投影压力反馈，并引入模块化随机稳定组件，解决随机约束决策中原始-对偶方法因小批量噪声导致乘子更新不稳定的问题，实现有限增益收敛和局部KKT残差解释。

详情

AI中文摘要

随机约束决策需要在强制执行统计要求（如安全性或公平性）的同时优化性能目标。然而，标准的原始-对偶方法在随机小批量反馈下难以稳健地更新乘子，因为小批量梯度和约束估计的噪声会直接累积到乘子记忆中。为了解决这个问题，我们提出了残差控制乘子学习（RCML），它将乘子更新重新表述为投影压力反馈。核心思想是将投影乘子分解为用于原始下降的有效压力信号和用于有限增益乘子跟踪的压力记忆残差。为了处理异质和有噪声的观测，我们进一步用模块化随机稳定组件增强这个残差-积分骨干。对于凸-仿射骨干，我们建立了有限增益收敛，推导了小批量反馈下的随机残差界，并表明在非凸问题的正则KKT点附近，残差反馈律具有局部KKT残差解释。在优化、分配和公平排序任务上的实验表明，RCML在保持竞争性目标性能的同时，改善了可行性控制和乘子稳定性。代码可在此处获取。

英文摘要

Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly under stochastic mini-batch feedback, as the noise of mini-batch gradients and constraint estimates can be directly accumulated into the multiplier memory. To address this issue, we propose Residual-Controlled Multiplier Learning (RCML), which reformulates multiplier updating as projected-pressure feedback. The central idea is to decompose the projected multiplier into an effective pressure signal for primal descent and a pressure-memory residual for finite-gain multiplier tracking. To handle heterogeneous and noisy observations, we further augment this residual-integral backbone with modular stochastic stabilization components. For the convex-affine backbone, we establish finite-gain convergence, derive a stochastic residual bound under mini-batch feedback, and show that the residual feedback law admits a local KKT-residual interpretation near regular KKT points of nonconvex problems. Experiments across optimization, allocation, and fair-ranking tasks show that RCML improves feasibility control and multiplier stability while maintaining competitive objective performance. Code is released at https://anonymous.4open.science/r/RCML-3114/.

URL PDF HTML ☆

赞 0 踩 0

2606.06888 2026-06-10 cs.LG 版本更新

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

数据受限的语言模型预训练：改进的正则化与缩放定律

Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang

发表机构 * University of Michigan ； KAIST AI

AI总结研究数据受限下语言模型预训练的正则化与缩放，提出掩码输入正则化（MIR）改善验证损失，并设计SoftQ缩放定律更准确拟合重复数据下的模型与数据规模交互。

详情

AI中文摘要

语言模型预训练的经典缩放定律在固定计算预算下平衡模型规模与训练数据集大小，假设数据充足且仅对语料库遍历一次。随着训练计算量增长快于自然语言数据的供应，预训练可能进入数据受限、计算丰富的阶段，模型在有限数据集上训练多个周期。我们沿正则化和缩放两个维度研究数据受限预训练。对于正则化，我们研究掩码输入正则化（MIR），一种对随机掩码输入进行辅助下一词预测损失的方法。MIR测试扩散语言模型中的随机掩码是否能在不改变架构或增加推理开销的情况下有益于自回归预训练。在72M到1.4B参数的模型中，我们发现MIR在强权重衰减基础上进一步改善了验证损失，优于仅使用强权重衰减的自回归模型，并在1.4B规模上带来下游性能提升。对于缩放，我们提出SoftQ，一种将模型规模和数据规模耦合以捕捉它们在重复数据下交互的缩放定律。经典替代方案如Chinchilla定律使用加性形式解耦这些项，导致在数据受限情况下设定错误。我们发现SoftQ比这些替代方案更好地拟合数据受限实验，并估计MIR带来的增益相当于约1.3倍的独特训练数据。我们在https://this URL 发布代码。

英文摘要

Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR's gains as equivalent to roughly 1.3 times as much unique training data. We release our code at https://github.com/yixinw-lab/dc_pretrain.

URL PDF HTML ☆

赞 0 踩 0

2606.06758 2026-06-10 cs.CL 版本更新

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

长上下文与检索增强语言模型中证据利用的四条件诊断协议

Haizhou Xia

发表机构 * University of Western Ontario

AI总结提出四条件证据可用性协议，通过ONCU估计器分离无证据、全上下文、检索证据和Oracle证据四种条件下的模型表现，诊断长上下文与检索增强语言模型的证据利用瓶颈。

详情

Comments: 46 pages, 37 tables, 1 figure

AI中文摘要

最终答案准确性、检索召回率和引用重叠本身并不能确定长上下文或检索增强语言模型是否使用了所提供的证据。模型可能从参数记忆中进行回答，尽管接收到正确的段落却失败，或者引用证据但未将其转换为所请求的答案。本文提出了一种匹配的四条件证据可用性协议——无证据、全上下文、检索证据和Oracle证据参考——用于在固定示例、提示、评分字段、检索设置和有效性检查下诊断证据利用情况。ONCU被用作协议绑定的估计器，用于估计恢复的Oracle参考证据优势，并且仅针对分母有效的组进行计算；无分母的答案、证据、检索和失败审计指标分别报告。实证研究评估了来自Qwen、Gemma、Llama和Mistral家族的五个本地开源模型，在Controlled-ONCU-safe16K、HotpotQA-ONCU和2WikiMultiHopQA-ONCU上进行了评估，共产生18,000个ONCU兼容预测。主要发现是任务相关的瓶颈分裂：受控合成设置主要暴露全上下文利用失败，而测试的真实多跳设置主要暴露无分母答案和证据指标中的检索链覆盖失败，ONCU在Oracle改进组上支持相同方向。贡献在于提供了一个诊断协议，用于分离无证据可回答性、Oracle证据可恢复性、全上下文利用和检索条件利用，而不是为长上下文或检索增强系统提供单一分数排行榜。

英文摘要

Final-answer accuracy, retrieval recall, and citation overlap do not reveal how much answer advantage a long-context or retrieval-augmented language model actually recovers from supplied evidence. A model may answer from parametric priors, fail to use evidence that is present, or cite relevant text without converting it into the final answer. This paper introduces a four-condition diagnostic protocol for evidence-utilization evaluation under matched examples, models, prompts, and scoring rules. The protocol compares no-evidence, full-context, retrieved-evidence, and oracle-evidence reference conditions, and uses Oracle-Reference Normalized Context Utilization (ONCU) as a denominator-valid estimate of recovered oracle-reference evidence advantage. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families over Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, comprising 18,000 ONCU-compatible predictions. Results show a task-dependent diagnostic pattern: controlled synthetic settings expose reduced recovery when the same evidence is embedded in long input rather than supplied compactly, while realistic multi-hop reconstructions show that full-context inputs outperform the tested retrieved inputs in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. Sensitivity audits with stronger retrieval settings narrow some gaps but do not overturn the scoped interpretation. The main contribution is therefore not a single utilization ratio, but a matched diagnostic protocol that separates no-evidence answerability, oracle-evidence recoverability, full-context recovery, retrieval-conditioned recovery, denominator validity, and companion answer/evidence diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2606.06744 2026-06-10 cs.LG cs.GT cs.MA econ.TH 版本更新

Learn to Match: Two-Sided Matching with Temporally Extended Feedback

学会匹配：具有时间扩展反馈的双边匹配

Haijing Zong, Yancheng Liang, Boyang Zhou, Natasha Jaques

发表机构 * University of Washington

AI总结提出一个具有时间扩展反馈的双边匹配框架，将其建模为部分可观测马尔可夫博弈，并基于多智能体强化学习构建Learn2Match基准，实验表明独立PPO优于bandit基线，但存在信息摩擦损失。

详情

AI中文摘要

RAG 基于思考轨迹可提升推理任务

Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia

AI总结提出检索思考轨迹而非文档，通过 T3 方法将其转化为结构化表示，在推理任务上显著提升性能，超越标准 RAG 和无 RAG 基线。

详情

AI中文摘要

检索增强生成（RAG）已被证明对知识密集型任务有效，但普遍认为其对数学和代码生成等推理密集型问题帮助有限。我们通过证明限制不在于 RAG 本身而在于语料库的选择来挑战这一假设。我们不检索文档，而是提出检索思考轨迹，即问题求解尝试过程中产生的中间思考轨迹。我们表明思考轨迹本身就是一个强大的检索源，并进一步引入 T3，一种离线方法，将其转化为结构化、利于检索的表示，以提高可用性。使用这些轨迹作为语料库，简单的检索-生成流水线在强模型和基准测试（如 AIME 2025--2026、LiveCodeBench 和 GPQA-Diamond）上持续提升推理性能，优于无 RAG 基线和检索标准网络语料库。例如，在 AIME 2025-2026 上，使用 Gemini-2-thinking 生成的轨迹进行 RAG，在 Gemini-2.5-Flash、GPT-OSS-120B 和 GPT-5 上分别实现了 +56.3%、+8.6% 和 +7.6% 的相对增益，尽管这些是更新的模型。总体而言，我们的结果表明思考轨迹是推理任务的有效检索语料库，将其转化为结构化、紧凑或诊断性表示可带来更强的增益。代码见此链接。

英文摘要

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

URL PDF HTML ☆

赞 0 踩 0

2606.09677 2026-06-10 eess.AS cs.AI 版本更新

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

MeCo: 基于MeanFlow的一步校正器用于多通道语音分离

Dohwan Kim, Jung-Woo Choi

AI总结提出MeCo，一种基于MeanFlow的一步生成式校正器，通过数据空间优化联合训练生成目标与信号保真度，在极低计算开销下同时提升信号保真度和人耳听觉质量。

详情

Comments: 5 pages, accepted to Interspeech 2026

AI中文摘要

虽然用于多通道语音分离的判别模型在基于参考的指标上表现出色，但它们通常表现出次优的人耳听觉质量。为了解决这个问题，我们提出了一种新颖的基于MeanFlow的一步生成式校正器（MeCo）。MeCo学习一个条件平均速度场，以一步方式将判别估计直接映射到干净语音流形上。为了最大化一步生成性能，我们引入了数据空间优化（DSO）。DSO集成了一个$\mathbf{x}_r$损失，该损失惩罚较长位移间隔上的预测误差，作为人耳听觉质量的生成目标，以及一个端点SI-SDR损失，直接优化终端信号保真度。实验表明，MeCo以最小的计算开销实现了最先进的性能，在域内和域外场景中同时实现了卓越的信号保真度和人耳听觉质量。

英文摘要

While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.09543 2026-06-10 cs.CL 版本更新

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

从基因到词元：受GWAS启发的可解释风格计量分析方法

Dmitry Pronin, Evgeny Kazartsev

AI总结受全基因组关联研究启发，提出一种通过逻辑回归和多重比较校正检测作者独特词汇标记的风格计量方法，在英、德、俄语语料中验证有效。

2606.09141 2026-06-10 eess.AS cs.SD 版本更新

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

FlashTTS: 基于MTP加速和X-pred均值流蒸馏的快速流式TTS

Hanke Xie, Xiaming Ren, Dake Guo, Ruonan You, Wenhao Li, Jingbin Hu, Guobin Ma, Huakang Chen, Kejie Xu, Rui Huang, Weiguo Tan, Xianrong Wang, Lei Xie

AI总结提出FlashTTS框架，通过滞后多轨架构、并行多令牌预测和X-pred均值流匹配解码器，实现低延迟流式TTS，首包延迟降至325ms，保持零样本语音克隆和跨语言可懂度。

详情

Comments: Accepted to Interspeech 2026

AI中文摘要

近期语音对话系统的进展要求文本转语音（TTS）模型更快、响应更及时。现代语音对话系统对TTS模型有两个主要要求：低延迟和支持流式输入输出。然而，大多数现有的基于单码本LLM的TTS方法依赖于多阶段流水线，缺乏原生流式能力。这些系统通常由于缓慢的自回归预测和多步流匹配而遭受高端到端延迟。为了解决这些限制，我们提出了FlashTTS，一个开源、低延迟的流式TTS框架。FlashTTS引入了一种滞后多轨架构，原生处理流式文本和语音输入，从而消除了句子级缓冲的需要。为了加速声学生成，我们将并行多令牌预测（MTP）与X-pred均值流匹配解码器集成。这种配置在恰好两次函数评估（2-NFE）中实现了高保真度的令牌到梅尔频谱生成。通过联合优化输入处理和解码效率，FlashTTS为实时语音对话系统提供了实用基础。实验表明，与稳健的流式基线相比，FlashTTS将首包延迟显著降低至325毫秒，同时保持了强大的零样本语音克隆和跨语言可懂度。语音样本可用。模型代码和检查点将作为开源发布。

英文摘要

Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.

URL PDF HTML ☆

赞 0 踩 0