arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.09809 2026-06-10 cs.AI 版本更新

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

评估卡:AI评估报告的解释层

Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Max Lamparth, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman

发表机构 * Hugging Face Stanford University(斯坦福大学) Queen Mary University of London(伦敦玛丽女王大学) University of Copenhagen(哥本哈根大学) Trustible EleutherAI TU Darmstadt(达姆施塔特工业大学) Weizenbaum Institute & Technical University of Munich(魏森鲍姆研究所与慕尼黑工业大学) Harvard University(哈佛大学) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Iowa State University(爱荷华州立大学) IBM Research(IBM研究院) University of Chicago(芝加哥大学) Independent(独立) Berkeley AI Safety Institute (BASIS)(伯克利人工智能安全研究所) Simula University of Edinburgh(爱丁堡大学) ETH Zurich & ETH AI Center(苏黎世联邦理工学院与ETH AI中心) Oxford Internet Institute(牛津互联网研究所) Amherst College(阿默斯特学院) University of Nebraska(内布拉斯加大学) Syntony Research McGill University(麦吉尔大学) Evals Consensus Israel Institute of Technology(以色列理工学院) IOL.Learn & Zuse Institute Berlin(IOL.Learn与柏林祖泽研究所) Georgia Institute of Technology(佐治亚理工学院) Quebec AI Institute, Université de Montréal(魁北克人工智能研究所,蒙特利尔大学) University of Notre Dame(圣母大学) Georgetown University(乔治城大学) DHBW Stuttgart(斯图加特双元制大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对AI评估报告不一致的问题,提出EvalCards作为统一记录层,通过结构化模式、四种解释信号和监控工具,覆盖5816个模型和635个基准,揭示报告实践中的系统性差距。

详情
AI中文摘要

AI评估结果大规模产生,但在排行榜、模型卡、基准论文和公司博客中的报告不一致。代价是解释性的:读者无法可靠地比较不同来源的结果,识别报告遗漏的内容,或将聚合声明追溯到其基础证据。最近的努力解决了孤立组件,但留下了三个空白:它们只覆盖了评估生命周期的狭窄片段,并且不能组合成单个可解释的记录;它们指定了静态表示,无法区分不同利益相关者对同一证据提出的问题;它们仍然是纸面上的提案,缺乏大规模采用所需的提取基础设施。我们提出EvalCards,一个可操作的报告层,将基准元数据、评估运行数据和模型元数据组合成统一记录。我们(1)从52篇论文和10次利益相关者访谈的结构化审查中推导出报告模式,(2)实现四种解释信号(可重复性、文档完整性、来源和风险、以及分数可比性),通过针对研究和非研究受众校准的读者模式呈现,以及(3)部署一个监控工具,将EvalCards应用于5816个模型、635个基准和101843个结果,揭示当前报告实践中的系统性差距。

英文摘要

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

2606.09681 2026-06-10 cs.CV 版本更新

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

GenEyePose:用于数字神经生理学生物标志物开发的无患者、基于知识的扫视眼动建模

Tianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan, Riya Satavlekar, Susan Kim, Nidhi Soley, Yujie Yan, Ishan Vatsaraj, Carl Harris, Aimon Rahman, Vishal Patel, Joseph Greenstein, Casey Taylor, Kemar E. Green

发表机构 * Whiting School of Engineering, Johns Hopkins University(约翰霍普金斯大学惠廷工程学院) Department of Neurology, Johns Hopkins Medicine(约翰霍普金斯医学院神经内科)

AI总结 提出首个全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析;基于合成数据训练的深度学习分类器在真实临床数据上区分正常与异常扫视精度,AUROC达0.76。

详情
AI中文摘要

眼动(包括扫视)被广泛认为是神经生理状态的高度敏感和客观生物标志物。检测神经系统疾病中的扫视特征提供了一种快速、便携的脑成像替代方案,避免了获取和成本障碍。目前,由于隐私问题和数据集稀缺,缺乏稳健的AI视频眼动图解决方案(例如数字生物标志物)用于筛查、分诊或定位脑异常。在这项工作中,我们提出了第一个完全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析。使用该合成数据集,我们训练了一个深度学习分类器,以区分正常和异常(低度量和高度量)扫视精度,并在真实临床数据上评估其性能。该模型实现了0.76的AUROC和0.71的灵敏度,表明合成数据在临床应用中具有强大的泛化潜力,包括作为家庭和急诊室环境中的筛查工具或精确神经解剖定位工具。

英文摘要

Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.

2606.09601 2026-06-10 cs.LG 版本更新

Assessing Sample Quality in Conditional Generation under Compositional Shift

在组合偏移下评估条件生成中的样本质量

Berker Demirel, Valentino Maiorca, Marco Fumero, Theofanis Karaletsos, Francesco Locatello

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学技术学院) Pyramidal Inc(Pyramidal公司) Achira Inc(Achira公司)

AI总结 针对条件生成在组合偏移下的评估难题,提出一种基于训练分布的后验信任分数,结合全局真实性和属性忠实度,实现样本过滤、排序和弃权,提升生成质量。

详情
AI中文摘要

条件生成器为可控生成提供了自然工具,包括所需条件是观测属性或实验因素的新组合的场景。在许多应用中,尤其是在科学领域,此类模型对于探索真实样本稀少、昂贵或尚未观测到的条件具有吸引力。然而,这给评估带来了循环问题:标准条件质量指标需要参考目标分布,但在外推场景下该分布根据定义不可用。我们通过一个基于训练分布的后验、每样本信任分数来解决这个问题,该分数用于评估条件样本。该分数结合了两个可估计的量:全局真实性(衡量与真实数据流形的兼容性)和属性忠实度(衡量样本是否更接近请求的属性而非合理的替代属性)。我们证明,在观测属性的温和覆盖条件下,该分数可以恢复跨外推生成的有意义比较。这些比较能够实现生成的有效过滤、排序和弃权,并可直接用于现成的预训练模型。在生物成像中,选定的样本更好地保留了真实的形态结构,并提高了下游预测性能,而在受控视觉基准上也观察到类似的增益。最后,我们展示了该分数如何在生成过程中应用,从而在完全解码之前实现弃权。代码可在 https://github.com/berkerdemirel/faithful-cond-gen 获取。

英文摘要

Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.

2606.09570 2026-06-10 cs.CL cs.HC 版本更新

UXBench: Benchmarking User Experience in AI Assistants

UXBench:AI助手中的用户体验基准测试

Mengze Hong, Xia Zeng, Zeyang Lei, Sheng Wang, Chen Jason Zhang, Di Jiang, Taiming Fu, Jinfeng Huang, Mengqiao Liu, Qinghe Chang, Haosheng Zou, Qiongyi Zhou, Sijun He, Simonjmdeng, Haojing Huang, Zijian Li, Lucas Mu Li, Fubao Zhang, Mona Zhou, Wei Ma, Chenxuan Ma, Yuanmeng Zhang, Jian Song, Minlong Peng, Di Liang, Davey Chen

发表机构 * Hong Kong Polytechnic University(香港理工大学) Tencent(腾讯)

AI总结 提出首个基于真实用户反馈的用户中心基准UXBench,包含三个任务和7400个测试实例,评估26个前沿语言模型,发现用户反馈预测是可学习的能力,并揭示了LLM作为评判者的系统偏差。

详情
AI中文摘要

随着AI助手每天服务数百万用户,评估超越一般模型能力的用户体验(UX)变得越来越重要。我们提出了UXBench,这是第一个基于真实用户反馈信号、用于评估偏好对齐和对话生成的用户中心基准。该基准由三个相互关联的任务组成:UX Judge、UX Eval和UX Recovery,包含从主流中文AI助手的超过7万条交互日志中提取的7400个测试实例。数据集紧密反映真实用户分布,涵盖8个场景、83个领域以及多种带来严峻挑战的失败模式。对26个前沿语言模型的大量实验提供了关于模型如何感知用户体验以及模型能力提升如何促进更好对话参与的新见解。通过对模型行为和性能差距的全面分析,我们表明用户反馈预测是一种可学习的能力,其中从野外反馈信号训练出的奖励模型可以实现良好校准的准确性。我们进一步记录了LLM作为评判者评估协议的系统性偏差,并比较了直接影响用户体验的典型响应策略。UXBench建立了一个新的评估格局,并呼吁更多关注定制的用户体验优化,为塑造AI助手成功的用户中心缩放定律做出贡献。

英文摘要

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

2606.09475 2026-06-10 cs.AI cs.LG 版本更新

Emergent alignment and the projectability of ethical personas

涌现对齐与伦理人格的可投射性

Guillermo Del Pinal, Youngchan Lee, Calum McNamara, Alejandro Perez Carballo

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究微调大语言模型在窄任务上如何引发广泛对齐行为,通过宪法AI方法赋予模型伦理人格,发现窄对齐可投射到未训练类别,并提出对齐策略应评估可投射性。

详情
AI中文摘要

关于“涌现错位”的研究表明,在窄任务上微调LLM会诱导广泛错位的行为。这支持了“人格选择”(PSM)假说:在预训练期间,LLM学会模拟不同的角色和视角,这些可以在后训练期间被激发和细化。本文研究了相反的现象“涌现对齐”,并用它来支持和细化PSM,并激发对齐的新需求。我们在广泛和狭窄的安全任务上微调一个仅帮助型模型。为了创建SFT样本,我们遵循“宪法AI”(CAI)方法,并使用四条编码合理对齐策略的宪法:道义论、后果论、美德伦理以及将AI对齐为从属于人类权威。对于每个模型,我们表明,在两个狭窄的安全子类别上微调可靠地诱导出在代表性的一般安全类别以及我们直接从用于窄对齐的数据集中过滤掉的安全子类别上的涌现对齐。为了使用更细粒度的评估测试“PSM”,我们使用了多维“伦理人格”诊断。对于每个宪法微调(广泛/狭窄)模型,我们评估其行为与预期特征概况的匹配程度。我们的结果表明,我们的CAI模型获得了预期的“伦理人格”——例如,使用后果论宪法创建的SFT样本窄微调的模型与功利主义信念的一致性显著高于道义论信念。然而,我们的粗粒度和细粒度评估显示,我们的(广泛/狭窄)微调CAI模型在投射效果上存在显著差异。我们得出结论,对齐策略不仅应基于其(分布内)一般安全性能进行评估,还应特别基于其可投射性程度进行评估。

英文摘要

Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

2606.09466 2026-06-10 cs.CL 版本更新

DECSELFMASK: Leveraging Unlabeled Text via Self-Relevance-Guided Masking for Decoder-Only Classification

DECSELFMASK: 通过自相关引导掩码利用未标记文本进行仅解码器分类

Pietro Ferrazzi, Matteo Merler, Giovanni Bonetta, Alberto Lavelli, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler, Trento, Italy(布鲁诺·凯斯勒基金会,特伦托,意大利) University of Padova, Italy(帕多瓦大学,意大利)

AI总结 提出DecSelfMask方法,利用相关性归因引导掩码策略从无标签数据创建自监督训练样本,通过下一词预测重构掩码部分,提升仅解码器模型在分类任务上的性能,在136个临床任务上平均Macro F1提升19.9点。

详情
AI中文摘要

分类任务需要标注数据,但收集这些数据往往昂贵、耗时甚至不可行。医学领域尤其如此,大型数据集通常只有少量标注样本。为解决这一问题,我们提出DecSelfMask(通过掩码进行解码器自学习),一种增强仅解码器模型在分类任务上性能的方法。我们基于常见的自学习方法,利用模型从无标签数据创建训练样本,并提出一种新颖的相关性引导掩码策略。我们使用相关性归因方法确定未标注文本中与任务相关的部分。然后通过掩码这些部分创建自监督训练样本,训练模型通过下一词预测重建它们。我们假设这些样本传达了关于未标注数据结构和语义的知识,可能对下游性能有用。我们在来自一家意大利医院的190万份临床笔记的136个任务上测试了我们的方法。我们在5个不同规模和系列的模型上量化了DecSelfMask对下游任务的影响,包括探测分析。实验显示持续改进,优于标准监督微调方法(Macro F1提高19.9点)、合成标签生成(提高12.5点)和持续预训练(提高6.3点),以及常见基线。

英文摘要

Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.

2606.09421 2026-06-10 cs.CL 版本更新

What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

技能应记住什么?语言模型代理中成本感知技能重写的质量-成本权衡

Qinghua Xing, Yinda Chen, Yaping Jin, Zhenhe Wu, Bohan Lin, Hang Zhou, Xinghao Chen, Hanting Chen, Zhiwei Xiong

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Technologies(华为技术有限公司) Tianjin University(天津大学)

AI总结 研究语言模型代理中技能重写的质量-成本权衡,提出信息保留策略,在SkillsBench上实现成本降低7%-14.7%且保持验证质量。

详情
AI中文摘要

大型语言模型代理越来越依赖技能:可重用的程序文档,编码工作流程、工具使用、实现模式、验证检查和领域规则。技能重写通常被视为提示压缩,但较短的技能可能通过移除防止探索、调试和恢复的稀疏操作锚点而使代理更昂贵。我们通过这种经济视角研究技能重写。我们的受控框架剖析技能结构,使用信息保留策略重写技能,并在固定任务指令、环境和验证器下评估重写。在SkillsBench上的实验揭示了不同策略间明显的质量-成本权衡:API/代码锚定、工作流保护和规则/公式锚定有利于不同的任务族,没有普遍主导的模板。在主要的留出评估中,学习到的策略将总成本降低7.0%,下游代理令牌成本降低6.0%;在冻结的跨模型迁移中,相应的降低平均为14.7%和13.7%,同时验证器质量保持不变。这些结果将技能设计定位为成本感知的操作知识工程,而非提示压缩。资源:\href{https://github.com/1Reminding/Skill_EE}{SkillEE}。

英文摘要

Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0% and downstream agent-token cost by 6.0%; in frozen cross-model transfer, the corresponding reductions average 14.7% and 13.7%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: https://github.com/1Reminding/Skill_EE.

2606.09377 2026-06-10 cs.LG cs.AI 版本更新

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

利用张量并行和全分片数据并行扩展神经网络验证

Sergei Vorobyov, Eugene Ilyushin

发表机构 * Lomonosov Moscow State University(莫斯科国立大学) Central University(中央大学)

AI总结 针对神经网络形式化验证中GPU内存瓶颈,将张量并行(TP)和全分片数据并行(FSDP)适配到auto_LiRPA/α,β-CROWN框架,TP实现约2倍峰值内存降低但边界紧度下降,FSDP实现80-90%基础内存降低且边界与单GPU逐位一致,并支持完整验证和卷积层。

详情
AI中文摘要

形式化神经网络验证——证明网络对于指定域内所有输入满足安全属性——在实践中受限于GPU内存:边界传播算法(IBP、CROWN、α-CROWN)的标准实现要求权重矩阵和松弛系数矩阵完全驻留在单个加速器上。我们将最初为大规模模型训练开发的两种并行技术适配到auto_LiRPA/α,β-CROWN验证框架。张量并行(TP)将权重矩阵和A矩阵分片到多个GPU上,在P=2时实现约2倍的峰值内存降低;在VNN-COMP 2022 MNIST-FC基准测试上确认了正确性,但由于分片区域内中间边界被迫使用IBP替代,边界紧度随分片区域数量增加而下降。全分片数据并行(FSDP)仅对权重矩阵进行分片,并逐层使用AllGather,产生的边界与单GPU基线逐位一致:在宽MLP上,基础内存降低80-90%,峰值内存降低34-39%。FSDP与完整验证(β-CROWN + 分支定界)和卷积层(BoundConv)无缝集成;在FSDP下,CIFAR-100 ResNet-large(VNN-COMP 2024)获得了完整的不可满足结果。在所有实验中,α-CROWN+BaB模式下的内存瓶颈被证明是每个神经元的alpha张量,而非权重矩阵,这指出了未来工作的关键方向。

英文摘要

Formal neural network verification -- proving that a network satisfies safety properties for *all* inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $α$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the auto_LiRPA / $α,β$-CROWN verification framework. Tensor Parallelism (TP) shards both weight and $A$-matrices across GPUs, achieving ${\approx}2\times$ peak-memory reduction at $P{=}2$; soundness is confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness degrades with the number of sharded zones due to forced IBP substitution for intermediate bounds inside sharded zones. Fully Sharded Data Parallelism (FSDP) shards only weight matrices with a per-layer AllGather, producing bounds that are bitwise identical to the single-GPU baseline: baseline memory drops by 80--90%, peak memory by 34--39% on wide MLPs. FSDP integrates cleanly with complete verification ($β$-CROWN + Branch-and-Bound) and with convolutional layers (BoundConv); a complete unsat result is obtained for CIFAR-100 ResNet-large (VNN-COMP 2024) under FSDP. Across all experiments the memory bottleneck in $α$-CROWN+BaB mode proves to be per-neuron alpha tensors, not weight matrices, pointing to the key direction for future work.

2606.09316 2026-06-10 cs.AI 版本更新

Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Anything2Skill: 将外部知识编译为智能体的可复用技能

Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Anything2Skill框架,将异构外部知识编译为可复用、可检索、可执行的技能,结合RAG显著提升智能体任务成功率。

详情
AI中文摘要

检索增强生成(RAG)使智能体在推理时能够访问外部知识,但主要检索的是碎片化的陈述性证据,导致智能体需要反复从段落、手册、示例、日志或轨迹中推断任务流程。这引发了一个根本性问题:能否从外部知识库中提取技能并安装到智能体中,使其快速逼近领域专业知识?在本文中,我们提出Anything2Skill,一个基于分类的框架,将异构外部知识编译为智能体可复用、可检索、可执行的技能。给定一个知识记录语料库,Anything2Skill首先将每条记录分解为证据窗口,并在技能树先验下执行规划与扩展的技能提取。然后将提取的候选技能转换为结构化的技能契约,指定调用条件、禁忌、动作步骤、工作流程步骤、约束、输出规范、支持证据和置信度分数。为了构建可部署的程序性记忆,Anything2Skill通过分类感知编译、注册表级协调、生命周期跟踪、版本化更新和可见的技能树投影,将提取的技能管理在持久化的SkillBank中。在推理时,智能体从原始知识库中检索任务特定段落,并从SkillBank中检索相关程序性技能,使RAG提供陈述性证据,而编译的技能提供可复用的程序性指导。在qsv和GitHub-CLI上的实验表明,Anything2Skill结合RAG分别实现了98.85%和94.10%的成功率,显著优于仅使用RAG的智能体。这些结果表明,将潜在的程序性知识编译为显式技能是从知识访问扩展到能力复用的有效途径。

英文摘要

Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.

2606.09203 2026-06-10 cs.RO 版本更新

Deterministic Execution of ROS 2 Applications via Lingua Franca

通过Lingua Franca实现ROS 2应用的确定性执行

Harun Teper, Shaokai Lin, Shulu Li, Edward A. Lee, Jian-Jia Chen

发表机构 * TU Dortmund University(多特蒙德工业大学) University of California, Berkeley(加州大学伯克利分校) RWTH Aachen University(亚琛工业大学)

AI总结 提出框架将未修改的ROS 2应用转换为Lingua Franca程序,利用逻辑时间实现确定性执行,解决ROS 2中回调执行顺序和消息交织的非确定性问题。

详情
AI中文摘要

机器人操作系统2(ROS 2)是一种广泛用于机器人系统的中间件,其特点是发布-订阅(pub-sub)通信机制,计算结构为由ROS 2执行器调度的回调。尽管很流行,但ROS 2中的pub-sub模式本质上是不确定的:即使在单个执行器内,这些回调的运行顺序也是不确定的,分布式部署由于节点间消息的交织和网络延迟进一步增加了不确定性。这种不确定性常常导致并发问题,使得几乎不可能分析安全性并提供保证。我们提出了一个框架,能够将未修改的ROS 2应用程序转换并在Lingua Franca(LF)下运行,LF是一种使用逻辑时间进行确定性执行的协调语言,使得相同的输入总是产生相同的确定性执行顺序。我们首先描述了哪些ROS 2特性可以在逻辑时间下确定性执行。这些特性使得建立自动转换框架成为可能,该框架从ROS 2应用程序中提取信息并直接将其转换为LF程序。然后可以应用LF的丰富特性,如逻辑时间延迟、跨进程的联邦执行和故障处理,使ROS 2应用程序以确定性和时序可预测的方式执行,而无需更改ROS 2代码。我们在一个合成示例和Autoware参考系统上评估了该框架。我们表明,在默认ROS 2中,回调的执行顺序不同,同时端到端延迟在不同执行中也有所变化。相比之下,我们由LF控制的ROS 2系统产生了确定的执行顺序和一致的端到端延迟。

英文摘要

The Robot Operating System 2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.

2606.09079 2026-06-10 cs.LG cs.AI 版本更新

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4: 通过前瞻稀疏注意力实现闪电索引超长上下文

Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Miao Peng, Nuo Chen, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu

发表机构 * Independent Researchers(独立研究者) Tencent(腾讯) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 提出前瞻稀疏注意力(LSA),基于DeepSeek-V4架构的神经记忆索引器,通过预测未来上下文需求仅保留关键KV块,在超长上下文场景下将物理KV缓存压缩至全上下文的13.5%,同时保持或略微提升下游准确率。

详情
Comments
Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4
AI中文摘要

传统大语言模型在解码过程中保持完整的KV缓存,导致超长上下文服务出现严重的GPU内存瓶颈。在本报告中,我们提出前瞻稀疏注意力(LSA),一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。LSA并非被动地关注所有历史令牌,而是主动预测未来的上下文需求,并仅在GPU内存中保留查询关键的KV块。关键的是,我们通过无骨干的解耦训练策略实例化该架构。通过将索引器制定为标准双编码器架构,我们使用标准检索训练框架独立训练它,而无需将庞大的骨干模型加载到GPU内存中。我们证明这种“少即是多”的范式显著最大化服务效率,同时在依赖长期全局记忆的任务中充当有效的注意力去噪器。在主要的长上下文评估套件(例如LongBench-v2、LongMemEval和RULER)中,FM-DS-V4将平均物理KV缓存占用压缩至全上下文基线的仅13.5%,同时一致地保持或略微提升下游准确率(平均绝对边际+0.6%)。关键的是,在极端500K规模下,FlashMemory将物理KV缓存开销抑制超过90%,而不会破坏骨干的核心推理能力。

英文摘要

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

2606.08982 2026-06-10 cs.AI 版本更新

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Baichuan-M4:面向持续照护的临床级医疗智能体系统

Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

发表机构 * Baichuan AI(百川智能) THUBPM Group, Tsinghua University(清华大学THUBPM课题组)

AI总结 提出Baichuan-M4临床级医疗大模型,通过统一运行时、持续照护强化学习框架和临床工具层三大支柱构建智能体系统,在多项医疗评估中取得领先结果,幻觉率降至3.3%。

详情
AI中文摘要

Baichuan-M4是百川智能开发的临床级医疗大模型,专为\emph{持续照护}而非单轮医疗问答设计。它围绕三大支柱构建为协调的医疗智能体系统:\textbf{Baichuan-Harness},一个统一运行时,保持强化学习训练与实际部署的一致性,同时强制执行动作约束、工具使用、长期患者记忆和多智能体协调;一个\textbf{核心推理模型},采用持续照护强化学习框架训练,该框架集成了跨度级奖励建模(SPAR++)、推理路径压缩、课程学习和稳定的策略优化;以及一个\textbf{临床工具层},用于患者记忆管理、权威循证检索以及跨文档、X光和皮肤科的多模态医学感知。在跨维度医学评估套件中,Baichuan-M4在静态医学知识与安全性、动态OSCE式咨询、长上下文临床记忆、循证检索、医学文档OCR和多模态图像理解方面取得领先结果,同时将幻觉率降至3.3%。

英文摘要

Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a core reasoning model trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

2606.08779 2026-06-10 cs.LG 版本更新

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

重新制定LLM强化学习以在黑箱差异下高效训练

Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Zhejiang University(浙江大学) Tianjin University(天津大学)

AI总结 针对强化学习中的训练-推理差异问题,提出差异约束马尔可夫决策过程(DCMDP),通过拉格朗日松弛自适应平衡性能提升与差异控制,实现稳定高效训练。

详情
AI中文摘要

强化学习(RL)已成为一种关键的后训练范式,但它经常遭受不可预测的次优性能甚至训练崩溃。最近的研究将这些失败归因于隐藏的训练-推理差异(或不匹配),源于底层引擎和架构的不同。我们发现,当提供适当的学习信号时,训练策略可以主动自我纠正这种差异。然后,我们进一步通过经验确定了一个差异容忍区域:在该区域内,激进地缩小差异会抑制策略探索并降低学习效率,而在该区域外,减少过度差异可提高优化一致性并提升可达到的局部性能上限。根据这些发现,我们将此问题表述为差异约束马尔可夫决策过程(DCMDP),其中奖励最大化与对齐训练-推理行为的约束相结合,实现稳定的双目标优化。为了自适应地平衡性能改进和差异控制,我们引入了一种拉格朗日松弛机制,根据当前差异违反程度动态调整两个目标的相对权重。这使得双目标优化稳定:策略可以在容忍区域内自由探索,而当差异超出安全边界时则被引导回来。经验上,DCMDP显著提升了8B密集模型(Qwen-3-8b)和30B混合专家模型(Qwen-3-30bA3b)的性能,并实现了一种异构训练范式,其中LLM可以在高保真训练设置下进行优化,同时明确对齐以用于低成本、资源受限的推理部署。

英文摘要

Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉:长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington(华盛顿大学) National Tsing Hua University(国立清华大学) Seoul National University(首尔大学) Mila - Québec AI Institute(米拉-魁北克人工智能研究所) Allen Institute for AI(艾伦人工智能研究所)

AI总结 通过分析2023-2025年*CL会议论文中的人类评估协议,发现报告不透明和可重复性差的问题,并提出改进建议。

详情
Comments
Accepted to ACL 2026 Main
AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而,这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中,我们对*CL会议出版物(2023-2025年)中评估长文本生成任务的人类评估协议进行了大规模分析,包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准,并应用这些标准系统地检查了社区内的报告规范和实践。我们发现,人类评估研究设计的重要方面普遍报告不足,导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现,我们概述了可操作的建议,以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到:https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

2606.07605 2026-06-10 cs.LG cs.AI 版本更新

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

SRT: 基于解缠校正流的时间序列超分辨率

Jufang Duan, Shenglong Xiao, Yuren Zhang

发表机构 * Bytedance(字节跳动)

AI总结 提出SRT框架,通过解缠校正流将低分辨率时间序列重建为高分辨率,分解趋势与季节成分,利用隐式神经表示对齐分辨率,并引入跨分辨率注意力机制生成细节。

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
Comments
Accepted to the International Conference on Learning Representations (ICLR) 2026
AI中文摘要

具有高时间分辨率的细粒度时间序列数据对于广泛应用的精确分析至关重要。然而,获取此类数据通常受到成本和可行性的限制。可以通过基于特定先验从低分辨率输入重建高分辨率信号来解决此问题,这被称为超分辨率。虽然在计算机视觉中得到了广泛研究,但直接将图像超分辨率技术迁移到时间序列并非易事。为了从根本上解决这一挑战,我们提出了时间序列超分辨率(SRT),这是一种通过解缠校正流重建低分辨率输入中丢失的时间模式的新框架。SRT将输入分解为趋势和季节成分,使用隐式神经表示将它们对齐到目标分辨率,并利用一种新颖的跨分辨率注意力机制来指导高分辨率细节的生成。我们进一步引入了SRT-large,这是一个经过大规模预训练的扩展版本,具有强大的零样本超分辨率能力。在九个公共数据集上的大量实验表明,SRT和SRT-large在多个尺度因子下始终优于现有方法,展示了稳健的性能以及我们架构中每个组件的有效性。

英文摘要

Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose Super-Resolution for Time series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pre-training, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.

2606.07586 2026-06-10 cs.LG cs.AI cs.AR cs.MA 版本更新

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

从人类引导到自主:面向空间NPU上端到端LLM部署的智能体技能系统

Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

发表机构 * AMD Research and Advanced Development(AMD研究与高级开发)

AI总结 提出两阶段方法,从人类引导的智能体辅助部署到自主技能系统,在AMD XDNA 2 NPU上实现8种LLM的端到端自动部署,性能超越或持平人工优化基线。

详情
Comments
Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026
AI中文摘要

空间神经处理单元(NPU)为边缘LLM推理提供了能效平台,但在此类硬件上高效端到端部署LLM仍然劳动密集。尽管AI编码智能体已开始降低这一成本,现有研究主要关注单核优化,而非在资源受限的空间NPU上进行端到端LLM部署。\n我们提出一种两阶段方法,在AMD XDNA 2 NPU上实例化,从人类引导开发进展到智能体自主。第一阶段,我们通过人类引导的智能体辅助开发Llama-3.2-1B的参考部署。与手工优化基线相比,该实现实现了2.2倍的预填充加速和4.0倍的解码加速,优化轨迹及其经验教训全程记录为结构化文档。第二阶段,我们将文档提炼为一个由八个阶段组成的智能体技能系统,编排优化和调试技能集,并在每个阶段严格执行数值正确性。\n利用我们的智能体技能系统,我们使用开源编译器栈在AMD XDNA 2 NPU上自主端到端部署了另外八个仅解码器LLM(Llama-3.2-3B、SmolLM2-1.7B、Qwen2.5-{0.5B, 1.5B, 3B}、Qwen3-{0.6B, 1.7B, 4B})。据我们所知,这些模型此前尚未通过任何开源软件栈部署在AMD NPU上。每次部署在0.5-4小时的智能体挂钟时间内完成,几乎无需人类引导,并通过数值正确性门控,展示了对先前未见LLM的功能泛化能力。其中八个中的三个达到或超过了我们Llama-3.2-1B参考部署的持续性能,表明所得实现无需额外模型特定人工工程即可具有竞争力。

英文摘要

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势:揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique MBZUAI ENS-PSL Durham University

AI总结 通过控制实验和项目反应理论模型,分离语言能力与文化知识访问,发现本地语言在文化知识访问上具有优势,但常被语言能力不足掩盖。

详情
AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题,但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制:许多评估依赖于可能无法反映文化知识自然出现的平行模板问题,并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型(文化无关 vs. 文化特定)与查询语言(英语 vs. 本地语言),并使用共享的1PL项目反应理论模型估计能力,我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上,我们发现文化无关问题上存在一致的英语优势,表明更强的英语能力。然而,在考虑了这种能力差距后,本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖,但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明,较弱的本地语言表现并不一定意味着较弱的文化知识;相反,本地文化知识可能通过本地语言更容易访问,但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

2606.07135 2026-06-10 cs.LG 版本更新

Explaining Unsupervised Disease Staging in Huntington's Disease: Insights into Model Representations and Clusters

解释亨廷顿病中的无监督疾病分期:模型表示与聚类洞察

Lubna Mahmoud Abu Zohair, Hind Zantout

发表机构 * Heriot-Watt University

AI总结 本文通过可解释性分析扩展无监督疾病分期框架,在Enroll-HD数据集上揭示模型嵌入与临床进展一致,并利用SHAP量化特征重要性,识别出从早期认知运动障碍到严重功能依赖的疾病阶段。

详情
Comments
Accepted for oral presentation and as a full-length paper at the International Conference on AI in Healthcare 2026 (26-28 August 2026, Imperial College London) and will be published by Springer in the Lecture Notes in Computer Science (LNCS) series
AI中文摘要

亨廷顿病(HD)是一种进行性神经退行性疾病,影响运动、认知和行为功能,准确描述疾病进展对于改善患者预后和生活质量至关重要。无监督机器学习(ML)方法已证明能够从纵向数据中发现疾病进展轨迹和有意义的潜在阶段;然而,其有限的可解释性限制了临床信任和转化。我们通过将可解释性分析应用于提取的特征表示和发现的疾病阶段,扩展了先前提出的基于ML的疾病分期框架。应用于Enroll-HD数据集,我们首先将学习到的表示投影到低维空间,以直观评估所得聚类是否与既定临床指标的进展一致。然后,我们使用显著性图识别随时间对学习嵌入贡献最大的临床特征。最后,我们训练一个替代分类器并应用SHAP来量化特征对聚类分配的重要性,并分析哪些临床变量驱动疾病阶段之间的转换。可解释性分析表明,学习到的嵌入捕捉了具有临床意义的疾病结构,与既定的运动和功能严重程度评分一致,并显示出跨聚类的进行性恶化。在此分析中,SHAP揭示了疾病阶段的分层,范围从早期认知运动障碍到严重功能依赖,与已知的临床进展模式一致,同时也突出了阶段内变异性。

英文摘要

Huntington's disease (HD) is a progressive neurodegenerative disorder that affects motor, cognitive, and behavioral functions, where accurate characterization of disease progression remains essential to improve patient outcome and quality of life. Unsupervised machine learning (ML) approaches have demonstrated the ability to uncover disease progression trajectories and meaningful latent stages from longitudinal data; however, their limited interpretability restricts clinical trust and translation. We extend a previously proposed ML-based disease staging framework by applying an explainability analysis to the extracted feature representations and discovered disease stages. Applied to the Enroll-HD dataset, we first project the learned representations into a lower-dimensional space to intuitively assess whether the resulting clusters align with the progression of established clinical measures. We then use saliency maps to identify the clinical features that most strongly contribute to the learned embeddings over time. Finally, we train a surrogate classifier and apply SHAP to quantify feature importance for cluster assignments and to analyze which clinical variables drive transitions between disease stages. The explainability analysis indicates that the learned embeddings capture clinically meaningful disease structure, aligning with established motor and functional severity scores and exhibiting progressive deterioration across clusters. Within this analysis, SHAP reveals a stratification of disease stages, ranging from early cognitive-motor impairment to severe functional dependency, consistent with known clinical progression patterns, while also highlighting intra-stage variability.

2606.07088 2026-06-10 cs.LG math.OC 版本更新

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

残差控制乘子学习用于随机约束决策

Kang Liu, Jianchen Hu, Ziyu Qu, Edward Hengzhou Yan, Lun Yang, Meng Zhang

发表机构 * Xi’an Jiaotong University Tencent China University of Geosciences

AI总结 提出残差控制乘子学习(RCML),通过将乘子更新重构为投影压力反馈,并引入模块化随机稳定组件,解决随机约束决策中原始-对偶方法因小批量噪声导致乘子更新不稳定的问题,实现有限增益收敛和局部KKT残差解释。

详情
AI中文摘要

随机约束决策需要在强制执行统计要求(如安全性或公平性)的同时优化性能目标。然而,标准的原始-对偶方法在随机小批量反馈下难以稳健地更新乘子,因为小批量梯度和约束估计的噪声会直接累积到乘子记忆中。为了解决这个问题,我们提出了残差控制乘子学习(RCML),它将乘子更新重新表述为投影压力反馈。核心思想是将投影乘子分解为用于原始下降的有效压力信号和用于有限增益乘子跟踪的压力记忆残差。为了处理异质和有噪声的观测,我们进一步用模块化随机稳定组件增强这个残差-积分骨干。对于凸-仿射骨干,我们建立了有限增益收敛,推导了小批量反馈下的随机残差界,并表明在非凸问题的正则KKT点附近,残差反馈律具有局部KKT残差解释。在优化、分配和公平排序任务上的实验表明,RCML在保持竞争性目标性能的同时,改善了可行性控制和乘子稳定性。代码可在此处获取。

英文摘要

Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly under stochastic mini-batch feedback, as the noise of mini-batch gradients and constraint estimates can be directly accumulated into the multiplier memory. To address this issue, we propose Residual-Controlled Multiplier Learning (RCML), which reformulates multiplier updating as projected-pressure feedback. The central idea is to decompose the projected multiplier into an effective pressure signal for primal descent and a pressure-memory residual for finite-gain multiplier tracking. To handle heterogeneous and noisy observations, we further augment this residual-integral backbone with modular stochastic stabilization components. For the convex-affine backbone, we establish finite-gain convergence, derive a stochastic residual bound under mini-batch feedback, and show that the residual feedback law admits a local KKT-residual interpretation near regular KKT points of nonconvex problems. Experiments across optimization, allocation, and fair-ranking tasks show that RCML improves feasibility control and multiplier stability while maintaining competitive objective performance. Code is released at https://anonymous.4open.science/r/RCML-3114/.

2606.06888 2026-06-10 cs.LG 版本更新

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

数据受限的语言模型预训练:改进的正则化与缩放定律

Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang

发表机构 * University of Michigan KAIST AI

AI总结 研究数据受限下语言模型预训练的正则化与缩放,提出掩码输入正则化(MIR)改善验证损失,并设计SoftQ缩放定律更准确拟合重复数据下的模型与数据规模交互。

详情
AI中文摘要

语言模型预训练的经典缩放定律在固定计算预算下平衡模型规模与训练数据集大小,假设数据充足且仅对语料库遍历一次。随着训练计算量增长快于自然语言数据的供应,预训练可能进入数据受限、计算丰富的阶段,模型在有限数据集上训练多个周期。我们沿正则化和缩放两个维度研究数据受限预训练。对于正则化,我们研究掩码输入正则化(MIR),一种对随机掩码输入进行辅助下一词预测损失的方法。MIR测试扩散语言模型中的随机掩码是否能在不改变架构或增加推理开销的情况下有益于自回归预训练。在72M到1.4B参数的模型中,我们发现MIR在强权重衰减基础上进一步改善了验证损失,优于仅使用强权重衰减的自回归模型,并在1.4B规模上带来下游性能提升。对于缩放,我们提出SoftQ,一种将模型规模和数据规模耦合以捕捉它们在重复数据下交互的缩放定律。经典替代方案如Chinchilla定律使用加性形式解耦这些项,导致在数据受限情况下设定错误。我们发现SoftQ比这些替代方案更好地拟合数据受限实验,并估计MIR带来的增益相当于约1.3倍的独特训练数据。我们在https://this URL 发布代码。

英文摘要

Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR's gains as equivalent to roughly 1.3 times as much unique training data. We release our code at https://github.com/yixinw-lab/dc_pretrain.

2606.06758 2026-06-10 cs.CL 版本更新

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

长上下文与检索增强语言模型中证据利用的四条件诊断协议

Haizhou Xia

发表机构 * University of Western Ontario

AI总结 提出四条件证据可用性协议,通过ONCU估计器分离无证据、全上下文、检索证据和Oracle证据四种条件下的模型表现,诊断长上下文与检索增强语言模型的证据利用瓶颈。

详情
Comments
46 pages, 37 tables, 1 figure
AI中文摘要

最终答案准确性、检索召回率和引用重叠本身并不能确定长上下文或检索增强语言模型是否使用了所提供的证据。模型可能从参数记忆中进行回答,尽管接收到正确的段落却失败,或者引用证据但未将其转换为所请求的答案。本文提出了一种匹配的四条件证据可用性协议——无证据、全上下文、检索证据和Oracle证据参考——用于在固定示例、提示、评分字段、检索设置和有效性检查下诊断证据利用情况。ONCU被用作协议绑定的估计器,用于估计恢复的Oracle参考证据优势,并且仅针对分母有效的组进行计算;无分母的答案、证据、检索和失败审计指标分别报告。实证研究评估了来自Qwen、Gemma、Llama和Mistral家族的五个本地开源模型,在Controlled-ONCU-safe16K、HotpotQA-ONCU和2WikiMultiHopQA-ONCU上进行了评估,共产生18,000个ONCU兼容预测。主要发现是任务相关的瓶颈分裂:受控合成设置主要暴露全上下文利用失败,而测试的真实多跳设置主要暴露无分母答案和证据指标中的检索链覆盖失败,ONCU在Oracle改进组上支持相同方向。贡献在于提供了一个诊断协议,用于分离无证据可回答性、Oracle证据可恢复性、全上下文利用和检索条件利用,而不是为长上下文或检索增强系统提供单一分数排行榜。

英文摘要

Final-answer accuracy, retrieval recall, and citation overlap do not reveal how much answer advantage a long-context or retrieval-augmented language model actually recovers from supplied evidence. A model may answer from parametric priors, fail to use evidence that is present, or cite relevant text without converting it into the final answer. This paper introduces a four-condition diagnostic protocol for evidence-utilization evaluation under matched examples, models, prompts, and scoring rules. The protocol compares no-evidence, full-context, retrieved-evidence, and oracle-evidence reference conditions, and uses Oracle-Reference Normalized Context Utilization (ONCU) as a denominator-valid estimate of recovered oracle-reference evidence advantage. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families over Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, comprising 18,000 ONCU-compatible predictions. Results show a task-dependent diagnostic pattern: controlled synthetic settings expose reduced recovery when the same evidence is embedded in long input rather than supplied compactly, while realistic multi-hop reconstructions show that full-context inputs outperform the tested retrieved inputs in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. Sensitivity audits with stronger retrieval settings narrow some gaps but do not overturn the scoped interpretation. The main contribution is therefore not a single utilization ratio, but a matched diagnostic protocol that separates no-evidence answerability, oracle-evidence recoverability, full-context recovery, retrieval-conditioned recovery, denominator validity, and companion answer/evidence diagnostics.

2606.06744 2026-06-10 cs.LG cs.GT cs.MA econ.TH 版本更新

Learn to Match: Two-Sided Matching with Temporally Extended Feedback

学会匹配:具有时间扩展反馈的双边匹配

Haijing Zong, Yancheng Liang, Boyang Zhou, Natasha Jaques

发表机构 * University of Washington

AI总结 提出一个具有时间扩展反馈的双边匹配框架,将其建模为部分可观测马尔可夫博弈,并基于多智能体强化学习构建Learn2Match基准,实验表明独立PPO优于bandit基线,但存在信息摩擦损失。

详情
AI中文摘要

双边匹配市场通常涉及随时间通过面试、重复互动、学习和分离而展开的信息。现有的匹配模型通常将此过程简化为关于固定偏好的即时亚高斯反馈,忽略了支付相关信息逐渐揭示并改变未来匹配决策的场景。我们引入了一个具有时间扩展反馈的框架,将双边匹配建模为一个部分可观测马尔可夫博弈,其中包含昂贵的匹配前筛选、有噪声的匹配后观测、演变的潜在特征以及内生的延续或解散。我们在Learn2Match中实例化该框架,这是一个用于动态匹配市场的多智能体强化学习基准。Learn2Match支持关于面试谁、与谁匹配以及何时解散匹配的分散决策,同时使用遗憾、社会福利和信息摩擦损失(衡量由潜在偏好不完全揭示引起的福利差距)来评估策略。我们发现,在时间扩展反馈下,独立PPO比bandit风格的CA-ETC基线实现了更高的累积社会福利和更低的累积遗憾,展示了MARL在动态匹配市场中的前景。然而,PPO仍然产生更高的信息摩擦损失,表明端到端MARL尚未提供匹配bandit方法的协调探索结构。这些结果将Learn2Match定位为开发下一代匹配市场算法的基准:像RL智能体一样自适应、像bandit算法一样统计严谨、像稳定匹配机制一样结构感知的方法。

英文摘要

Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms. Please refer to https://sites.google.com/view/learn-to-match/home for the official website and the code link.

2606.06742 2026-06-10 cs.LG stat.ML 版本更新

TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

TorchKM:面向GPU的核学习与模型选择库

Yikai Zhang, Gaoxiang Jia, Jie Ding, Boxiang Wang

发表机构 * University of Iowa University of Minnesota

AI总结 提出GPU加速的核学习库TorchKM,通过智能复用矩阵运算加速SVM、核逻辑回归等模型的训练与模型选择,性能优于标准基线。

详情
Comments
14 pages, 2 figures
AI中文摘要

TorchKM是一个用于核机器的开源库,包括支持向量机、核逻辑回归和核分位数回归,并具有GPU加速。该库采用scikit-learn风格的API,旨在利用GPU友好的线性代数,通过智能复用矩阵运算加速完整的训练和模型选择流程。基准测试显示,与标准基线相比,具有竞争力的预测性能以及显著的加速效果。代码和文档可在https://this URL获取,并且该包可以通过PyPI轻松安装。

英文摘要

TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is designed to exploit GPU-friendly linear algebra, accelerating the full training and model-selection pipeline through intelligent reuse of matrix operations. Benchmarks show competitive predictive performance with substantial speedups over standard baselines. The efficiency and programmable design also make TorchKM a kernel-learning component for AI-driven workflows. Code and documentation are available at https://github.com/YikaiZhang95/torchkm, and the package can be easily installed via PyPI.

2606.06735 2026-06-10 cs.AI 版本更新

A Geometric Account of Activation Steering through Angle-Norm Decomposition

通过角度-范数分解的激活引导的几何解释

Georgii Aparin, Tatiana Gaintseva

发表机构 * Huawei Noah’s Ark Lab Queen Mary University of London

AI总结 本文通过控制实验分离角度和径向分量,发现概念主要编码在角度结构中,但范数对引导的稳定性和下游效应至关重要,建议将激活引导参数化为可解释的角度和径向分量。

详情
AI中文摘要

线性激活引导作为一种简单且经验有效的控制语言模型行为的方法已受到广泛关注。最近,球形引导范式被提出来解决加性干预的局限性,其动机通常是假设隐藏状态范数不携带概念相关信息。在这项工作中,我们通过一项旨在分离角度和径向分量作用的受控实证研究重新审视了这一假设。我们表明,引导方法的主要区别在于它们如何耦合两种几何效应:改变令牌与概念方向的角度对齐以及改变其隐藏状态范数。在七个语言模型上,我们发现概念主要表示在角度结构中,这支持了球形方法的动机,但范数对于引导的稳定性和下游效应仍然重要。我们的结果解释了为什么具有相似概念级别效果的干预可能表现不同,并建议激活引导应由干预的可解释角度和径向分量参数化,而不是由纠缠这两种效应的单个加性系数参数化。

英文摘要

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

2606.06698 2026-06-10 cs.LG cs.CL 版本更新

RECAP: Regression Evaluation for Continual Adaptation of Prompts

RECAP: 提示持续适应的回归评估

Harsh Deshpande, Kushal Chawla, Sangwoo Cho, William Campbell, Sambit Sahu

发表机构 * Capital One

AI总结 提出RECAP基准,在严格主动适应-测试协议下评估提示优化方法对约束变化的持续学习能力,发现现有方法在主动场景下性能无显著提升,强调设计主动提示适应方法的必要性。

详情
AI中文摘要

生产中的代理系统经常面临不断变化的约束,并且必须从下一次交互开始就遵守。诸如工具调用通知更改合规阈值或策略更新添加披露要求等场景符合这一标准,在生产中几乎没有出错的空间。这种主动适应设置在部署中很常见,但在当前的基准测试中却不存在,这些基准测试假设要么是静态约束集,要么是带有评估反馈的反应式协议。我们引入了RECAP,这是一个基准测试,在严格主动适应-测试协议下,在约束级别测量持续学习现象(遗忘、回归、前向转移):提示优化方法仅接收约束规范,并且必须在看到任何测试数据之前进行泛化。我们在四个LLM和三个具有不断变化的约束的调度上评估了六种方法,发现这些方法在性能上没有显著改善,即使在产生更高延迟之后也是如此。这些为离线或反应式设置设计的方法不足以应对主动范式。我们的工作强调了设计主动提示适应方法的日益增长的需求,其中模型必须对部署中不断变化的需求保持鲁棒性。

英文摘要

Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

2606.06622 2026-06-10 cs.CL 版本更新

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench:评估大语言模型分布随机性的基准

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Liang Luo, Ellie Dingqiao Wen, Lele Wang, Giuseppe Carenini, Peter West

发表机构 * University of British Columbia Independent Researcher

AI总结 提出UnpredictaBench基准,通过KS@N指标评估LLM从目标分布(统计分布、随机程序、自然语言场景)采样的能力,发现模型表现差异大且无模型超过40%准确率,表明分布采样能力仍有显著提升空间。

详情
AI中文摘要

我们引入了UnpredictaBench,这是一个评估大语言模型(LLM)捕捉真实潜在分布能力的测试。随着LLM越来越多地被用作其他实体的替代品(例如,在经济模拟中替代人类),许多模型倾向于坍缩到单一合理答案,这导致无法捕捉真实系统的不可预测性。最近关于提高输出多样性的工作对于这种设置是不够的:模拟需要从目标分布中校准的样本,而不仅仅是多样化的输出。UnpredictaBench提炼了该问题的一个简化但基础的版本:从单个目标分布中采样结果,包括经典统计分布、随机程序诱导的分布以及描述随机过程的自然语言场景。我们引入了448个这样的问题,以及KS@N,一个通用评估指标,通过Kolmogorov-Smirnov统计检验量化模型输出近似黑盒目标分布的程度。这是我们在样本量为N时未能拒绝模型样本与真实样本之间差异的比率,N越大表示难度越大。在开源和专有模型上的测试中,我们发现分布能力存在很大差异。例如,当模型生成样本量为100(KS@100,我们的标准指标)时,得分范围从接近0到超过20%。没有模型能在KS@100上达到40%以上,这表明分布采样作为一种能力仍有显著的提升空间。尽管增加推理可以在一定程度上提高得分,但我们发现这个问题没有立即可行的解决方案。UnpredictaBench表明,即使是简单的分布模拟仍然具有挑战性,这使得它成为使用LLM作为复杂系统替代品的必要第一步。

英文摘要

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

2605.03344 2026-06-10 cs.IR cs.AI cs.CL 版本更新

RAG over Thinking Traces Can Improve Reasoning Tasks

RAG 基于思考轨迹可提升推理任务

Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia

AI总结 提出检索思考轨迹而非文档,通过 T3 方法将其转化为结构化表示,在推理任务上显著提升性能,超越标准 RAG 和无 RAG 基线。

详情
AI中文摘要

检索增强生成(RAG)已被证明对知识密集型任务有效,但普遍认为其对数学和代码生成等推理密集型问题帮助有限。我们通过证明限制不在于 RAG 本身而在于语料库的选择来挑战这一假设。我们不检索文档,而是提出检索思考轨迹,即问题求解尝试过程中产生的中间思考轨迹。我们表明思考轨迹本身就是一个强大的检索源,并进一步引入 T3,一种离线方法,将其转化为结构化、利于检索的表示,以提高可用性。使用这些轨迹作为语料库,简单的检索-生成流水线在强模型和基准测试(如 AIME 2025--2026、LiveCodeBench 和 GPQA-Diamond)上持续提升推理性能,优于无 RAG 基线和检索标准网络语料库。例如,在 AIME 2025-2026 上,使用 Gemini-2-thinking 生成的轨迹进行 RAG,在 Gemini-2.5-Flash、GPT-OSS-120B 和 GPT-5 上分别实现了 +56.3%、+8.6% 和 +7.6% 的相对增益,尽管这些是更新的模型。总体而言,我们的结果表明思考轨迹是推理任务的有效检索语料库,将其转化为结构化、紧凑或诊断性表示可带来更强的增益。代码见此链接。

英文摘要

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

2606.09677 2026-06-10 eess.AS cs.AI 版本更新

MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation

MeCo: 基于MeanFlow的一步校正器用于多通道语音分离

Dohwan Kim, Jung-Woo Choi

AI总结 提出MeCo,一种基于MeanFlow的一步生成式校正器,通过数据空间优化联合训练生成目标与信号保真度,在极低计算开销下同时提升信号保真度和人耳听觉质量。

详情
Comments
5 pages, accepted to Interspeech 2026
AI中文摘要

虽然用于多通道语音分离的判别模型在基于参考的指标上表现出色,但它们通常表现出次优的人耳听觉质量。为了解决这个问题,我们提出了一种新颖的基于MeanFlow的一步生成式校正器(MeCo)。MeCo学习一个条件平均速度场,以一步方式将判别估计直接映射到干净语音流形上。为了最大化一步生成性能,我们引入了数据空间优化(DSO)。DSO集成了一个$\mathbf{x}_r$损失,该损失惩罚较长位移间隔上的预测误差,作为人耳听觉质量的生成目标,以及一个端点SI-SDR损失,直接优化终端信号保真度。实验表明,MeCo以最小的计算开销实现了最先进的性能,在域内和域外场景中同时实现了卓越的信号保真度和人耳听觉质量。

英文摘要

While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.

2606.09543 2026-06-10 cs.CL 版本更新

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

从基因到词元:受GWAS启发的可解释风格计量分析方法

Dmitry Pronin, Evgeny Kazartsev

AI总结 受全基因组关联研究启发,提出一种通过逻辑回归和多重比较校正检测作者独特词汇标记的风格计量方法,在英、德、俄语语料中验证有效。

详情
AI中文摘要

这篇短文介绍了一种受全基因组关联研究(GWAS)启发的风格计量解释方法。每个“基因”词元与“表型”作者身份的关联通过逻辑回归进行检验,并进行了多重比较校正。将该方法应用于英语、德语和俄语语料库,检测出了个体作者特有的统计显著的词汇标记。

英文摘要

This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.

2606.09141 2026-06-10 eess.AS cs.SD 版本更新

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

FlashTTS: 基于MTP加速和X-pred均值流蒸馏的快速流式TTS

Hanke Xie, Xiaming Ren, Dake Guo, Ruonan You, Wenhao Li, Jingbin Hu, Guobin Ma, Huakang Chen, Kejie Xu, Rui Huang, Weiguo Tan, Xianrong Wang, Lei Xie

AI总结 提出FlashTTS框架,通过滞后多轨架构、并行多令牌预测和X-pred均值流匹配解码器,实现低延迟流式TTS,首包延迟降至325ms,保持零样本语音克隆和跨语言可懂度。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

近期语音对话系统的进展要求文本转语音(TTS)模型更快、响应更及时。现代语音对话系统对TTS模型有两个主要要求:低延迟和支持流式输入输出。然而,大多数现有的基于单码本LLM的TTS方法依赖于多阶段流水线,缺乏原生流式能力。这些系统通常由于缓慢的自回归预测和多步流匹配而遭受高端到端延迟。为了解决这些限制,我们提出了FlashTTS,一个开源、低延迟的流式TTS框架。FlashTTS引入了一种滞后多轨架构,原生处理流式文本和语音输入,从而消除了句子级缓冲的需要。为了加速声学生成,我们将并行多令牌预测(MTP)与X-pred均值流匹配解码器集成。这种配置在恰好两次函数评估(2-NFE)中实现了高保真度的令牌到梅尔频谱生成。通过联合优化输入处理和解码效率,FlashTTS为实时语音对话系统提供了实用基础。实验表明,与稳健的流式基线相比,FlashTTS将首包延迟显著降低至325毫秒,同时保持了强大的零样本语音克隆和跨语言可懂度。语音样本可用。模型代码和检查点将作为开源发布。

英文摘要

Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.