arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2605.22863 2026-06-09 cs.LG 版本更新

Latent Cache Flow: Model-to-Model Communication Without Text

潜在缓存流:无需文本的模型间通信

Maximillian Rossi, Prajwal Raghunath, Eugene Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出潜在缓存流(LCF)方法,通过联合翻译和压缩键值缓存实现高效模型间通信,在上下文不同场景下比基于文本的通信准确率提高23%、速度提升8.5倍。

Comments 6 pages, 5 figures

详情
AI中文摘要

当今的LLM智能体通过文本进行通信,由于需要自回归解码共享模型的状态并在接收模型处编码,这会导致显著的延迟和信息损失。最近的工作如Cache-to-Cache(C2C;Fu等人,2026)试图通过学习适配器来交换KV缓存,该适配器将共享者的KV矩阵转换为接收者模型。然而,这些适配器体积庞大且训练成本高,并且逐词翻译,要求目标上下文完全相同。这对于LLM具有不同上下文的智能体通信来说是不合适的。我们引入了潜在缓存流(LCF)。为了解决效率问题,我们观察到键和值可以联合翻译和压缩,将适配器大小减少到C2C的约4%。为了解决上下文不同的问题,我们设计了适配器来传输目标模型所没有的新信息的摘要。我们的初步实验表明,在共享上下文设置中,一个13 MB的LCF适配器可以比956 MB的C2C适配器更准确;对于不同上下文,LCF比基于文本的通信准确率提高23%,速度提升8.5倍。

英文摘要

LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a pruned 13 MB LCF adapter can be more accurate than C2C at 956 MB in shared-context settings; for different contexts, LCF improves F1 by 7.5% and Exact Match by 23% while 8.5 times faster than text-based communication.

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) ByteDance Inc.(字节跳动公司)

AI总结 针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题,提出技能检索增强(SRA)范式,通过动态检索外部技能库提升智能体性能,并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)演变为能够自主解决问题的智能体,它们越来越依赖外部的、可复用的技能来处理超出其原生参数能力的任务。在现有的智能体系统中,整合技能的主要策略是在上下文窗口内显式枚举可用技能。然而,这种策略无法扩展:随着技能库的扩大,上下文预算迅速消耗,智能体在识别正确技能方面的准确性显著下降。为此,本文提出了技能检索增强(SRA),一种新的范式,其中智能体按需从大型外部技能库中动态检索、整合和应用相关技能。为了使该问题可衡量,我们构建了一个大规模技能库,并引入了SRA-Bench,这是首个对完整SRA流程进行分解评估的基准,涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5,400个能力密集型测试实例和636个手动构建的金标准技能,这些技能与网络收集的干扰技能混合,形成了一个包含26,262个技能的大规模语料库。大量实验表明,基于检索的技能增强可以显著提高智能体性能,验证了该范式的潜力。同时,我们揭示了技能整合中的一个基本差距:当前的LLM智能体倾向于以相似的速率加载技能,无论是否检索到金标准技能,或者任务是否实际需要外部能力。这表明技能增强的瓶颈不仅在于检索,还在于基础模型判断何时加载何种技能以及何时真正需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题,并为未来智能体系统中能力的可扩展增强奠定了基础。

英文摘要

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

2605.22763 2026-06-09 cs.AI 版本更新

Advancing Mathematics Research with AI-Driven Formal Proof Search

用AI驱动的形式证明搜索推进数学研究

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely Bérczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Miklós Z. Horváth, Andrew Ferraiuolo, Henryk Michalewski, Edward Lockhart, Codrut Grosu, Thomas Hubert, Matej Balog, Pushmeet Kohli, Swarat Chaudhuri

发表机构 * Google DeepMind(谷歌DeepMind) Aarhus University(奥胡斯大学)

AI总结 本文研究了如何利用大型语言模型生成形式证明,以解决开放性数学问题,并展示了AI辅助形式证明搜索在数学研究中的应用和贡献。

详情
AI中文摘要

大型语言模型(LLMs)在数学推理方面日益表现出色,但其不可靠性限制了其在数学研究中的实用性。一种缓解方法是使用LLMs生成Lean等语言中的形式证明。我们首次对这种方法解决开放性问题的能力进行了大规模评估。我们的最强大代理在每个问题的成本仅为几百美元的情况下,自主解决了353个开放性埃德勒问题中的9个,并证明了492个OEIS猜想中的44个,同时正被应用于组合学、优化、图论、代数几何和量子光学研究。一个基本代理交替使用基于LLM的生成和基于Lean的验证,复制了埃德勒的成功,但在最困难的问题上成本更高。这些发现展示了AI辅助形式证明搜索的威力,并揭示了使这种技术可行的代理设计。

英文摘要

Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics research. A mitigation is using LLMs to generate formal proofs in languages like Lean. We perform the first large-scale evaluation of this method's ability to solve open problems. Our most capable agent autonomously resolved 9 of 353 open Erdős problems at the per-problem cost of a few hundred dollars, proved 44/492 OEIS conjectures, and is being deployed in combinatorics, optimization, graph theory, algebraic geometry, and quantum optics research. A basic agent alternating LLM-based generation with Lean-based verification replicated the Erdős successes but proved costlier on the hardest problems. These findings demonstrate the power of AI-aided formal proof search and shed light on the agent designs that enable it.

2605.11314 2026-06-09 cs.CV cs.AI 版本更新

Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中基于3D无标记运动学的罗达和格雷厄姆步态分类量化

Lauhitya Reddy, Seth Donahue, Jeremy Bauer, Susan Sienko, Anita Bagley, Joseph Krzak, Maura Eveld, Karen Kruger, Ross Chafetz, Vedant Kulkarni, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Shriners Children’s(夏皮罗儿童医院) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的沃克·H·库勒生物医学工程系)

AI总结 本文提出了一种基于单视角视频的无标记步态分析方法,用于量化罗达和格雷厄姆步态分类中的膝踝z分数,从而在资源有限的临床环境中实现可扩展的客观步态评估。

Comments 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format

详情
AI中文摘要

脑瘫(CP)是一种运动神经障碍,是儿童中最常见的终身身体残疾原因。大约75%的脑瘫儿童能够行走,准确的步态评估对于保持行走功能至关重要,这种功能在四分之一到一半的脑瘫成人中在中年时会恶化。罗达和格雷厄姆分类系统利用来自3D仪器化步态分析(3D-IGA)的踝关节和膝关节z分数来量化矢状面步态偏差,但3D-IGA成本高且仅限于专业中心,而观察性评估仅显示中等的评分者间一致性。我们开发了一种无标记步态分析流程,可以直接从单视角临床步态视频中量化罗达和格雷厄姆膝踝z分数。在1,058个双侧肢体样本(来自152名儿童的529次试验,其中88名男性,63名女性,年龄12.1±4.0岁,60种不同的主要诊断,脑瘫最为常见,n=54)中,矢状面模型在膝关节z分数上达到R²=0.80±0.02和CCC=0.89±0.02,踝关节z分数上达到R²=0.57±0.02和CCC=0.72±0.02,与3D-IGA相比。二元筛查用于过量膝关节屈曲的AUROC=0.88,正确识别了83%的受影响儿童,应用罗达和格雷厄姆规则得到7类准确率为43±1%,宏AUROC=0.78±0.01,踝关节预测误差仍然是主要瓶颈。除了横断面筛查外,连续z分数支持跨访问的纵向轨迹跟踪,为监测疾病进展和治疗反应提供定量基础,这在观察性量表中是无法实现的。这些结果证明了基于视频的z分数估计、过量屈曲筛查和纵向轨迹跟踪在资源有限的临床环境中实现可扩展、客观步态评估的可行性。

英文摘要

Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

2605.22079 2026-06-09 cs.CL 版本更新

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ishigaki-IDS-Bench: 一个用于从BIM信息需求生成信息交付规范的基准

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.(ONESTRUCTION公司) AWS GenAI Innovation Center(AWS生成式人工智能创新中心)

AI总结 本文提出Ishigaki-IDS-Bench基准,用于评估大型语言模型生成符合行业标准的XML信息交付规范(IDS)的能力,通过166个由BIM/IDS专家编写和验证的示例,结合内容一致性评估和结构审核,展示了当前LLM在生成满足IDS标准和IFC词汇约束的XML方面的局限性。

Comments 7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face

详情
AI中文摘要

大型语言模型(LLMs)被广泛用于生成结构化输出,如JSON、SQL和代码,但公共资源仍然有限,无法有效评估必须同时满足行业标准XML和领域词汇约束的生成能力。本文提出了Ishigaki-IDS-Bench,一个用于评估从BIM信息需求生成信息交付规范(IDS)XML能力的基准。该基准包含166个由BIM/IDS专家编写和验证的示例,这些示例是通过将83个实际场景扩展为日语和英语后生成的,对应黄金IDS文件以及输入格式、语言、轮次设置、IFC版本和建筑领域等元数据。其评估结合了基于IDSAuditTool的可操作性、结构和内容审核,以及与黄金IDS文件的内容一致性评估。在零样本评估中,10个LLM中表现最好的模型在内容一致性上达到65.6%的宏F1分数,但只有27.7%的输出通过内容审核。这些结果表明,当前LLM能够表达部分信息需求作为IDS,但仍难以稳定生成满足IDS标准和IFC词汇约束的XML。Ishigaki-IDS-Bench支持比较评估、失败分析以及开发符合领域标准的受限结构生成方法。我们已将评估脚本和基准数据以CC BY 4.0许可发布在GitHub和Hugging Face上。

英文摘要

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

2605.21854 2026-06-09 cs.CV cs.AI 版本更新

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

CrossVLA: 跨范式后训练和推理优化用于视觉-语言-动作模型

Zhi Liu

发表机构 * Tianjin University(天津大学)

AI总结 本文研究了视觉-语言-动作(VLA)模型的跨范式后训练方法,提出了CrossVLA框架,通过改进的连续动作流匹配估计器、对比LoRA和DoRA参数高效层的性能,并揭示了推理过程中去噪循环对延迟的影响,最终实现了在LIBERO数据集上的显著提升。

Comments Workshop draft, 14 pages, 4 figures. Code, ckpts, data: https://github.com/lz-googlefycy/vla-lab

详情
AI中文摘要

视觉-语言-动作(VLA)模型迅速收敛到一小套架构模式:离散令牌自回归(例如OpenVLA)和连续动作流匹配(例如pi-0.5)。然而,通过直接偏好优化(DPO)进行偏好对齐——语言模型中事实上的后训练步骤——几乎仅在自回归VLA上被研究。我们提出了CrossVLA,对跨范式VLA后训练进行实证研究。三大贡献:(i)一个替代流匹配对数概率估计器,使DPO可以在不进行概率流ODE积分的情况下在连续动作后端上运行;(ii)对LoRA和DoRA作为VLA DPO的参数高效层进行直接比较,发现DoRA在LIBERO 4套件上比OpenVLA SFT平均提升10.4个百分点(600次试验,3种子)——每套件+20.0对象,+11.0长周期,+8.0目标,+2.7空间——在对象上无种子方差(38/50在每个种子上);(iii)推理时间解剖显示去噪循环主导了78.6%的sample_actions延迟,而类似于VLA-Cache的前缀K/V缓存达到了21%的加速上限——无论是块级还是令牌级缓存策略在我们的基准中都会使成功率降至0-80%。我们进一步在6000个LIBERO帧上预训练了一个多视角+时间投影头,实现了99.5%的k-NN召回率@1(36倍于随机),可用作下游初始化。所有代码、检查点、训练日志和复现脚本均在https://github.com/lz-googlefycy/vla-lab上公开。

英文摘要

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz-googlefycy/vla-lab.

2605.20735 2026-06-09 cs.CV cs.LG 版本更新

Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

降低参与IREX的门槛:用于虹膜识别的开源算法、工具包和基准测试

Siamul Karim Khan, Patrick J. Flynn, Adam Czajka

发表机构 * University of Notre Dame(内布拉斯加大学)

AI总结 本文提出两种新的开源虹膜识别算法,提供Python和符合IREX标准的C++实现,用于提交官方IREX X计划。研究旨在首次根据IREX测试协议评估开源虹膜识别解决方案,并提供一个模型C++提交,显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络,分别使用三元组损失与批量硬三元组挖掘(TripletIris)和ArcFace损失(ArcIris)。此外,文章还提供了两种现有方法的开源IREX兼容C++实现:基于虹膜图像过滤的人类显著性驱动内核(HDBIF)算法,以及用于检测和比较Fuchs密钥(CRYPTS)的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外,其他方法已通过官方IREX X评估,并在多个流行学术基准上进行了评估。最后,本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

详情
AI中文摘要

本文提出了两种新的开源虹膜识别算法,提供了Python和符合IREX标准的C++实现,用于提交官方IREX X计划。本研究有两个主要目标:(a)首次根据IREX测试协议评估开源虹膜识别解决方案;(b)提供一个模型C++提交,显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络,分别使用三元组损失与批量硬三元组挖掘(TripletIris)和ArcFace损失(ArcIris)。本文还提供了两种现有方法的开源IREX兼容C++实现:(a)基于虹膜图像过滤的人类显著性驱动内核(HDBIF)算法;(b)用于检测和比较Fuchs密钥(CRYPTS)的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外,这些方法已通过官方IREX X评估,并在多个流行学术基准上进行了评估:Quality-Face/Iris Research Ensemble、Warsaw-Biobase Post-Mortem Iris、CASIA-Iris-Thousand-V4、CASIA-Iris-Lamp-V4、IIT Delhi Iris Database、IIITD Contact Lens Iris Database、NDIris3D和Notre Dame Variable Iris Image Quality Release 2。最后,本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

英文摘要

NIST Iris Exchange (IREX) offers an appealing solution to evaluating new open-source iris recognition algorithms, but it presents high barriers to entry because these algorithms must be written in C++, using a specific API, and adapted to meet strict IREX speed and memory constraints. The main goal of this paper is to lower these barriers and advance open-source iris recognition large-scale evaluations by offering: (a) two new modern deep learning-based open-source iris matchers (ArcIris and TripletIris), along with their C++ IREX X-compliant implementations, which are the first open-source iris recognition methods included into the IREX X leaderboard (and thus IREX-vetted), as well as new segmentation and iris circular approximation models that can be incorporated into any new iris recognition method, and (b) a performance assessment (according to IREX X testing protocols) of all major and currently available open-source iris recognition solutions. The paper also provides Python implementations of the new ArcIris and TripletIris methods and discusses the differences one may encounter between C++ and Python implementations of the same conceptually equivalent approaches. Finally, the paper offers open-source, IREX X-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). In addition to IREX X evaluation results, the paper reports the performance of all methods on major academic benchmarks: Quality-Face/Iris Research Ensemble (Q-FIRE), Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2 (VII-Q-R2).

2604.24199 2026-06-09 cs.SD cs.AI eess.AS eess.SP 版本更新

Speech Enhancement Based on Drifting Models

基于漂移模型的语音增强

Liang Xu, Diego Caviedes-Nozal, W. Bastiaan Kleijn, Longfei Felix Yan, Rasmus Kongsgaard Olsson

发表机构 * Victoria University of Wellington(维多利亚大学) Lincoln University(林肯大学) GN Advanced Science(GN先进科学)

AI总结 本文提出了一种基于漂移模型的语音增强框架DriftSE,通过将去噪问题建模为平衡问题,实现单步推理,从而在无需配对数据的情况下实现高质量语音增强。

Comments 6 pages, 2 figures

详情
AI中文摘要

我们提出了一种基于漂移模型的语音增强(DriftSE),一种新颖的生成框架,将去噪建模为一个平衡问题。与依赖迭代采样的方法不同,DriftSE通过演化映射函数的推动分布来实现单步推理,直接匹配干净语音分布。这种演化由漂移场驱动,这是一种学习到的修正向量,引导样本向干净分布的高密度区域发展,这自然促进了在未配对数据上的训练,通过匹配分布而非配对样本。我们从两种形式研究了该框架:从噪声观测到直接映射,以及从高斯先验的随机条件生成模型。在VoiceBank-DEMAND基准测试中,DriftSE在单步中实现了高保真度的增强,优于多步扩散基线,并建立了语音增强的新范式。

英文摘要

We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

协同优化中的因果卸载:在对抗性贡献下的精确和近似影响反转

Ali Mahdavi, Azadeh Zamanifar, Amirfarhad Farhadi, Omid Kashefi

发表机构 * Department of Computer Engineering, SRC, Islamic Azad University Tehran, Iran(伊朗伊斯兰Azad大学塔希尔分校计算机工程系) School of Computer Engineering, Iran University of Science and Technology Tehran, Iran(伊朗科学技术大学塔希尔分校计算机工程系) Meta CA, USA(美国Meta公司)

AI总结 本文提出HF-KCU方法,通过共轭梯度迭代在Krylov子空间中近似影响函数,从而在协同优化中实现数据删除,减少计算复杂度并提高隐私保护效果。

详情
AI中文摘要

联邦学习系统必须支持数据删除请求以符合隐私法规,但每次删除后重新训练是计算上不可行的。我们提出了HF-KCU方法,通过在Krylov子空间中进行共轭梯度迭代近似影响函数,将复杂度从O(d^3)降低到O(kd),其中k<<d。因果加权机制确保只有持有删除数据的客户端接收参数更新,防止对未受影响的客户端造成虚假变化。我们的方法设计用于处理有界对抗性扰动的Hessian和梯度,提供在现实威胁模型下的优雅退化。我们在卷积(ResNet-18,SimpleCNN)和Transformer(ViT-Lite)架构上CIFAR-10、MNIST和Fashion-MNIST数据集上验证了HF-KCU。在CIFAR-10的Dirichlet(alpha=0.5)划分下,HF-KCU在重新训练的基础上实现了47.75倍的速度提升,同时保持测试准确率在0.60%以内(71.16 vs 71.76%)。对遗忘集的成员推断攻击的成功率达到了0.499,与重新训练模型匹配,证实了有效的隐私恢复。我们提供了收敛保证,显示Krylov近似误差随着O((k^{1/2}-1)/(k^{1/2}+1))递减,其中k是Hessian条件数。因果加权机制确保了手术更新,只有持有删除数据的客户端被修改,保护了未受影响参与者的模型质量,并避免了异步联邦设置中梯度方法的不稳定性。该设计提供了可解释性,因为每个更新都可以直接追溯到删除数据的影响。该方法的效率和精度使其适用于生产联邦系统,其中删除请求异步到达且计算预算受限。

英文摘要

Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client's contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where k<<d.A causal weighting mechanism ensures that only clients holding the deleted data receive parameter updates, preventing spurious changes to unaffected clients. Our method is designed to handle bounded adversarial perturbations to the Hessian and gradient, providing graceful degradation under realistic threat models. We validate HF-KCU across convolutional (ResNet-18, SimpleCNN) and transformer (ViT-Lite) architectures on CIFAR-10, MNIST, and Fashion-MNIST. On CIFAR-10 under Dirichlet (alpha=0.5) partitioning, HF-KCU achieves 47.75 times speedup over retraining while maintaining test accuracy within 0.60% of the rational baseline(71.16 vs 71.76 %). Membership inference attacks on the forget set yield success rates of 0.499 matching the retrained model and confirming effective privacy restoration. We provide convergence guarantees showing that the Krylov approximation error decreases as O((k ^1/2-1)/(k^1/2+1)) where k is the Hessian condition number. The causal weighting mechanism ensures surgical updates, where only clients holding deleted data are modified, preserving model quality for unaffected participants and avoiding the instability of gradient-based approaches in asynchronous federated settings. This design provides interpretability as each update is directly traceable to the influence of the deleted data. The method's efficiency and precision make it suitable for production federated systems where deletion requests arrive asynchronously and computational budgets are constrained.

2605.19674 2026-06-09 cs.AI 版本更新

Beyond Rational Illusion: Behaviorally Realistic Strategic Classification

超越理性错觉:行为现实的战略分类

Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng, Yikai Chen, Haoxuan Li, Yang Shi, Jinxuan Yang, Zhouchen Lin, Yuanlong Chen, Yuanxing Zhang, Shaowu Yang, Wenjing Yang, Haotian Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于前景理论的行为现实战略分类框架,以应对现实中受心理偏差影响的决策者策略性操纵问题。

Comments Accepted by ICML2026

详情
AI中文摘要

战略分类(SC)研究了决策模型与策略性操纵特征以获得有利结果的代理之间的相互作用。现有SC框架通常依赖于理想化的假设,即代理是严格理性的。然而,行为经济学和心理学的证据一致表明,现实世界中的决策往往受到认知偏差的影响,偏离纯粹理性。为了正式化这一限制,我们识别并定义了一个新的问题设置,称为行为现实的战略分类问题,其中代理的策略性操纵由于心理偏差而偏离完全理性。受识别限制的启发,我们提出了前景引导的战略框架(Pro-SF)来解决这个问题,这是一个基于前景理论的原理框架,用于建模和学习在行为现实的战略响应下。具体来说,为了捕捉行为现实的战略操纵,我们的框架通过引入三种受前景理论启发的关键机制,重新表述了代理与决策者之间的Stackelberg式互动,包括收益与成本之间的不对称性、不同的主观参照点以及非理性的概率扭曲。在合成和现实世界数据集上的实验表明,Pro-SF是一种行为导向的战略分类方法,连接了机器学习和行为经济学,为现实世界中的更可靠部署提供了桥梁。

英文摘要

Strategic classification(SC) studies the interaction between decision models and agents who strategically manipulate their features for favorable outcomes. Existing SC frameworks typically rely on the idealized assumption that agents are strictly rational. However, evidence from behavioral economics and psychology consistently shows that real-world decision-making is often shaped by cognitive biases, deviating from pure rationality. To formalize this limitation, we identify and define a new problem setting, termed the behaviorally realistic strategic classification problem, where agents' strategic manipulations deviate from full rationality due to psychological biases. Motivated by the identified limitation, we propose the Prospect-Guided Strategic Framework (Pro-SF) to address the problem, a principled framework grounded in prospect theory to model and learn under behaviorally realistic strategic responses. Specifically, to capture behaviorally realistic strategic manipulations, our framework reformulates the Stackelberg-style interaction between agents and the decision-maker by incorporating three key mechanisms inspired by prospect theory, including the asymmetry between benefits and costs, different subjective reference points, and non-rational probability distortion. Experiments on synthetic and real-world datasets establish Pro-SF as a behaviorally grounded approach to strategic classification, bridging machine learning and behavioral economics for more reliable deployment in the real world.

2605.19662 2026-06-09 cs.AI 版本更新

When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach

当表格基础模型遇见策略性表格数据:一种先验对齐方法

Xinpeng Lv, Yunxin Mao, Renzhe Xu, Chunyuan Zheng, Yikai Chen, Haoxuan Li, Jinxuan Yang, Kun Kuang, Yuanlong Chen, Mingyang Geng, Wanrong Huang, Shixuan Liu, Shaowu Yang, Wenjing Yang, Zhouchen Lin, Haotian Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 本文研究了表格基础模型在策略性表格数据上的泛化能力,提出了一种策略感知的先验对齐框架SPN,以提高模型在策略性环境中的鲁棒性和预测性能。

Comments Accepted by ICML2026

详情
AI中文摘要

基于预训练先验数据拟合网络(PFNs)的表格基础模型在多样化的表格任务上表现出强大的泛化能力,但通常设计用于非策略性设置,其中数据分布与部署分类器无关。然而,在许多现实世界决策场景中,个体可能在部署后有意识地修改特征以获得有利结果,导致部署后分布偏移。本文研究了PFN风格的表格基础模型是否能泛化到此类策略性表格数据。我们证明,策略性操纵导致了预训练期间学习的非策略性先验与操纵后的策略性先验之间的不匹配,从而产生系统性的预测偏差。为了解决这个问题,我们提出了策略性先验数据拟合网络(SPN),一种推理时策略感知的框架,能够在不重新训练的情况下将表格基础模型适应到策略性环境。SPN构建策略性上下文示例以近似操纵后的输入,并将PFN预测与诱导的策略性分布对齐。在现实世界和合成表格数据集上的实验表明,与表格基础模型和经典表格方法相比,SPN在策略性操纵下始终提高了鲁棒性和预测性能。

英文摘要

Tabular foundation models based on pretrained prior-data fitted networks~(PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for \emph{non-strategic} settings where data distributions are independent of deployed classifiers. In many real-world decision scenarios, however, individuals may strategically modify their features after deployment to obtain favorable outcomes, inducing a post-deployment distribution shift. This paper studies whether PFN-style tabular foundation models can generalize to such \emph{strategic} tabular data. We show that strategic manipulation creates a mismatch between the non-strategic prior learned during pretraining and the post-manipulation strategic prior, which leads to systematic prediction bias. To address this issue, we propose \textbf{Strategic Prior-data Fitted Network}~\textit{(SPN)}, an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining. SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution. Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

2605.19266 2026-06-09 cs.CL cs.AI 版本更新

FormalASR: End-to-End Spoken Chinese to Formal Text

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结 本文提出FormalASR,一种端到端的中文语音到正式文本转换模型,通过构建大规模的语音到正式文本数据集,并使用Qwen3-ASR进行微调,实现了比原声基线减少37.4%的CER,同时提升了ROUGE-L和BERTScore指标,提供了一个轻量级的设备端解决方案。

详情
AI中文摘要

自动语音识别(ASR)系统通常优化于逐字转录,这保留了不连贯、填充词和非正式口语结构,这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑,但这种设计增加了延迟和内存成本,并且难以在设备上部署。我们提出了FormalASR,两个紧凑的端到端模型(0.6B和1.7B),可直接将中文语音转录为正式书面文本。为了实现这一目标,我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集,通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模(0.6B和1.7B)的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明,FormalASR在比原声基线减少37.4%的CER的同时,也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM,提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于分步置信度归因(SCA)的方法,用于诊断黑盒大语言模型在多步推理中的失败,通过信息瓶颈原理对生成的推理轨迹进行置信度评估,并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能,但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号,但现有方法受限于最终答案或需要内部模型访问。在本文中,我们引入了分步置信度归因(SCA),一种适用于封闭源LLM的框架,该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理:与正确解决方案中的一致结构对齐的步骤获得高置信度,而偏差则被标记为可能错误。我们提出了两种互补的方法:(1)NIBS,一种非参数化的IB方法,用于测量一致性而无需图结构,以及(2)GIBS,一种基于图的IB模型,通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明,SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外,使用步骤级置信度指导自我修正,比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

2605.18856 2026-06-09 cs.LG cs.CL cs.IT math.IT 版本更新

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

SPHERICAL KV: 角度域注意力与率失真保持用于高效长上下文推理

Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Synopsys McGill University(麦吉尔大学) IIIT Ranchi(印度理工学院拉奇) Amazon(亚马逊) Meta Apple(苹果) Pragya Lab, BITS Pilani Goa(普拉基亚实验室, BITS 拉贾斯坦)

AI总结 提出Spherical KV方法,通过角度域注意力(ADA)和率失真保持(RDR)机制,在长上下文推理中减少KV缓存占用并保持解码效率。

详情
AI中文摘要

长上下文推理日益受到KV缓存的限制:常驻内存随上下文长度增长,解码受限于重复的高带宽内存(HBM)流而非算术运算。现有方法如驱逐、窗口化、量化和卸载减少了占用,但通常仅部分解决了关键路径瓶颈,尤其是在解码期间压缩状态仍需重建为密集向量时。我们提出Spherical KV,一种将KV分配视为基于注意力几何的率失真问题以实现高效解码的长上下文推理方法。该方法基于两个思想:(i) 在解码热循环中廉价地表示方向信息,(ii) 根据估计的未来效用分配保留和精度。其第一个组件,角度域注意力(ADA),将键存储在由标量半径和紧凑角度码组成的球面参数化中,并直接根据这些码计算注意力对数,无需重建密集键。这保留了分页、块局部、融合友好的解码路径,并在实际服务设置中直接针对HBM流量。其第二个组件,率失真保持(RDR),在固定预算下联合选择每个令牌和头的保留/丢弃决策及精度层级,生成层级同质的页面,具有轻量级元数据和合并读取。ADA和RDR共同提供了一种面向部署的机制,在保持解码效率的同时减少KV常驻内存。

英文摘要

Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

2605.18643 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI Kuaishou Technology(快手科技) Shanghai AI Lab(上海人工智能实验室) TsinghuaC3I/ZEDA(清华大学C3I/ZEDA)

AI总结 本文提出ZEDA框架,通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型,显著减少专家FLOPs并提升推理速度。

详情
AI中文摘要

混合专家(MoE)通过稀疏专家激活高效地扩展语言模型,其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应,使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本,通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应(ZEDA),一种低成本框架,将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换,ZEDA在每个MoE层中注入无参数的零输出专家,并通过两阶段自蒸馏适应增强模型,利用原始MoE作为冻结的教师,并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试(涵盖数学、代码和指令跟随)中,ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上,ZEDA比最强的动态MoE基线分别高出6.1和4.0个点,并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

2605.06317 2026-06-09 cs.CV cs.AI 版本更新

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

NavOne: 一种基于顶部向下地图的视觉语言导航的一步全局规划

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

发表机构 * South China University of Technology(南方科技大学)

AI总结 本文提出了一种基于顶部向下地图的视觉语言导航方法,通过引入NavOne框架,实现多模态地图的单步全局路径规划,显著提升了导航效率和性能。

Comments 10 pages, 7 figures

详情
AI中文摘要

现有的视觉语言导航(VLN)方法通常采用以自身为中心的逐步导航范式,这导致误差累积并限制了效率。尽管最近的方法试图利用预建的环境地图,但它们通常依赖于逐步更新记忆图或评分离散路径提案,这限制了连续的空间推理并创建了离散瓶颈。我们提出了顶部向下VLN(TD-VLN),将导航重新表述为在预建的顶部向下地图上的一步全局路径规划问题,支持我们新构建的R2R-TopDown数据集。为了解决这个问题,我们引入了NavOne,一个统一的框架,它在单次端到端前向传递中直接预测多模态地图上的密集路径概率。NavOne具有顶部向下地图融合器,用于联合多模态地图表示,并扩展了空间感知的深度混合。在R2R-TopDown上的广泛实验表明,NavOne在基于地图的VLN方法中实现了最先进的性能,其规划阶段的速度提升比现有基于地图的基线方法快8倍,比以自身为中心的方法快80倍,从而实现了高效全局导航。

英文摘要

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

2605.17609 2026-06-09 cs.LG 版本更新

Adaptive Generate-Rank-Verify: Inference-Time Search with Costly Verification

自适应生成-排序-验证:具有高成本验证的推理时间搜索

Shaddin Dughmi, Mahdi Haghifam, Yusuf Hakan Kalayci

发表机构 * University of Southern California(南加州大学) Northwestern University(西北大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Simons Institute for the Theory of Computing(Simons计算理论研究所) Data Science Institute at the University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文提出了一种自适应生成-排序-验证方法,通过在未知分布下自适应地生成和验证候选答案,以在保证成本的前提下找到正例,同时通过理论分析和实验验证了该方法在数学推理和编程竞赛中的有效性。

Comments 33 Pages, 6 Figures, 4 Tables. Changes compared to V1: updated the related work section

详情
AI中文摘要

许多推理时间语言模型管道结合了低成本奖励信号和高成本验证器,例如数学推理中的精确答案检查或代码生成中的隐藏测试执行。我们通过学习理论的视角将这一设置形式化为生成性主动搜索:一个成本敏感的首次正例搜索问题,在其中策略会自适应地从未知分布中采样候选者,观察低成本评分,并支付验证器标签的费用,直到找到正例。对于固定的提示,生成器和奖励模型诱导出两个未知对象:奖励分数上的分布和条件于评分的成功函数。当这些量已知时,我们使用动态规划方法来表征分布感知的最优策略。在现实和实用的设置中,当评分分布和成功函数都未知时,我们提出ADAP算法,一种分层自适应的生成-排序-验证算法,逐步增加采样的响应数量和顶部验证的数量。在单调性假设下,即更高的奖励分数不太可能通过验证,我们证明ADAP在期望成本上接近分布感知的最优。我们通过基于中心星数的学习理论下界补充这一结果,表明对评分-标签关系的结构假设是必要的。在数学推理和竞争编程上的实验验证了在固定非自适应策略和难度自适应基线上的预测优势。

英文摘要

Many inference-time language-model pipelines combine a cheap reward signal with an expensive verifier, such as exact answer checking in mathematical reasoning or hidden-test execution in code generation. We formalize this setting using a learning-theoretic lens as generative active search: a cost-sensitive first-positive search problem in which a policy adaptively samples candidates from an unknown distribution, observes cheap scores, and pays for verifier labels until it finds a positive example. For a fixed prompt, the generator and reward model induce two unknown objects: a distribution over reward scores and a score-conditioned success function. When these quantities are known, we characterize the distribution-aware optimal policy using a dynamic programming approach. In the realistic and practical setting where both the score distribution and success function are unknown, we propose ADAP, a shellwise adaptive generate-rank-verify algorithm that progressively increases the number of sampled responses and top-ranked verifications. Under the monotonicity assumption that higher reward scores are no less likely to pass verification, we show that ADAP achieves expected cost within a constant factor of the distribution-aware optimum. We complement this result with learning-theoretic lower bounds, based on a centered star number, showing that structural assumptions on the score--label relationship are necessary. Experiments on mathematical reasoning and competitive programming validate the predicted advantage over both fixed non-adaptive policies and difficulty-adaptive baselines.

2605.02439 2026-06-09 cs.CV cs.LG 版本更新

Anomaly-Preference Image Generation

异常偏好图像生成

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * Nanjing University of Science(南京理工大学) Beijing Normal University, Beijing, China(北京师范大学) China Academy of Space Technology, Beijing, China(中国航天科技集团)

AI总结 本文提出了一种新的异常生成方法,通过隐式偏好对齐机制和时间感知能力分配模块,提升生成图像的真实性和多样性,实验表明其在真实性和多样性上均优于现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

从有限数据中合成逼真且多样的异常样本对于鲁棒模型泛化至关重要。然而,现有方法难以平衡保真度和多样性,通常受分布不匹配和过拟合的阻碍。为缓解这一问题,我们引入了异常偏好优化,一种将异常生成重新表述为偏好学习问题的新范式。我们的方法核心是隐式偏好对齐机制,利用真实异常作为正例参考,直接从去噪轨迹偏差中推导优化信号,而无需昂贵的人工标注。此外,我们提出了一个时间感知能力分配模块,动态地沿扩散时间线分配模型能力,在高噪声阶段优先考虑结构多样性,在低噪声阶段增强细粒度保真度。在推理过程中,分层采样策略调节保真度与对齐的权衡,实现对生成过程的精确控制。大量实验表明,该方法显著优于现有基线,实现了真实性和多样性方面的最先进性能。

英文摘要

Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.

2605.17301 2026-06-09 cs.CL cs.AI 版本更新

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG: 检测和解决检索增强生成中的知识冲突

Chenyu Wang, Yueyuan Li, Yingmin Liu, Yang Shu

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出ConflictRAG框架,通过两阶段冲突检测模块、熵-TOPSIS框架和冲突感知RAG评分,有效检测和解决检索增强生成中的知识冲突,实验表明其在冲突检测F1和正确性方面优于现有方法。

Comments 6 pages, 6 figures, submitted to IEEE SMC 2026

详情
AI中文摘要

检索增强生成(RAG)系统隐式假设检索文档之间相互一致——这一假设在实践中经常失效。我们提出了ConflictRAG,一种具有冲突意识的RAG框架,能够在生成答案之前检测、分类和解决知识冲突。该框架引入了三个贡献:(1)一个两阶段冲突检测模块,结合轻量级嵌入基于MLP分类器和选择性LLM细化,使API成本降低62%,同时保持90.8%的检测准确率;(2)一个熵-TOPSIS框架用于数据驱动的来源可信度评估,比手动启发式方法提高7.1%的选取准确率;(3)一个冲突感知RAG评分(CARS)用于诊断冲突处理能力。在三个基准测试中对六个基线的实验表明,冲突检测F1达到88.7%,并且在最强的冲突感知基线中,正确性提高了5.3-6.1%。该流程能够有效跨基础LLM转移。

英文摘要

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

2605.17289 2026-06-09 cs.LG cs.AI 版本更新

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

LEAP:可学习的端到端无结构剪枝大型语言模型

Mohammad Mozaffari, Younes Hourri, Mohammad Rastegari, Mahyar Najibi

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出LEAP,一种可学习的端到端无结构剪枝方法,通过伯努利-戈姆贝茨松弛替代传统参数化,提高了无结构剪枝的端到端准确率,实验表明在多个LLM家族上平均提升了零样本准确率。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM)

详情
AI中文摘要

无结构稀疏性现在通过最近的GPU内核和数据流硬件原生加速,瓶颈从推理执行转移到了剪枝算法。最先进的无结构LLM剪枝方法是基于最优大脑外科手术原理的分层代理,牺牲了端到端准确性,尤其是在高稀疏度下。端到端替代方案如MaskLLM和PATCH表明可学习掩码可以缩小这一差距,但它们的类别-模式参数化随有效掩码数量按行数增长,并不适用于无结构设置。我们引入LEAP,用每权重伯努利-戈姆贝茨松弛替代这种不可行参数化,使端到端无结构掩码学习变得可行。在五个从0.5B到8B参数的LLM家族上,在50%和60%稀疏度下,LEAP在六个任务的零样本准确率上平均比ADMM提升+2.59点,ADMM是我们在扫掠中的最佳分层基线。

英文摘要

Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.

2605.16928 2026-06-09 cs.CL cs.AI 版本更新

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

全注意力再临:在数百次训练步骤内将全注意力转化为稀疏

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出RTPurbo方法,通过利用模型内在稀疏性,在少量训练步骤内实现高效的稀疏注意力,从而在保持接近无损精度的同时,显著提升推理效率。

Comments 20 pages, 9 figures

详情
AI中文摘要

大型语言模型的长上下文推理受到全注意力二次成本的限制。现有的高效替代方法通常依赖于原生稀疏训练或启发式令牌驱逐,导致效率、训练成本和准确性之间存在不理想的权衡。在本文中,我们证明全注意力LLM本质上已经是稀疏的,并且可以通过最小的适应转化为高度稀疏的模型。我们的方法基于三个观察:(1) 只有少量的注意力头真正需要完整的长上下文处理;(2) 长距离检索主要由低维子空间支配,允许相关令牌通过16维索引器高效检索;(3) 有用的令牌预算强烈依赖于查询,使得动态top-p选择比固定top-k稀疏化更合适。基于这些见解,我们提出了RTPurbo,该方法仅保留检索头的完整KV缓存,并引入轻量级令牌索引器进行稀疏注意力。通过利用模型的内在稀疏性,RTPurbo仅在数百次训练步骤内即可实现稀疏化。在长上下文基准和推理任务上的实验表明,RTPurbo在保持接近无损精度的同时,实现了显著的效率提升,包括在100万上下文下的预填充速度提升高达9.36倍,以及解码速度提升约2.01倍。这些结果表明,可以通过标准的全注意力训练获得强大的稀疏推理,而无需昂贵的原生稀疏预训练。

英文摘要

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

2605.16823 2026-06-09 cs.LG 版本更新

VQ-Atom: Semantic Discretization of Local Atomic Environments for Molecular Representation Learning

原子作为语言:VQ-Atom:用于分子表示学习的语义离散化

Takayuki Kimura

发表机构 * Atoms as Language, LLC(Atoms as Language公司)

AI总结 本文提出VQ-Atom,一种用于分子表示学习的语义离散化框架,通过将连续的原子级图表示转换为对应局部化学环境的离散标记,从而提升分子表示的学习效果。

详情
AI中文摘要

分子表示学习已成为AI驱动药物发现中的核心方法,但现有分子分词如SMILES仍主要是语法性的,无法自然对齐具有化学意义的子结构。在本文中,我们介绍了VQ-Atom,一种语义离散化框架,将连续的原子级图表示转换为对应局部化学环境的离散标记。利用图神经网络嵌入和向量量化,原子被分配到代表化学有意义的原子上下文的代码本条目中。这些离散标记定义了一种适合基于Transformer的预训练的分子语言。我们评估了VQ-Atom在蛋白质-配体相互作用预测中的表现,采用蛋白质冷分割设置且不依赖3D结构信息。实验结果表明,与传统分词方法相比,VQ-Atom在预测性能上始终有所提升,表明语义基础的离散化可以显著增强分子表示学习。我们的发现表明,分词设计本身在使化学领域有效语言建模中起着关键作用。

英文摘要

Large language models succeed by combining large-scale pretraining with meaningful discrete tokens. In molecular machine learning, SMILES is widely used as a token representation, but it is primarily a linearization format for molecular graphs rather than a semantic decomposition of chemistry. We propose VQ-Atom, a semantic tokenization framework that assigns discrete atom-level tokens based on local chemical environments via vector quantization. Unlike SMILES tokens, VQ-Atom tokens encode graph-local chemical context and are aligned with molecular structure. On protein-cold drug--target interaction prediction using the KIBA dataset, VQ-Atom substantially improves global ranking performance, achieving AUROC of 0.79 while substantially outperforming both SMILES-based and continuous molecular representations under an identical downstream architecture. Furthermore, VQ-Atom enables approximately 3 times faster downstream training than continuous atom-level representations by replacing per-atom continuous features with reusable discrete tokens. These results suggest that molecular tokenization is not merely a preprocessing step, but a central design choice. In particular, well-structured tokens can encode substantial chemical semantics, reducing the burden on downstream learning. VQ-Atom can be interpreted as defining a molecular language, where tokens correspond to chemically meaningful atomic environments, suggesting that token design may constitute an additional axis of machine learning research alongside architecture, objectives, and optimization.

2605.16551 2026-06-09 cs.CL 版本更新

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

PQR:一个生成多样化且逼真用户查询的框架,以引发问答代理失败

Yunan Lu, Luigi Liu, Omar Yahia, Arpit Sharma, Zhou Yu

发表机构 * Columbia University(哥伦比亚大学) University of California San Diego(加州大学圣地亚哥分校) Walmart(沃尔玛)

AI总结 PQR框架通过迭代交互生成多样化且逼真的用户查询,以发现问答代理的失败案例,其方法比现有方法更有效。

详情
AI中文摘要

评估基于LLM的代理仍然具有挑战性,因为识别有意义的失败案例通常需要大量的人力来设计现实的测试场景。先前的工作主要关注自动发现由对抗性用户引起的代理失败,而忽略了具有真实用户意图的查询也触发代理失败的情况。我们引入PQR,一个框架,不仅能够针对特定目标(如有用性、安全性等)揭示代理失败,还能模拟真实用户意图。PQR通过两个互补模块的迭代交互运作。查询精炼模块执行重写以探索多样化的查询变体,而提示精炼模块利用先前反馈推导新的违反目标的策略和现实性政策以精炼提示,从而生成引发失败但逼真的查询。我们在检测电子商务问答代理的不帮助性响应上评估了PQR。我们的方法发现了23% - 78%更多的不帮助性响应,且我们生成的查询比先前方法更加多样化和逼真。

英文摘要

Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.

2605.16309 2026-06-09 cs.AI cs.LG cs.MA 版本更新

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL:通过受控符号补丁学习适应大语言模型代理

Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University at Buffalo(布法罗大学) University of Colorado Boulder(科罗拉多大学博尔德分校) University of Colorado Colorado Springs(科罗拉多大学科罗拉多州立分校)

AI总结 ANNEAL通过受控符号补丁学习适应大语言模型代理,解决重复故障问题,其核心机制FDKA能定位责任操作符并生成类型补丁,实现持久结构修复,优于现有方法。

Comments Code Implementation: https://github.com/sbhakim/anneal-agents

详情
AI中文摘要

基于大语言模型的代理可以恢复个体执行错误,但在底层过程知识未修复时,同一故障会反复失败。现有自我进化方法通过更新提示、记忆或模型权重来解决这一差距,但未直接修复编码任务执行的符号结构,且缺乏安全部署所需的治理保证。我们引入ANNEAL,一种神经符号代理,将重复失败转化为受控符号编辑过程知识图谱,而无需修改基础模型权重。其核心机制,故障驱动知识获取(FDKA),定位责任操作符,通过约束LLM生成合成类型补丁,并通过多维评分、符号护栏和金丝雀测试验证提案,再提交。每条接受的编辑都携带完整溯源和确定性回滚能力。在四个领域和27个多种子运行中,ANNEAL是唯一在测试重复故障设置中将失败率降至0%的评估系统。消融实验表明,移除FDKA会消除所有结构修复并使成功率下降最高26.7个百分点。这些结果表明,受控符号修复为持续故障消除提供了与权重级和提示级适应互补的范式。

英文摘要

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

2602.16346 2026-06-09 cs.CL cs.LG 版本更新

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

有益于故障:测量多轮、多语言LLM代理中的非法协助

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

发表机构 * EPFL(苏黎世联邦理工学院) independent(独立研究员) tubingen(图宾根大学)

AI总结 本文提出STING框架,用于评估多轮多语言LLM代理在执行非法任务时的协助能力,发现低资源语言中攻击成功率不一致,提供实际部署中的压力测试方法。

Comments Accepted in ICML 2026

详情
AI中文摘要

基于工具和记忆的LLM代理通过执行现实世界工作流。这些功能使恶意对手也能利用这些代理执行复杂的恶意场景。现有代理恶意使用基准测试主要测试单提示指令,留下测量代理在多轮中帮助执行有害或非法任务的空白。我们引入STING(序列测试非法N步目标执行),一种自动红队框架,构建基于良性角色的逐步非法计划,并通过适应性后续问题迭代探测目标代理,使用判断代理跟踪阶段完成。我们进一步引入分析框架,将多轮红队测试建模为首次越狱时间随机变量,使分析工具如发现曲线、攻击语言的危险比率归因以及新指标:受限均值越狱发现。在AgentHarm场景中,STING的非法任务完成率显著高于单轮提示和适应于工具使用代理的多轮基线。在六个非英语设置的多语言评估中,发现攻击成功率和非法任务完成率在低资源语言中不一致,与常见聊天机器人发现不同。总体而言,STING提供了一种评估和压力测试代理恶意使用在现实部署环境中的实用方法,其中交互本质上是多轮且经常多语言的。

英文摘要

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

2605.15690 2026-06-09 cs.LG 版本更新

FRWKV+: Periodic-Aware Adaptive Gating for Frequency-Space Linear Time Series Forecasting

FRWKV+: 基于周期感知的自适应门控用于频率域线性时间序列预测

Qingyuan Yang, Dongyue Chen, Da Teng, Junhua Xiao, Jiaji Pan, Shizhuo Deng

发表机构 * College of Information Science and Engineering, Northeastern University(信息科学与工程学院,东北大学) Foshan Graduate School of Innovation, Northeastern University(创新研究生学院,东北大学) National Frontiers Science Center for Industrial Intelligence and Systems Optimization(工业智能与系统优化国家级前沿科学中心)

AI总结 本文提出FRWKV-Plus模型,通过引入跨分支频谱门和信任门控残差修正,提升频率域时间序列预测的准确性与效率,实验表明其在多个基准数据集上表现优异。

详情
AI中文摘要

准确且高效的长期多变量时间序列预测需要捕捉重复的时序结构,同时在许多变量和预测范围上保持推理成本低。频率域模型能紧凑地表示长程和周期性变化,但通常将实部和虚部频谱组件作为弱耦合流处理,并将周期性提示作为普通输入特征,即使这些提示不可靠。本文提出FRWKV-Plus,一种轻量级周期感知频率域预测模型,基于高效的FRWKV骨干网络。FRWKV-Plus引入了跨分支频谱门,通过总结其兄弟分支来重新加权每个频谱分支,并引入信任门控残差修正,将紧凑的周期内上下文转换为有界的、符号灵活的调整。通过构造,修正在初始化时保持恒等,并严格有界,因此周期性证据可以细化但不会主导或反转基础交互。在七个标准基准上,FRWKV-Plus在强线性、频率域、递归式和Transformer基预测器中表现一致竞争,同时保持骨干网络的轻量级特性。受控三种子消融实验显示,每个组件都起作用,收益在强周期性数据上较小,在更难的交换和IL数据集上更显著,且周期内上下文是最有影响力的单一组件。实现已公开在https://github.com/yangqingyuan-byte/FRWKV-plus。

英文摘要

Accurate and efficient long-term multivariate time series forecasting requires capturing recurring temporal structure while keeping inference cheap across many variables and horizons. Frequency-space models represent long-range and periodic variation compactly, but they typically process the real and imaginary spectral components as weakly coupled streams and treat periodic cues as ordinary input features, even when such cues are unreliable. This paper proposes FRWKV-Plus, a lightweight periodic-aware frequency-space forecasting model built on the efficient FRWKV backbone. FRWKV-Plus introduces a cross-branch spectral gate that reweights each spectral branch using a summary of its sibling branch, and a trust-gated residual correction that converts compact within-period context into a bounded, sign-flexible adjustment of these gates under a learned, data-dependent trust score. By construction, the correction is identity-preserving at initialization and strictly bounded, so periodic evidence can refine but never dominate or invert the base interaction. On seven standard benchmarks, FRWKV-Plus is consistently competitive with strong linear, frequency-domain, recurrent-style, and Transformer-based forecasters while preserving the lightweight profile of the backbone. Controlled three-seed ablations show that each component contributes, that the benefit is modest on strongly periodic data and pronounced on the harder Exchange and ILI datasets, and that the within-period context is the most influential single component. The implementation is publicly available at https://github.com/yangqingyuan-byte/FRWKV-plus.

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Ghosted Layers: 无约束激活对齐用于恢复层剪枝的LLM

Vincent-Daniel Yun, Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee

发表机构 * University of Southern California(南加州大学) Inha University(inha大学)

AI总结 本文提出Ghosted Layers方法,通过无约束优化解决层剪枝后激活分布不匹配问题,提升LLM准确性和 perplexity 而不牺牲效率。

详情
AI中文摘要

层剪枝从大型语言模型中移除整个Transformer解码器块,但导致后续存活层接收到的隐藏状态分布与训练时分布不匹配,从而引起显著性能下降。我们提出Ghosted Layers,一种无需训练的恢复模块,通过解决边界激活对齐问题来解决此问题。我们的方法从少量校准集推导出闭合形式的最优线性算子,以重建由剪枝层引入的激活差异。我们展示该解决方案对应于对齐目标的无约束最优解,而现有方法受限于有限算子子空间内的约束解。在多个LLM backbone和剪枝策略上的实验表明,我们的方法在保持层剪枝效率增益的同时,一致提升了准确性和perplexity,优于先前的无训练基线。官方代码仓库:https://github.com/daniel-eai/ghosted_layers_official_repository/.

英文摘要

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning. Official code repository: https://github.com/daniel-eai/ghosted_layers_official_repository/.

2605.15466 2026-06-09 cs.CV 版本更新

Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

以实体为中心的世界模型:交互感知的掩码用于因果视频预测

Santosh Kumar Paidi

发表机构 * Genentech, Inc.(基因泰克公司)

AI总结 本文提出IA-JEPA,通过运动中心的自监督掩码策略,优先捕捉物理交互,提升因果推理任务的准确性,并在真实世界动作和物理谜题中验证了其泛化能力。

Comments 12 pages, 4 figures

详情
AI中文摘要

从未标记视频中学习预测性世界模型是人工智能的基础挑战。尽管联合嵌入预测架构(JEPA)在语义分类中设定了新基准,但它们往往缺乏物理感知,无法捕捉下游推理所需的因果动态。我们假设这源于标准的基于块的掩码策略,这些策略优先考虑视觉纹理而非罕见但信息丰富的运动事件。我们提出交互感知JEPA(IA-JEPA),利用自监督的运动中心掩码策略,优先考虑物理交互。通过专门针对碰撞或动量转移的实体,我们迫使架构重建潜在轨迹而非静态背景特征。在CLEVRER基准上评估,IA-JEPA在因果推理任务中达到14.26%的准确率,显著高于标准块掩码基线的3.22%。关键的是,我们证明IA-JEPA通过诱导更高熵、更具判别性的潜在空间(+10%熵增)打破了标准自监督的“静态偏见”,并线性化物理能量(R²=0.43)。我们展示这种交互偏见可推广到真实世界的人类动作(Something-Something V2)和零样本物理谜题(PHYRE-Lite)。我们的结果提供了一条可扩展的、完全自监督的路径,以构建开始内部化物理世界因果结构的基础世界模型。

英文摘要

Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

2605.15416 2026-06-09 cs.LG cs.AI 版本更新

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

基于边际的置信度排名用于可靠的LLM判断

Gaojie Jin, Yong Tao, Lijia Yu, Tianjin Huang

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) Institute of AI for Industries, Chinese Academy of Sciences(中国科学院工业人工智能研究所) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 本文提出一种基于边际的置信度排名方法,通过学习专用置信度估计器,改进LLM在人类判断一致性上的表现,通过模拟标注者多样性与边际排名公式,显式建模LLM区分人类一致与不一致案例的置信度,并推导出通用性保证。

Comments Accepted to ICML 2026

详情
AI中文摘要

Jung等人(2025)提出了一种假设检验框架,以确保大型语言模型(LLMs)与人类判断之间的一致性,基于模型估计的置信度与人类不一致风险之间单调性的假设。然而,在实践中,这一假设可能被违反,且置信度估计器的泛化行为未被显式分析。我们通过学习专用置信度估计器而非依赖启发式置信信号来缓解这些问题。我们的方法利用模拟标注者多样性和基于边际的排名公式,显式建模LLM区分人类一致与不一致案例的置信度。我们进一步推导出该估计器的泛化保证,揭示出一个与边际相关的权衡,从而指导适应性估计器训练过程的设计。当集成到固定序列测试中时,所学的置信度估计器提高了排名准确性,并在多个数据集和判断模型上实现了更高的成功率,以满足目标一致性水平。

英文摘要

Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.

2605.14531 2026-06-09 cs.CL 版本更新

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

语言生成作为最优控制:潜在控制空间中的闭环扩散

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文将语言生成重新表述为随机最优控制问题,通过统一理论视角分析自回归和扩散模型,解释其局限性,并提出基于流匹配的闭环控制器实现高效文本生成。

详情
AI中文摘要

本工作将语言生成重新表述为随机最优控制问题,提供统一的理论视角来分析自回归和扩散模型,并解释其局限性(效率-保真度悖论、不可逆误差传播、优化可行性与保真度)在轨迹奇异性、共轭状态消失和梯度缺失的组合下的表现。为解决这些问题,我们近似求解哈密顿-雅可比-贝尔曼(HJB)方程,得到一个作为闭环控制器的最优策略。为避免直接求解HJB PDE的不可行性,我们采用流匹配作为最优轨迹求解器,在校正的潜在控制空间中。这使我们的Manta-LM配备全局积分算子能够近似全局向量场,从而实现同时实现高保真文本生成和高效、低成本并行采样的模型。实验表明,我们的方法在语言建模和条件生成任务中表现强劲,同时表现出改进的稳定性、效率和可控性。

英文摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.