arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.12569 2026-06-12 cs.CL cs.AI 新提交

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

EDEN:意大利语临床笔记的大规模语料库

Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Istituto di Ricerche Farmacologiche Mario Negri IRCCS(马里奥·内格里药理研究所IRCCS) University of Padua(帕多瓦大学)

AI总结 本文介绍EDEN,一个大规模意大利语急诊临床笔记语料库,包含约400万份匿名笔记及6000份专家标注数据,用于支持大语言模型在医疗中的应用,并提出了CRF填充作为新的结构化信息提取基准。

详情
AI中文摘要

我们提出了EDEN(急诊电子笔记),这是一个新颖且独特的大规模临床笔记语料库,这些笔记来自意大利医院的急诊科。当前版本的语料库由约400万份完全匿名的临床笔记组成,涵盖了患者在急诊科停留期间的不同护理阶段。此外,约六千份笔记的子集由临床专家通过结构化病例报告表(CRF)进行了手动标注,该CRF包含132个项目,涉及急诊科两种患者情况:呼吸困难和意识丧失。项目可能取数值(例如血氧饱和度)、分类(例如意识水平)、二元(例如是否存在创伤)和混合值类型。标注过程涉及多位临床医生,并经过迭代修订以解决项目表述中的歧义,从而形成了一个结构丰富(尽管高度不平衡)的资源。该数据集旨在填补能够支持大语言模型在具体医疗应用中开发和使用的重要数据缺口。我们描述了数据收集协议、现场匿名化流程、语料库统计数据和标注方案。最后,我们提出了CRF填充作为一项新的结构化信息提取基准,并提供了基于Gemma-27B和MedGemma-27B的零样本基线。据我们所知,EDEN数据集是意大利语现有最大的免费临床笔记语料库。

英文摘要

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

2606.12563 2026-06-12 cs.AI 新提交

Arbor: Tree Search as a Cognition Layer for Autonomous Agents

Arbor:作为自主智能体认知层的树搜索

Neha Prakriya, Chaojun Hou, Zheng Gong, Huasha Zhao, Xi Zhao, Mou Li, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 提出Arbor多智能体框架,通过结构化树搜索作为认知层,在大型有状态动作空间中实现自主优化,在LLM推理优化中实现高达193%的吞吐量-延迟帕累托改进。

详情
AI中文摘要

Arbor是一个多智能体框架,引入了结构化树搜索作为自主智能体在大型有状态动作空间中运行的认知层。先前的自主优化系统在具有无状态评估的孤立目标上运行。相反,Arbor维护一个显式的得分假设搜索树,作为跨智能体的共享工作记忆,随着每次测量而演变,将失败视为诊断信号以重塑后续探索,并随着先前的成功转移瓶颈分布而扩展。我们在全栈LLM推理优化上验证了Arbor,这是一个历史上需要应用程序、框架、编译器、内核和硬件栈的工程团队协调努力才能达到峰值性能的领域。Arbor将Orchestrator智能体(通过将优化委托给推理栈中的领域专家来驱动优化)与Critic智能体(通过根本原因分析、内省和测量验证来维护稳定性)配对——这是一种制衡架构,其中没有一个智能体可以单方面驱动系统。智能体能力被分解为硬技能(领域专业知识)和软技能(决定贡献如何组合的协调协议),从而实现完全自主的多日活动。Arbor在供应商优化的基线上实现了高达193%的推理吞吐量-延迟帕累托改进,而没有该框架的单个智能体在吞吐量改进上达到+33%后几小时内就不可恢复地崩溃。Arbor可推广到多代硬件平台,运行间方差在2个百分点以内,表明该方法与硬件无关且可重复。

英文摘要

Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stateful action spaces. Prior autonomous optimization systems operate on isolated targets with stateless evaluation. Arbor instead maintains an explicit search tree of scored hypotheses that serves as the shared working memory across agents, evolving with every measurement, treating failures as diagnostic signal that reshapes subsequent exploration, and expanding as prior successes shift the bottleneck distribution. We validate Arbor on full-stack LLM inference optimization, a domain where achieving peak performance has historically required coordinated effort from engineering teams across the application, framework, compiler, kernel, and hardware stack. Arbor pairs an Orchestrator agent, which drives optimization by delegating to Domain Specialists across the inference stack, with a Critic agent that safeguards stability through root-cause analysis, introspection, and measurement validation -- a checks-and-balances architecture where neither agent can unilaterally drive the system. Agent capabilities are decomposed into hard skills (domain expertise) and soft skills (coordination protocols that determine how contributions compose), enabling fully autonomous multi-day campaigns. Arbor achieves up to 193% inference throughput-latency Pareto improvement over vendor-optimized baselines, while a single agent without the harness plateaus at +33% throughput improvement and crashes irrecoverably within hours. Arbor generalizes to multiple generations of hardware platform, and run-to-run variance is within 2 percentage points demonstrating that the method is hardware-agnostic and reproducible.

2606.12562 2026-06-12 cs.CV cs.GR 新提交

HairPort: In-context 3D-aware Hair Import and Transfer for Images

HairPort: 上下文感知的3D发型导入与迁移

Alireza Heidari, Amirhossein Alimohammadi, Wallace Michel Pinto Lira, Adi Bar-Lev, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University(西蒙菲莎大学) Huawei Canada(华为加拿大)

AI总结 提出HairPort框架,通过显式分离发型移除与迁移,并利用3D感知管道实现大姿态差异下的发型迁移,结合LoRA适配的秃头转换器和条件流匹配生成器,实现高质量、身份保持的发型迁移。

详情
Comments
Accepted to SIGGRAPH 2026 (Conference Papers Track). 23 pages, 15 figures, 10 tables, including supplementary material as appendices. Project page: this https URL
AI中文摘要

在图像之间迁移发型是计算机图形学、计算机视觉和视觉效果中一个重要但具有挑战性的任务。它使用户能够在无需实际改变发型的情况下探索新造型,应用于虚拟试穿系统、增强现实和娱乐等领域。大多数先前的方法在姿态差异较小时表现最佳,但在视角和尺度差异较大时效果不佳,此时缺失的发型内容必须合成而非迁移。我们提出HairPort,一个3D感知的发型迁移框架,通过显式分离发型移除与迁移,并在合成前强制几何一致性来解决这些问题。我们引入了一个秃头转换器,通过基于LoRA的上下文适配FLUX.1 Kontext生成逼真的秃头人脸版本。为了训练我们的秃头转换器,我们引入了一个新数据集Baldy,包含6000对在不同身份和条件下的秃头和原始图像。我们还使用了一个3D感知迁移管道,在将参考发型合成到源图像之前,从目标视角重建并重新渲染该发型。由于具有3D感知能力,我们的方法支持源和目标之间的大姿态和尺度差异。最后,一个条件流匹配生成器从秃头源和几何对齐的参考引导中合成迁移结果。综合来看,我们的方法实现了准确、姿态一致且身份保持的发型迁移,在定性和定量上均优于现有方法。

英文摘要

Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

2606.12556 2026-06-12 cs.DC 新提交

ITME: Inference Tiered Memory Expansion with Disaggregated CXL-Hybrid Memories

ITME:基于解耦CXL混合内存的分层推理内存扩展

Hakbeom Jang, Younghoon Min, Sunwoong Kim, Taeyoung Ahn, Hanyee Kim, Youngpyo Joo, Hoshik Kim, Jongryool Kim

AI总结 针对大语言模型推理中TB级上下文状态的内存瓶颈,提出ITME架构,利用CXL混合内存实现字节可寻址的远程内存扩展,通过确定性访问模式优化数据移动,提升吞吐量达35.7%。

详情
AI中文摘要

大语言模型(LLMs)向智能体和长上下文工作负载的快速转变,正在推动行业超越单个服务器的容量限制,转向解耦的共享存储以处理TB级上下文状态。这一趋势催生了专门的共享上下文层,旨在跨分布式集群外部化和共享累积推理状态。虽然将数据卸载到JBOF(仅闪存阵列)架构中的DPU(数据处理单元)可以加速NVMe-over-fabrics(NVMe-oF)目标处理,但复杂的软件级优化和成本效率负担仍然显著。因此,扩展此共享上下文基础设施的理想架构仍是一个活跃的探索领域。在本文中,我们提出了ITME(推理分层内存扩展),它利用CXL混合内存实现大规模、TB级字节可寻址的远程内存扩展。这种方法通过直接字节可寻址性实现了成本高效的扩展并简化了软件栈,有效解决了共享上下文基础设施的挑战。我们的关键洞察是,大量模型权重和前缀缓存的确定性访问模式使系统能够主动管理跨内存-存储层次的数据移动。我们通过使用生产级SK海力士CMM和PCIe Gen5 NVMe SSD评估其性能潜力,并通过基于FPGA的硬件原型进一步证明其功能可行性,从而验证了ITME。总体而言,ITME通过提供额外的远程内存扩展以容纳超出主机内存限制的大型KV缓存足迹,增强了传统的CPU卸载,实现了高达35.7%的吞吐量提升。

英文摘要

The rapid shift toward agentic and long-context workloads in Large Language Models (LLMs) is pushing the industry beyond the capacity of individual servers toward disaggregated shared storage to handle TB-scale context states. This movement has led to the emergence of specialized shared context layers designed to externalize and share cumulative inference states across distributed clusters. While offloading to a data processing unit (DPU) within just-a-bunch-of-flash (JBOF) architectures accelerates NVMe-over-fabrics (NVMe-oF) target processing, the need for sophisticated software-level optimization and cost-efficiency burdens remain significant. Consequently, the ideal architecture for scaling this shared context infrastructure is still an active area of exploration. In this paper, we propose ITME (Inference Tiered Memory Expansion), which leverages a CXL-hybrid memory to present a massive, TB-scale byte-addressable remote memory expansion. This approach enables cost-efficient scaling and simplifies the software stack through direct byte-addressability, effectively addressing the challenges of shared context infrastructure. Our key insight is that the deterministic access patterns of voluminous model weights and prefix caches enable the system to proactively manage data movement across the memory-storage hierarchy. We validate ITME by evaluating its performance potential with production-grade SK Hynix CMM and PCIe Gen5 NVMe SSDs, while further demonstrating its functional feasibility through an FPGA-based hardware prototype. Overall, ITME enhances conventional CPU-offloading by providing additional remote memory expansion to accommodate large KV cache footprints beyond host memory limits, achieving up to a 35.7\% throughput improvement.

2606.12555 2026-06-12 cs.SD cs.CV cs.MM 新提交

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

AudioX-Turbo:高效任意到音频生成的统一框架

Zeyue Tian, Lei Ke, Zhaoyang Liu, Ruibin Yuan, Liumeng Xue, Yujiu Yang, Weijia Chen, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Noiz AI Independent Researcher(独立研究员)

AI总结 提出AudioX-Turbo,基于教师-学生范式的统一高效框架,通过多模态扩散Transformer和分布匹配蒸馏实现文本、视频、音频到音频的生成,仅需4步采样,NFE减少约25倍。

详情
AI中文摘要

基于灵活的多模态控制信号生成音频和音乐是一个广泛适用的课题,面临以下关键挑战:1) 统一的多模态建模框架,2) 大规模、高质量的训练数据,3) 多步扩散采样的高昂推理成本。为此,我们提出AudioX-Turbo,一个统一且高效的任意到音频生成框架,集成了多种多模态条件(即文本、视频和音频信号)。AudioX-Turbo遵循教师-学生范式。教师模型AudioX-Base基于多模态扩散Transformer,并带有模态自适应融合模块,用于对齐多样化的多模态输入以实现高保真合成,然后通过适用于流匹配的分布匹配蒸馏将其蒸馏为少步学生模型AudioX-Turbo,并辅以基于扩散的判别器以实现高质量的少步生成。为支持AudioX-Turbo的训练,我们构建了一个大规模、高质量的数据集IF-caps-Pro,包含约920万个样本,通过两阶段数据收集和标注流程整理而成。我们在广泛的任务上对AudioX-Turbo进行基准测试,发现我们的模型实现了优越的性能,尤其是在文本到音频和文本到音乐生成方面,同时仅需4个采样步骤,所需的函数评估次数(NFE)比多步基线减少约25倍。这些结果表明,我们的方法能够在灵活的多模态控制下进行音频生成,展现出高效且强大的指令跟随能力。代码和数据集将在https://this URL上提供。

英文摘要

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at this https URL.

2606.12552 2026-06-12 cs.LG 新提交

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

跨越验证危机:交叉验证出人意料地有效降低基准测试方差

Célestin Eve, Gaël Varoquaux, Thomas Moreau

发表机构 * MIND Team, Université Paris-Saclay, Inria, CEA, Palaiseau, France(MIND团队,巴黎-萨克雷大学,法国国家信息与自动化研究所,法国原子能委员会,帕莱索,法国) SODA Team, Inria, Palaiseau, France(SODA团队,法国国家信息与自动化研究所,帕莱索,法国) Probabl

AI总结 本文提出交叉验证通过样本增益概念量化虚拟数据增强,显著提升算法性能评估的置信度与稳定性,并引入动态早停机制减少计算开销。

详情
Comments
34 pages, 11 figures
AI中文摘要

现代机器学习通过实证工作推进,对新方法进行基准测试以评估相对性能。然而,评估固有的统计变异性——由于许多算法的随机性而加剧——常常因有限的测试样本而使性能估计不可靠,导致验证危机,其中真正的进步难以辨别。在这项工作中,我们展示了交叉验证在评估和比较学习算法性能时显著提高了置信度。我们引入了样本增益的概念,它量化了通过使用多个交叉验证分割来减少基准测试方差所实现的虚拟数据增强。在合成和真实世界数据集(组织病理学扫描和NLP微调)上的实验表明,多个分割可以显著提高性能估计的可靠性和稳定性,且收益递减往往比预期来得更晚。我们还引入了一种动态早停交叉验证的程序,通过从最初几个折叠估计后续折叠是否会带来大的样本增益。我们的发现强调了在可用样本上推行交叉验证以实现稳健可靠基准测试的价值。

英文摘要

Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

2606.12550 2026-06-12 cs.RO cs.AI 新提交

Foresight: Iterative Reasoning About Clues that Matter for Navigation

Foresight: 关于导航关键线索的迭代推理

Arthur Zhang, Carl Qi, Donne Su, Xiangyun Meng, Amy Zhang, Joydeep Biswas

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) FieldAI

AI总结 提出Foresight框架,利用微调VLM交替提出和批评图像空间运动计划,通过人类反馈学习奖励模型进行强化学习后训练,实现无地图导航中稀疏语言指令下的迭代运动优化,任务成功率提升37%。

详情
Comments
22 pages, 10 figures, 3 tables
AI中文摘要

从稀疏语言指令进行开放世界无地图导航需要解决未明确指定的目标,并推断哪些环境线索与到达目标相关。例如,到达一个视野外的目的地可能需要解释坡道、标志或绕行路线,这些揭示了去哪里或走哪条路线。先前的工作受限于对已知导航因素和封闭集因素类别的依赖,或者在运动规划之前识别线索而遗漏了依赖于计划的线索。我们认为预训练的视觉语言模型(VLM)可以发现新的指令相关线索,但需要适应以关注哪些线索重要以及它们应如何影响运动规划。我们在Foresight中实现了这些想法,这是一个测试时框架,其中微调的VLM交替提出图像空间运动计划并使用语言目标和视觉上下文对其进行批评。后续计划基于先前的批评,使得在执行前能够进行迭代运动优化。为了将计划批评和优化与开放集行为偏好对齐,我们从人类反馈中学习一个奖励模型,并使用它在计划-批评循环中通过强化学习对VLM进行后训练。在离线评估和6个真实世界环境中,相对于最先进的测试时推理和基础模型基线,Foresight将平均任务成功率提高了37%,并将每次任务的干预次数减少了52%,同时在Jetson AGX Orin上实时运行。我们将发布代码、数据和训练细节,以支持未来关于机器人运动优化的测试时推理工作。更多视频请见:this https URL

英文摘要

Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental cues are relevant for reaching the goal. For instance, reaching an out-of-view destination may require interpreting ramps, signs, or detours that reveal where to go or which route to take. Prior works are limited by their reliance on known navigation factors and closed-set factor categories, or identify cues before motion planning and miss plan-dependent cues. We argue that pretrained Vision-Language Models (VLMs) can discover novel instruction-relevant cues, but require adaptation to focus on which cues matter and how they should influence motion planning. We realize these ideas in Foresight, a test-time framework in which a finetuned VLM alternates between proposing image-space motion plans and critiquing them using the language goal and visual context. Subsequent plans are conditioned on prior critiques, enabling iterative motion refinement before execution. To align plan critiques and refinements with open-set behavior preferences, we learn a reward model from human feedback and use it to post-train the VLM with reinforcement learning in the plan-critique loop. In offline evaluations and 6 real-world environments, Foresight improves average task success by 37% and reduces interventions per mission by 52% relative to state-of-the-art test-time reasoning and foundation-model baselines, while running in real-time on a Jetson AGX Orin. We will release code, data, and training details to support future work on test-time reasoning for robot motion refinement. Additional videos at: this https URL

2606.12507 2026-06-12 cs.LG 新提交

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

基于评分标准的自蒸馏:无需评分标准验证器的后训练

MohammadHossein Rezaei, Anas Mahmoud, Zihao Wang, Utkarsh Tyagi, Advait Gosai, Razvan-Gabriel Dumitru, Aakash Sabharwal, Bing Liu, Yunzhong He

发表机构 * Scale AI

AI总结 提出RGSD方法,通过将评分标准作为条件蒸馏到学生模型,无需验证器即可实现密集逐令牌学习,在医学和科学领域达到与基于评判的GRPO相当的评分标准满足率。

详情
AI中文摘要

在开放领域(单一标准答案不可用)中,评分标准已成为RLVR的替代方案。现有的基于评分标准的训练方法依赖LLM验证器对每次生成根据评分标准进行评分。这引入了大量的训练时间开销,使优化暴露于验证器特定偏差,并将评分标准反馈简化为稀疏的轨迹末端信号。我们提出无验证器的训练方法——基于评分标准的自蒸馏(RGSD),其中基础策略以评分标准为条件,作为无条件学生的教师。RGSD将基于评分标准的教师分布逐令牌蒸馏到学生,用密集的逐令牌学习信号替代稀疏的轨迹级奖励,并完全从训练循环中移除LLM评判。在Qwen-2.5(3B、7B)和Qwen3-Thinking(4B、8B)模型上,针对医学和科学领域,RGSD在每次提示仅使用一次在线生成且无需训练时验证器调用的情况下,实现了与基于评判的GRPO相当的评分标准满足率。消融实验表明,原始评分标准比自生成参考响应提供更强的教师增强信号,而更强的GRPO评判在某些设置下可能优于RGSD,使RGSD成为验证器成本或可靠性成为瓶颈时的互补性无验证器替代方案。

英文摘要

Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

2606.12505 2026-06-12 cs.LG cs.AI 新提交

Boosting Direct Preference Optimization with Penalization

通过惩罚增强直接偏好优化

Pengwei Sun

AI总结 提出DPOP,在DPO损失上增加对参考模型贪婪响应的门控惩罚,仅当当前策略对偏好响应概率低于拒绝响应时激活,在AlpacaEval 2.0上显著提升胜率。

详情
Comments
Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning
AI中文摘要

离线偏好优化已成为从人类反馈中进行强化学习的实用替代方案,但诸如直接偏好优化(DPO)及其变体等成对目标仅使用存储在静态数据集中的选择和拒绝响应。这留下了一个有用的信号未被利用:参考模型本身为同一提示生成的响应。我们提出了带惩罚的直接偏好优化(DPOP),这是DPO的一个简单扩展,它在基础偏好损失上增加了一个对参考贪婪响应的门控惩罚。DPOP仅在当前策略对偏好响应的似然仍低于对拒绝响应的似然时激活此惩罚。在AlpacaEval 2.0上,DPOP在Llama-3-8b-it和Gemma-2-9b-it上均提高了长度控制的胜率,相对于DPO、SimPO和AlphaDPO,在两个模型上分别实现了5.3%和4.4%的相对增益。消融实验进一步表明,在此设置下,SimNPO风格的长度归一化惩罚比NPO和token级非似然惩罚更强。

英文摘要

Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

2606.12504 2026-06-12 cs.LO 新提交

A Type Theory of Sense: Witnessed Choice in Stratified Semantic Spaces

一种意义类型论:分层语义空间中的见证选择

Iman Poernomo

AI总结 提出依赖类型论TTS,用horn填充表示语义组合,通过测量上下文记录分离见证,实现非全局规范组合,支持弗雷格意义、指称和超内涵差异的几何解释。

详情
AI中文摘要

我们引入TTS,一种依赖类型论,其中语义组合由horn填充表示,可能完成之间的区别相对于显式测量机制被见证。TTS用基于测量索引的不可区分性和构造性分离替换全局规范组合,允许填充空间在完成全部观察连接时被分类为规范的,在两个有根据的完成被正分离时被分类为分叉的。分离见证仅通过记录实际仪器输出的测量上下文进入演算,产生保守性、来源性和空记录无分叉结果。我们证明分叉在细化下持续而规范性可能失败,并精确刻画一个机制所做的识别何时能与另一个机制所做的分离一致共存。该框架支持弗雷格意义作为填充的选择、指称作为约束该选择的边界、超内涵差异作为测量的分离的几何解释,同时为分层表示空间和语言模型生成中的分支行为提供了可证伪的桥梁。

英文摘要

We introduce TTS, a dependent type theory in which semantic composition is represented by horn filling and distinctions between possible completions are witnessed relative to explicit measurement regimes. TTS replaces globally canonical composition with regime-indexed indiscernibility and constructive apartness, allowing filler spaces to be classified as canonical when all completions are observationally connected and forked when two warranted completions are positively separated. Separation witnesses enter the calculus only through measurement contexts recording actual instrument outputs, yielding conservativity, provenance, and a no-fork-from-the-empty-record result. We prove that forks persist under refinement while canonicity may fail, and characterize exactly when an identification made by one regime can consistently coexist with a separation made by another. This framework supports a geometric account of Fregean sense as a choice of filler, reference as the boundary constraining that choice, and hyperintensional difference as measured apartness, while providing a falsifiable bridge to stratified representation spaces and branching behaviour in language-model generation.

2606.12503 2026-06-12 cs.LG cs.SD 新提交

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Dolph2Vec: 海豚发声的自监督表示

Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan, Alexis Emanuelli, Yair Lakretz, Gonzalo de Polavieja, German Sumbre

发表机构 * École Normale Supérieure, Paris, France(巴黎高等师范学院) Not Diamond, San Francisco, USA(Not Diamond公司) Institut du Cerveau, Paris, France(巴黎脑研究所) Champalimaud Foundation, Lisbon, Portugal(尚帕利莫基金会)

AI总结 提出Dolph2Vec,首个基于五年纵向海豚录音数据训练的自监督模型,在签名哨声分类和检测任务上显著优于通用基线,并发现可解释的声学单元。

详情
AI中文摘要

自监督学习(SSL)通过无需昂贵人工标注即可对动物发声进行可扩展建模,为生物声学开辟了新机遇。然而,当前该领域的SSL模型优先考虑跨物种的广泛泛化,并未针对揭示个体通信系统的细粒度结构进行优化。在这项工作中,我们收集并发布了一个新颖的数据集,包含来自半自然海洋环境中五只已知海豚的超过五年的纵向录音,这是研究海豚通信的前所未有的资源。我们将Wav2Vec2.0 Baevski等人(2020)的架构适应于此领域,并引入Dolph2Vec,这是第一个仅在此数据上训练的大规模、物种特异性SSL模型。我们在两个生物学相关任务上对模型进行基准测试:签名哨声分类和哨声检测。Dolph2Vec在这两个任务上均显著优于通用基线。除了性能,我们还展示了学习到的嵌入和码本结构捕获了与海豚哨声类别以及可能的子哨声结构对齐的可解释声学单元,从而能够对通信模式进行细粒度分析。我们的发现证明了SSL如何作为模型和科学工具来探索动物通信研究中的假设。

英文摘要

Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

2606.12501 2026-06-12 cs.LG 新提交

Policy-driven Conformal Prediction for Trustworthy QoT Estimation

策略驱动的可信QoT估计的保形预测

Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) University of Applied Sciences and Arts of Southern Switzerland(瑞士南方应用科学与艺术大学)

AI总结 提出Conformal QoT框架,结合统计保证的QoT估计与操作决策策略,实现域偏移下可靠的光路可行性预测,在开放数据集上将准确率从92%提升至99.6%。

详情
AI中文摘要

我们提出Conformal QoT,一个策略驱动的框架,将具有统计保证的QoT估计与操作决策策略相结合,能够在域偏移下实现可靠的光路可行性预测,并在开放数据集上将准确率从92%提升至99.6%。

英文摘要

We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92\% to 99.6\% on open datasets.

2606.12500 2026-06-12 cs.LG cs.AI 新提交

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结 本文利用机器学习行为模型替代传统规则模型进行交通微观仿真,通过极端值理论分析模拟冲突预测碰撞频率,在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情
AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案,用于预测当前或计划道路基础设施设计的碰撞频率。然而,现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型,这些模型能较好地再现交通流,但往往无法生成真实的冲突动态,限制了碰撞预测的准确性。机器学习(ML)行为模型的最新进展提供了一个有希望的机会,通过直接从大规模轨迹数据集中学习人类驾驶行为,可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性,我们对英国利兹的五个真实信号交叉口进行了交通微观仿真,使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突,然后使用极端值理论建模以预测碰撞频率。结果表明,ML模型的冲突产生的碰撞预测与实际碰撞数据一致,而基于规则的模型由于缺乏对特定模拟交叉口的模型校准,无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果,这表明尽管当前的ML模型可以真实地再现冲突,但尚不能生成真实的碰撞。总体而言,研究结果表明,基于ML的行为模型在无需特定地点模型校准的情况下,有望从模拟冲突中改进碰撞预测,并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

2606.12499 2026-06-12 cs.RO 新提交

Action-Effect Memory Pretraining for Robot Manipulation

动作-效应记忆预训练用于机器人操作

Yijing Zhou, Qiwei Liang, Sitong Zhuang, Jiaxi Li, Xianpeng Wang, Boyang Cai, Yunyang Mo, Renjing Xu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Shenzhen University(深圳大学)

AI总结 提出AEM框架,通过视觉-动作历史掩码建模学习紧凑时间表征,提升机器人操作在部分可观测环境下的性能,优于单帧预训练和帧堆叠方法。

详情
AI中文摘要

我们提出了AEM,一个用于机器人操作的动作-效应记忆预训练框架,从视觉-动作历史中学习紧凑的时间表征。与先前主要关注单帧视觉编码的机器人表征预训练方法不同,AEM针对操作的时间特性,在部分可观测性下,仅凭当前观测往往不足。AEM通过交错视觉和动作特征将操作建模为动作驱动的交互过程,并应用掩码建模从不完整历史中恢复缺失内容,从而学习动作条件化的状态演化。最终视觉令牌的Mamba编码输出用作紧凑的历史表征,作为解码和下游控制的全局上下文。该设计在保持推理高效的同时,保留了单向量时间瓶颈。我们使用扩散策略和流策略评估AEM。AEM在仿真和真实环境中一致提升了操作性能,在干净场景、杂乱和随机场景以及非马尔可夫任务中均优于基线。消融研究进一步表明,历史感知预训练超越了单帧预训练和直接帧堆叠,同时降低了推理延迟和计算成本。

英文摘要

We present AEM, an Action-Effect Memory pretraining framework for robot manipulation that learns compact temporal representations from vision-action history. Unlike prior robot representation pretraining methods that mainly focus on single-frame visual encoding, AEM targets the temporal nature of manipulation, where the current observation alone is often insufficient under partial observability. AEM models manipulation as an action-driven interaction process by interleaving visual and action features and applying masked modeling to recover missing content from incomplete histories, thereby learning action-conditioned state evolution. The Mamba-encoded output of the final vision token is used as a compact history representation, serving as the global context for decoding and downstream control. This design preserves a single-vector temporal bottleneck while keeping inference efficient. We evaluate AEM with Diffusion Policy and Flow Policy. AEM consistently improves manipulation performance in both simulation and real-world settings, outperforming baselines across clean scenes, cluttered and random scenes, and non-Markovian tasks. Ablation studies further show that history-aware pretraining surpasses single-frame pretraining and direct frame stacking, while reducing inference latency and computational cost.

2606.12498 2026-06-12 cs.CR cs.LG 新提交

From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging

从参数到特征空间:模型合并中后门缓解的任务算术

Zhenqian Zhu, Yamin Hu, Yiya Diao, Weixiang Li, Haodong Li, Wenjian Luo

AI总结 提出线性特征路径最小化(LFPM)框架,通过跨任务线性性在特征空间优化反后门任务向量,在模型合并中有效抑制后门且保持干净任务性能。

详情
AI中文摘要

模型合并(MM)作为一种将多个任务特定模型整合为统一模型的成本效益方法,已获得显著关注。然而,近期工作揭示MM极易受到后门攻击。现有基于任务算术的防御通常因依赖直接参数空间编辑,在未显著降低干净任务性能的情况下难以消除后门。为解决这一差距,我们提出线性特征路径最小化(LFPM),一种用于模型合并的后门缓解框架,该框架将反后门任务向量引入被后门污染的合并模型。与先前方法不同,LFPM在跨任务线性性(CTL)框架下从统一的特征空间视角制定合并模型的后门鲁棒性,该框架利用跨任务特征的近似线性性。这一视角指导反后门任务的优化,以在抑制后门的同时保持干净任务性能。此外,我们引入一种基于梯度累积和损失路径积分的有效优化机制,确保沿插值路径的鲁棒后门抑制。大量实验表明,LFPM在完全微调和参数高效微调(PEFT)设置中均对后门攻击表现出强鲁棒性。

英文摘要

Model merging (MM) has gained significant attention as a cost-effective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

2606.12497 2026-06-12 cs.LG cs.RO 新提交

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA:部分可观测操作中VLA模型的循环记忆研究

Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

发表机构 * CogAI Lab, Moscow, Russia(CogAI实验室,莫斯科,俄罗斯) MIRAI, Moscow, Russia(MIRAI,莫斯科,俄罗斯)

AI总结 针对VLA模型在部分可观测场景中的记忆缺失问题,提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强,在MIKASA-Robo上将训练任务成功率从0.42提升至0.84,并在LIBERO上保持全可观测性能。

详情
Comments
34 pages, 20 figures, 9 tables
AI中文摘要

视觉-语言-动作(VLA)模型从当前观测预测未来动作块,这一假设在部分可观测性下失效,因为决策依赖于不再可见的信息。现有的记忆增强VLA同时引入了循环、检索、压缩模块、辅助目标、层次化记忆或特定任务架构变化,因此循环本身的贡献与周围机制纠缠不清。我们提出了一个在强预训练VLA骨干网络中的受控隔离研究。我们的方案通过一小部分可学习的记忆令牌增强Transformer,这些令牌跨时间步传递并通过自注意力更新,使用截断反向传播时间进行端到端训练,没有辅助损失和架构变化。我们将其实例化为$μ$VLA,一组由记忆宽度m、TBPTT长度K和记忆更新规则(跨步梯度或分离的EMA)参数化的OpenVLA-OFT变体,使得循环是唯一变化的因素。在MIKASA-Robo上,$μ$VLA在最强设置下将五个训练任务的平均成功率从0.42提高到0.84,并在具有相同记忆结构的保留任务上达到0.23,而无记忆基线为0.07。在需要不同记忆结构的任务上,性能接近基线。在LIBERO上,最强的循环变体达到96.2%的平均成功率,表明在全可观测性下没有性能下降。我们将这些结果解释为对最小化骨干网络循环能力范围的校准,识别了其足够的情况以及需要额外记忆结构的情况。演示和视频可在以下链接找到:https://example.com。

英文摘要

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $\mu$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $\mu$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in this https URL.

2606.12495 2026-06-12 cs.SD 新提交

Missing-Token Prompted Reliability-Aware Fusion for Robust Polyglot Speaker Identification

缺失令牌提示的可靠性感知融合用于鲁棒多语种说话人识别

Peng Jia, Li Dai, Jia Li, Zhenzhen Hu, Ye Zhao, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) Intelligent Interconnected Systems Laboratory of Anhui Province(安徽省智能互联系统实验室)

AI总结 提出MRAF框架,通过可学习的缺失令牌和可靠性感知交叉注意力融合,解决多语种场景下跨语言泛化和人脸缺失时的鲁棒性问题,在POLY-SIM 2026测试集上取得高准确率。

详情
Comments
8 pages, 3 figures, 4 tables
AI中文摘要

准确且鲁棒的多模态说话人识别对于多媒体理解和生物特征认证至关重要。然而,现实中的多语种场景带来了两个关键挑战:说话人判别性表示应跨语言泛化,并且当人脸信息不可用时模型应保持可靠。为了解决这些挑战,我们提出了MRAF,一个缺失令牌提示的可靠性感知融合框架,用于跨完整模态、缺失人脸和跨语言场景的多语种说话人识别。MRAF用可学习的缺失令牌代替固定的零值特征来表示不可用的人脸输入,提供了缺失视觉状态的可训练表示。这种设计减少了由缺失输入引起的分布差距,并允许后续的可靠性估计和跨模态融合在统一的令牌空间内操作。为了自适应地集成具有不同可靠性的模态,MRAF进一步引入了可靠性感知的交叉注意力融合模块,该模块估计人脸和音频的可靠性分数,将其归一化为模态权重,并在双向交叉注意力之前将这些权重应用于令牌表示。这样,模型可以强调可靠的模态线索,同时抑制不可靠的。在训练过程中,MRAF联合优化多分支分类损失、仅音频知识蒸馏和中心损失,以提高说话人判别性和缺失模态鲁棒性。在官方POLY-SIM 2026测试集上的实验证明了所提出框架的有效性。在最终评估中,MRAF在P3和P5上达到了100%的准确率,并在更具挑战性的缺失人脸设置P4和P6上获得了有竞争力的结果。源代码将在https://this URL发布。

英文摘要

Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at this https URL.

2606.12494 2026-06-12 cs.LG 新提交

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Net-Ev$^2$:网络事件演化的生成式模拟器

Guangyu Wang, Zhaonan Wang

发表机构 * NYU Shanghai(上海纽约大学)

AI总结 提出Net-Ev$^2$,一种结合事件线索与网络拓扑的生成式模拟器,通过结构引导掩码预训练和拓扑感知扩散过程模拟网络事件演化,在多个道路网络数据集上达到最优性能。

详情
Comments
Accepted by KDD 2026 Research Track
AI中文摘要

减少现实世界的试错一直是决策的核心目标,生成式模拟器通过建模未来状态的演化推进了这一目标。一个更具挑战性且更有意义的任务是模拟扰动事件(如事故)如何通过网络传播其影响。现有方法在模拟网络事件演化时,未能同时建模事件的结构化属性和非结构化语义,也未能捕捉拓扑结构。因此,我们提出Net-Ev$^2$($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution),一种新颖的生成式模拟器,在模拟中联合利用事件线索并保留网络拓扑。具体而言,该框架包含两个阶段:结构引导的掩码预训练和拓扑感知扩散过程,后者通过类似U-Net的图下采样和上采样实现去噪。在推理时,Net-Ev$^2$仅需自然语言事件输入即可生成模拟,具有更大的实际使用灵活性。此外,我们引入了Net-Ev$^2$-6.5M,一个跨四个大规模道路网络的对齐事件和网络流量数据的多模态基准,以及一个新的拓扑感知指标JL-MMD,用于评估生成网络动态的拓扑保真度。大量实验证明了Net-Ev$^2$的最优性能和强泛化能力。代码已开源。

英文摘要

Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at this https URL.

2606.12490 2026-06-12 cs.LG 新提交

Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

基于抽象精化的循环神经网络鲁棒性验证

Li-Jen Lin, Chih-Duo Hong

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出抽象精化框架,通过分割预激活区间消除非线性松弛误差,并利用SHAP引导的时间步选择策略降低组合成本,显著提升RNN鲁棒性验证成功率。

详情
AI中文摘要

循环神经网络(RNN)的认证局部鲁棒性验证具有挑战性,因为非线性松弛引入的近似误差会通过循环连接传播并随时间累积。因此,可扩展的线性边界传播方法往往过于保守,无法认证实际上鲁棒的输入,尤其是当许多预激活区间跨越零点时。我们提出了一种用于RNN验证的抽象精化框架,该框架划分此类区间以消除主要的松弛误差:在每个精化分支上,ReLU变得精确,而tanh和sigmoid等平滑激活函数则允许更紧的线性包络。为了控制在长序列中分裂的组合成本,我们引入了一种SHAP引导的时间步选择策略,该策略根据隐藏状态对验证目标的贡献进行排序,并按时间顺序仅精化最关键的时间步。在CIFAR10和MNIST笔画基准上的实验表明,与仅使用抽象的基线相比,验证成功率和鲁棒性边界紧度持续提升,同时揭示了ReLU和tanh模型之间清晰的运行时权衡。

英文摘要

Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

2606.12488 2026-06-12 cs.LG 新提交

A Stationary (and Therefore Compatible) Representation is All You Need

静态(因此兼容)表示即所需

Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

发表机构 * Media Integration and Communication Center (MICC), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze(佛罗伦萨大学信息工程系媒体集成与通信中心(MICC))

AI总结 本文证明d-Simplex固定分类器学习的静态表示满足兼容性定义,并通过交叉熵与对比损失的凸组合捕获高阶依赖,实现模型更新时无需重处理的检索服务。

详情
Comments
Accepted to TPAMI2026. Extension of the CVPR2024 version ( arXiv:2405.02581 )
AI中文摘要

学习兼容表示旨在当模型更新时,特征表示可以互换使用。本文证明,由d-Simplex固定分类器学习的静态表示隐含了其正式定义中的兼容性。这一结果为未来工作奠定了基础,并可直接应用于实际学习场景。我们解决了在模型顺序微调时使用d-Simplex固定分类器学习兼容性的挑战。使用交叉熵损失的d-Simplex固定分类器学习对齐一阶统计量的特征分布,因此可能无法完全捕捉模型更新之间表示的高阶依赖。为解决此问题,我们证明通过交叉熵损失和对比损失的凸组合使用d-Simplex固定分类器训练模型,不仅能捕捉高阶依赖,而且等价于在兼容性约束下使用交叉熵学习。我们通过大量实验证实了我们的发现,并考虑了一个新场景:预训练模型被顺序微调,偶尔被改进模型替换。我们表明,静态表示能够实现不间断的检索服务(无需重新处理图库图像),同时在模型更新和替换期间提升性能,达到最先进水平。代码见此 https URL。

英文摘要

Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at this https URL.

2606.12487 2026-06-12 cs.LG 新提交

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

DynamicPTQ: 通过残差流动态缓解激活量化崩溃

Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出DynamicPTQ,通过分析残差流中激活的相位式动态变化,识别量化敏感层并分配8位精度,在W4A4KV4量化下提升LLaMA-2/3的困惑度和零样本QA性能,吞吐量提升1.05-1.07倍。

详情
AI中文摘要

训练后量化(PTQ)对于高效的大语言模型推理至关重要,但当权重、激活和KV缓存全部量化到4位精度时,可靠地量化激活仍然具有挑战性。一个关键困难在于大规模激活,其极端值主导激活范围并放大量化误差。最先进的方法主要通过基于变换的平滑(如正交旋转和仿射缩放)来缓解大规模激活,但忽略了残差流的跨层动态。在本文中,我们展示了大规模激活在网络深度上以相位模式出现和消失,触发大的残差变化。这些变化导致新注入的逐层更新主导4位量化尺度,并削弱历史残差信息。为了表征这种行为,我们引入了跳跃比和历史特征信噪比。这表明基于静态变换的平滑无法完全解决由跨层残差变化引起的动态量化不稳定性。基于这一分析,我们提出了DynamicPTQ,一种用于相位感知混合精度激活量化的动态训练后量化策略。DynamicPTQ从残差流动态中识别量化敏感层,并仅对这些层分配8位激活精度,同时保持权重、KV缓存和其他激活为4位精度。它可以直接集成到强大的PTQ基线中,如QuaRot、SpinQuant和FlatQuant。在LLaMA-2和LLaMA-3上的实验表明,DynamicPTQ在W4A4KV4量化下一致地提高了困惑度和零样本QA性能,同时实现了1.05到1.07倍的吞吐量提升,且内存开销适中。这些结果展示了实现鲁棒低位LLM推理的实用路径。

英文摘要

Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

2606.12486 2026-06-12 cs.LG 新提交

An Empirical Study on Predictive Maintenance for Component X in Heavy-Duty Scania Trucks

重型斯堪尼亚卡车中组件X的预测性维护实证研究

Valeriu Dimidov, Sasan Jafarnejad, Raphaël Frank

发表机构 * SnT, University of Luxembourg(卢森堡大学SnT) Scania CV AB(斯堪尼亚商用车公司)

AI总结 针对卡车车队,提出一种基于状态监测的预测性维护方法,将磨损状态建模为单调非递减时间序列,通过选取最近观测并转换为表格数据,利用AutoML简化建模,在Scania组件X数据集上降低了成本。

详情
AI中文摘要

近年来,基于状态的预测性维护(PdM)在卡车车队中得到了广泛应用。这种维护策略旨在通过监测车辆的健康状况并根据其状态采取主动措施,最大限度地减少计划外停机并降低成本。然而,由于卡车产生的大量数据、通过传感器数据检测故障的内在复杂性以及在解决方案实施中寻找成本效益权衡的困难,基于状态的PdM系统的实施具有挑战性。在本文中,我们定义并验证了一种基于状态的PdM方法,该方法基于一个假设:被监测组件的磨损状态可以表示为单调非递减的时间序列。它涉及仅从时间序列中选择最近的观测值,并将其转换为表格格式,以便使用为表格数据设计的机器学习(ML)模型进行分类。我们的结果表明,与当前最先进(SOTA)方法相比,所提出的方法在Scania组件X数据集上降低了成本,同时通过AutoML简化了建模过程。

英文摘要

Condition-based Predictive Maintenance (PdM) for truck fleets has gained momentum in recent years. This maintenance strategy aims to minimize unplanned downtimes and reduce costs by monitoring the health status of vehicles and taking proactive action based on their condition. However, the implementation of condition-based PdM systems is challenging due to the large volume of data generated by the trucks, the inherent complexity of detecting failures through sensor data and the difficulties in finding cost-effective trade-offs in the solution's implementation. In this paper, we define and validate a condition-based PdM methodology built on the assumption that the wear-and-tear state of the monitored component can be represented as a monotonically non-decreasing time series. It involves selecting only the most recent observations from the time series and transforming them into a tabular format for classification using machine learning (ML) models designed for tabular data. Our results indicate that the proposed methodology reduces costs on the Scania Component X dataset compared to current state-of-the-art (SOTA) approaches, while also simplifying the modeling process through AutoML.

2606.12485 2026-06-12 cs.LG cs.AI 新提交

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University(北京航空航天大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) The Hong Kong University of Science and Technology(香港科技大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学)

AI总结 提出推测性回滚修正(SRC)框架,通过固定视野分支审查和回滚机制,在减少教师查询的同时保持轨迹多样性,在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情
AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而,在此背景下,确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积,将页面状态推入不可恢复的区域。相反,过早或过度干预会使智能体过度依赖专家策略,将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正(SRC),一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签,也不是仅在完成轨迹后修正,而是采用固定视野分支审查:学生先执行一个短的推测性片段,然后由教师审查,仅当局部进展中断时,教师才定位第一个有害偏差。回滚保留有用的前缀,而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上,SRC收集了977条通过验证器的轨迹和9183个下一步动作示例;固定视野审查在保留通过验证器的解决方案变体的同时,改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at this https URL.

2606.12483 2026-06-12 cs.LG 新提交

Scalable anomaly detection via a univariate Christoffel function

通过单变量Christoffel函数实现可扩展的异常检测

Florian Grivet (CNES, LAAS-DISCO, Comue de Toulouse), Didier Henrion (LAAS-POP), Jean-Bernard Lasserre (TSE-R, LAAS-POP), Louise Travé-Massuyès (LAAS-DISCO, Comue de Toulouse)

AI总结 针对Christoffel函数方法因矩阵大小随维度指数增长而难以应用于高维数据的问题,提出基于查询点与支撑点间平方距离的单变量Christoffel函数(UCF),在ADBench基准上平均精度优于14种基线方法。

详情
AI中文摘要

异常检测在欺诈检测、网络入侵和系统故障诊断等领域识别异常模式中发挥关键作用。近年来,基于Christoffel函数的方法(根植于多项式优化)因其坚实的数学基础和计算节俭性,成为深度学习的有前景替代方案。然而,其实用性受限于需要求逆一个大小随数据维度指数增长的矩阵,即使对于中等维度数据集也难以处理。本文解决了Christoffel函数异常检测的维度限制,同时保留了其关键理论性质,即开关支撑二分法行为和准确的支撑形状捕获。我们引入了UCF,一种基于查询点与支撑点间平方距离的单变量Christoffel函数。在ADBench基准上的大量实验表明,UCF在平均精度上持续优于14个最先进的基线方法。通过解决Christoffel函数的可扩展性瓶颈,本文扩展了异常检测方法的工具箱,提供了一种稳健、有理论依据且普遍适用的方法。

英文摘要

Anomaly detection plays a critical role in identifying unusual patterns across domains such as fraud detection, network intrusion, and system fault diagnosis. Recently, Christoffel function-based methods, rooted in polynomial optimization, have emerged as promising alternatives to deep learning due to their strong mathematical foundations and computational frugality. However, their practical applicability is hindered by the need to invert a matrix whose size grows exponentially with the data dimension, rendering the method intractable even for moderate-dimensional datasets. This paper addresses the dimensionality limitations of Christoffel function-based anomaly detection while preserving its key theoretical properties, i.e., the on-off support dichotomy behavior and the accurate support shape capture. We introduce UCF, a univariate Christoffel function which is based on the squared distance between the query point and the support points. Extensive experiments on the ADBench benchmark demonstrate that UCF consistently outperforms 14 state-of-the-art baselines in terms of Average Precision. By resolving the scalability bottleneck of the Christoffel Function, this work expands the toolkit of anomaly detection methods with a robust, theoretically grounded, and universally applicable approach.

2606.12481 2026-06-12 cs.LG cs.AI 新提交

Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

发表机构 * Korea University(高丽大学) Mila, University of Montreal(蒙特利尔大学米拉研究所)

AI总结 提出T2SP方法,将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序,使LLM无需微调即可高效推理,在编辑、描述和问答任务上优于原始序列表示。

详情
Comments
Preprint
AI中文摘要

大型语言模型(LLM)展示了强大的推理和指令遵循能力,使其成为时间序列分析的潜在强大工具。然而,时间序列超出了其原生文本模态,引发了一个基本问题:应该如何表示时间序列,以便LLM能够有效地推理它们?现有工作通常序列化原始数值序列或在时间序列数据上微调预训练的LLM。这些方法将提取时间结构的负担直接放在LLM上,造成了模态不匹配,常常降低长序列的性能并引入大量计算开销。在这项工作中,我们引入了时间序列到结构化程序表示(T2SP),一种确定性的、无需训练的方法,将时间序列表示为结构化的符号程序。T2SP将时间序列分解为趋势、周期和显著事件,并以与LLM原生训练的文本和代码类模态对齐的程序友好格式表达它们。通过将时间结构提取从模型转移到表示本身,T2SP使现成的LLM能够利用其现有的推理能力进行时间序列理解。我们在三个推理任务上评估T2SP——编辑、描述和问答——与原始字符串表示相比,它持续提高了性能,减少了推理时间,并降低了失败率。我们的结果表明,T2SP提供了时间序列和LLM之间的有效接口。

英文摘要

Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

2606.12479 2026-06-12 cs.LG cs.AI 新提交

ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ReCal框架,通过分层奖励分解和分布感知优化校准奖励信号,解决多目标冲突和异质性任务优化偏差,提升LLM路由性能与稳定性。

详情
AI中文摘要

大型语言模型(LLM)路由已成为一种有效范式,通过动态模型和推理策略选择来利用多个LLM的互补优势。最近的基于强化学习(RL)的路由方法通过从交互反馈中优化路由策略,进一步提高了路由质量。然而,在难度不同的异质性任务下,它们仍然难以提供信息丰富且可比较的学习信号。在实践中,多个目标(如正确性、格式行为)被聚合为单个标量奖励,导致模糊的信用分配和冲突的优化信号。此外,奖励信号在不同实例间表现出显著变异性,其中一些实例产生更高或更可变的奖励,引入了偏向于平凡样本而非信息性样本的优化偏差。为了解决这些问题,我们提出了\textbf{ReCal},一个用于基于RL的LLM路由的\textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration(奖励校准)框架。我们首先引入了一种具有分量式优势估计的分层奖励分解机制。我们进一步提出了一种分布感知的优化策略,通过方差感知重加权和每数据集归一化来校准优化变异性。在七个数据集上的实验表明,ReCal在路由性能和训练稳定性上持续优于基线方法。代码可在该网址获取。

英文摘要

Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at this https URL.

2606.12476 2026-06-12 cs.LG cs.AI cs.CL 新提交

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

幻觉起始的快速检测:延迟界与学习型CUSUM统计量

Igor Itkin

发表机构 * Independent Researcher(独立研究员)

AI总结 将幻觉起始检测建模为快速变化检测问题,基于RAGTruth验证的一阶马尔可夫模型,利用学习型CUSUM算法在匹配虚警率下实现11-13个token的检测延迟,优于线性基线,并揭示了分类指标掩盖的延迟结构。

详情
Comments
14 pages, 1 figure
AI中文摘要

Token级幻觉检测器作为分类器进行评估,通过所有token的AUC,但流式监控器由其反应时间判断:从幻觉开始到警报之间的token数量。我们将幻觉起始检测表述为一个快速变化检测问题。在RAGTruth上验证的潜在忠实/幻觉状态的一阶马尔可夫模型,将任务置于经典变点理论中,并得出Lorden关于检测延迟的下界:在虚警率为0.01时约为1.3个token。然后我们证明,因果循环标注器充当了具有学习增量的CUSUM;在匹配的虚警率下,它在11-13个token内检测到,而线性每token基线为31个token,受控分解将大部分优势归因于更好的每token得分,而非时间累积。Donsker-Varadhan型的信息率最优性定理解释了剩余的数量级差距:学习得分仅实现了特征携带散度的1/4.5,这一缺陷无法通过重新校准消除,其余部分为有限时域效应。分类指标掩盖了这种延迟结构;序列分析使其可测量。

英文摘要

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment; at a matched false-alarm rate it detects in 11-13 tokens, against 31 for a linear per-token baseline, and a controlled decomposition attributes most of this advantage to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable

2606.12475 2026-06-12 cs.RO 新提交

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

学习辅助:面向隐式人机协作的协作式VLA模型

Leo Xu, Letian Li, Alex Cuellar, Michael Hagenow

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究利用视觉-语言-动作(VLA)模型通过模仿学习实现人机协作,发现动作分块策略在隐式协作中存在演示动作泄漏问题,提出推理时引导方法缓解过早辅助行为,并通过用户研究验证其有效性。

详情
AI中文摘要

人机协作(HRC)结合了人类和机器人的互补优势,以提高任务效率。然而,许多现有的协作系统依赖于手工设计的流程,限制了其对新任务的可扩展性和灵活性。在这项工作中,我们展示了通过模仿学习进行端到端训练的模型,特别是视觉-语言-动作(VLA)模型,可以支持协作操作,并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型,并识别了隐式HRC中动作分块策略的一种失败模式,其中演示动作泄漏(即动作块跨越潜在任务转换)可能导致过早的辅助行为。我们发现,这个问题随着执行时域的增长而加剧,并在真实世界的协作VLA系统中出现,例如当机器人试图在人员准备好之前移交工具时。我们提出了一种推理时引导方法,以减轻这些错误的辅助动作,同时保持策略性能。最后,通过一项16名参与者在长时域协作组装任务上的用户研究,我们表明引导能够实现更长的执行时域,同时减轻过早辅助,与短时域策略相比,实现了更快的协作和更少的失败。

英文摘要

Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.

2606.12474 2026-06-12 cs.MA cs.AI cs.CR 新提交

SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems

SAIGuard: 面向LLM多智能体系统主动防御的通信状态模拟

Ruxue Shi, Yili Wang, Mengnan Du, Qinggang Zhang, Rui Miao, Yixin Liu, Xin Wang

AI总结 提出SAIGuard主动防御框架,通过通信状态模拟检测并净化风险消息,降低攻击成功率并保持系统效用。

详情
AI中文摘要

基于LLM的多智能体系统(MAS)通过智能体间协作解决复杂任务,但其通信驱动的特性也使安全风险能够在智能体间传播并引发系统级故障。现有的MAS防御主要遵循执行后的反应式范式,通过检测和隔离有害智能体,但这可能导致不可逆的损害并降低协作效用。为解决此问题,我们提出一种面向MAS安全的主动防御框架,即模拟感知拦截守卫(SAIGuard)。SAIGuard在MAS交互图上执行通信状态模拟,估计传入消息对局部智能体状态和全局MAS状态的影响,并通过与良性通信模式的重建偏差检测风险消息。SAIGuard不隔离智能体,而是在可疑消息传播到系统之前对其进行净化或重新生成。跨多种拓扑和攻击场景的实验表明,SAIGuard在保持MAS效用的同时降低了攻击成功率,优于反应式防御。

英文摘要

LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allows security risks to spread across agents and trigger system-wide failures. Existing MAS defenses mainly follow a reactive paradigm after execution by detecting and isolating harmful agents, which may cause irreversible damage and degrade collaborative utility. To address this, we propose a proactive defense framework for MAS security, namely a Simulation-aware Interception Guard (SAIGuard). SAIGuard performs communication-state simulation over the MAS interaction graph, estimates the impact of incoming messages on local agent states and the global MAS state, and detects risky messages via reconstruction deviations from benign communication patterns. Instead of isolating agents, SAIGuard sanitizes or regenerates suspicious messages before it propagation into system. Experiments across diverse topologies and attack scenarios show that SAIGuard reduces attack success rates while maintaining MAS utility, outperforming reactive defenses.

2606.12473 2026-06-12 cs.CV 新提交

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

基于人体姿态估计的立体视觉跌倒预测与检测在AMD Kria K26 SOM上的实现

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette, Christopher Paolini

发表机构 * San Diego State University(圣地亚哥州立大学) PSG College of Technology(PSG理工学院) Old Dominion University(欧道明大学)

AI总结 提出一种基于AMD Kria K26 SOM的低功耗、便携式立体视觉跌倒预测与检测系统,通过量化YOLOX、A2J和CNN三级流水线实现实时、隐私保护的跌倒检测,多线程版本达到4.5 FPS。

详情
Comments
19 pages; 31 figures
AI中文摘要

背景与目标:老年人跌倒可能导致严重伤害并降低生活质量。及时的预测和检测对于预防伤害和支持健康至关重要。我们提出了一种便携式、低功耗、电池供电的基于视觉的跌倒预测与检测系统,在AMD Kria K26系统模块(SOM)上使用人体姿态估计(HPE)。目标是实现非侵入性、保护隐私的实时跌倒检测系统。方法:系统使用Intel RealSense D455距离感应摄像头,通过USB连接到K26 SOM。它捕获同步的RGB和深度帧,分辨率分别为640×480×3和640×480像素,帧率为60 FPS。SOM运行一个三级流水线,包括量化的YOLOX、Anchor-to-Joint(A2J)和跌倒检测模型。YOLOX从RGB帧中识别人体边界框,然后丢弃RGB帧以保护隐私。A2J使用深度帧估计每个人的15个关节点。CNN使用选定的关节坐标(x, y, z)对跌倒活动进行分类。YOLOX在CrowdHuman上训练;A2J在ITOP、MP-3DHP、UR Fall Detection和自定义的SDSU PSG数据集上训练;CNN在UR Fall Detection和SDSU PSG上训练。设计使用了单核DPU的串行流水线和双核DPU运行YOLOX和A2J的多线程版本。结果:量化精度通过YOLOX的IoU≥50%、A2J的10厘米规则mAP以及CNN的分类准确率(TP+TN)/(TP+TN+FP+FN)进行评估。准确率分别为74%、84.13%和75.85%。吞吐量从单线程流水线的2.5 FPS提高到多线程版本的4.5 FPS。结论:结果证明了在AMD Kria K26边缘设备上实现隐私保护跌倒检测的可行性。设备上的HPE和跌倒分类无需依赖云端,支持老年人监测和辅助医疗。未来工作将提高模型精度和速度。

英文摘要

Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.