arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.21972 2026-05-22 cs.LG

How Sparsity Allocation Shapes Label-Free Post-Pruning Recoverability

稀疏性分配如何塑造无标签后剪枝恢复能力

Qishi Zhan, Minxuan Hu, Liang He

AI总结 本文研究了在固定激活统计修复后端下,稀疏性分配如何影响后修复恢复能力,通过比较ERK和LAMP分配在不同数据集和模型上的表现,发现分配选择对后修复准确性有显著影响,并揭示了修复敏感的过渡区域。

详情
AI中文摘要

在高稀疏度下进行无结构幅度剪枝可能导致神经网络精度降至接近随机水平,而在实际部署中可能无法进行带标签的重新训练。无标签后剪枝修复方法可以部分恢复塌陷的稀疏模型,但其有效性取决于上游剪枝分配留下的稀疏模型。本文研究了在固定激活统计修复后端下,稀疏性分配如何影响后修复恢复能力。我们在CIFAR-10、CIFAR-100和ImageNetet上,使用ResNet-18、ResNet-34和ResNet-50,在90%到95.5%的稀疏度下,比较ERK和LAMP分配在相同无标签修复协议下的表现。结果表明,在相同全局稀疏度下,分配选择可以显著改变后修复准确性,并且优选的分配会随着架构、数据集难度和稀疏度水平而变化。我们识别出一个修复敏感的过渡区域,在此区域内批归一化重新校准开始失效,而激活统计修复仍能恢复非平凡的准确性。在ImageNet-100和DenseNet-121上的额外验证表明,此可恢复区域的位置和宽度取决于数据规模和连接结构。这些发现表明,剪枝分配和后剪枝修复应联合研究,因为分配决定了可用于无标签恢复的激活信号量。

英文摘要

Unstructured magnitude pruning at high sparsity can reduce neural network accuracy to near-random performance, while labeled retraining may be unavailable in practical deployment settings. Label-free post-pruning repair methods can partially recover collapsed sparse models, but their effectiveness depends on the sparse model left by the upstream pruning allocation. This paper studies how sparsity allocation shapes post-repair recoverability under a fixed activation-statistic repair backend. We compare ERK and LAMP allocations under the same label-free repair protocol across CIFAR-10, CIFAR-100, and Imagenette with ResNet-18, ResNet-34, and ResNet-50 at sparsities from 90% to 95.5%. The results show that allocation choice can substantially change post-repair accuracy at the same global sparsity, and that the preferred allocation varies with architecture, dataset difficulty, and sparsity level. We identify a repair-sensitive transition regime in which BatchNorm recalibration begins to fail, while activation-statistic repair still recovers nontrivial accuracy. Additional validation on ImageNet-100 and DenseNet-121 shows that the location and width of this recoverable regime depend on data scale and connectivity structure. These findings suggest that pruning allocation and post-pruning repair should be studied jointly, since the allocation determines how much activation signal remains available for label-free recovery.

2605.21968 2026-05-22 cs.LG

An Improved Adaptive PID Optimizer with Enhanced Convergence and Stability for Deep Learning

一种改进的自适应PID优化器,具有增强的收敛性和稳定性,用于深度学习

Saurabh Saini, Kapil Ahuja, Thomas Wick, Saurav Kumar

AI总结 本文提出了一种改进的自适应PID优化器IAdaPID-ADG,通过引入非递增有效学习率和基于梯度差的调制因子来解决AdaPID在收敛性和稳定性方面的不足,实验表明其在多个数据集上表现优异。

Comments 11 Pages, Double Column, 6 Tables, 5 Figures

详情
AI中文摘要

优化在深度学习中至关重要。大多数优化器的基础方法是基于动量的随机梯度下降。然而,它有两个关键缺点。首先,它有噪声和变化的梯度,其次,它有超调现象。为了解决噪声梯度,提出了Adam,它仍然是最广泛使用的自适应优化器。为了解决超调现象,提出了一种基于控制理论的PID优化器。为了在单一框架内解决这些限制,最近提出了几种AdaPID的变体。尽管AdaPID表现良好,但它仍然继承了Adam的两个关键缺点,即收敛性和稳定性问题。在本文中,我们解决了这两个限制。为了修复收敛问题,我们独特地将使用非递增有效学习率的想法整合到AdaPID中(最初在AMSGrad中提出,是Adam的扩展)。为了修复稳定性问题,我们创新性地将基于梯度差的调制因子整合到AdaPID中(最初在DiffGrad中提出,是Adam的另一个扩展)。将这两种想法结合到AdaPID中,结果得到我们新的IAdaPID-ADG优化器。我们在多个数据集上评估了所提出的优化器,包括基准数据集(MNIST和CIFAR10)和实际数据集(IARC和AnnoCerv)。IAdaPID-ADG在所有竞争优化器中表现显著更好。此外,我们在MNIST数据集上进行了消融研究,以展示每个添加组件的贡献。

英文摘要

Optimization is essential in deep learning. The foundational method upon which most optimizers are built is momentum-based stochastic gradient descent. However, it suffers from two key drawbacks. First, it has noisy and varying gradients, and second, it has an overshoot phenomenon. To address noisy gradients, Adam was proposed, which remains the most widely used adaptive optimizer. To address the overshoot phenomenon, a control-theory-based PID optimizer was proposed. To tackle both the limitations within a single framework, several variants of Adaptive PID (AdaPID) have recently been proposed. Although AdaPID performs well, it still inherits two critical drawbacks from Adam, namely convergence and stability issues. In this work, we address both these limitations. To fix the convergence issue, we uniquely integrate the idea of using a non-increasing effective learning rate into AdaPID (originally proposed in AMSGrad, an extension of Adam). To fix the stability issue, we innovatively integrate a gradient difference based modulation factor into AdaPID (originally proposed in DiffGrad, another extension of Adam). Combining both these ideas in AdaPID, results in our novel IAdaPID-ADG optimizer. We evaluate our proposed optimizer on multiple datasets, including benchmark datasets (MNIST and CIFAR10) and real-world datasets (IARC and AnnoCerv). The IAdaPID-ADG substantially outperforms all competing optimizers. Additionally, we perform an ablation study on the MNIST dataset to demonstrate the contribution of each added component.

2605.21965 2026-05-22 cs.CL

SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

SpecHop:连续推测用于加速多跳检索代理

Mehrdad Saberi, Keivan Rezaei, Soheil Feizi

AI总结 本文研究如何在不改变最终轨迹的情况下加速多跳工具使用过程,提出了一种连续推测框架SpecHop,通过维护多个推测线程和异步验证预测观测来减少延迟。

详情
AI中文摘要

大型语言模型越来越多地使用外部工具如网络搜索和文档检索来解决信息密集型任务。然而,在复杂任务中多跳工具使用引入了显著的延迟,因为模型必须反复等待工具观察结果才能继续。我们研究如何在不改变最终轨迹的情况下加速此类轨迹,假设可以访问更快但更不可靠的推测工具。我们开发了一个理论框架用于多跳工具使用设置中的无损推测,表征了最佳可实现的延迟增益。我们提出了SpecHop,一种连续推测框架,维护多个推测线程,在目标工具输出到达时异步验证预测观测,提交正确的分支并回滚错误的分支。这在保持准确性的同时减少了实际时间延迟。我们证明,当有足够活跃线程时,SpecHop可以接近理想延迟增益。在检索增强的多跳任务中,Empirically,SpecHop接近理论预测,并在某些设置中将延迟减少高达40%。代码:https://github.com/mehrdadsaberi/spechop

英文摘要

Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: https://github.com/mehrdadsaberi/spechop

2605.21963 2026-05-22 cs.LG cs.AI

ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data

ChronoMedicalWorld:一个用于从纵向护理数据中学习患者轨迹的医学世界模型

Jiangyuan Wang, Xuyong Chen, Junwei He, Xu Xu, Shasha Xie, Fuman Han

AI总结 本文提出了一种名为ChronoMedicalWorld的模型,旨在通过纵向护理数据学习患者轨迹,该模型结合了联合嵌入状态编码器和宽动作编码器,并在六个术语目标下训练了循环潜在转移模块,以提高慢性病护理中长期预测的准确性。

Comments 14 pages, 2 figures, 6 tables

详情
AI中文摘要

长期临床模拟--预测患者在指定干预下数年的生理演变--是慢性病护理的核心,但现有的电子健康记录(EHR)模型大多为判别性模型,且通用的大语言模型在重复干预下会漂移。我们提出了ChronoMedicalWorld模型(CMWM),一种用于从纵向护理数据中学习患者轨迹的动作条件潜在世界模型框架。CMWM结合了联合嵌入状态编码器和宽动作编码器,该编码器可以接受结构化干预指标和自由文本通信嵌入,并在六个术语目标下训练了循环潜在转移模块:下一步观察监督、下一步潜在预测、SIGReg潜在正则化,以及三个生理感知的形状先验(斜率、连续性、大跳跃惩罚)。闭环滚出前缀协议使训练与部署相匹配,因此模型在推理时表现出的多步误差相同。作为具体案例研究,我们为慢性肾病(CKD)的年度估计肾小球滤过率(eGFR)轨迹预测实例化CMWM。在2,232名肾病患者队列上,CKD实例化实现了动态-50%历史滚动测试的平均绝对误差(MAE)为7.384和均方根误差(RMSE)为10.256,而调优的GPT-5.5结构提示基线为7.964和11.069(MAE减少7.28%,RMSE减少7.35%),增益主要由患者与健康教练交流的对话部分主导。该框架不特定于CKD:其架构、损失设计和训练协议适用于任何可以被描述为周期性临床状态交替与结构化和对话干预的慢性疾病。

英文摘要

Long-horizon clinical simulation -- predicting how a patient's physiology evolves over years under specified interventions -- is central to chronic-disease care, yet existing electronic health record (EHR) models are predominantly discriminative, and general-purpose large language models drift under repeated interventions. We propose the \textbf{ChronoMedicalWorld Model (CMWM)}, an action-conditioned latent world-model framework for learning patient trajectories from longitudinal care data. CMWM couples a joint-embedding state encoder with a wide action encoder that admits both structured intervention indicators and free-text communication embeddings, and trains a recurrent latent transition module under a six-term objective: next-observation supervision, next-latent prediction, SIGReg latent regularisation, and three physiology-aware shape priors (slope, continuity, large-jump penalty). A closed-loop rollout-prefix protocol matches training to deployment, so the model is optimised against the same multi-step error it exhibits at inference. As a concrete case study, we instantiate CMWM for annual estimated glomerular filtration rate (eGFR) trajectory forecasting in chronic kidney disease (CKD). On a 2{,}232-patient nephrology cohort, the CKD instantiation achieves a dynamic-50\% history rollout test mean absolute error (MAE) of 7.384 and root-mean-square error (RMSE) of 10.256, against 7.964 and 11.069 for a tuned GPT-5.5 structured-prompting baseline ($-7.28\%$ MAE, $-7.35\%$ RMSE), with the gain dominated by the dialogue portion of patient--health-coach communication. The framework is not CKD-specific: its architecture, loss design, and training protocol apply to any chronic condition that can be cast as periodic clinical state interleaved with structured and conversational interventions.

2605.21962 2026-05-22 cs.AI cs.CY cs.HC cs.MA

AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

AI赋能的严肃游戏:在训练系统中整合智能与适应性

Priyamvada Tripathi, Bill Kapralos

AI总结 本文探讨了如何利用人工智能技术提升严肃游戏中的实时教学适应能力,分析了智能与适应性的定义,并讨论了大型语言模型、强化学习和基于代理的架构在严肃游戏中的应用及面临的挑战。

Comments Book chapter, 1 figure. To appear in "Advances in Global Applied Artificial Intelligence," G. A. Tsihrintzis, M. Virvou, N. G. Bourbakis, and L. C. Jain (Eds.), Springer, Learning and Analytics in Intelligent Systems book series, 2026

详情
AI中文摘要

严肃游戏在医疗、国防和教育等多个领域被广泛用于学习和培训。然而,仍然存在静态场景设计、作者瓶颈、有限的学习者建模和实现有意义的实时教学适应的困难。近年来,人工智能(AI)的进步引入了动态场景变化、上下文反馈、自适应节奏和学习者状态建模等新能力,可能帮助解决一些限制。同时,将AI集成到严肃游戏中也引发了关于有效性、透明性、系统控制和学习者信任的重要问题。本章探讨了当代AI方法如何支持严肃游戏中的实时教学适应。它区分了教学智能,即系统推断学习者知识并合理回应的能力,以及适应性,即在交互过程中修改教学行动的能力。本文呈现了适应性学习系统的综述,从早期的计算机辅助教学到智能辅导系统(ITS)、动态难度调整(DDA)、作者平台、学习分析和最近的AI赋能架构。基于这一视角,本文讨论了大型语言模型(LLMs)、强化学习(RL)和基于代理的架构如何促进严肃游戏中更整合的智能和适应性。同时,它还突出了与AI赋能系统相关的实际和研究挑战,包括可解释性、验证、计算成本以及关于AI赋能严肃游戏中长期学习结果的有限实证证据。

英文摘要

Serious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.

2605.21958 2026-05-22 cs.CL

Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

诊断并非处方:语言共适应解释了LLM流水线中的修补危害

Yoon Jeonghun, Kim Dongchan

AI总结 本文研究了多模块LLM代理失败时,诊断与修补之间的矛盾,发现路由模块虽为瓶颈,但注入修正示例反而降低性能,而修正查询重写模块则更有效,提出了语言合同假说解释这种现象。

Comments Preprint. Under review at EMNLP 2026 (ARR)

详情
AI中文摘要

当多模块LLM代理失效时,最负责失败的模块未必是最佳干预地点。我们通过实证展示了这种诊断悖论:因果分析一致地将路由模块——选择下一步调用的工具——识别为三个独立代理家族中的主要瓶颈。然而,将提示级修正示例注入此模块会持续降低性能,有时甚至严重。相反,修补上游的查询重写模块则能可靠地改善结果。这种效果在两个代理家族中具有统计显著性,在第三个家族中表现出方向一致性;在路由模块的替代修补策略(指令重写、模型升级)则无明显影响,证实了危害仅特定于修正注入修补。我们通过语言合同假说解释这种不对称性:每个下游模块隐式地适应其上游的特征错误分布,因此修正瓶颈会打破这种隐式对齐,而上游修正不会。我们通过基于诊断的每代理共适应度度量来操作化这一概念,并展示其在所有代理家族中与修补危害一致相关:较高的共适应度与危害相关,较低的与安全性相关。这一趋势在所有三个代理家族中均成立,为假说提供了初步支持,超越了单代理观察。

英文摘要

When a multi-module LLM agent fails, the module most responsible for the failure is not necessarily the best place to intervene. We demonstrate this Diagnostic Paradox empirically: causal analysis consistently identifies the routing module -- which selects which tool to call next -- as the primary bottleneck across three independent agent families. Yet injecting prompt-level correction examples into this module consistently degrades performance, sometimes severely. Patching an upstream query-rewriting module instead reliably improves outcomes. The effect holds with statistical significance on two agent families and directional consistency on a third; alternative repair strategies at the routing module (instruction rewriting, model upgrade) are neutral, confirming that the harm is specific to correction-injection patching. We explain this asymmetry through the Linguistic Contract hypothesis: each downstream module implicitly adapts to its upstream's characteristic error distribution, so correcting the bottleneck breaks this implicit alignment in a way that upstream corrections do not. We operationalize this via a per-agent co-adaptation measure, derived from diagnosis alone, and show it is consistently associated with patching harm across agent families: higher co-adaptation co-occurs with harm, lower with safety. This trend holds across all three agent families, providing preliminary support for the hypothesis beyond a single-agent observation.

2605.21957 2026-05-22 cs.CV

Bounding-Box Trajectories Matter for Video Anomaly Detection

边界框轨迹对视频异常检测至关重要

Inpyo Song, Jangwon Lee

AI总结 本文提出TrajVAD框架,通过建模多类边界框轨迹来学习正常运动模式,利用边界框轨迹作为主要异常线索,在ShanghaiTech数据集上取得优于现有姿态基方法的性能。

Comments 17 pages, 3 figures

详情
AI中文摘要

视频异常检测对于公共安全和安保至关重要,尽管已有大量研究,但仍极具挑战性,因为存在大量外观、视角和场景动态的变化。在现有方法中,基于人类姿态的方法已成为主要研究方向,由于许多公共数据集中的异常涉及人类,姿态表示对外观变化具有鲁棒性,同时提供紧凑的运动描述。然而,这些方法往往忽视了边界框轨迹,尽管这种信息在基于姿态的管道中本应是固有的。在本文中,我们明确利用这些轨迹作为主要异常线索。我们提出了TrajVAD框架,使用归一化流建模多类边界框轨迹以学习正常运动模式。其仅轨迹变体(TrajVAD-T)消除了姿态估计,并在ShanghaiTech上以87.7%的AP超越了所有比较的姿态基方法,同时在MSAD上取得最佳结果。扩展版本(TrajVAD-P)纳入了姿态信息,进一步将ShanghaiTech上的性能提升至88.6%的AUROC和90.9%的AP,突显了边界框轨迹作为视频异常检测中有效但尚未充分研究的模态。

英文摘要

Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.

2605.21954 2026-05-22 cs.CV cs.AI

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

AI总结 本文研究了多模态大语言模型(MLLMs)在视频时间定位中的感知与生成之间的差距,提出了一种推理阶段的读取-再生成框架,通过利用注意力线索来提高时间定位的准确性,从而在三个视频时间定位基准上提升了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能。

Comments Project Website: https://ddz16.github.io/mllmsknowwhen.github.io/

详情
AI中文摘要

视频时间定位(VTG),即在未剪裁的视频中定位查询事件的起止时间,是检验多模态大语言模型(MLLMs)是否理解不仅发生了什么,而且何时发生的关键测试。尽管现代MLLMs能够流畅地描述视频内容,但它们的时间戳预测仍然不可靠,而现有的解决方案要么需要昂贵的后训练时间标注,要么依赖于粗略的训练无关启发式方法。在本文中,我们探测了MLLMs的跨模态注意力,并揭示了一个感知-生成的差距。我们的关键发现是,MLLMs在prefill阶段往往知道目标区间,但在生成最终答案时会丢失这个信号。在prefill阶段,一组稀疏的注意力头(我们称之为时间定位头(TG-Heads))会将查询到视频的注意力集中在真实区间上。然而,在自回归解码过程中,答案标记会将注意力从该区间转移到视觉显著但与查询无关的段落。这一观察促使我们提出了一种推理阶段的读取-再生成框架。我们首先将TG-Head prefill注意力转换为一个去偏的帧级相关性信号,并提取它突出的高注意力区间。然后,我们使用视频裁剪或注意力掩码来限制MLLM的视觉上下文,仅限于该区间,以抑制干扰项。在不进行参数更新和架构更改的情况下,我们的框架在三个VTG基准上一致地提高了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能,最大提升达到+3.5 mIoU。该项目网站可在https://ddz16.github.io/mllmsknowwhen.github.io/上找到。

英文摘要

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

2605.21951 2026-05-22 cs.LG

Dynamic Mixture of Latent Memories for Self-Evolving Agents

动态潜在记忆混合用于自演化智能体

Dianzhi Yu, Vireo Zhang, Hongru Wang, Yanyu Chen, Minda Hu, Wanghan Xu, Siki Chen, Philip Torr, Zhenfei Yin, Irwin King

AI总结 本文提出MoLEM框架,通过动态混合专家机制实现智能体的持续学习,避免灾难性遗忘,提升任务学习和能力保持。

Comments 19 pages, 5 figures, 5 tables

详情
AI中文摘要

实现智能体的自演化需要在变化的任务序列中持续积累新知识,同时不遗忘之前获得的能力。现有方法要么通过更新模型参数内部化知识,导致灾难性遗忘,要么依赖外部记忆,无法真正增强模型的内在能力。我们提出MoLEM,一种基于动态混合专家(MoE)的生成性潜在记忆混合框架。我们将多个专家视为独立的记忆载体来生成记忆。路由器通过键-查询匹配选择并加权专家,聚合的潜在记忆被注入推理过程。基础模型保持完全冻结,所有经验知识被内部化到附加模块中,避免灾难性遗忘。对于持续学习,每个训练阶段配对一个轻量级自编码器,在推理时选择适当的路由组,输入若不匹配任何阶段则回退到预训练模型。实验在涵盖数学、科学和代码领域的持续学习序列上训练框架。训练后,我们在相应的测试集上评估框架,以测量跨持续适应阶段的任务学习和能力保持。在完整的持续学习序列后,我们的方法在Vanilla预训练基线基础上将平均准确率提高了10.40%,而其他方法在不同训练顺序中均无法超过此基线。

英文摘要

Achieving self-evolution in intelligent agents requires the continual accumulation of new knowledge across changing task sequences without forgetting previously acquired abilities. Existing approaches either internalize knowledge by updating model parameters, which induces catastrophic forgetting, or rely on external memory, which fails to genuinely enhance the model's intrinsic capabilities. We propose MoLEM, a generative mixture of latent memory framework based on a dynamic mixture-of-experts (MoE). We treat multiple experts as independent carriers to generate memory. A router selects and weights experts through key-query matching, and the aggregated latent memory is injected into the reasoning process. The base model for reasoning remains entirely frozen, with all experiential knowledge internalized into the additional modules, avoiding catastrophic forgetting. For continual learning, each training stage is paired with a lightweight autoencoder that selects the appropriate routing group at inference, and inputs that match no stage fall back to the pretrained model. Experiments train the framework on continual-learning sequences spanning math, science, and code domains. After training, we evaluate the framework on the corresponding test sets to measure task learning and competence preservation across continual adaptation stages. After the full continual-learning sequence, our method improves the average accuracy by 10.40% over the Vanilla pretrained baseline, while none of the competing methods consistently exceed this baseline across different training orders.

2605.21949 2026-05-22 cs.CL

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

面向高风险医疗检索增强生成的声明选择性认证

Shao Kan

AI总结 本文研究了高风险医疗问答场景中检索增强生成系统中声明选择性认证问题,通过将响应分解为可验证的声明并根据检索证据评分,结合意图感知选择器映射到{完整、部分、冲突、回避},在弱标签证书协议上实现了高准确率的认证结果。

Comments 22 pages, 7 figures, 11 tables

详情
AI中文摘要

在高风险问答设置中,医疗RAG系统通常通过单个答案或回避决策进行评估,但混合证据可能支持一个声明,要求另一个声明的条件,并与第三个声明矛盾。我们研究声明选择性认证:每个响应被分解为可验证的声明,根据检索证据评分,并通过意图感知选择器映射到{完整、部分、冲突、回避}。在主要弱标签证书协议上,其真实源-only的开发/测试行覆盖了自然发生的非回避动作,完整的系统在开发集(n=314)上记录UCCR=0.0000,PAU=1.0000,PAU Precision=0.9901,以及动作准确率=0.9204,在测试集(n=319)上记录UCCR=0.0000,PAU=0.9967,PAU Precision=0.9739,以及动作准确率=0.8997。UCCR衡量证书定义内的不支持声明风险,而源缺失的反事实切片评估在空证据下的回避。捷径控制量化由源和意图元数据解释的动作-标签先验,而源/证据新颖切片表征转移边界。所得到的界面在混合证据下将动作-标签预测与证据关联的声明选择分开。

英文摘要

Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to {full, partial, conflict, abstain}. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.

2605.21948 2026-05-22 cs.LG

SCI-Defense: Defending Manipulation Attacks from Generative Engine Optimization

SCI-Defense: 防御生成引擎优化的操纵攻击

Xucheng Yu, Haibo Jin, Huimin Zeng, Haohan Wang

AI总结 本文提出SCI-Defense框架,通过检测困惑度、语义完整性评分和跨候选检测三种组件,有效识别生成引擎优化攻击,实现了高精度和低误报率,同时揭示了现有防御方法的局限性及未来研究方向。

Comments 20 pages, NeurIPS 2026 submission

详情
AI中文摘要

基于大型语言模型的排序系统易受生成引擎优化(GEO)攻击影响,攻击者通过在产品描述中注入语义信号来人为提升排名。我们提出了SCI-Defense,一种结合困惑度检测(PPL)、语义完整性评分(SIS)和跨候选检测(ICD)的三元防御框架。SIS评估四个操纵维度:权威归因(AA)、叙事目的性(NP)、比较主张(CA)和时间主张(TC)。在600个亚马逊产品描述(6个类别)上评估,SCI-Defense实现了精度1.000和误报率0.000,召回率分别为1.000、0.952和0.830,分别针对字符串、推理和评论攻击。在600个MS MARCO网页段落上,字符串攻击被完美阻止,而评论攻击的召回率接近零,因为网页段落缺乏SIS在产品描述中针对的说服性信号。我们证明现有防御方法——仅PPL过滤、SafetyClf内容分类器和改写——在对抗语义操纵攻击时召回率为零。我们进一步展示了新的攻击方式如规范放大和用例饱和,可以暴露语义相关性操纵作为结构防御盲点,指明了未来研究的方向。

英文摘要

LLM-based ranking systems are vulnerable to Generative Engine Optimization (GEO) attacks, where adversaries inject semantic signals into product descriptions to artificially boost rankings. We propose SCI-Defense, a three-component defense framework combining Perplexity detection (PPL), Semantic Integrity Scoring (SIS), and Inter-Candidate Detection (ICD). SIS evaluates four manipulation dimensions: Authority Attribution (AA), Narrative Purposiveness (NP), Comparative Claims (CA), and Temporal Claims (TC). Evaluated on 600 Amazon product descriptions across 6 categories, SCI-Defense achieves Precision=1.000 and FPR=0.000, with Recall of 1.000, 0.952, and 0.830 against String, Reasoning, and Review attacks respectively. On 600 MS MARCO web passages, String attacks are blocked with perfect recall while Review attacks yield near-zero recall, as web passages lack the persuasion-oriented signals that SIS targets in product descriptions. We demonstrate that existing defenses -- PPL-only filters, SafetyClf content classifiers, and paraphrasing -- achieve zero recall against semantic manipulation attacks. We further demonstrate new attacks such as Specification Amplification and Use-Case Saturation can expose semantic relevance manipulation as a structural defense blind spot that suggests directions for future research.

2605.21947 2026-05-22 cs.RO

A Visitation Grid for Complete Coverage Foraging in Robot Swarms

用于机器人群完全覆盖觅食的访问网格

Qi Arturo Gonzalez, Yifeng Gao, Li Zhang, Qi Lu

AI总结 本文提出了一种基于网格的随机觅食策略,通过减少冗余访问并加速后期收集,提高了机器人群在大规模未知环境中的资源收集效率和完整性。

Comments The 23rd International Conference on Ubiquitous Robots, 10 figures, 3 tables

详情
AI中文摘要

在大规模未知环境中对稀疏资源的完全收集仍然是自主机器人群面临的挑战。先前研究表明,收集阶段的大部分时间消耗在最终阶段,此时仅剩下少量随机分布的资源。因此,许多现有的群体觅食算法(搜索和收集)专注于在有限的时间窗口内收集大多数资源,而不是改进后期收集所有资源的效率。我们提出了一种基于网格的随机觅食策略,通过显式减少冗余访问并加速后期收集。未知的搜索区域被划分为网格地图,该地图由一个轻量级的中央服务器维护。为了保持可扩展性,机器人和服务器都在有限的内存和计算约束下运行。服务器根据机器人报告的位置更新网格级别的访问次数,生成探索密度的全局估计。对于每次新的觅食任务,机器人从一个局部3×3邻域网格中以最低访问次数的概率选择下一个搜索区域,从而将探索偏向于未访问的区域,同时保持随机性。广泛的模拟实验表明,所提出的策略在性能上始终优于传统的中央放置基线觅食算法(CPFA)。与CPFA相比,所提出的方法将总收集时间减少了多达33%,并在任务的最后阶段将收集效率提高了超过48%。这些结果表明,所提出的策略在机器人群的近完全和完全资源收集中具有鲁棒性、灵活性和可扩展性,并且可以作为在有限机载资源下随机群体觅食方法的一般增强。

英文摘要

The complete collection of sparse resources in large, unknown environments remains a challenging problem for autonomous robot swarms. Previous studies have shown that a substantial portion of total mission time is consumed during the final stage of collection, where only a small fraction of randomly scattered resources remain. Consequently, many existing swarm foraging algorithms (search and collection) focus on collecting most resources within a limited time window, rather than improving end-stage efficiency for collecting all resources. We propose a grid-based stochastic foraging strategy that explicitly reduces redundant visits and accelerates late-stage collection. The unknown search area is partitioned into a grid map, which is maintained by a lightweight central server. To maintain scalability, both robots and the server operate within limited memory and computational constraints. The server updates the grid-level visitation counts based on robot-reported locations, producing a global estimate of the exploration density. For each new foraging trip, a robot selects its next search area from a local 3 X 3 neighborhood of grids probabilistically with the lowest visitation count, thus biasing exploration toward under-visited regions while maintaining stochasticity. Extensive simulation experiments demonstrate that the proposed strategy consistently outperforms the canonical centrally placed baseline foraging algorithm (CPFA). Compared to CPFA, the proposed method reduces the total collection time by up to 33% and improves collection efficiency by more than 48% during the final stage of the mission. These results indicate that the proposed strategy is robust, flexible, and scalable for near-complete and complete resource collection in robot swarms and can serve as a general enhancement for stochastic swarm foraging methods under limited onboard resources.

2605.21938 2026-05-22 cs.LG cs.CR cs.IT math.IT

Optimal Guarantees for Auditing Rényi Differentially Private Machine Learning

对Rényi差分隐私机器学习的最优审计保证

Benjamin D. Kim, Lav R. Varshney, Daniel Alabi

AI总结 本文研究了声称具有Rényi差分隐私(RDP)保证的机器学习算法的黑盒审计问题,提出了一种基于假设检验的审计框架,利用Donsker-Varadhan(DV)变分估计器直接估计相邻执行之间的Rényi散度,并通过类受限DV估计器得出非渐近的置信区间,证明了样本复杂度保证在信息论上最优,首次建立了通过DV估计器审计RDP的最优保证。

Comments 28 pages, 3 figures

详情
AI中文摘要

我们研究了声称具有Rényi差分隐私(RDP)保证的机器学习算法的黑盒审计问题。我们引入了一种基于假设检验的审计框架,该框架利用Donsker-Varadhan(DV)变分估计器直接估计相邻执行之间的Rényi散度。我们的分析得出通过类受限DV估计器进行RDP审计的显式且非渐近的置信区间,将统计估计误差与算法隐私泄漏分开。我们证明了匹配的minimax下界,表明在对数因子范围内,我们的样本复杂度保证在信息论上最优,从而建立了通过DV估计器审计RDP的首次最优保证。经验上,我们为在完全黑盒设置中审计DP-SGD实例化了我们的框架。在MNIST和CIFAR-10上,以及在广泛的隐私制度下,我们的审计器在经验RDP下界方面相比先前最先进的黑盒方法表现出显著的整体改进,尤其是在小和中等Rényi阶数,其中准确审计最为具有挑战性时。

英文摘要

We study black-box auditing for machine learning algorithms that claim R \ 'enyi differential privacy (RDP) guarantees. We introduce an auditing framework, based on hypothesis testing, that directly estimates Rényi divergence between neighboring executions using the Donsker-Varadhan (DV) variational estimator. Our analysis yields explicit and non-asymptotic confidence intervals for RDP auditing via class-restricted DV estimators, separating statistical estimation error from algorithmic privacy leakage. We prove matching minimax lower bounds showing that, up to logarithmic factors, our sample-complexity guarantees are information-theoretically optimal, thereby establishing the first optimal guarantees for auditing RDP via DV estimators. Empirically, we instantiate our framework for auditing DP-SGD in a fully black-box setting. Across MNIST and CIFAR-10, and over a wide range of privacy regimes, our auditors produce a strong overall improvement on empirical RDP lower bounds compared to prior state-of-the-art black-box methods especially at small and moderate Rényi orders where accurate auditing is most challenging.

2605.21935 2026-05-22 cs.RO

Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

学习进化:多模态交互场用于动态环境中的稳健双足机器人导航

Peifeng Jiang, Hong Liu, Jin Jin, Wenshuai Wang, Xia Li

AI总结 本文提出多模态交互场(MIF)系统,通过结合置信度感知的语义3D高斯溅射、差异触发的空间记忆更新和任务驱动的几何重建,在闭环感知-适应管道中实现稳健的双足机器人导航,显著提高了非静态环境中的重定位成功率并减少了语义内存足迹。

Comments Accepted by Robotics: Science and Systems 2026

详情
AI中文摘要

安全的以操作为导向的导航对于双足机器人需要在运动引起的感知扭曲、环境变化和交互层面的几何安全约束下保持可靠的场景记忆。现有语义映射和场景图系统难以直接部署在此设置中,因为它们通常假设稳定的相机轨迹、静态环境或粗略的对象几何。我们引入多模态交互场(MIF),一个面向双足机器人的系统,整合了置信度感知的语义3D高斯溅射、差异触发的空间记忆更新和任务驱动的几何重建,形成闭环的感知-适应管道。MIF耦合了三个场:一个不确定性感知的3DGS外观场,用于抑制步态引起的模糊;一个空间场用于维护拓扑记忆;一个几何场用于在操作前支持交互姿态安全(IPS)。引入了一个差异检测分数,用于区分运动引起的假阳性变化与持续变化,并仅更新局部不一致的区域。在真实动态办公室中的Unitree-G1双足机器人上,MIF将非静态环境中的重定位成功率从12%提升到94%,同时通过特征蒸馏将语义内存足迹减少91.4%,以适应实际的在线操作。项目页面和代码:https://ziya-jiang.github.io/MIF-homepage/

英文摘要

Safe manipulation-oriented navigation for humanoid robots requires scene memory that remains reliable under locomotion-induced perceptual distortion, environmental changes, and interaction-level geometric safety constraints. Existing semantic mapping and scene-graph systems are difficult to deploy directly in this setting because they often assume stable camera trajectories, static environments, or coarse object geometry. We introduce the Multi-modal Interactive Field (MIF), a humanoid-oriented system that integrates confidence-aware semantic 3D Gaussian Splatting, discrepancy-triggered spatial memory updates, and task-driven geometric reconstruction within a closed-loop perception-adaptation pipeline. MIF couples three fields: an uncertainty-aware 3DGS Appearance Field that suppresses gait-induced blur, a Spatial Field that maintains topological memory, and a Geometry Field that supports Interaction Pose Safety (IPS) before manipulation. A discrepancy detection score is introduced to separate locomotion-induced false-positive changes from persistent changes and updates only locally inconsistent regions. On a Unitree-G1 humanoid in a real dynamic office, MIF improves relocation success in non-static environments from 12% to 94% compared with static scene-graph memory, while reducing semantic memory footprint by 91.4% through feature distillation for practical online operation. Project page and code: https://ziya-jiang.github.io/MIF-homepage/

2605.21932 2026-05-22 cs.RO

Auction-Consensus Algorithm with Learned Bidding Scheme for Multi-Robot Systems

带有学习出价方案的拍卖-共识算法用于多机器人系统

Jose Rodriguez, Constantine Tarawneh, Sven Koenig, Wenjie Dong, Qi Lu

AI总结 本文提出了一种学习增强的拍卖-共识框架,通过强化学习训练神经出价策略来改进多机器人系统的任务分配,保留了传统的拍卖和共识阶段以实现去中心化协调。

Comments The 23rd International Conference on Ubiquitous Robots, 9 figures, 6 pages

详情
AI中文摘要

多机器人任务分配(MRTA)是分布式多智能体系统中的核心挑战,其中机器人团队必须在有限通信的情况下协作分配和执行任务,同时优化全局性能目标。拍卖-共识算法,如基于共识的捆绑算法(CBBA),提供了可扩展的去中心化协调,具有可证明的收敛性,但依赖于手工设计的贪婪评分函数,通常导致次优的任务分配。本文提出了一种学习增强的拍卖-共识框架,其中CBBA的确定性出价机制被神经出价策略取代,该策略通过强化学习进行训练。在集中训练和去中心化执行范式下,智能体学会从部分局部观测中计算任务出价,同时保留标准拍卖和共识阶段以实现去中心化协调。学习的出价策略通过混合整数线性规划获得的接近全局最优解的奖励进行训练。多个神经网络架构被评估,包括神经加法模型、长短期记忆(LSTM)模型和集合转换器模型。在不同群体大小的实验结果中,学习的出价策略在经典CBBA之上提高了解决方案的质量,同时保持了去中心化的执行。所提出的方法突显了将强化学习与经典分布式协调算法结合的有效性,为高质量的去中心化多机器人任务分配提供了可扩展的路径。

英文摘要

Multi-Robot Task Allocation (MRTA) is a central challenge in decentralized multi-agent systems, where teams of robots must cooperatively assign and execute tasks under limited communication while optimizing global performance objectives. Auction-consensus algorithms, such as the Consensus-Based Bundle Algorithm (CBBA), provide scalable decentralized coordination with provable convergence, but rely on hand-crafted greedy scoring functions that often lead to suboptimal task allocations. This paper proposes a learning-enhanced auction-consensus framework in which CBBA's deterministic bidding mechanism is replaced by a neural bidding policy trained using reinforcement learning. Under a centralized training and decentralized execution paradigm, agents learn to compute task bids from partial local observations while retaining the standard auction and consensus phases for decentralized coordination. The learned bidding policy is trained using Proximal Policy Optimization with rewards shaped by proximity to globally optimal solutions obtained via mixed-integer linear programming. Multiple neural architectures are evaluated, including a Neural Additive Model, the Long Short-Term Memory (LSTM) model, and the Set Transformer Model. Experimental results across varying swarm sizes demonstrate that learned bidding policies can improve solution quality over classical CBBA while preserving decentralized execution. The proposed approach highlights the effectiveness of integrating reinforcement learning with classical distributed coordination algorithms, offering a scalable pathway toward higher-quality decentralized multi-robot task allocation.

2605.21931 2026-05-22 cs.CV

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

EvoVid: 以时间为中心的自我进化用于视频大语言模型

Shiqi Huang, Ziyue Wang, Zhongrong Zuo, Han Qiu, Qi She, Bihan Wen

AI总结 本文提出EvoVid,一种以时间为中心的自我进化框架,使视频大语言模型能够直接从未经标注的视频中改进。通过引入两个互补的时间感知奖励,即时间感知的问题生成奖励和时间基础的求解奖励,EvoVid在四个基础模型和六个基准测试中实现了优于基线模型和现有自我进化基线的改进,展示了时间为中心的自我进化在视频理解和推理中的有效性。

Comments Project page: https://huangshiqi128.github.io/EvoVid.io/

详情
AI中文摘要

近期的视频大语言模型(Video-LLMs)通过强化学习(RL)展示了在视频推理中的强大能力。然而,现有的RL流程严重依赖于人工标注的任务和解决方案,使其扩展成本高且本质上受人类专业知识的限制。自我进化框架最近作为一种有前途的替代方案出现,通过自主的提问者-求解者自玩。不幸的是,这些方法主要针对静态模态,如文本和图像,从根本上无法捕捉视频推理中至关重要的时间动态。在本工作中,我们提出了EvoVid,一种以时间为中心的自我进化框架,使Video-LLMs能够直接从原始、未标注的视频中改进。具体来说,我们引入了两个互补的时间感知奖励:一个时间感知的问题生成奖励,通过时间扰动敏感性鼓励时间依赖性的问题生成;一个时间基础的求解奖励,通过固有的视频片段定位提供自动的时间监督。在四个基础模型和六个基准测试中的广泛实验显示,EvoVid在基线模型和现有自我进化基线上实现了持续的改进,取得了与监督方法相竞争的性能。这些结果突显了时间为中心的自我进化作为视频理解和推理的有效且可扩展的范式。

英文摘要

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

2605.21928 2026-05-22 cs.LG cs.AI stat.ME

CausalGuard: Conformal Inference under Graph Uncertainty

CausalGuard: 在图不确定性下的契合推断

Vikash Singh, Weicong Chen, Debargha Ganguly, Yanyan Zhang, Nengbo Wang, Sreehari Sankar, Mohsen Hariri, Alexander Nemecek, Chaoda Song, Shouren Wang, Biyao Zhang, Van Yang, Erman Ayday, Jing Ma, Vipin Chaudhary

AI总结 本文提出CausalGuard,一种结构加权的契合框架,通过聚合图条件双稳健伪结果进行校准,以在图不确定性下提供无分布的有限样本边际覆盖。

详情
AI中文摘要

从观察数据估计治疗效应需要选择调整集,但有效的调整依赖于未知的因果图。图的不规范可能导致覆盖不足,而图无关的契合包装可能只能通过大填充来恢复名义覆盖。我们介绍了CausalGuard,一种结构加权的契合框架,该框架在聚合图条件双稳健伪结果后进行校准。候选DAGs从LLM衍生的边先验中提出,通过条件独立性测试进行修剪,并通过贝叶斯信息准则重新加权。然后,一个复合非契合分数校准后加权的伪结果。CausalGuard为聚合的伪结果提供无分布的有限样本边际覆盖;在因果识别、重叠、条件均值噪声稳定性以及集中在目标对齐的有效调整策略下,其条件均值收敛于真实的条件平均治疗效应。在五个基准测试中,CausalGuard在可直接评估的目标上实现了均值覆盖超过名义90%水平,并在图无关契合基线需要大填充时减少了宽度。压力测试显示,当保留的候选集受数据支持时,CausalGuard能抑制无效的碰撞调整并在不规范的先验下保持稳定。

英文摘要

Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.

2605.21924 2026-05-22 cs.CV

Visual-Advantage On-Policy Distillation for Vision-Language Models

基于视觉优势的在线策略蒸馏用于视觉-语言模型

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu

AI总结 本文提出了一种基于视觉优势的在线策略蒸馏方法,用于提升视觉-语言模型对视觉输入的依赖性,通过引入视觉优势指标来区分关键视觉token与语言token,从而提高蒸馏效果。

详情
AI中文摘要

在线策略知识蒸馏在语言模型中已被证明有效,但其在视觉-语言模型(VLMs)中的应用仍显不足。我们发现标准在线策略蒸馏可以提高学生模型的输出质量,但未能增强其对视觉输入的依赖性:在视觉关键token上,学生模型的预测在是否具备细粒度视觉细节时基本保持不变,尽管教师模型的预测依赖于它。为了使这种差异变得明显,我们引入了视觉优势(VA),即当教师在评分学生生成的rollout时,有无细粒度视觉细节的token级对数概率差异。VA集中在少数token上,这些高VA token实际上承载了视觉监督信号。这促使我们提出了一种蒸馏目标,使它们与语言支架不同,以避免其被大量语言token稀释。我们提出了视觉优势在线策略蒸馏(VA-OPD),它在两个粒度上使用VA:通过轨迹平均VA进行rollout级重新加权,以及在高VA和低VA组内分别计算token级KL平均值。我们在这两个数学数据集(Geometry3K和ViRL39K)上进行训练,并在八个基准测试上进行评估,涵盖数学推理和视觉理解,跨三种教师大小(4B、8B和32B)在Qwen3-VL系列上。VA-OPD在每个基准测试上均优于标准在线策略蒸馏,增益随着教师大小和数据规模轴单调增长,表明这些因素一致地相互作用。

英文摘要

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

2605.21919 2026-05-22 cs.CV cs.AI

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

SDGBiasBench: 评估和减轻可持续发展目标中视觉-语言模型的偏见

Zihang Lin, Huaiyuan Qin, Muli Yang, Hongyuan Zhu

AI总结 本文提出SDGBiasBench,一个用于评估和减轻可持续发展目标中视觉-语言模型偏见的大型基准测试集,通过分析模型在决策和估计层面的偏见,提出CADE方法以减少偏见,提高模型的准确性和可靠性。

详情
AI中文摘要

评估可持续发展目标(SDGs)的进展需要对视觉线索、上下文知识和发展指标进行多步骤推理,其中不完整的证据使用和不完美的证据整合可能引入隐藏的预测偏见。现实中的SDG监测还涵盖定性判断和定量估计。然而,现有基准通常孤立地评估这些方面,掩盖了当模型用先验代替证据时系统性偏见。为解决这一差距,我们提出了SDGBiasBench,一个面向SDG的视觉-语言推理大型基准测试集。该基准涵盖50万专家参与的多项选择题和5万回归任务,能够全面评估视觉-语言模型(VLMs)在决策和估计层面的偏见。在SDGBiasBench上的评估揭示了当前VLMs中固有的SDG偏见,其中预测通常由SDG特定的先验驱动,而非可靠的多模态线索。为减轻这种偏见,我们提出CADE(对比自适应去偏集合),一种无需训练的即插即用方法,利用模态特定的答案先验。CADE在所提出的基准上取得显著成效,提高了多项选择的准确率高达25%,并减少了回归MAE高达12点,适用于多种VLMs。我们希望我们的工作能促进更公平和可靠的AI系统在可持续发展中的发展。

英文摘要

Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

2605.21917 2026-05-22 cs.CV cs.AI

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

MAVEN:一种多阶段代理标注管道用于视频推理任务

Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

AI总结 本文提出MAVEN,一种多阶段代理标注管道,通过链式推理轨迹生成多任务训练数据,用于视频事件推理任务,核心方法是多尺度时空事件描述,支持代理驱动的领域适应,通过分层细化循环改进数据质量,并在多个数据集上验证了其有效性。

Comments CVPR 2026 Workshop

详情
AI中文摘要

训练视频事件推理的视觉语言模型(VLMs)需要高质量的结构化标注,这些标注不仅要描述发生了什么,还要捕捉何时、何地、为何以及后果。我们提出了MAVEN(多阶段代理视频事件标注),一种多阶段代理管道,通过链式推理(CoT)轨迹将原始视频转换为多任务训练数据,围绕指定的事件焦点组织。在核心部分,MAVEN从三个互补的标题级别合成多尺度时空事件描述(MSTED),该显式中间体是下游问答生成的唯一输入,适用于多种任务格式。关键的是,MAVEN支持代理驱动的领域适应:给定新的视频数据集和目标问题示例,代理可以重新设计所有提示,而无需手动重新工程。分层细化循环进一步将注释错误分类到分类学中,追溯根本原因到起始管道阶段,并应用有针对性的编辑,重写提示或修改管道结构本身,迭代改进数据质量。我们应用MAVEN标注超过5,300个交通视频,并在生成的数据上微调Cosmos-Reason2-8B。在私人CCTV评估集上,微调优于Gemini 2.5 Pro和3.1 Flash,包括在零样本情况下MCQ准确率提高了38.8个百分点。在AccidentBench上,仅使用CCTV训练提升了Cosmos-Reason2的MCQ分数10.7分,并在没有dashcam视频的情况下与Gemini 2.5 Pro持平;添加代理适应的dashcam注释缩小了与Gemini 3.1 Flash的差距,RL后训练将总体性能推过了Gemini基线。对仓库监控和公共安全视频的定性结果进一步表明,代理工作流能够轻松适应新领域。

英文摘要

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

2605.21914 2026-05-22 cs.RO

Non-Contact Vibration-Based Damage Detection of Civil Structures Using a Cost-Effective Autonomous UAV

基于低成本自主无人机的非接触式振动法 civil 结构损伤检测

Javier Becerril, Maximiliano Vargas, Jennifer Herrera, Joanna Gutierrez, Jorge Rios, Mohsen Amjadian, Constantine Tarawneh, Jinghao Yang, Qi Lu

AI总结 本文提出了一种利用低成本自主无人机进行非接触式振动法 civil 结构损伤检测的方法,通过视频记录中的视觉运动追踪提取振动信号,识别自然频率的变化以检测结构退化。实验评估了实验室规模的框架结构在健康和模拟损伤条件下的表现,结果表明无人机能够可靠地检测到损伤引起的频率变化,尽管存在一定的误差,但其性能优于商业无人机系统。

Comments 8 pages, 8 figures, The 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026

详情
AI中文摘要

本文提出了一种非接触式振动法 civil 结构损伤检测方法,利用自主且定制化的低成本无人机(UAV)。通过基于视觉的运动追踪从视频记录中提取振动信号,以识别自然频率的变化,从而检测结构退化。在实验室规模的框架结构上,评估了健康和模拟损伤条件下的性能。所提出的系统通过实验研究验证,使用两部智能手机、USB相机和定制的低成本无人机,该无人机配备了内置相机和自主对齐系统,以在GPS受限环境中操作。提取并分析位移时间,并在频域中与参考测量值(来自接触加速度计和有限元模型)进行比较。实验结果表明,所有平台均能成功捕捉基频及其因损伤引起的偏移。尽管由于平台干扰和传感限制,无人机表现出略高的误差(最高达5.7%),但其能够可靠地检测到损伤引起的频率变化。与商业无人机系统相比,所提出的平台在显著降低成本的情况下实现了可比的检查性能。这些结果表明,低成本自主无人机为结构健康监测提供了一种实用、灵活且可扩展的解决方案,特别是在接触式传感不可行的情况下。此外,研究结果也支持了多个协作无人机部署的潜力,以进一步提高检查的覆盖范围和鲁棒性。

英文摘要

This paper presents a non-contact approach for vibration-based structural damage detection using an autonomous and customized cost-effective unmanned aerial vehicle (UAV). Vibration signals are extracted from video recordings through vision-based motion tracking to identify shifts in natural frequencies indicative of structural degradation. A laboratory-scale frame structure is evaluated under healthy and simulated-damage conditions. The proposed system is validated through an experimental study involving two smartphones, a USB camera, and a custom-built low-cost UAV equipped with an onboard camera and an autonomous alignment system for operation in GPS-denied environments. The displacement time is extracted and analyzed in the frequency domain and compared to reference measurements from contact accelerometers and a finite element model. Experimental results show that all platforms successfully capture the fundamental frequency and its shift due to damage. Although the UAV exhibits slightly higher errors (up to 5.7%) due to platform-induced disturbances and sensing limitations, it reliably detects damage-induced frequency changes. Compared to commercial UAV systems, the proposed platform achieves comparable inspection performance at significantly lower cost. These results demonstrate that low-cost autonomous UAVs provide a practical, flexible, and scalable solution for structural health monitoring, particularly in scenarios where contact-based sensing is impractical. The findings also support the potential for the deployment of multiple cooperative UAVs to further enhance inspection coverage and robustness.

2605.21913 2026-05-22 cs.CV

Multi-scale interaction network for stereo image super-resolution

多尺度交互网络用于立体图像超分辨率

Liyi Xu, Lin Qi

AI总结 本文提出了一种多尺度交互网络,用于立体图像超分辨率,通过改进视内特征提取和视间匹配精度,实现了更优的超分辨率效果。

详情
AI中文摘要

立体图像超分辨率旨在通过利用双目系统的互补信息生成高分辨率图像。尽管先前研究取得了显著成果,但视内和视间信息的潜力尚未被充分挖掘。为了解决这个问题,我们提出了一种新颖的多尺度交互网络用于立体图像超分辨率。具体来说,我们设计了一个多尺度空间-通道注意模块,利用多尺度大可分离核注意和简单的通道注意来改进视内特征提取。此外,我们提出了一个双视图极线注意模块,利用最优传输算法实现更精确的极线匹配。广泛的实验和消融研究显示,我们的方法实现了具有竞争力的结果,优于大多数最先进的方法。

英文摘要

Stereo image super-resolution aims to generate high-resolution images by leveraging complementary information from binocular systems. Although previous studies have achieved impressive results, the potential of intra-view and cross-view information has not been fully exploited. To address this issue, we propose a novel multi-scale interaction network for stereo image super-resolution. Specifically, we design a Multi-scale Spatial-Channel Attention Module that utilizes multi-scale large separable kernel attention and simple channel attention to improve intra-view feature extraction. Additionally, we propose a Dual-View Epipolar Attention Module, utilizing an optimal transport algorithm to achieve more accurate matching along the epipolar line. Extensive experimental and ablation studies show that our method achieves competitive results that outperform most SOTA methods.

2605.21911 2026-05-22 cs.LG

Noise Schedule Design for Diffusion Models: An Optimal Control Perspective

扩散模型的噪声调度设计:一个最优控制视角

Seo Taek Kong, Weina Wang, R. Srikant

AI总结 本文从最优控制的角度出发,提出了一种分析和设计扩散模型噪声调度的框架,通过将噪声调度问题转化为最优控制问题,推导出噪声调度的充分条件,实现了更优的采样误差,并通过参数调整得到新的噪声调度方案,提升了图像生成的FID分数。

详情
AI中文摘要

我们开发了一个系统分析和设计扩散模型噪声调度的框架。我们证明可以将此设计问题重新表述为一个最优控制问题,其状态是扩散过程的Fisher信息,该信息根据微分方程演变,控制输入是噪声调度。最优控制问题的目标函数涉及Fisher信息,它被证明是Kullback-Leibler采样误差的上界。通过求解此最优控制问题,我们获得噪声调度的充分条件,使得最先进的~O(d/n)采样误差得以实现,其中d是数据维度,n是离散化步骤数。尽管现有理论工作也证明~O(d/n)采样误差界是可行的,但这些结果仅适用于特定的噪声调度,不包括实践中使用的调度。在进一步的数据分布参数假设下,我们证明可以得到噪声调度的闭式表达。这些噪声调度通过允许额外可调参数来推广标准经验调度,如指数和Sigmoid调度。系统地调整这些调度的参数可得到新的调度方案,在图像生成基准上取得更优的FID分数。

英文摘要

We develop a principled framework for analyzing and designing noise schedules in diffusion models. We show that one can recast this design problem as an optimal control problem, whose state is the Fisher information of the diffusion process which evolves according to an ODE and the control input is the noise schedule. The objective of the optimal control problem is a functional involving the Fisher information, which is shown to be an upper bound on the Kullback-Leibler sampling error. By solving this optimal control problem, we obtain sufficient conditions on noise schedules under which state-of-the-art $\tilde{\mathcal{O}} (d/n)$ sampling error is achievable, where $d$ is the data dimension and $n$ is the number of discretization steps. While existing theoretical work also prove that $\tilde{\mathcal{O}}(d/n)$ sampling error bounds are achievable, these results hold for specific noise schedules, which do not include the schedules used in practice. Under a further parametric assumption on the data distribution, we show that one can obtain closed-form expressions for the noise schedules. These noise schedules generalize standard empirical schedules such as exponential and sigmoid schedules by allowing additional parameters that can be tuned. Systematically tuning the parameters of these schedules yields new schedules that achieve superior FID scores on image generation benchmarks.

2605.21907 2026-05-22 cs.CV

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

引导轨迹优化与稀疏缩放用于测试时间扩散

Gang Dai, Yining Huang, Yiming Xia, Guohao Chen, Shuaicheng Niu

AI总结 本文提出RTS方法,通过奖励引导的噪声优化策略和稀疏测试时间缩放框架,提升扩散模型的生成性能,实验表明在GenEval和ImageReward指标上均优于现有方法。

详情
AI中文摘要

高效的测试时间缩放(TTS)范式为提升扩散模型的生成性能提供了有前途的视角。然而,当前解决方案局限于静态、预定义的噪声池,并在去噪轨迹中的噪声探索上表现出灵活性不足。为弥合这一差距,我们提出了RTS,一种新颖的奖励引导轨迹缩放方法,以充分释放扩散模型的生成潜力。与现有方法不同,RTS通过两个核心创新实现了高质量图像的合成:1)奖励引导的噪声优化策略,主动将搜索方向引导至有前途的区域;2)结合PCA驱动的曲率分析方案的稀疏测试时间缩放框架,优先考虑去噪空间中的关键中间步骤,有效压缩搜索空间。实验表明,我们的方法在GenEval得分上比基线高出15.6%,在ImageReward得分上提升60.4%,设定了新的SOTA,并为扩散特定架构的更有效的测试时间缩放提供了实用指南。

英文摘要

The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

2605.21902 2026-05-22 cs.AI cs.CL

Planning in the LLM Era: Building for Reliability and Efficiency

在大语言模型时代进行规划:构建可靠性与效率

Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi

AI总结 本文探讨了在大语言模型时代规划领域的发展,重点介绍了通过生成可验证的符号求解器来提高规划的可靠性和效率的方法。

Comments Published at ICAPS 2026

详情
AI中文摘要

随着智能代理受到越来越多的关注,规划能力成为其核心能力之一。早期尝试利用大语言模型(LLMs)进行规划的方法主要依赖于单次计划生成,随后发展出结合LLMs与有限外部搜索的混合方法。这些方法本质上不严谨且不完整,往往需要大量资源,但并未在未见问题上产生更好的解决方案。随着对LLMs局限性的认识加深,近期的研究转向在解决方案构建时使用LLMs,生成可用于验证并高效用于推理时间的一类问题的符号求解器。这一趋势反映了对既可靠又资源高效的代理日益增长的需求。它还提供了一条生成可维护规划器的路径,从而在推理时对语言模型的依赖最小化。在本文中,我们论证这种转变反映了在大语言模型时代规划领域更广泛的真实调整。我们检查了三种主要的规划器生成方法类别,讨论了它们当前的局限性,并概述了朝着更可靠和高效的大语言模型驱动的规划器生成的研究步骤。

英文摘要

Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time -- generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.

2605.21901 2026-05-22 cs.RO

Higher Order Reasoning for Collaborative Communicationless Mobile Robot Operations

高阶推理用于无通信协作移动机器人操作

Jonathan Reasoner, Nicola Bezzo

AI总结 本文提出了一种基于高阶推理的动态认知规划框架,使机器人能够在无通信环境下实现隐式协调和长周期规划,通过仿真和实物实验验证了其在通信受限领域中提升任务完成效率的能力。

详情
AI中文摘要

在无通信环境下,多机器人系统必须在不进行常规定步协调策略所假设的持续信息交换的情况下运作。本文提出了一种新颖的动态认知规划框架,通过机器人之间的高阶推理实现隐式协调和长周期规划。我们的方法使机器人能够形成并传播高阶信念粒子,利用贝叶斯推断更新世界信念,并通过行为树选择动作,以预测队友的可能决策。一种时间感知的模型预测路径积分(MPPI)控制器将这种推理整合到低层执行中,使机器人能够在部分可观测条件下规划拦截并适应轨迹。所提出的框架在仿真和实物实验中均显示出比一阶基线方法更短的任务完成时间,证明了认知逻辑可以作为在通信受限领域中具有鲁棒性的协调基础。

英文摘要

In communicationless environments, multi-robot systems must operate without the constant information exchange that many coordination strategies typically assume. This paper presents a novel dynamic epistemic planning framework that enables implicit coordination and long horizon planning through higher-order reasoning among robots. With our approach, robots form and propagate higher-order belief particles, update world beliefs using Bayesian inference, and select actions via a behavior tree that anticipates teammates' likely decisions. A temporally aware Model Predictive Path Integral (MPPI) controller integrates this reasoning into low-level execution, allowing robots to plan intercepts and adapt trajectories under partial observability. The proposed framework is evaluated in both simulations and physical experiments, where it consistently reduces task completion time compared to a first-order baseline, demonstrating that epistemic logic can serve as a robust foundation for resilient coordination in communication-restricted domains.

2605.21882 2026-05-22 cs.CV

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

Thermo-VL:扩展视觉语言模型以适应热红外感知

Rusiru Thushara, Yasiru Ranasinghe, Jay Paranjape, Vishal M. Patel

AI总结 本文提出Thermo-VL,一种基于热红外感知的视觉语言模型,通过引入可训练的热编码器和文本引导的双注意力融合模块,提升了低光照条件下的多光谱融合能力,并在热红外和RGB+热红外推理任务中取得显著成果。

Comments 18 pages, 11 figures

详情
AI中文摘要

视觉语言模型(VLMs)在低光照条件下往往表现不佳,因为它们的视觉基础主要学习自RGB图像,而热红外图像在可见线索退化时能保留互补的场景结构。我们提出了Thermo-VL,一种波长感知的VLM,它在冻结的Molmo-7B主干上添加了可训练的热编码器和文本引导的双注意力融合模块。给定对齐的RGB标记、热标记和提示嵌入,融合模块将热特征条件化为语言和RGB上下文,然后将门控残差注入冻结的RGB流中,使热证据能够被纳入而不破坏Molmo预训练的RGB-语言接口。我们使用标准的语言建模目标以及辅助对齐和正则化损失来训练模型,这些损失提高了跨模态基础并减少了对RGB的依赖。我们还引入了一个像素对齐的RGB-热指令微调数据集和Thermo-VL-Bench,一个手动筛选的RGB-热VQA基准,用于低光照和跨光谱推理。实验表明,在具有挑战性的热红外和RGB+热红外推理任务中取得了显著的提升,突显了基于提示的多光谱融合的价值。我们的数据集和代码可在:https://thusharakart.github.io/Thermo-VL 公开获取。

英文摘要

Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo's pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We also introduce a pixel-aligned RGB-thermal instruction-tuning dataset and Thermo-VL-Bench, a manually screened RGB-thermal VQA benchmark for low-light and cross-spectrum reasoning. Experiments show strong gains on challenging thermal-only and RGB+thermal reasoning tasks, highlighting the value of prompt-conditioned multispectral fusion. Our dataset and code are publicly available at: https://thusharakart.github.io/Thermo-VL

2605.21869 2026-05-22 cs.CV cs.AI cs.HC

Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

双阶段多模态框架用于情感模仿强度预测

Dinithi Dissanayake, Shaveen Silva, Ovindu Atukorala, Prasanth Sasikumar, Suranga Nanayakkara

AI总结 本文提出了一种双阶段多模态框架,用于从真实视频片段中预测六个连续情绪强度维度,通过结合文本、音频和视觉表示,并可选运动分支,提供了一个实用且可复现的基线。

Comments 10th Affective & Behavior Analysis in-the-wild, CVPR Workshop 2026

详情
AI中文摘要

我们提交了Hume-ABAW10情感模仿强度(EMI)挑战的参赛方案,旨在从真实多模态视频片段中预测六个连续情绪强度维度:钦佩、娱乐、决心、共情痛苦、兴奋和快乐。我们提出了一种分阶段的多模态框架,结合文本、音频和视觉表示,可选运动分支。我们的方法首先独立训练模态特定的编码器,然后通过轻量级回归器融合其学习的表示,通过模态丢弃和受控编码器适应。在我们提交的系统中,最佳验证性能由文本-音频-视觉-运动融合模型在扩展的4:1划分下获得,平均皮尔逊相关系数为0.4722。尽管运动分支仅带来极小的提升,但其行为值得研究。我们的团队在EMI挑战中获得第三名,测试集的平均皮尔逊相关系数为0.57。总体而言,我们提供了一个实用且可复现的EMI预测基线。

英文摘要

We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.

2605.21868 2026-05-22 cs.LG

When to Switch, Not Just What: Transition Quality Prediction in Clash Royale

何时切换,而不仅仅是选择:Clash Royale中的切换质量预测

Heeyun Heo, Huy Kang Kim

AI总结 该研究探讨了竞技游戏中玩家在连续失利后切换策略的频率与胜率之间的反向关联,提出了一种基于切换质量预测(TQP)的三阶段方法,通过PersonaGate、TimingGate和ScoreFusion来优化策略推荐,并引入SwitchGap作为评估指标,以衡量策略的判别质量。

Comments 11 pages, 2 figures, 4 tables; Accepted at IEEE Conference on Games (CoG) 2026

详情
AI中文摘要

在竞技游戏中,玩家经常在连续失利后切换策略,但通过对34,619名Clash Royale玩家的926,334场比赛记录分析,发现切换频率与胜率之间存在反直觉的关联:切换频率与胜率成反比,且这种影响在不同玩家和情境中差异显著。我们归因于许多先前推荐系统的一个局限性,即仅通过预期质量评估策略,而忽略了切换行为的成本和个体在切换倾向上的差异。我们将这一隐含前提称为零切换成本假设。为了解决这一问题,我们将策略推荐重新表述为一个过渡层面的决策问题,并将其实例化为TQP(Transition Quality Predictor),一个三阶段的流程,结构为Who -> When -> What。PersonaGate抑制了那些在经验上与更优结果相关联的玩家的推荐。TimingGate识别出切换可能比保持更有净收益的时刻,使用子类型和状态匹配的基线来控制自然胜率恢复。ScoreFusion通过结合采用性信号和预测的过渡质量(delta WR)来对候选策略进行排名。我们进一步引入了SwitchGap,一种衡量策略判别质量的评估指标,不将观察到的玩家选择视为最优地面真实。这一属性尤为重要,因为最频繁切换的玩家记录了最低的胜率。完整的流程在推荐率为5.4%时实现了SwitchGap的+10.4个百分点,尽管在表现最差的群体中,触发损失的切换者从子类型条件指导中受益最大。

英文摘要

In competitive games, players frequently switch strategies after losing streaks, yet our analysis of 926,334 match records from 34,619 Clash Royale players reveals a counterintuitive pattern: switching frequency is inversely associated with the win rate, with effects that vary substantially across players and situational contexts. We attribute this to a limitation common in many prior recommendation systems, which evaluate strategies by expected quality while overlooking the behavioral cost of switching and individual differences in switching propensity. We refer to this implicit premise as the Zero Switching Cost Assumption. To address this, we reformulate strategy recommendation as a transition-level decision problem and instantiate it as TQP (Transition Quality Predictor), a three-stage pipeline structured as Who -> When -> What. PersonaGate suppresses recommendations for players whose strategic consistency is empirically associated with superior outcomes. TimingGate identifies moments when switching is likely to yield a net benefit over staying, using a subtype- and state-matched baseline to control for natural win-rate recovery. ScoreFusion ranks candidate strategies by combining an adoptability signal with predicted transition quality (delta WR). We further introduce SwitchGap, an evaluation metric that measures a policy's discriminative quality without treating observed player choices as optimal ground truth. This property is particularly important because the most frequent switchers record the lowest win rates. The full pipeline achieves a SwitchGap of +10.4 percentage points at a recommendation rate of 5.4%, and loss-triggered switchers, despite being the lowest-performing group, benefit the most from subtype-conditioned guidance.

2605.21863 2026-05-22 cs.RO

OCELOT: Odometry and Contact Estimation for Legged Robots

OCELOT:用于腿部机器人的步态和接触估计

Emre Girgin, Cagri Kilic

AI总结 本文提出了一种基于误差状态扩展卡尔曼滤波器(ESEKF)的完整腿部里程计管道,通过仅使用本体感觉数据(如固定IMU、关节编码器和力传感器)来实现准确的里程计估计,核心贡献是融合接触检测和不确定性量化模块,用于显式识别并拒绝滑动。

Comments 8 pages

详情
AI中文摘要

腿部机器人中的一项重大挑战是仅使用机载本体感觉传感器实现准确的里程计。在本研究中,我们提出了一种基于误差状态扩展卡尔曼滤波器(ESEKF)的完整腿部里程计管道,该管道仅依赖于本体感觉数据:固定IMU、关节编码器和力传感器,其中滤波器的状态通过确定处于静止支撑的脚来校正。我们的核心贡献是融合接触检测和一个不确定性量化模块,该模块设计用于显式识别并拒绝滑动。该模块为每只脚运行两个检测器:1)一个基于力的去抖 Gaussian Mixture Model(GMM)引导的有限状态机(FSM)以确认物理接触,2)一个基于运动学的广义似然比检验(GLRT)在估计的脚速度上。两个估计器的连续质量分数被融合,以检测脚是否同时物理加载和运动学静止,并作为每种接触的不确定性信号。为了验证我们的方法,我们收集了一个多模态数据集,包含29个序列,覆盖多样的室内外地形(例如混凝土、草地、鹅卵石和岩石),总长度为2.4公里。我们对比了本体感觉和外源感觉方法。结果表明,我们的方法在提供准确的里程计估计和在易滑动环境中具有鲁棒性。我们还分享了我们的代码和实时ROS2包作为开源。

英文摘要

One of the significant challenges in legged robotics is achieving accurate odometry using only onboard proprioceptive sensors. In this study, we present a complete leg odometry pipeline based on an Error-State EKF (ESEKF) that relies exclusively on proprioceptive data: a body fixed IMU, joint encoders, and force sensors, where filter's state is corrected by feet determined to be in a stationary stance. The core of our contribution is fused contact detection and an uncertainty quantification module designed to explicitly identify and reject slippage. This module runs two detectors in parallel for each foot, 1) a debounced, force-based Gaussian Mixture Model (GMM) guided Finite State Machine (FSM) to confirm physical contact, and 2) a kinematic-based Generalized Likelihood Ratio Test (GLRT) on the estimated velocity of the foot. The continuous quality scores from both estimators are fused to detect if the foot is both physically loaded and kinematically stationary and served as an uncertainty signal for each contact. To validate our approach, we collected a multi-modal dataset of 29 sequences spanning diverse indoor and outdoor terrains (e.g., concrete, grass, pebble, and rock) total of 2.4 km long. We benchmarked our approach against both proprioceptive and exteroceptive methods. The results demonstrate our method's efficacy in providing accurate odometry estimates, robustly handling slippage-prone environments. We also share our code and real-time ROS2 package as open-source.