arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.23672 2026-05-25 cs.CV

RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video

RiGS: 从单目视频中的刚性感知4D高斯泼溅

Chenyu Wu, Wanhua Li, Zhu-Tian Chen, Hanspeter Pfister

AI总结 从单目视频重建动态3D场景是一项基础但极具挑战性的任务,因为现实中的运动往往包含长期平滑变换和短期复杂形变。本文提出了一种名为RiGS的刚性感知四维高斯泼溅方法,能够同时捕捉多时间尺度的运动信息。该方法引入了三种高斯基元,分别用于表示静态背景、长期低频运动和短期高频动态,并通过对象级动态掩码聚合长距离时空运动信息,指导静态与动态区域的分解。实验表明,RiGS在新视角合成任务中取得了最先进的性能。

详情
AI中文摘要

从单目视频重建动态3D场景是一项基本但极具挑战性的任务,因为现实世界的运动通常涉及长期平滑变换和短期复杂变形。现有方法要么难以保持时间一致性,要么由于运动建模能力有限而无法捕捉高频动态。在这项工作中,我们提出了刚性感知4D高斯泼溅(RiGS),它同时捕捉多个时间尺度上的运动。具体来说,RiGS引入了三种类型的高斯原语:静态、刚性和瞬态,分别表示静态背景、长期低频运动和短期高频动态。提出了一种对象级动态掩码来聚合长距离时空运动信息,并指导静态和动态区域的分解。为了联合建模跨尺度的运动,允许刚性高斯根据其时间持续期转变为瞬态高斯,并且两者都在场景流引导下进行优化,提供密集的3D运动监督。大量实验表明,RiGS在新视角合成基准测试中达到了最先进的性能。代码可在\url{https://github.com/ladvu/RiGS}获取。

英文摘要

Reconstructing dynamic 3D scenes from monocular videos is a fundamental yet highly challenging task, as real-world motions often involve both long-term smooth transformations and short-term complex deformations. Existing methods either struggle to maintain temporal consistency or fail to capture high-frequency dynamics due to limited motion modeling capacity. In this work, we present Rigid-aware 4D Gaussian Splatting (RiGS), which simultaneously captures motions across multiple temporal scales. Specifically, RiGS introduces three types of Gaussian primitives: static, rigid, and transient, which represent static backgrounds, long-term low-frequency motions, and short-term high-frequency dynamics, respectively. An object-wise dynamic mask is proposed to aggregate long-range spatiotemporal motion information and guide the decomposition of static and dynamic regions. To jointly model motion across scales, rigid Gaussians are allowed to transition into transient Gaussians based on their temporal duration, and both are optimized under scene flow guidance, providing dense 3D motion supervision. Extensive experiments demonstrate that RiGS achieves state-of-the-art performance on novel view synthesis benchmarks. Code is available at \hyperlink{https://github.com/ladvu/RiGS}{https://github.com/ladvu/RiGS}.

2605.23668 2026-05-25 cs.CL cs.AI

OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

OnePred: 多轮对话中基于递归意图记忆的下一个查询预测

Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang

AI总结 该研究提出了 OnePred,一种用于多轮对话中预测用户下一条查询的模型,旨在使对话系统更具主动性。其核心方法是通过递归更新的意图记忆来捕捉用户意图的演变,从而在不依赖完整对话历史的情况下实现高效且准确的预测。该方法通过两阶段强化学习训练模型,既学习预测内容又优化信息压缩,显著降低了计算成本并提升了预测性能。研究还发布了 NQP-Bench 基准数据集,实验表明 OnePred 在保持预测质量的同时,相比传统方法减少了高达 22 倍的计算开销。

详情
AI中文摘要

尽管大语言模型(LLM)对话系统每天处理数百万次多轮对话,但它们本质上仍是被动的:仅在用户输入查询后才响应。迈向主动交互的关键一步是下一个查询预测,即仅根据之前的对话预测用户后续的查询。该任务的进展受到缺乏专用基准以及基本效率-质量权衡的阻碍:简单拼接完整对话历史会导致线性增长的token消耗,而截断至最新一轮则会丢弃关键的跨轮上下文。我们的关键见解是,准确预测不需要重新阅读原始历史;只需跟踪用户跨主题、未解决需求和兴趣转移的不断演变的意图轨迹即可。我们提出OnePred,它维护一个递归更新的记忆作为唯一的跨轮上下文,将每轮成本限制为与对话长度无关。我们通过两阶段强化学习流程训练模型,首先教导预测什么,然后教导压缩什么,将记忆塑造成面向预测的意图链。为了建立严格的测试平台,我们引入了NQP-Bench,涵盖三个不同的子集。实验表明,与完整历史输入相比,OnePred将每轮token消耗减少高达22倍,同时在预测质量上持续超过所有基线,在较长对话中增益更大。我们的代码可在https://github.com/ZBWpro/OnePred公开获取。

英文摘要

Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.

2605.23663 2026-05-25 cs.HC cs.LG

Detecting Drunk Driving Using Off-the-Shelf Smartwatches

使用现成智能手表检测酒驾

Robin Deuber, Lanlan Yang, Michal Bechny, Christoph Heck, Matthias Pfäffli, Matthias Bantle, Florian von Wangenheim, Elgar Fleisch, Wolfgang Weinmann, Manuel Günther, Felix Wortmann, Varun Mishra

AI总结 本文研究了如何利用市售智能手表检测酒后驾驶行为,以预防道路交通事故。研究通过分析手腕加速度计数据和心率变异性等生理信号,提出了一种基于机器学习的检测系统,并在封闭测试轨道上进行了随机对照实验。该系统使用逻辑回归和一维卷积神经网络进行训练,取得了较高的检测准确率,为基于可穿戴设备的酒驾预防提供了新的可行方案。

详情
Comments
27 pages, 7 figures
AI中文摘要

酒精影响驾驶仍然是道路交通事故和死亡的一个主要但可预防的原因,许多驾驶员低估了自己的醉酒程度。与车载系统相比,使用消费级智能手表的移动酒驾检测提供了一种可扩展的方式,无需额外车载硬件即可触发预防性干预并提高意识。我们引入了一个系统,利用手腕加速度计数据和心率变异性衍生的生理信号来检测酒精相关的驾驶障碍。我们在一个随机、对照的三组测试轨道研究(n=54)中收集数据,并训练了带有窗口聚合特征的逻辑回归模型和一个双塔一维卷积神经网络(CNN),以检测酒精影响下的驾驶。CNN在检测任何酒精中毒时实现了参与者平均受试者工作特征曲线下面积(AUROC)为0.88,在检测驾驶超过WHO推荐的0.05 g/dL限值时AUROC为0.86。据我们所知,这是第一个(1)展示使用消费级智能手表检测酒驾的工作,(2)在封闭测试轨道的真实车辆中开发和评估此类系统,以及(3)严格评估对未见参与者的泛化能力。这些发现共同凸显了基于可穿戴设备的传感在支持可扩展、测量驱动的酒精相关交通伤害预防方面的潜力。

英文摘要

Alcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.

2605.23656 2026-05-25 cs.CV

Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models

递归块对角耦合用于视觉模型的资源高效训练

Maxim Henry, Adrien Deliège, Sébastien Piérard, Marc Van Droogenbroeck

AI总结 本文提出了一种名为RBDC的高效训练方法,通过递归地以无参数的块对角方式耦合多个窄模型,从而构建出宽模型,实现了对训练资源的灵活分配。该方法在ImageNet数据集上与从头训练的标准方法相比,在保持相似测试精度的情况下减少了30%的计算量,并在相同计算量下取得了优于现有模型增长方法的性能。此外,RBDC训练的模型在下游目标检测和实例分割任务中也表现出更优的性能。

详情
Comments
22 pages, 3 figures, 4 tables, and 34 references
AI中文摘要

从头训练高容量视觉模型需要大量计算资源。为了提高宽目标模型的训练效率,现有的增长方法通常假设存在更窄的模型,从而掩盖了整个流程的真实计算成本。我们提出了一种高效的训练协议RBDC,该协议通过递归方式以无参数块对角耦合独立训练的窄模型来构建宽模型。这允许灵活分配所有涉及模型的可用训练预算。在ImageNet上使用视觉变换器(DeiT)和卷积网络(ResNet)进行评估,我们的RBDC训练协议显示出比标准协议从头训练的模型更好的效率,在相似测试精度下实现了30%的FLOPs减少。与模型增长文献中的训练协议相比,它在相同训练FLOPs下也实现了更高的性能。最后,我们展示了我们的模型可以作为比原始模型更好的下游目标检测和实例分割任务的主干网络。

英文摘要

Training high-capacity vision models from scratch requires substantial computational resources. To improve training efficiency of a wide target model, existing growth methods often assume the availability of narrower models, obscuring the true computational cost of the entire pipeline. We propose an efficient training protocol, RBDC, that builds wide models by coupling in a parameter-free block-diagonal way narrower, independently trained models in a recursive way. This allows a flexible allocation of the training budget available across all the models involved. Evaluated with vision transformers (DeiT) and convolutional networks (ResNet) on ImageNet, our RBDC training protocol shows a much better efficiency than models trained from scratch with the standard protocol, yielding 30% FLOPs reduction at similar test accuracies. It also achieves higher performances at same training FLOPs than training protocols from the model growth literature. Finally, we show that our models can serve as better backbones than their original counterparts for downstream object detection and instance segmentation tasks.

2605.23655 2026-05-25 cs.CV cs.AI cs.LG cs.MM

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

CVSearch:赋予多模态大语言模型认知视觉搜索能力以感知高分辨率图像

Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang

AI总结 高分辨率图像感知是多模态大语言模型面临的关键瓶颈。为解决视觉搜索中覆盖性与效率之间的矛盾,本文提出CVSearch,一种无需训练的自适应框架,通过“评估-搜索”流程动态调度搜索策略。该方法在全局信息不足时采用专家辅助搜索,失败时触发语义感知的扫描机制,有效减少物体碎片化,并通过动态自底向上搜索策略提升局部细节的探索效率。实验表明,CVSearch在高分辨率基准上实现了最先进的准确率和显著提升的搜索效率。

详情
Comments
Accepted by ICML 2026. 22 pages, 12 figures, 7 tables
AI中文摘要

高分辨率图像感知是多模态大语言模型的一个关键瓶颈。虽然视觉搜索提供了有希望的解决方案,但现有方法在覆盖率和效率之间难以权衡。视觉专家辅助搜索效率高,但当提议失败时容易出现盲点,而基于扫描的搜索以计算冗余和语义碎片化为代价保证了覆盖率。为了解决这一困境,我们引入了CVSearch,一种无需训练的自适应框架,通过评估-搜索工作流动态调度搜索策略。具体来说,CVSearch首先在全局信息不足时调用专家辅助搜索,仅在失败时触发一种新颖的语义感知扫描机制。与刚性网格划分不同,这种高效扫描范式结合了语义引导的自适应补丁,将图像分解为语义一致的区域,有效缓解了物体碎片化。此外,我们设计了一种由视觉复杂性先验驱动的动态自底向上搜索策略,以实现对局部细节的高效且精确的迭代探索。在高分辨率基准上的大量实验表明,CVSearch在显著提高搜索效率的同时实现了最先进的准确性。代码已发布在https://github.com/liliupeng28/ICML26-CVSearch。

英文摘要

High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.

2605.23653 2026-05-25 cs.CV

ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction

ExpOS: 基于3D手部重建的可解释开放式手术技能评估

Roi Papo, Idan Smoller, Shlomi Laufer

AI总结 本文提出ExpOS,一种基于3D手部重建的可解释开放手术技能评估框架,旨在实现自动化的、以反馈为导向的手术训练评估。该方法通过从手术视频中提取手部姿态和工具检测信息,学习具有判别力的时间模式,并利用时空卷积网络和注意力机制生成帧级重要性图,从而预测技能水平并提供可解释的反馈。实验表明,ExpOS在多个手术任务中与专家评分具有高度相关性,尤其在筋膜闭合任务中表现优异,展示了其在可扩展性和实用性方面的潜力。

详情
Comments
10 pages, 4 figures
AI中文摘要

及时且透明的反馈对于有效的手术培训至关重要,但目前的评估仍然依赖于专家观察,限制了可扩展性和自主实践的机会。我们提出了ExpOS,一个用于数据驱动的开放式手术技能评估的可解释框架,旨在实现自动化的、面向反馈的评估。ExpOS不依赖于专家定义的指标,而是直接从运动数据中学习判别性时间模式,并识别出最能预测技能水平的片段和行为。我们在221名医学生执行三项开放式手术任务的视频上训练和评估了该方法。从每一帧中提取手部姿态和工具检测,以推导运动学描述符和全局运动统计。使用时间卷积骨干网络和基于注意力的池化对时空手-工具动态进行建模,生成帧级重要性图。这些表示与全局运动统计融合,以预测技能水平并提供可解释的反馈。ExpOS通过注意力权重识别信息事件发生的时间,并通过全局特征分析确定哪些运动特征对预测影响最大,从而提供多层级可解释性。在各项任务中,该框架与专家评分实现了强相关性,在筋膜闭合任务上表现最佳(r = 0.778, R2 = 0.74)。这些结果表明,将弱监督时间重要性学习与可解释运动统计相结合,能够实现可扩展且可操作的手术技能评估。

英文摘要

Timely and transparent feedback is essential for effective surgical training, yet current assessment remains dependent on expert observation, limiting scalability and opportunities for autonomous practice. We present ExpOS, an explainable framework for data-driven assessment of open-surgery skills designed to enable automatic, feedback-oriented evaluation. Rather than relying on expert-defined metrics, ExpOS learns discriminative temporal patterns directly from motion data and identifies the segments and behaviors most predictive of skill level. We trained and evaluated the method on 221 videos of medical students performing three open-surgery tasks. Hand poses and tool detections were extracted from each frame to derive kinematic descriptors and global motion statistics. Spatiotemporal hand-tool dynamics were modeled using a temporal convolutional backbone with attention-based pooling to generate frame-level importance maps. These representations were fused with global motion statistics to predict skill level and to provide interpretable feedback. ExpOS provides multi-level explainability by identifying when informative events occur through attention weights and which motion characteristics most influence predictions through global feature analysis. Across tasks, the framework achieved strong correlation with expert ratings, with best performance on fascial closure (r = 0.778, R2 = 0.74). These results demonstrate that combining weakly-supervised temporal importance learning with interpretable motion statistics enables scalable and actionable surgical skill assessment.

2605.23652 2026-05-25 cs.AI

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

一个策略,无限NPC:用于可扩展游戏智能体的可追溯共享RL策略

Yoosung Hong

AI总结 该研究提出了一种名为 pcsp 的共享强化学习策略,用于实现可扩展的游戏 NPC 控制,能够根据自由形式的人格描述生成具有个性特征且可控的行为。该方法基于冻结的 LLM 嵌入进行条件化,并结合了多种技术如低秩投影和一致性训练目标,以确保人格一致性与行为多样性。实验表明,pcsp 在零样本人格识别、语义行为对齐和推理速度等方面显著优于现有方法,并在实际游戏引擎中验证了其有效性与稳定性。

详情
Comments
18 pages, 15 figures, 14 tables
AI中文摘要

在300人生活模拟基准上,pcsp实现了组合零样本角色识别,准确率比随机高17倍,Spearman相关系数约0.73的语义-行为对齐,推理速度比LLM作为策略的基线快22倍。生活模拟游戏需要数百到数千个非玩家角色(NPC),这些角色具有一致的个性,同时通过设计师编写的自然语言保持可控。现有方法在个性一致性、可控性或实时推理等约束下失败。我们引入了pcsp(个性条件共享策略),这是一种单一的强化学习策略,以自由形式个性描述的冻结LLM嵌入为条件。pcsp结合了每个NPC一次的个性编码、低秩个性投影、神经个性调节以及PPO + InfoNCE一致性 + KL多样性训练目标。在三个实验设置中,消融实验表明InfoNCE轨迹一致性目标是关键:移除它会导致零样本角色识别降至随机水平。在Melting Pot 2.4.0子任务上的外部验证证实,我们的方法在多智能体战略环境中产生了基于个性的行为差异。我们区分了两种保留评估的含义:组合零样本和词汇扩展保留。最后,在UE5部署中,以64个智能体在引擎内重现了基于个性的消融实验,故障率低,表明子帧推理轮廓在商业游戏引擎中得以保留。这些结果证明,共享RL策略可以支持可扩展、实时、基于个性的NPC控制。

英文摘要

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

2605.23645 2026-05-25 cs.LG cs.AI

Learning Through Noise: Why Subliminal Learning Works and When It Fails

通过噪声学习:为什么潜意识学习有效以及何时失败

Vincent C. Brockers, Roman D. Ventzke, Valentin Neuhaus, Belén Hidalgo-Ogalde, Viola Priesemann

AI总结 本文研究了人工神经网络中的“潜意识学习”现象,即通过任务无关的输入-输出对进行知识蒸馏时,学生模型从教师模型中隐式学习任务相关知识或偏差的机制。研究发现,这一过程并不依赖于教师与学生模型的初始化一致性,而是由输出头的兼容性所决定。通过控制实验,作者展示了即使在随机初始化、网络结构变化等情况下,学生模型仍能通过兼容的辅助输出头从教师模型中学习有用信息,并在特定条件下达到与教师相当的任务性能。该研究为潜意识学习提供了理论解释,并明确了其适用范围与失效条件。

详情
AI中文摘要

在人工神经网络的背景下,潜意识学习指的是通过任务无关的输入-输出对的蒸馏,将任务相关知识或意外偏差从教师模型传递到学生模型。先前的解释将这种效应归因于共享或紧密匹配的教师-学生初始化。我们表明,紧密匹配的初始化并非必要。相反,潜意识学习由兼容的输出头控制。使用受控的MNIST设置,我们将输出分为辅助头(用于辅助的、任务无关的噪声信号)和分类头(用于分类),以证明潜意识学习发生——即使我们随机初始化隐藏层并移除层、添加新层或更改架构(MLP到CNN)。兼容的辅助头能够传递可恢复的教师信号,使学生的表示更接近教师的表示。当分类头也保持兼容时,仅训练于任务无关噪声的学生可以接近,并且在有利情况下达到教师级别的任务性能。我们的设置使我们能够发展一种理论来解释潜意识学习的机制,并推导出潜意识学习失败时的上界。总之,我们的结果将潜意识学习从一种令人惊讶的迁移效应转变为具有可预测限制的理论基础机制。

英文摘要

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

2605.23643 2026-05-25 cs.CR cs.LG

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

更少努力,更短证明:Tamarin中安全协议分析的强化学习

Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem, Niklas Medinger

AI总结 本文提出了一种基于强化学习的框架,用于辅助Tamarin工具进行安全协议的形式化验证。该方法受到AlphaZero和AlphaProof的启发,结合蒙特卡洛树搜索和神经网络启发式策略,实现了更高效、更短的协议验证过程。实验表明,该方法在多个案例研究中能够自动发现更多证明,并且生成的证明长度优于Tamarin默认搜索和人工设计的启发式方法,有效降低了验证过程中的人力投入。

详情
AI中文摘要

像Tamarin和ProVerif这样的工具在分析和验证复杂的现实世界协议(如EMV、5G和WPA2)方面取得了显著成功,甚至检测到了零日漏洞。尽管取得了这些成功,验证此类协议仍然是一项耗时、具有挑战性的任务,通常需要大量的人力和专业知识。在本文中,我们提出了一个受AlphaZero和AlphaProof启发的强化学习(RL)框架,该框架为Tamarin实现了一种新的证明搜索风格。我们为Tamarin开发了一个无状态API,充当经典的RL环境。我们通过一个从完成的子证明中学习的神经启发式来指导蒙特卡洛树搜索(MCTS)。我们在16个案例研究上评估了我们的框架,范围从经典协议模型到近期出版物中具有挑战性的最先进协议模型。我们的方法比Tamarin的标准搜索自动找到更多的证明,并且比标准和人工设计的启发式产生更短的证明。我们的流程开箱即用,可帮助Tamarin用户在活跃研究中减少所需的人力。此外,我们的标准化接口为用户提供了一种与Tamarin交互的程序化方式。最后,我们的工作展示了将基于RL的方法适应Tamarin领域的巨大潜力。

英文摘要

Tools like Tamarin and ProVerif have achieved notable success in analyzing and verifying complex real-world protocols such as EMV, 5G, and WPA2, even detecting zero-day exploits. Despite these successes, verifying such protocols remains a time-consuming, challenging task, often requiring significant human effort and expertise. In this paper, we present a reinforcement learning (RL) framework inspired by AlphaZero and AlphaProof that implements a new style of proof search for Tamarin. We have developed a stateless API for Tamarin that acts as a classical RL environment. We guide a Monte Carlo Tree Search (MCTS) by a neural heuristic that learns from completed subproofs. We evaluate our framework on 16 case studies, ranging from classical protocol models to challenging state-of-the-art protocol models from recent publications. Our method finds more proofs automatically than Tamarin's standard search and produces shorter proofs than both the standard and human-engineered heuristics. Our pipeline is applicable out of the box to assist Tamarin users in active research, reducing the human effort required. Moreover, our standardized interface provides a programmatic way for users to interact with Tamarin. Finally, our work demonstrates the promising potential of adapting RL-based methods to the Tamarin domain.

2605.23635 2026-05-25 stat.ML cs.LG

Dirichlet-Based Monte Carlo Dropout for Uncertainty Estimation in Neural Networks

基于狄利克雷的蒙特卡洛丢弃法用于神经网络不确定性估计

Rouaa Hoblos, Noura Dridi, Noureddine Zerhouni, Zeina Al Masry

AI总结 传统神经网络无法提供预测的不确定性估计,而贝叶斯神经网络虽能进行不确定性量化,但计算复杂度较高。本文提出了一种基于狄利克雷分布的蒙特卡洛Dropout方法,在保持计算效率的同时提升了不确定性估计的质量。该方法通过将类别概率建模为狄利克雷分布,实现了更具信息量的不确定性表示,并在实验中验证了其在不确定性校准方面的有效性。

详情
Journal ref
56es Journ{é}es de Statistique de la SFdS, Jun 2025, Marseille, France
AI中文摘要

传统神经网络提供确定性预测,缺乏固有的不确定性估计。虽然贝叶斯神经网络(BNN)为不确定性量化提供了原则性方法,但其计算复杂度限制了可扩展性。蒙特卡洛(MC)Dropout最初作为正则化技术引入,已被证明通过多次随机前向传播实现概率建模,从而近似贝叶斯推断。在这项工作中,我们通过在MC Dropout中集成基于狄利克雷的框架来增强深度学习中的不确定性估计。具体来说,我们利用Sensoy等人(2018)提出的公式,其中使用狄利克雷分布对类概率进行建模,从而允许更信息化的不确定性表示。所提出的方法保持了MC Dropout的计算效率,同时提高了不确定性估计的质量。我们讨论了所提出方法的理论基础,并将其与现有的不确定性量化技术进行了比较。结果突显了所提出方法在产生良好校准的不确定性估计方面的有效性,为不确定性感知的深度学习模型提供了实用解决方案。

英文摘要

Traditional neural networks provide deterministic predictions without inherent uncertainty estimates. While Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty quantification, their computational complexity limits scalability. Monte Carlo (MC) Dropout, initially introduced as a regularization technique, has been shown to approximate Bayesian inference by enabling probabilistic modeling through multiple stochastic forward passes. In this work, we enhance uncertainty estimation in deep learning by integrating a Dirichlet-based framework within MC Dropout. Specifically, we leverage the formulation proposed by Sensoy et al. (2018), where class probabilities are modeled using a Dirichlet distribution, allowing for a more informative uncertainty representation. The proposed approach maintains the computational efficiency of MC Dropout while improving the quality of uncertainty estimates. We discuss the theoretical foundations of our method and compare it with existing uncertainty quantification techniques. The results highlight the effectiveness of the proposed method in producing well-calibrated uncertainty estimates, offering a practical solution for uncertainty-aware deep learning models.

2605.23634 2026-05-25 cs.CV cs.AI

DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection

DualMem: 绕过目标性瓶颈以实现开放世界目标检测中校准的未知流过滤

Yingjun Xiao, Xi Chen, Gang Fang, Siyuan Chen

AI总结 开放世界目标检测(OWOD)需要检测器既能定位已知类别,又能识别未知对象以支持未来的增量学习。本文发现当前强OWOD检测器的未知预测流中背景误检比例过高,问题根源在于对象性头的信息瓶颈。为此,作者提出DualMem,一种基于冻结SigLIP特征空间的校准后处理过滤器,通过非参数似然比检验实现对未知对象的筛选,有效提升了未知对象识别的准确性,同时保持已知类别检测性能不变。

详情
AI中文摘要

开放世界目标检测(OWOD)要求检测器定位已知类别,同时识别未知对象以进行未来的增量学习。我们发现,强OWOD检测器的未知预测流受到严重污染:在M-OWODB上,对于PROB、OW-DETR和HypOW,未来任务的正未知样本仅占未知预测的不到10%,而背景假阳性则占46-71%。我们表明,这不是信息缺失问题,而是目标性头部的信息瓶颈。在PROB任务1上,对256维解码器查询的线性探针在正负未知区分上达到了0.908的AUROC,但最终的一维目标性标量降至0.642。一个冻结的SigLIP特征,无需访问检测器,在过滤阶段独立恢复了大部分这种提议级别的可分离性(AUROC = 0.871)。基于这一发现,我们提出DualMem,一种校准的后验过滤器,它假设一个小的、图像不相交的、标注了未来任务对象的校准分割,并在冻结的SigLIP特征空间中执行非参数似然比检验。DualMem使用k近邻正记忆来保护未来任务对象,并使用负记忆来抑制类似背景的提议。其决策阈值通过Neyman-Pearson校准选择,为用户提供了假未知抑制与新奇召回之间的显式权衡。在M-OWODB任务1上的PROB、OW-DETR和HypOW中,DualMem将每幅图像的背景型假未知提议减少了44.9%-66.3%,平均减少56.6%。在PROB任务1上,它使自然K-means原型基线的减少量翻倍以上,同时保持已知类别的mAP不变,因为已知检测绕过过滤器。

英文摘要

Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.

2605.23632 2026-05-25 cs.LG

Valid and Expressive Copulas for Irregular Multivariate Time Series

不规则多元时间序列的有效且表达力强的Copula模型

Christian Klötergens, Tom Hanika, Lars Schmidt-Thieme, Vijaya Krishna Yalavarthi

AI总结 本文提出了一种名为CopFITi的模型,用于对不规则多变量时间序列进行概率预测。该模型结合了归一化流在单变量边缘分布上的表达能力,以及高斯混合copula在联合依赖结构上的灵活性和一致性。研究首次构建了一个在边缘化上具有一致性的不规则多变量时间序列copula模型,并在联合密度建模方面取得了新的状态-of-the-art成果。

详情
AI中文摘要

我们提出了CopFITi,一种用于不规则多元时间序列(IMTS)概率预测的copula模型。该模型将单变量边缘分布的归一化流的表达力与联合依赖结构的高斯混合Copula的一致性和灵活性相结合。我们的实验表明,将边缘分布与联合分布解耦的基于copula的方法,比直接拟合完整联合分布的架构能产生更好的边缘模型。通过CopFITi,我们提出了第一个通过构造实现边缘化一致性的IMTS copula,并在联合IMTS密度建模中建立了新的最优水平。

英文摘要

We introduce CopFITi, a copula model for probabilistic forecasting of irregular multivariate time series (IMTS). Our model combines the expressivity of normalizing flows for univariate marginals with the consistency and flexibility of a Gaussian Mixture Copula for the joint dependency structure. Our experiments show that copula-based approaches, which decouple the marginals from the joint, yield better marginal models than architectures that directly fit the full joint. With CopFITi, we propose the first IMTS copula that is marginalization-consistent by construction and establish a new state of the art in joint IMTS density modeling.

2605.23629 2026-05-25 cs.CV

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

DDX-TRACE: 视觉语言模型中医学诊断轨迹的基准

Jiazhen Pan, Weixiang Shen, Jun Li, Julian Canisius, Felix Bitzer, Paula Roßmüller, Jiancheng Yang, Virginie Kreutzinger, Daniel Rueckert, Benedikt Wiestler

AI总结 DDX-TRACE 是一个用于评估视觉语言模型在医学诊断过程中表现的基准,专注于神经放射学领域,包含211个复杂病例。该基准模拟了真实的诊断流程,模型需在有限的临床信息基础上逐步请求影像检查、更新诊断假设,并最终给出确诊结果。研究发现,传统仅评价最终答案的方法可能无法准确反映模型的诊断质量,而DDX-TRACE通过关注诊断轨迹,揭示了模型在证据获取、不确定性更新和推理能力方面的关键问题。

详情
Comments
41 pages
AI中文摘要

医学诊断并非来自完全指定的病例的单次预测。它是一个序贯工作流程:临床医生决定获取哪些证据,修订鉴别诊断,并在诊断得到充分支持时停止。大多数医学AI基准则提前揭示相关背景,仅对最终答案评分,使得无依据的正确猜测、过早闭合、低效工作流以及不良的不确定性更新变得不可见。我们引入了DDX-TRACE,一个由医生裁决的多模态神经放射学基准,在211个具有挑战性的病例中评估隐藏证据下的诊断轨迹。每个病例从有限的临床病史开始;模型以自由形式请求影像研究,在可用时接收匹配的图像包,每轮后更新概率性鉴别诊断,并以定位的最终诊断结束。评估最先进的VLM,我们发现最终诊断分数可能严重歪曲工作流质量:模型可能在没有必要证据的情况下猜测合理的诊断,请求有用的研究但误解原始图像,或者低效地获取证据同时更新不确定性不佳。受控证据变体隔离了规划、视觉证据提取和下游鉴别推理中的瓶颈。DDX-TRACE将医学AI评估从最终答案转向证据支持的诊断轨迹。

英文摘要

Medical diagnosis is not a single prediction from a fully specified vignette. It is a sequential workup: clinicians decide what evidence to obtain, revise a differential diagnosis, and stop when the diagnosis is sufficiently supported. Most medical AI benchmarks instead reveal the relevant context upfront and score only the final answer, making unsupported correct guesses, premature closure, inefficient workups, and poor uncertainty updating invisible. We introduce DDX-TRACE, a physician-adjudicated benchmark for multimodal neuroradiology that evaluates diagnostic trajectories under hidden evidence over 211 challenging cases. Each case begins with limited clinical history; models request imaging studies in free form, receive matched image bundles when available, update a probabilistic differential diagnosis after each turn, and stop with a localized final diagnosis. Evaluating state-of-the-art VLMs, we find that final diagnosis scores can substantially misrepresent workup quality: models may guess plausible diagnoses without essential evidence, request useful studies but misinterpret raw images, or acquire evidence inefficiently while updating uncertainty poorly. Controlled evidence variants isolate bottlenecks in planning, visual evidence extraction, and downstream differential reasoning. DDX-TRACE shifts medical AI evaluation from final answers to evidence-supported diagnostic trajectories.

2605.23628 2026-05-25 cs.LG

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

操纵基准测试有多难?排行榜鲁棒性的社会选择分析

Polina Gordienko, Georg Schollmeyer, Frauke Kreuter, Christoph Jansen

AI总结 本文研究了在多任务基准测试中通过训练数据选择来操纵模型排名的难度问题,将其类比为社会选择理论中的选举操纵问题。作者将数据集视为选民、模型视为候选人,证明在Borda计数和平均胜率等评价指标下,基准特定训练问题属于NP难问题。此外,文章引入了实例级别的鲁棒性指标,用于衡量模型开发者需要包含多少数据集才能在排行榜上超越其他模型,并在多个基准测试中验证了不同指标下的鲁棒性差异,发现平均胜率最难被操纵。

详情
AI中文摘要

多任务基准测试已成为机器学习研究的核心支柱,但其日益增长的影响力激励了基准测试游戏——为提高特定模型的排行榜排名而采取的策略性行动。将数据集视为选民,模型视为候选人,我们将基准特定训练——在训练中包含基准数据——视为一种选举操纵形式。对于任何序数基准,选择训练数据集以使目标模型排名第一的问题对应于移位贿赂,这是计算社会选择中的一类操纵问题。利用这一识别,我们证明在Borda计数和平均胜率下,基准特定训练问题是NP难的。作为这种最坏情况视角的补充,我们引入了实例级鲁棒性,即模型开发者必须包含在训练中以使给定排行榜排名第一的最小数据集数量,并在算术平均、中位数、平均胜率和成对多数下推导出其表达式。我们在HELM下的MMLU和Open LLM排行榜下的BIG-Bench Hard(BBH)上评估了这些表达式。在两个套件中,平均胜率最难操纵:这一差距在BBH(24个任务,4507个模型)上很明显,其中位鲁棒性为22个任务(92%),而算术平均下为13个(54%),中位数和成对多数下为12个(50%)。

英文摘要

Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training -- the inclusion of benchmark data in training -- as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.

2605.23623 2026-05-25 cs.CR cs.AI cs.LG

Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

时间概念漂移下的对抗脆弱性:Android恶意软件检测的纵向研究

Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein, David Mohaisen

AI总结 本文通过长期视角研究了安卓恶意软件检测系统在时间概念漂移下的对抗脆弱性,分析了十年间应用数据在静态和动态特征表示下的对抗鲁棒性。研究采用三种部署协议评估模型性能,引入了多个时间关联指标以量化分布偏移对鲁棒性的影响。结果表明,随着时间间隔增大,对抗鲁棒性下降,而攻击成功率上升,强调了在动态数据环境下需考虑时间漂移因素,并提出了针对长期对抗环境的鲁棒性评估框架的重要性。

详情
Comments
42 pages, 4 tables, 10 figures
AI中文摘要

我们提出了一种纵向的、考虑漂移的对抗鲁棒性评估,使用从模拟器和真实设备执行中提取的静态和动态特征表示,跨越超过十年的Android应用。数据集按年度切片组织,并在三种模拟现实学习场景的部署协议下进行评估:(1)同年度训练和测试,(2)跨年度部署且不更新模型,(3)使用累积历史数据进行扩展窗口重训练。在多个分类器家族中,使用FGSM和SPSA在可行性约束下生成对抗样本。我们测量了干净性能、对抗准确率(AA)、攻击成功率(ASR),并引入了时序关联指标——RobustDrop、$\Delta$ASR和对抗放大因子(AAF)——以量化分布漂移与鲁棒性退化之间的关系。结果表明,在评估的基于迁移的特征空间设置下,时间分离与对抗鲁棒性降低相关。随着训练-测试间隔增加,干净准确率和对抗准确率下降,而攻击成功率呈现配置相关的增加,特别是在FGSM扰动和静态特征下。扩展窗口重训练可以缓解但无法消除在持续分布演化下的鲁棒性损失。这些发现表明,在评估智能检测系统在演化数据分布下的长期鲁棒性时,应考虑时间漂移,并强调了在长期对抗环境中需要漂移感知的鲁棒性评估框架。

英文摘要

We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $Δ$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.

2605.23619 2026-05-25 eess.AS cs.SD

Frame-Aligned Fusion of Canary and WavLM for Non-Intrusive Intelligibility Prediction of Hearing-Aid-Processed Speech

Canary与WavLM的帧对齐融合用于助听器处理语音的非侵入式清晰度预测

Kazushi Nakazawa

AI总结 本文研究了在无参考条件下预测助听器处理语音可懂度的问题,提出了一种基于Canary和WavLM两个预训练语音编码器的框架对齐融合方法。通过比较多种融合策略,作者发现将WavLM经过可学习的步进卷积处理后,在较粗的Canary时间线上进行融合,能够有效提升预测性能,最终在Eval数据集上取得了较低的RMSE和较高的相关系数。实验分析表明,在池化前建立粗粒度的时序对应关系有助于模型更好地捕捉语音可懂度的关键特征。

详情
Comments
7 pages, 2 figures
AI中文摘要

非侵入式清晰度预测估计听力受损听众对助听器处理语音的理解程度,无需干净参考。我们在第三届清晰度预测挑战赛中研究此任务,使用两个冻结的语音编码器Canary和WavLM。核心问题不仅在于是否应结合互补的预训练表示,还在于它们的交互应发生在何处。我们在共享的左右保留双耳框架下比较了单骨干基线、统一分数平均、池后融合、交叉注意力、帧对齐融合和反向对齐。在比较的系统中,最佳模型使用可学习的步进卷积对WavLM进行时间准备,并在池化前在较粗的Canary时间线上将其与Canary融合,达到Eval RMSE 24.96±0.06和Eval Corr 0.796±0.001。严重性、增强系统、层窗口和时间偏移分析表明,池化前的粗局部时间对应是该任务的有用归纳偏置。

英文摘要

Non-intrusive intelligibility prediction estimates how well hearing-impaired listeners understand hearing-aid-processed speech without a clean reference. We study this task in the 3rd Clarity Prediction Challenge using two frozen speech encoders, Canary and WavLM. The central question is not only whether complementary pretrained representations should be combined, but where their interaction should occur. We compare single-backbone baselines, uniform score averaging, pool-late fusion, cross-attention, frame-aligned fusion, and reverse alignment under a shared left/right-preserving binaural framework. Among the compared systems, the best model temporally prepares WavLM with a learnable strided convolution and fuses it with Canary on the coarser Canary timeline before pooling, reaching Eval RMSE 24.96$\pm$0.06 and Eval Corr 0.796$\pm$0.001. Severity, enhancement-system, layer-window, and temporal-shift analyses indicate that coarse local temporal correspondence before pooling is a useful inductive bias for this task.

2605.23618 2026-05-25 cs.CL

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Google Embeddings 2 与开源模型在多语言稠密检索和 RAG 系统中的基准测试

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Giandomenico Solimando

AI总结 本文对比了Google Embeddings 2(GE2)与五个开源模型在多语言密集检索和RAG系统中的性能,发现GE2在多个任务中表现最佳,但其延迟较高;相比之下,mE5-L在保持较高检索效果的同时具有更低的延迟,适合对响应时间有要求的应用;实验还表明,所有模型在32词块长度时性能趋于饱和,而语义分块仅在16词块时带来明显提升。

详情
Comments
9 pages, 2 figures, 5 tables. Text and evaluation code available at https://github.com/cciro94/GoogleEmbeddings2-benchmark
AI中文摘要

我们对 Google Embeddings (GE2) 进行了基准测试,这是一个由 Vertex-AI 托管的双编码器,具有 2048 令牌上下文和显式任务类型条件,与五个开源替代方案:BGE-M3、E5-large、Multilingual-E5-large (mE5-L)、LaBSE 和 Paraphrase-Multilingual-MPNet (mMPNet)。评估涵盖四个 BEIR 子集、一个合成意大利语 RAG 语料库、考虑三种策略下 5 种令牌大小的分块消融实验,以及在商品 CPU 硬件上的每查询延迟。GE2 在每个任务上排名第一,达到 BEIR 平均 nDCG@10 = 0.638 和 IT-RAG-Bench nDCG@10 = 0.282,但中位延迟为 231.6 毫秒,比最快的本地模型慢约 14 倍。mE5-L 在意大利语上以 31 毫秒的延迟达到与 GE2 相差 0.003 nDCG 的性能,使其成为在子 100 毫秒 SLA 下更优的选择。一个更惊人的发现涉及 LaBSE,尽管广泛部署于多语言场景,其在 BEIR 上的平均 nDCG@10 仅为 0.188,低于包括 mMPNet 在内的所有专用检索模型。分块实验表明,所有六个模型在我们的语料库上在 32 令牌分块时性能饱和,语义分块仅在 16 令牌时提供可衡量的增益。

英文摘要

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

2605.23610 2026-05-25 cs.CV cs.AI

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

EM-Vid:无需训练的以实体为中心的记忆,用于高效且一致的多镜头视频生成

Jente Vandersanden, Matheus Gadelha, Chun-Hao P. Huang, Hyeonho Jeong, Yulia Gryaditskaya

AI总结 本文提出了一种无需训练的实体中心记忆机制 EM-Vid,用于高效且一致的多镜头视频生成。该方法通过存储实体相关的潜在补丁来分离持久实体信息与瞬时场景背景,结合稀疏 token 条件控制和结构化脚本格式,有效降低了计算成本并提升了生成一致性。此外,引入的预算化记忆更新策略和噪声注入机制,进一步增强了对实体外观的精细控制,防止了无关信息的泄露。

详情
AI中文摘要

多镜头视频生成需要在不同镜头间保持重复实体的一致外观,同时忠实于镜头特定的文本提示。最近的自回归方法重用先前生成的帧作为记忆。然而,全帧存储将持久实体信息与瞬态场景上下文纠缠在一起,导致无关信息泄漏和高计算成本。我们提出一种以实体为中心的记忆,形式为实体索引的潜在补丁库。我们引入与预训练模型兼容的稀疏令牌条件化,将自注意力限制在实体相关令牌上,降低计算成本。为此,我们引入一种结构化的多镜头脚本格式。我们还提出一种预算记忆更新策略,以维护紧凑且不断演化的记忆。最后,我们为实体表示配备噪声注入机制,实现细粒度外观控制,防止无关信息泄漏。我们的方法在保持主体一致性的同时,提高了提示遵循度和效率。

英文摘要

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

2605.23605 2026-05-25 cs.LG cs.AI cs.CL

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

AI总结 DiLaDiff 是一种改进的扩散语言模型,旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间,并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明,DiLaDiff 在不进行蒸馏时已优于基线模型,并在蒸馏后显著加快了推理速度。

详情
AI中文摘要

扩散语言模型本质上无法捕捉解码令牌之间的相关性,导致采样质量与吞吐量之间存在严峻的权衡。为了解决这个问题,我们提出了DiLaDiff,一种掩码扩散语言模型的变体,包含三个组件:(1)具有语义能力的连续潜在空间,通过从现有掩码扩散语言模型微调的自编码器学习;(2)学习编码器分布先验的潜在扩散模型;(3)将学习到的先验蒸馏为少步潜在生成模型的一致性模型。我们表明,即使没有蒸馏,我们的潜在引导扩散模型在显著加速推理的同时也优于掩码扩散基线。一致性蒸馏进一步降低了连续扩散的计算开销,使得潜在生成的时间相对于离散解码可以忽略不计。

英文摘要

Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.

2605.23604 2026-05-25 eess.AS cs.SD

Word-Level Modeling with Alignment-Aware Acoustic Fusion for Text-Assisted Intelligibility Prediction in Listeners with Hearing Loss

基于对齐感知声学融合的词级建模用于听力损失患者文本辅助可懂度预测

Kazushi Nakazawa

AI总结 本文研究了如何利用文本辅助预测听力障碍者对语音的可懂度,提出了一种基于词级建模和对齐感知声学融合的方法。该方法结合冻结的Whisper编码器分析降质语音,通过条件解码器结合标准文本进行预测,并引入词对齐的局部声学分支与全局声学分支进行校准,提升了预测性能。实验表明,该方法在多项指标上优于基线模型,验证了细粒度预测与对齐融合的有效性。

详情
Comments
7 pages, 2 figures
AI中文摘要

我们针对CPC3中听力受损者的文本辅助语音可懂度预测问题。尽管目标是句子级百分比,但它由参考词识别结果决定。我们将预测建模为参考条件下的词级正确性建模:冻结的Whisper编码器分析退化语音,教师强制解码器以规范转录为条件,句子可懂度通过对有效参考词的预测正确概率取平均得到。为了补充转录条件解码器状态,我们添加了一个基于字符级交叉注意力对齐的词对齐局部声学分支,以及一个用于校准的语句级全局声学分支。在官方评估集上,解码器基线获得RMSE 24.92和相关系数0.795,而联合融合将错误词F1提升至0.778,MCC 0.626,相关系数0.806,RMSE 24.39。使用Whisper medium的类似趋势表明,增益来自预测粒度和对齐感知融合。

英文摘要

We address text-assisted speech intelligibility prediction for hearing-impaired listeners in CPC3. Although the target is a sentence-level percentage, it is determined by reference-word recognition outcomes. We formulate prediction as reference-conditioned word-level correctness modeling: a frozen Whisper encoder analyzes degraded speech, a teacher-forced decoder conditions on the canonical transcript, and sentence intelligibility is obtained by averaging predicted correctness probabilities over valid reference words. To complement transcript-conditioned decoder states, we add a word-aligned local acoustic branch based on character-level cross-attention alignment and an utterance-level global acoustic branch for calibration. On the official evaluation set, the decoder baseline obtains RMSE 24.92 and correlation 0.795, while joint fusion improves to incorrect-word F1 0.778, MCC 0.626, correlation 0.806, and RMSE 24.39. A similar trend with Whisper medium suggests that the gain comes from prediction granularity and alignment-aware fusion.

2605.23603 2026-05-25 cs.LG cond-mat.dis-nn cs.AI cs.NE

Preisach Attention: A Hysteretic Model of Sequential Memory

Preisach注意力:序列记忆的迟滞模型

Piotr Frydrych

AI总结 本文提出了一种基于经典 Preisach 滞后算子的新型序列建模架构——Preisach 注意力层(PAL),用二值继电器操作符替代传统的 softmax 注意力机制,通过学习激活与去激活阈值来维护内部的局部极值栈。该架构在任意精度算术下实现图灵完备性,且单层 PAL-Transformer 的深度仅为 O(1),优于传统硬注意力 Transformer 所需的 O(log n) 深度。研究还证明 PAL 与 Transformer 在可计算函数类上互不包含,PAL 能以更少层数计算历史范围统计量,而 Transformer 支持随机访问但需额外状态支持,且 PAL 对序列的响应仅依赖于局部极值序列,而非绝对位置或时间间隔。

详情
Comments
24 pages, 2 tables, preprint
AI中文摘要

我们引入了Preisach注意力层(PAL),一种基于数学物理中经典Preisach迟滞算子的新型序列建模架构。PAL用由学习到的激活和去激活阈值参数化的二进制继电器算子替代了softmax注意力机制,并维护一个局部极值栈作为其内部状态。在任意精度算术下,具有O(1)深度的单层PAL-Transformer是图灵完备的,这可以通过模拟双栈下推自动机实现——而标准硬注意力变压器需要O(log n)深度。其次,我们证明了PAL和Transformer可计算的函数类是不可比的:PAL在O(1)层内计算历史范围统计,而Transformer需要O(log n)层;Transformer支持随机访问检索,而PAL在没有辅助状态的情况下无法执行。分离性质是率无关性——PAL仅响应局部极值序列,而不响应绝对标记位置或时间间隔。第三,我们证明了极值栈构成了所有率无关泛函的输入历史的最小充分统计量,提供了经典迟滞理论中擦除性质的形式类比。因此,PAL是一种适用于长情节记忆和弱位置依赖任务的高效架构,其总推理成本为O(n log n),而标准注意力为O(n^2)。

英文摘要

We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton -- in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence -- PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.

2605.23602 2026-05-25 cs.CV

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

GlowGS: 夜间发光场景中用于3D高斯溅射的生成式语义特征学习

Beibei Lin, Xiao Cao, Jingyuan Guo, Robby T. Tan

AI总结 现有3D高斯泼溅(3DGS)方法在白天清晰场景中能生成高质量的新视角图像,但在夜间发光区域表现较差,主要因为缺乏纹理和边缘等结构特征。为此,本文提出GlowGS方法,结合扩散模型和视觉基础模型(VFM),通过语义特征生成和新视角语义学习两个关键思想,生成高质量的隐式结构线索,并在无需真实标签的情况下优化渲染结果,显著提升了夜间发光场景下3D重建的语义准确性和视觉质量。

详情
Comments
Accepted by CVPR Findings 2026
AI中文摘要

现有的3DGS方法在晴朗场景中能有效渲染高质量的新视图。然而,它们在夜间场景中表现不佳,特别是在发光区域,因为缺乏纹理和边缘等结构特征,而这些特征是基于溅射重建的关键线索。为了解决这个问题,我们利用扩散模型和视觉基础模型(VFM)来补偿缺失的结构线索。我们的方法包含两个关键的新思想:语义特征生成和新视图语义学习。首先,语义特征生成为新视图生成高质量的语义特征作为隐式结构线索。具体来说,扩散模型从训练视图中合成具有未知相机姿态的新视图,而VFM评估其质量。一旦识别出高质量的新视图,VFM提取鲁棒特征以构建语义特征库。其次,新视图语义学习使3DGS能够优化渲染的新视图,而无需真实标签。它通过从渲染的新视图中提取语义特征,在特征库中搜索最相似的特征,并最小化它们的距离来实现。这个过程施加了隐式结构约束,确保语义一致、无伪影的渲染视图。大量实验证明了我们的GlowGS在生成语义准确的3D视图方面的有效性,显示出比现有方法显著的改进。

英文摘要

Existing 3DGS methods effectively render high-quality novel views in clear-day scenes. However, they struggle with night scenes, particularly in glow regions, due to the lack of structural features such as textures and edges, which are key cues for splatting-based reconstruction. To address this problem, we leverage a diffusion model and a Vision Foundation Model (VFM) to compensate for missing structural cues. Our method consists of two key novel ideas: semantic feature generation and novel-view semantic learning. First, semantic feature generation produces high-quality semantic features as implicit structural cues for novel views. Specifically, a diffusion model synthesizes novel views with unknown camera poses from training views, while a VFM evaluates their quality. Once high-quality novel views are identified, the VFM extracts robust features to construct the semantic feature bank. Second, novel-view semantic learning enables 3DGS to optimize rendered novel views without requiring ground truth. It achieves this by extracting semantic features from a rendered novel view, searching the feature bank for the most similar features, and minimizing their distance. This process enforces implicit structural constraints, ensuring semantically coherent, artifact-free rendered views. Extensive experiments demonstrate the effectiveness of our GlowGS in generating semantically accurate 3D views, showing significant improvements over existing methods.

2605.23597 2026-05-25 cs.CL cs.LG

Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts

结构引导的实体解析:微调大语言模型以实现复杂语言上下文中的鲁棒姓名匹配

Shivam Chourasia, Hitesh Kapoor, Nilesh Patil

AI总结 本文研究了在语言和文化复杂环境下进行人名匹配的实体解析问题,提出了一种名为Structure-Guided Entity Resolution(SGER)的新框架,通过两阶段课程式微调增强大语言模型对姓名结构和语义的理解,从而提升实体匹配的准确性。该方法在印度身份数据等具有高度语言多样性和噪声的现实场景中表现出色,取得了99.02%的高准确率,并在生产环境中成功部署,验证了其在大规模多语言系统中的有效性和鲁棒性。

详情
Comments
Accepted to ACL 2026. 8 pages, 1 figure, 2 tables
AI中文摘要

跨异构记录匹配人名是实体解析的核心挑战,尤其是在语言和文化复杂的环境中。命名惯例的差异、跨文字的不一致音译以及频繁的数据录入错误使得统一用户身份变得困难,而这对于了解你的客户(KYC)合规至关重要。虽然大语言模型在理解自然语言方面显示出潜力,但它们往往难以处理此类特定领域设置中存在的结构化歧义。本文介绍了结构引导实体解析(SGER),一种新颖的框架,通过两阶段课程微调大语言模型。模型首先被训练解析人名的语法和语义结构,然后针对二元实体匹配的下游任务进行优化。我们在印度身份数据的挑战性背景下评估SGER,这是全球语言最多样化和噪声最大的环境之一。SGER在包含50,000个真实世界对的保留测试集上达到了99.02%的准确率和0.994的F1分数,优于GPT-4o少样本提示和单阶段微调基线。该系统已完全部署在全球最大的梦幻体育平台Dream11的生产环境中,服务超过2.5亿用户。我们的结果表明,课程引导的训练能够在现实世界的多语言系统中实现大规模、高精度的实体解析。

英文摘要

Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world's largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

2605.23592 2026-05-25 cs.AI

Solving the Aircraft Disassembly Scheduling Problem

解决飞机拆解调度问题

Charles Thomas, Pierre Schaus

AI总结 本文研究了飞机报废拆解过程中的调度问题,该问题涉及大量任务和多种约束条件,对航空公司实现可持续拆解和盈利至关重要。文章提出了两种求解方法,包括约束规划模型和混合整数规划模型,并基于工业合作伙伴提供的真实数据进行了测试,验证了模型在处理多达1450项任务实例中的有效性。

详情
AI中文摘要

拆解寿命终结的飞机是一项复杂的工程,对于可持续性而言是必要的,但为航空运输公司带来的利润空间很小。因此,拆解过程的高效调度对于确保流程的盈利能力和激励实践至关重要。这是一个涉及数千个任务和许多不同约束的大规模调度问题:提取计划重复使用的部件需要具有特定认证和设备的技师。提取操作可能受先后顺序关系约束。此外,在整个过程中必须保持飞机平衡。最后,飞机的某些位置空间有限,限制了可同时工作的技师数量。本文详细介绍了该问题,并提出了两种解决方法:约束规划模型和混合整数规划模型。这些模型在基于工业合作伙伴提供的真实运营数据、规模不同(最多1450个任务)的实例上进行了测试。

英文摘要

Dismantling aircrafts reaching their end of life is a complex endeavour that is necessary in terms of sustainability but yields small income margins for air transport companies. An efficient scheduling of the disassembly procedure is thus crucial to ensure the profitability of the process and incentivize practice. This is a large scheduling problem that involves thousands of tasks and many different constraints: Extracting parts that are destined to be reused requires technicians with specific certifications and equipment. Extraction operations might be subject to precedence relations. Furthermore, the aircraft must be kept balanced during the whole process. Finally, some of the locations of the aircraft have a limited space that caps the number of technicians able to work there concurrently. This article presents the problem in details and proposes two approaches to solve the problem: a Constraint Programming model and a MIP model. The models are tested on instances of varying sizes involving up to 1450 tasks, which are based on real operational data provided by an industrial partner.

2605.23591 2026-05-25 stat.ML cond-mat.dis-nn cs.LG math.ST stat.TH

Asymmetric Scaling Laws from Sparse Features

基于稀疏特征的非对称缩放定律

John Sous, Michael Winer

AI总结 本文研究了稀疏激活下神经网络的扩展规律,提出了一种新的模型,指出测试损失主要由训练输入中从未出现的稀疏坐标主导,从而形成一种不同于密集模型的新瓶颈。研究推导了欠参数化和过参数化情形下的渐近损失,并发现损失曲线在插值阈值附近呈现双下降现象,表现出由稀疏度决定的两个不同扩展指数。此外,还分析了梯度下降动力学,并展示了固定步长梯度下降不稳定概率的扩展规律,表明稀疏性带来的影响在非线性激活下依然存在。

详情
AI中文摘要

我们引入了一个稀疏激活下的神经缩放定律模型。在该模型中,测试损失通常由训练输入中从未观察到的稀有坐标主导。这种机制引入了一个密集模型中不存在的新瓶颈。我们推导了欠参数化和过参数化区域的渐近总体损失,并表明损失在插值阈值附近出现双下降峰值——其中参数数量刚好足以拟合训练数据——导致损失曲线由两个不同的缩放指数控制:一个用于过参数化区域,一个用于欠参数化区域,其差距由稀疏程度决定。此外,我们推导了一个计算最优边界,在固定计算预算下倾向于增加数据集大小而非模型容量。我们还分析了梯度下降动力学,并确定了固定步长梯度下降变得不稳定的概率的缩放定律。我们进一步表明,稀疏诱导效应在非线性激活下仍然存在。

英文摘要

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

2605.23590 2026-05-25 cs.AI

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Co-ReAct:作为ReAct智能体逐步协作者的评分准则

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

AI总结 Co-ReAct 是一种基于评分标准(rubrics)的行动选择框架,旨在改进 ReAct 代理在多步骤推理任务中的决策过程。该方法在每一步推理中注入评分标准作为指导,明确代理应关注的证据搜索、推理或自我评估方向,从而提升推理的深度和针对性。通过引入专门训练的评分标准生成器,并采用多评委共识排名优化目标,Co-ReAct 显著提升了多个基准任务上的表现,且无需修改原有代理的决策机制。

详情
AI中文摘要

用于搜索密集型、多步推理任务的ReAct风格智能体主要依赖自身内部判断来决定寻求哪些证据、下一步采取哪个推理或行动步骤以及何时停止,常常产生浅显、冗余或目标不明确的轨迹。先前的工作探索了将评分准则作为外部质量信号,但现有用途主要是评估性的而非行动指导性的:评分准则通常作为训练时的奖励或完成输出的事后评估器,在深度研究场景中,它们往往是粗粒度的、报告级别的而非步骤级别的。我们引入了Co-ReAct,一个评分准则指导的行动选择框架,在推理过程中将评分准则作为步骤级指导。在每个决策步骤,Co-ReAct将评分准则注入智能体的上下文,以指导下一个“推理或行动”决策,明确智能体在证据寻求、搜索、推理或自我评估中应瞄准什么。为了使这种指导可靠,我们使用GRPO训练了一个专用的评分准则生成器。与先前的成对或二元偏好公式不同,我们的目标优化了针对多评判专家共识排名的列表式斯皮尔曼等级相关奖励,鼓励评分准则具有区分性而不仅仅是合理。在DeepResearchBench和SQA-CS-V2上,Co-ReAct在基于8B/14B开源和前沿闭源基础模型构建的搜索智能体上,一致优于ReAct和代表性的测试时计算基线。训练好的评分准则生成器还可以作为即插即用组件,在不改变底层决策机制的情况下改进这些基线。我们的代码公开在https://github.com/ZBWpro/Co-ReAct。

英文摘要

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

2605.23583 2026-05-25 cs.RO cs.LG

How Many Training Samples Are Needed for the Inverse Kinematics Solutions by Artificial Neural Networks

人工神经网络求解逆运动学需要多少训练样本

Dong-Won Lim

AI总结 本文研究了使用人工神经网络求解机器人逆运动学问题时所需的最小训练样本数量。通过构建不同规模的训练数据集,训练前馈神经网络并评估其精度、收敛性和泛化能力,发现当样本数量超过125后,模型效率提升不再显著。该研究为实际机器人应用中优化神经网络数据规模、平衡计算成本与模型精度提供了有价值的指导。

详情
Comments
14 pages, 5 figures
AI中文摘要

逆运动学在机器人运动规划与控制中扮演关键角色。机器人操作臂的逆运动学求解可通过传统方法如几何法、代数法或雅可比法实现,但这些方法存在缺陷。人工神经网络因其泛化能力和计算效率,已成为近似逆运动学解的有前途的替代方案。该方法基本上只训练记录用于求解逆运动学问题的少量末端执行器样本。然而,一个基本问题仍然存在:多少训练样本足以实现可靠且准确的逆运动学预测?本研究探讨了训练数据集大小与基于ANN的逆运动学求解器精度之间的数学框架。使用关节型机器人操作臂,我们生成不同数量的关节位置对来训练前馈神经网络,并评估其精度、收敛性和泛化能力。结果表明,超过125个训练样本并未有助于提高模型效率,该效率通过采样大小上的近似精度可比度量来衡量,为数据效率提供了宝贵见解。这项工作为优化ANN解决方案的数据规模提供了实用指导,平衡了实际机器人应用中的计算成本和模型精度。

英文摘要

Inverse Kinematics (IK) plays a critical role in robotic motion planning and control. The IK solutions of a robot manipulator could be done by conventional ways such as geometric, algebraic, or Jacobian methods, which have drawbacks. The Artificial Neural Networks (ANNs) have become a promising alternative for approximating IK solutions due to their generalization ability and computational efficiency. This approach basically trains only a few samples of the end effector that are recorded for the solution of the IK problem. However, a fundamental question remains: how many training samples are sufficient to achieve reliable and accurate IK predictions? This study investigates the mathematical framework of relating the size of training datasets and the accuracy of ANN-based IK solvers. Using an articulated robotic manipulator, we generate varying amounts of joint-position pairs to train feedforward neural networks and assess their accuracy, convergence, and generalization capability. The results reveal more training samples than 125 did not contribute to the improvement of the model efficiency that the comparable measure dealing with the approximation accuracy over the sampling size, offering valuable insight into data efficiency. This work provides practical guidance for optimizing the data sizing of ANN solutions, balancing computational cost and model accuracy for real-world robotic applications.

2605.23580 2026-05-25 cs.CV

Calibration-Informative Region Selection for Online LiDAR--Camera Calibration in Agricultural Environments

农业环境中在线LiDAR-相机标定的标定信息区域选择

Rajitha de Silva, Grzegorz Cielniak

AI总结 本文研究了农业环境下在线激光雷达-相机标定中的校准信息区域选择问题,提出了一种基于支持图的多模态标定方法,将标定过程分解为初始标定、跨模态残差提取、支持图估计和支持感知优化四个模块。通过结合无目标标定方法MDPCalib和密集匹配模型CMRNext,该方法生成了一个密集校准支持图,用于识别标定信息可靠的区域,实验表明该方法在Bacchus Long-Term和KITTI数据集上能有效提升标定精度,尤其在平移参数方面表现突出。

详情
Comments
Accepted to ICRA 2026 Workshop on Agricultural Robotics
AI中文摘要

可靠的多模态标定需要识别哪些观测真正约束外参,哪些主要引入噪声或模糊性。本文提出一种基于支持图的多模态标定方法,解耦四个功能模块:初始标定、跨模态残差提取、支持图估计和支持感知精化。我们利用MDPCalib(一种基于运动和深度点对应的无目标LiDAR-相机标定方法)和CMRNext(一种预测光流状图像平面残差的密集LiDAR-相机匹配模型)实例化该公式用于在线LiDAR-相机标定。关键贡献是密集标定支持图,它聚合对齐观测上的跨模态一致性,并突出标定证据持续可靠的区域。在Bacchus Long-Term (BLT)数据集和KITTI上,我们表明标定证据在空间和语义上不均匀,表明某些语义区域为标定提供更强的线索。在KITTI上,支持引导的精化改善了标定性能,平移精度更好,而旋转增益仍然有限。

英文摘要

Reliable multi-modal calibration requires identifying which observations truly constrain the extrinsic parameters and which ones mainly add noise or ambiguity. In this paper, we propose a support-map-driven approach to multi-modal calibration that decouples four functional blocks: initial calibration, cross-modal residual extraction, support-map estimation, and support-aware refinement. We instantiate this formulation for online LiDAR--camera calibration using MDPCalib, a target-less LiDAR--camera calibration method based on motion and deep point correspondences, and CMRNext, a dense LiDAR--camera matching model that predicts optical-flow-like image-plane residuals. The key contribution is a dense calibration support map that aggregates cross-modal agreement over aligned observations and highlights where calibration evidence is consistently reliable. Across the Bacchus Long-Term (BLT) dataset and KITTI, we show that calibration evidence is spatially and semantically non-uniform, indicating that some semantic regions provide stronger cues for calibration than others. On KITTI, support-guided refinement improves the calibration performance with better translation accuracy while rotational gains remain limited.

2605.23574 2026-05-25 cs.LG cs.SE

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

推动你的智能体:在长周期LLM智能体中测量和强制实现定量目标持续性

Yuandao Cai, Yuzhang Zhu, Liyou Gao, Wensheng Tang, Shengchao Qin

AI总结 本文研究了长期语言智能体在完成定量目标时存在的“定量目标持续性”(QGP)问题,即智能体是否能持续工作直到外部验证器确认完成足够数量的有效任务。为此,作者提出了PushBench基准,用于直接衡量重复工作、重复提交、虚假完成等问题。实验表明,基于状态追踪和工作单元追踪的控制器在减少重复提交和提高任务完成率方面表现优异,而当前主流智能体在处理大量任务时成功率显著下降,突显了定量目标对智能体可靠性提出的更高要求。

详情
AI中文摘要

长周期语言智能体可能做出许多看似合理的局部工具调用,但未能持续直到请求的数量实际完成。我们将这一差距研究为定量目标持续性(QGP):即智能体是否持续工作,直到外部验证器确认足够数量的不同有效项。PushBench将其转化为一个用于仓库-工件收集和验证器支持的工作单元的基准,因此重复工作、重复提交、虚假完成和进度漂移被直接测量,而不是隐藏在最终成功标志之后。在匹配的控制器比较中,状态追踪检索控制器达到69-78%的成功率,同时消除了重复提交;而积压追踪工作单元控制器在标准和完成门控控制器无法完成任何任务实例的设置中达到25-50%的成功率。使用Claude Code(Sonnet 4.6)和Codex CLI(gpt-5.4)的黑盒前沿智能体评估解决了许多50个工件的任务,但在100个工件时每条件仅剩3/9的成功率。结果表明,定量目标对不同于局部任务能力的可靠性要求提出了挑战:智能体必须维护已验证的进度,并仅在请求的工作完成时停止。

英文摘要

Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state-tracking retrieval controller reaches 69-78% success while eliminating duplicate submissions, and a backlog-tracking work-unit controller reaches 25-50% success in settings where standard and completion-gated controllers complete no task instances. Black-box frontier-agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) solve many 50-artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

2605.23572 2026-05-25 cs.IR cs.AI cs.LG

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

HARNESS-LM: 一种在赞助搜索中利用小语言模型的三阶段训练方案

Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani, Amit Singh, Manik Varma

AI总结 在赞助搜索中,如何在保证检索质量的同时降低响应延迟是一个重要挑战。本文提出HARNESS-LM(HLM),一种三阶段训练框架,旨在将大规模语言模型的检索能力转移到参数更少、成本更低的模型中。通过知识蒸馏和对比优化等方法,HLM在保持高检索精度的同时显著提升了推理效率,并在实际的Bing Ads测试中验证了其有效性,取得了更高的收益、曝光和点击率提升。

详情
Comments
9 pages, 3 figures, 10 tables
AI中文摘要

在赞助搜索的竞争格局中,平衡检索质量与生产延迟是一个关键挑战。尽管基于小语言模型(SLM)的大型检索模型(如Qwen3-Embedding-4B/8B)在公共基准上设定了强上限,但其在高吞吐、延迟敏感环境中的部署仍不切实际。本文提出HARNESS-LM(HLM),一个三阶段训练框架,用于将大规模检索器的能力迁移至紧凑、成本高效的模型。该方法包括:(1)通过微调十亿参数规模的SLM训练高性能参考(“教师”)检索器;(2)通过L2目标对齐查询表示,将知识蒸馏至低于600M参数的学生编码器;(3)应用最终对比精炼阶段以优化学生的检索性能。我们还对关键设计选择进行了全面的实证研究,包括对齐目标、嵌入维度、模型规模、架构和优化策略,以确定在生产环境中最为有效的配置。在真实世界的Bing Ads评估基准上,HLM在多种设置下恢复了参考检索器超过98%的精度,同时在NVIDIA A100 GPU上实现了高达27倍的在线查询编码器延迟降低和20倍的吞吐量提升。在Bing Ads上的在线A/B测试进一步显示,与当前生产中运行的检索器集成(部署190M参数模型)相比,收入提升+1%,展示量提升+0.6%,点击量提升+0.4%,清晰突显了HLM方案在真实世界赞助搜索场景中的实际效果。

英文摘要

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.