arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3418
2510.22186 2026-05-26 cs.LG cs.IT math.FA math.IT math.MG

Quantitative Bounds for Sorting-Based Permutation-Invariant Embeddings

基于排序的置换不变嵌入的定量界

Nadav Dym, Matthias Wellershoff, Efstratios Tsoukanis, Daniel Levy, Radu Balan

AI总结 研究通过排序独立一维投影得到的置换不变嵌入,改进了注入性所需嵌入维度的上下界,并给出了双Lipschitz常数的估计,其失真度与点数n的平方成正比且与维度d无关。

Comments Minor revision; 37 pages, 1 figure, 2 tables

详情
Journal ref
IEEE Trans. Inf. Theory, vol. 72, no. 6, pp. 4297-4311, Jun. 2026
AI中文摘要

我们研究$d$维点集的置换不变嵌入,这些嵌入通过排序输入数据的$D$个独立一维投影来定义。此类嵌入出现在图深度学习中对图节点输出应具有置换不变性的场景。先前的工作表明,对于足够大的$D$和处于一般位置的投影,该映射是单射的,并且满足双Lipschitz条件。然而,仍存在两个空白:首先,注入性所需的最优大小$D$尚不清楚;其次,映射的双Lipschitz常数估计未知。本文在解决这两个空白方面取得了实质性进展。针对第一个空白,我们改进了注入性所需嵌入维度$D$的最佳已知上界,并给出了最小注入性维度的下界。针对第二个空白,我们构造了投影向量矩阵,使得映射的双Lipschitz失真度与点数$n$的平方成正比,且完全独立于维度$d$。我们还证明,对于任何投影向量的选择,映射的失真度不会优于与$n$的平方根成比例的界。最后,我们展示了即使对映射应用线性投影以降低其维度,也能提供类似的保证。

英文摘要

We study permutation-invariant embeddings of $d$-dimensional point sets, which are defined by sorting $D$ independent one-dimensional projections of the input. Such embeddings arise in graph deep learning where outputs should be invariant to permutations of graph nodes. Previous work showed that for large enough $D$ and projections in general position, this mapping is injective, and moreover satisfies a bi-Lipschitz condition. However, two gaps remain: firstly, the optimal size $D$ required for injectivity is not yet known, and secondly, no estimates of the bi-Lipschitz constants of the mapping are known. In this paper, we make substantial progress in addressing both of these gaps. Regarding the first gap, we improve upon the best known upper bounds for the embedding dimension $D$ necessary for injectivity, and also provide a lower bound on the minimal injectivity dimension. Regarding the second gap, we construct matrices of projection vectors, so that the bi-Lipschitz distortion of the mapping depends quadratically on the number of points $n$, and is completely independent of the dimension $d$. We also show that for any choice of projection vectors, the distortion of the mapping will never be better than a bound proportional to the square root of $n$. Finally, we show that similar guarantees can be provided even when linear projections are applied to the mapping to reduce its dimension.

2510.22143 2026-05-26 cs.CL

Benchmarking and Learning Real-World Customer Service Dialogue

基准测试与学习真实世界客服对话

Tianhong Gao, Jundong Shen, Jiapeng Wang, Bei Shi, Ying Ju, Junfeng Yao, Huiyu Yu

AI总结 针对工业智能客服与真实对话需求脱节的问题,提出OlaBench基准和OlaMind模型,通过蒸馏专家推理模式与分阶段强化学习,在OlaBench上超越GPT-5.2和Gemini 3 Pro,在线A/B测试中问题解决率提升23.67%,人工转接率降低6.6%。

详情
AI中文摘要

现有的工业智能客服(ICS)基准和训练流程与真实对话需求仍存在偏差,过度强调可验证的任务成功,而低估了主观服务质量和实际故障模式,导致离线收益与可部署对话行为之间存在差距。我们通过一个从基准到优化的循环来弥合这一差距:首先引入OlaBench,一个涵盖检索增强生成、基于工作流的系统和智能体设置的ICS基准,评估服务能力、安全性和延迟敏感性;此外,受OlaBench结果显示最先进的LLM仍不足的启发,我们提出OlaMind,它从专家对话中提炼可复用的推理模式和服务策略,并应用分阶段探索-利用强化学习,结合实例级评分感知指导来提升模型能力。OlaMind在OlaBench上超越了GPT-5.2和Gemini 3 Pro(83.64 vs. 70.58/70.84),并且在在线A/B测试中,与基线相比,平均问题解决率提高了23.67%,人工转接率降低了6.6%,从而将离线收益桥接到部署中。OlaBench和OlaMind共同推动ICS系统向更拟人化、专业化和可靠的方向发展。项目页面和评估可在https://olamind-olabench.github.io获取。

英文摘要

Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies staged exploration--exploitation reinforcement learning with instance-level rubric-aware guidance to improve model capability. OlaMind surpasses GPT-5.2 and Gemini 3 Pro on OlaBench (83.64 vs. 70.58/70.84) and, in online A/B tests, delivers an average +23.67% issue resolution and -6.6% human transfer rate versus the baseline, bridging offline gains to deployment. Together, OlaBench and OlaMind advance ICS systems toward more anthropomorphic, professional, and reliable deployment. The project page and evaluation are available at https://olamind-olabench.github.io.

2510.20390 2026-05-26 cs.RO

NeuralTouch: Neural Descriptors for Precise Sim-to-Real Tactile Robot Control

NeuralTouch: 用于精确的仿真到现实触觉机器人控制的神经描述符

Yijiong Lin, Bowen Deng, Keju Pu, Chenghua Lu, Max Yang, Efi Psomopoulou, Nathan F. Lepora

AI总结 提出NeuralTouch多模态框架,结合神经描述符场(NDF)和触觉感知,通过深度强化学习策略利用触觉反馈优化抓取姿态,实现精确且可泛化的机器人操作。

详情
Journal ref
IEEE/ASME Transactions on Mechatronics, 2026 IEEE/ASME Transactions on Mechatronics IEEE/ASME Transactions on Mechatronics
AI中文摘要

抓取精度是精确物体操作的关键前提,通常需要机器人手与物体之间的仔细对齐。神经描述符场(NDF)提供了一种有前景的基于视觉的方法,能够生成跨物体类别泛化的抓取姿态。然而,由于相机标定不完美、点云不完整以及物体变异性,仅靠NDF可能产生不准确的姿态。同时,触觉感知能够实现更精确的接触,但现有方法通常学习仅限于简单、预定义接触几何的策略。在这项工作中,我们引入了NeuralTouch,一个集成NDF和触觉感知的多模态框架,通过轻柔的物理交互实现精确、可泛化的抓取。我们的方法利用NDF隐式表示目标接触几何,从中训练深度强化学习(RL)策略,利用触觉反馈来优化抓取。该策略以神经描述符为条件,不需要显式指定接触类型。我们通过仿真中的消融研究以及零样本迁移到真实世界的操作任务(如销钉出孔和瓶盖打开)来验证NeuralTouch,无需额外微调。结果表明,NeuralTouch在抓取精度和鲁棒性上显著优于基线方法,为精确、富接触的机器人操作提供了一个通用框架。

英文摘要

Grasping accuracy is a critical prerequisite for precise object manipulation, often requiring careful alignment between the robot hand and object. Neural Descriptor Fields (NDF) offer a promising vision-based method to generate grasping poses that generalize across object categories. However, NDF alone can produce inaccurate poses due to imperfect camera calibration, incomplete point clouds, and object variability. Meanwhile, tactile sensing enables more precise contact, but existing approaches typically learn policies limited to simple, predefined contact geometries. In this work, we introduce NeuralTouch, a multimodal framework that integrates NDF and tactile sensing to enable accurate, generalizable grasping through gentle physical interaction. Our approach leverages NDF to implicitly represent the target contact geometry, from which a deep reinforcement learning (RL) policy is trained to refine the grasp using tactile feedback. This policy is conditioned on the neural descriptors and does not require explicit specification of contact types. We validate NeuralTouch through ablation studies in simulation and zero-shot transfer to real-world manipulation tasks--such as peg-out-in-hole and bottle lid opening--without additional fine-tuning. Results show that NeuralTouch significantly improves grasping accuracy and robustness over baseline methods, offering a general framework for precise, contact-rich robotic manipulation.

2510.15264 2026-05-26 cs.CV

DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

DriveGen3D: 通过高效视频扩散提升前馈驾驶场景生成

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, Jiwen Lu

AI总结 提出DriveGen3D框架,结合快速视频扩散Transformer(FastDrive-DiT)和前馈3D重建模块(FastRecon3D),实现高质量、可控的动态3D驾驶场景生成,在长视频和3D一致性上达到最优。

Comments ICME 2026 Oral, Project Page: https://lhmd.top/drivegen3d

详情
AI中文摘要

我们提出了DriveGen3D,一个用于生成高质量、高可控性动态3D驾驶场景的新框架,解决了现有方法的关键局限性。当前的驾驶场景合成方法要么因扩展时间生成而面临高昂的计算需求,要么专注于没有3D表示的长时间视频合成,或者局限于静态单场景重建。我们的工作通过多模态条件控制,将加速的长期视频生成与大规模动态场景重建相结合,弥合了这一方法论差距。DriveGen3D引入了一个由两个专门组件组成的统一流程:FastDrive-DiT,一个高效的视频扩散Transformer,用于在文本和鸟瞰图(BEV)布局引导下进行高分辨率、时间连贯的视频合成;以及FastRecon3D,一个前馈模块,可快速构建跨时间的3D高斯表示,确保时空一致性。DriveGen3D能够生成长达$800\times424$、12 FPS的驾驶视频及相应的3D场景,在保持效率的同时取得了最先进的结果。

英文摘要

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to $800\times424$ at $12$ FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

2510.14862 2026-05-26 cs.CV cs.DC

Multi-modal video data-pipelines for machine learning with minimal human supervision

最小人工监督的机器学习多模态视频数据管道

Mihai-Cristian Pîrvu, Marius Leordeanu

AI总结 提出一种全自动数据管道,利用预训练专家模型和程序化组合,在无需人工监督下融合多种视觉模态,并基于PHG-MAE模型实现高效蒸馏,以低参数(<1M)达到与300M参数模型竞争的性能,部署于实时语义分割和深度估计任务。

详情
AI中文摘要

现实世界本质上是多模态的。我们的工具以数字形式(如视频或声音)观察和拍摄其快照,但大部分信息丢失。同样,对于人类之间的动作和信息传递,语言被用作书面交流形式。传统上,机器学习模型是单模态的(例如,rgb -> 语义或文本 -> 情感分类)。最近的趋势走向双模态,其中图像和文本一起学习,然而,为了真正理解世界,我们需要整合所有这些独立的模态。在这项工作中,我们尝试使用很少或没有人工监督来结合尽可能多的视觉模态。为此,我们使用预训练专家模型和它们之间的程序化组合,在原始视频之上构建一个完全自主的数据管道,我们也将其开源。然后,我们利用PHG-MAE,一个专门设计用于利用多模态数据的模型。我们展示了这个模型被高效蒸馏成低参数(<1M)后,可以与约300M参数的模型竞争。我们将该模型部署在商品硬件上的手持设备或网络摄像头上,分析实时语义分割的用例。最后,我们使用相同的框架部署其他现成模型,如用于近实时深度估计的DPT。

英文摘要

The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

2510.11296 2026-05-26 cs.CV cs.LG

$Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

$Δ\mathrm{Energy}$: 优化视觉-语言对齐过程中的能量变化提升OOD检测与OOD泛化

Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

AI总结 本文提出ΔEnergy分数,通过重新对齐视觉-语言模态时的能量变化来同时提升分布外检测和分布外泛化性能,并基于此开发了统一微调框架EBM。

Comments Accepted by NeurIPS2025

详情
AI中文摘要

近期针对视觉-语言模型(VLM)的方法在下游任务快速适应中取得了显著成功。当应用于真实世界下游任务时,VLM不可避免地会遇到分布内(ID)数据和分布外(OOD)数据。OOD数据集通常包括协变量偏移(例如,已知类别但图像风格变化)和语义偏移(例如,测试时未见类别)。这凸显了提升VLM对协变量偏移OOD数据的泛化能力,同时有效检测开放集语义偏移OOD类别的重要性。本文受重新对齐视觉-语言模态时(具体通过将最大余弦相似度直接降低到低值)观察到的闭集数据中显著能量变化的启发,提出了一种新的OOD分数,命名为ΔEnergy。ΔEnergy显著优于基于能量的原始OOD分数,为OOD检测提供了更可靠的方法。此外,ΔEnergy还能同时提升协变量偏移下的OOD泛化,这是通过ΔEnergy的下界最大化(称为EBM)实现的。理论上证明EBM不仅能增强OOD检测,还能产生领域一致的Hessian矩阵,这作为OOD泛化的强指标。基于这一发现,我们开发了一个统一的微调框架,能够提升VLM在OOD泛化和OOD检测两方面的鲁棒性。在具有挑战性的OOD检测和泛化基准上的大量实验证明了我们方法的优越性,在AUROC上比近期方法提升了10%到25%。

英文摘要

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named ΔEnergy. ΔEnergy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, ΔEnergy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for ΔEnergy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

2510.10921 2026-05-26 cs.CV cs.AI cs.LG

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

FG-CLIP 2: 一种双语细粒度视觉-语言对齐模型

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

AI总结 提出FG-CLIP 2双语视觉语言模型,通过区域-文本匹配、长描述建模和文本内模态对比损失等细粒度监督,在英中双语上实现细粒度对齐,在29个数据集上取得最优结果。

Comments Accepted in ICML2026

详情
AI中文摘要

细粒度视觉-语言理解需要视觉内容与语言描述之间的精确对齐,这一能力在当前模型中仍然有限,尤其是在非英语环境下。虽然CLIP等模型在全局对齐上表现良好,但它们往往难以捕捉对象属性、空间关系和语言表达中的细粒度细节,且对双语理解的支持有限。为应对这些挑战,我们提出了FG-CLIP 2,一个旨在推进英语和中文细粒度对齐的双语视觉语言模型。我们的方法利用了丰富的细粒度监督,包括区域-文本匹配和长描述建模,以及多个判别性目标。我们进一步引入了文本内模态对比损失,以更好地区分语义相似的描述。在精心策划的大规模英语和中文数据混合上训练,包括新发布的1200万中文区域-文本数据集,FG-CLIP 2实现了强大的双语性能。为进行严格评估,我们提出了一个新的中文多模态理解基准,包括长描述检索和边界框分类。在8个任务的29个数据集上的大量实验表明,FG-CLIP 2优于现有方法,在两种语言上均达到了最先进的结果。我们发布了模型、代码和基准,以促进双语细粒度视觉-语言对齐的未来研究。

英文摘要

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, including a newly released 12M Chinese region-text dataset, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained vision-language alignment.

2510.08558 2026-05-26 cs.AI cs.CL cs.IR cs.LG

Agent Learning via Early Experience

通过早期经验进行智能体学习

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

AI总结 提出早期经验范式,利用智能体自身动作生成的交互数据(无需奖励信号)通过隐式世界建模和自我反思两种策略提升智能体在多样化环境中的效果和跨域泛化能力。

Comments ICML 2026

详情
AI中文摘要

语言智能体的一个长期目标是通过自身经验学习和改进,最终在复杂的现实任务中超越人类。然而,在缺乏可验证奖励(如网站)或需要低效长程展开(如多轮工具使用)的许多环境中,基于经验数据使用强化学习训练智能体仍然困难。因此,当前大多数智能体依赖专家数据的监督微调,这难以扩展且泛化能力差。这一局限性源于专家示范的本质:它们只捕获了狭窄的场景范围,并使智能体暴露于有限的环境多样性。我们通过一种称为早期经验的中间范式来解决这一局限性:由智能体自身动作生成的交互数据,其中产生的未来状态作为监督信号,无需奖励。在此范式下,我们研究了使用此类数据的两种策略:(1)隐式世界建模,利用收集的状态将策略基于环境动态;(2)自我反思,智能体从其次优动作中学习以改进推理和决策。在八个多样化环境和多个模型家族上的评估表明,我们的方法持续提升了有效性和跨域泛化,凸显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了有希望的信号,表明早期经验为后续强化学习奠定了坚实基础,使其成为模仿学习与完全经验驱动智能体之间的实用桥梁。

英文摘要

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

2510.08350 2026-05-26 cs.LG cs.AI

DeepEN: A Deep Reinforcement Learning Framework for Personalized Enteral Nutrition in Critical Care

DeepEN: 一种用于重症监护中个性化肠内营养的深度强化学习框架

Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng

AI总结 提出DeepEN框架,利用深度强化学习从电子健康记录中学习个性化肠内营养方案,在MIMIC-IV数据集上相比临床实践降低绝对死亡率4.0个百分点。

详情
AI中文摘要

目的:由于个性化程度有限以及在动态代谢需求下对适当热量、蛋白质和液体目标的不确定性,ICU中的肠内营养(EN)输送仍不理想。我们引入DeepEN,一个使用电子健康记录数据进行个性化EN优化的强化学习(RL)框架。方法:DeepEN在来自MIMIC-IV的超过11,000名ICU患者上训练,以生成每4小时一次、针对患者的卡路里、蛋白质和液体目标。状态表示包括人口统计学、合并症、生命体征、实验室值和近期干预措施。一个生理学对齐的奖励框架平衡了生物标志物稳定性与长期生存。策略学习采用带有保守Q学习正则化的决斗双深度Q网络,以实现安全的离线训练。结果:DeepEN实现了最高的估计策略价值($V^π= 9.48$)和最低的校准死亡率(18.8 ± 1.0%),与临床实践(22.8%)相比绝对降低了4.0个百分点。该策略还表现出优越的代谢稳定性,实现了目标范围内葡萄糖、磷酸盐和钠值的最高比例。此外,偏离DeepEN策略与死亡率和生物标志物不稳定性独立相关,而偏离随机策略则没有这种关联。可解释性分析进一步表明,建议是基于器官功能和代谢状态的生理相关标志物,而不是静态剂量启发式。结论:DeepEN证明了保守离线RL在安全、个性化EN优化中的可行性,突出了数据驱动个性化在重症监护中补充基于指南方法的潜力。

英文摘要

Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropriate calorie, protein, and fluid targets under dynamic metabolic demands. We introduce DeepEN, a reinforcement learning (RL) framework for personalized EN optimization using electronic health record data. Methods: DeepEN was trained on over 11,000 ICU patients from MIMIC-IV to generate 4-hourly, patient-specific caloric, protein, and fluid targets. The state representation incorporated demographics, comorbidities, vital signs, laboratory values, and recent interventions. A physiologically aligned reward framework balanced biomarker stability with long-term survival. Policy learning employed a dueling double deep Q-network with Conservative Q-Learning regularization to enable safe offline training. Results: DeepEN achieved the highest estimated policy value ($V^π= 9.48$) and the lowest calibrated mortality (18.8 +/- 1.0%), representing a 4.0 percentage-point absolute reduction compared with clinician practice (22.8%). The policy also demonstrated superior metabolic stability, achieving the highest proportion of glucose, phosphate, and sodium values within target range. Furthermore, deviation from the DeepEN policy was independently associated with increased mortality and biomarker instability, whereas deviation from a random policy showed no such association. Interpretability analyses further indicated that recommendations were conditioned on physiologically relevant markers of organ function and metabolic status rather than static dosing heuristics. Conclusion: DeepEN demonstrates the feasibility of conservative offline RL for safe, individualized EN optimization, highlighting the potential of data-driven personalization to complement guideline-based approaches in critical care.

2510.06672 2026-05-26 cs.LG

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

XRPO:通过定向探索与利用突破GRPO极限

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai

AI总结 提出XRPO框架,通过自适应探索分配器、上下文种子策略和新颖性感知优势机制,在数学和编码基准上实现比GRPO最高4% pass@1和6% cons@32的提升,并加速训练收敛达2.7倍。

详情
AI中文摘要

GRPO等强化学习算法推动了大型语言模型推理的最新进展。虽然增加rollout数量可以稳定训练,但现有方法在具有挑战性的提示上探索有限,且由于跨提示的上下文无关rollout分配(例如,每个提示生成16个rollout)以及严重依赖稀疏奖励,导致信息性反馈信号未被充分利用。本文提出XRPO(探索-利用GRPO),这是一个统一框架,通过rollout探索-利用的原则性视角重新审视策略优化。为增强探索,XRPO引入了一个数学基础的rollout分配器,自适应地优先处理具有更高不确定性减少潜力的提示。它还通过上下文种子策略注入精选示例,解决零奖励提示上的停滞问题,引导模型进入更困难的推理轨迹。为加强利用,XRPO开发了一种组相对、新颖性感知的优势锐化机制,利用序列似然性放大低概率但正确的响应,从而将策略扩展到稀疏奖励之外。在多种数学和编码基准上对推理和非推理模型的实验表明,XRPO优于现有先进方法(如GRPO和GSPO),pass@1提升高达4%,cons@32提升高达6%,同时训练收敛速度加快达2.7倍。

英文摘要

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

2510.06351 2026-05-26 cs.RO

A Formal gatekeeper Framework for Safe Dual Control with Active Exploration

具有主动探索的安全双重控制的正式门控框架

Kaleb Ben Naveed, Devansh R. Agrawal, Dimitra Panagou

AI总结 提出一个集成鲁棒规划与主动探索的框架,通过门控机制仅在可验证改进且不牺牲安全时进行探索,实现安全与不确定性降低的平衡。

Comments Accepted at American Control Conference (ACC) 2026

详情
AI中文摘要

在模型不确定性下规划安全轨迹是一个基本挑战。鲁棒规划通过考虑最坏情况来确保安全,但忽略了不确定性降低,导致过于保守的行为。在名义任务期间主动实时降低不确定性定义了双重控制问题。大多数方法通过在成本中添加加权探索项来解决这一问题,调整以平衡名义目标和不确定性降低,但没有正式考虑何时探索是有益的。此外,某些方法强制安全性,而其他方法则没有。我们提出了一个框架,将鲁棒规划与正式保证下的主动探索集成如下:关键创新和贡献在于,仅在探索提供可验证改进且不牺牲安全时才进行探索。为实现这一点,我们利用我们早期关于门控器作为安全验证架构的工作,并将其扩展,使其生成既安全又信息丰富的轨迹,从而降低不确定性和任务成本,或将其保持在用户定义的预算内。通过参数不确定性下四旋翼飞行器在线双重控制的仿真案例研究评估了该方法。

英文摘要

Planning safe trajectories under model uncertainty is a fundamental challenge. Robust planning ensures safety by considering worst-case realizations, yet ignores uncertainty reduction and leads to overly conservative behavior. Actively reducing uncertainty on-the-fly during a nominal mission defines the dual control problem. Most approaches address this by adding a weighted exploration term to the cost, tuned to trade off the nominal objective and uncertainty reduction, but without formal consideration of when exploration is beneficial. Moreover, safety is enforced in some methods but not in others. We propose a framework that integrates robust planning with active exploration under formal guarantees as follows: The key innovation and contribution is that exploration is pursued only when it provides a verifiable improvement without compromising safety. To achieve this, we utilize our earlier work on gatekeeper as an architecture for safety verification, and extend it so that it generates both safe and informative trajectories that reduce uncertainty and the cost of the mission, or keep it within a user-defined budget. The methodology is evaluated via simulation case studies on the online dual control of a quadrotor under parametric uncertainty.

2510.05688 2026-05-26 cs.LG cs.AI

vAttention: Verified Sparse Attention

vAttention: 验证的稀疏注意力

Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

AI总结 提出vAttention,通过统一top-k和随机采样,实现首个具有用户指定(ε, δ)近似精度保证的实用稀疏注意力机制,显著提升质量-效率权衡。

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
AI中文摘要

最先进的用于减少解码延迟的稀疏注意力方法主要分为两类:近似top-$k$(及其扩展top-$p$)和最近引入的基于采样的估计。然而,这些方法在逼近全注意力方面存在根本性局限:它们无法在头和查询向量之间提供一致的近似,最关键的是,缺乏对近似质量的保证,限制了其实际部署。我们观察到top-$k$和随机采样是互补的:当注意力分数由少数标记主导时,top-$k$表现良好,而当注意力分数相对均匀时,随机采样提供更好的估计。基于这一洞察并利用采样的统计保证,我们引入了vAttention,这是第一个具有用户指定$(ε, δ)$近似精度保证(因此称为“已验证”)的实用稀疏注意力机制。这些保证使vAttention成为向大规模实用、可靠部署稀疏注意力迈出的引人注目的一步。通过统一top-$k$和采样,vAttention在质量-效率权衡上优于两者各自的表现。我们的实验表明,vAttention显著提高了稀疏注意力的质量(例如,在RULER-HARD上,Llama 3.1 8B Instruct和DeepSeek-R1-Distill-Llama-8B提高了约4.5个百分点),并有效弥合了全注意力和稀疏注意力之间的差距(例如,在多个数据集上,以高达20倍稀疏度匹配全模型质量)。我们还展示了它可以部署在推理场景中,在不牺牲模型质量的情况下实现快速解码(例如,vAttention在AIME2024上以10倍稀疏度和高达32K标记生成实现了全模型质量)。代码:https://github.com/skylight-org/sparse-attention-hub。网页:https://sky-light.eecs.berkeley.edu。

英文摘要

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, "verified"). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-$k$ and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama 3.1 8B Instruct and DeepSeek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with up to 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code: https://github.com/skylight-org/sparse-attention-hub. Webpage: https://sky-light.eecs.berkeley.edu.

2510.03827 2026-05-26 cs.CV cs.RO

LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

LIBERO-PRO:超越记忆的视觉-语言-动作模型鲁棒与公平评估

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, Lichao Sun

AI总结 针对LIBERO基准评估中的记忆偏差问题,提出LIBERO-PRO扩展基准,通过在操作对象、初始状态、任务指令和环境四个维度施加合理扰动,揭示现有VLA模型性能从90%以上骤降至0.0%的严重缺陷,并呼吁采用鲁棒评估方法。

Comments 10 pages,7 figures, 0 tables

详情
AI中文摘要

LIBERO已成为评估视觉-语言-动作(VLA)模型的广泛采用的基准;然而,其当前的训练和评估设置存在问题,常常导致性能估计膨胀,并阻碍公平的模型比较。为了解决这些问题,我们引入了LIBERO-PRO,一个扩展的LIBERO基准,系统性地评估模型在四个维度(操作对象、初始状态、任务指令和环境)的合理扰动下的性能。实验结果表明,尽管现有模型在标准LIBERO评估下达到90%以上的准确率,但在我们的泛化设置下,其性能骤降至0.0%。关键的是,这种差异暴露了模型依赖于对训练集中动作序列和环境布局的死记硬背,而非真正的任务理解或环境感知。例如,当目标对象被替换为无关物品时,模型仍持续执行抓取动作;即使给出被破坏的指令甚至混乱的令牌,其输出也保持不变。这些发现揭示了当前评估实践中的严重缺陷,我们呼吁社区放弃误导性方法,转而采用对模型泛化和理解能力的鲁棒评估。我们的代码可在 https://github.com/Zxy-MLlab/LIBERO-PRO 获取。

英文摘要

LIBERO has emerged as a widely adopted benchmark for evaluating Vision-Language-Action (VLA) models; however, its current training and evaluation settings are problematic, often leading to inflated performance estimates and preventing fair model comparison. To address these issues, we introduce LIBERO-PRO, an extended LIBERO benchmark that systematically evaluates model performance under reasonable perturbations across four dimensions: manipulated objects, initial states, task instructions, and environments. Experimental results reveal that, although existing models achieve over 90% accuracy under the standard LIBERO evaluation, their performance collapses to 0.0% under our generalized setting. Crucially, this discrepancy exposes the models' reliance on rote memorization of action sequences and environment layouts from the training set, rather than genuine task understanding or environmental perception. For instance, models persist in executing grasping actions when the target object is replaced with irrelevant items, and their outputs remain unchanged even when given corrupted instructions or even messy tokens. These findings expose the severe flaws in current evaluation practices, and we call on the community to abandon misleading methodologies in favor of robust assessments of model generalization and comprehension. Our code is available at: https://github.com/Zxy-MLlab/LIBERO-PRO.

2510.02837 2026-05-26 cs.AI cs.CL

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

超越最终答案:评估工具增强型智能体的推理轨迹

Wonjoong Kim, Sangwu Park, Yeonjun In, Sein Kim, Dongha Lee, Chanyoung Park

AI总结 针对工具增强型LLM,提出无参考框架TRACE,通过证据库多维度评估推理轨迹的效率、幻觉和适应性,并用元评估数据集验证其有效性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

尽管最近的工具增强型基准涉及复杂请求,但评估仍局限于答案匹配,忽略了效率、幻觉和适应性等关键轨迹方面。最直接的评估方法是将智能体的轨迹与真实轨迹进行比较,但注释所有有效的真实轨迹成本过高。为此,我们引入TRACE,一个用于工具增强型LLM多维度评估的无参考框架。通过整合一个从先前步骤积累知识的证据库,TRACE有效评估智能体的推理轨迹。为验证我们的框架,我们开发了一个新的元评估数据集,包含多样且有缺陷的轨迹,每个轨迹都标有多方面的性能分数。我们的结果证实,即使使用小型开源LLM,TRACE也能准确评估复杂轨迹。此外,我们应用该方法评估智能体在解决工具增强型任务时产生的轨迹,展示了先前未报告的观察结果及其相应的见解。

英文摘要

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical trajectory aspects like efficiency, hallucination, and adaptivity. The most straightforward method for evaluation is to compare an agent's trajectory with the ground-truth, but annotating all valid ground-truth trajectories is prohibitively expensive. In this manner, we introduce TRACE, a reference-free framework for the multi-dimensional evaluation of tool-augmented LLMs. By incorporating an evidence bank which accumulates knowledge from preceding steps, TRACE assesses an agent's reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates complex trajectories even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

2510.02361 2026-05-26 cs.CL cs.AI

ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

ChunkLLM: 一种轻量级可插拔的LLM推理加速框架

Haojie Ouyang, Jianwei Lv, Lei Ren, Chen Wei, Xiaojie Wang, Fangxiang Feng

AI总结 针对Transformer自注意力二次复杂度导致的推理效率低下问题,提出ChunkLLM框架,通过QK适配器和块适配器实现块选择与压缩,在保持性能的同时显著加速推理。

详情
AI中文摘要

基于Transformer的大模型在自然语言处理和计算机视觉中表现出色,但由于自注意力对输入令牌的二次复杂度,面临严重的计算效率低下问题。最近,研究人员提出了一系列基于块选择和压缩的方法来缓解这一问题,但它们要么存在语义不完整的问题,要么训练-推理效率低下。为了全面解决这些挑战,我们提出了ChunkLLM,一个轻量级且可插拔的训练框架。具体来说,我们引入了两个组件:QK适配器(Q-Adapter和K-Adapter)和块适配器。前者附加在每个Transformer层上,兼具特征压缩和块注意力获取的双重目的。后者在模型的最底层运行,通过利用上下文语义信息来检测块边界。在训练阶段,骨干网络的参数保持冻结,仅QK适配器和块适配器进行训练。值得注意的是,我们设计了一种注意力蒸馏方法来训练QK适配器,这提高了关键块的召回率。在推理阶段,仅当当前令牌被检测为块边界时才触发块选择,从而加速模型推理。我们在涵盖多个任务的多种长文本和短文本基准数据集上进行了实验评估。ChunkLLM不仅在短文本基准上取得了可比的性能,而且在长上下文基准上保持了98.64%的性能,同时保持了48.58%的键值缓存保留率。特别地,在处理120K长文本时,ChunkLLM相比原始Transformer实现了最大4.48倍的加速。

英文摘要

Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention's quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

2510.02327 2026-05-26 cs.CL cs.AI eess.AS

KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI

KAME:用于增强实时语音到语音对话AI知识的串联架构

So Kuroki, Yotaro Kubo, Takuya Akiba, Yujin Tang

AI总结 提出一种混合架构,通过实时注入后端LLM的文本响应来增强S2S模型的知识,在保持低延迟的同时提升响应正确性。

Comments Published at IEEE ICASSP 2026

详情
AI中文摘要

实时语音到语音(S2S)模型擅长生成自然、低延迟的对话响应,但往往缺乏深层知识和语义理解。相反,结合自动语音识别、基于文本的大语言模型(LLM)和文本到语音合成的级联系统提供了优越的知识表示,但代价是高延迟,这破坏了自然交互的流畅性。本文介绍了一种新颖的混合架构,弥合了这两种范式之间的差距。我们的框架通过S2S变压器处理用户语音以实现即时响应,同时将查询并发地传递给强大的后端LLM。然后,LLM的基于文本的响应被实时注入以指导S2S模型的语音生成,有效地为其输出注入丰富的知识,而无需承受级联系统的全部延迟惩罚。我们使用MT-Bench基准的语音合成变体(包含多轮问答会话)评估了我们的方法。结果表明,我们的系统在响应正确性上显著优于基线S2S模型,接近级联系统的水平,同时保持了与基线相当的延迟。

英文摘要

Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and semantic understanding. Conversely, cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency, which disrupts the flow of natural interaction. This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms. Our framework processes user speech through an S2S transformer for immediate responsiveness while concurrently relaying the query to a powerful back-end LLM. The LLM's text-based response is then injected in real time to guide the S2S model's speech generation, effectively infusing its output with rich knowledge without the full latency penalty of a cascaded system. We evaluated our method using a speech-synthesized variant of the MT-Bench benchmark that consists of multi-turn question-answering sessions. The results demonstrate that our system substantially outperforms a baseline S2S model in response correctness, approaching that of a cascaded system, while maintaining a latency on par with the baseline.

2510.01348 2026-05-26 cs.RO

Kilometer-Scale GNSS-Denied UAV Navigation via Heightmap Gradients: A Winning System from the SPRIN-D Challenge

基于高程图梯度的千米级GNSS拒止无人机导航:SPRIN-D挑战优胜系统

Michal Werner, David Čapek, Tomáš Musil, Ondřej Franěk, Tomáš Báča, Martin Saska

AI总结 针对GNSS拒止环境下无人机长距离飞行中的漂移问题,提出一种利用高程图梯度模板匹配进行漂移校正的轻量级定位方法,并在SPRIN-D挑战中实现9公里航点导航。

Comments 8 pages

详情
AI中文摘要

在GNSS拒止环境中实现可靠的长距离无人机飞行具有挑战性:集成里程计会导致漂移,在未探索区域无法进行闭环检测,且嵌入式平台计算能力有限。我们提出了一套完全机载的无人机系统,专为SPRIN-D Funke Fully Autonomous Flight Challenge开发,该挑战要求在没有GNSS或先验密集地图的情况下,在低于25米AGL(离地高度)的高度完成9公里长距离航点导航。该系统集成了感知、建图、规划和控制,并采用一种轻量级漂移校正方法,通过梯度模板匹配将激光雷达导出的局部高程图与先验地理数据高程图进行匹配,并在聚类粒子滤波器中融合里程计证据。在竞赛部署中,该系统在城区、森林和开阔地形中执行了千米级飞行,相对于原始里程计显著减少了漂移,同时在仅CPU硬件上实时运行。我们描述了系统架构、定位流程和竞赛评估,并报告了现场部署中的实际经验,为GNSS拒止无人机自主性的设计提供了参考。

英文摘要

Reliable long-range flight of unmanned aerial vehicles (UAVs) in GNSS-denied environments is challenging: integrating odometry leads to drift, loop closures are unavailable in previously unseen areas and embedded platforms provide limited computational power. We present a fully onboard UAV system developed for the SPRIN-D Funke Fully Autonomous Flight Challenge, which required 9 km long-range waypoint navigation below 25 m AGL (Above Ground Level) without GNSS or prior dense mapping. The system integrates perception, mapping, planning, and control with a lightweight drift-correction method that matches LiDAR-derived local heightmaps to a prior geo-data heightmap via gradient-template matching and fuses the evidence with odometry in a clustered particle filter. Deployed during the competition, the system executed kilometer-scale flights across urban, forest, and open-field terrain and reduced drift substantially relative to raw odometry, while running in real time on CPU-only hardware. We describe the system architecture, the localization pipeline, and the competition evaluation, and we report practical insights from field deployment that inform the design of GNSS-denied UAV autonomy.

2509.24978 2026-05-26 cs.AI cond-mat.quant-gas quant-ph

Agentic Exploration of Physics Models

物理模型的智能体探索

Maximilian Nägele, Florian Marquardt

AI总结 提出 SciExplorer 智能体,利用大语言模型工具使用能力,无需领域特定蓝图即可探索未知物理系统,通过实验和观测恢复运动方程和哈密顿量。

详情
AI中文摘要

科学发现的过程依赖于观察、分析和假设生成的相互作用。机器学习正越来越多地被用于处理这一过程的各个方面。然而,完全自动化发现未知系统定律所需的启发式迭代循环(通过实验和分析探索系统)仍然是一个开放挑战,且不能针对特定任务进行定制。在这里,我们介绍了 SciExplorer,一个利用大语言模型工具使用能力来探索系统而无需任何领域特定蓝图的智能体,并将其应用于最初对智能体未知的物理系统。我们在涵盖机械动力学系统、波演化和量子多体物理的广泛模型上测试了 SciExplorer。尽管使用了最小工具集(主要基于代码执行),我们观察到在从观测动力学恢复运动方程和从期望值推断哈密顿量等任务上表现出色。该设置的有效性为在其他领域进行类似的科学探索打开了大门,无需微调或任务特定指令。

英文摘要

The process of scientific discovery relies on an interplay of observations, analysis, and hypothesis generation. Machine learning is increasingly being adopted to address individual aspects of this process. However, it remains an open challenge to fully automate the heuristic, iterative loop required to discover the laws of an unknown system by exploring it through experiments and analysis, without tailoring the approach to the specifics of a given task. Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the agent. We test SciExplorer on a broad set of models spanning mechanical dynamical systems, wave evolution, and quantum many-body physics. Despite using a minimal set of tools, primarily based on code execution, we observe impressive performance on tasks such as recovering equations of motion from observed dynamics and inferring Hamiltonians from expectation values. The demonstrated effectiveness of this setup opens the door towards similar scientific exploration in other domains, without the need for finetuning or task-specific instructions.

2509.24621 2026-05-26 cs.CV

FreeRet: MLLMs as Training-Free Retrievers

FreeRet: 无需训练的多模态大语言模型检索器

Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Chunxu Liu, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang

AI总结 提出FreeRet框架,将现成的多模态大语言模型转化为无需额外训练的两阶段检索器,通过语义嵌入和重排序提升检索性能。

Comments ICML 2026

详情
AI中文摘要

多模态大语言模型正成为混合模态检索的通用基础。然而,它们通常需要大量的后期训练才能转化为用于检索的对比编码器。本文提出:现成的多模态大语言模型能否在无需额外训练的情况下作为强大的检索器?我们提出了FreeRet,一个即插即用的框架,可将任何多模态大语言模型转化为两阶段检索器。FreeRet首先直接从模型中导出语义嵌入以进行快速候选搜索,然后利用其推理能力进行精确重排序。该框架贡献了三个进步:绕过词汇对齐层以获得语义保真的嵌入、通过显式先验条件化表示生成、以及通过中性选择框架减轻重排序中的框架效应。在涵盖46个数据集的MMEB和MMEB-V2基准测试中,FreeRet显著优于在数百万个对上训练的模型。除基准测试外,FreeRet与模型无关,可无缝扩展至不同多模态大语言模型系列和规模,保留其生成能力,支持任意模态组合,并将检索、重排序和生成统一到单个模型内的端到端RAG中。我们的发现表明,经过精心利用的预训练多模态大语言模型可以在无需训练的情况下作为强大的检索引擎,弥补了其作为通才角色的关键差距。

英文摘要

Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.

2509.23651 2026-05-26 cs.RO

HeLoM: Hierarchical Learning for Whole-Body Loco-Manipulation by a Hexapod Robot

HeLoM: 六足机器人全身移动操作的分层学习

Xinrong Yang, Peizhuo Li, Hongyi Li, Yifeng Peng, Arhaan Jain, Junkai Lu, Linnan Chang, Yuhong Cao, Yifeng Zhang, Ge Sun, Guillaume Sartoretti

AI总结 提出HeLoM分层框架,通过协调多肢控制实现六足机器人对重/不规则物体的稳定推动,在仿真和实物实验中验证了有效性。

详情
AI中文摘要

在自然界中,动物经常需要移动/操纵与自身重量/大小相当的物体。与抓取和搬运相比,推动提供了一种更直接、高效的非抓取操纵策略,避免了复杂的抓取设计,同时利用直接接触在交互过程中调节物体的姿态。然而,实现有效的推动既需要足够的操纵能力,也需要稳定的全身协调,这在处理重型或不规则物体时尤其具有挑战性。为了解决这些挑战,我们提出了HeLoM,一种基于学习的六足机器人分层全身操纵框架,该框架利用协调的多肢控制,并适用于多足机器人系统。受多足昆虫合作策略的启发,我们的框架利用多个接触点和高度自由度,在物体交互过程中实现高效、动态的全身协调。HeLoM的高层规划器规划推动行为,而其低层控制器保持运动稳定性并生成动态一致的关节动作。这种设计使机器人能够通过协调的前肢交互和支撑性的后肢推进,在执行连续可控的推动行为的同时保持平衡。我们通过仿真和实物实验验证了HeLoM的有效性。结果表明,我们的框架能够在现实世界中稳定地将不同尺寸和未知物理属性的物体推动到指定的目标姿态。

英文摘要

In nature, animals often need to move/manipulate objects comparable in weight/size to their own bodies. Compared to grasping and carrying, pushing provides a more straightforward and efficient non-prehensile manipulation strategy, avoiding complex grasp design while leveraging direct contact to regulate an object's pose during interaction. Achieving effective pushing, however, requires both sufficient manipulation capability and stable whole-body coordination, which is particularly challenging when dealing with heavy or irregular objects. To address these challenges, we propose HeLoM, a learning-based hierarchical whole-body manipulation framework for hexapod robots that exploits coordinated multi-limb control and is applicable to multi-legged robotic systems. Inspired by the cooperative strategies of multi-legged insects, our framework leverages multiple contact points and high degrees of freedom to enable efficient and dynamic whole-body coordination during object interaction. HeLoM's high-level planner plans pushing behaviors, while its low-level controller maintains locomotion stability and generates dynamically consistent joint actions. This design enables the robot to maintain balance while executing continuous and controllable pushing behaviors through coordinated foreleg interaction and supportive hind-leg propulsion. We validate the effectiveness of HeLoM through both simulation and real-world experiments. Results show that our framework can stably push objects of varying sizes and unknown physical properties to designated goal poses in the real world.

2509.22299 2026-05-26 cs.LG cs.AI

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

HEAPr: 基于Hessian的输出空间中高效原子专家剪枝

Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang

AI总结 针对MoE模型粗粒度专家剪枝导致精度下降的问题,提出HEAPr算法,通过将专家分解为原子专家并利用二阶信息(最优脑外科原理)评估重要性,在输出空间简化计算,实现高比例无损压缩。

Comments ICLR 2026

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
AI中文摘要

大型语言模型中的混合专家(MoE)架构相比密集LLM具有卓越性能和更低的推理成本。然而,其庞大的参数数量导致内存需求过高,限制了实际部署。现有的剪枝方法主要关注专家级剪枝,这种粗粒度通常导致显著的精度下降。在这项工作中,我们引入了HEAPr,一种新颖的剪枝算法,它将专家分解为更小、不可分割的原子专家,从而实现更精确和灵活的原子专家剪枝。为了衡量每个原子专家的重要性,我们利用基于最优脑外科理论原理的二阶信息。为了解决二阶信息带来的计算和存储挑战,HEAPr利用原子专家的固有属性,将专家参数的二阶信息转换为原子专家参数的二阶信息,并进一步简化为原子专家输出的二阶信息。这种方法将空间复杂度从$O(d^4)$(其中$d$是模型的维度)降低到$O(d^2)$。HEAPr仅需在小型校准集上进行两次前向传播和一次反向传播即可计算原子专家的重要性。在包括DeepSeek MoE和Qwen MoE系列在内的MoE模型上的大量实验表明,HEAPr在广泛的剪枝比例和基准测试中优于现有的专家级剪枝方法。具体来说,在大多数模型中,HEAPr在20%~25%的剪枝比例下实现了几乎无损的压缩,同时FLOPs也减少了近20%。代码可在[https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr)找到。

英文摘要

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to the Optimal Brain Surgeon theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where $d$ is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of pruning ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at pruning ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at [https://github.com/LLIKKE/HEAPr](https://github.com/LLIKKE/HEAPr).

2509.21592 2026-05-26 cs.CV cs.AI cs.LG

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

接下来会发生什么?通过生成点轨迹预测未来运动

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

AI总结 提出一种基于单张图像预测未来运动的方法,通过生成密集轨迹网格来捕捉场景动态和不确定性,相比现有方法更准确多样,并验证其在机器人等下游任务中的有效性。

详情
Journal ref
ICLR 2026
AI中文摘要

我们考虑从单张图像预测运动的问题,即预测世界中物体可能如何移动,而无法观察其他参数如物体速度或施加的力。我们将此任务表述为密集轨迹网格的条件生成,模型紧密遵循现代视频生成器的架构,但输出运动轨迹而非像素。这种方法捕捉了场景范围的动态和不确定性,比先前的回归器和生成器产生更准确和多样化的预测。我们在模拟数据上广泛评估了我们的方法,展示了其在机器人等下游应用中的有效性,并在真实世界的直觉物理数据集上显示出有希望的准确性。尽管最近最先进的视频生成器常被视为世界模型,但我们表明它们在从单张图像预测运动方面存在困难,即使在简单的物理场景如落块或机械物体交互中,尽管对这些数据进行了微调。我们表明这一局限性源于生成像素的开销,而非直接建模运动。

英文摘要

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

2509.17057 2026-05-26 cs.RO

RoboManipBaselines: A Unified Framework for Imitation Learning in Robotic Manipulation across Real and Simulation Environments

RoboManipBaselines:面向真实与仿真环境的机器人操作模仿学习统一框架

Masaki Murooka, Tomohiro Motoda, Ryoichi Nakajo, Hanbit Oh, Koshi Makihara, Keisuke Shirai, Tetsuya Ogata, Yukiyasu Domae

AI总结 提出RoboManipBaselines开源框架,统一支持仿真和真实环境下的机器人操作模仿学习全流程,包括数据收集、策略训练和部署,并通过基准测试和研究应用验证其有效性。

Comments Added a Limitations section in response to comments from reviewers

详情
Journal ref
IEEE Access 2026
AI中文摘要

我们提出RoboManipBaselines,一个用于机器人操作模仿学习研究的开源软件框架。该框架支持完整的模仿学习流程,包括数据收集、策略训练和部署,覆盖仿真和真实环境。其设计强调通过一致的工作流程实现集成,跨不同环境和机器人平台的通用性,通过易于添加新机器人、任务和策略的可扩展性,以及通过使用公开数据集进行评估的可重复性。RoboManipBaselines系统地实现了模仿学习的核心组件:环境、数据集和策略。通过统一接口,该框架支持多种仿真器和真实机器人环境,以及多模态传感器和多种策略模型。我们进一步在仿真和真实环境中进行了基准评估,并介绍了多项研究应用,包括数据增强、与触觉模型的集成、交互式机器人系统、3D感知评估和硬件扩展。这些结果表明,RoboManipBaselines为利用模仿学习推进机器人操作的研究和实验验证提供了有用的基础。https://isri-aist.github.io/RoboManipBaselines-ProjectPage

英文摘要

We present RoboManipBaselines, an open-source software framework for imitation learning research in robotic manipulation. The framework supports the entire imitation learning pipeline, including data collection, policy training, and rollout, across both simulation and real-world environments. Its design emphasizes integration through a consistent workflow, generality across diverse environments and robot platforms, extensibility for easily adding new robots, tasks, and policies, and reproducibility through evaluations using publicly available datasets. RoboManipBaselines systematically implements the core components of imitation learning: environment, dataset, and policy. Through a unified interface, the framework supports multiple simulators and real robot environments, as well as multimodal sensors and a wide variety of policy models. We further present benchmark evaluations in both simulation and real-world environments and introduce several research applications, including data augmentation, integration with tactile models, interactive robotic systems, 3D sensing evaluation, and hardware extensions. These results demonstrate that RoboManipBaselines provides a useful foundation for advancing research and experimental validation in robotic manipulation using imitation learning. https://isri-aist.github.io/RoboManipBaselines-ProjectPage

2509.16139 2026-05-26 cs.LG

Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media

介观结构介质中冲击传播的时空多场深度学习

M. Giselle Fernández-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill

AI总结 提出多场时空模型(MSTM),通过训练多尺度多物理场数据,同时演化七个耦合热力学和动力学场,以高精度预测冲击传播中的异常响应,实现1000倍加速。

Comments 25 pages, 12 figures

详情
AI中文摘要

预测多孔和晶格材料极端流体动力学响应是高能量密度物理学中的一个基本挑战,其中冲击诱导的孔洞塌陷、斜压涡度和异常动力学与热力学状态必须在多个尺度上解析。传统高保真流体动力学代码在行星防御和惯性约束聚变等应用的大规模设计探索中计算成本过高。我们提出了一种多场时空模型(MSTM),旨在克服标准机器学习替代模型的局限性,这些模型通常无法捕捉冲击传播特征的尖锐梯度和非线性场耦合。通过在高保真、多尺度多物理场数据上训练,MSTM同时演化七个耦合的热力学和动力学场——包括压力、温度、密度和速度——跨越复杂材料架构。我们的框架展示了准确预测异常响应的能力,例如反直觉的冲击后密度降低和局部热点形成,均方根误差低至1.4%。关键的是,模型的多场公式在长自回归展开中保持了物理一致性和界面稳定性,在结构保真度上比单场模型提高了94%。该框架实现了1000倍的求解时间减少,为介观结构介质中能量耗散和动量传递的实时分析与优化提供了实用途径。

英文摘要

Predicting the extreme hydrodynamic response of porous and architected lattice materials is a fundamental challenge in high energy density physics, where shock-induced pore collapse, baroclinic vorticity, and anomalous kinetic and thermodynamic states must be resolved across multiple scales. Traditional high-fidelity hydrocodes are computationally prohibitive for large-scale design exploration in applications like planetary defense and inertial confinement fusion. We present a multi-field spatio-temporal model (MSTM) designed to overcome the limitations of standard machine learning surrogates, which often fail to capture the sharp gradients and non-linear field couplings characteristic of shock propagation. By training on high-fidelity, multiscale multiphysics data, MSTM simultaneously evolves seven coupled thermodynamic and kinetic fields - including pressure, temperature, density, and velocity - across complex material architectures. Our framework demonstrates the ability to accurately predict anomalous responses, such as counterintuitive post-shock density reductions and localized hotspot formation, with mean root mean squared errors as low as 1.4%. Crucially, the model's multi-field formulation maintains physical consistency and interface stability over long autoregressive rollouts, outperforming single-field models by 94% in structural fidelity. This framework enables a 1000x reduction in time to solution, providing a practical pathway for the real-time analysis and optimization of energy dissipation and momentum transfer in meso-structured media.

2509.14250 2026-05-26 cs.CL

The meaning of prompts and the prompts of meaning: Semiotic reflections and modelling

提示的意义与意义的提示:符号学反思与建模

Martin Thellefsen, Amalia Nurma Dewi, Bent Sorensen

AI总结 本文基于皮尔士符号学三元模型和Dynacom传播模型,将大型语言模型中的提示重新概念化为动态符号现象,强调其作为沟通和认知行为的迭代符号形成与解释过程。

Comments 18 pages, 2 figures

详情
AI中文摘要

本文探讨了大型语言模型(LLMs)中的提示(prompts)和提示工程(prompting)作为动态符号现象,借鉴了皮尔士的符号三元模型、他的九种符号类型以及Dynacom传播模型。目的是将提示重新概念化,不是作为一种技术输入机制,而是作为一种沟通和认知行为,涉及符号形成、解释和精炼的迭代过程。理论基础建立在皮尔士符号学上,特别是再现体(representamen)、对象(object)和解释项(interpretant)之间的相互作用,以及符号的类型学丰富性:性质符号(qualisign)、单一符号(sinsign)、法则符号(legisign);像似符(icon)、指示符(index)、象征符(symbol);呈位(rheme)、述位(dicent)、论位(argument)——以及Dynacom模型中捕捉的解释项三元组。在分析上,本文将LLM定位为一种符号资源,它根据用户提示生成解释项,从而参与共享话语宇宙中的意义创造。研究结果表明,提示是一种符号和沟通过程,重新定义了数字环境中知识的组织、搜索、解释和共建方式。这一视角邀请我们在计算符号学时代重新构想知识组织和信息检索的理论与方法基础。

英文摘要

This paper explores prompts and prompting in large language models (LLMs) as dynamic semiotic phenomena, drawing on Peirce's triadic model of signs, his nine sign types, and the Dynacom model of communication. The aim is to reconceptualize prompting not as a technical input mechanism but as a communicative and epistemic act involving an iterative process of sign formation, interpretation, and refinement. The theoretical foundation rests on Peirce's semiotics, particularly the interplay between representamen, object, and interpretant, and the typological richness of signs: qualisign, sinsign, legisign; icon, index, symbol; rheme, dicent, argument - alongside the interpretant triad captured in the Dynacom model. Analytically, the paper positions the LLM as a semiotic resource that generates interpretants in response to user prompts, thereby participating in meaning-making within shared universes of discourse. The findings suggest that prompting is a semiotic and communicative process that redefines how knowledge is organized, searched, interpreted, and co-constructed in digital environments. This perspective invites a reimagining of the theoretical and methodological foundations of knowledge organization and information seeking in the age of computational semiosis

2509.09658 2026-05-26 cs.CV

Measuring Epistemic Humility in Multimodal Large Language Models

测量多模态大语言模型中的认知谦逊

Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

AI总结 提出HumbleBench基准,通过强制选择多项选择中引入“以上皆非”选项,评估多模态大语言模型拒绝错误选项的谦逊行为。

详情
AI中文摘要

多模态大语言模型(MLLMs)中的幻觉——即模型生成与输入图像不一致的内容——在现实应用中带来显著风险,从视觉问答中的错误信息到决策中的不安全错误。现有基准主要测试识别准确性,即评估模型能否在干扰项中选择正确答案。这忽略了可信AI的另一个重要能力:当没有提供的选项得到图像支持时,能够识别并避免做出错误选择,这是一种与谦逊相关的行为。我们提出了HumbleBench,这是一个新的幻觉基准,旨在评估MLLMs在强制选择多项选择设置中拒绝错误选项的能力,其中包含“以上皆非”选项。基于全景场景图数据集,我们利用对象和关系的细粒度场景图注释,使用候选属性线索,并提示GPT-4-Turbo生成多项选择问题,随后进行严格的人工筛选。每个问题都包含一个“以上皆非”选项,要求模型不仅识别正确的视觉信息,还要识别何时没有提供的答案有效。我们在HumbleBench上评估了各种最先进的MLLMs——包括通用型、专门推理型和专有模型——并为社区报告了实证结果。通过纳入明确的错误选项拒绝,HumbleBench填补了当前评估套件中的一个关键空白,评估了一种较窄但重要的、与可信多模态推理相关的弃权行为。我们的代码和数据集已公开发布,可在https://github.com/maifoundations/HumbleBench获取。

英文摘要

Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks another important capability for trustworthy AI: recognizing when none of the provided options is supported by the image and abstaining from committing to a false choice, a humility-related behavior. We present HumbleBench, a new hallucination benchmark designed to evaluate false-option rejection in MLLMs under a forced-choice multiple-choice setting with a ``None of the above'' option. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations for objects and relations, use candidate attribute cues, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a ``None of the above'' option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including general-purpose, specialized reasoning, and proprietary models -- on HumbleBench and report empirical findings for the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites by assessing a narrower but important abstention-oriented behavior that is relevant to trustworthy multimodal reasoning. Our code and dataset are released publicly and can be accessed at \href{https://github.com/maifoundations/HumbleBench}{https://github.com/maifoundations/HumbleBench}.

2509.04445 2026-05-26 cs.LG

Towards Cognitively-Faithful Decision-Making Models to Improve AI Alignment

朝向认知忠实决策模型以改善AI对齐

Cyrus Cousins, Vijay Keswani, Vincent Conitzer, Hoda Heidari, Jana Schaich Borg, Walter Sinnott-Armstrong

AI总结 提出一种基于公理的方法,从成对比较中学习认知忠实的决策过程,以解决标准偏好诱导方法未能捕捉人类决策认知过程的问题,并在肾脏分配任务中验证了模型的有效性。

Comments In ICLR 2026

详情
AI中文摘要

最近的AI趋势旨在将AI模型与以人为中心的学习目标(如个人偏好、效用或社会价值观)对齐。使用标准偏好诱导方法,研究人员和从业者构建人类决策和判断的模型,AI模型与之对齐。然而,标准诱导方法通常未能捕捉人类决策背后的认知过程,如启发式或简化的结构化思维模式。为了解决这一失败,我们采用公理化的方法从成对比较中学习认知忠实的决策过程。基于分析塑造人类决策的认知过程的文献,我们推导出一个模型类,其中特征首先通过学习的规则处理,然后通过固定规则(如Bradley-Terry规则)聚合以产生决策。这种结构化的信息处理确保了这些模型作为代表潜在人类决策过程的现实且可行的候选者。我们通过在肾脏分配任务中学习可解释的人类决策模型来展示这种建模方法的有效性,并表明我们提出的模型在准确性上匹配或超越了先前的人类成对决策模型。

英文摘要

Recent AI trends seek to align AI models to learned human-centric objectives, such as personal preferences, utility, or societal values. Using standard preference elicitation methods, researchers and practitioners build models of human decisions and judgments, to which AI models are aligned. However, standard elicitation methods often fail to capture the cognitive processes behind human decision making, such as heuristics or simplifying structured thought patterns. To address this failure, we take an axiomatic approach to learning cognitively faithful decision processes from pairwise comparisons. Building on the literature analyzing cognitive processes that shape human decision-making, we derive a model class in which features are first processed with learned rules, then aggregated via a fixed rule, such as the Bradley-Terry rule, to produce a decision. This structured processing of information ensures that such models are realistic and feasible candidates to represent underlying human decision-making processes. We demonstrate the efficacy of this modeling approach by learning interpretable models of human decision making in a kidney allocation task, and show that our proposed models match or surpass the accuracy of prior models of human pairwise decision-making.

2509.00056 2026-05-26 cs.CV

Apex-Centered Spatio-Temporal Rank Pooling and Gradient Attention for Micro-Expression Recognition

基于顶点的时空秩池化和梯度注意力用于微表情识别

Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

AI总结 提出微表情时空图像(MESTI)和微表情梯度注意力网络(MEGANet),通过改进输入模态和注意力机制提升微表情识别性能。

详情
AI中文摘要

微表情识别(MER)由于微表情的细微和短暂性是一项具有挑战性的任务。传统的输入模态,如顶点帧、光流和动态图像,往往无法充分捕捉这些短暂的面部运动,导致性能次优。在本研究中,我们引入了微表情时空图像(MESTI),这是一种针对微表情的动态秩池化的重新表述,将视频序列转换为单张图像,同时强调微表情的起始-顶点-结束时间模式。此外,我们提出了微表情梯度注意力网络(MEGANet),该网络包含一个提出的梯度注意力块,以增强从微表情中提取细粒度运动特征。通过结合MESTI和MEGANet,我们旨在建立一种更有效的MER方法。进行了大量实验以评估MESTI的有效性,将其与现有输入模态在常规架构上进行比较。此外,我们证明将先前发表的MER网络的输入替换为MESTI会导致一致的性能提升。还评估了MEGANet的性能,显示我们提出的网络在SMIC-HS、SAMM数据集上达到了最先进的结果,在CASMEII数据集上具有竞争力的性能,并且在报告的跨数据集评估设置中也取得了领先性能。MESTI和MEGANet的组合始终优于比较方法。这些发现强调了MESTI作为优越输入模态和MEGANet作为先进识别网络的潜力,旨在在各种应用中实现更有效的MER系统。

英文摘要

Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a micro-expression-specific reformulation of dynamic rank pooling that transforms a video sequence into a single image while emphasizing the onset-apex-offset temporal pattern of micro-expressions. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a proposed Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across regular architectures. Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet is also evaluated, showing that our proposed network achieves state-of-the-art results on the SMIC-HS, SAMM and competitive performance on CASMEII datasets, it also achieves leading performance in the reported cross-dataset evaluation settings. The combination of MESTI and MEGANet consistently outperforms the compared methods. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, aiming to more effective MER systems in a variety of applications.

2508.19988 2026-05-26 cs.CL

AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios

AgentCoMa:一个混合常识与数学推理的现实场景组合基准

Lisa Alazraki, Lihu Chen, Ana Brassard, Joe Stacey, Hossein A. Rahmani, Marek Rei

AI总结 提出AgentCoMa基准,测试大语言模型在组合常识与数学推理任务上的性能,发现模型在单独步骤上准确率高但组合后平均下降近30%。

Comments ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)在涉及多个推理步骤组合的复杂常识和数学问题上取得了高准确率。然而,当前测试这些技能的组合基准往往侧重于常识或数学推理,而解决现实世界任务的LLM智能体需要两者的结合。在这项工作中,我们引入了一个智能体常识与数学基准(AgentCoMa),其中每个组合任务需要一个常识推理步骤和一个数学推理步骤。我们在61个不同规模、模型家族和训练策略的LLM上进行了测试。我们发现,LLM通常可以孤立地解决这两个步骤,但当两者结合时,它们的准确率平均下降近30%。这比我们在先前组合相同推理类型多个步骤的组合基准中观察到的性能差距要大得多。相比之下,非专家人类标注者可以以同样高的准确率解决AgentCoMa中的组合问题和各个步骤。此外,我们进行了一系列可解释性研究,以更好地理解性能差距,检查了神经元模式、注意力图和成员推断。我们的工作强调了在混合类型组合推理背景下模型脆弱性的显著程度,并为未来的改进提供了一个测试平台。

英文摘要

Large Language Models (LLMs) have achieved high accuracy on complex commonsense and mathematical problems that involve the composition of multiple reasoning steps. However, current compositional benchmarks testing these skills tend to focus on either commonsense or math reasoning, whereas LLM agents solving real-world tasks would require a combination of both. In this work, we introduce an Agentic Commonsense and Math benchmark (AgentCoMa), where each compositional task requires a commonsense reasoning step and a math reasoning step. We test it on 61 LLMs of different sizes, model families, and training strategies. We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by nearly 30% on average when the two are combined. This is a substantially greater performance gap than the one we observe in prior compositional benchmarks that combine multiple steps of the same reasoning type. In contrast, non-expert human annotators can solve the compositional questions and the individual steps in AgentCoMa with similarly high accuracy. Furthermore, we conduct a series of interpretability studies to better understand the performance gap, examining neuron patterns, attention maps and membership inference. Our work underscores a substantial degree of model brittleness in the context of mixed-type compositional reasoning and offers a test bed for future improvement.

2508.19113 2026-05-26 cs.AI

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

混合深度搜索器:可扩展的并行与顺序搜索推理

Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee, Yongrae Jo, Gunhee Kim, Moontae Lee, Kyungjae Lee

AI总结 提出混合搜索策略HybridDeepSearcher,通过并行查询扩展与显式证据聚合结合顺序推理,在多个基准上显著提升性能并实现测试时搜索扩展。

Comments Accepted to ICLR 2026

详情
AI中文摘要

大型推理模型(LRMs)结合检索增强生成(RAG)使得深度研究智能体能够通过外部知识检索进行多步推理。然而,我们发现现有方法很少展示测试时搜索扩展。通过单查询顺序搜索扩展推理的方法受限于证据覆盖范围,而每步生成多个独立查询的方法通常缺乏结构化聚合,阻碍了更深的顺序推理。我们提出一种混合搜索策略来解决这些限制。我们引入了HybridDeepSearcher,一种结构化的搜索智能体,它在进入更深的顺序推理之前集成了并行查询扩展与显式证据聚合。为了监督这种行为,我们引入了HDS-QA,一个新颖的数据集,通过包含并行子查询的监督推理-查询-检索轨迹,指导模型将广泛的并行搜索与结构化聚合相结合。在五个基准上,HybridDeepSearcher显著优于现有技术,在FanOutQA上F1分数提高+15.9,在BrowseComp子集上提高+9.2。进一步分析显示其一致的测试时搜索扩展:随着允许的额外搜索轮次或调用次数增加,性能持续提升,而竞争方法则趋于平稳。

英文摘要

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval. However, we find that existing approaches rarely demonstrate test-time search scaling. Methods that extend reasoning through single-query sequential search suffer from limited evidence coverage, while approaches that generate multiple independent queries per step often lack structured aggregation, hindering deeper sequential reasoning. We propose a hybrid search strategy to address these limitations. We introduce HybridDeepSearcher, a structured search agent that integrates parallel query expansion with explicit evidence aggregation before advancing to deeper sequential reasoning. To supervise this behavior, we introduce HDS-QA, a novel dataset that guides models to combine broad parallel search with structured aggregation through supervised reasoning-query0retrieval trajectories containing parallel sub-queries. Across five benchmarks, HybridDeepSearcher significantly outperforms the state-of-the-art, improving F1 scores by +15.9 on FanOutQA and +9.2 on a subset of BrowseComp. Further analysis shows its consistent test-time search scaling: performance improves as additional search turns or calls are allowed, while competing methods plateau.