arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
2605.22629 2026-05-22 cs.CV

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

H-Flow:通过物理启发的联合多模态学习实现自监督的人体场景流

Zhanbo Huang, Xiaoming Liu, Yu Kong

发表机构 * Michigan State University(密歇根州立大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出H-Flow,一种能够同时捕捉骨骼运动学和表面变形的密集人体场景流方法,通过物理启发的联合多模态学习实现自监督,引入高保真合成基准DynAct4D,并在标准基准和零样本场景中优于现有方法。

Comments 19 pages, 7 figures, 4 tables

详情
AI中文摘要

参数化人体模型能够捕捉全局姿态,但无法表示衣物和软组织的非刚性表面动态。通用场景流估计密集运动,但在关节化身体上失效,且像素级监督难以获得。我们引入H-Flow,一种能够同时捕捉骨骼运动学和表面变形的密集人体场景流。统一的多头Transformer估计从单目视频中的流,同时预测姿态和深度作为互补输出。挑战在于缺乏监督。替代无法获得的标签,我们将网络锚定在人体运动的物理中,将几何、结构和生物力学先验编码为跨模态训练目标。我们进一步引入DynAct4D,一个高保真合成基准,提供跨多样体、服装和动作的密集流标注。在标准基准上,H-Flow优于场景流和参数化基线,并能泛化到野外视频。代码、模型和DynAct4D基准将在发表时发布。

英文摘要

Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication

2605.22622 2026-05-22 cs.LG math.OC

A note on convergence of Wasserstein policy optimization

关于Wasserstein策略优化收敛性的注记

David Šiška, Yufei Zhang

发表机构 * School of Mathematics, University of Edinburgh(爱丁堡大学数学学院) Department of Mathematics, Imperial College London(伦敦帝国理工学院数学系)

AI总结 本文探讨了Wasserstein策略优化在连续状态和动作空间中的收敛性问题,通过利用均场分析和log-Sobole不等式,证明了在熵正则化的马尔可夫决策过程框架下,WPO算法能够线性收敛到全局最优解。

详情
AI中文摘要

Wasserstein Policy Optimization (WPO) 是一种最近提出的强化学习算法,利用Wasserstein梯度流来优化连续动作空间中的随机策略。尽管在实践中取得了成功,但在连续状态和动作空间环境中,WPO的理论收敛性质尚未完全确立。在本文中,我们论证了在熵正则化的马尔可夫决策过程框架下,WPO能够线性收敛。这是通过利用最近在均场分析中用于梯度流收敛的进展,结合log-Sobole不等式来实现的。假设梯度流方程存在足够光滑的解,我们展示了沿流的能量单调耗散,并建立了局部log-Sobole不等式。最终,这些性质使得我们能够论证价值函数应线性收敛到全局最优解。

英文摘要

Wasserstein Policy Optimization (WPO) is a recently proposed reinforcement learning algorithm that leverages Wasserstein gradient flows to optimize stochastic policies in continuous action spaces. Despite its empirical success, the theoretical convergence properties of WPO in environments with continuous state and action spaces have yet to be fully established. In this note, we argue that WPO within the framework of entropy-regularised Markov Decision Processes converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobole inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.

2605.22620 2026-05-22 cs.LG cs.CL

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个优于一个:一种无崩溃的多奖励RLIF训练框架

Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali

发表机构 * Bangladesh University of Engineering and Technology(孟加拉工程科技大学) West Virginia University(西弗吉尼亚大学) University of Aberdeen(阿伯丁大学) Fogsphere (Redev.AI Ltd, UK)(Fogsphere(Redev.AI Ltd,英国)) University College London(伦敦大学学院)

AI总结 本文提出一种多奖励RLIF框架,通过分解训练信号为答案级奖励和完成级奖励,并结合GDPO归一化和KL-Cov正则化,提升稳定性和鲁棒性,同时在数学推理和代码生成任务中接近监督RLVR方法的性能。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)显著提升了大语言模型的推理能力,但通常依赖于外部监督的人类注释或黄金标准解决方案。最近,从内部反馈强化学习(RLIF)作为一种可扩展的无监督替代方法出现,利用模型自身提取的信号。然而,现有RLIF方法通常依赖单一内部奖励,可能导致奖励黑客、熵崩溃和推理结构退化。我们提出一种多奖励RLIF框架,将训练信号分解为两个互补成分:基于聚类投票的答案级奖励和基于逐token自信心的完成级奖励。为了稳健地结合这些信号,我们应用GDPO基于的归一化以减少奖励尺度不平衡。我们进一步引入KL-Cov正则化,针对导致不成比例熵减少的低熵token分布,保持探索并防止后期崩溃。在数学推理和代码生成基准上,我们的方法在无监督RL方法中提高了稳定性和鲁棒性,同时在性能上接近监督RLVR方法。这些结果表明,互补的内部奖励结合针对性正则化可以支持稳定的长周期推理,而无需依赖外部真实监督。代码将很快发布。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.

2605.22619 2026-05-22 cs.CV

GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

GLeVE: 在3D CT中基于图的病变接地与提案验证

Shuo Jiang, Yuhao Hong, Chunbo Jiang, Weihong Chen, Huangwei Chen, Shenghao Zhu, Beining Wu, Mingxuan Liu, Zhu Zhu, Feiwei Qin, Min Tan, Yifei Chen

发表机构 * Zhejiang Key Laboratory of Space Information Sensing and Transmission(浙江空间信息感知与传输重点实验室) Hangzhou Dianzi University(杭州电子科技大学) Zhejiang University(浙江大学) Tsinghua University(清华大学) Children's Hospital, Zhejiang University School of Medicine(浙江大学医学院附属儿童医院)

AI总结 本文提出GLeVE框架,通过图引导的病变接地和解剖学先验验证,解决3D CT中自由文本叙述与体积解剖之间的语义-空间差距问题,提升病变定位的准确性。

Comments 11 pages, 4 figures

详情
AI中文摘要

将放射科报告描述接地到3D CT体积对于可验证的临床解释至关重要,但受到自由文本叙述与体积解剖之间语义-空间差距的挑战。现有基于报告辅助和视觉-语言接地的方法通常依赖于短语级对齐或密集像素监督,导致病变层面的对应有限和定位准确性不足。我们提出GLeVE,一种带有解剖学先验验证和基于八叉树的自回归细化的图引导病变接地框架。GLeVE将每个病变描述视为一个原子语义单元,并通过关系感知图推理编码器官归属、属性和跨病变关系,以生成具有判别性的病变层面查询。具有区域级验证的解剖学感知提案生成强制一对一的文本-病变对齐,而分层八叉树细化逐步改进边界界定。在AbdomenAtlas 3.0上的实验表明,GLeVE在分割准确性和病变层面定位方面均优于经典多模态基础模型和报告监督基线。

英文摘要

Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.

2605.22613 2026-05-22 cs.LG

Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery

为LLM引导的程序发现设计的进化多任务优化

Halil Alperen Gozeten, Xuechen Zhang, Emrullah Ildiz, Ege Onur Taga, Tara Javidi, Samet Oymak

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出了一种进化多任务优化(EMO)方法,用于LLM引导的程序发现,通过两个阶段框架EMO-STA(共享后适应)在多个任务家族中提高了程序发现的效率和鲁棒性,同时展示了共享进化在减少过拟合方面的优势。

详情
AI中文摘要

最近的LLM引导的进化搜索方法表明,迭代程序突变可以发现强大的算法,但它们通常独立地优化每个任务,即使相关任务共享可重用的结构。我们介绍了用于LLM引导的程序发现的进化多任务优化(EMO),并提出了EMO-STA(共享后适应)两种阶段框架,首先在任务家族中进化一个可执行程序的共享档案,然后将选定的共享候选程序适应到每个目标任务。在EMO-STA中,我们探索了多种适应策略,包括从共享档案中进行预热启动、适应最佳平均共享程序,以及适应在每个目标任务上表现最佳的共享程序。在八个跨越连续优化、几何构造、建模和算法优化的任务家族中,EMO-STA在大多数设置中优于匹配计算的单任务进化,其中STA Best-Local在分布内适应最强,而STA Best-Shared在未见过的任务中具有鲁棒性。计算分配实验表明,将相当大的家庭级预算分配给共享进化通常是有益的,平衡的共享和适应预算往往是最优的。除了计算效率外,我们还展示了共享进化可以缓解低证据设置(例如少量训练数据)中的过拟合,包括ARC任务和时间序列特征工程,通过优先选择跨所有任务通用的程序,而不是利用任务特定的脆弱特征。

英文摘要

Recent LLM-guided evolutionary search methods have shown that iterative program mutation can discover strong algorithms, but they typically optimize each task independently, even when related tasks share reusable structure. We introduce Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery, and propose EMO-STA (Shared-Then-Adapt), a two-stage framework that first evolves a shared archive of executable programs across a task family and then adapts selected shared candidates to each target task. Within EMO-STA, we explore multiple adaptation strategies, including warm-starting from the shared archive, adapting the best average shared program, and adapting the shared program that performs best on each target task. Across eight task families spanning continuous optimization, geometric construction, modeling, and algorithmic optimization, EMO-STA improves over matched-compute single-task evolution in most settings, with STA Best-Local providing the strongest in-distribution adaptation and STA Best-Shared yielding robust transfer to unseen tasks. Compute-allocation experiments show that allocating a substantial fraction of the family-level budget to shared evolution is consistently beneficial, with roughly balanced shared and adaptation budgets often being optimal. Beyond compute efficiency, we show that shared evolution can mitigate overfitting in low-evidence settings (e.g. few training data), including ARC tasks and time-series feature engineering, by favoring programs that generalize across all tasks rather than exploiting task-specific brittle artifacts.

2605.22611 2026-05-22 cs.LG

Benchmarking Machine Learning Architectures for Antimicrobial Stewardship in Pediatric ICUs

对儿科ICU中抗菌药物使用管理的机器学习架构进行基准测试

Niklas Raehse, Luregn J. Schlapbach, Daphné Chopard

发表机构 * Department of Intensive Care and Neonatology and Children’s Research Center University of Zurich University Children’s Hospital Zurich(重症护理与新生儿科及儿童研究中心,苏黎世大学苏黎世儿童医院) Department of Health Sciences and Technology ETH Zurich(健康科学与技术系,苏黎世联邦理工学院) Department of Computer Science ETH Zurich(计算机科学系,苏黎世联邦理工学院)

AI总结 本研究针对儿科ICU中抗菌药物使用管理的机器学习模型进行基准测试,通过公共数据集和私人机构队列系统评估了四种临床相关的目标,发现预测性能主要由目标流行率和数据集特征决定,而非模型复杂度,序列模型在粗粒度下提升了精度-召回权衡,但细粒度建模带来的收益有限,且校准效果较差。

Comments 16 pages, 6 figures, code: https://anonymous.4open.science/r/AMS_intervention_prediction-C024

详情
AI中文摘要

抗菌药物使用管理(AMS)在儿科重症监护室(PICUs)中至关重要,其中诊断不确定性常导致广谱抗生素使用,增加抗菌药物耐药性和潜在的长期危害。机器学习为从电子健康记录数据中识别患者层面的使用管理干预机会提供了有前途的方法,但以往研究主要集中在成人群体和静态表格表示上。我们展示了在PICU中对AMS干预预测的系统性基准研究,涵盖了公共数据集和私人机构队列。我们定义了四个临床相关的代理目标以减少抗生素暴露:静脉到口服转换、降级、停用和短程治疗。在统一的评估框架下,我们比较了表格、基于序列和基于图的时序模型在多个时间分辨率下的表现。我们发现,预测性能主要由目标流行率和数据集特征驱动,而非模型复杂度。序列模型在粗粒度(24小时)下比表格方法在精度-召回权衡上有所提升,而更精细的时间建模提供有限的额外收益。然而,这些收益是以较差的校准为代价的,更简单的表格模型产生更可靠的概率估计。多任务学习仅产生微小改进,表明在使用管理目标之间共享结构有限。我们的发现强调了目标设计、时间表示和校准在临床机器学习中的重要性,并为开发可靠的决策支持系统提供实用指导。

英文摘要

Antimicrobial stewardship (AMS) is critical in pediatric intensive care units (PICUs), where diagnostic uncertainty often drives broad-spectrum antibiotic use, increasing antimicrobial resistance and potential long-term harms. Machine learning offers a promising approach for identifying patient-level opportunities for stewardship interventions from electronic health record data, yet prior work has focused largely on adult populations and static tabular representations. We present a systematic benchmarking study of AMS intervention prediction in the PICU across a public dataset and a private institutional cohort. We define four clinically relevant proxy targets for reducing antibiotic exposure: intravenous-to-oral switching, de-escalation, discontinuation, and short-course therapy. Under a unified evaluation framework, we compare tabular, sequence-based, and graph-based temporal models at multiple temporal resolutions. We find that predictive performance is driven primarily by target prevalence and dataset characteristics rather than model complexity. Sequence models improve the precision-recall trade-off over tabular approaches at coarse (24-hour) resolution, while finer temporal modeling provides limited additional benefit. However, these gains come at the cost of poorer calibration, with simpler tabular models yielding more reliable probability estimates. Multi-task learning produces only marginal improvements, suggesting limited shared structure across stewardship targets. Our findings highlight the importance of target design, temporal representation, and calibration in clinical machine learning, and provide practical guidance for developing reliable decision support systems for pediatric AMS.

2605.22608 2026-05-22 cs.CL cs.AI

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR: 自动化多层级评估LLM代理

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

发表机构 * IBM Research(IBM研究院)

AI总结 本研究提出Agentic CLEAR框架,通过多层级细粒度分析实现LLM代理的自动化评估,提供高质量的数据驱动反馈并预测任务成功率。

Comments ACL

详情
AI中文摘要

代理系统正变得越来越有能力:代理定义策略、采取行动并与不同环境交互。这种自主性对监督和评估代理行为提出了严峻挑战。当前大多数工具功能有限,要么侧重于可观测性并具备基本评估能力,要么强制使用静态、手工制定的错误分类法,无法适应新领域。为解决这一差距,我们提出了Agentic CLEAR,一个自动、动态且易于使用的评估框架。它在三个粒度层级上生成关于代理行为的文本洞察:系统、轨迹和节点。Agentic CLEAR运行在可观测性层之上,能够实现无缝集成,并具有直观的用户界面,使代理评估变得高度可访问。在四个基准测试、七个代理设置和数万次LLM调用的实验中,我们展示了Agentic CLEAR能够产生高质量、数据驱动的反馈。我们的分析显示与人工标注的错误高度一致,并且能够预测任务的成功率。

英文摘要

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

2605.22607 2026-05-22 cs.CV

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

增强视觉基础模型中的眼动推理以实现眼动跟随

Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang

发表机构 * Beijing Jiaotong University(北京交通大学) University of Birmingham(英国伯明翰大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) Microsoft, UK(微软公司(英国))

AI总结 本文提出了一种新的训练机制,通过局部LoRA和锥外惩罚来增强视觉基础模型中的眼动推理,以提升眼动跟随任务的性能,特别是在目标不显著时表现更优。

Comments 11 pages, 8 figures

详情
AI中文摘要

眼动跟随需要场景理解和眼动推理来定位场景中人的目光目标。最近,视觉基础模型(VFMs)在该任务上表现出色,使更简单的架构能够超越先前方法。然而,我们观察到基于VFM的方法存在关键限制:虽然VFM显著提高了场景理解,但对眼动推理贡献有限。因此,现有方法常依赖语义显著物体而非真实目光线索,导致目标不显著时性能下降。为了解决这一问题,我们提出了一种新的训练机制,通过局部LoRA和锥外惩罚来增强VFM中的眼动推理。实验表明,我们的方法在GazeFollow和VAT数据集上取得了最先进的性能,特别是在目标不显著时表现尤为突出。我们的发现为未来眼动跟随研究提供了有价值的见解。

英文摘要

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.

2605.22605 2026-05-22 cs.RO cs.CV

Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

通过双区间运动线索解耦自身运动与目标动态以实现无人机检测

Liuyang Wang, Feitian Zhang

发表机构 * Department of Robotics, School of Advanced Manufacturing and Robotics(机器人学院,先进制造与机器人学院) State Key Laboratory of Turbulence and Complex Systems(湍流与复杂系统国家重点实验室) Peking University(北京大学) Great Bay University(大湾大学)

AI总结 本文提出了一种基于视觉的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰,提升无人机检测在剧烈自身运动下的性能。

详情
AI中文摘要

无人机的物体检测面临严重的自身运动、相机抖动和大规模变化的挑战。尽管现代检测器在静态图像上表现良好,但直接应用于无人机视频时往往失效,尤其在动态场景中的小目标。现有基于运动的方法要么依赖计算昂贵的光流,要么使用单区间差分,易受抖动影响且难以捕捉多样的运动模式。本文提出了一种视觉-only的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰。首先基于同射影的全局运动补偿(GMC)对相邻帧进行对齐。然后引入双区间运动提取策略,捕捉短期和长期的运动线索。为了整合这些线索,轻量级运动引导注意力模块(MGA)在特征金字塔网络中增强特征表示。在VisDrone-VID数据集上的实验表明,在严重自身运动下,该方法在YOLOv8基线上有显著改进。消融研究进一步验证了双区间设计和所提运动引导注意力机制的有效性。

英文摘要

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

2605.22602 2026-05-22 cs.AI

Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

三思而后言:基于双重知识增强的理论思维推理用于说服代理

Minghui Ma, Bin Guo, Runze Yang, Mengqi Chen, Yan Liu, Jingqi Liu, Yahan Pei, Xuehao Ma, Qiuyun Zhang, Zhiwen Yu

发表机构 * Northwestern Polytechnical University(西北工业大学) Peking University(北京大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 本文提出了一种基于双重知识增强的理论思维推理方法,用于提升说服代理的对话能力,通过构建大规模标注数据集和提出TTBYS框架,提高了LLM在推理欲望、信念和说服策略方面的性能。

Comments 19 pages, 6 figures

详情
AI中文摘要

说服对话需要推理他人潜在的心理状态,这一能力称为理论思维(ToM)。然而,由于依赖简单的提示策略和不足的ToM知识,现有LLM往往无法捕捉心理状态之间的内在依赖关系,导致表示碎片化和推理不稳定。为解决这些问题,我们引入了基于ToM的说服对话(ToM-PD)任务,该任务基于信念-欲望-意图(BDI)框架,明确建模多轮对话中心理状态的序列依赖性。为了促进该任务的研究,我们构建了一个大规模标注数据集,即基于ToM的广泛说服对话(ToM-BPD),捕捉了细粒度的心理状态和相应的说服策略。我们进一步提出了“三思而后言”(TTBYS),一种知识增强的分步推理框架,利用显式和隐式先验经验来提高LLM对欲望、信念和说服策略的推理能力。实验结果表明,配备TTBYS的Qwen3-8B在预测欲望、信念和说服策略方面分别优于GPT-5 1.20%、22.80%和16.97%。案例研究进一步表明,我们的方法增强了推理的可解释性和一致性。

英文摘要

Persuasive dialogue requires reasoning about others' latent mental states, a capability known as Theory of Mind (ToM). However, due to reliance on simple prompting strategies and insufficient ToM knowledge, existing LLMs often fail to capture the intrinsic dependencies among mental states, leading to fragmented representations and unstable reasoning. To address these challenges, we introduce the ToM-based Persuasive Dialogue (ToM-PD) task, grounded in the Belief-Desire-Intention (BDI) framework, which explicitly models the sequential dependencies among mental states in multi-turn dialogues. To facilitate research on this task, we construct a large-scale annotated dataset, ToM-based Broad Persuasive Dialogues (ToM-BPD), capturing fine-grained mental states and corresponding persuasive strategies. We further propose Think Thrice Before You Speak (TTBYS), a knowledge-enhanced stepwise reasoning framework that leverages both explicit and implicit prior experiences to improve LLMs' inference of desires, beliefs, and persuasive strategies. Experimental results demonstrate that Qwen3-8B equipped with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% in predicting desires, beliefs, and persuasive strategies, respectively. Case studies further show that our approach enhances interpretability and consistency in reasoning.

2605.22600 2026-05-22 cs.RO

Branch-Stochastic Model Predictive Control for Motion Planning under Multi-Modal Uncertainty with Scenario Clustering

基于分支随机优化的运动规划在多模态不确定性下的场景聚类

Zekun Xing, Ramkrishna Chaudhari, Marion Leibold, Dirk Wollherr, Martin Buss

发表机构 * Chair of Automatic Control Engineering(自动控制工程教授会)

AI总结 本文提出一种结合随机模型预测控制与分支结构的方法,用于在多模态不确定性下进行运动规划,通过场景聚类提高实时计算性能并减少保守性。

Comments This work has been accepted for presentation at IFAC World Congress 2026

详情
AI中文摘要

自动驾驶的运动规划必须考虑周围车辆意图和轨迹的多模态不确定性。以最坏情况处理不确定性可以保证鲁棒性,但往往导致过度保守。随机模型预测控制(SMPC)通过机会约束减少了轨迹层面的保守性,但对意图不确定性仍保持保守,因为约束必须在所有意图下成立。本文提出一种新的SMPC与分支结构的结合,使规划器能够为不同的可能意图生成不同的轨迹,同时在轨迹不确定性下保持安全。提出了一种新的场景聚类方法,基于高层决策相似性合并预测场景,从而确保实时可处理性。此外,一种自适应的分支时间计算延迟对分离计划的承诺,直到意图不确定性充分降低。在具有挑战性的高速公路场景中的仿真研究证明,所提出的方法提高了安全性,减少了保守性,并实现了实时计算性能。

英文摘要

Motion planning for autonomous driving must account for multi-modal uncertainty in both the intentions and trajectories of surrounding vehicles. Handling uncertainty in a worst-case manner guarantees robustness but often leads to excessive conservatism. Stochastic Model Predictive Control (SMPC) reduces trajectory-level conservatism through chance constraints, yet remains conservative with respect to intention uncertainty since constraints must hold across all intentions. We present a novel combination of SMPC and the branching structure, enabling the planner to generate distinct trajectories for different possible intentions while maintaining safety under trajectory uncertainty. A novel scenario clustering is proposed to merge prediction scenarios based on high-level decision similarity, thereby ensuring real-time tractability. Furthermore, an adaptive branching-time computation postpones commitment to separate plans until intention uncertainty is sufficiently reduced. Simulation studies in challenging highway scenarios demonstrate that the proposed method improves safety, reduces conservatism, and achieves real-time computational performance.

2605.22597 2026-05-22 cs.LG cs.AI cs.GR cs.RO

MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

MoSA: 通过学习残余各向异性来缓解连续动力学中现实到模拟差距的运动约束应力适应

Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu

发表机构 * Hong Kong University of Science(香港科学大学) MMLab, Chinese University of Hong Kong, Hong Kong SAR(香港中文大学MMLab, 香港特别行政区) The University of Hong Kong, Hong Kong SAR(香港大学, 香港特别行政区)

AI总结 本文提出MoSA框架,通过运动约束应力适应来缓解连续动力学中现实到模拟差距,利用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性,最终在机器人操作中验证了其有效性。

Journal ref International Conference on Machine Learning 2026

详情
AI中文摘要

从视觉观测中学习现实世界的动力学对于各种领域至关重要。一种常见策略是通过估计物理参数来校准模拟器,但准确性最终受限于底层物理模型,这些模型通常假设材料是均质且各向同性的。即使合理,现实中的物体通常表现出轻微的各向异性和非均匀性。在近各向同性的骨架良好校准后,这些残余效应成为进一步缩小现实到模拟差距的关键瓶颈。虽然神经网络可以端到端地拟合动力学,但这种黑盒建模会丢弃强物理先验,导致数据效率低和过拟合。因此,我们提出了MoSA,一种运动约束应力适应框架,旨在针对这些残余效应以进一步提高现实到模拟动力学学习。MoSA使用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性。它通过微平面约束的再分布逐步适应应力,在一个物理指导的级联网络中。我们进一步通过监督变形场的时空导数来施加运动约束。实验表明,我们学习的动力学在准确性、泛化性和鲁棒性方面均优于现有方法,同时学习了具有物理意义的残余各向异性。最后,我们在机器人操作设置中验证了MoSA,显示更好的现实到模拟动力学建模能够转化为更可靠的模拟到现实转移。项目页面可在https://mercerai.github.io/MoSA/上获取。

英文摘要

Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

2605.22596 2026-05-22 cs.LG

Factored Diffusion Policies:Compositionally Generalized Robot Control with a Single Score Network

因子扩散策略:一种单一分数网络的组成通用机器人控制

Sayan Mitra, Ege Yuceel, Noah Giles, Abhishek Pai

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种因子扩散策略,通过单一共享的扩散网络实现通用机器人控制,该网络在推理时能将分数分解为各因子的加法形式,从而在训练任务预算上从因子基数的乘积减少到求和,通过轨迹管证书将分数界转化为闭环状态轨迹管,实验验证了其泛化界和证书的有效性。

详情
AI中文摘要

机器人任务通常由多个因子组成,如要抓取的对象、要避开的障碍物、目标的颜色等。收集每个因子组合的专家示范数据会呈指数增长。我们提出了因子扩散策略:一个单一共享的扩散网络,通过每个因子的空标记dropout进行训练,在推理时分数可以跨因子加性分解。在给定动作-观测对的情况下,因子之间的近似条件独立性使得这种组合可以近似真实联合分数,误差有界且均匀,从而将训练任务预算从因子基数的乘积减少到求和。轨迹管证书将此分数界通过反向时间采样ODE和一个收缩跟踪控制器转化为闭环状态轨迹管,其半径分解为ODE敏感性常数和每个因子分数误差预算。不同于将单独训练的网络组合在一起的组合扩散方法,我们使用一个共享网络。无人机赛车实验验证了泛化界和证书的有效性。在基于状态的多关卡赛车中,因子策略通过90%的保留关卡(与理想情况一致),而K网络组合基线则下降到3%;在基于视觉的单关卡穿越中,它能够零样本迁移至未见场地,成功率提升11.7个百分点,碰撞率减少2.4倍。

英文摘要

Robotic tasks are typically specified by a tuple of factors, such as the object to be grasped, the obstacles to be avoided, the color of the target, and so on. Collecting expert demonstrations for every combination of factor values grows combinatorially. We present factored diffusion policies: a single shared diffusion network trained with per-factor null-token dropout, whose score decomposes additively across factors at inference. Under approximate conditional independence between factors given the action-observation pair, this composition approximates the true joint score with a bounded uniform error, reducing the training-task budget from a product of factor cardinalities to a sum. A trajectory-tube certificate chains this score-level bound through the reverse-time sampling ODE and a contracting tracking controller into a closed-loop state-trajectory tube whose radius factors into an ODE-sensitivity constant and a per-factor score-error budget. Unlike compositional-diffusion methods for control that combine separately trained networks, we use one shared network. Drone racing experiments confirm both the generalization bound and the certificate. On state-based multi-gate racing, the factored policy passes 90% of held-out gates -- matching an oracle -- while a K-network composition baseline collapses to 3%; on vision-based single-gate traversal, it transfers zero-shot to an unseen venue with +11.7pp success-rate gain and 2.4X crash-rate reduction.

2605.22593 2026-05-22 cs.LG

Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?

深度集成是否真的在图神经网络中捕捉了不确定性?

Pedro C. Vieira, Pedro Ribeiro, Viacheslav Borovitskiy

发表机构 * University of Edinburgh(爱丁堡大学) DCC/FCUP, University of Porto(葡萄牙里斯本大学数据与计算中心/里斯本大学)

AI总结 本文研究了深度集成在图神经网络中的有效性,发现其在不确定性量化中效果有限,主要归因于模型优化噪声的稳定而非不确定性估计的提升,揭示了集成崩溃现象。

详情
AI中文摘要

尽管深度集成被认为是深度学习中不确定性量化的默认方法,但其在图结构数据中的有效性往往基于计算机视觉领域的成功经验而被简单假设。我们专门研究了消息传递图神经网络中的标准深度集成。在七个代表不同任务和复杂度的数据集上进行基准测试,我们发现集成在单个模型上提供 surprisingly 小的改进。相反,观察到的边际收益主要来自稳定点预测的优化噪声,而非产生有意义更好的不确定性估计。通过偶然性-知识性分解,我们识别出知识性崩溃:独立训练的网络一致收敛到过于相似的预测。因为分歧是集成捕捉知识性不确定性的重要机制,这种缺乏多样性抵消了其关键优势。进一步分析这一现象,我们建议这种崩溃是由功能而非权重空间凸性驱动的,其中不同的参数解诱导几乎相同的行为。我们的结果表明,深度集成的成功并不无缝转移到图机器学习中。

英文摘要

While deep ensembles are widely considered to be the default method for uncertainty quantification in deep learning, their effectiveness for graph-structured data is often simply assumed based on successes in domains like computer vision. We investigate standard deep ensembles specifically for message-passing graph neural networks. Benchmarking across seven datasets representing varied tasks and complexities, we reveal that ensembles provide surprisingly little improvement over a single model. Instead, the observed marginal gains stem primarily from stabilizing optimization noise in point predictions rather than yielding meaningfully better uncertainty estimates. Through an aleatoric-epistemic decomposition, we identify epistemic collapse: independently trained networks consistently converge to overly similar predictions. Because disagreement is the fundamental mechanism through which ensembles capture epistemic uncertainty, this lack of diversity neutralizes their key advantage. Analyzing this phenomenon further, we suggest this collapse is driven by functional rather than weight-space convexity, where distinct parameter solutions induce almost identical behavior. Our results suggest that deep ensemble success does not seamlessly transfer to graph machine learning.

2605.22591 2026-05-22 cs.CV

Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

重新思考冻结视觉基础模型的噪声鲁棒训练:一个跨数据集基准与小损失失败的案例研究

Zitong Li, Haoyu Wang

发表机构 * Department of Biostatistics and Health Informatics(生物统计学与健康信息学系)

AI总结 本文通过跨五个医学数据集、三种主干网络、两种噪声类型和五种噪声率的基准测试,重新评估了冻结特征域中噪声标签学习方法的性能,揭示了小损失假设在高风险场景下的局限性,并提出了基于特征空间的选择器以指导实际应用。

详情
AI中文摘要

冻结视觉基础模型(VFMs)配备轻量级分类头,因其高效且可重复部署而在医学影像中日益普及。然而,针对此冻结特征域的噪声标签学习方法仍缺乏深入理解,且大多数现有方法仍依赖于从端到端训练继承的小损失假设。本文提出了一个包含八个噪声标签方法、五个医学数据集、三种主干网络、两种噪声类型和五种噪声率(150种条件,6,000次训练运行)的受控基准测试,通过平衡准确率进行评估。基准测试表明,不存在普遍胜利者:Friedman排名在150种条件下得出χ²=333.2(p=4.77×10⁻⁶⁸),ELR在最多条件(49/150)中获胜,而CUFIT获得最佳平均排名(2.51)。方法选择的实际成本随着噪声严重程度急剧增加,从干净数据上的4.5pp增加到不对称40%噪声时的18.8pp。为了解释这些基准级别的模式,我们重新审视了小损失假设在代表性的高风险场景中的应用。在冻结DINOv2特征下,干净和噪声损失分布重叠达53-61%,匹配率的干净样本检测显示,在不对称噪声下,预测一致性比损失排名更加稳定(3pp vs. 13pp精度下降)。在ISIC2019数据集上,不对称40%噪声下,Co-Teaching达到68%的总体准确率,但在三个少数类上无召回时,其平衡准确率降至35.1%。这些结果将冻结VFMs的噪声标签学习重新定义为一种基于场景的方法选择问题,而非寻找单一主导算法。本文最后提供了基于证据的指导和一个低遗憾的特征空间选择器,以指导实际应用。

英文摘要

Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $χ^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.

2605.22581 2026-05-22 cs.CV cs.AI cs.LG

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner: 在真实场景中实现基于3D的平面定位

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学) Kempner Institute, Harvard University(哈佛大学 Kempner 院)

AI总结 本文提出了一种在真实场景中实现基于3D重建的平面定位方法,通过将任务 grounding 在场景的重建3D表示中,解决了现有方法在大规模建筑和栅格化平面图中应用受限的问题。

Comments Project Page: https://Cornell-VAILab.github.io/SceneAligner

详情
AI中文摘要

许多公共建筑提供带有'你在这里'指示器的平面图,以帮助游客导航。平面定位旨在通过确定视觉观测是在平面图中的哪个位置来计算实现这一能力。然而,现有方法通常假设受控的小规模环境和精确的向量平面图,限制了它们在大规模建筑和栅格化平面图中的应用能力。在本文中,我们提出了一种在真实场景中实现平面定位的方法,通过将任务 grounding 在场景的重建3D表示中。给定一组无约束的图像集合,我们的方法重建一个重力对齐的3D场景,并将其投影到2D密度图中,作为平面图的代理。平面定位则被公式化为通过2D相似性变换将该代理与输入平面图对齐。为了弥合密度图与建筑平面图之间的外观差距,我们适配了一个2D基础模型来学习跨模态的对应关系,引入了一种细调方案,鼓励语义对齐的同时保持结构一致性。广泛的实验表明,与先前方法相比有显著的改进,包括在极稀疏设置中,甚至使用单张输入图像时。我们的代码和数据将公开提供。

英文摘要

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

2605.22579 2026-05-22 cs.CL cs.AI stat.ML

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

超越温度:超拟合作为晚期几何扩展

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) School of Computer and Information Engineering, Henan University(河南大学计算机与信息工程学院)

AI总结 本文研究了超拟合现象,发现其与分布锐化不同,通过实验表明超拟合依赖于动态的上下文相关排名重排机制,并在Transformer最后一层的终端扩展中实现了特征空间的几何扩展,提出了Late-Stage LoRA方法以提升生成质量。

Comments Accepted at ICML 2026

详情
AI中文摘要

近期的研究揭示了一个反直觉现象,称为

英文摘要

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

2605.22578 2026-05-22 cs.CV

Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping

超越形变距离:面向在线制图的粒度化顺序感知评估度量

Chouaib Bencheikh Lehocine, Adam Lilja, Junsheng Fu, Lars Hammarstrand

发表机构 * Zenseact AB(Zenseact公司) Chalmers University of Technology(楚姆勒斯技术大学)

AI总结 本文提出一种粒度化顺序感知评估度量,用于评估在线制图方法,通过引入序列最优子模式分配(SOSPA)和多实例评估框架中的多段线定位与检测(PLD),改进了传统基于形变距离的评估方法,揭示了当前方法中检测能力是主要瓶颈。

详情
AI中文摘要

在线地图估计是自动驾驶系统的关键组成部分,能够减少对昂贵高精度地图的依赖。最先进的方法通常将地图元素预测为点的有序序列,形成多边形和多边形链。这些方法的评估主要依赖于基于阈值形变距离(CD)的平均平均精度(mAP)。该框架对点顺序缺乏敏感性,并且在评估几何质量时缺乏粒度,使得难以区分哪些方法真正优于其他方法。在本文中,我们从两个方面解决了这些限制。对于单实例相似性度量,我们引入了序列最优子模式分配(SOSPA),一种顺序感知度量,能够对单个几何体进行细粒度评估,同时满足所有度量公理。对于多实例评估框架,我们提出了多段线定位与检测(PLD),一种软度量,能够同时捕捉检测质量和几何准确性,用原理性的软分配替代mAP的硬阈值。通过在nuScenes上的评估,我们证明PLD能够有效排序最先进的在线制图方法(MapTRv2、StreamMapNet、MapTracker),并提供分解的误差分析。该分析揭示了当前方法中检测能力是主要瓶颈,揭示了一种mAP无法捕捉的性能趋势。使用我们度量的评估代码将被发布。

英文摘要

Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies predominantly on mean average precision (mAP) based on thresholded Chamfer distance (CD). This framework lacks sensitivity to point ordering and provides limited granularity in assessing geometric quality, making it difficult to distinguish which methods truly excel over others. In this work, we address these limitations on two fronts. For the single-instance similarity measure, we introduce sequence optimal sub-pattern assignment (SOSPA), an order-aware metric that enables fine-grained evaluation of individual geometries while satisfying all metric axioms. For the multi-instance evaluation framework, we propose polyline localisation and detection (PLD), a soft metric that jointly captures detection quality and geometric accuracy, replacing the hard thresholding of mAP with a principled soft assignment. Through evaluations on nuScenes, we demonstrate that PLD effectively ranks SOTA online mapping methods (MapTRv2, StreamMapNet, MapTracker) while providing a decomposed error analysis. This analysis identifies detection capability as the dominant bottleneck in current methods, revealing a performance trend that mAP fails to capture. Code for evaluation using our metrics will be released.

2605.22572 2026-05-22 cs.CV

SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation

SegGuidedNet: 基于子区域的注意力监督用于可解释性脑肿瘤分割

Hasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * German Research Center for Artificial Intelligence(德国人工智能研究中心) Intelligentx GmbH (intelligentx.com)(Intelligentx GmbH) National University of Sciences and Technology (NUST)(国家科学与技术大学) Department of Core Informatics, Graduate School of Informatics ,Osaka Metropolitan University(信息学研究生院核心信息学系,大阪 metropolitan 大学)

AI总结 本文提出SegGuidedNet,一种引入新颖SegAttentionGate模块的三维残差编码器-解码器网络,通过轻量级辅助损失显式监督解码器生成每个肿瘤子区域(坏死核心、周围水肿、增强肿瘤)的空间判别注意力图,从而在无需后处理解释方法的情况下提供免费的空间可解释性,并在BraTS2021和BraTS2023 GLI上实现了优异的分割性能。

详情
AI中文摘要

准确分割多参数MRI中脑肿瘤的子区域对于治疗计划至关重要,但因形态学变化、类别不平衡和不同成像序列中肿瘤区域的重叠外观而具有挑战性。我们提出了SegGuidedNet,一种引入新颖SegAttentionGate模块的三维残差编码器-解码器网络,该模块通过轻量级辅助损失显式监督解码器,为每个肿瘤子区域(坏死核心、周围水肿、增强肿瘤)生成空间判别性注意力图,参数开销低于0.2%。这种子区域监督在保持解码器在视觉模糊类别间的判别能力的同时,无需任何后处理解释方法即可在推理过程中提供免费的空间可解释性。在独立评估BraTS2021和BraTS2023 GLI的251个被排除受试者上,SegGuidedNet分别实现了平均Dice系数为0.905(ET=0.873,TC=0.906,WT=0.935)和0.897(ET=0.859,TC=0.902,WT=0.931),超越了基于集成的nnU-Net和HNF-Netv2作为单一模型,并接近Swin UNETR在2-4个Dice点内以少量推理成本实现。结果在两个基准版本中的一致性进一步验证了所提出方法的通用性,提供了一个轻量、临床实用的框架,在保证准确性的同时具备内置的可解释性。

英文摘要

Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimensional residual encoder--decoder network introducing a novel SegAttentionGate module that explicitly supervises the decoder to produce spatially discriminative attention maps for each tumour sub-region necrotic core, peritumoral oedema, and enhancing tumour via a lightweight auxiliary loss, adding less than 0.2% parameter overhead. This sub-region supervision maintains decoder discriminability between visually ambiguous classes while providing free-of-cost spatial interpretability at inference without any post-hoc explanation method. Evaluated independently on BraTS2021 and BraTS2023 GLI across 251 held-out subjects each, SegGuidedNet achieves mean Dice of 0.905 (ET= 0.873, TC=0.906, WT=0.935) and 0.897 (ET=0.859, TC=0.902, WT=0.931) respectively, surpassing ensemble-based nnU-Net and HNF-Netv2 as a single model and approaching Swin UNETR a 10-model ensemble within 2--4 Dice points at a fraction of the inference cost. The consistency of results across two benchmark editions further confirms the generalisability of the proposed approach, offering competitive accuracy with built-in interpretability in a lightweight, clinically practical framework.

2605.22570 2026-05-22 cs.CV cs.AI

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench: 一个通过主动视频合成进行时空推理的基准

Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University(全北大学人工智能系) Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文提出VGenST-Bench,一个通过生成模型主动合成多样化评估场景的视频基准,旨在评估多模态大语言模型的时空推理能力,通过引入多代理流程和3x2x2视频分类体系,实现对细粒度时空理解的精准诊断。

Comments 82 pages, 91 figures (7 in main paper, 84 in appendix). Project page: https://zinosii.github.io/VGenST-Bench/

详情
AI中文摘要

时空推理是多模态大语言模型(MLLMs)在现实世界中的一项核心能力。因此,精确评估这一能力已成为一个关键挑战。然而,现有的时空推理基准数据集主要依赖静态图像集或被动整理的视频数据,这限制了对细粒度推理能力的评估。在本文中,我们介绍了VGenST-Bench,一个视频基准,该基准利用生成模型主动合成高度可控且多样化的评估场景。为了构建VGenST-Bench,我们提出一个包含人类质量控制阶段的多代理流程,确保所有生成的视频和问答对的质量。我们建立了一个全面的3x2x2视频分类体系,涵盖空间尺度、视角和场景动态,以涵盖多样化的场景。此外,我们设计了一个分层任务套件,将低层次的视觉感知与高层次的时空推理分离。通过从被动整理转向主动合成,VGenST-Bench能够对MLLMs的时空理解进行细粒度诊断。

英文摘要

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

2605.22567 2026-05-22 cs.CL

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG: 用于多语言推理的强化学习与语言自适应提示引导

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Jingbo Zhu, Tong Xiao

发表机构 * NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China(东北大学计算机科学与工程学院自然语言处理实验室) Meituan Inc.(美团公司) NiuTrans Research, Shenyang, China(牛译研所)

AI总结 本文提出LANG框架,通过语言条件提示引导非英语推理任务的探索,解决了多语言环境下强化学习在输入语言一致性与推理质量之间的权衡问题,提升了推理性能而不影响语言一致性。

Comments Accepted to ACL 2026 (main conference)

详情
AI中文摘要

强化学习已被证明在增强大型语言模型(LLMs)的多步推理方面非常有效,但其好处尚未完全转化为多语言环境。现有方法在根本上面临一个矛盾:优先考虑输入语言的一致性严重损害推理质量,而优先考虑推理则会导致无意中向英语漂移。我们通过LANG,一种新的框架,利用语言条件提示来指导非英语推理任务的探索。我们的方法结合了两个关键机制来防止依赖这些提示:一个逐步衰减计划,逐渐撤回支架,以及一个语言自适应切换,将学习时间跨度调整到特定语言的困难程度。在具有挑战性的多语言数学基准上的实验证明,LANG显著提高了推理性能,而不会损害语言一致性。此外,我们表明我们的框架超越了数学,促进了模型各层之间更一致的语言对齐。

英文摘要

Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers

2605.22566 2026-05-22 cs.LG

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

GraphFlow: 一种基于图的流程管理用于高效的LLM代理服务

Ao Li, Shangpeng Yang, Fahao Chen, Tianheng Xu, Peng Li, Zhou Su

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学计算机科学与工程学院) School of Artificial Intelligence, Shandong University, Jinan, China(山东大学人工智能学院) Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, China(中国科学院上海先进研究院)

AI总结 本文提出了一种基于图的流程管理方法GraphFlow,通过统一图结构wGraph动态生成任务特定流程,提高LLM代理服务的效率和性能,实验表明其在多个基准数据集上表现优异,性能提升显著且内存占用减少。

Comments Accepted to ICML 2026

详情
AI中文摘要

基于大型语言模型(LLM)的代理在有结构化指令引导下表现出强大的推理和执行能力,通常称为工作流。然而,现有的工作流辅助代理服务系统通常依赖于预定义模板和浅层匹配机制,限制了它们捕捉深层语义关系和泛化到以前未见过的任务的能力。为了解决这些限制,我们提出了一种新的工作流管理范式,通过统一图结构表示工作流,称为wGraph,其中每个节点对应一个原子操作。wGraph作为共享的基质,从其中动态实例化任务特定的工作流。基于wGraph的基本原理,我们引入了GraphFlow系统,通过两个关键设计高效地将工作流整合到代理服务中。首先,自适应工作流生成根据任务语义和约束要求从wGraph动态构建工作流。其次,工作流状态管理利用wGraph结构高效管理键值(KV)缓存,减少代理服务中的冗余计算。在五个基准数据集上的广泛实验表明,GraphFlow在多个基准数据集上表现优异,平均性能提升约4.95个百分点,同时实现内存占用约4倍的减少。

英文摘要

Large Language Model (LLM)-based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task-specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key-Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state-of-the-art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4$\times$ reduction in memory footprint.

2605.22564 2026-05-22 cs.CL cs.LG cs.SE

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

SynAE: 一个用于评估工具调用代理合成数据质量的框架

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft Research(微软研究院)

AI总结 本文提出SynAE框架,用于评估多轮工具调用代理合成数据的质量,通过四个指标类别评估合成数据的有效性、保真度和多样性,揭示单一指标不足以全面表征合成数据质量。

详情
AI中文摘要

如今,工具调用代理通常在静态执行轨迹数据集上进行评估或测试,包括输入命令、代理响应和相关工具调用。然而,内部生产数据集往往不足或无法使用;例如,它们可能包含敏感或专有数据,或过于稀疏,无法支持全面测试(尤其是预部署前)。在这些情况下,实践者越来越多地用合成数据替代或补充真实数据进行评估。关键挑战是量化这些合成数据集与真实数据之间的关系。我们介绍了SynAE,一个用于评估多轮工具调用代理合成基准如何复制和增强真实数据轨迹特征的评估框架。SynAE在四个指标类别中评估合成数据的效度、保真度和多样性:(i)任务指令和中间响应,(ii)工具调用,(iii)最终输出,(iv)下游评估。我们通过近期代理基准评估SynAE,并通过现实且受控的生成方案测试常见的合成数据失败模式。SynAE能够检测数据效度、保真度和多样性的细粒度变化,并表明没有单一指标足以全面表征合成数据质量,从而推动对合成数据的多轴评估。SynAE的演示可在https://synae-2026-synae-demo.static.hf.space/index.html获取,代码在https://github.com/wsqwsq/SynAE。

英文摘要

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

2605.22563 2026-05-22 cs.CV

Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain

椭圆傅里叶描述符域中的细胞假体视频生成

Francesco Benedetto, Roberto Basla, Luca Magri, Giacomo Boracchi

发表机构 * Department of Electronics Information and Bioengineering(电子信息与生物工程系)

AI总结 本研究提出了一种在椭圆傅里叶描述符(EFD)域中生成细胞假体视频的新框架,通过将细胞假体演变表示为多变量时间序列的EFD系数,引入了强先验知识,从而高效生成在时间上一致的视频,验证了在EFD空间建模时间演变能够生成生物合理性的假体视频,为合成标注数据生成提供了方法,减少了标注努力。

Comments 6 pages, Accepted at the International Conference on Image Processing (ICIP) 2026

详情
AI中文摘要

训练用于生物视频中单个细胞跟踪的深度神经网络需要大量标注数据。对细胞跟踪视频进行标注非常耗时,通常需要领域专业知识;这解释了公共标注数据在解决重要医疗问题如组织修复或癌症治疗方面有限的可用性。生成合成视频及其地面真实标注是一个有前景的解决方案,其基础第一步是单个细胞标注(或假体)的合成。假体需要时间一致,因为它们必须复制特定细胞类型的生物过程。在本文中,我们提出了一种新的框架,用于在椭圆傅里叶描述符(EFDs)域中生成细胞假体视频,这是一种紧凑且几何上可解释的2D闭合轮廓表示。我们将细胞假体演变表示为EFD系数的多变量时间序列,引入了强先验知识用于细胞形态,从而高效生成在时间上一致演变的序列。我们的实验验证证明,建模EFD空间中的时间演变能够生成生物合理性的假体视频。我们的方法可用于生成合成标注数据的生成管道,从而强烈缓解创建新数据集的标注努力。我们的代码可在此处下载:https://github.com/FrancescoBenedetto99/efd-cell-video-gen。

英文摘要

Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to address important medical problems like tissue repair or cancer treatment. Generating synthetic videos along with their Ground Truth annotations is a promising solution that relies, as a foundational first step, on the synthesis of single cell annotations (or phantoms). Phantoms need to be time consistent, as they have to replicate biological processes that are specific to the cell types. In this work, we propose a novel framework for generating videos of cell phantoms in the Elliptical Fourier Descriptors (EFDs) domain, a compact and geometrically interpretable representation for 2D closed contours. We represent the cell phantom evolution as a multivariate time series of EFD coefficients, introducing a strong prior for cell morphology and enabling the efficient generation of sequences that evolve coherently in time. Our experimental validation proves that modelling the temporal evolution in EFD space enables the generation of biologically plausible phantom videos. Our method can be used in generative pipelines for synthesizing annotated data for cell tracking, thus strongly mitigating the annotation effort for creating new datasets. Our code is available for download here: https://github.com/FrancescoBenedetto99/efd-cell-video-gen.

2605.22561 2026-05-22 cs.LG

Regret-Based $(ε,δ)$-optimal Stopping Criteria for Bayesian Optimization

基于遗憾的贝叶斯优化(ε,δ)-最优停止准则

Haowei Wang, Jingyi Wang, Qiyu Wei

发表机构 * National University of Singapore(新加坡国立大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) The University of Manchester(曼彻斯特大学)

AI总结 本文提出了一种基于更紧的高斯过程上置信界(GP-UCB)即时遗憾界限的停止准则,确保在终止时以高概率1-δ获得ε-最优解,并通过数值实验验证其有效性。

Comments 21 pages

详情
AI中文摘要

贝叶斯优化(BO)是一种广泛使用的迭代黑盒优化方法,利用高斯过程(GP)替代模型。在实践中,BO通常在耗尽固定评估预算后终止,这可能导致不必要的成本,并且无法保证解的质量最优性。最近的研究在开发实用的停止准则方面取得了实证进展,但理论上有说服力的停止准则仍处于进行中。在本文中,我们提出了GP上置信界(GP-UCB)在任意给定迭代中的可证明更紧的即时遗憾界限。然后,我们基于此更紧的界限提出GP-UCB的停止准则,确保终止时以高概率1-δ获得ε-最优解。通过数值实验验证和展示所提停止准则的有效性和效率。

英文摘要

Bayesian optimization (BO) is a widely used iterative black-box optimization method that utilizes Gaussian process (GP) surrogate models. In practice, BO is typically terminated after a fixed evaluation budget is exhausted, which can incur unnecessary cost and provides no optimality guarantee on solution quality. Recent research in developing a practical stopping criterion has made empirical progress, yet a theoretically sound stopping criterion remains a work in progress. In this work, we present provably tighter instantaneous regret bounds for GP upper confidence bound (GP-UCB) at any given iteration. Then, we propose stopping criteria for GP-UCB based on this tighter bound that ensures an $ε$-optimal solution with high probability $1-δ$ upon termination. Numerical experiments are performed to validate and demonstrate the effectiveness and efficiency of our stopping criteria.

2605.22558 2026-05-22 cs.CV

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

GeoWeaver: 在场景推理前通过几何证据 grounding 视觉 token

Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Pazhou Lab (Huangpu)(琶洲实验室(黄埔)) Hainan University(海南大学) University of California at Merced(加州大学默塞德分校)

AI总结 本文提出 GeoWeaver,一种在场景推理前通过几何证据对视觉 token 进行 grounding 的框架,以提升空间推理能力并保持多模态能力。

详情
AI中文摘要

视觉语言模型中的时空推理需要保持物理几何的视觉表示,而非仅仅语义外观。最近的多模态模型通过结构分支、3D感知监督、推理阶段融合或长视界记忆来整合几何信息。尽管这些方法展示了几何对空间智能的重要性,但它们通常将几何线索视为所有视觉 token 的共享信号。我们注意到,这忽略了更细致的挑战:不同的视觉 token 需要根据其空间角色不同的几何证据。为了解决这一限制,我们引入 GeoWeaver,一种预推理的几何 grounding 框架,将几何视为时空推理的表示前提。GeoWeaver 从冻结的几何编码器构建多层次的几何库,并执行 token 自适应的几何证据分配,使每个视觉 token 能够检索最相关的几何抽象。所选证据通过残差 grounding 操作整合到视觉 token 中,在语言建模之前,产生几何 grounding 的表示,以支持后续推理。在空间推理基准上的广泛评估表明,GeoWeaver 一致地增强了几何感知推理,同时保持了通用多模态能力。这表明几何信息带来的最大收益不是作为后期融合的辅助信号,而是作为塑造大型语言模型推理基础的必要前提。所有源代码和模型将在 https://github.com/yahooo-m/GeoWeaver 上发布。

英文摘要

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

2605.22556 2026-05-22 cs.LG

ImplicitTerrainV2: Wavelet-Guided Spatially Adaptive Neural Terrain Representation

ImplicitTerrainV2: 基于小波引导的时空自适应神经地形表示

Haoan Feng, Xin Xu, Leila De Floriani

发表机构 * University of Maryland, College Park(马里兰大学学院市分校)

AI总结 本文提出ImplicitTerrainV2,通过结合频谱控制机制、小波引导的空间自适应性、导数感知监督和训练后模型压缩,实现了紧凑高效的神经地形数据格式,提升了地形分析的精度和效率。

Comments 14 pages, 8 figures

详情
AI中文摘要

数字高程模型(DEMs)是地理信息系统(GIS)中地形分析的基础,但其常见的栅格形式依赖插值进行离格采样和有限差分算子进行基于导数的分析。隐式神经表示(INRs)提供了一种连续的替代方案,但先前的地形INRs缺乏显式的频率控制,忽视了地形的梯度结构,并且在实际部署中仍然过于庞大和昂贵。我们提出了ImplicitTerrainV2,通过结合频谱控制机制、小波引导的空间自适应性、导数感知监督和训练后模型压缩,将地形INRs推进到紧凑、高效的神经地形数据格式。在核心部分,小波复杂度场(WCF)从解析计算的小波系数中推导出空间自适应的频率掩码,将高频能力局部化到复杂地形区域。同一字段指导复杂度感知的自适应采样,将训练集中在高复杂度区域,同时梯度匹配应用额外监督以强制地形DEMs的光滑流形结构,从而提高导数保真度。训练后混合精度量化和熵编码将存储减少到1.23 bpp,PSNR下降0.28 dB。在50个瑞士地形图块上,ImplicitTerrainV2达到66.25 dB的端到端PSNR,比先前工作提高了5.70 dB,同时使用3.2倍更少的参数,在单个GPU上每个图块训练时间仅为55秒。我们的压缩神经格式在率失真性能上与几种已建立的DEM编码器竞争,同时还支持离格点查询、闭合形式导数评估和分辨率无关重建,这可能受益于许多下游GIS应用。

英文摘要

Digital elevation models (DEMs) underpin terrain analysis in Geographic Information Systems (GIS), but in their common raster form, they rely on interpolation for off-grid sampling and finite-difference operators for derivative-based analysis. Implicit neural representations (INRs) offer a continuous alternative, but prior terrain INRs lack explicit frequency control, neglect the gradient structure of terrain, and remain too large and costly to train for practical deployment. We present ImplicitTerrainV2, which advances terrain INRs toward a compact, efficient neural terrain data format by combining a spectral control mechanism with wavelet-guided spatial adaptivity, derivative-aware supervision, and post-training model compression. At its core, a wavelet complexity field (WCF) derives spatially-adaptive frequency masks from analytically computed wavelet coefficients, localizing high-frequency capacity to complex terrain regions. The same field guides complexity-aware adaptive sampling that concentrates training in high-complexity regions, while gradient matching applies extra supervision to enforce the smooth manifold structure of terrain DEMs for improved derivative fidelity. Post-training mixed-precision quantization and entropy coding reduce storage to 1.23 bpp with a 0.28 dB PSNR drop. On 50 Swiss terrain tiles, ImplicitTerrainV2 reaches 66.25 dB end-to-end PSNR, improving over the prior work by 5.70 dB while using 3.2x fewer parameters and training in 55 s per tile on a single GPU. Our compressed neural format is competitive with several established DEM codecs in rate-distortion performance, while additionally supporting off-grid point queries, closed-form derivative evaluation, and resolution-independent reconstruction, which may benefit many downstream GIS applications.

2605.22552 2026-05-22 cs.CV cs.MM

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

FashionLens:通过任务自适应学习实现多功能时尚图像检索

Haokun Wen, Xuemeng Song, Xinghao Xie, Xiaolin Chen, Xiangyu Zhao, Weili Guan

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Institute of Data Science, National University of Singapore(新加坡国立大学数据科学研究所) School of Information Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)信息科学与技术学院) Shenzhen Loop Area Institute(深圳南山区研究所)

AI总结 本文提出FashionLens框架,通过任务自适应学习实现多功能时尚图像检索,解决现有方法无法处理多样检索需求的问题。

详情
AI中文摘要

时尚图像检索是现代电子商务系统的核心。在实践中,一个能够支持多种查询格式和搜索意图的统一框架备受青睐。然而,现有方法专注于狭窄的检索任务,无法充分捕捉这种多样性。因此,在本工作中,我们旨在开发一个能够处理多样现实时尚检索场景的统一框架,实现真正多功能的时尚图像检索。为了建立数据基础,我们首先引入U-FIRE,一个综合基准,将碎片化的时尚数据集整合到统一的集合中,并辅以两个人工整理的数据集进行测试通用性。在此基础上,我们提出了基于多模态大语言模型的FashionLens框架。为处理不同的匹配目标,我们设计了Proposal-Guided Spherical Query Calibrator,通过自适应球形线性插值动态将查询表示转移到任务对齐的度量空间中。此外,为缓解因任务复杂性和数据规模不同导致的优化不平衡问题,我们开发了Gradient-Guided Adaptive Sampling策略,根据实时学习难度和数据规模先验自动重新加权任务。在U-FIRE上的实验表明,FashionLens在多种检索场景中均取得最佳性能,并能稳健地推广到未见任务。数据和代码已公开发布在https://github.com/haokunwen/FashionLens。

英文摘要

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.

2605.22550 2026-05-22 cs.CV

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

MOTOR: 两轮车骑行行为理解的多模态数据集

Varun A. Paturkar, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad(IIIT-海得拉巴学院计算机视觉研究所)

AI总结 本文提出MOTOR数据集,用于研究两轮车在密集无结构交通中的骑行行为,通过多视角、多模态数据融合,为自动驾驶辅助系统提供新的研究基础。

详情
AI中文摘要

两轮车在发展中国家道路上的致命事故比例显著偏高。然而,关于两轮车骑行行为的研究远远落后于四轮车,后者多模态数据集推动了高级驾驶辅助系统(ADAS)的重大进展。为填补这一空白,我们提出了MOTOR数据集,这是首个大规模、多视角、多模态资源,专门用于密集无结构交通中的两轮车。MOTOR包含1,629个序列(25多个小时的视频数据),由16名骑行者收集,整合了同步的前视、后视和头盔视频、可穿戴追踪器的骑行目视数据、道路音频和 telemetry(GPS、加速度计、陀螺仪)。丰富的注释捕捉交通情境、骑行状态、12种骑行动作(涵盖传统和非常规行为)以及合法性标签(合法、非法、未指定)。我们使用最先进的视频动作识别骨干网络(CNN和Transformer-based)进行骑行行为识别和动作合法性分类,并发现结合RGB、目视和telemetry数据能够获得最佳性能。MOTOR因此为两轮车驾驶的安全关键理解提供了独特基础。它为研究社区提供了一个基准,以开发和评估用于行为分析、合法性感知预测和智能交通系统模型。数据集和代码可在https://varuniiith.github.io/MOTOR-Dataset/获取。

英文摘要

Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 1,629 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https: //varuniiith.github.io/MOTOR-Dataset/

2605.22544 2026-05-22 cs.CL cs.IR

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

一个提示不够:指令敏感性削弱了嵌入模型评估

Yevhen Kostiuk, Kenneth Enevoldsen

发表机构 * Aarhus University(奥胡斯大学)

AI总结 本文研究了单提示评估在指令调优嵌入模型中的不足,发现默认提示可能系统性低估或高估性能,并指出排行榜对提示选择不鲁棒,建议通过多提示评估或报告敏感性来改进基准测试。

详情
AI中文摘要

指令嵌入模型已成为最先进模型中的常见选择,但通常仅使用单个提示进行评估。单点评估忽略了指令方法的主要问题,即对指令措辞的敏感性。我们对6个嵌入模型、11个数据集和每个数据集15个任务特定提示进行了实证研究,共990个案例。我们发现报告的分数无法代表在合理提示下的分数分布。默认提示既可能系统性低估也可能高估性能。此外,我们发现排行榜对提示选择不鲁棒:通过选择有利的提示,研究中的任何模型都可以被提升到首位。我们的发现表明,单提示评估不足以评估指令调优的嵌入模型,基准测试应纳入提示鲁棒性,通过多提示评估或报告敏感性来改进。

英文摘要

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.