arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22649 2026-05-22 cs.CV cs.LG

From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

从基线到随访:利用因果层次变分自编码器在UK Biobank中生成脊柱DXA图像

Yilin Zhang, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

AI总结 本文提出了一种基于元数据的因果层次变分自编码器,用于在UK Biobank中生成一致的脊柱DXA图像,通过基线到随访的设置评估因果一致性,展示了年龄干预下关键椎体形态学变量的高一致性,支持了在解剖上合理的DXA图像合成。

详情
Comments
7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)
AI中文摘要

双能X射线吸收法(DXA)广泛用于大规模骨骼评估,但学习可控且可解释的因子特异性解剖变异仍具挑战性。我们提出了一种基于元数据的因果层次变分自编码器(CHVAE),用于在UK Biobank(UKB)中因果一致地生成前后位(AP)脊柱DXA图像。模型在3,743个原始AP脊柱扫描(来自首次成像访问)上进行训练,并基于基本参与者属性和腰椎形态学进行条件化。因果一致性在基线到随访的设置中通过 abduction--action--prediction(AAP)进行评估:潜在变量从基线图像中抽象出来,年龄被干预到重复成像值,然后将产生的反事实随访形态学与观察到的重复成像测量进行比较。结果表明,在年龄干预下,关键椎体形态学变量的绝对一致性较高,支持了与干预对齐的、在解剖上合理的DXA图像合成。

英文摘要

Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

2605.22645 2026-05-22 cs.AI

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

AtelierEval: 人类与LLM作为文本到图像提示器的代理评估

Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam

AI总结 本文提出AtelierEval,首个统一基准,通过360个专家设计的任务量化提示能力,引入技能基于的记忆增强代理评估器,实现与人类专家的高相关性,验证了提示器在图像增强方向的优越性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

文本到图像(T2I)系统日益依赖上游提示器,无论是人类还是多模态大语言模型(MLLMs),将用户意图转化为详细提示。然而,当前基准固定提示并仅评估T2I模型,忽略了上游组件的提示能力。我们引入AtelierEval,首个统一基准,通过360个专家设计的任务量化提示能力。基于认知观点,它涵盖三个任务类别,并使用现实挑战的分类学来实例化任务,为人类和MLLMs提供双接口。为了实现可扩展和可靠的评估,我们提出了AtelierJudge,一个技能基于、记忆增强的代理评估器。它为提示-图像对生成主观和客观评分,与人类专家的Spearman相关性达到0.79,接近人类表现。广泛实验在4个T2I后端上基准8个MLLMs和48个人类用户,验证AtelierEval作为稳健诊断工具的有效性,并揭示模仿优于规划,倡导未来提示器的图像增强方向。我们的工作已发布以支持未来研究。

英文摘要

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

2605.22644 2026-05-22 cs.LG

Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics

为何SGD不是布朗运动:对随机动力学的新视角

Igor Ignashin, Anna Radovskaya, Andrew Semenov, Egor Lopatin, Stanislav Potapov, Aleksandr Kovalenko, Andrey Veprikov, Aleksandr Shestakov, Andrey Leonidov, Aleksandr Beznosikov

AI总结 本文从离散更新出发,提出了一种将SGD视为在波动损失景观中确定性动力学的新方法,揭示了在临界点附近SGD的动力学行为,并通过实验验证了其在神经网络模型中的表现。

详情
Comments
Preprint
AI中文摘要

随机梯度下降(SGD)通常被建模为兰格汉斯过程,假设小批量噪声充当布朗运动。然而,这种近似依赖于连续时间极限和sqrt(eta)噪声缩放,这与有限学习率下的离散SGD更新不匹配。本文提出了一种替代方法,将SGD视为由小批量采样诱导的波动损失景观中的确定性动力学。从离散更新出发,我们推导了参数分布的主方程,并获得了与标准兰格汉斯形式在eta^2阶不同的离散福克-平克方程。利用这一框架,我们分析了SGD在损失临界点附近的行为。我们表明,其行为沿均值海森矩阵的本征基分解为质地上不同的区域。特别是,几乎平坦的方向不具有平稳分布:方差随时间增长,对应于沿山谷的有效扩散,系数与学习率成比例。我们提供了支持这些预测的实验证据,通过计算机视觉和自然语言处理的神经网络模型,观察到受限和扩散模式之间的明显质别。

英文摘要

Stochastic Gradient Descent (SGD) is commonly modeled as a Langevin process, assuming that minibatch noise acts as Brownian motion. However, this approximation relies on a continuous-time limit and a sqrt(eta) noise scaling that does not match the discrete SGD update at finite learning rate. In this work, we propose an alternative formulation of SGD as deterministic dynamics in a fluctuating loss landscape induced by minibatch sampling. Starting directly from the discrete update, we derive a master equation for the parameter distribution and obtain a discrete Fokker--Planck equation that differs from the standard Langevin form at order eta^2. Using this framework, we analyze SGD dynamics near critical points of the loss. We show that the behavior decomposes along the eigenbasis of the mean Hessian into qualitatively distinct regimes. In particular, nearly-flat directions do not admit a stationary distribution: the variance grows over time, corresponding to effective diffusion along valleys with a coefficient proportional to the learning rate. We provide empirical evidence supporting these predictions on neural network models in computer vision and natural language processing, observing a clear qualitative separation between confined and diffusive modes.

2605.22642 2026-05-22 cs.AI

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet-RL: 通过强化学习推进大型语言模型代理在现实中的电子表格任务中的进步

Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang, Zhaoheng Li, Shengyi Qian, Minjia Zhang, Klara Nahrstedt, Rui Hou, Xiangjun Fan, Hanchao Yu

AI总结 本文提出Spreadsheet-RL,一种通过强化学习微调框架,旨在在现实Microsoft Excel环境中训练专门的电子表格代理。该方法通过自动化管道收集在线论坛中的配对起始-目标电子表格,以及金融和供应链管理等领域的领域特定评估任务,构建了新的Domain-Spreadsheet基准数据集,并展示了在通用和领域特定电子表格任务上的显著性能提升。

详情
Comments
Mingyuan served as the project lead. Banghao, Yining, and Mingyuan contributed equally to this work, with more junior authors listed before senior authors. All data and code releases are maintained by the corresponding authors at UIUC and are not affiliated with Meta
AI中文摘要

电子表格系统(例如Microsoft Excel,Google Sheets)在现代数据导向的工作流程中起着核心作用。随着AI代理越来越能够自动化复杂任务,如控制计算机和生成演示文稿,构建一个AI驱动的电子表格代理已成为一个有前途的研究方向。大多数现有的电子表格代理依赖于在通用目的LLM上进行专门的提示;虽然这种设计在简单的电子表格操作上有潜力,但难以管理现实世界中典型的复杂、多步骤的工作流程。我们介绍了Spreadsheet-RL,一种强化学习(RL)微调框架,旨在在现实Microsoft Excel环境中训练专门的电子表格代理。Spreadsheet-RL具有自动化管道,用于可扩展地收集配对的起始-目标电子表格,以及在金融和供应链管理等领域的领域特定评估任务,这些任务被编译成新的Domain-Spreadsheet基准数据集。它还包括一个Spreadsheet Gym环境,用于多轮RL:Spreadsheet Gym通过Python沙箱暴露广泛的Excel功能,并附带一个经过改进的Harness,其中包含全面的工具集和精心设计的工具路由规则用于电子表格任务。通过全面的实验,我们展示了Spreadsheet-RL在通用和领域特定的电子表格任务上显著提高了AI代理的性能:它将Qwen3-4B-Thinking-2507在SpreadsheetBench上的Pass@1从12.0%提高到23.4%,并在我们精心编写的Domain-Spreadsheet数据集上将Pass@1从8.4%提高到17.2%。这些结果突显了Spreadsheet-RL在电子表格自动化中的强大泛化能力和实际应用潜力,以及更广泛地,其在日常工作中LLM与数据接口交互方面的前景。

英文摘要

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.

2605.22633 2026-05-22 cs.RO

SE3Kit: A Lightweight Python Library for Specialized Geometric Primitives in Robotics

SE3Kit: 一个用于机器人学中专用几何原语的轻量级Python库

Daniyal Maroufi, Omid Rezayof, Farshid Alambeigi

AI总结 本文提出SE3Kit,一个轻量级Python库,专注于特殊欧几里得群SE(3)和特殊正交群SO(3)上的高效运算,提供严格的数学实现,适用于嵌入式部署、快速原型设计和教育。

详情
AI中文摘要

Python机器人生态系统面临挑战:虽然有许多库用于刚体变换,但很少有库既轻量又数学严谨。本文介绍了SE3Kit,一个轻量级Python库,高效地进行特殊欧几里得群SE(3)和特殊正交群SO(3)上的运算。不同于需要大量依赖的现有框架(例如SpatialMath、PyPose)或缺乏机器人特定功能的一般工具(例如SciPy),SE3Kit旨在填补这些极端之间的空白。它专为嵌入式部署、快速原型设计和教育而设计,同时提供严谨的数学实现。它提供了一个仅使用Python和NumPy的Lie群运算实现,没有深度学习或其他可视化软件的开销。

英文摘要

The Python robotics ecosystem faces a challenge: while many libraries exist for rigid body transformations, few are both lightweight and mathematically strict. This paper introduces SE3Kit, a lightweight Python library efficient operations on the Special Euclidean Group SE(3) and the Special Orthogonal Group SO(3). Unlike established frameworks that require heavy dependencies (e.g., SpatialMath, PyPose) or general tools that lack robotics-specific features (e.g., SciPy), SE3Kit targets the gap between these extremes. It is designed for embedded deployment, rapid prototyping, and education while providing rigorous mathematical implementation. It provides a pure-Python, NumPy-only implementation of Lie Group operations, without the overhead of deep learning or other visualization software.

2605.22631 2026-05-22 cs.CV

AtomicMotion: Learning Human Motion From Different Human Parts

AtomicMotion: 从不同人体部分学习人体动作

Runzhen Liu, Chuhua Xian, Fa-Ting Hong

AI总结 该研究提出AtomicMotion框架,通过解耦和重新整合身体动态,解决从稀疏头部和手部轨迹准确重建完整身体姿态的挑战,核心方法是逻辑身体分区、全身体预条件化策略和运动学注意力机制,实验表明其在AMASS数据集上显著优于现有基线。

详情
AI中文摘要

准确从稀疏头部和手部轨迹重建完整身体姿态是沉浸式AR/VR远程存在的基础挑战。当前方法常面临误差累积和不自然关节协调的问题,主要因为将人体视为单一实体,无法捕捉细微信号变化中的细粒度“原子意图”,并忽视了固有的结构拓扑。为弥合这一差距,我们提出了AtomicMotion,一个通过三个核心创新解耦和重新整合身体动态的框架。首先,我们引入一种逻辑身体分区方案,根据功能意图将骨架分解为五个不同的簇;这确保每个分区保留内部关节协同性,同时隔离局部运动原语。其次,为了稳健地将稀疏输入映射到高维姿态,我们在训练期间采用掩码全身体预条件化策略,迫使模型内化全局骨骼拓扑和潜在运动学约束。最后,针对常规空间注意力机制常忽略固定生理连接的局限性,我们提出了运动学注意力。通过将经典运动学树结构嵌入注意力机制中,我们确保合成动作具有生物合理性。在AMASS数据集上的广泛评估表明,AtomicMotion显著优于现有基线,实现了更高的重建保真度和更优越的生物力学真实性。

英文摘要

Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents'' embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.

2605.22629 2026-05-22 cs.CV

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

H-Flow:通过物理启发的联合多模态学习实现自监督的人体场景流

Zhanbo Huang, Xiaoming Liu, Yu Kong

AI总结 本文提出H-Flow,一种能够同时捕捉骨骼运动学和表面变形的密集人体场景流方法,通过物理启发的联合多模态学习实现自监督,引入高保真合成基准DynAct4D,并在标准基准和零样本场景中优于现有方法。

详情
Comments
19 pages, 7 figures, 4 tables
AI中文摘要

参数化人体模型能够捕捉全局姿态,但无法表示衣物和软组织的非刚性表面动态。通用场景流估计密集运动,但在关节化身体上失效,且像素级监督难以获得。我们引入H-Flow,一种能够同时捕捉骨骼运动学和表面变形的密集人体场景流。统一的多头Transformer估计从单目视频中的流,同时预测姿态和深度作为互补输出。挑战在于缺乏监督。替代无法获得的标签,我们将网络锚定在人体运动的物理中,将几何、结构和生物力学先验编码为跨模态训练目标。我们进一步引入DynAct4D,一个高保真合成基准,提供跨多样体、服装和动作的密集流标注。在标准基准上,H-Flow优于场景流和参数化基线,并能泛化到野外视频。代码、模型和DynAct4D基准将在发表时发布。

英文摘要

Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication

2605.22622 2026-05-22 cs.LG math.OC

A note on convergence of Wasserstein policy optimization

关于Wasserstein策略优化收敛性的注记

David Šiška, Yufei Zhang

AI总结 本文探讨了Wasserstein策略优化在连续状态和动作空间中的收敛性问题,通过利用均场分析和log-Sobole不等式,证明了在熵正则化的马尔可夫决策过程框架下,WPO算法能够线性收敛到全局最优解。

详情
AI中文摘要

Wasserstein Policy Optimization (WPO) 是一种最近提出的强化学习算法,利用Wasserstein梯度流来优化连续动作空间中的随机策略。尽管在实践中取得了成功,但在连续状态和动作空间环境中,WPO的理论收敛性质尚未完全确立。在本文中,我们论证了在熵正则化的马尔可夫决策过程框架下,WPO能够线性收敛。这是通过利用最近在均场分析中用于梯度流收敛的进展,结合log-Sobole不等式来实现的。假设梯度流方程存在足够光滑的解,我们展示了沿流的能量单调耗散,并建立了局部log-Sobole不等式。最终,这些性质使得我们能够论证价值函数应线性收敛到全局最优解。

英文摘要

Wasserstein Policy Optimization (WPO) is a recently proposed reinforcement learning algorithm that leverages Wasserstein gradient flows to optimize stochastic policies in continuous action spaces. Despite its empirical success, the theoretical convergence properties of WPO in environments with continuous state and action spaces have yet to be fully established. In this note, we argue that WPO within the framework of entropy-regularised Markov Decision Processes converges linearly. This is done by leveraging recent advances in mean-field analysis for convergence of gradient flows using log-Sobole inequalities. Assuming existence of sufficiently regular solution to the gradient flow equation we demonstrate monotonic energy dissipation along the flow and establish a local log-Sobolev inequality. Ultimately, these properties allow us to argue that the value function should converge linearly to the global optimum.

2605.22621 2026-05-22 cs.CR cs.LG cs.NI

UNAD+: An Explainable Hybrid Framework for Unknown Network Attack Detection

UNAD+: 一种用于未知网络攻击检测的可解释混合框架

Saif Alzubi, Frederic Stahl

AI总结 本文提出UNAD+框架,结合无监督集成与监督精修阶段,通过集成可解释性层提升未知网络攻击检测的性能和透明度。

详情
AI中文摘要

先前未见的网络攻击检测仍然是入侵检测系统面临的主要挑战。尽管监督学习方法在已知攻击类别上表现良好,但当新攻击类型未在训练数据中表示时,它们的性能受限。无监督方法更适合检测零日攻击,因为它们不需要标记的攻击样本,但它们通常具有较高的误报率,这限制了其在现实中的实用性。本文提出了UNAD+,一种改进的未知网络攻击检测框架,源自之前提出的Unknown Network Attack Detector (UNAD)。UNAD+结合了仅良性样本的无监督集成、加权多数投票(WMV),一种在伪标记检测上训练的监督精修阶段,以及一个后验可解释性层,提供局部和全局解释。该框架在CICIDS2017和NSL-KDD基准数据集上进行了评估。结果表明,UNAD+在原始UNAD框架上有所改进,在基准数据集上实现了超过98%的F1分数,同时显著减少了误报率,并通过集成可解释性增强了透明度和部署适用性。

英文摘要

The detection of previously unseen network attacks remains a major challenge for intrusion detection systems. Although supervised learning methods often perform well on known attack classes, they are limited when new attack types are not represented in the training data. Unsupervised methods are more suitable for detecting zero-day attacks, as they do not require labelled attack samples, but they often suffer from high false positive rates, which limits their real-world usefulness. This paper presents UNAD+, an enhanced framework for unknown network attack detection derived from the previously proposed Unknown Network Attack Detector (UNAD). UNAD+ combines a benign-only unsupervised ensemble with Weighted Majority Voting (WMV), a supervised refinement stage trained on pseudo-labelled detections, and a post hoc explainability layer that provides both local and global explanations. The framework was evaluated on the CICIDS2017 and NSL-KDD benchmark datasets. The results show that UNAD+ improves on the original UNAD framework, achieving F1-scores above 98% across the benchmark datasets while significantly reducing false positives and enhancing transparency and deployment suitability through integrated explainability.

2605.22620 2026-05-22 cs.LG cs.CL

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个优于一个:一种无崩溃的多奖励RLIF训练框架

Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali

AI总结 本文提出一种多奖励RLIF框架,通过分解训练信号为答案级奖励和完成级奖励,并结合GDPO归一化和KL-Cov正则化,提升稳定性和鲁棒性,同时在数学推理和代码生成任务中接近监督RLVR方法的性能。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)显著提升了大语言模型的推理能力,但通常依赖于外部监督的人类注释或黄金标准解决方案。最近,从内部反馈强化学习(RLIF)作为一种可扩展的无监督替代方法出现,利用模型自身提取的信号。然而,现有RLIF方法通常依赖单一内部奖励,可能导致奖励黑客、熵崩溃和推理结构退化。我们提出一种多奖励RLIF框架,将训练信号分解为两个互补成分:基于聚类投票的答案级奖励和基于逐token自信心的完成级奖励。为了稳健地结合这些信号,我们应用GDPO基于的归一化以减少奖励尺度不平衡。我们进一步引入KL-Cov正则化,针对导致不成比例熵减少的低熵token分布,保持探索并防止后期崩溃。在数学推理和代码生成基准上,我们的方法在无监督RL方法中提高了稳定性和鲁棒性,同时在性能上接近监督RLVR方法。这些结果表明,互补的内部奖励结合针对性正则化可以支持稳定的长周期推理,而无需依赖外部真实监督。代码将很快发布。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.

2605.22619 2026-05-22 cs.CV

GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

GLeVE: 在3D CT中基于图的病变接地与提案验证

Shuo Jiang, Yuhao Hong, Chunbo Jiang, Weihong Chen, Huangwei Chen, Shenghao Zhu, Beining Wu, Mingxuan Liu, Zhu Zhu, Feiwei Qin, Min Tan, Yifei Chen

AI总结 本文提出GLeVE框架,通过图引导的病变接地和解剖学先验验证,解决3D CT中自由文本叙述与体积解剖之间的语义-空间差距问题,提升病变定位的准确性。

详情
Comments
11 pages, 4 figures
AI中文摘要

将放射科报告描述接地到3D CT体积对于可验证的临床解释至关重要,但受到自由文本叙述与体积解剖之间语义-空间差距的挑战。现有基于报告辅助和视觉-语言接地的方法通常依赖于短语级对齐或密集像素监督,导致病变层面的对应有限和定位准确性不足。我们提出GLeVE,一种带有解剖学先验验证和基于八叉树的自回归细化的图引导病变接地框架。GLeVE将每个病变描述视为一个原子语义单元,并通过关系感知图推理编码器官归属、属性和跨病变关系,以生成具有判别性的病变层面查询。具有区域级验证的解剖学感知提案生成强制一对一的文本-病变对齐,而分层八叉树细化逐步改进边界界定。在AbdomenAtlas 3.0上的实验表明,GLeVE在分割准确性和病变层面定位方面均优于经典多模态基础模型和报告监督基线。

英文摘要

Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.

2605.22613 2026-05-22 cs.LG

Evolutionary Multi-Task Optimization for LLM-Guided Program Discovery

为LLM引导的程序发现设计的进化多任务优化

Halil Alperen Gozeten, Xuechen Zhang, Emrullah Ildiz, Ege Onur Taga, Tara Javidi, Samet Oymak

AI总结 本文提出了一种进化多任务优化(EMO)方法,用于LLM引导的程序发现,通过两个阶段框架EMO-STA(共享后适应)在多个任务家族中提高了程序发现的效率和鲁棒性,同时展示了共享进化在减少过拟合方面的优势。

详情
AI中文摘要

最近的LLM引导的进化搜索方法表明,迭代程序突变可以发现强大的算法,但它们通常独立地优化每个任务,即使相关任务共享可重用的结构。我们介绍了用于LLM引导的程序发现的进化多任务优化(EMO),并提出了EMO-STA(共享后适应)两种阶段框架,首先在任务家族中进化一个可执行程序的共享档案,然后将选定的共享候选程序适应到每个目标任务。在EMO-STA中,我们探索了多种适应策略,包括从共享档案中进行预热启动、适应最佳平均共享程序,以及适应在每个目标任务上表现最佳的共享程序。在八个跨越连续优化、几何构造、建模和算法优化的任务家族中,EMO-STA在大多数设置中优于匹配计算的单任务进化,其中STA Best-Local在分布内适应最强,而STA Best-Shared在未见过的任务中具有鲁棒性。计算分配实验表明,将相当大的家庭级预算分配给共享进化通常是有益的,平衡的共享和适应预算往往是最优的。除了计算效率外,我们还展示了共享进化可以缓解低证据设置(例如少量训练数据)中的过拟合,包括ARC任务和时间序列特征工程,通过优先选择跨所有任务通用的程序,而不是利用任务特定的脆弱特征。

英文摘要

Recent LLM-guided evolutionary search methods have shown that iterative program mutation can discover strong algorithms, but they typically optimize each task independently, even when related tasks share reusable structure. We introduce Evolutionary Multi-Task Optimization (EMO) for LLM-guided program discovery, and propose EMO-STA (Shared-Then-Adapt), a two-stage framework that first evolves a shared archive of executable programs across a task family and then adapts selected shared candidates to each target task. Within EMO-STA, we explore multiple adaptation strategies, including warm-starting from the shared archive, adapting the best average shared program, and adapting the shared program that performs best on each target task. Across eight task families spanning continuous optimization, geometric construction, modeling, and algorithmic optimization, EMO-STA improves over matched-compute single-task evolution in most settings, with STA Best-Local providing the strongest in-distribution adaptation and STA Best-Shared yielding robust transfer to unseen tasks. Compute-allocation experiments show that allocating a substantial fraction of the family-level budget to shared evolution is consistently beneficial, with roughly balanced shared and adaptation budgets often being optimal. Beyond compute efficiency, we show that shared evolution can mitigate overfitting in low-evidence settings (e.g. few training data), including ARC tasks and time-series feature engineering, by favoring programs that generalize across all tasks rather than exploiting task-specific brittle artifacts.

2605.22612 2026-05-22 cs.CY cs.AI cs.LG

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

医疗LLM基准测试的可靠性仅取决于其显式假设

Naveen Raman, Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, Bryan Wilder

AI总结 本文提出医疗LLM基准测试的评估-部署差距源于隐式假设,而非基准设计问题,并通过BenchmarkCards和分阶段评估方法来解决这一问题。

详情
Comments
13 pages, 1 figure
AI中文摘要

基准测试对于医疗评估是必要的,但不足以预测部署性能。我们的观点是,评估-部署差距并非源于基准设计不当,而是源于关于用户如何与模型交互的隐式假设,这些假设无法仅通过基准测试本身来揭示。为了使这一观点更明确,我们提出了将假设分为两类的分类:任务假设,可通过对话数据单独测试;以及结果假设,需要结果数据和行为研究来测试。关键的是,结果假设依赖于人类行为,即使设计良好的基准也无法直接观察。为了证明该框架的实用性,我们回顾性分析了一个医疗RCT作为案例研究,并发现差距自然分为大致相等的任务和结果差距。为此,我们做出了两项贡献:首先,我们提出BenchmarkCards,一种记录假设的工具;其次,我们提出分阶段评估,一种系统测试假设并评估性能的程序。

英文摘要

Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.

2605.22611 2026-05-22 cs.LG

Benchmarking Machine Learning Architectures for Antimicrobial Stewardship in Pediatric ICUs

对儿科ICU中抗菌药物使用管理的机器学习架构进行基准测试

Niklas Raehse, Luregn J. Schlapbach, Daphné Chopard

AI总结 本研究针对儿科ICU中抗菌药物使用管理的机器学习模型进行基准测试,通过公共数据集和私人机构队列系统评估了四种临床相关的目标,发现预测性能主要由目标流行率和数据集特征决定,而非模型复杂度,序列模型在粗粒度下提升了精度-召回权衡,但细粒度建模带来的收益有限,且校准效果较差。

详情
Comments
16 pages, 6 figures, code: https://anonymous.4open.science/r/AMS_intervention_prediction-C024
AI中文摘要

抗菌药物使用管理(AMS)在儿科重症监护室(PICUs)中至关重要,其中诊断不确定性常导致广谱抗生素使用,增加抗菌药物耐药性和潜在的长期危害。机器学习为从电子健康记录数据中识别患者层面的使用管理干预机会提供了有前途的方法,但以往研究主要集中在成人群体和静态表格表示上。我们展示了在PICU中对AMS干预预测的系统性基准研究,涵盖了公共数据集和私人机构队列。我们定义了四个临床相关的代理目标以减少抗生素暴露:静脉到口服转换、降级、停用和短程治疗。在统一的评估框架下,我们比较了表格、基于序列和基于图的时序模型在多个时间分辨率下的表现。我们发现,预测性能主要由目标流行率和数据集特征驱动,而非模型复杂度。序列模型在粗粒度(24小时)下比表格方法在精度-召回权衡上有所提升,而更精细的时间建模提供有限的额外收益。然而,这些收益是以较差的校准为代价的,更简单的表格模型产生更可靠的概率估计。多任务学习仅产生微小改进,表明在使用管理目标之间共享结构有限。我们的发现强调了目标设计、时间表示和校准在临床机器学习中的重要性,并为开发可靠的决策支持系统提供实用指导。

英文摘要

Antimicrobial stewardship (AMS) is critical in pediatric intensive care units (PICUs), where diagnostic uncertainty often drives broad-spectrum antibiotic use, increasing antimicrobial resistance and potential long-term harms. Machine learning offers a promising approach for identifying patient-level opportunities for stewardship interventions from electronic health record data, yet prior work has focused largely on adult populations and static tabular representations. We present a systematic benchmarking study of AMS intervention prediction in the PICU across a public dataset and a private institutional cohort. We define four clinically relevant proxy targets for reducing antibiotic exposure: intravenous-to-oral switching, de-escalation, discontinuation, and short-course therapy. Under a unified evaluation framework, we compare tabular, sequence-based, and graph-based temporal models at multiple temporal resolutions. We find that predictive performance is driven primarily by target prevalence and dataset characteristics rather than model complexity. Sequence models improve the precision-recall trade-off over tabular approaches at coarse (24-hour) resolution, while finer temporal modeling provides limited additional benefit. However, these gains come at the cost of poorer calibration, with simpler tabular models yielding more reliable probability estimates. Multi-task learning produces only marginal improvements, suggesting limited shared structure across stewardship targets. Our findings highlight the importance of target design, temporal representation, and calibration in clinical machine learning, and provide practical guidance for developing reliable decision support systems for pediatric AMS.

2605.22608 2026-05-22 cs.CL cs.AI

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR: 自动化多层级评估LLM代理

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

AI总结 本研究提出Agentic CLEAR框架,通过多层级细粒度分析实现LLM代理的自动化评估,提供高质量的数据驱动反馈并预测任务成功率。

详情
Comments
ACL
AI中文摘要

代理系统正变得越来越有能力:代理定义策略、采取行动并与不同环境交互。这种自主性对监督和评估代理行为提出了严峻挑战。当前大多数工具功能有限,要么侧重于可观测性并具备基本评估能力,要么强制使用静态、手工制定的错误分类法,无法适应新领域。为解决这一差距,我们提出了Agentic CLEAR,一个自动、动态且易于使用的评估框架。它在三个粒度层级上生成关于代理行为的文本洞察:系统、轨迹和节点。Agentic CLEAR运行在可观测性层之上,能够实现无缝集成,并具有直观的用户界面,使代理评估变得高度可访问。在四个基准测试、七个代理设置和数万次LLM调用的实验中,我们展示了Agentic CLEAR能够产生高质量、数据驱动的反馈。我们的分析显示与人工标注的错误高度一致,并且能够预测任务的成功率。

英文摘要

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

2605.22607 2026-05-22 cs.CV

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

增强视觉基础模型中的眼动推理以实现眼动跟随

Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang

AI总结 本文提出了一种新的训练机制,通过局部LoRA和锥外惩罚来增强视觉基础模型中的眼动推理,以提升眼动跟随任务的性能,特别是在目标不显著时表现更优。

详情
Comments
11 pages, 8 figures
AI中文摘要

眼动跟随需要场景理解和眼动推理来定位场景中人的目光目标。最近,视觉基础模型(VFMs)在该任务上表现出色,使更简单的架构能够超越先前方法。然而,我们观察到基于VFM的方法存在关键限制:虽然VFM显著提高了场景理解,但对眼动推理贡献有限。因此,现有方法常依赖语义显著物体而非真实目光线索,导致目标不显著时性能下降。为了解决这一问题,我们提出了一种新的训练机制,通过局部LoRA和锥外惩罚来增强VFM中的眼动推理。实验表明,我们的方法在GazeFollow和VAT数据集上取得了最先进的性能,特别是在目标不显著时表现尤为突出。我们的发现为未来眼动跟随研究提供了有价值的见解。

英文摘要

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.

2605.22605 2026-05-22 cs.RO cs.CV

Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

通过双区间运动线索解耦自身运动与目标动态以实现无人机检测

Liuyang Wang, Feitian Zhang

AI总结 本文提出了一种基于视觉的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰,提升无人机检测在剧烈自身运动下的性能。

详情
AI中文摘要

无人机的物体检测面临严重的自身运动、相机抖动和大规模变化的挑战。尽管现代检测器在静态图像上表现良好,但直接应用于无人机视频时往往失效,尤其在动态场景中的小目标。现有基于运动的方法要么依赖计算昂贵的光流,要么使用单区间差分,易受抖动影响且难以捕捉多样的运动模式。本文提出了一种视觉-only的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰。首先基于同射影的全局运动补偿(GMC)对相邻帧进行对齐。然后引入双区间运动提取策略,捕捉短期和长期的运动线索。为了整合这些线索,轻量级运动引导注意力模块(MGA)在特征金字塔网络中增强特征表示。在VisDrone-VID数据集上的实验表明,在严重自身运动下,该方法在YOLOv8基线上有显著改进。消融研究进一步验证了双区间设计和所提运动引导注意力机制的有效性。

英文摘要

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

2605.22604 2026-05-22 cs.CR cs.AI cs.LG cs.SE

Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms

无卡人工智能银行业创新:基于机器学习算法的全面框架用于网络安全与欺诈防范

Md Israfeel

AI总结 本文提出了一种全面的框架,利用机器学习算法增强无卡人工智能银行系统的网络安全和欺诈防范能力,通过AI驱动的数据加密生成虚拟卡,减少信息泄露风险。

详情
AI中文摘要

无卡人工智能(AI)银行业的发展标志着金融领域的一次范式转变,为用户提供前所未有的安全性和便利性。本文概述了一个全面的框架,旨在增强网络安全,引入自动生成的虚拟卡,并在无卡AI银行系统中减轻欺诈风险。该框架设想了一种未来银行架构,利用AI驱动的数据加密技术来创建安全的虚拟卡以实现无缝交易。通过强调安全的通信渠道,它确保了银行系统、持卡人和第三方供应商之间的金融活动的完整性。基于AI的授权方法在验证每一笔交易的同时,主动识别潜在欺诈,展示了该框架在加强无卡AI银行业安全方面的有效性。初始方法,包含一个AI驱动的基于特征的银行系统,确保生成带有加密数据的虚拟卡,减少信息暴露并降低欺诈风险。整合机器学习算法为潜在的欺诈活动增加了一层保护。最后,所提出的框架为无卡AI银行系统建立了一个全面的网络安全和欺诈防范范式。其实施使金融机构能够应对传统银行业相关的安全问题,为一个不仅抗欺诈而且对用户安全和方便的未来银行业景观铺平道路。

英文摘要

The advent of cardless artificial intelligence (AI) banking heralds a paradigm shift in the financial landscape, offering users unprecedented security and convenience. This paper outlines a comprehensive framework designed to enhance cybersecurity, introduce auto-generated virtual cards, and mitigate fraud risks within cardless AI banking systems. The framework envisions a future banking architecture that employs AI-powered data cryptography to create secure virtual cards for seamless transactions. By emphasizing secure communication channels, it ensures the integrity of financial activities among banking systems, cardholders, and third-party vendors. AI-based authorization methodologies play a pivotal role in authenticating each transaction while proactively identifying potential fraud, demonstrating the framework's efficacy in fortifying cardless AI banking security. The initial approach, featuring an AI-driven, feature-based banking system, ensures the generation of virtual cards with encrypted data, minimizing information exposure and reducing fraud risks. Integrating a machine learning algorithm adds an additional layer of protection against potential fraudulent activities. In conclusion, the proposed framework establishes a holistic cybersecurity and fraud-mitigation paradigm for cardless AI banking systems. Its implementation empowers financial institutions to address security concerns associated with traditional banking, paving the way for a future banking landscape that is not only fraud-resistant but also secure and convenient for users.

2605.22602 2026-05-22 cs.AI

Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

三思而后言:基于双重知识增强的理论思维推理用于说服代理

Minghui Ma, Bin Guo, Runze Yang, Mengqi Chen, Yan Liu, Jingqi Liu, Yahan Pei, Xuehao Ma, Qiuyun Zhang, Zhiwen Yu

AI总结 本文提出了一种基于双重知识增强的理论思维推理方法,用于提升说服代理的对话能力,通过构建大规模标注数据集和提出TTBYS框架,提高了LLM在推理欲望、信念和说服策略方面的性能。

详情
Comments
19 pages, 6 figures
AI中文摘要

说服对话需要推理他人潜在的心理状态,这一能力称为理论思维(ToM)。然而,由于依赖简单的提示策略和不足的ToM知识,现有LLM往往无法捕捉心理状态之间的内在依赖关系,导致表示碎片化和推理不稳定。为解决这些问题,我们引入了基于ToM的说服对话(ToM-PD)任务,该任务基于信念-欲望-意图(BDI)框架,明确建模多轮对话中心理状态的序列依赖性。为了促进该任务的研究,我们构建了一个大规模标注数据集,即基于ToM的广泛说服对话(ToM-BPD),捕捉了细粒度的心理状态和相应的说服策略。我们进一步提出了“三思而后言”(TTBYS),一种知识增强的分步推理框架,利用显式和隐式先验经验来提高LLM对欲望、信念和说服策略的推理能力。实验结果表明,配备TTBYS的Qwen3-8B在预测欲望、信念和说服策略方面分别优于GPT-5 1.20%、22.80%和16.97%。案例研究进一步表明,我们的方法增强了推理的可解释性和一致性。

英文摘要

Persuasive dialogue requires reasoning about others' latent mental states, a capability known as Theory of Mind (ToM). However, due to reliance on simple prompting strategies and insufficient ToM knowledge, existing LLMs often fail to capture the intrinsic dependencies among mental states, leading to fragmented representations and unstable reasoning. To address these challenges, we introduce the ToM-based Persuasive Dialogue (ToM-PD) task, grounded in the Belief-Desire-Intention (BDI) framework, which explicitly models the sequential dependencies among mental states in multi-turn dialogues. To facilitate research on this task, we construct a large-scale annotated dataset, ToM-based Broad Persuasive Dialogues (ToM-BPD), capturing fine-grained mental states and corresponding persuasive strategies. We further propose Think Thrice Before You Speak (TTBYS), a knowledge-enhanced stepwise reasoning framework that leverages both explicit and implicit prior experiences to improve LLMs' inference of desires, beliefs, and persuasive strategies. Experimental results demonstrate that Qwen3-8B equipped with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% in predicting desires, beliefs, and persuasive strategies, respectively. Case studies further show that our approach enhances interpretability and consistency in reasoning.

2605.22600 2026-05-22 cs.RO

Branch-Stochastic Model Predictive Control for Motion Planning under Multi-Modal Uncertainty with Scenario Clustering

基于分支随机优化的运动规划在多模态不确定性下的场景聚类

Zekun Xing, Ramkrishna Chaudhari, Marion Leibold, Dirk Wollherr, Martin Buss

AI总结 本文提出一种结合随机模型预测控制与分支结构的方法,用于在多模态不确定性下进行运动规划,通过场景聚类提高实时计算性能并减少保守性。

详情
Comments
This work has been accepted for presentation at IFAC World Congress 2026
AI中文摘要

自动驾驶的运动规划必须考虑周围车辆意图和轨迹的多模态不确定性。以最坏情况处理不确定性可以保证鲁棒性,但往往导致过度保守。随机模型预测控制(SMPC)通过机会约束减少了轨迹层面的保守性,但对意图不确定性仍保持保守,因为约束必须在所有意图下成立。本文提出一种新的SMPC与分支结构的结合,使规划器能够为不同的可能意图生成不同的轨迹,同时在轨迹不确定性下保持安全。提出了一种新的场景聚类方法,基于高层决策相似性合并预测场景,从而确保实时可处理性。此外,一种自适应的分支时间计算延迟对分离计划的承诺,直到意图不确定性充分降低。在具有挑战性的高速公路场景中的仿真研究证明,所提出的方法提高了安全性,减少了保守性,并实现了实时计算性能。

英文摘要

Motion planning for autonomous driving must account for multi-modal uncertainty in both the intentions and trajectories of surrounding vehicles. Handling uncertainty in a worst-case manner guarantees robustness but often leads to excessive conservatism. Stochastic Model Predictive Control (SMPC) reduces trajectory-level conservatism through chance constraints, yet remains conservative with respect to intention uncertainty since constraints must hold across all intentions. We present a novel combination of SMPC and the branching structure, enabling the planner to generate distinct trajectories for different possible intentions while maintaining safety under trajectory uncertainty. A novel scenario clustering is proposed to merge prediction scenarios based on high-level decision similarity, thereby ensuring real-time tractability. Furthermore, an adaptive branching-time computation postpones commitment to separate plans until intention uncertainty is sufficiently reduced. Simulation studies in challenging highway scenarios demonstrate that the proposed method improves safety, reduces conservatism, and achieves real-time computational performance.

2605.22597 2026-05-22 cs.LG cs.AI cs.GR cs.RO

MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

MoSA: 通过学习残余各向异性来缓解连续动力学中现实到模拟差距的运动约束应力适应

Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu

AI总结 本文提出MoSA框架,通过运动约束应力适应来缓解连续动力学中现实到模拟差距,利用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性,最终在机器人操作中验证了其有效性。

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

从视觉观测中学习现实世界的动力学对于各种领域至关重要。一种常见策略是通过估计物理参数来校准模拟器,但准确性最终受限于底层物理模型,这些模型通常假设材料是均质且各向同性的。即使合理,现实中的物体通常表现出轻微的各向异性和非均匀性。在近各向同性的骨架良好校准后,这些残余效应成为进一步缩小现实到模拟差距的关键瓶颈。虽然神经网络可以端到端地拟合动力学,但这种黑盒建模会丢弃强物理先验,导致数据效率低和过拟合。因此,我们提出了MoSA,一种运动约束应力适应框架,旨在针对这些残余效应以进一步提高现实到模拟动力学学习。MoSA使用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性。它通过微平面约束的再分布逐步适应应力,在一个物理指导的级联网络中。我们进一步通过监督变形场的时空导数来施加运动约束。实验表明,我们学习的动力学在准确性、泛化性和鲁棒性方面均优于现有方法,同时学习了具有物理意义的残余各向异性。最后,我们在机器人操作设置中验证了MoSA,显示更好的现实到模拟动力学建模能够转化为更可靠的模拟到现实转移。项目页面可在https://mercerai.github.io/MoSA/上获取。

英文摘要

Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

2605.22596 2026-05-22 cs.LG

Factored Diffusion Policies:Compositionally Generalized Robot Control with a Single Score Network

因子扩散策略:一种单一分数网络的组成通用机器人控制

Sayan Mitra, Ege Yuceel, Noah Giles, Abhishek Pai

AI总结 本文提出了一种因子扩散策略,通过单一共享的扩散网络实现通用机器人控制,该网络在推理时能将分数分解为各因子的加法形式,从而在训练任务预算上从因子基数的乘积减少到求和,通过轨迹管证书将分数界转化为闭环状态轨迹管,实验验证了其泛化界和证书的有效性。

详情
AI中文摘要

机器人任务通常由多个因子组成,如要抓取的对象、要避开的障碍物、目标的颜色等。收集每个因子组合的专家示范数据会呈指数增长。我们提出了因子扩散策略:一个单一共享的扩散网络,通过每个因子的空标记dropout进行训练,在推理时分数可以跨因子加性分解。在给定动作-观测对的情况下,因子之间的近似条件独立性使得这种组合可以近似真实联合分数,误差有界且均匀,从而将训练任务预算从因子基数的乘积减少到求和。轨迹管证书将此分数界通过反向时间采样ODE和一个收缩跟踪控制器转化为闭环状态轨迹管,其半径分解为ODE敏感性常数和每个因子分数误差预算。不同于将单独训练的网络组合在一起的组合扩散方法,我们使用一个共享网络。无人机赛车实验验证了泛化界和证书的有效性。在基于状态的多关卡赛车中,因子策略通过90%的保留关卡(与理想情况一致),而K网络组合基线则下降到3%;在基于视觉的单关卡穿越中,它能够零样本迁移至未见场地,成功率提升11.7个百分点,碰撞率减少2.4倍。

英文摘要

Robotic tasks are typically specified by a tuple of factors, such as the object to be grasped, the obstacles to be avoided, the color of the target, and so on. Collecting expert demonstrations for every combination of factor values grows combinatorially. We present factored diffusion policies: a single shared diffusion network trained with per-factor null-token dropout, whose score decomposes additively across factors at inference. Under approximate conditional independence between factors given the action-observation pair, this composition approximates the true joint score with a bounded uniform error, reducing the training-task budget from a product of factor cardinalities to a sum. A trajectory-tube certificate chains this score-level bound through the reverse-time sampling ODE and a contracting tracking controller into a closed-loop state-trajectory tube whose radius factors into an ODE-sensitivity constant and a per-factor score-error budget. Unlike compositional-diffusion methods for control that combine separately trained networks, we use one shared network. Drone racing experiments confirm both the generalization bound and the certificate. On state-based multi-gate racing, the factored policy passes 90% of held-out gates -- matching an oracle -- while a K-network composition baseline collapses to 3%; on vision-based single-gate traversal, it transfers zero-shot to an unseen venue with +11.7pp success-rate gain and 2.4X crash-rate reduction.

2605.22593 2026-05-22 cs.LG

Do Deep Ensembles Actually Capture Uncertainty in Graph Neural Networks?

深度集成是否真的在图神经网络中捕捉了不确定性?

Pedro C. Vieira, Pedro Ribeiro, Viacheslav Borovitskiy

AI总结 本文研究了深度集成在图神经网络中的有效性,发现其在不确定性量化中效果有限,主要归因于模型优化噪声的稳定而非不确定性估计的提升,揭示了集成崩溃现象。

详情
AI中文摘要

尽管深度集成被认为是深度学习中不确定性量化的默认方法,但其在图结构数据中的有效性往往基于计算机视觉领域的成功经验而被简单假设。我们专门研究了消息传递图神经网络中的标准深度集成。在七个代表不同任务和复杂度的数据集上进行基准测试,我们发现集成在单个模型上提供 surprisingly 小的改进。相反,观察到的边际收益主要来自稳定点预测的优化噪声,而非产生有意义更好的不确定性估计。通过偶然性-知识性分解,我们识别出知识性崩溃:独立训练的网络一致收敛到过于相似的预测。因为分歧是集成捕捉知识性不确定性的重要机制,这种缺乏多样性抵消了其关键优势。进一步分析这一现象,我们建议这种崩溃是由功能而非权重空间凸性驱动的,其中不同的参数解诱导几乎相同的行为。我们的结果表明,深度集成的成功并不无缝转移到图机器学习中。

英文摘要

While deep ensembles are widely considered to be the default method for uncertainty quantification in deep learning, their effectiveness for graph-structured data is often simply assumed based on successes in domains like computer vision. We investigate standard deep ensembles specifically for message-passing graph neural networks. Benchmarking across seven datasets representing varied tasks and complexities, we reveal that ensembles provide surprisingly little improvement over a single model. Instead, the observed marginal gains stem primarily from stabilizing optimization noise in point predictions rather than yielding meaningfully better uncertainty estimates. Through an aleatoric-epistemic decomposition, we identify epistemic collapse: independently trained networks consistently converge to overly similar predictions. Because disagreement is the fundamental mechanism through which ensembles capture epistemic uncertainty, this lack of diversity neutralizes their key advantage. Analyzing this phenomenon further, we suggest this collapse is driven by functional rather than weight-space convexity, where distinct parameter solutions induce almost identical behavior. Our results suggest that deep ensemble success does not seamlessly transfer to graph machine learning.

2605.22591 2026-05-22 cs.CV

Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

重新思考冻结视觉基础模型的噪声鲁棒训练:一个跨数据集基准与小损失失败的案例研究

Zitong Li, Haoyu Wang

AI总结 本文通过跨五个医学数据集、三种主干网络、两种噪声类型和五种噪声率的基准测试,重新评估了冻结特征域中噪声标签学习方法的性能,揭示了小损失假设在高风险场景下的局限性,并提出了基于特征空间的选择器以指导实际应用。

详情
AI中文摘要

冻结视觉基础模型(VFMs)配备轻量级分类头,因其高效且可重复部署而在医学影像中日益普及。然而,针对此冻结特征域的噪声标签学习方法仍缺乏深入理解,且大多数现有方法仍依赖于从端到端训练继承的小损失假设。本文提出了一个包含八个噪声标签方法、五个医学数据集、三种主干网络、两种噪声类型和五种噪声率(150种条件,6,000次训练运行)的受控基准测试,通过平衡准确率进行评估。基准测试表明,不存在普遍胜利者:Friedman排名在150种条件下得出χ²=333.2(p=4.77×10⁻⁶⁸),ELR在最多条件(49/150)中获胜,而CUFIT获得最佳平均排名(2.51)。方法选择的实际成本随着噪声严重程度急剧增加,从干净数据上的4.5pp增加到不对称40%噪声时的18.8pp。为了解释这些基准级别的模式,我们重新审视了小损失假设在代表性的高风险场景中的应用。在冻结DINOv2特征下,干净和噪声损失分布重叠达53-61%,匹配率的干净样本检测显示,在不对称噪声下,预测一致性比损失排名更加稳定(3pp vs. 13pp精度下降)。在ISIC2019数据集上,不对称40%噪声下,Co-Teaching达到68%的总体准确率,但在三个少数类上无召回时,其平衡准确率降至35.1%。这些结果将冻结VFMs的噪声标签学习重新定义为一种基于场景的方法选择问题,而非寻找单一主导算法。本文最后提供了基于证据的指导和一个低遗憾的特征空间选择器,以指导实际应用。

英文摘要

Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $χ^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.

2605.22581 2026-05-22 cs.CV cs.AI cs.LG

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner: 在真实场景中实现基于3D的平面定位

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

AI总结 本文提出了一种在真实场景中实现基于3D重建的平面定位方法,通过将任务 grounding 在场景的重建3D表示中,解决了现有方法在大规模建筑和栅格化平面图中应用受限的问题。

详情
Comments
Project Page: https://Cornell-VAILab.github.io/SceneAligner
AI中文摘要

许多公共建筑提供带有'你在这里'指示器的平面图,以帮助游客导航。平面定位旨在通过确定视觉观测是在平面图中的哪个位置来计算实现这一能力。然而,现有方法通常假设受控的小规模环境和精确的向量平面图,限制了它们在大规模建筑和栅格化平面图中的应用能力。在本文中,我们提出了一种在真实场景中实现平面定位的方法,通过将任务 grounding 在场景的重建3D表示中。给定一组无约束的图像集合,我们的方法重建一个重力对齐的3D场景,并将其投影到2D密度图中,作为平面图的代理。平面定位则被公式化为通过2D相似性变换将该代理与输入平面图对齐。为了弥合密度图与建筑平面图之间的外观差距,我们适配了一个2D基础模型来学习跨模态的对应关系,引入了一种细调方案,鼓励语义对齐的同时保持结构一致性。广泛的实验表明,与先前方法相比有显著的改进,包括在极稀疏设置中,甚至使用单张输入图像时。我们的代码和数据将公开提供。

英文摘要

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

2605.22579 2026-05-22 cs.CL cs.AI stat.ML

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

超越温度:超拟合作为晚期几何扩展

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

AI总结 本文研究了超拟合现象,发现其与分布锐化不同,通过实验表明超拟合依赖于动态的上下文相关排名重排机制,并在Transformer最后一层的终端扩展中实现了特征空间的几何扩展,提出了Late-Stage LoRA方法以提升生成质量。

详情
Comments
Accepted at ICML 2026
AI中文摘要

近期的研究揭示了一个反直觉现象,称为

英文摘要

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

2605.22578 2026-05-22 cs.CV

Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping

超越形变距离:面向在线制图的粒度化顺序感知评估度量

Chouaib Bencheikh Lehocine, Adam Lilja, Junsheng Fu, Lars Hammarstrand

AI总结 本文提出一种粒度化顺序感知评估度量,用于评估在线制图方法,通过引入序列最优子模式分配(SOSPA)和多实例评估框架中的多段线定位与检测(PLD),改进了传统基于形变距离的评估方法,揭示了当前方法中检测能力是主要瓶颈。

详情
AI中文摘要

在线地图估计是自动驾驶系统的关键组成部分,能够减少对昂贵高精度地图的依赖。最先进的方法通常将地图元素预测为点的有序序列,形成多边形和多边形链。这些方法的评估主要依赖于基于阈值形变距离(CD)的平均平均精度(mAP)。该框架对点顺序缺乏敏感性,并且在评估几何质量时缺乏粒度,使得难以区分哪些方法真正优于其他方法。在本文中,我们从两个方面解决了这些限制。对于单实例相似性度量,我们引入了序列最优子模式分配(SOSPA),一种顺序感知度量,能够对单个几何体进行细粒度评估,同时满足所有度量公理。对于多实例评估框架,我们提出了多段线定位与检测(PLD),一种软度量,能够同时捕捉检测质量和几何准确性,用原理性的软分配替代mAP的硬阈值。通过在nuScenes上的评估,我们证明PLD能够有效排序最先进的在线制图方法(MapTRv2、StreamMapNet、MapTracker),并提供分解的误差分析。该分析揭示了当前方法中检测能力是主要瓶颈,揭示了一种mAP无法捕捉的性能趋势。使用我们度量的评估代码将被发布。

英文摘要

Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies predominantly on mean average precision (mAP) based on thresholded Chamfer distance (CD). This framework lacks sensitivity to point ordering and provides limited granularity in assessing geometric quality, making it difficult to distinguish which methods truly excel over others. In this work, we address these limitations on two fronts. For the single-instance similarity measure, we introduce sequence optimal sub-pattern assignment (SOSPA), an order-aware metric that enables fine-grained evaluation of individual geometries while satisfying all metric axioms. For the multi-instance evaluation framework, we propose polyline localisation and detection (PLD), a soft metric that jointly captures detection quality and geometric accuracy, replacing the hard thresholding of mAP with a principled soft assignment. Through evaluations on nuScenes, we demonstrate that PLD effectively ranks SOTA online mapping methods (MapTRv2, StreamMapNet, MapTracker) while providing a decomposed error analysis. This analysis identifies detection capability as the dominant bottleneck in current methods, revealing a performance trend that mAP fails to capture. Code for evaluation using our metrics will be released.

2605.22572 2026-05-22 cs.CV

SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation

SegGuidedNet: 基于子区域的注意力监督用于可解释性脑肿瘤分割

Hasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

AI总结 本文提出SegGuidedNet,一种引入新颖SegAttentionGate模块的三维残差编码器-解码器网络,通过轻量级辅助损失显式监督解码器生成每个肿瘤子区域(坏死核心、周围水肿、增强肿瘤)的空间判别注意力图,从而在无需后处理解释方法的情况下提供免费的空间可解释性,并在BraTS2021和BraTS2023 GLI上实现了优异的分割性能。

详情
AI中文摘要

准确分割多参数MRI中脑肿瘤的子区域对于治疗计划至关重要,但因形态学变化、类别不平衡和不同成像序列中肿瘤区域的重叠外观而具有挑战性。我们提出了SegGuidedNet,一种引入新颖SegAttentionGate模块的三维残差编码器-解码器网络,该模块通过轻量级辅助损失显式监督解码器,为每个肿瘤子区域(坏死核心、周围水肿、增强肿瘤)生成空间判别性注意力图,参数开销低于0.2%。这种子区域监督在保持解码器在视觉模糊类别间的判别能力的同时,无需任何后处理解释方法即可在推理过程中提供免费的空间可解释性。在独立评估BraTS2021和BraTS2023 GLI的251个被排除受试者上,SegGuidedNet分别实现了平均Dice系数为0.905(ET=0.873,TC=0.906,WT=0.935)和0.897(ET=0.859,TC=0.902,WT=0.931),超越了基于集成的nnU-Net和HNF-Netv2作为单一模型,并接近Swin UNETR在2-4个Dice点内以少量推理成本实现。结果在两个基准版本中的一致性进一步验证了所提出方法的通用性,提供了一个轻量、临床实用的框架,在保证准确性的同时具备内置的可解释性。

英文摘要

Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimensional residual encoder--decoder network introducing a novel SegAttentionGate module that explicitly supervises the decoder to produce spatially discriminative attention maps for each tumour sub-region necrotic core, peritumoral oedema, and enhancing tumour via a lightweight auxiliary loss, adding less than 0.2% parameter overhead. This sub-region supervision maintains decoder discriminability between visually ambiguous classes while providing free-of-cost spatial interpretability at inference without any post-hoc explanation method. Evaluated independently on BraTS2021 and BraTS2023 GLI across 251 held-out subjects each, SegGuidedNet achieves mean Dice of 0.905 (ET= 0.873, TC=0.906, WT=0.935) and 0.897 (ET=0.859, TC=0.902, WT=0.931) respectively, surpassing ensemble-based nnU-Net and HNF-Netv2 as a single model and approaching Swin UNETR a 10-model ensemble within 2--4 Dice points at a fraction of the inference cost. The consistency of results across two benchmark editions further confirms the generalisability of the proposed approach, offering competitive accuracy with built-in interpretability in a lightweight, clinically practical framework.

2605.22570 2026-05-22 cs.CV cs.AI

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench: 一个通过主动视频合成进行时空推理的基准

Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park

AI总结 本文提出VGenST-Bench,一个通过生成模型主动合成多样化评估场景的视频基准,旨在评估多模态大语言模型的时空推理能力,通过引入多代理流程和3x2x2视频分类体系,实现对细粒度时空理解的精准诊断。

详情
Comments
82 pages, 91 figures (7 in main paper, 84 in appendix). Project page: https://zinosii.github.io/VGenST-Bench/
AI中文摘要

时空推理是多模态大语言模型(MLLMs)在现实世界中的一项核心能力。因此,精确评估这一能力已成为一个关键挑战。然而,现有的时空推理基准数据集主要依赖静态图像集或被动整理的视频数据,这限制了对细粒度推理能力的评估。在本文中,我们介绍了VGenST-Bench,一个视频基准,该基准利用生成模型主动合成高度可控且多样化的评估场景。为了构建VGenST-Bench,我们提出一个包含人类质量控制阶段的多代理流程,确保所有生成的视频和问答对的质量。我们建立了一个全面的3x2x2视频分类体系,涵盖空间尺度、视角和场景动态,以涵盖多样化的场景。此外,我们设计了一个分层任务套件,将低层次的视觉感知与高层次的时空推理分离。通过从被动整理转向主动合成,VGenST-Bench能够对MLLMs的时空理解进行细粒度诊断。

英文摘要

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

2605.22568 2026-05-22 cs.CR cs.AI

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

在不欺骗自己的情况下衡量安全:为什么基准测试智能体是困难的

Sahar Abdelnabi, Chris Hicks, Konrad Rieck, Ahmad-Reza Sadeghi

AI总结 本文探讨了在安全关键角色中评估AI代理的基准测试存在的核心挑战,包括基准漏洞、时间滞后的不准确性以及运行时的不确定性,并提出了构建更可靠和可信评估框架的方向。

详情
AI中文摘要

用于评估在安全关键角色中AI代理的基准测试存在关键弱点。基于最近的经验证据,我们指出了三个核心挑战,这些挑战削弱了安全评估:基准漏洞、时间滞后的不准确性和运行时的不确定性。然后,我们概述了构建更稳健和可信评估框架的实用方向。

英文摘要

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.