arXivDaily arXiv每日学术速递 周一至周五更新
重置
2511.19652 2026-06-12 cs.CV 版本更新

Navigating Gigapixel Pathology Images with Large Multimodal Models

利用大型多模态模型导航千兆像素病理图像

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Pathology, Massachusetts General Hospital(麻省总医院病理学系) Department of Pathology and Laboratory Medicine, Brown University(布朗大学病理学与实验室医学系)

AI总结 提出GIANT方法,无需训练即可让通用多模态模型自主导航WSI,通过迭代选择多放大倍数裁剪并聚合证据,在MultiPathQA基准上实现SOTA。

详情
AI中文摘要

近期大型多模态模型的进展使得开发能够对话和推理病理全切片图像(WSI)的交互式聊天模型成为可能。然而,现有的切片级聊天系统通常高度专业化,通常将WSI压缩为固定的切片级嵌入或依赖多组件流水线,这可能会丢失多尺度细节并限制目标任务之外的泛化能力。我们提出GIANT(千兆像素图像组织导航代理),一种简单、无需训练的方法,让通用多模态模型自主导航WSI,迭代选择多放大倍数裁剪并随时间聚合证据。为了评估WSI问答中的泛化能力并促进可重复性,我们引入了MultiPathQA,一个涵盖五个临床挑战和934个问题(涉及868个独特WSI)的基准套件。其中包括128道由病理学家编写的多项选择题,旨在模拟真实的诊断搜索和多尺度推理。使用GPT-5,GIANT在五个基准中的四个上取得了最先进的性能,优于专门用于病理问答的模型。

英文摘要

Recent advances in large multimodal models have allowed for the development of interactive chat models that can converse and reason about pathology whole-slide images (WSIs). However, existing slide-level chat systems are often highly specialized, typically compressing WSIs into fixed slide-level embeddings or relying on multi-component pipelines, which can lose multi-scale detail and limit generalizability beyond the target task. We present GIANT (Gigapixel Image Agent for Navigating Tissue), a simple, training-free approach that lets general-purpose multimodal models navigate WSIs on their own, iteratively selecting multi-magnification crops and aggregating evidence over time. To evaluate generalizability in WSI question answering and to promote reproducibility, we introduce MultiPathQA, a benchmark suite spanning five clinical challenges and 934 questions over 868 unique WSIs. This includes a new set of 128 pathologist-authored multiple-choice questions designed to mirror real diagnostic search and multi-scale reasoning. Using GPT-5, GIANT outperforms models specialized for pathology question answering, achieving state-of-the-art performance on four out of five benchmarks.

2511.17221 2026-06-12 cs.CV cs.RO 版本更新

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

QueryOcc:基于查询的3D语义占据自监督方法

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) Zenseact

AI总结 提出QueryOcc,一种基于查询的自监督框架,通过相邻帧的4D时空查询直接学习连续3D语义占据,利用视觉基础模型或激光雷达数据提供监督,并引入收缩场景表示以在恒定内存下实现远程监督,在Occ3D-nuScenes基准上语义RayIoU提升26%。

详情
AI中文摘要

从图像学习3D场景几何和语义是计算机视觉的核心挑战,也是自动驾驶的关键能力。由于大规模3D标注成本过高,近期研究探索直接从传感器数据中进行自监督学习,无需人工标签。现有方法要么依赖2D渲染一致性(3D结构仅隐式出现),要么依赖来自累积激光雷达点云的离散化体素网格,限制了空间精度和可扩展性。我们提出QueryOcc,一种基于查询的自监督框架,通过跨相邻帧采样的独立4D时空查询直接学习连续3D语义占据。该框架支持来自视觉基础模型导出的伪点云或原始激光雷达数据的监督。为了实现恒定内存下的远程监督和推理,我们引入了一种收缩场景表示,在平滑压缩远处区域的同时保留近场细节。QueryOcc在自监督Occ3D-nuScenes基准上以11.6 FPS运行,语义RayIoU比之前的基于相机的方法提升26%,表明直接4D查询监督能够实现强大的自监督占据学习。

英文摘要

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

2511.04260 2026-06-12 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

详情
Comments
44 pages, 27 figures, 11 tables
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2511.11022 2026-06-12 cs.RO 版本更新

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

用于验证多智能体协同自动驾驶的微型测试平台

Hyunchul Bae, Eunjae Lee, Jehyeop Han, Minhee Kang, Jaehyeon Kim, Junggeun Seo, Minkyun Noh, Heejin Ahn

发表机构 * School of Electrical Engineering(电气工程学院) School of Mechanical Engineering(机械工程学院) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出CIVAT微型测试平台,集成V2V/V2I通信与ROS2框架,通过基础设施感知和交叉口管理实验验证协同自动驾驶功能。

详情
Comments
Accepted by ICRA 2026, 8 pages
AI中文摘要

协同自动驾驶通过实现车辆与智能路侧基础设施之间的实时协作来扩展车辆自主性,仍然是一个具有挑战性但至关重要的问题。然而,现有的测试平台均未采用配备感知、边缘计算和通信能力的智能基础设施。为填补这一空白,我们设计并实现了一个1:15比例的微型测试平台CIVAT,用于验证协同自动驾驶,该平台包括一个缩小的城市地图、配备车载传感器的自动驾驶车辆以及智能基础设施。所提出的测试平台通过共享Wi-Fi和ROS2框架,以发布-订阅模式集成V2V和V2I通信,实现车辆与基础设施之间的信息交换,从而达成协同驾驶功能。作为案例研究,我们通过基于基础设施的感知和交叉口管理实验验证了该系统。

英文摘要

Cooperative autonomous driving, which extends vehicle autonomy by enabling real-time collaboration between vehicles and smart roadside infrastructure, remains a challenging yet essential problem. However, none of the existing testbeds employ smart infrastructure equipped with sensing, edge computing, and communication capabilities. To address this gap, we design and implement a 1:15-scale miniature testbed, CIVAT, for validating cooperative autonomous driving, consisting of a scaled urban map, autonomous vehicles with onboard sensors, and smart infrastructure. The proposed testbed integrates V2V and V2I communication with the publish-subscribe pattern through a shared Wi-Fi and ROS2 framework, enabling information exchange between vehicles and infrastructure to realize cooperative driving functionality. As a case study, we validate the system through infrastructure-based perception and intersection management experiments.

2503.10919 2026-06-12 cs.RO cs.SY eess.SY nlin.PS 版本更新

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

基于绝热谱子流形的数据驱动软体机器人控制

Roshan S. Kaundinya, John Irvin Alora, Jonas G. Matt, Luis A. Pabon, Marco Pavone, George Haller

发表机构 * Institute for Mechanical Systems, ETH Zürich(机械系统研究所,苏黎世联邦理工学院) Autonomous Systems Lab, Stanford University(自主系统实验室,斯坦福大学) Automatic Control Laboratory, ETH Zürich(自动控制实验室,苏黎世联邦理工学院)

AI总结 针对软体机器人在非线性区域控制难题,提出基于绝热谱子流形(aSSM)的模型预测控制策略,通过数据驱动构建低维吸引子流形,实现高精度轨迹跟踪,性能提升达10倍。

详情
Comments
41 pages, 24 figures, IJRR (2026) in press
AI中文摘要

软体机器人的机械复杂性给基于模型的控制带来了重大挑战。具体而言,线性数据驱动模型难以在探索具有显著非线性行为的复杂空间扩展路径上控制软体机器人。为了解释这些非线性,我们基于最新的绝热谱子流形(aSSM)理论开发了一种模型预测控制策略。该理论适用是因为重度阻尼机器人的内部振动衰减速度远快于机器人沿预定路径的期望速度。在这种情况下,低维吸引不变流形(aSSM)从路径发出并承载机器人的主导动力学。借助这一最新理论,我们仅从数据出发设计了一种基于aSSM的模型预测控制方案。我们展示了数据驱动模型在跨不同任务跟踪动态轨迹方面的有效性。我们在软体躯干机器人和基于Cosserat杆的弹性软臂的高保真、高维有限元模型上进行了验证,额外实验确认了即使在存在实验噪声的情况下也具有鲁棒性能。值得注意的是,我们发现五维或六维aSSM简化模型在所有闭环控制任务中的跟踪性能比其他数据驱动建模方法高出最多10倍。

英文摘要

The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate the effectiveness of our data-driven model in tracking dynamic trajectories across diverse tasks. We validate on high-fidelity, high-dimensional finite-element models of a soft trunk robot and Cosserat-rod-based elastic soft arms, with additional experiments confirming robust performance even in the presence of experimental noise. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to 10 across all closed-loop control tasks.

2504.21561 2026-06-12 cs.CV 版本更新

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

通过逐步偏好调优的多模态智能体迭代工具使用探索

Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology(北京智能信息科技重点实验室,计算机科学与技术学院,北京理工大学) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI) State Key Laboratory of General Artificial Intelligence, Peking University(通用人工智能国家重点实验室,北京大学) Harbin Institute of Technology(哈尔滨工业大学) Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University(广东机器感知与智能计算实验室,深圳MSU-BIT大学) Department of Automation, Tsinghua University(自动化系,清华大学)

AI总结 提出SPORT方法,通过任务合成、步骤采样、步骤验证和偏好调优的迭代循环,使多模态智能体无需预收集数据即可自主探索和优化工具使用策略,在GTA和GAIA基准上分别提升6.41%和3.64%。

详情
Comments
24 pages
AI中文摘要

多模态智能体将控制器(例如视觉语言模型)与外部工具集成,在解决复杂多模态任务方面展现了卓越的能力。现有训练这些智能体的方法,包括监督微调和强化学习,都依赖于大量人工标注的任务-答案对和工具轨迹。然而,对于复杂多模态任务,此类标注成本过高或难以实现。本文提出一种无需任何预收集数据的多模态智能体迭代工具使用探索方法,即SPORT,通过逐步偏好优化来改进工具使用轨迹。我们的方法使多模态智能体能够通过自我探索和优化自主发现有效的工具使用策略,消除了人工标注的瓶颈。SPORT包含四个迭代组件:任务合成、步骤采样、步骤验证和偏好调优。我们首先使用语言模型合成多模态任务。然后,我们引入一种新颖的轨迹探索方案,其中步骤采样和步骤验证交替执行以解决合成任务。在步骤采样中,智能体尝试不同的工具并获取相应结果。在步骤验证中,我们使用验证器提供AI反馈以构建逐步偏好数据。该数据随后通过偏好调优用于更新控制器的工具使用,生成SPORT智能体。通过与真实环境交互,SPORT智能体逐渐演化为更精细和更有能力的系统。在GTA和GAIA基准上的评估显示,SPORT智能体分别实现了6.41%和3.64%的提升,突显了我们方法的泛化性和有效性。项目页面见该URL。

英文摘要

Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench:评估语言模型中的程序性和多元道德推理,超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington(华盛顿大学) New York University(纽约大学) Scale AI Harvard University(哈佛大学) University of Michigan(密歇根大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Center for AI Safety(人工智能安全中心) Stanford University(斯坦福大学) MIT(麻省理工学院) University of Oxford(牛津大学)

AI总结 提出MoReBench基准,包含1000个道德场景和超过2.3万条标准,用于评估语言模型在道德推理中的程序性推理能力,发现现有基准无法预测模型表现,且模型对特定道德框架存在偏好。

详情
Comments
46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)
AI中文摘要

随着人工智能系统的进步,我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观,我们不仅需要理解它们做出了什么决策,还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和(部分透明的)中间思考轨迹,这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同,道德困境是过程导向评估的绝佳测试平台,因为它们允许多种可辩护的结论。为此,我们提出了MoReBench:包含1000个道德场景,每个场景配有一组专家认为在推理该场景时必须包含(或避免)的评分标准。MoReBench包含超过2.3万条标准,包括识别道德考量、权衡利弊以及给出可操作的建议,覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外,我们整理了MoReBench-Theory:150个示例,用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明,规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架(例如边沁式的行为功利主义和康德义务论)的偏好,这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估,以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

2510.16311 2026-06-12 cs.LG 版本更新

Toward General Digraph Contrastive Learning: A Dual Spatial Perspective

面向一般有向图对比学习:双空间视角

Zhengyu Wu, Daohan Su, Yang Zhang, Xunkai Li, Rong-Hua Li, Guoren Wang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出S2-DiGCL框架,从复数域和实数域双空间视角对有向图进行对比学习,通过磁拉普拉斯自适应调制和路径子图增强,在节点分类和链接预测任务上分别提升4.41%和4.34%。

详情
AI中文摘要

图对比学习(GCL)已成为一种从图中提取一致表示而无需标签信息的强大工具。然而,现有方法主要关注无向图,忽略了在实际网络(如社交网络和推荐系统)中基础且不可或缺的关键方向信息。本文提出了S2-DiGCL,一种新颖的框架,强调从复杂域和实数域视角对有向图进行对比学习的空间洞察。从复数域视角,S2-DiGCL在磁拉普拉斯中引入个性化扰动,以自适应地调制边相位和方向语义。从实数域视角,它采用基于路径的子图增强策略,捕捉细粒度的局部不对称性和拓扑依赖性。通过联合利用这两个互补的空间视图,S2-DiGCL构建了高质量的正负样本,从而实现更通用和鲁棒的有向图对比学习。在7个真实有向图数据集上的大量实验证明了我们方法的优越性,在监督和无监督设置下,节点分类和链接预测分别实现了4.41%和4.34%的性能提升,达到了最先进水平。

英文摘要

Graph Contrastive Learning (GCL) has emerged as a powerful tool for extracting consistent representations from graphs, independent of labeled information. However, existing methods predominantly focus on undirected graphs, disregarding the pivotal directional information that is fundamental and indispensable in real-world networks (e.g., social networks and recommendations).In this paper, we introduce S2-DiGCL, a novel framework that emphasizes spatial insights from complex and real domain perspectives for directed graph (digraph) contrastive learning. From the complex-domain perspective, S2-DiGCL introduces personalized perturbations into the magnetic Laplacian to adaptively modulate edge phases and directional semantics. From the real-domain perspective, it employs a path-based subgraph augmentation strategy to capture fine-grained local asymmetries and topological dependencies. By jointly leveraging these two complementary spatial views, S2-DiGCL constructs high-quality positive and negative samples, leading to more general and robust digraph contrastive learning. Extensive experiments on 7 real-world digraph datasets demonstrate the superiority of our approach, achieving SOTA performance with 4.41% improvement in node classification and 4.34% in link prediction under both supervised and unsupervised settings.

2510.05430 2026-06-12 cs.RO 版本更新

Active Semantic Perception

主动语义感知

Huayi Tang, Pratik Chaudhari

发表机构 * General Robotics, Automation, Sensing and Perception (GRASP) Laboratory(通用机器人、自动化、传感与感知实验室)

AI总结 提出一种基于紧凑多层场景图和大语言模型的主动语义感知方法,用于高效探索未知环境,在仿真和真实机器人上验证了优于现有方法。

详情
AI中文摘要

我们开发了一种主动语义感知方法,该方法利用场景的语义进行探索等任务。我们构建了一个紧凑的多层场景图,能够以不同抽象级别表示大型复杂室内环境,例如对应于房间、物体、墙壁、窗户等的节点,以及它们几何结构的细粒度细节。我们基于大语言模型(LLM)开发了一个程序,用于采样与场景部分观测一致的未观测区域的新可能场景图。我们开发了一个程序,用于计算潜在航点在该场景图上的信息增益,以实现复杂的空间推理:例如,从客厅出去的两扇门中,一扇可能通向厨房,另一扇通向卧室。我们在仿真中的逼真3D室内公寓以及现实世界中的Unitree Go 2机器人上评估了我们的方法。定性和定量分析表明,我们的方法能够比现有方法更快、更准确地确定环境中高层和低层的语义信息。

英文摘要

We develop an approach for active semantic perception, which refers to using the semantics of the scene for tasks such as exploration. We build a compact, multi-layer scene graph that can represent large, complex indoor environments at various levels of abstraction, e.g., nodes corresponding to rooms, objects, walls, windows etc., as well as fine-grained details of their geometry. We develop a procedure based on large language models (LLMs) to sample new plausible scene graphs of unobserved regions that are consistent with partial observations of the scene. We develop a procedure to compute the information gain of a potential waypoint upon this scene graph to enable sophisticated spatial reasoning: for example, of the two doors that lead out of the living room, one probably leads to the kitchen and the other to the bedroom. We evaluate our approach in realistic 3D indoor apartments in simulation and also on a Unitree Go 2 robot in the real world. Qualitative and quantitative analysis shows that our approach can pin down high-level and low-level semantic information in the environment quickly and more accurately than existing approaches.

2503.06573 2026-06-12 cs.CL cs.AI 版本更新

WildIFEval: Instruction Following in the Wild

WildIFEval: 野外指令遵循

Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

发表机构 * The Hebrew University of Jerusalem(希伯来大学杰里科分校) IBM Research(IBM研究院)

AI总结 提出WildIFEval数据集,包含7K条真实用户的多约束指令,用于评估LLM的指令遵循能力,发现所有模型仍有较大改进空间。

详情
Comments
Accepted to the 5th Workshop on Generation, Evaluation and Metrics (GEM) at ACL 2026
AI中文摘要

最近的LLMs在遵循用户指令方面取得了显著成功,但处理具有多个约束的指令仍然是一个重大挑战。在这项工作中,我们引入了WildIFEval——一个包含7K条真实用户指令的大规模数据集,这些指令具有多样化的多约束条件。与以往的数据集不同,我们的收集涵盖了广泛的词汇和主题约束范围,这些约束是从自然用户指令中提取的。我们将这些约束分为八个高级类别,以捕捉它们在现实场景中的分布和动态。利用WildIFEval,我们进行了大量实验来评估领先LLMs的指令遵循能力。WildIFEval清晰地区分了小型和大型模型,并表明所有模型在此类任务上仍有很大的改进空间。我们分析了约束数量和类型对性能的影响,揭示了模型约束遵循行为的有趣模式。我们发布数据集以促进在复杂现实条件下指令遵循的进一步研究。

英文摘要

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have a large room for improvement on such tasks. We analyze the effects of the number and type of constraints on performance, revealing interesting patterns of model constraint-following behavior. We release our dataset to promote further research on instruction-following under complex, realistic conditions.

2510.03896 2026-06-12 cs.CV cs.RO 版本更新

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出通用动作专家(GAE),通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹,采用动作预训练-点云微调(APPF)方案解耦动作动力学与几何基础,实现跨视觉域、视角和指令的强泛化。

详情
AI中文摘要

视觉语言模型展示了强大的推理和规划能力,但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起,导致泛化能力有限。我们提出了通用动作专家(GAE),一个任务无关的模型,将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口:VLM预测代表高层意图的稀疏3D路点,而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力,我们引入了动作预训练-点云微调(APPF)方案,将学习动作动力学与几何基础解耦。预训练后,GAE被冻结并在下游任务中重用,只需对VLM进行轻量级微调以生成稀疏接口。实验表明,我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

2505.20076 2026-06-12 cs.LG 版本更新

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

ExPLAIND:统一模型、数据和训练归因以研究模型行为

Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich

发表机构 * University of Michigan(密歇根大学)

AI总结 提出ExPLAIND框架,统一归因于模型组件、数据和训练轨迹,支持跨粒度解释,通过梯度路径核和AdamW核机器推导参数级和步骤级影响分数,验证了Transformer的Grokking和EuroLLM预训练中的两阶段动态。

详情
Comments
published at ICML 2026, code at https://github.com/mainlp/explaind
AI中文摘要

事后可解释性方法通常将模型行为归因于其组件、数据或训练轨迹中的某一个,并且往往局限于局部到全局谱中的特定粒度。这导致解释缺乏统一视角,可能遗漏关键交互。我们提出了ExPLAIND,一个理论扎实的统一框架,它整合了模型组件、数据和训练轨迹,同时支持跨粒度的解释。我们推广了最近关于梯度路径核的工作,将AdamW训练的模型重新表述为核机器。从得到的核特征图中,我们推导出新的参数级和步骤级影响分数。我们在多种设置下实证验证了模型行为的分解结果,并将ExPLAIND应用于两个案例研究。我们对一个表现出Grokking现象的Transformer的发现支持了先前提出的学习阶段,同时将最后阶段细化为外层在记忆后围绕一个表示管道对齐的阶段。对于EuroLLM预训练,ExPLAIND揭示了一个两阶段动态:第一阶段以外部MLP学习为特征,第二阶段以中间注意力层的相对影响增加为特征。这些结果确立了ExPLAIND作为解释模型行为和训练动态的统一框架。

英文摘要

Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.

2509.22050 2026-06-12 cs.LG 版本更新

BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning

BrainPro:迈向大规模脑状态感知的脑电图表征学习

Yi Ding, Muyun Jiang, Weibang Jiang, Shuailei Zhang, Xinliang Zhou, Chenyu Liu, Shanglin Li, Yong Li, Cuntai Guan

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) Advanced Telecommunications Research Institute International(先进电信研究院) Southeast University(东南大学)

AI总结 提出BrainPro模型,通过检索式空间对齐和脑状态解耦模块,学习共享与特定状态表征,在9个公共BCI数据集上取得最优性能。

详情
Comments
31 pages, 11 figures
AI中文摘要

脑电图(EEG)反映了潜在的脑状态,其活动分布在大脑区域并表现为头皮上的空间模式。学习这些空间结构化的、与状态相关的模式需要跨数据集的一致空间表征。然而,现有的EEG基础模型通常基于自注意力机制,该机制不保留位置特定信息,并且难以对齐不同通道配置记录的信号。此外,脑状态包含共享和状态特定的区域活动,这表明学习神经生理学上合理的、状态感知的表征可以补充当前模型所针对的共享表征,并改善下游解码。为了解决这些局限性,我们提出了BrainPro,一个大型EEG模型,它结合了基于检索的空间学习机制用于跨布局空间对齐,以及一个脑状态解耦模块,通过并行编码器和区域感知重建学习共享和状态特定表征。在大型EEG语料库上预训练后,BrainPro在跨越情感、运动、语音、压力、精神疾病和注意力任务的九个公共BCI数据集上实现了最先进的性能。对空间滤波器、通道丢失鲁棒性和编码器贡献的分析进一步验证了其空间对齐和状态感知路径的有效性。这些结果表明,BrainPro实现了学习空间模式的更好可解释性,并产生了有益于多种EEG解码任务的表征。

英文摘要

Electroencephalography (EEG) reflects underlying brain states, whose activities are distributed across brain regions and manifest as spatial patterns on the scalp. Learning these spatially structured, state-related patterns requires consistent spatial representations across datasets. However, existing EEG foundation models are typically based on self-attention, which does not preserve location-specific information and struggles to align signals recorded with different channel configurations. Moreover, brain states contain both shared and state-specific regional activity, suggesting that learning neurophysiologically plausible, state-aware representations can complement the shared representations targeted by current models and improve downstream decoding. To address these limitations, we propose BrainPro, a large EEG model that combines a retrieval-based spatial learning mechanism for cross-layout spatial alignment with a brain state-decoupling module that learns both shared and state-specific representations through parallel encoders and region-aware reconstruction. Pre-trained on a large EEG corpus, BrainPro achieves state-of-the-art performance across nine public BCI datasets spanning emotion, motor, speech, stress, mental disease, and attention tasks. Analyses of spatial filters, channel-drop robustness, and encoder contributions further validate the effectiveness of its spatial alignment and state-aware pathways. These results show that BrainPro achieves improved interpretability of learned spatial patterns and produces representations that benefit diverse EEG decoding tasks.

2509.21398 2026-06-12 cs.CV eess.IV 版本更新

Skeleton Sparsification and Densification Scale-Spaces

骨架稀疏化和致密化尺度空间

Julia Gierke, Pascal Peter

发表机构 * Mathematical Image Analysis Group, Saarland University(萨尔兰大学数学图像分析组) Department of Mathematics and Computer Science, Saarland University(萨尔兰大学数学与计算机科学系)

AI总结 提出骨架化尺度空间,通过稀疏化中轴实现形状层次简化,并引入致密化实现从粗到细的逆过程,应用于鲁棒骨架化、形状压缩和增材制造刚度增强。

详情
AI中文摘要

Hamilton-Jacobi骨架,也称为中轴,是一种强大的形状描述符,它根据最大内切圆的中心来表示二值对象。尽管应用广泛,但中轴对噪声敏感:微小的边界变化可能导致骨架不成比例地扩大和产生不必要的分支。经典的剪枝方法通过系统地移除多余的骨架分支来缓解这一缺陷。这种骨架的顺序简化类似于稀疏化尺度空间的原理,该空间将图像嵌入到从越来越稀疏的像素表示重建的族中。我们通过引入骨架化尺度空间将两者结合起来:它们利用中轴的稀疏化来实现形状的层次简化。与传统的剪枝不同,我们的框架固有地满足关键的尺度空间特性,如层次结构、可控简化和对几何变换的等变性。我们在连续和离散公式中提供了严格的理论基础,并通过致密化进一步扩展了这一概念。通过逐步增长骨架而不是收缩它,我们允许从粗到细尺度的逆过程。致密化尺度空间甚至可以超越原始骨架,产生与实际问题相关的过完备形状表示。通过概念验证实验,我们展示了我们的框架在实际任务中的有效性,包括鲁棒骨架化、形状压缩和增材制造的刚度增强。

英文摘要

The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: Minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. By growing the skeleton successively instead of shrinking it, we allow inverse progression from coarse to fine scales. Densification scale-spaces can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.

2509.19526 2026-06-12 cs.LG cs.SY eess.SY 版本更新

Metriplectic Conditional Flow Matching for Dissipative Dynamics

度量辛条件流匹配用于耗散动力学

Ali Baheri, Lars Lindemann

发表机构 * Rochester Institute of Technology, Rochester, NY, USA(罗切斯特理工学院) Automatic Control Laboratory, ETH Zürich, Switzerland(自动控制实验室)

AI总结 提出度量辛条件流匹配(MCFM)方法,通过将保守-耗散分解融入向量场和结构保持采样器,学习耗散动力学,保证能量单调递减和长期稳定性。

详情
AI中文摘要

度量辛条件流匹配(MCFM)在不违反第一原理的情况下学习耗散动力学。神经替代模型常常注入能量并破坏长期推演的稳定性;MCFM 则将保守-耗散分解同时融入向量场和结构保持采样器。MCFM 通过短时间过渡上的条件流匹配进行训练,避免了长时间推演伴随的梯度计算。在推理时,Strang-prox 方案交替进行辛更新和近端度量步骤,确保离散能量衰减;当有可信能量可用时,可选投影强制严格衰减。我们提供了连续和离散时间保证,将该参数化和采样器与守恒、单调耗散和稳定推演联系起来。在一个受控机械基准上,MCFM 产生的相图更接近真实情况,并且与同等表达能力的无约束神经流相比,能量增加和正能量率事件显著减少,同时匹配终端分布拟合。

英文摘要

Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

2509.01630 2026-06-12 cs.LG cs.MA cs.RO cs.SY eess.SY 版本更新

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

DiffCoord: 分布式多智能体轨迹优化的可微协调

Bingheng Wang, Yichao Gao, Tianchen Sun, Shanker Ajay, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 提出DiffCoord框架,将截断ADMM-DDP管道的耦合参数通过端到端元学习联合优化,利用智能体神经网络实现任务自适应,并扩展到不同智能体数量。在协作空中运输系统中验证,相比现有方法将每智能体梯度计算时间减少70%。

详情
AI中文摘要

将交替方向乘子法(ADMM)与微分动态规划(DDP)相结合,为分布式多智能体轨迹优化提供了一个可扩展的框架。在实践中,ADMM通常被截断以提高计算效率,这紧密耦合了原本分别控制协调质量和任务性能的参数。在本文中,我们提出了可微协调(DiffCoord),一个统一框架,联合元学习截断ADMM-DDP管道的这些耦合参数。这些参数由智能体神经网络生成以实现任务自适应,并且同构智能体之间共享相同的网络,从而能够扩展到不同数量的智能体。我们通过端到端微分ADMM-DDP管道实现了高效的元学习。值得注意的是,这产生了一个辅助的ADMM-LQR分布式梯度求解器,用于计算和协调关于这些参数的元梯度。该求解器继承了管道的计算结构,使得关键计算结果可以重用,并能够在智能体和轨迹时间线上高效并行化。我们通过协作空中运输系统的数值和物理实验验证了DiffCoord,该系统在狭窄空间中重新配置四旋翼编队以实现安全的六自由度负载操作。它能够鲁棒地适应变化的团队规模和负载动力学,同时与最先进的轨迹梯度方法相比,将每智能体梯度计算时间减少高达70%。

英文摘要

Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

详情
Comments
Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

2508.14143 2026-06-12 cs.LG q-bio.NC 版本更新

The Urysohn Machine: A Metric-Topological Model of Computation

Urysohn机器:一种度量-拓扑计算模型

Xin Li

发表机构 * University at Albany, State University of New York(纽约州立大学阿尔巴尼分校)

AI总结 提出Urysohn机器,一种基于度量分离、前沿结构和收缩的分类计算模型,通过Urysohn三元组和分层构造实现分类复杂度度量与可重用推理。

详情
AI中文摘要

我们引入Urysohn机器,一种面向分类计算的有效模型,其中度量分离、前沿结构和收缩是计算状态的显式部分。其基本对象是Urysohn三元组:一个支撑区域、一个目标划分以及一个存储在可重用度量库中的分离分类器。拓扑基础是有限单纯形设置下的构造性Urysohn实现定理。它通过嵌套多面体区域的二进阶梯构建分离器,并为其前沿配备链级微积分:前沿是循环,层级之间的壳层边界由前沿之差给出。该构造产生两种相关的复杂度度量:决策边界宽度(单个分类器边界的几何度量)和Urysohn宽度(库或实现所表示的总前沿质量)。我们证明了摊销分离定理,该定理表明在显式边界足迹假设下,逼近宽度为的边界达到精度所需的简单基三元组数量与边界宽度成正比,与分辨率成反比。我们还引入了一种对比分离算子,其图割泛函能从采样度量数据中一致地估计决策边界宽度,而其拉普拉斯谱则能证明类组件结构和电导率。最后,我们分析了动态Urysohn阶梯,并证明了四个保证:商塌缩下的可分离性、已提交前沿的稳定性、收缩下的有界容量以及商距离下的可扩展性。这些结果共同给出了分类复杂度、摊销推理和组合重用的度量-拓扑解释,在保留经典可计算性的同时,揭示了纯符号描述所隐藏的几何结构。

英文摘要

We introduce the Urysohn Machine, an effective model of classification-oriented computation in which metric separation, frontier structure, and contraction are explicit parts of the computational state. Its basic object is a \emph{Urysohn Triple}: a support region, a target partition, and a separating classifier stored in a reusable Metric Library. The topological foundation is a constructive Urysohn Realization theorem for finite simplicial settings. It builds separators from dyadic ladders of nested polyhedral regions and equips their frontiers with a chain-level calculus: frontiers are cycles, and shells between levels have boundaries given by differences of frontiers. This construction yields two related complexity measures: decision-boundary width, the geometric measure of a single classifier's boundary, and Urysohn width, the total frontier mass represented by a library or realization. We prove an Amortized Separation Theorem showing that approximating a boundary of width to accuracy requires a number of simple basis triples proportional to boundary width and inversely proportional to resolution, under explicit boundary-footprint assumptions. We also introduce a contrastive separation operator whose graph-cut functional consistently estimates decision-boundary width from sampled metric data, while its Laplacian spectrum certifies class-component structure and conductance. Finally, we analyze the dynamic Urysohn ladder and prove four guarantees: separability under quotient collapse, stability of committed frontiers, bounded capacity under contraction, and scalability with quotient distance. Together, these results give a metric-topological account of classification complexity, amortized inference, and compositional reuse that preserves classical computability while exposing geometric structure hidden by purely symbolic descriptions.

2508.04888 2026-06-12 cs.LG 版本更新

Retrieval-Augmented Foundation Models for Water Level Prediction in the Everglades

用于大沼泽地水位预测的检索增强基础模型

Rahuul Rangaraj, Jimeng Shi, Rajendra Paudel, Giri Narasimhan, Yanzhao Wu

发表机构 * Florida International University(佛罗里达国际大学) Everglades National Park(大沼泽地国家公园)

AI总结 针对大沼泽地水位预测,提出检索增强机制,利用统计相似性或互信息检索历史水文事件,提升预训练时序基础模型的长期预测性能,尤其在极端事件中效果显著。

详情
AI中文摘要

大沼泽地的准确水位预测对于防洪、干旱管理、水资源规划和生物多样性保护至关重要。尽管最近的时序基础模型在通用任务(体现在其预训练中)上表现出色,但它们在特定领域应用中的有效性仍未被充分理解。在这项工作中,我们整理了一个用于大沼泽地水位预测的领域特定数据集,并观察到当前最先进模型的性能仍然有限。为了解决这一差距,我们利用检索增强机制,从历史观测的外部档案中检索类似的多变量水文事件,以丰富这些预训练模型的输入上下文。我们研究了两种检索策略:基于统计相似性的检索和基于互信息的检索,并分析了纳入检索到的历史上下文如何影响预测性能。大量实验表明,检索增强一致地改善了长期水位预测,并在极端事件期间产生了不成比例的更大收益,这对环境决策尤为关键。我们的研究提供了经验证据,表明基于类比检索可以有益于环境科学中的预训练时序基础模型,为它们在大沼泽地水文预测中的应用提供了关于其优势、局限性和失败模式的实用见解。尽管在大沼泽地进行了评估,但所提出的框架是通用的,并且可以应用于给定时间序列数据的其他水文系统。代码和数据已在此 https URL 公开。

英文摘要

Accurate water level forecasting in the Everglades is essential for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent time-series foundation models have shown strong performance on generic tasks (represented in their pre-training), their effectiveness in domain-specific applications remains insufficiently understood. In this work, we curate a domain-specific dataset for water-level forecasting in the Everglades and observe that the performance of current state-of-the-art models remains limited. To address this gap, we leverage a retrieval-augmented mechanism that retrieves analogous multivariate hydrological episodes from an external archive of historical observations to enrich the input context of those pre-trained models. We study two retrieval strategies, statistical similarity-based retrieval and mutual information-based retrieval, and analyze how incorporating retrieved historical contexts affects predictive performance. Extensive experiments show that retrieval augmentation consistently improves long-horizon water level forecasts and yields disproportionately larger gains during extreme events, which is particularly critical for environmental decision-making. Our study provides empirical evidence that analog-based retrieval can benefit pretrained time-series foundation models in environmental science, offering practical insights into their strengths, limitations, and failure modes when applied to hydrological forecasting in the Everglades. Although evaluated in the Everglades, the proposed framework is general and can be applied to other hydrological systems given time series data. The code and data have been made publicly available at https://github.com/rahuul2992000/WaterRAF.

2508.01656 2026-06-12 cs.CL cs.AI cs.CY cs.HC physics.soc-ph 版本更新

Authorship Attribution in Multilingual Machine-Generated Texts

多语言机器生成文本的作者归属

Lucio La Cava, Dominik Macko, Róbert Móro, Ivan Srba, Andrea Tagarelli

发表机构 * DIMES Department, University of Calabria(卡利博大学DIMES系) Kempelen Institute of Intelligent Technologies(智能技术研究所)

AI总结 提出多语言作者归属问题,研究单语言方法在18种语言和8个生成器上的跨语言迁移能力,发现显著局限。

详情
Comments
Accepted at ACL 2026 - Main
AI中文摘要

随着大型语言模型(LLM)达到类人的流畅性和连贯性,区分机器生成文本(MGT)与人类撰写的内容变得越来越困难。虽然MGT检测的早期工作侧重于二元分类,但LLM的不断发展和多样性需要更细粒度且更具挑战性的作者归属(AA),即能够识别文本背后的确切生成器(LLM或人类)。然而,目前AA仍局限于单语言环境,其中英语是研究最多的语言,忽视了现代LLM的多语言性质和使用。在这项工作中,我们引入了多语言作者归属问题,涉及将文本归因于跨多种语言的人类或多个LLM生成器。聚焦于18种语言——涵盖多个语系和书写系统——以及8个生成器(7个LLM和人类撰写类别),我们研究了单语言AA方法在多语言环境中的适用性,包括其跨语言迁移能力,以及生成器对归属性能的影响。我们的结果表明,虽然某些单语言AA方法可以适应多语言环境,但仍然存在显著的局限性和挑战,特别是在跨不同语系迁移时,这凸显了多语言AA的复杂性以及需要更稳健的方法以更好地匹配现实场景。

英文摘要

As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-written content becomes increasingly difficult. While early efforts in MGT detection have focused on binary classification, the growing landscape and diversity of LLMs require a more fine-grained yet challenging authorship attribution (AA), i.e., being able to identify the precise generator (LLM or human) behind a text. However, AA remains nowadays confined to a monolingual setting, with English being the most investigated one, overlooking the multilingual nature and usage of modern LLMs. In this work, we introduce the problem of Multilingual Authorship Attribution, which involves attributing texts to human or multiple LLM generators across diverse languages. Focusing on 18 languages -- covering multiple families and writing scripts -- and 8 generators (7 LLMs and the human-authored class), we investigate the multilingual suitability of monolingual AA methods in terms of their cross-lingual transferability, and the impact of generators on attribution performance. Our results reveal that while certain monolingual AA methods can be adapted to multilingual settings, significant limitations and challenges remain, particularly in transferring across diverse language families, underscoring the complexity of multilingual AA and the need for more robust approaches to better match real-world scenarios.

2507.22791 2026-06-12 cs.CV 版本更新

Modality-Aware Feature Matching in Visual and Vision-Language Applications: A Comprehensive Survey

视觉与视觉-语言应用中的模态感知特征匹配:全面综述

Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin

发表机构 * School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics(江西财经大学计算机与人工智能学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Computer Science and Informatics, Cardiff University(卡迪夫大学计算机科学与信息学院) School of Computing and Communications, Lancaster University(兰卡斯特大学计算机与通讯学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR)(新加坡资讯研究院,科技研究局(A*STAR)) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 综述基于模态的特征匹配,涵盖传统手工方法和现代深度学习方法,重点讨论跨RGB、深度、3D点云、LiDAR、医学图像及视觉-语言模态的进展,突出模态感知技术。

详情
Comments
CSUR
AI中文摘要

特征匹配是计算机视觉中的一项基础任务,对于图像检索、立体匹配、三维重建和SLAM等应用至关重要。本综述全面回顾了基于模态的特征匹配,探索了传统手工方法,并强调了当代深度学习方法在各种模态中的应用,包括RGB图像、深度图像、3D点云、LiDAR扫描、医学图像和视觉-语言交互。传统方法利用Harris角点等检测器和SIFT、ORB等描述符,在中等模态内变化下表现出鲁棒性,但在显著模态差距下表现不佳。当代基于深度学习的方法,例如基于CNN的SuperPoint和基于Transformer的LoFTR等无检测器策略,显著提高了跨模态的鲁棒性和适应性。我们重点介绍了模态感知的进展,例如用于深度图像的几何和深度特定描述符、用于3D点云的稀疏和密集学习方法、用于LiDAR扫描的注意力增强神经网络,以及用于复杂医学图像匹配的MIND描述符等专门解决方案。跨模态应用,特别是在医学图像配准和视觉-语言任务中,突显了特征匹配处理日益多样化数据交互的演变。

英文摘要

Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

2507.22028 2026-06-12 cs.CV cs.RO 版本更新

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

从看见到体验:通过强化学习扩展导航基础模型

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Coco Robotics(Coco机器人)

AI总结 提出S2E框架,结合离线视频预训练和模拟环境强化学习,通过锚点引导分布匹配和残差注意力模块,提升导航基础模型的交互性和安全性。

详情
Comments
27 pages, 20 figures, 9 tables, conference
AI中文摘要

基于大规模网络数据训练的导航基础模型使智能体能够跨不同环境和实体进行泛化。然而,这些仅基于离线数据训练的模型往往缺乏推理其行为后果或通过反事实理解进行适应的能力。因此,它们在现实世界城市导航中面临重大限制,其中交互性和安全行为(如避开障碍物和移动行人)至关重要。为解决这些挑战,我们引入了从看见到体验(S2E)学习框架,通过强化学习扩展导航基础模型的能力。S2E结合了离线视频预训练和强化学习后训练的优势。它保持了从大规模真实世界视频中获得的模型泛化能力,同时通过模拟环境中的强化学习增强了其交互性。具体而言,我们引入了两项创新:(1)用于离线预训练的锚点引导分布匹配策略,通过基于锚点的监督稳定学习并建模多样化的运动模式;(2)用于强化学习的残差注意力模块,从模拟环境中获得反应性行为,同时不抹除模型的预训练知识。此外,我们建立了一个全面的端到端评估基准NavBench-GS,该基准基于真实世界场景的光照逼真3D高斯溅射重建,并融入了物理交互。它可以系统评估导航基础模型的泛化能力和安全性。

英文摘要

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

2507.20208 2026-06-12 cs.CL 版本更新

From Benchmarks to Skills: Low-Rank Factors for LLM Evaluation

从基准到技能:LLM评估的低秩因子

Aviya Maimon, Amir DN Cohen, Gal Vishne, Shauli Ravfogel, Reut Tsarfaty

发表机构 * Bar-Ilan University(巴伊兰大学) OriginAI Data Science Institute Columbia University(哥伦比亚大学数据科学学院) Center for Data Science New York University(纽约大学数据科学中心)

AI总结 通过因子分析发现LLM基准性能矩阵本质低秩,揭示任务冗余,提出基于潜在技能空间的评估框架,用于识别冗余任务、用小任务子集建模新模型和按技能轮廓选模型。

详情
AI中文摘要

当前对大型语言模型(LLM)的评估严重依赖于不断增长的基准集合和聚合基准分数,然而这种比较实际捕捉了什么,以及这些分数揭示了模型的哪些底层能力,仍不清楚。在此,我们提出了一种新的LLM评估范式,通过询问基准性能是反映许多独立能力,还是依赖于少量共享维度。为了回答这个问题,我们将因子分析(FA)应用于LLM与基准的大规模性能矩阵(60×44),揭示了该矩阵的固有低秩结构。也就是说,少量潜在因子捕捉了完整任务空间中的大部分结构。这种低秩几何揭示了现有任务之间存在大量冗余,并解释了为什么许多基准似乎测量了重叠的能力。我们进一步表明,这些潜在因子对应于连贯的、类似技能的LLM行为维度。利用这个潜在技能空间,我们为LLM评估和下游用户提供了三个实用工具:(i)识别冗余任务,(ii)使用少量任务子集对新模型进行画像,以及(iii)选择与所需技能轮廓一致的模型。我们的方法为单一聚合分数的事实标准提供了一个可靠的替代方案,并建立了一个可解释且实用的框架,用于理解和基准测试LLM的核心能力。

英文摘要

Current evaluations of large language models (LLMs) rely heavily on a growing collection of benchmarks and on aggregate benchmark scores, yet it remains unclear what this comparison actually captures, and what these scores reveal about models' underlying capabilities. Here, we propose a new paradigm for LLM evaluation, by asking whether benchmark performance reflects many independent abilities, or rather relies on a small number of shared dimensions. To answer this, we apply Factor Analysis (FA) to a massive performance matrix of LLMs versus benchmarks \((60\times44)\) revealing an \emph{intrinsically low-rank} structure of that matrix. That is, a small number of latent factors captures most of the structure in the full task space. This low-rank geometry reveals substantial redundancy across existing tasks and explains why many benchmarks appear to be measuring overlapping abilities. We further show that these latent factors correspond to coherent, skill-like, dimensions of LLM behavior. Leveraging this latent skill-space, we deliver three practical tools for LLM evaluation and downstream users: (i)~identifying redundant tasks, (ii)~profiling new models using a small subset of tasks, and (iii)~selecting models aligned with desired skill profiles. Our method provides a solid alternative to the de-facto standard of a single aggregate score, and establishes an interpretable and practical framework for understanding and benchmarking LLM core capabilities.

2507.10599 2026-06-12 cs.CL cs.AI cs.LG 版本更新

Emergence of Hierarchical Emotion Organization in Large Language Models

大型语言模型中层级情感组织的涌现

Maya Okawa, Bo Zhao, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of Tokyo(东京大学)

AI总结 受情感轮理论启发,分析大型语言模型输出中情感状态间的概率依赖关系,发现模型自然形成与人类心理模型一致的层级情感树,且更大模型发展出更复杂的层级结构,同时揭示社会经济角色在情感识别中的系统性偏差。

详情
Comments
ICML 2026
AI中文摘要

随着大型语言模型(LLMs)越来越多地驱动对话代理,理解它们如何建模用户的情绪状态对于伦理部署至关重要。受情感轮(即一种认为情感层级组织的心理学框架)的启发,我们分析了模型输出中情感状态之间的概率依赖关系。我们发现LLMs自然形成与人类心理模型一致的层级情感树,且更大的模型发展出更复杂的层级结构。我们还揭示了跨社会经济角色的情感识别中存在系统性偏差,对于交叉、代表性不足的群体,错误分类会叠加。人类研究显示出惊人的相似性,表明LLMs内化了社会感知的某些方面。除了突出LLMs中的涌现情感推理能力,我们的结果还暗示了利用认知基础理论开发更好模型评估的潜力。

英文摘要

As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

2507.05019 2026-06-12 cs.LG cs.AI 版本更新

Meta-Learning Transformers to Improve In-Context Generalization

元学习变换器以改进上下文泛化

Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci

发表机构 * University of Trento, Italy(特伦托大学,意大利) Eindhoven University, Netherlands(埃因霍温大学,荷兰) University of Doha for Science and Technology, Qatar(多哈科学与技术大学,卡塔尔)

AI总结 提出利用多个小规模领域特定数据集训练上下文学习器,通过元学习提升跨领域泛化能力,并在持续学习和无监督场景下验证其鲁棒性。

详情
AI中文摘要

上下文学习使变换器模型能够仅基于输入提示泛化到新任务,无需任何权重更新。然而,现有的训练范式通常依赖于大型非结构化数据集,这些数据集存储成本高,难以评估质量和平衡性,并且由于包含敏感信息而引发隐私和伦理问题。受这些局限性和风险的启发,我们提出了一种替代训练策略,利用多个小规模、领域特定的数据集集合。我们经验性地证明,此类数据质量的提高和多样性的增加提升了上下文学习器在其训练领域之外的泛化能力,同时与在单个大规模数据集上训练的模型相比,性能相当。我们通过利用元学习在Meta-Album集合上训练上下文学习器来研究这一范式,在多种设置下进行实验。首先,我们在受控环境中展示性能,其中测试领域完全排除在训练知识之外。其次,我们探索这些模型在信息可访问时间有限的持续场景中对遗忘的鲁棒性。最后,我们探索更具挑战性的无监督场景。我们的发现表明,当在精心策划的数据集集合上训练时,变换器仍然能够泛化用于上下文预测,同时在模块化和可替换性方面提供了优势。

英文摘要

In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

2507.03660 2026-06-12 cs.LG 版本更新

Single vs. Multiple Branches in DeepONet and S-DeepONet: Network Architecture Follows Coupling in Multiphysics Systems

DeepONet和S-DeepONet中的单分支与多分支:网络架构遵循多物理系统中的耦合

Jaewan Park, Kazuma Kobayashi, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam

发表机构 * National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign(国家超级计算应用中心,伊利诺伊大学厄巴纳-香槟分校) The Grainger College of Engineering, Mechanical Science and Engineering, University of Illinois at Urbana-Champaign(格拉inger工程学院,机械科学与工程系,伊利诺伊大学厄巴纳-香槟分校) The Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering, University of Illinois at Urbana-Champaign(格拉inger工程学院,核物理与辐射工程系,伊利诺伊大学厄巴纳-香槟分校) Department of Industrial and Manufacturing Systems Engineering, Kansas State University(工业与制造系统工程系,堪萨斯州立大学) Civil and Urban Engineering Department, New York University Abu Dhabi, UAE(土木与城市工程系,纽约大学阿布扎比分校,阿联酋)

AI总结 研究比较单分支与多分支神经算子架构在强耦合多物理系统中的表现,发现单分支网络在紧耦合场景下通过共享潜在表示优于多分支,而多分支适用于解耦或单物理任务,代理模型加速高达1.8×10^4倍。

详情
AI中文摘要

复杂物理系统的实时预测需要从数据中学习并代表强多物理耦合的代理模型。深度算子网络在单物理问题中已显示出成功,但其在捕捉耦合系统(如热-机械或电-热耦合)中非线性相互作用方面的有效性仍未充分探索。这里我们提出一个实际问题:神经算子的架构是否应反映其旨在建模的物理耦合强度?我们比较了单分支和多分支设计,包括前馈和顺序循环形式,跨越三个代表性系统:具有异质源的反应-扩散问题、具有温度依赖电导率和焦耳热的非线性热电问题,以及钢凝固的粘塑性热-机械模型。单分支网络在紧耦合场景中通过鼓励共享潜在表示持续优于多分支变体,而多分支设计对于解耦或单物理任务仍然有利。一旦训练完成,这些代理模型提供全场预测的速度比基于物理的求解器快高达1.8×10^4倍。

英文摘要

`Real-time prediction of complex physical systems requires surrogate models that learn from data while representing strong multiphysics coupling. Deep Operator Networks have shown success in single-physics problems, yet their effectiveness in capturing nonlinear interactions in coupled systems (such as thermo-mechanical or electro-thermal coupling) remains underexplored. Here we pose a practical question: should the architecture of a neural operator reflect the strength of physical coupling it aims to model? We compare single-branch and multi-branch designs, in both feedforward and sequential recurrent forms, across three representative systems: a reaction--diffusion problem with heterogeneous sources, a nonlinear thermo-electrical problem with temperature-dependent conductivity and Joule heating, and a viscoplastic thermo-mechanical model of steel solidification. Single-branch networks consistently outperform multi-branch variants in tightly coupled regimes by encouraging shared latent representations, whereas multi-branch designs remain favorable for decoupled or single-physics tasks. Once trained, these surrogates deliver full-field predictions up to $1.8 \times 10^4$ times faster than physics-based solvers.

2506.23033 2026-06-12 cs.LG stat.ML 版本更新

How Reliable are Fairness Audits with Unreliable Data?

不可靠数据下的公平性审计有多可靠?

Yash Vardhan Tomar

发表机构 * Purdue University(普渡大学)

AI总结 研究受保护标签缺失对公平性缓解审计的影响,提出种子校准压力测试区分缺失效应与随机波动,发现正可用性缺失通常不改变缓解方法效果,但无标签端点表现不同,且阈值优化可能将单轴公平性增益转化为交叉危害。

详情
AI中文摘要

公平性审计是负责任机器学习部署的关键组成部分。然而,在不完全受保护标签访问下审计建议的可靠性仍然知之甚少。在这项工作中,我们关注公平性缓解审计中的受保护标签缺失。我们引入了一种种子校准压力测试,以将缺失效应与完全标签下已经存在的种子间波动分离开来。在ACS/Folktables任务中,我们发现正可用性缺失通常不会将选定的缓解方法移出完全标签的种子基线。无标签端点表现不同,暴露了ERM等效候选和确定性断点,而不是广泛的缺失效应。我们还发现,阈值优化可以将单轴公平性增益转化为高于零点的交叉危害,这是一种更尖锐的失败模式,在随机森林验证下似乎仍然可见。总体而言,我们的结果强调,在将受保护标签缺失视为审计脆弱性的证据之前,应报告种子零校准、候选集背景和交叉后果。

英文摘要

Fairness audits are a key component of responsible machine-learning deployment. Yet, audit-recommendation reliability under incomplete protected-label access is still poorly understood. In this work, we focused on protected-label missingness in fairness mitigation audits. We introduced a seed-calibrated stress test to separate missingness effects from seed-to-seed movement already present under complete labels. Across ACS/Folktables tasks, missingness settings that retain some protected labels usually do not move selected mitigation methods beyond a complete-label seed-to-seed baseline. At $0%$ protected-label access, candidates collapse to an empirical-risk-minimization baseline and deterministic tie-breaking rather than revealing a broad missingness effect. We also found that threshold optimization can turn fairness gains on a single protected axis into intersectional harm above a seed baseline, and this threshold-optimizer finding persists under random-forest validation. Overall, our results highlight that protected-label missingness should be reported with seed-null calibration, candidate-set context, and intersectional consequences before it is treated as evidence of audit fragility.

2506.21855 2026-06-12 cs.CV 版本更新

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Periodic-MAE:用于rPPG估计的周期性视频掩码自编码器

Jiho Choi, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University, Republic of Korea(电子与信息工程系,全州国立大学)

AI总结 提出Periodic-MAE,一种自监督框架,通过周期性感知掩码和生理频带约束,从无标签面部视频学习可泛化的时空表示,提升远程光电容积描记法(rPPG)估计性能。

详情
AI中文摘要

在本文中,我们提出Periodic-MAE,一种自监督框架,用于从无标签面部视频中学习周期性生理信号的通用时空表示。该方法利用掩码自编码器(MAE),通过重建掩码视频令牌学习高维面部表示,而不依赖远程光电容积描记法(rPPG)特定监督。为了明确地将表示学习与rPPG特征对齐,我们引入了一种基于视频重采样的周期性感知帧掩码策略,使编码器能够学习捕获与脉搏信号估计相关的准周期性时间模式的表示。此外,生理频带约束被集成到MAE预训练框架中,利用脉搏信号在频域的稀疏性,引导学习到的表示朝向生理上有意义的模式。预训练后,学习到的表示被迁移到下游rPPG估计任务,其中编码器作为通用特征提取器,从面部视频中恢复脉搏相关信号。我们在四个基准数据集(包括PURE、UBFC-rPPG、MMPD和V4V)上进行了广泛实验。此外,我们在无约束光照条件和受试者运动下收集的真实世界rPPG数据集上评估了所提方法。实验结果表明,Periodic-MAE持续改善了rPPG估计性能,特别是在具有挑战性的跨数据集和真实世界评估场景中。我们的代码可在以下网址获取:此 https URL。

英文摘要

In this paper, we propose Periodic-MAE, a self-supervised framework for learning generalizable spatio-temporal representations of periodic physiological signals from unlabeled facial videos. The proposed method leverages a masked autoencoder (MAE), which learns high-dimensional facial representations by reconstructing masked video tokens without relying on remote photoplethysmography (rPPG) specific supervision. To explicitly align representation learning with the characteristics of rPPG, we introduce a periodicity-aware frame masking strategy based on video resampling, enabling the encoder to learn representations that capture quasi-periodic temporal patterns relevant to pulse signal estimation. In addition, physiological bandlimit constraints are integrated into the MAE pre-training framework, exploiting the sparsity of pulse signals in the frequency domain to guide the learned representations toward physiologically meaningful patterns. After pre-training, the learned representations are transferred to downstream rPPG estimation, where the encoder serves as a generic feature extractor for recovering pulse-related signals from facial videos. We conduct extensive experiments on four benchmark datasets, including PURE, UBFC-rPPG, MMPD, and V4V. Moreover, we evaluate the proposed approach on a real-world rPPG dataset collected under unconstrained lighting conditions and subject motion. Experimental results demonstrate that Periodic-MAE consistently improves rPPG estimation performance, particularly in challenging cross-dataset and real-world evaluation settings. Our code is available at https://github.com/ziiho08/Periodic-MAE.

2502.18959 2026-06-12 cs.LG stat.ML 版本更新

Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential

傅里叶多分量与多层神经网络:解锁高频潜力

Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou

发表机构 * Department of Applied Mathematics(应用数学系) Hong Kong Polytechnic University(香港理工大学) Department of Mathematics(数学系) Duke University(杜克大学) Department of Mathematics and Statistics(数学与统计学系) Auburn University(阿伯茨伦大学) School of Mathematics(数学学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出傅里叶多分量与多层神经网络(FMMNN),结合正弦型激活函数与多分量多层结构,通过低秩架构实现指数级函数逼近能力,优化景观优于标准全连接网络,并设计缩放随机初始化方法加速训练,在高频函数逼近任务中取得高精度与良好收敛性。

详情
Comments
Our code and implementation details are available at https://github.com/ShijunZhangMath/FMMNN
AI中文摘要

神经网络的结构及其激活函数的选择对其性能至关重要。同样重要的是确保这两个元素良好匹配,因为它们的对齐是有效表示和学习的关键。在本文中,我们引入了傅里叶多分量与多层神经网络(FMMNN),该模型将正弦型激活函数与MMNN的多分量多层结构相结合。在FMMNN中,每个分量表示为固定随机正弦型基函数的可训练线性组合,而多层组合则生成更复杂且自适应的频率特征。我们证明,即使在低秩架构下,FMMNN仍能保持函数逼近的指数级表达能力。我们还分析了FMMNN的优化景观,发现其比标准全连接神经网络更有利,尤其是对于高频目标。此外,我们提出了一种针对FMMNN第一层权重的缩放随机初始化方法,当样本充足时,该方法能加速训练并提高最终性能。大量数值实验支持我们的理论见解,表明FMMNN在振荡函数逼近基准上实现了高精度和良好的收敛行为。

英文摘要

The architecture of a neural network and the choice of its activation function are both fundamental to its performance. Equally important is ensuring that these two elements are well matched, as their alignment is key to effective representation and learning. In this paper, we introduce the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), a model that combines sine-type activations with the multi-component and multi-layer structure of MMNNs. In an FMMNN, each component is represented as a trainable linear combination of fixed random sine-type basis functions, while multi-layer composition generates more complex and adaptive high-frequency features. We establish that FMMNNs retain exponential expressive power for function approximation even under a low-rank architectural structure. We also analyze the optimization landscape of FMMNNs and find it to be substantially more favorable than that of standard fully connected neural networks, especially for high-frequency targets. In addition, we propose a scaled random initialization method for the first-layer weights in FMMNNs, which accelerates training and improves final performance when sufficient samples are available. Extensive numerical experiments support our theoretical insights, showing that FMMNNs achieve strong accuracy and favorable convergence behavior on oscillatory function-approximation benchmarks.

2506.01274 2026-06-12 cs.CV cs.AI 版本更新

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

ReFoCUS: 用于上下文理解的强化引导帧优化

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReFoCUS框架,首次将在线策略梯度强化学习集成到视频大语言模型的帧级优化中,通过自回归和查询条件选择架构学习帧选择策略,无需显式帧级监督,提升视频问答推理准确性。

详情
Comments
Project page: https://interlive-team.github.io/ReFoCUS/
AI中文摘要

近期大型多模态模型(LMMs)的进展实现了有效的视觉-语言推理,然而视频理解能力仍受限于次优的帧选择策略,尽管视频专用LMMs发展迅速。先前的工作尝试通过静态启发式或外部检索模块来提供帧级信息,但这些方法往往无法捕捉与给定用户查询相关的视觉线索,混淆了原始视觉动态与真正的语义相关性。在本文中,我们介绍了ReFoCUS(用于上下文理解的强化引导帧优化),这是首个将在线策略梯度强化学习集成到视频-LLMs帧级优化的框架。ReFoCUS旨在学习帧选择策略,利用来自参考模型的奖励信号来捕捉其对最佳支持时间接地响应的帧组合的潜在评分行为。为了高效探索巨大的组合帧空间,我们采用了一种自回归且查询条件的选择架构,确保上下文一致性的同时降低复杂度。我们的策略学习无需显式帧级监督,因为它隐式地发现了最优且语义一致的帧组合。ReFoCUS在多个视频问答基准测试中持续提高了推理准确性,证明了将帧选择与模型内部效用对齐的优势。

英文摘要

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.