arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2509.17428 2026-05-20 cs.CL

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

QWHA: 量化感知的沃尔什-哈达玛适应用于大型语言模型的参数高效微调

Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学) Sungkyunkwan University(全北国立大学)

AI总结 本文提出QWHA方法,通过将基于傅里叶变换的适配器与沃尔什-哈达玛变换结合,改进量化感知参数高效微调,有效减少量化误差并降低计算成本。

Comments 25 pages, 9 figures, 14 tables

Journal ref ICLR 2026 Poster

详情
AI中文摘要

大型语言模型(LLMs)的高效部署需求推动了量化(减少推理成本)和参数高效微调(降低训练开销)的研究。为此,我们开发了量化感知参数高效微调(QWHA),以生成准确且高效的量化模型。在该设定中,减少量化误差在微调前至关重要。然而,现有依赖低秩适应的方法存在表示能力有限的问题。最近的傅里叶相关变换(FT)基于适配器具有比低秩适配器更大的表示能力,但其直接整合到量化模型中往往导致无效的误差减少和计算开销增加。为克服这些限制,我们提出了QWHA,一种通过使用沃尔什-哈达玛变换(WHT)作为变换内核,并结合新的适配器初始化方案(包括自适应参数选择和值细化)将FT基于适配器整合到量化模型中的方法。我们证明QWHA有效减轻量化误差并促进微调,其设计显著降低了计算成本。实验结果表明,QWHA在低比特量化精度上一致优于基线,并在现有FT基于适配器上实现了显著的训练加速。代码可在https://github.com/vantaa89/qwha获取。

英文摘要

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

2509.14968 2026-05-20 cs.LG cs.NI

FAWN: A MultiEncoder Fusion-Attention Wave Network for Integrated Sensing and Communication Indoor Scene Inference

FAWN:一种多编码器融合-注意力波网络用于集成感知与通信室内场景推断

Carlos Barroso-Fernández, Alejandro Calvillo-Fernandez, Antonio de la Oliva, Carlos J. Bernardos

发表机构 * Ericsson(爱立信)

AI总结 本文提出FAWN,一种基于Transformer架构的多编码器融合-注意力波网络,用于整合感知与通信的室内场景推断,通过融合Wi-Fi和5G信号提高环境感知精度。

Comments 7 pages, 6 figures and tables, less than 5500 words. Under revision at IEEE Communication Magazine

详情
AI中文摘要

下一代无线技术有望实现万物互联和智能化的时代。随着对智能需求的增长,网络必须学会更好地理解物理世界。然而,部署专用硬件来感知环境并不总是可行,主要是由于成本和/或复杂性。集成感知与通信(ISAC)在解决这一挑战上迈出了重要一步。在ISAC中,被动感知作为一种成本效益高的解决方案,利用无线通信来感知环境,而不干扰现有通信。然而,当前大多数解决方案仅限于一种技术(主要是Wi-Fi或5G),限制了最大精度。由于不同技术使用不同的频谱,我们看到有必要整合多种技术以扩大覆盖范围。因此,我们利用ISAC被动感知,提出FAWN,一种用于ISAC室内场景推断的多编码器融合-注意力波网络。FAWN基于原始Transformer架构,融合Wi-Fi和5G信息,使网络能够理解物理世界而不干扰当前通信。为了测试我们的解决方案,我们构建了一个原型并将其集成到真实场景中。结果表明,在84%的时间内,误差低于0.6米。

英文摘要

The upcoming generations of wireless technologies promise an era where everything is interconnected and intelligent. As the need for intelligence grows, networks must learn to better understand the physical world. However, deploying dedicated hardware to perceive the environment is not always feasible, mainly due to costs and/or complexity. Integrated Sensing and Communication (ISAC) has made a step forward in addressing this challenge. Within ISAC, passive sensing emerges as a cost-effective solution that reuses wireless communications to sense the environment, without interfering with existing communications. Nevertheless, the majority of current solutions are limited to one technology (mostly Wi-Fi or 5G), constraining the maximum accuracy reachable. As different technologies work with different spectrums, we see a necessity in integrating more than one technology to augment the coverage area. Hence, we take the advantage of ISAC passive sensing, to present FAWN, a MultiEncoder Fusion-Attention Wave Network for ISAC indoor scene inference. FAWN is based on the original transformers architecture, to fuse information from Wi-Fi and 5G, making the network capable of understanding the physical world without interfering with the current communication. To test our solution, we have built a prototype and integrated it in a real scenario. Results show errors below 0.6 m around 84% of times.

2509.14839 2026-05-20 cs.CV

MapAnything: Evaluating Monocular Metric Depth Models for 3D Urban Asset Localization

MapAnything: 评估单目度量深度模型用于3D城市资产定位

Miriam Louise Carnot, Jonas Kunze, Erik Quinten Fastermann, Eric Peukert, André Ludwig, Bogdan Franczyk

发表机构 * ScaDS.AI (University of Leipzig)(ScaDS.AI(莱比锡大学)) University of Leipzig(莱比锡大学) Kühne Logistics University(库赫内物流大学) Wrocław University of Economics(沃拉夫经济大学)

AI总结 本文提出MapAnything框架,通过单目图像自动映射城市物体和事件,利用度量深度估计模型计算物体坐标,验证其在复杂城市环境中的精度,展示其在交通标志和道路损坏等实际应用中的有效性。

详情
AI中文摘要

城市管理部门越来越多地依赖全面的数据库和数字孪生,如交通标志和树木以及涂鸦或道路损坏等事件,以有效监控城市状况。数字化提高了对持续更新的空间数据集的需求,但当前的数据采集和维护过程仍涉及大量人工劳动,带来了显著的可扩展性挑战。本文介绍了MapAnything,一种新颖的地理定位框架,能够从单个单目图像自动映射城市物体和事件。通过利用先进的度量深度估计模型,Map Anything准确计算物体的地理坐标,将2D图像数据转换为有价值的3D空间信息。该方法集成了估计的相机到物体距离与几何原理和已知相机规格。我们展示了该框架的详细验证,将其距离估计精度与高精度LiDAR点云在复杂城市环境中的对比。我们的评估提供了在各种距离区间和语义区域(如道路和植被)上的空间性能的细致分析。最后,我们通过具体的使用案例,如映射交通标志和道路路面损坏,展示了该框架的实际有效性,并提供了将其整合到自动化城市库存系统中的建议。

英文摘要

City administrations increasingly rely on comprehensive databases and urban digital twins of city assets, such as traffic signs and trees, as well as incidents like graffiti or road damage, to maintain an effective overview of urban conditions. Digitization has increased the demand for continuously updated spatial datasets, yet current data acquisition and maintenance processes still involve considerable manual effort, posing significant scalability challenges. This paper introduces MapAnything, a novel geo-localization framework that automates the spatial mapping of urban objects and incidents from a single monocular image. By leveraging advanced Metric Depth Estimation models, MapAnything accurately calculates object geocoordinates, converting 2D image data into valuable 3D spatial information. The methodology integrates the estimated camera-to-object distance with geometric principles and known camera specifications. We present a detailed validation of the framework, comparing its distance-estimation accuracy against high-precision LiDAR point clouds in complex urban environments. Our evaluation provides a granular analysis of spatial performance across various distance intervals and semantic areas, such as roads and vegetation. Finally, we demonstrate the framework's practical efficacy through specific use cases, including mapping traffic signs and road pavement damage, and provide recommendations for its integration into automated urban inventory systems.

2509.14787 2026-05-20 cs.RO

COMPASS: Confined-space Manipulation Planning with Active Sensing Strategy

COMPASS:基于主动感知策略的受限空间操作规划

Qixuan Li, Chen Le, Dongyue Huang, Jincheng Yu, Xinlei Chen

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院,中国深圳) Department of Electronic Engineering, and the Institute for Embodied Intelligence and Robotics, Tsinghua University, Beijing, China(清华大学电子工程系,以及 embodied intelligence and robotics 研究院,中国北京) The School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore(南洋理工大学电子与电气工程学院,新加坡)

AI总结 本文提出COMPASS框架,通过主动感知策略在受限和杂乱环境中实现安全操作,提高了操作成功率。

Comments Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
AI中文摘要

在受限和杂乱环境中进行操作仍然是一个重大挑战,由于部分可观测性和复杂的配置空间。有效在这些环境中进行操作需要一种智能探索策略来安全地理解和搜索目标。在本文中,我们提出COMPASS,一种多阶段探索和操作框架,其特征是具有操作意识的基于采样的规划器。首先,我们通过近场意识扫描减少碰撞风险,以构建局部碰撞图。此外,我们采用多目标效用函数来寻找同时具有信息性和有利于后续操作的视角。此外,我们执行一种受限操作优化策略,以生成遵守障碍物约束的操作姿态。为了系统评估方法在这些困难下的性能,我们提出了一个包含四个挑战性场景的受限空间探索和操作基准。与为其他机器人设计的探索方法和仅考虑信息增益的方法相比,我们的框架在模拟中将操作成功率提高了24.25%。现实世界实验展示了我们的方法在受限环境中进行主动感知和操作的能力。

英文摘要

Manipulation in confined and cluttered environments remains a significant challenge due to partial observability and complex configuration spaces. Effective manipulation in such environments requires an intelligent exploration strategy to safely understand the scene and search the target. In this paper, we propose COMPASS, a multi-stage exploration and manipulation framework featuring a manipulation-aware sampling-based planner. First, we reduce collision risks with a near-field awareness scan to build a local collision map. Additionally, we employ a multi-objective utility function to find viewpoints that are both informative and conducive to subsequent manipulation. Moreover, we perform a constrained manipulation optimization strategy to generate manipulation poses that respect obstacle constraints. To systematically evaluate method's performance under these difficulties, we propose a benchmark of confined-space exploration and manipulation containing four level challenging scenarios. Compared to exploration methods designed for other robots and only considering information gain, our framework increases manipulation success rate by 24.25% in simulations. Real-world experiments demonstrate our method's capability for active sensing and manipulation in confined environments.

2508.16112 2026-05-20 cs.AI

IR-Agent: Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra

IR-Agent: 专家启发的LLM代理用于从红外光谱中解析分子结构

Heewoong Noh, Namkyeong Lee, Gyoung S. Na, Kibum Kim, Chanyoung Park

发表机构 * KAIST(韩国科学技术院) KRICT(韩国信息通信技术研究院)

AI总结 本文提出IR-Agent,一种新的多代理框架,用于从红外光谱中解析分子结构,通过模拟专家驱动的分析过程,提升结构解析的准确性。

Comments ICLR 2026

详情
AI中文摘要

光谱分析为解析未知材料提供了关键线索。在各种技术中,红外光谱(IR)在实验室环境中扮演着重要角色,因为它具有高可访问性和低成本。然而,现有方法往往无法反映专家分析过程,并且在整合多种化学知识方面缺乏灵活性,这对于现实世界的分析场景至关重要。在本文中,我们提出了IR-Agent,一种新的多代理框架,用于从红外光谱中解析分子结构。该框架旨在模拟专家驱动的红外分析过程,并具有内在的可扩展性。每个代理专门处理红外解释的特定方面,其互补作用使集成推理成为可能,从而提高整体结构解析的准确性。通过广泛的实验,我们证明了IR-Agent不仅在实验红外光谱上提高了基线性能,而且在各种化学信息形式上表现出强大的适应性。

英文摘要

Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real-world analytical scenarios. In this paper, we propose IR-Agent, a novel multi-agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert-driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR-Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.

2508.14134 2026-05-20 cs.LG cs.AI

ERIS: An Energy-Guided Feature Disentanglement Framework for Out-of-Distribution Time Series Classification

ERIS: 一种面向分布外时间序列分类的能量引导特征解耦框架

Xin Wu, Fei Teng, Ji Zhang, Xingwang Li, Yuxuan Liang

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出ERIS框架,通过能量引导机制和语义指导,解决时间序列分类中分布外数据的可靠特征解耦问题,提升模型鲁棒性和泛化能力。

Journal ref Information Fusion 135, 104407 (2026)

详情
AI中文摘要

理想的时间序列分类(TSC)应能捕捉不变表示,但实现对分布外(OOD)数据的可靠性能仍是一个核心障碍。这一障碍源于模型内在地将领域特定和标签相关特征纠缠在一起,导致虚假相关性。尽管特征解耦旨在解决这一问题,但当前方法大多缺乏必要的语义方向,无法隔离真正普遍的特征。为此,我们提出一个端到端的Energy-Regularized Information for Shift-Robustness(ERIS)框架,以实现引导且可靠的特征解耦。核心思想是有效的解耦不仅需要数学约束,还需要语义指导来锚定分离过程。ERIS集成了三个关键机制来实现这一目标。具体来说,我们首先引入一种能量引导校准机制,为分离过程提供关键的语义指导,使模型能够自我校准。此外,一个权重层面正交性策略强制领域特定和标签相关特征之间的结构性独立,从而减轻它们的干扰。此外,一个辅助对抗泛化机制通过注入结构化扰动来增强鲁棒性。在四个基准测试中的实验表明,ERIS在统计上显著优于最先进的基线方法,始终保持最佳性能排名。

英文摘要

An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (ERIS) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial generalization mechanism enhances robustness by injecting structured perturbations. Experiments across four benchmarks demonstrate that ERIS achieves a statistically significant improvement over state-of-the-art baselines, consistently securing the top performance rank.

2507.15698 2026-05-20 cs.CL cs.AI cs.LG

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

CoLD: 用于数学推理过程中奖励模型的反事实引导长度偏差消除

Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Peking University(北京大学) University of Bristol(布里斯托大学)

AI总结 本文提出CoLD,一种通过反事实引导消除过程奖励模型中长度偏差的统一框架,旨在提高多步骤推理的准确性和简洁性,同时提升下游强化学习性能和跨领域泛化能力。

详情
AI中文摘要

过程奖励模型(PRMs)在评估和引导大型语言模型(LLMs)的多步推理中起着核心作用,特别是在数学问题解决中。然而,我们发现现有PRMs存在普遍的长度偏差:即使语义内容和逻辑有效性未变,它们也倾向于对较长的推理步骤赋予更高的分数。这种偏差会削弱奖励预测的可靠性,并导致推理过程中输出过于冗长。为了解决这一问题,我们提出了CoLD(Counterfactually-Guided Length Debiasing),一种统一的框架,通过三个组件减轻长度偏差:显式的长度惩罚调整、一个训练以捕捉虚假长度相关信号的学得偏差估计器,以及一种联合训练策略,强制奖励预测的长度不变性。我们的方法基于反事实推理,并受因果图分析的启发。在MATH500和GSM-Plus上的广泛实验表明,CoLD提高了步骤选择的准确性,并鼓励了更简洁、逻辑有效的推理。此外,它一致提高了下游RL性能,并通过减轻长度偏差在跨领域中泛化,展示了CoLD强大的泛化能力。

英文摘要

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

2507.10614 2026-05-20 cs.LG cs.AI

Fine-tuning Large Language Model for Automated Algorithm Design

微调大语言模型用于自动化算法设计

Fei Liu, Rui Zhang, Xi Lin, Zhichao Lu, Qingfu Zhang

发表机构 * City University of Hong Kong(香港城市大学) Xi’an Jiaotong University(西安交通大学)

AI总结 本文探讨了微调大语言模型以提升其在自动化算法设计中的性能,提出了一种多样性感知的排名策略和直接偏好优化方法,通过实验验证了任务特定微调在不同算法设计任务中的有效性。

详情
AI中文摘要

将大语言模型(LLMs)整合到自动化算法设计中已展现出巨大潜力。一种常见的方法是将LLMs嵌入到搜索过程中,以迭代生成和优化候选算法。然而,现有大多数方法依赖于为通用编码任务训练的现成LLMs,留下一个关键问题:是否需要专门针对算法设计训练的LLMs?如果是,如何有效获得此类LLMs,并且它们在不同算法设计任务中有多好的泛化能力?在本文中,我们通过探索针对算法设计的LLMs微调,初步回答了这些问题。我们引入了一种多样性感知的排名(DAR)采样策略,以平衡训练数据的多样性和质量,然后利用直接偏好优化来高效地对齐LLMs的输出与任务目标。我们的实验主要在Llama-3.2-1B-Instruct和Llama-3.1-8BInstruct上进行,针对三个不同的算法设计任务,此外,openPangu-Embedded模型还作为辅助比较在可允许集合问题上进行评估。结果表明,微调后的LLMs在较小的Llama-3.2-1B-Instruct上显著优于其现成的对应者,并在可允许集合问题上与较大的Llama-3.1-8B-Instruct匹配。此外,我们观察到良好的泛化能力:在特定算法设计任务上微调的LLMs在相关任务中也表现出色。这些发现突显了LLMs在算法设计中任务特定适应的价值,并为未来研究开辟了新途径。我们的代码可在https://github.com/RayZhhh/dpo-aad上公开获取。

英文摘要

The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks, leaving a key question open: Do we need LLMs specifically tailored for algorithm design? If so, how can such LLMs be effectively obtained and how well can they generalize across different algorithm design tasks? In this paper, we take a preliminary step toward answering these questions by exploring fine-tuning of LLMs for algorithm design. We introduce a Diversity-Aware Rank-based (DAR) sampling strategy to balance training data diversity and quality, then we leverage direct preference optimization to efficiently align LLM outputs with task objectives. Our experiments are primarily conducted on Llama-3.2-1B-Instruct and Llama-3.1-8BInstruct across three distinct algorithm design tasks, with openPangu-Embedded models additionally included as auxiliary comparisons on the admissible set problem. Results suggest that fine-tuned LLMs can significantly outperform their off-the-shelf counterparts with the smaller Llama-3.2-1B-Instruct and match the larger Llama-3.1-8B-Instruct on the admissible set problem. Moreover, we observe promising generalization: LLMs fine-tuned on specific algorithm design tasks also improve performance on related tasks with varying settings. These findings highlight the value of task-specific adaptation for LLMs in algorithm design and open new avenues for future research. Our code is publicly available at https://github.com/RayZhhh/dpo-aad.

2507.10492 2026-05-20 cs.CV cs.AI cs.LG

BenchReAD: A systematic benchmark for retinal anomaly detection

BenchReAD: 一种系统性的视网膜异常检测基准

Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China(香港理工大学护理学院智能健康中心) School of Biomedical Engineering, Tsinghua University, Beijing, China(清华大学生物医学工程学院) Research Center for Medical AI, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(中国科学院深圳先进技术研究院医学人工智能研究中心)

AI总结 本研究提出BenchReAD基准,旨在解决视网膜异常检测领域缺乏全面且公开的评估标准的问题,通过系统化的数据和算法分类,引入了全监督方法DRA,并改进为NFM-DRA,实现了SOTA性能。

Comments MICCAI 2025

详情
AI中文摘要

视网膜异常检测在筛查眼部和系统性疾病中起着关键作用。尽管其重要性,该领域的进展受到缺乏全面且公开可用的基准的阻碍,这对于公平评估和推进方法至关重要。由于这一限制,与视网膜图像相关的先前异常检测工作受到(1)异常类型有限且过于简单的限制,(2)测试集几乎饱和,以及(3)缺乏泛化评估的影响,导致实验设置说服力不足。此外,现有医学异常检测基准大多专注于单类监督方法(仅使用负样本训练),忽视了临床实践中大量可用的标记异常数据和未标记数据。为了填补这些差距,我们引入了视网膜异常检测的基准,该基准在数据和算法上都是全面且系统的。通过分类和评估先前方法,我们发现利用解耦异常表示的全监督方法(DRA)取得了最佳性能,但在遇到某些未见异常时性能显著下降。受单类监督学习中记忆库机制的启发,我们提出了NFM-DRA,将其与正常特征记忆结合,以缓解性能下降,建立新的SOTA。该基准可在https://github.com/DopamineLcy/BenchReAD上公开获取。

英文摘要

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

2507.05843 2026-05-20 cs.CV

USIGAN: Unbalanced Self-Information Feature Transport for Weakly Paired Image IHC Virtual Staining

USIGAN: 用于弱配对图像IHC虚拟染色的不平衡自信息特征传输

Yue Peng, Bing Xiong, Fuqiang Chen, De Eybo, RanRan Zhang, Wanming Hu, Jing Cai, Wenjian Qin

发表机构 * ShenZhen Institues of Advanced Technology, university chinese academy of sciences(深圳先进技术研究院,中国科学院)

AI总结 本文提出USIGAN方法,通过提取全局形态学语义来解决弱配对条件下IHC虚拟染色的不一致问题,改进生成结果的病理语义一致性。

详情
AI中文摘要

免疫组化(IHC)虚拟染色任务旨在从H&E图像生成虚拟IHC图像,同时保持与相邻切片的病理语义一致性。该任务通过生成模型实现形态结构与染色模式的跨域映射,为病理分析提供高效且经济的解决方案。然而,在弱配对条件下,相邻切片之间的空间异质性带来了显著挑战,可能导致不准确的一对多映射并生成与相邻切片病理语义不一致的结果。为了解决这个问题,我们提出了一种新的IHC虚拟染色的不平衡自信息特征传输方法,称为USIGAN,该方法在不依赖位置对应的情况下提取全局形态学语义。通过在联合边缘分布中移除弱配对项,我们有效减轻了弱配对对联合分布的影响,从而显著提高了生成结果的内容一致性和病理语义一致性。此外,我们设计了不平衡最优传输一致性(UOT-CTM)机制和病理自对应(PC-SCM)机制,以构建H&E与生成IHC在图像级别以及真实IHC与生成IHC图像集内的相关矩阵。在两个公开数据集上的实验表明,我们的方法在多个临床相关指标上表现优异,如IoD和Pearson-R相关性,证明了更好的临床相关性。

英文摘要

Immunohistochemical (IHC) virtual staining is a task that generates virtual IHC images from H\&E images while maintaining pathological semantic consistency with adjacent slices. This task aims to achieve cross-domain mapping between morphological structures and staining patterns through generative models, providing an efficient and cost-effective solution for pathological analysis. However, under weakly paired conditions, spatial heterogeneity between adjacent slices presents significant challenges. This can lead to inaccurate one-to-many mappings and generate results that are inconsistent with the pathological semantics of adjacent slices. To address this issue, we propose a novel unbalanced self-information feature transport for IHC virtual staining, named USIGAN, which extracts global morphological semantics without relying on positional correspondence.By removing weakly paired terms in the joint marginal distribution, we effectively mitigate the impact of weak pairing on joint distributions, thereby significantly improving the content consistency and pathological semantic consistency of the generated results. Moreover, we design the Unbalanced Optimal Transport Consistency (UOT-CTM) mechanism and the Pathology Self-Correspondence (PC-SCM) mechanism to construct correlation matrices between H\&E and generated IHC in image-level and real IHC and generated IHC image sets in intra-group level.. Experiments conducted on two publicly available datasets demonstrate that our method achieves superior performance across multiple clinically significant metrics, such as IoD and Pearson-R correlation, demonstrating better clinical relevance.

2507.01123 2026-05-20 cs.CV cs.LG eess.IV

Landslide Detection and Mapping Using Deep Learning Across Multi-Source Satellite Data and Geographic Regions

利用多源卫星数据和地理区域的深度学习进行滑坡检测与制图

Rahul A. Burange, Harsh K. Shinde, Omkar Mutyalwar

发表机构 * Department of Electronics & Telecommunication, KDK College of Engineering(电子与电信系,KDK工程学院)

AI总结 本文提出了一种综合方法,结合多源卫星影像和深度学习模型,以提高滑坡识别和预测的准确性,通过Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型(DEM)层来捕捉影响滑坡发生的关键环境特征,并评估多种地理空间分析技术对检测精度的影响,同时评估了多种先进的深度学习分割模型,如U-Net、DeepLabV3+和Res-Net,以确定其在滑坡检测中的有效性。

Comments 17 pages, 22 figures

Journal ref JETIR March 2025, Volume 12, Issue 3

详情
AI中文摘要

滑坡对基础设施、经济和人类生命构成严重威胁,需要在多样化的地理区域中进行准确的检测和预测制图。随着深度学习和遥感技术的进步,自动化滑坡检测已变得更加有效。本文提出了一种综合方法,整合多源卫星影像和深度学习模型,以增强滑坡识别和预测。我们利用Sentinel-2多光谱数据和ALOS PALSAR衍生的坡度和数字高程模型(DEM)层来捕捉影响滑坡发生的关键环境特征。各种地理空间分析技术被用来评估地形特征、植被覆盖和降雨对检测精度的影响。此外,我们评估了多种先进的深度学习分割模型,包括U-Net、DeepLabV�+和Res-Net,以确定其在滑坡检测中的有效性。所提出的框架有助于发展可靠的早期预警系统,改进灾害风险管理,并促进可持续的土地利用规划。我们的发现为深度学习和多源遥感在创建稳健、可扩展和可转移的滑坡预测模型中的潜力提供了有价值的见解。

英文摘要

Landslides pose severe threats to infrastructure, economies, and human lives, necessitating accurate detection and predictive mapping across diverse geographic regions. With advancements in deep learning and remote sensing, automated landslide detection has become increasingly effective. This study presents a comprehensive approach integrating multi-source satellite imagery and deep learning models to enhance landslide identification and prediction. We leverage Sentinel-2 multispectral data and ALOS PALSAR-derived slope and Digital Elevation Model (DEM) layers to capture critical environmental features influencing landslide occurrences. Various geospatial analysis techniques are employed to assess the impact of terra in characteristics, vegetation cover, and rainfall on detection accuracy. Additionally, we evaluate the performance of multiple stateof-the-art deep learning segmentation models, including U-Net, DeepLabV3+, and Res-Net, to determine their effectiveness in landslide detection. The proposed framework contributes to the development of reliable early warning systems, improved disaster risk management, and sustainable land-use planning. Our findings provide valuable insights into the potential of deep learning and multi-source remote sensing in creating robust, scalable, and transferable landslide prediction models.

2506.08618 2026-05-20 cs.LG cond-mat.mes-hall cond-mat.other cs.AI cs.CV

HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

HSG-12M: 一种大规模空间多图基准,源自非厄密晶体能量谱

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

发表机构 * National University of Singapore(新加坡国立大学) NUS Centre for Bioimaging Sciences(新加坡国立大学生物成像科学中心)

AI总结 本文提出HSG-12M,一个包含1160万静态和510万动态哈密顿量谱图的数据集,用于研究非厄密量子物理中的复杂几何结构,填补了现有图基准在空间多边学习方面的空白。

Comments Accepted to ICLR 2026, OpenReview: [https://openreview.net/forum?id=YxuKCME576]. 49 pages, 13 figures, 14 tables. Code & pipeline: [https://github.com/sarinstein-yan/Poly2Graph] Dataset: [https://github.com/sarinstein-yan/HSG-12M] Dataset released under CC BY 4.0. The Fourteenth International Conference on Learning Representations (ICLR 2026)

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

人工智能正通过揭示理解复杂物理系统的新方法改变科学研究,但其影响仍受限于缺乏大规模、高质量的领域专用数据集。非厄密量子物理中蕴藏着丰富的资源,其中晶体的能量谱在复平面上形成复杂的几何结构,称为哈密顿量谱图。尽管这些谱图作为电子行为的指纹具有重要意义,但其系统研究一直受限于手动提取的依赖。为释放这一潜力,我们引入Poly2Graph:一个高性能、开源的管道,自动化将一维晶体哈密顿量映射到谱图。使用该工具,我们提出了HSG-12M:一个包含1160万静态和510万动态哈密顿量谱图的数据集,涵盖1401个特征多项式类别,源自177TB的谱势数据。关键的是,HSG-12M是首个大规模空间多图数据集——图嵌入在度量空间中,其中两个节点之间不同的几何轨迹被保留为单独的边。这同时填补了现有图基准在空间多边学习方面的空白。流行的GNN基准测试揭示了在大规模学习空间多边时的新挑战。除了其实际用途外,我们还表明谱图是多项式、向量和矩阵的通用拓扑指纹,建立了新的代数到图的联系。HSG-12M为凝聚态物理的数据驱动科学发现奠定了基础,为几何感知图学习的新机会以及更广泛领域铺平了道路。

英文摘要

AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane -- termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs -- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

2506.05317 2026-05-20 cs.CV

ProJo4D: Progressive Joint Optimization for Sparse-View Inverse Physics Estimation

ProJo4D:渐进式联合优化用于稀疏视图逆物理估计

Daniel Rho, Jun Myeong Choi, Biswadip Dey, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Meta Reality Labs(Meta现实实验室)

AI总结 本文提出ProJo4D,一种渐进式联合优化框架,用于解决稀疏视图下逆物理参数估计问题,通过逐步扩展联合优化参数集,提高了4D未来状态预测和物理参数估计的准确性,达到几何精度提升10倍的性能。

Comments TMLR 2026

详情
AI中文摘要

神经渲染在3D重建和新视图合成方面已取得显著进展,将物理整合到这些框架中开辟了新的应用,如机器人和XR中的物理准确数字孪生。然而,从视觉观测中估计物理参数的逆问题仍具挑战性。现有物理感知神经渲染方法通常需要密集多视角视频,使其在可扩展的实际部署中不切实际。在稀疏视图设置下,当前方法采用的顺序优化策略导致严重误差累积:初始3D重建的不准确性会传播到后续阶段,降低物理状态和材料参数估计。另一方面,同时优化所有参数失败,因为问题高度非凸且通常非可微。我们提出ProJo4D,一种渐进式联合优化框架,逐步扩展联合优化的参数集。这种设计使物理感知梯度能够细化几何,同时避免直接对所有参数进行联合优化的不稳定性。在合成和真实世界数据集上的评估表明,ProJo4D在4D未来状态预测和物理参数估计方面显著优于先前工作,实现几何精度提升高达10倍,同时保持计算效率。请访问项目网页:https://daniel03c1.github.io/ProJo4D/

英文摘要

Neural rendering has advanced significantly in 3D reconstruction and novel view synthesis, and integrating physics into these frameworks opens new applications such as physically accurate digital twins for robotics and XR. However, the inverse problem of estimating physical parameters from visual observations remains challenging. Existing physics-aware neural rendering methods typically require dense multi-view videos, making them impractical for scalable, real-world deployment. Under sparse-view settings, the sequential optimization strategies employed by current approaches suffer from severe error accumulation: inaccuracies in initial 3D reconstruction propagate to subsequent stages, degrading physical state and material parameter estimates. On the other hand, simultaneous optimization of all parameters fails due to the highly non-convex and often non-differentiable nature of the problem. We propose ProJo4D, a progressive joint optimization framework that gradually expands the set of jointly optimized parameters. This design enables physics-informed gradients to refine geometry while avoiding the instability of direct joint optimization over all parameters. Evaluations on synthetic and real-world datasets demonstrate that ProJo4D substantially outperforms prior work in 4D future state prediction and physical parameter estimation, achieving up to 10x improvement in geometric accuracy while maintaining computational efficiency. Please visit the project webpage: https://daniel03c1.github.io/ProJo4D/

2506.01529 2026-05-20 cs.LG

Learning Abstract World Models with a Group-Structured Latent Space

通过组结构潜在空间学习抽象世界模型

Thomas Delliaux, Nguyen-Khanh Vu, Vincent François-Lavet, Elise van der Pol, Emmanuel Rachelson

发表机构 * ISAE-SUPAERO ETH Zürich(瑞士联邦理工学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) Microsoft Research AI for Science(微软研究院人工智能科学部)

AI总结 该研究通过在低维表示流形上引入几何先验,改进了马尔可夫决策过程的抽象模型学习,从而提升有限数据下的泛化能力,并在具有旋转和翻译特征的环境中实现了更有效的强化学习任务学习。

Comments 20 pages, 18 figures

详情
AI中文摘要

学习有意义的马尔可夫决策过程(MDPs)的抽象模型对于从有限数据中提高泛化能力至关重要。在本文中,我们展示了如何在学习的转移模型的低维表示流形上施加几何先验。我们通过适当选择潜在空间和相关的群作用,纳入已知的对称结构,这些结构编码了环境中的先验知识关于不变性。此外,我们的框架允许将额外的无结构信息与这些对称性一起嵌入。我们实验表明,这导致了比完全无结构方法更好的潜在转移模型预测,以及在具有旋转和翻译特征的环境中下游RL任务学习的改进。此外,我们的实验还显示,这导致了更简单和更解耦的表示。完整的代码可在GitHub上获得以确保可重复性。

英文摘要

Learning meaningful abstract models of Markov Decision Processes (MDPs) is crucial for improving generalization from limited data. In this work, we show how geometric priors can be imposed on the low-dimensional representation manifold of a learned transition model. We incorporate known symmetric structures via appropriate choices of the latent space and the associated group actions, which encode prior knowledge about invariances in the environment. In addition, our framework allows the embedding of additional unstructured information alongside these symmetries. We show experimentally that this leads to better predictions of the latent transition model than fully unstructured approaches, as well as better learning on downstream RL tasks, in environments with rotational and translational features, including in first-person views of 3D environments. Additionally, our experiments show that this leads to simpler and more disentangled representations. The full code is available on GitHub to ensure reproducibility.

2506.01418 2026-05-20 cs.RO cs.CV

SEMNAV: Enhancing Visual Semantic Navigation in Robotics through Semantic Segmentation

SEMNAV: 通过语义分割增强机器人中的视觉语义导航

Rafael Flor-Rodríguez, Carlos Gutiérrez-Álvarez, Francisco Javier Acevedo-Rodríguez, Sergio Lafuente-Arroyo, Roberto J. López-Sastre

发表机构 * University of Alcalá(阿尔卡萨大学) CAM-UAH Ministry of Science and Innovation of Spain(西班牙科学与创新部)

AI总结 本文提出SEMNAV,一种利用语义分割作为环境主要视觉输入表示的方法,以增强机器人代理的感知和决策能力,通过引入高层面的语义信息,提升模型在未知环境中的泛化能力,并引入SEMNAV数据集进行训练。

Journal ref Applied Intelligence, 2026

详情
AI中文摘要

视觉语义导航(VSN)是机器人学中的基本问题,其中智能体必须在未知环境中导航至目标对象,主要依靠视觉信息。大多数最先进的VSN模型是在模拟环境中训练的,其中使用的是现实世界的渲染场景,最理想的情况。这些方法通常依赖于虚拟场景的原始RGB数据,这限制了它们在真实世界环境中的泛化能力,由于域适应问题。为了解决这个问题,本文提出了SEMNAV,一种新的方法,利用语义分割作为环境的主要视觉输入表示,以增强代理的感知和决策能力。通过显式地引入这种高层语义信息,我们的模型学习到稳健的导航策略,提高了在未见过的环境中泛化的能力,无论是模拟还是真实世界。我们还引入了SEMNAV数据集,这是一个新编纂的数据集,用于训练如SEMNAV这样的语义分割感知导航模型。我们的方法在模拟环境和真实世界机器人平台上进行了广泛的评估。实验结果表明,SEMNAV优于现有的最先进VSN模型,在Habitat 2.0模拟环境使用HM3D数据集时实现了更高的成功率。此外,我们的实际实验突显了语义分割在缓解仿真到现实差距方面的有效性,使我们的模型成为实用VSN基于机器人应用的有希望的解决方案。代码和数据集可在https://github.com/gramuah/semnav访问。

英文摘要

Visual Semantic Navigation (VSN) is a fundamental problem in robotics, where an agent must navigate toward a target object in an unknown environment, mainly using visual information. Most state-of-the-art VSN models are trained in simulation environments, where rendered scenes of the real world are used, at best. These approaches typically rely on raw RGB data from the virtual scenes, which limits their ability to generalize to real-world environments due to domain adaptation issues. To tackle this problem, in this work, we propose SEMNAV, a novel approach that leverages semantic segmentation as the main visual input representation of the environment to enhance the agent's perception and decision-making capabilities. By explicitly incorporating this type of high-level semantic information, our model learns robust navigation policies that improve generalization across unseen environments, both in simulated and real world settings. We also introduce the SEMNAV dataset, a newly curated dataset designed for training semantic segmentation-aware navigation models like SEMNAV. Our approach is evaluated extensively in both simulated environments and with real-world robotic platforms. Experimental results demonstrate that SEMNAV outperforms existing state-of-the-art VSN models, achieving higher success rates in the Habitat 2.0 simulation environment, using the HM3D dataset. Furthermore, our real-world experiments highlight the effectiveness of semantic segmentation in mitigating the sim-to-real gap, making our model a promising solution for practical VSN-based robotic applications. The code and datasets are accessible at https://github.com/gramuah/semnav

2506.00286 2026-05-20 cs.LG cs.AI math.OC stat.ML

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

递归熵风险优化在折扣马尔可夫决策过程中的应用:带有生成模型的样本复杂性界

Oliver Mortensen, Mohammad Sadegh Talebi

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 本文研究了在有限折扣马尔可夫决策过程(MDP)中使用递归熵风险度量(ERM)进行风险敏感强化学习的问题,引入了基于模型的算法Model-Based ERM Q-Value Iteration(MB-RS-QVI),并推导了该算法在价值学习和策略学习中的PAC型样本复杂性界,证明了在最坏情况下样本复杂性与|β|/(1-γ)呈指数关系,为递归ERM在风险规避和风险寻求情形下的样本复杂性提供了首次严格保证。

详情
AI中文摘要

我们研究了在有限折扣马尔可夫决策过程(MDP)中使用递归熵风险度量(ERM)进行风险敏感强化学习的问题,其中风险参数β≠0控制智能体的风险态度:β>0表示风险规避,β<0表示风险寻求行为。假设MDP具有生成模型。我们的关注点是学习最优状态-动作价值函数(价值学习)和最优策略(策略学习)在递归ERM下的样本复杂性。我们引入了一个基于模型的算法,称为Model-Based ERM Q-Value Iteration(MB-RS-QVI),并推导了该算法在价值和策略学习中的PAC型样本复杂性界。两种PAC界都随|β|/(1-γ)呈指数增长,其中γ是折扣因子。我们还为价值和策略学习建立了相应的下界,证明在最坏情况下样本复杂性对|β|/(1-γ)的指数依赖是不可避免的。这些界在状态和动作的数量(S和A)上是紧的,为递归ERM在风险规避和风险寻求情形下的样本复杂性提供了首次严格保证。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $β\neq 0$ controls the agent's risk attitude: $β>0$ for risk-averse and $β<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|β|/(1-γ)$, where $γ$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|β|/(1-γ)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

2505.23747 2026-05-20 cs.CV cs.AI cs.LG

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: 提升基于视觉的空域智能的MLLM能力

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出Spatial-MLLM,一种基于纯2D观测的视觉空域推理框架,通过双编码器架构和空间感知帧采样策略提升空域理解能力,实验表明其在多种视觉空域任务中达到SOTA性能。

Comments 22 pages

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在2D视觉任务上的性能显著提升。然而,提高其空间智能仍是一个挑战。现有的3D MLLMs总是依赖额外的3D或2.5D数据来整合空间意识,限制了它们在只有2D输入(如图像或视频)场景中的实用性。在本文中,我们提出了Spatial-MLLM,一种新颖的框架,用于从纯2D观测中进行基于视觉的空间推理。与传统视频MLLMs依赖CLIP-based视觉编码器优化语义理解不同,我们的关键见解是释放来自前馈视觉几何基础模型的强大结构先验。具体来说,我们提出了双编码器架构:一个预训练的2D视觉编码器用于提取语义特征,以及一个3D空间编码器,从视觉几何模型的主干初始化以提取3D结构特征。然后,一个连接器将两种特征整合到统一的视觉标记中以增强空间理解。此外,我们提出了一种在推理时间的空间感知帧采样策略,该策略选择视频序列中具有空间信息的帧,确保在有限的token长度下,模型专注于对空间推理至关重要的帧。除了架构改进外,我们从多个来源构建了一个训练数据集,并使用监督微调和GRPO对其进行训练。在各种真实世界数据集上的广泛实验表明,Spatial-MLLM在广泛的基于视觉的空间理解和推理任务中实现了SOTA性能。项目页面:https://diankun-wu.github.io/Spatial-MLLM/.

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

2505.17726 2026-05-20 cs.CV cs.AI

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Slot-MLLM: 多模态大语言模型中的面向对象视觉标记化

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Kakao Corp(Kakao公司) School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 本文提出了一种面向对象的视觉标记化方法Slot-MLLM,通过基于Slot Attention的标记器,有效编码局部视觉细节并保持高层语义,从而提升多模态大语言模型在视觉内容理解和生成中的性能。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)已成为实现人工通用智能的关键方法。特别是,视觉语言MLLMs已被开发用于从多模态输入中生成文本和视觉输出。这一进展需要高效的图像标记,使LLMs能够有效处理输入和输出。然而,现有的图像标记方法通常只能捕捉全局抽象概念或均匀分割的图像块,限制了MLLMs在理解和生成细节视觉内容方面的能力,尤其是在对象层面。为了解决这一限制,我们提出了一种基于Slot Attention的面向对象视觉标记器,专门针对MLLMs。具体而言,基于Q-Former编码器、扩散解码器和残差向量量化,我们提出的离散化槽标记能够编码局部视觉细节,同时保持高层语义,并与文本数据对齐,无缝集成到LLMs的统一下一个标记预测框架中。所得到的Slot-MLLM在各种涉及局部详细理解和生成的视觉语言任务中,相对于先前视觉标记器的基线表现显著提升。值得注意的是,这项工作是首次展示了使用MLLMs和真实自然图像进行面向对象槽注意力的可行性。

英文摘要

Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

2505.12217 2026-05-20 cs.CV

HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models

HyperCap:面向视觉语言模型的超光谱土地覆盖描述数据集

Aryan Das, Tanishq Rachamalla, Pravendra Singh, Koushik Biswas, Vinay Kumar Verma, Salvador Garcia, Antonio Plaza, Swalpa Kumar Roy

发表机构 * Department of Computer Science and Engineering, Vellore Institute of Technology(计算机科学与工程系,维洛雷理工学院) Department of Information Technology, Siddhartha Academy of Higher Education(信息技术系,斯里达拉塔高等教育学院) Department of Computer Science and Engineering, Indian Institute of Technology, Roorkee(计算机科学与工程系,印度理工学院罗尔基分校) Department of Computer Science and Engineering, Indraprastha Institute of Information Technology Delhi(计算机科学与工程系,印度信息技术学院德里) Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur(计算机科学与工程系,印度理工学院坎浦尔) Department of Computer Science and Artificial Intelligence, University of Granada(计算机科学与人工智能系,格拉纳达大学) Hyperspectral Computing Laboratory, Department of Computers and Communications, University of Extremadura(超光谱计算实验室,计算机与通信系,埃斯特拉达大学)

AI总结 本文提出HyperCap数据集,通过整合光谱数据与像素级文本标注,提升遥感应用中的模型性能,为未来研究提供基础资源。

Comments Accepted for publication in IEEE Geoscience and Remote Sensing Magazine (GRSM), 2026

详情
AI中文摘要

我们介绍了HyperCap,首个大规模超光谱描述数据集,旨在提升模型在遥感应用中的性能和有效性。与传统超光谱成像(HSI)基准不同,HyperCap将光谱数据与像素级文本标注相结合,实现更深入的语义理解。该数据集通过结合自动和手动方法对四个基准数据集进行标注,确保准确性和一致性。使用最先进的编码器和多样的融合技术进行实证评估,显示出显著的分类性能提升。这些结果突显了视觉-语言学习在HSI中的潜力,并将HyperCap定位为未来研究的基础数据集。代码和数据集可在https://github.com/arya-domain/HyperCap获取。

英文摘要

We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) benchmarks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field. The code and dataset are available at https://github.com/arya-domain/HyperCap.

2505.04588 2026-05-20 cs.CL

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

ZeroSearch: 无需搜索即可激励大语言模型的搜索能力

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, Jingren Zhou

发表机构 * Tongyi Lab , Alibaba Group(通义实验室,阿里巴巴集团)

AI总结 本研究提出ZeroSearch框架,通过模拟搜索提升大语言模型的搜索能力,解决真实搜索引擎文档质量不可控和API成本高的问题,实验表明其在不同参数规模的模型上均表现优异。

详情
AI中文摘要

有效信息检索对于增强大语言模型(LLMs)的推理和生成能力至关重要。最近的研究探索了通过与真实搜索引擎在现实环境中交互,利用强化学习(RL)来提高LLMs的搜索能力。尽管这些方法显示出有前景的结果,但面临两个主要挑战:(1)不可控的文档质量:搜索引擎返回的文档质量往往不可预测,引入噪声和训练过程的不稳定性。(2)极高的API成本:RL训练需要频繁的回放,可能涉及数万次搜索请求,导致显著的API费用,并严重限制可扩展性。为了解决这些挑战,我们引入了ZeroSearch,一种新颖的RL框架,通过在训练过程中使用模拟搜索来激励LLMs的搜索能力。我们的方法首先通过轻量级监督微调将LLM转换为检索模块,使其能够生成有用和嘈杂的文档以响应查询。在RL训练过程中,我们采用基于课程的回放策略,逐步降低生成文档的质量,逐步通过暴露模型于越来越具有挑战性的检索场景来激发其推理能力。广泛的实验表明,ZeroSearch有效地利用3B LLM作为检索模块来激励LLMs的搜索能力。令人印象深刻的是,7B检索模块的表现与真实搜索引擎相当,而14B检索模块甚至超过了它。此外,它在各种参数规模的基模型和指令微调模型上均表现出良好的泛化能力,并且与多种RL算法兼容。

英文摘要

Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real search engine with simulated searches during training. Our approach begins with lightweight supervised fine-tuning to transform the LLM into a retrieval module capable of generating both useful and noisy documents in response to a query. During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model's reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZeroSearch effectively incentivizes the search capabilities of LLMs using a 3B LLM as the retrieval module. Remarkably, a 7B retrieval module achieves comparable performance to the real search engine, while a 14B retrieval module even surpasses it. Furthermore, it generalizes well across both base and instruction-tuned models of various parameter sizes and is compatible with a wide range of RL algorithms.

2504.09188 2026-05-20 cs.RO cs.SY eess.SY

Compliant Explicit Reference Governor for Contact Friendly Robotic Manipulators

顺应性显式参考 governor 用于接触友好的机器人机械臂

Yaashia Gautam, Gilberto Briscoe-Martinez, Adhitya Mohan, Nataliya Nechyporenko, Alessandro Roncone, Marco M. Nicotra

发表机构 * College of Engineering(工程学院) Applied Sciences, University of Colorado Boulder(应用科学学院,科罗拉多大学博尔德分校) Boulder, CO 80301 USA(博尔德,科罗拉多州80301 USA)

AI总结 本文提出了一种顺应性显式参考 governor (CERG),一种模块化的参考管理系统,使机器人能够在有保证的条件下与环境物理交互。CERG 作为高层规划器和低层控制器之间的中间层,强制操作约束并使自由运动和接触操作之间平滑过渡。CERG 通过限制接触时机械臂可用的总能量来确保安全。在没有接触的情况下,CERG 不会惩罚系统性能。

Comments Updated paper with current contributions and author list , accepted at IFAC World Congress, Busan, 2026

详情
AI中文摘要

本文介绍了一种顺应性显式参考 governor (CERG),一种模块化的参考管理系统,使机器人能够在有保证的条件下与环境物理交互。CERG 是一个可以放置在高层规划器和低层控制器之间的中间层:它强制操作约束并使自由运动和接触操作之间平滑过渡。CERG 通过限制接触时机械臂可用的总能量来确保安全。然而,在没有接触的情况下,CERG 不会惩罚系统性能。仿真和硬件实验验证了 CERG 在日益复杂系统上的有效性。

英文摘要

This paper introduces the Compliant Explicit Reference Governor (CERG), a modular reference management system that enables robots to interact physically with their environment under provable guarantees. The CERG is an intermediate layer that can be placed between a high-level planner and a low-level controller: it enforces operational constraints and enables smooth transitions between free-motion and contact operations. The CERG ensures safety by limiting the total energy available to the robotic arm at the time of contact. In the absence of contact, however, the CERG does not penalize the system performance. Simulation and hardware experiments validate the CERG on increasingly complex systems.

2504.07756 2026-05-20 cs.AI cs.CY

Artificial Intelligence, conceptual metaphors and conceptual engineering: Are AI-based framings of human behaviour and cognition successful?

人工智能、概念隐喻和概念工程:基于人工智能的对人类行为和认知的框架是否成功?

Warmhold Jan Thomas Mollema, Thomas Wachter

发表机构 * Vrije University Amsterdam (Department of Philosophy)(荷兰阿姆斯特丹自由大学(哲学系)) National Centre for Artificial Intelligence(人工智能国家研究中心)

AI总结 本文探讨了将人工智能概念应用于人类行为和认知领域的成功性,分析了这些框架是否属于概念隐喻还是概念工程,并指出其潜在的伦理和简化挑战。

详情
AI中文摘要

利用人工智能领域的概念来理解人类行为、神经科学和心理学正变得越来越流行。鉴于人工智能技术在日常生活中的大规模整合,人工智能相关概念被用来将人工智能系统与人类行为、脑功能和认知能力(如语言习得)进行类比。但科学家和哲学家也越来越倾向于将人工智能对人类概念领域的框架视为字面意义。本文探讨了这些‘人工智能框架’的知识和实践成功性:应用人工智能的概念图景到人类概念领域意味着什么?我们考虑并比较了两种可能的答案:这些例子是概念隐喻,还是概念工程的尝试。首先,我们论证当这些人工智能框架被视为概念隐喻时,它们可能陷入‘地图-领土谬误’。其次,我们论证这些比较也包含误导性的‘双重隐喻’,因为人类心理学与计算之间的隐喻性联系存在于计算的基础概念中。但我们也论证人工智能框架中存在一个可能的语义陷阱,这被概念工程观点所捕捉。即,人工智能框架指向了概念工程的可能途径。如果概念伦理和简化主义的挑战被克服,一些人工智能框架可能会丰富我们的知识和实践生活。因此,在最坏的情况下——作为隐含的概念隐喻——人工智能框架会完全误导我们;在最好的情况下,它促使我们重新反思当前概念的边界如何服务于我们以及如何改进它们。

英文摘要

Understanding human behaviour, neuroscience and psychology using concepts from the domain of AI is increasing in popularity. Given the massive integration of AI technologies into our daily lives, AI-related concepts are being used to compare AI systems with human behaviour, brain functions, and cognitive abilities like language acquisition. But scientists and philosophers are also increasingly tempted to take the AI-framing of the human conceptual domain as a literal one. This paper investigates the epistemic and practical success of these 'AI-framings': What does it mean to apply the conceptual constellation of AI to the human conceptual domain? We consider and compare two possible answers: either these examples are conceptual metaphors, or they are attempts at conceptual engineering. Firstly, we argue that when viewed as conceptual metaphors, the AI-framed descriptions risk committing the ''map-territory fallacy''. Secondly, we argue the comparisons also contain a misleading 'double metaphor' because of the metaphorical connection between human psychology and computation at the conceptual foundation of computation. But we also argue that there is a possible semantic catch to the AI-framing, which is captured by the conceptual engineering view. This is that the AI-framings point towards avenues for forms of conceptual engineering. If the challenges of conceptual ethics and reductionism are overcome, some AI-framings might enrich our epistemic and practical lives. So, at its worst - as implicit conceptual metaphor - the AI-framing leads us completely astray; at its best, it prompts us to reflect anew on how the boundaries of our current concepts serve us and how they could be improved.

2504.04065 2026-05-20 cs.CV cs.IR cs.MM

Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

使检索增强的视觉问答实现协作参数知识校准

Jiaqi Deng, Kaize Shi, Zonghan Wu, Huan Huo, Dingxian Wang, Guandong Xu

发表机构 * University of Technology Sydney(悉尼大学) East China Normal University(华东师范大学) The Education University of Hong Kong(香港教育大学)

AI总结 本文提出了一种统一的检索增强视觉问答框架,通过协作参数知识校准来充分利用KB-VQA中的跨任务协同效应,从而提升问答准确性。

Comments 10 pages, 5 figures, Under Review

Journal ref Knowledge-Based Systems, 8 July 2026, Volume 346

详情
AI中文摘要

基于知识的视觉问答(KB-VQA)系统通过从外部知识库检索的知识来解决复杂的视觉-地面化问题。知识检索和答案生成任务都要求对问题上下文和外部知识进行精确的多模态理解。然而,现有方法将这两个阶段视为独立模块,在训练过程中交互有限,这阻碍了双向参数知识共享,最终导致性能不佳。为充分利用KB-VQA中的跨任务协同效应,我们提出了一种统一的检索增强VQA框架,具有协作参数知识校准。所提出的框架可以有效地将通用多模态预训练模型适应于细粒度、知识密集型任务,同时在训练和推理过程中使检索器和生成器能够协作增强和共享其参数知识。为了增强对问题和外部文档的细粒度理解,我们还将晚期交互机制整合到所提出的训练框架中。此外,我们引入了一种反思-回答机制,使模型能够显式评估并细化其知识边界。我们的方法在与最先进的模型竞争中取得了竞争力的表现,实现了回答准确率的显著4.7%的提升,并为基础MLLMs的VQA性能带来了平均7.5%的提升。

英文摘要

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

2504.00470 2026-05-20 cs.LG cs.CV

Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection

少即是多:通过最小可解释子集选择实现高效的黑盒属性分析

Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Li Liu, Hua Zhang, Xiaochun Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Artificial Intelligence, University of Science and Technology Beijing(北京科技大学人工智能学院) Department of Mechanical Engineering, Imperial College London(伦敦帝国理工学院机械工程系) Center for Machine Vision and Signal Analysis (CMVS), University of Oulu(奥卢大学机器视觉与信号分析中心) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区计算机科学与技术学院)

AI总结 本文提出了一种高效的黑盒属性分析方法LiMA,通过将重要区域的属性分析转化为子模函数子集选择的优化问题,以更少的区域提供更准确的解释,并在多个基准模型上展示了显著的改进。

详情
AI中文摘要

为了开发一个可信的AI系统,目标是识别对模型决策影响最大的输入区域。现有属性方法的主要任务是高效且准确地识别输入-预测交互关系。特别是当输入数据是离散的,如图像时,分析输入和输出之间的关系由于组合爆炸而成为重大挑战。在本文中,我们提出了一种新颖且高效的黑盒属性机制LiMA(Less input is More faithful for Attribution),它将重要区域的属性分析重新表述为一个子模子集选择的优化问题。首先,为了准确评估交互,我们设计了一个子模函数,该函数量化子集的重要性并有效捕捉其对决策结果的影响。然后,通过一种新的双向贪心搜索算法,高效地对输入子区域按重要性进行排序。LiMA能够识别最和最不重要的样本,同时确保一个最优的属性边界,以最小化误差。在八个基础模型上的广泛实验表明,我们的方法在更少的区域上提供了忠实的解释,并表现出强大的泛化能力,插入和删除任务的平均改进分别为36.3%和39.6%。我们的方法在属性效率方面也优于朴素的贪心搜索,速度提高了1.6倍。此外,当解释模型预测错误的原因时,我们的方法平均最高置信度比最先进的属性算法高86.1%。代码可在https://github.com/RuoyuChen10/LIMA上获得。

英文摘要

To develop a trustworthy AI system, which aim to identify the input regions that most influence the models decisions. The primary task of existing attribution methods lies in efficiently and accurately identifying the relationships among input-prediction interactions. Particularly when the input data is discrete, such as images, analyzing the relationship between inputs and outputs poses a significant challenge due to the combinatorial explosion. In this paper, we propose a novel and efficient black-box attribution mechanism, LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection. First, to accurately assess interactions, we design a submodular function that quantifies subset importance and effectively captures their impact on decision outcomes. Then, efficiently ranking input sub-regions by their importance for attribution, we improve optimization efficiency through a novel bidirectional greedy search algorithm. LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors. Extensive experiments on eight foundation models demonstrate that our method provides faithful interpretations with fewer regions and exhibits strong generalization, shows an average improvement of 36.3% in Insertion and 39.6% in Deletion. Our method also outperforms the naive greedy search in attribution efficiency, being 1.6 times faster. Furthermore, when explaining the reasons behind model prediction errors, the average highest confidence achieved by our method is, on average, 86.1% higher than that of state-of-the-art attribution algorithms. The code is available at https://github.com/RuoyuChen10/LIMA.

2503.19877 2026-05-20 cs.CL

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

通过推理模型作为评估器来提升评估时的计算能力

Seungone Kim, Ian Wu, Jinu Lee, Xiang Yue, Seongyun Lee, Mingyeong Moon, Carolin Lawrence, Kiril Gashteovski, Julia Hockenmaier, Graham Neubig, Sean Welleck

发表机构 * CMU(卡内基梅隆大学) UIUC(伊利诺伊大学) KAIST AI(韩国科学技术院人工智能研究所) NEC Laboratories Europe(日本 NEC 欧洲实验室) Ss.Cyril and Methodius University of Skopje(斯科普里塞尔吉尔和梅蒂乌斯大学)

AI总结 本文探讨了通过增加评估时的计算量来提升语言模型的评估能力,利用推理模型作为评估器,分别评估响应整体和每个步骤,从而提高评估效果。

Comments ACL 2026 Findings

详情
AI中文摘要

随着语言模型(LM)的输出越来越自然,评估其质量变得越来越困难。同时,通过增加测试时的计算量来提升LM的'思考'时间,已被证明是解决数学和代码等领域挑战性问题的有效技术。这引发了一个自然的问题:是否可以通过增加测试时的计算量来提升LM的评估能力?为回答这个问题,我们研究了利用推理模型——即能够原生生成长链推理的LM——作为评估器。具体而言,我们考察了通过(1)使用推理模型,以及(2)提示这些模型不仅评估响应整体(即结果评估),还评估响应中的每个步骤(即过程评估)来利用更多测试时计算量的方法。在实验中,我们观察到评估器的性能随着生成更多推理标记而单调提升,类似于LM生成中的趋势。此外,我们使用这些更准确的评估器对多个生成进行重新排序,并证明在评估时花费更多计算量可以像在生成时花费更多计算量一样有效,从而提升LM的问题解决能力。

英文摘要

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

2503.13868 2026-05-20 cs.LG cs.AI

Out-of-Distribution Generalization in Time Series: A Survey

时间序列中的分布外泛化:综述

Xin Wu, Fei Teng, Xingwang Li, Ji Zhang, Tianrui Li, Qiang Duan

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education(可持续城市智能交通工程研究中心,教育部) Information Sciences and Technology Department, the Pennsylvania State University(信息科学与技术系,宾夕法尼亚州立大学)

AI总结 本文综述了时间序列中分布外泛化的方法,分析了数据分布、表示学习和分布外评估三个维度,总结了主流算法,指出了应用场景和存在的挑战,并提出了未来研究方向。

Comments Work in Progress

Journal ref Information Fusion 133, 104336 (2026)

详情
AI中文摘要

时间序列经常表现出分布偏移、多样化的潜在特征和非平稳学习动态,特别是在开放和演变的环境中。这些特性对分布外(OOD)泛化提出了重大挑战。尽管已有显著进展,但系统性综述仍缺乏。为填补这一空白,我们首次全面回顾了时间序列中OOD泛化方法,旨在阐明该领域的发展轨迹和当前研究现状。我们的分析分为三个基础维度:数据分布、表示学习和OOD评估。在每个维度中,我们详细介绍了几种流行的算法。此外,我们强调了关键的应用场景,突显其实际影响。最后,我们识别了持续存在的挑战并提出了未来的研究方向。时间序列中OOD泛化方法的详细总结可通过https://tsood-generalization.com获取。

英文摘要

Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose significant challenges for out-of-distribution (OOD) generalization. While substantial progress has been made, a systematic synthesis of advancements remains lacking. To address this gap, we present the first comprehensive review of OOD generalization methodologies for time series, organized to delineate the field's evolutionary trajectory and contemporary research landscape. We organize our analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, we present several popular algorithms in detail. Furthermore, we highlight key application scenarios, emphasizing their real-world impact. Finally, we identify persistent challenges and propose future research directions. A detailed summary of the methods reviewed for the generalization of OOD in time series can be accessed at https://tsood-generalization.com.

2503.12172 2026-05-20 cs.LG cs.CR cs.CV

SEAL: Semantic Aware Image Watermarking

SEAL:语义感知图像水印

Kasra Arabi, R. Teal Witter, Chinmay Hegde, Niv Cohen

发表机构 * New York University(纽约大学)

AI总结 本文提出了一种新的水印方法,通过将生成图像的语义信息直接嵌入水印中,实现无损水印验证,无需依赖密钥模式数据库。通过局部敏感哈希从图像语义嵌入中推断密钥模式,并基于原始图像内容条件检测水印,提高对抗伪造攻击的鲁棒性。

详情
AI中文摘要

生成模型已迅速发展以生成逼真的输出。然而,它们的合成输出越来越多地挑战自然与AI生成内容之间的清晰区分,需要稳健的水印技术。水印通常需要保持目标图像的完整性,抵御移除尝试,并防止未经授权的复制到无关图像上。为了解决这一需求,最近的方法将持久水印嵌入由扩散模型生成的图像中使用初始噪声。然而,为此,它们要么会扭曲生成图像的分布,要么依赖于搜索一个长密钥字典进行检测。在本文中,我们提出了一种新的水印方法,将生成图像的语义信息直接嵌入水印中,使水印无损,且无需数据库中的密钥模式即可验证。相反,密钥模式可以从图像的语义嵌入中使用局部敏感哈希推断。此外,将水印检测条件化于原始图像内容可以提高对伪造攻击的鲁棒性。为了证明这一点,我们考虑了两种被忽视的攻击策略:(i)攻击者提取初始噪声并生成具有相同模式的新图像;(ii)攻击者在水印图像中插入无关(可能有害)的对象,可能在保持水印的情况下。我们通过实验证明了我们的方法对这些攻击的增强鲁棒性。总的来说,我们的结果表明,内容感知的水印可以缓解图像生成模型带来的风险。

英文摘要

Generative models have rapidly evolved to generate realistic outputs. However, their synthetic outputs increasingly challenge the clear distinction between natural and AI-generated content, necessitating robust watermarking techniques. Watermarks are typically expected to preserve the integrity of the target image, withstand removal attempts, and prevent unauthorized replication onto unrelated images. To address this need, recent methods embed persistent watermarks into images produced by diffusion models using the initial noise. Yet, to do so, they either distort the distribution of generated images or rely on searching through a long dictionary of used keys for detection. In this paper, we propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark, enabling a distortion-free watermark that can be verified without requiring a database of key patterns. Instead, the key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Furthermore, conditioning the watermark detection on the original image content improves robustness against forgery attacks. To demonstrate that, we consider two largely overlooked attack strategies: (i) an attacker extracting the initial noise and generating a novel image with the same pattern; (ii) an attacker inserting an unrelated (potentially harmful) object into a watermarked image, possibly while preserving the watermark. We empirically validate our method's increased robustness to these attacks. Taken together, our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.

2503.02170 2026-05-20 cs.CV cs.AI

Adaptive Camera Sensor for Vision Models

自适应摄像头传感器用于视觉模型

Eunsu Baek, Sunghwan Han, Taesik Gong, Hyung-Sin Kim

发表机构 * Graduate School of Data Science(数据科学研究生院) Seoul National University(首尔国立大学) Department of Computer Science & Engineering(计算机科学与工程系) Seogang University(世宗大学) Ulsan National Institute of Science and Technology(乌山国立科学技术研究院)

AI总结 本文提出Lens,一种基于人类视觉感知的自适应摄像头传感器控制方法,通过从模型视角捕获高质量图像来提升模型性能,同时在真实时间内适应特定模型和场景,并通过新的ImageNet-ES Diverse数据集验证了其有效性。

Comments The International Conference on Learning Representations (ICLR 2025)

详情
AI中文摘要

领域偏移仍然是基于深度学习的计算机视觉中的持续挑战,通常需要大量的模型修改或标记数据集来解决。受人类视觉感知的启发,即通过矫正透镜调整输入质量而不是过度训练大脑,我们提出了Lens,一种新颖的摄像头传感器控制方法,通过从模型视角捕获高质量图像来增强模型性能,而不是依赖传统的以人类为中心的传感器控制。Lens是轻量级的,并且能够实时适应特定模型和场景的传感器参数。其核心是VisiT,一种无需训练的、模型特定的质量指标,它在测试时使用置信度分数评估单个未标记样本,而无需额外的适应成本。为了验证Lens,我们引入了ImageNet-ES Diverse,一个新基准数据集,捕捉了来自变化的传感器和光照条件的自然扰动。在ImageNet-ES和我们新的ImageNet-ES Diverse上的大量实验表明,Lens在各种传感器控制和模型修改的基线方案中显著提高了模型的准确性,同时保持了低延迟的图像捕获。Lens有效补偿了大模型大小差异,并与模型改进技术协同作用。我们的代码和数据集可在github.com/Edw2n/Lens.git上获得。

英文摘要

Domain shift remains a persistent challenge in deep-learning-based computer vision, often requiring extensive model modifications or large labeled datasets to address. Inspired by human visual perception, which adjusts input quality through corrective lenses rather than over-training the brain, we propose Lens, a novel camera sensor control method that enhances model performance by capturing high-quality images from the model's perspective rather than relying on traditional human-centric sensor control. Lens is lightweight and adapts sensor parameters to specific models and scenes in real-time. At its core, Lens utilizes VisiT, a training-free, model-specific quality indicator that evaluates individual unlabeled samples at test time using confidence scores without additional adaptation costs. To validate Lens, we introduce ImageNet-ES Diverse, a new benchmark dataset capturing natural perturbations from varying sensor and lighting conditions. Extensive experiments on both ImageNet-ES and our new ImageNet-ES Diverse show that Lens significantly improves model accuracy across various baseline schemes for sensor control and model modification while maintaining low latency in image captures. Lens effectively compensates for large model size differences and integrates synergistically with model improvement techniques. Our code and dataset are available at github.com/Edw2n/Lens.git.

2502.20981 2026-05-20 cs.CV

Distribution Prototype Diffusion Learning for Open-set Supervised Anomaly Detection

分布原型扩散学习用于开放集监督异常检测

Fuyun Wang, Tong Zhang, Yuanzhi Wang, Yide Qiu, Xin Liu, Xu Guo, Zhen Cui

发表机构 * Nanjing University of Science and Technology(南京理工大学) Nanjing SeetaCloud Technology(南京海康威视科技) Beijing Normal University(北京师范大学)

AI总结 本文提出了一种分布原型扩散学习方法,通过构建可学习的高斯原型来创建潜在表示空间,以提高正常样本的判别边界,并通过Schroedinger桥促进正常样本向原型的扩散,同时将异常样本推离,从而提升异常检测性能。

Comments Accepted by CVPR 2025

详情
AI中文摘要

在开放集监督异常检测(OSAD)中,现有方法通常生成伪异常来补偿观察到的异常样本稀缺,而忽视了正常样本的关键先验,导致判别边界效果不佳。为了解决这个问题,我们提出了一种分布原型扩散学习(DPDL)方法,旨在将正常样本封闭在紧凑且判别的分布空间中。具体来说,我们构建了多个可学习的高斯原型,以创建一个容纳丰富且多样正常样本的潜在表示空间,并学习Schroedinger桥以促进正常样本向这些原型的扩散过渡,同时将异常样本推离。此外,为了增强样本间的分离,我们设计了一种在超球面空间中的分散特征学习方法,有助于识别分布外的异常。实验结果表明,所提出的DPDL方法在9个公开数据集上取得了最先进的性能。

英文摘要

In Open-set Supervised Anomaly Detection (OSAD), the existing methods typically generate pseudo anomalies to compensate for the scarcity of observed anomaly samples, while overlooking critical priors of normal samples, leading to less effective discriminative boundaries. To address this issue, we propose a Distribution Prototype Diffusion Learning (DPDL) method aimed at enclosing normal samples within a compact and discriminative distribution space. Specifically, we construct multiple learnable Gaussian prototypes to create a latent representation space for abundant and diverse normal samples and learn a Schrödinger bridge to facilitate a diffusive transition toward these prototypes for normal samples while steering anomaly samples away. Moreover, to enhance inter-sample separation, we design a dispersion feature learning way in hyperspherical space, which benefits the identification of out-of-distribution anomalies. Experimental results demonstrate the effectiveness and superiority of our proposed DPDL, achieving state-of-the-art performance on 9 public datasets.

2501.09203 2026-05-20 cs.CV cs.RO

3D Modeling and Automated Measurement of Concrete Cracks via Segment Anything Refinement and Visual Inertial LiDAR Fusion

通过段落任何精修和视觉惯性LiDAR融合进行混凝土裂缝的3D建模与自动测量

Pengru Deng, Jiapeng Yao, Chun Li, Su Wang, Xinrun Li, Varun Ojha, Xuhui He

发表机构 * School of Civil Engineering(土木工程学院) Central South University(中南大学) Hunan Provincial Key Laboratory for Disaster Prevention and Mitigation of Rail Transit Engineering Structures(湖南省铁路工程结构灾害预防与 mitigation 工程结构重点实验室) Nvidia School of Computing(计算学院) Newcastle University(新castle大学)

AI总结 本文提出了一种结合计算机视觉技术和多模态同时定位与建图(SLAM)的创新框架,用于二维裂缝检测、三维重建和三维自动裂缝测量,解决了现有方法在适应性和鲁棒性方面的不足,特别是在处理曲线或复杂几何形状时的挑战。

Comments Title and author list updated

Journal ref Computer-Aided Civil and Infrastructure Engineering, Volume 45, 2026, 100019, ISSN 1093-9687

详情
AI中文摘要

视觉-空间系统在混凝土裂缝检测中变得越来越关键。然而,现有方法往往缺乏对多样化场景的适应性,在基于图像的方法中表现出有限的鲁棒性,并且在处理曲线或复杂几何形状时存在困难。为了解决这些限制,本文提出了一种创新的框架,通过整合计算机视觉技术和多模态同时定位与建图(SLAM),用于二维(2D)裂缝检测、三维(3D)重建和三维自动裂缝测量。首先,基于基础的DeepLabv3+分割模型,并结合特定的改进利用基础模型Segment Anything Model(SAM),我们开发了一种具有强泛化能力的裂缝分割方法,能够在不熟悉的场景中生成精确的2D裂缝掩码。为了提高三维重建的准确性和鲁棒性,利用Light Detection and Ranging(LiDAR)点云与图像数据和分割掩码。通过利用图像和LiDAR-SLAM,我们开发了多帧和多模态融合框架,产生密集、着色的点云,有效捕捉裂缝语义在三维现实尺度上。此外,裂缝几何属性在三维密集点云空间中自动且直接地进行测量,超越了传统二维图像测量方法的限制。这一进步使该方法适用于具有曲线和复杂三维几何结构的结构部件。在各种混凝土结构上的实验结果突显了所提出方法的显著改进和独特优势,展示了其在现实应用中的有效性、准确性和鲁棒性。

英文摘要

Visual-Spatial Systems has become increasingly essential in concrete crack inspection. However, existing methods often lacks adaptability to diverse scenarios, exhibits limited robustness in image-based approaches, and struggles with curved or complex geometries. To address these limitations, an innovative framework for two-dimensional (2D) crack detection, three-dimensional (3D) reconstruction, and 3D automatic crack measurement was proposed by integrating computer vision technologies and multi-modal Simultaneous localization and mapping (SLAM) in this study. Firstly, building on a base DeepLabv3+ segmentation model, and incorporating specific refinements utilizing foundation model Segment Anything Model (SAM), we developed a crack segmentation method with strong generalization across unfamiliar scenarios, enabling the generation of precise 2D crack masks. To enhance the accuracy and robustness of 3D reconstruction, Light Detection and Ranging (LiDAR) point clouds were utilized together with image data and segmentation masks. By leveraging both image- and LiDAR-SLAM, we developed a multi-frame and multi-modal fusion framework that produces dense, colorized point clouds, effectively capturing crack semantics at a 3D real-world scale. Furthermore, the crack geometric attributions were measured automatically and directly within 3D dense point cloud space, surpassing the limitations of conventional 2D image-based measurements. This advancement makes the method suitable for structural components with curved and complex 3D geometries. Experimental results across various concrete structures highlight the significant improvements and unique advantages of the proposed method, demonstrating its effectiveness, accuracy, and robustness in real-world applications.