arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2512.05277 2026-06-04 cs.CV cs.AI

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

从片段到场景：自动驾驶中基于视觉语言模型的时间理解

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）； ETH Zurich（苏黎世联邦理工学院）； University of Washington（华盛顿大学）； University of Southern California（南加州大学）

AI总结提出自动驾驶时间理解基准TAD，通过场景思维链和轨迹认知图两种无训练方法提升视觉语言模型的时间推理能力。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署为野外自主代理的感知和推理骨干，其中自动驾驶（AD）是最安全关键的实例之一。可靠的时间理解对于此类代理预测事件、归因原因和在动态环境中安全行动至关重要，但即使对于最先进的（SoTA）VLM来说，这仍然是一个重大挑战。先前的视频基准强调了其他内容（体育、烹饪等），但现有基准没有专门关注短时和长时AD视频的时间理解。为填补这一空白，我们提出了自动驾驶时间理解（TAD）基准，包含近6000个问答（QA）对，涵盖7个任务，并评估了9个闭源和开源通用以及AD专用模型。当前SoTA模型在TAD上的表现远低于人类准确率。为了改进基于VLM的驾驶代理的时间推理，我们提出了两种新颖的无训练解决方案：Scene-CoT，它使用思维链（CoT）推理；以及TCogMap，它结合了由轨迹分析模块生成的自我中心时间认知图，该模块作为VLM周围的代理工具运行。与现有VLM集成后，我们的方法在TAD上的平均准确率提高了高达17.72%，在STSBench上提高了高达10.35%。通过引入TAD、对SoTA模型进行基准测试并提出有效的增强方法，本工作旨在促进野外代理AD系统时间理解的进一步进展。基准和评估代码分别可在${\href{https://huggingface.co/datasets/vbdai/TAD}{ ext{Hugging Face}}}$和${\href{https://github.com/vbdi/tad_bench}{ ext{GitHub}}}$上获取。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

URL PDF HTML ☆

赞 0 踩 0

2512.08331 2026-06-04 cs.CV

DMAConv: Dual Mask-Adaptive Convolution for Remote Sensing Pansharpening

DMAConv: 用于遥感全色锐化的双掩膜自适应卷积

Xianghong Xiao, Zeyu Xia, Zhou Fei, Jinliang Xiao, Haorui Chen, Liangjian Deng

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Tongji University（同济大学）

AI总结提出双掩膜自适应卷积（DMAConv），通过软硬掩膜动态分配计算资源，以轻量级双分支结构高效处理遥感图像的区域异质性，实现SOTA性能且计算成本最低。

详情

AI中文摘要

全色锐化旨在融合高分辨率全色图像与低分辨率多光谱图像。现有的深度学习方法，包括最近的自适应卷积，难以应对遥感图像的区域异质性，且往往计算成本过高。为解决这些挑战，我们提出双掩膜自适应卷积（DMAConv），这是一种根据特征特征动态分配计算资源的新型算子。DMAConv首先使用轻量级模块生成软掩膜和硬掩膜。硬掩膜将特征分为一个紧凑分支（用于全局处理冗余信息）和一个聚焦分支（以更多计算投入建模复杂异质区域）。随后，软掩膜对两个分支的输入特征进行初步调制。这种双分支掩膜自适应设计显著增强了特征表示，同时最小化了计算开销。大量实验表明，我们的方法在广泛的定量基准上达到了SOTA，且参数数量显著更低，计算成本在自适应卷积模型中最低。

英文摘要

Pansharpening aims to fuse a high-resolution panchromatic image with a low-resolution multispectral image. Existing deep learning methods, including recent adaptive convolutions, struggle with regional heterogeneity in remote sensing images and often incur prohibitive computational costs. To address these challenges, we propose Dual Mask-Adaptive Convolution (DMAConv), a novel operator that dynamically allocates computational resources based on feature characteristics. DMAConv first employs a lightweight module to generate soft and hard masks. The hard mask separates features into a compact branch for processing redundant information globally and a focused branch that models complex, heterogeneous regions with greater computational investment. The soft mask then preliminarily modulates the input features for both branches. This dual-branch, mask-adaptive design significantly enhances feature representation while minimizing computational overhead. Extensive experiments demonstrate that our method achieves SOTA on a broad array of quantitative benchmarks, with substantially lower parameter counts and the minimal computational cost among adaptive convolution models.

URL PDF HTML ☆

赞 0 踩 0

2512.08094 2026-06-04 cs.CL

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

分割、嵌入和对齐：将字幕与手语对齐的通用方法

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

发表机构 * VGG, Dept. of Engineering Science, University of Oxford（视觉感知与计算实验室，工程科学系，牛津大学）； University of Zurich（苏黎世大学）； KAIST（韩国科学技术院）； LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris（LIGM，国家科学研究中心，古斯塔夫·埃菲尔大学，巴黎理工大学，IP巴黎）

AI总结提出一种通用框架SEA，利用预训练模型分割视频帧序列为单个手势、嵌入手势片段到与文本共享的潜在空间，并通过轻量动态规划实现高效对齐，在多个手语数据集上达到最先进性能。

Comments Camera-ready version of ACL 2026 (Main)

详情

AI中文摘要

本文的目标是开发一种通用方法，用于将字幕（即带有对应时间戳的口语文本）与连续手语视频对齐。先前的方法通常依赖于针对特定语言或数据集的端到端训练，这限制了它们的通用性。相比之下，我们的方法Segment, Embed, and Align (SEA)提供了一个适用于多种语言和领域的单一框架。SEA利用两个预训练模型：第一个模型将视频帧序列分割为单个手势，第二个模型将每个手势的视频片段嵌入到与文本共享的潜在空间中。随后，通过轻量级动态规划程序进行对齐，该程序即使在长达一小时的视频中也能在CPU上高效运行，耗时不到一分钟。SEA灵活且能适应各种场景，利用从小型词汇表到大型连续语料库的资源。在四个手语数据集上的实验展示了最先进的对齐性能，突显了SEA在生成高质量并行数据以推动手语处理方面的潜力。SEA的代码和模型已公开提供。

英文摘要

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video frame sequence into individual signs and the second to embed the video clip of each sign into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPUs within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing. SEA's code and models are openly available.

URL PDF HTML ☆

赞 0 踩 0

2504.03038 2026-06-04 cs.RO cs.SY eess.SY

Learning to Adapt Control Barrier Functions Under Epistemic and Aleatoric Uncertainty

在认知不确定性和偶然不确定性下学习自适应控制障碍函数

Taekyung Kim, Robin Inho Kee, Dimitra Panagou

发表机构 * Department of Robotics, University of Michigan（机器人学系，密歇根大学）； Charles Stark Draper Laboratory（查尔斯·斯泰克·德拉珀实验室）； Department of Aerospace Engineering, University of Michigan（航空航天工程系，密歇根大学）

AI总结提出在线自适应CBF框架（OA-CBF），通过概率集成神经网络和图注意力编码器动态调整CBF参数，在保证安全性的同时减少保守性。

Comments Extended journal version of the IEEE CDC 2025 paper (available as arXiv:2504.03038v5). Project page: https://www.taekyung.me/oa-cbf

详情

AI中文摘要

控制障碍函数（CBF）为机器人系统强制执行安全约束提供了一种可处理的机制，但其实际性能强烈依赖于类-K函数参数的选择。在输入约束下，保守参数通常以牺牲进展速度为代价保持可行性，而激进参数可能导致基于CBF的优化不可行或不安全。本文提出了在线自适应CBF（OA-CBF），一种在运行时调整CBF参数的框架。我们引入了局部验证的CBF参数概念，该概念在有限预测范围内认证候选参数，并表明当这种验证在连续更新间隔内保持时，安全性得以保留。为了高效识别局部验证的参数，OA-CBF训练一个概率集成神经网络来评估查询的CBF参数，而不是直接预测单个参数。图注意力编码器表示可变大小的障碍物环境，由保形预测校准的认知不确定性门拒绝不可靠的预测，分布鲁棒的CVaR条件筛选偶然风险。在验证的候选参数中，OA-CBF选择具有最佳预测进展度量的参数，并通过MPC-CBF或CBF-QP安全滤波器应用它。在动态独轮车、平面和三维四旋翼、运动学自行车以及VTOL四翼飞机基准上的仿真研究表明，OA-CBF在保持低碰撞率和不可行率的同时，减少了固定参数CBF控制器的保守性。

英文摘要

Control barrier functions (CBFs) provide a tractable mechanism for enforcing safety constraints in robotic systems, but their practical performance depends strongly on the choice of class-K function parameters. Under input constraints, conservative parameters often preserve feasibility at the cost of slow progress, whereas aggressive parameters can make the CBF-based optimization infeasible or unsafe. This paper proposes Online Adaptive CBF (OA-CBF), a framework for adapting CBF parameters at runtime. We introduce the notion of locally validated CBF parameters, which certify candidate parameters over a finite prediction horizon, and show that safety is preserved when such validation is maintained over successive update intervals. To identify locally validated parameters efficiently, OA-CBF trains a probabilistic ensemble neural network to evaluate queried CBF parameters rather than directly predict a single parameter. A graph-attention encoder represents variable-size obstacle environments, an epistemic uncertainty gate calibrated by conformal prediction rejects unreliable predictions, and a distributionally robust CVaR condition screens aleatoric risk. Among the verified candidates, OA-CBF selects the parameter with the best predicted progress metric and applies it through either an MPC-CBF or CBF-QP safety filter. Simulation studies on dynamic unicycle, planar and three-dimensional quadrotor, kinematic bicycle, and VTOL quadplane benchmarks show that OA-CBF reduces the conservatism of fixed-parameter CBF controllers while maintaining low collision and infeasibility rates.

URL PDF HTML ☆

赞 0 踩 0

2511.21035 2026-06-04 cs.LG

RAVQ-HoloNet: Rate-Adaptive Vector-Quantized Hologram Compression

RAVQ-HoloNet：速率自适应向量量化全息图压缩

Shima Rafiei, Zahra Nabizadeh Shahr-Babak, Soroush Khoubyarian, Alexandre Cooper, Shadrokh Samavi, Shahram Shirani

发表机构 * Department of Electrical and Computer Engineering, McMaster University（麦基尔大学电气与计算机工程系）； Department of Physics and Astronomy, University of Waterloo（滑铁卢大学物理与天文学系）； Institute for Quantum Computing, Department of Physics and Astronomy, University of Waterloo（滑铁卢大学量子计算研究所）； Computer Science Department, Seattle University（西雅图大学计算机科学系）

AI总结提出RAVQ-HoloNet，一种集成速率自适应压缩与相位全息图变换的向量量化框架，在低比特率下实现高保真重建，性能超越现有方法。

详情

AI中文摘要

全息术为AR/VR应用提供了巨大潜力。然而，其应用受到数据压缩高需求的限制。现有的深度学习方法通常缺乏单一网络内的速率自适应性，往往需要多个模型来覆盖不同的带宽要求。我们提出了RAVQ-HoloNet，一种速率自适应向量量化框架，将速率自适应压缩与图像数据到纯相位全息图的变换相结合。RAVQ-HoloNet实现了高保真重建，通过两种不同的架构配置超越了当前最先进的方法：一种针对低比特率优化的标准模型，以及一种针对超低比特率设置的更深、扩展变体。为了评估这些模型，我们使用DIV2K数据集作为高保真全息重建的基准。模拟中的定量分析表明，我们的方法显著超越了当前基准。具体来说，在低比特率领域，相对于最先进的方法，我们的方法实现了-33.91%的BD-Rate降低和1.02dB的BD-PSNR增益。此外，在SLM设备上的实验结果表明，我们的方法实现了更高的对比度和改进的质量。

英文摘要

Holography offers significant potential for AR/VR applications. However, its adoption is limited by the high demand for data compression. Existing deep learning approaches generally lack rate adaptivity within a single network and often require multiple models to cover different bandwidth requirements. We present RAVQ-HoloNet, a rate-adaptive vector quantization framework that integrates the rate-adaptive compression with the transformation of image data into phase-only hologram. RAVQ-HoloNet achieves high-fidelity reconstructions, outperforming current state-of-the-art methods implemented via two distinct architectural configurations: a standard model optimized for low bit rates and a deeper, extended variant tailored for ultra low bit rate setting. To evaluate these models, we utilized the DIV2K dataset as a benchmark for high-fidelity holographic reconstruction. Quantitative analysis in the simulation reveals that our approach significantly surpasses current benchmarks. Specifically, in the low bit rate domain, our method achieves a BD-Rate reduction of -33.91% and a BD-PSNR gain of 1.02dB relative to the state-of-the-art method. Additionally, experimental results on the SLM device show that our method achieves higher contrast and improved quality.

URL PDF HTML ☆

赞 0 踩 0

2511.12581 2026-06-04 cs.LG

LMM-IR: Large-Scale Netlist-Aware Multimodal Framework for Static IR-Drop Prediction

LMM-IR：面向静态IR压降预测的大规模网表感知多模态框架

Kai Ma, Zhen Wang, Hongquan He, Qi Xu, Tinghuan Chen, Hao Geng

发表机构 * Tsinghua University（清华大学）

AI总结提出一种基于大规模网表变换器和3D点云表示的多模态框架，用于快速准确地预测芯片静态IR压降，在ICCAD 2023竞赛中取得最佳F1分数和最低MAE。

Comments Accepted by DAC2025

详情

DOI: 10.1109/DAC63849.2025.11133205

AI中文摘要

静态IR压降分析是芯片设计领域一项基础且关键的任务。然而，该过程可能相当耗时，有时需要数小时。此外，解决IR压降违规问题通常需要迭代分析，从而造成计算负担。因此，快速准确的IR压降预测对于减少芯片设计的总体投入时间至关重要。在本文中，我们首次提出了一种新颖的多模态方法，通过大规模网表变换器（LNT）高效处理SPICE文件。我们的关键创新在于将网表拓扑表示为3D点云并进行处理，从而能够高效处理节点数达数十万至数百万的网表。所有类型的数据，包括网表文件和图像数据，都被编码到潜在空间作为特征，并输入模型进行静态电压降预测。这使得来自多种模态的数据能够集成，实现互补预测。实验结果表明，我们提出的算法在ICCAD 2023竞赛的获胜团队和现有最优算法中，能够取得最佳F1分数和最低MAE。

英文摘要

Static IR drop analysis is a fundamental and critical task in the field of chip design. Nevertheless, this process can be quite time-consuming, potentially requiring several hours. Moreover, addressing IR drop violations frequently demands iterative analysis, thereby causing the computational burden. Therefore, fast and accurate IR drop prediction is vital for reducing the overall time invested in chip design. In this paper, we firstly propose a novel multimodal approach that efficiently processes SPICE files through large-scale netlist transformer (LNT). Our key innovation is representing and processing netlist topology as 3D point cloud representations, enabling efficient handling of netlist with up to hundreds of thousands to millions nodes. All types of data, including netlist files and image data, are encoded into latent space as features and fed into the model for static voltage drop prediction. This enables the integration of data from multiple modalities for complementary predictions. Experimental results demonstrate that our proposed algorithm can achieve the best F1 score and the lowest MAE among the winning teams of the ICCAD 2023 contest and the state-of-the-art algorithms.

URL PDF HTML ☆

赞 0 踩 0

2511.06331 2026-06-04 cs.CV

Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Instance Segmentation, Semantic Segmentation, and Species Classification

标签高效的3D森林映射：自监督与迁移学习用于实例分割、语义分割和物种分类

Aldino Rizaldy, Fabian Ewald Fassnacht, Ahmed Jamal Afifi, Hua Jiang, Richard Gloaguen, Pedram Ghamisi

发表机构 * Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology (HIF)（德累斯顿-罗斯托克亥姆霍兹中心（HZDR）、弗里贝格资源技术亥姆霍兹研究所（HIF））； Remote Sensing and Geoinformatics, Freie Universität Berlin（柏林自由大学遥感与地理信息学系）； Institute of Geomatics, BOKU University（博科尼大学测绘学院）； Faculty of Electrical and Computer Engineering, University of Iceland（爱沙尼亚大学电气与计算机工程学院）

AI总结本文利用自监督和迁移学习策略，在少量标注数据下提升3D点云中树木实例分割、语义分割和物种分类的性能，并集成统一框架以简化流程。

详情

AI中文摘要

个体树木级别的详细结构和物种信息对于支持精准林业、生物多样性保护以及为生物量和碳映射提供参考数据日益重要。来自机载和地面激光扫描的点云是目前快速大规模获取此类信息的最合适数据源。深度学习的最新进展改进了对个体树木的分割和分类以及语义树组件的识别。然而，深度学习模型通常需要大量标注训练数据，这限制了进一步的改进。为3D点云生成密集、高质量的标注，尤其是在复杂森林中，劳动密集且难以规模化。我们探索使用自监督和迁移学习来减少对大型标注数据集的依赖。我们的目标是提高三个任务的性能：实例分割、语义分割和树木分类，使用现实且可操作的训练集。与从头训练相比，我们观察到所有任务均有所改进，并通过各自的指标进行评估。对于实例分割，自监督学习结合领域适应使AP50提高了16.98%。对于语义分割，仅自监督学习使mIoU提高了1.79%。对于树木分类，层次迁移学习使平均Jaccard提高了6.07%。为简化使用并鼓励采用，我们将这些任务集成到一个统一框架中，简化了从原始点云到树木描绘、结构分析和物种分类的流程。预训练模型减少了约21%的能耗和碳排放。这一开源贡献旨在加速从激光扫描点云中操作性地提取个体树木信息，以支持林业、生物多样性和碳映射。

英文摘要

Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high-quality annotations for 3D point clouds, especially in complex forests, is labor-intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self-supervised and transfer learning. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. We observe improvements across all tasks, compared to training from scratch, evaluated with their respective metrics. For instance segmentation, self-supervised learning combined with domain adaptation improves AP50 by 16.98%. For semantic segmentation, self-supervised learning alone improves mIoU by 1.79%. For tree classification, hierarchical transfer learning improves mean Jaccard by 6.07%. To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open-source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.

URL PDF HTML ☆

赞 0 踩 0

2511.00801 2026-06-04 cs.CV cs.MM

Med-Banana: Learning Quality-Controlled Medical Image Editing from Success-and-Failure Trajectories

Med-Banana：从成功与失败轨迹中学习质量可控的医学图像编辑

Zhihui Chen, Qingyuan Lei, Kai He, Yanrui Du, Mengling Feng

发表机构 * National University of Singapore（新加坡国立大学）； The Chinese University of Hong Kong（香港中文大学）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结提出Med-Banana框架，通过收集成功与失败编辑轨迹数据集Med-Banana-80K，联合训练编辑器、验证器和优化器，实现质量可控的医学图像编辑。

详情

AI中文摘要

文本引导的医学图像编辑必须满足所需的病理特征，同时保留解剖结构、模态特定外观和临床合理性。然而，现有数据集主要用最终接受的编辑结果来监督编辑器，并丢弃生成过程中产生的失败尝试。我们认为这些失败为质量控制提供了必要的监督：它们指定了应该拒绝什么、为什么编辑在医学或视觉上无效，以及应该如何修改指令。我们提出了Med-Banana，一个用于质量可控的医学图像编辑的轨迹监督框架。我们引入了Med-Banana-80K，一个大规模的成功与失败编辑轨迹资源，包含候选图像、验证结果、拒绝原因和提示优化。在此基础上，Med-Banana联合训练编辑器、验证器和优化器，实现了从接受和拒绝尝试中进行编辑-验证-优化推理。在MLLM评估者、盲审专家评估、源保留和真实-合成可分离性探测上的实验表明，与开放的医学图像编辑器相比，该方法具有一致的改进。代码和数据已公开。

英文摘要

Text-guided medical image editing must satisfy the requested pathology while preserving anatomy, modality-specific appearance, and clinical plausibility. However, existing datasets largely supervise editors with final accepted edits and discard the failed attempts produced during generation. We argue that these failures provide essential supervision for quality control: they specify what should be rejected, why an edit is medically or visually invalid, and how the instruction should be revised. We present Med-Banana, a trajectory-supervised framework for quality-controlled medical image editing. We introduce Med-Banana-80K, a large-scale resource of success-and-failure editing trajectories with candidate images, verification outcomes, rejection reasons, and prompt refinements. Building on it, Med-Banana jointly trains an editor, verifier, and refiner, enabling edit--verify--refine inference from accepted and rejected attempts. Experiments across MLLM judges, blind expert assessment, source-preservation and real--synthetic separability probes demonstrate consistent improvements over open medical image editors. Code and data are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2511.03304 2026-06-04 cs.LG cs.AI

Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods

将连续属性的公平零空间投影扩展到核方法

Felix Störck, Fabian Hinder, Barbara Hammer

发表机构 * Felix Störck ； Fabian Hinder ； Barbara Hammer

AI总结提出将公平零空间投影扩展到核诱导特征空间，通过经验特征空间直接变换核矩阵，实现模型和公平评分无关的连续属性公平性方法，并在支持向量回归中展示竞争性或改进性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

随着机器学习系统融入数百万人的日常社会生活，公平性在其发展中的优先级日益提高。公平性概念通常依赖受保护属性来评估潜在偏差。这里，大多数文献关注离散设置下的目标和受保护属性。关于连续属性尤其是与回归结合——我们称之为“连续公平性”——的文献很少。一种常见策略是迭代零空间投影，目前仅在线性模型或通过非线性编码器获得的嵌入中探索。我们通过“经验特征空间”将其扩展到核诱导特征空间，从而改进这一点。我们从理论上推导出这是核矩阵的直接变换，产生一种适用于连续受保护属性的模型和公平评分无关的方法。我们证明，与支持向量回归结合时，我们的新方法在多个数据集上相比其他当代方法具有竞争性或改进的性能。

英文摘要

With the on-going integration of machine learning systems into the everyday social life of millions the notion of fairness becomes an ever increasing priority in their development. Fairness notions commonly rely on protected attributes to assess potential biases. Here, the majority of literature focuses on discrete setups regarding both target and protected attributes. The literature on continuous attributes especially in conjunction with regression -- we refer to this as \emph{continuous fairness} -- is scarce. A common strategy is iterative null-space projection which as of now has only been explored for linear models or embeddings such as obtained by a non-linear encoder. We improve on this by extending this to kernel induced feature spaces by means of the ``empirical feature space''. We theoretically derive this as a direct transformation of the kernel matrix yielding a model and fairness-score agnostic method applicable to continuous protected attributes. We demonstrate that our novel approach in conjunction with Support Vector Regression (SVR) provides competitive or improved performance across multiple datasets in comparison to other contemporary methods.

URL PDF HTML ☆

赞 0 踩 0

2505.24528 2026-06-04 cs.CV cs.LG

Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

地理空间基础模型推动可持续发展目标的进展

Pedram Ghamisi, Weikang Yu, Xiaokang Zhang, Aldino Rizaldy, Jian Wang, Chufeng Zhou, Richard Gloaguen, Gustau Camps-Valls

发表机构 * Helmholtz-Zentrum Dresden-Rossendorf（德累斯顿-罗斯托克研究所）； University of Iceland（冰岛大学）； Wuhan University（武汉大学）； Wuhan University of Science and Technology（武汉科技大学）； Universitat de València（瓦伦西亚大学）

AI总结本文提出SustainFM基准框架，基于17个可持续发展目标评估地理空间基础模型，发现其在多样任务中优于传统方法，并强调需从模型中心转向影响驱动部署，关注能效、泛化性和伦理。

详情

AI中文摘要

基础模型（FMs）是大规模预训练的人工智能系统，已革新自然语言处理和计算机视觉，并正在推进地理空间分析和地球观测（EO）。它们承诺在任务间改进泛化、可扩展性以及用最少标注数据高效适应。然而，尽管地理空间FMs迅速激增，其现实世界效用和与全球可持续发展目标的一致性仍未充分探索。我们提出SustainFM，一个基于17个可持续发展目标的全面基准框架，涵盖从资产财富预测到环境危害检测的极其多样化的任务。本研究提供了对地理空间FMs的严格、跨学科评估，并对其在实现可持续发展目标中的作用提供了关键见解。我们的发现表明：（1）虽然并非普遍优越，但FMs在多样任务和数据集上通常优于传统方法。（2）评估FMs应超越准确性，将可迁移性、泛化性和能效作为其负责任使用的关键标准。（3）FMs支持可扩展的、基于SDG的解决方案，为应对复杂可持续发展挑战提供广泛实用性。关键的是，我们倡导从以模型为中心的发展转向以影响驱动的部署，并强调能效、对领域变化的鲁棒性以及伦理考量等指标。

英文摘要

Foundation Models (FMs) are large-scale, pre-trained artificial intelligence (AI) systems that have revolutionized natural language processing and computer vision, and are now advancing geospatial analysis and Earth Observation (EO). They promise improved generalization across tasks, scalability, and efficient adaptation with minimal labeled data. However, despite the rapid proliferation of geospatial FMs, their real-world utility and alignment with global sustainability goals remain underexplored. We introduce SustainFM, a comprehensive benchmarking framework grounded in the 17 Sustainable Development Goals with extremely diverse tasks ranging from asset wealth prediction to environmental hazard detection. This study provides a rigorous, interdisciplinary assessment of geospatial FMs and offers critical insights into their role in attaining sustainability goals. Our findings show: (1) While not universally superior, FMs often outperform traditional approaches across diverse tasks and datasets. (2) Evaluating FMs should go beyond accuracy to include transferability, generalization, and energy efficiency as key criteria for their responsible use. (3) FMs enable scalable, SDG-grounded solutions, offering broad utility for tackling complex sustainability challenges. Critically, we advocate for a paradigm shift from model-centric development to impact-driven deployment, and emphasize metrics such as energy efficiency, robustness to domain shifts, and ethical considerations.

URL PDF HTML ☆

赞 0 踩 0

2510.24342 2026-06-04 cs.AI

A Unified Geometric Space for Topological Alignment Between Transformer-Based Models and Human Brain Networks

基于Transformer的模型与人脑网络之间拓扑对齐的统一几何空间

Silin Chen, Yuzhong Chen, Caiwei Wang, Zifan Wang, Junhao Wang, Zifeng Jia, Keith M Kendrick, Tuo Zhang, Lin Zhao, Dezhong Yao, Tianming Liu, Xi Jiang

发表机构 * The Clinical Hospital of Chengdu Brain Science Institute, MOE-K Lab for NeuroInformation, Brain‑Apparatus Communication Institute, School of Life Science and Technology, University of Electronic Science and Technology of China（成都脑科学研究院临床医院，MOE-K神经信息实验室，脑-装置通信研究所，电子科技大学生命科学与技术学院）； School of Automation, Northwestern Polytechnical University（西北工业大学自动化学院）； Department of Biomedical Engineering, New Jersey Institute of Technology（新泽西理工学院生物医学工程系）； School of Computing, University of Georgia（佐治亚大学计算机学院）

AI总结提出一个模态无关、任务无关的拓扑对齐空间，通过图组织属性将Transformer模型的注意力拓扑映射到人脑固有连接网络，揭示了不同模态和规模模型的连续弧形分布及对齐特性。

详情

AI中文摘要

先前的脑-人工智能对齐研究通常受限于特定的输入和任务，限制了其捕捉不同模态模型组织特性的能力。在这项工作中，我们聚焦于基于Transformer的模型，引入了一个脑-模型拓扑对齐空间。我们不是从神经机制推断对齐，而是通过基于图的组织特性来检查对齐，将模型的内在空间注意力拓扑映射到规范的人脑固有连接网络（ICNs）。这使得在组织特性层面上，对视觉、语言和多模态系统进行模态无关且无任务的比较成为可能。通过分析跨这些模态和规模的151个基于Transformer的模型，我们观察到一个连续的弧形分布，反映了不同程度的拓扑对齐。与其训练目标一致，优化用于全局语义抽象的模型与高阶ICNs关联更紧密，而专注于局部细节的模型则与低级ICNs关联。更令人惊讶的是，我们发现了非直观的现象：DINOv2相比其前身表现出对齐降低，蒸馏的DeiT模型显示出反直觉的缩放反转，即更大的模型与高阶ICNs对齐更差，而微调和指令调优对对齐影响有限。此外，拓扑对齐分数与30个视觉Transformer的ImageNet-1K Top-1准确率相关性不显著（r=0.266, p=0.156）。这项工作为通过脑参考拓扑映射比较基于Transformer的模型的组织特性提供了新的定量视角。

英文摘要

Prior brain-AI alignment studies are typically constrained by specific inputs and tasks, limiting their ability to capture organizational properties across models with different modalities. In this work, we focus on Transformer-based models and introduce a brain-model topological alignment space. Rather than inferring alignment from neural mechanisms, we examine it through graph-based organizational properties, mapping the intrinsic spatial attention topology of a model onto canonical human intrinsic connectivity networks (ICNs). This enables a modality-agnostic and task-free comparison across vision, language, and multimodal systems at the level of organizational properties. Analyzing 151 Transformer-based models across these modalities and scales, we observe a continuous arc-shaped distribution, reflecting varying degrees of topological alignment. Consistent with their training objectives, models optimized for global semantic abstraction were associated more closely with higher-order ICNs, while local detail-focused models associated with low-level ICNs. More surprisingly, we uncovered non-intuitive phenomena: DINOv2 exhibited reduced alignment compared to its predecessors, distilled DeiT models showed a counterintuitive scaling inversion where larger models aligned less well with higher-order ICNs, and fine-tuning as well as instruction tuning had limited effect on alignment. Furthermore, topological alignment scores showed non-significant correlation with ImageNet-1K Top-1 accuracy in 30 vision Transformers (r=0.266, p=0.156). This work provides a new quantitative perspective for comparing the organizational properties of Transformer-based models through brain-referenced topological mapping.

URL PDF HTML ☆

赞 0 踩 0

2510.15416 2026-06-04 cs.AI

Adaptive Minds: Empowering Agents with LoRA-as-Tools

自适应心智：将LoRA作为工具赋予智能体能力

Pavan C Shekar, Aswanth Krishnan

发表机构 * GitHub

AI总结提出将LoRA适配器作为可调用工具的框架，通过路由和智能体推理聚合多个专业适配器的优势，在30个适配器库中达到98.3%路由准确率，并在九类任务上显著提升性能。

Comments 13 pages, 3 figures, 9 tables. ICML 2026 CompLearn Workshop camera-ready (non-archival). Code: https://github.com/qpiai/adaptive-minds

详情

AI中文摘要

我们研究了一个框架，其中LoRA适配器被视为可调用的工具，基础语言模型可以动态选择并调用它们。我们假设，当适配器经过训练以提供强大的领域特定增益，并附带清晰的元数据时，基础模型可以可靠地将查询路由到适当的专家，从而有效地在单个框架内聚合许多专门适配器的优势。我们引入了自适应心智（Adaptive Minds），这是一个通用框架，在其中我们研究单步路由和多步智能体推理。在这种设置中，智能体可以迭代地调用多个适配器以及其他工具（例如，外部API、检索系统或执行环境），并在多个步骤中对其输出进行推理。这重新将适配器视为模块化技能或记忆单元，可以在推理过程中组合，而不是静态应用。在我们的评估中，路由层在30个适配器库上达到了98.3%的准确率，并且在单一共享训练配方下，训练有素的专业适配器在九个任务族中提供了+4.6到+84.0个百分点的严格评分增益；AM路由器在每个查询包含领域信号的基准测试中，将这些增益聚合在直接专业适配器的5个百分点以内。我们的研究结果表明，该方法的有效性取决于各个适配器的质量和专业化程度，并且启用许多此类专家的灵活组合可以显著扩展语言模型智能体的实际能力，朝着更通用的、工具增强的智能迈进。

英文摘要

We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invoke. We hypothesize that, when adapters are trained to provide strong domain-specific gains and are exposed with clear metadata, a base model can reliably route queries to the appropriate expert, effectively aggregating the benefits of many specialized adapters within a single framework. We introduce Adaptive Minds, a general framework within which we study both single-step routing and multi-step agentic reasoning. In this setting, the agent can iteratively invoke multiple adapters alongside other tools (e.g., external APIs, retrieval systems, or execution environments) and reason over their outputs across multiple steps. This reframes adapters as modular skills or memory units that can be composed during reasoning rather than statically applied. In our evaluation, the routing layer reaches 98.3% accuracy on a 30-adapter library, and well-trained specialists provide +4.6 to +84.0 percentage points of strict-scorer gain across nine task families under a single shared training recipe; the AM router aggregates these gains within 5 pp of the direct specialist on every benchmark whose queries surface domain signal. Our findings suggest that the effectiveness of this approach depends on the quality and specialization of individual adapters, and that enabling flexible composition of many such experts can significantly expand the practical capabilities of language model agents, moving toward more general, tool-augmented intelligence.

URL PDF HTML ☆

赞 0 踩 0

2510.13796 2026-06-04 cs.CL cs.CV

The Mechanistic Emergence of Symbol Grounding in Language Models

语言模型中符号接地机制的涌现

Shuyu Wu, Ziqiao Ma, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结通过机械因果分析，发现符号接地在语言模型的中层计算中通过注意力头聚合环境信息实现，并在多模态对话和多种架构中复现。

详情

AI中文摘要

符号接地（Harnad, 1990）描述了词语等符号如何通过连接真实世界的感知运动经验来获得意义。最近的研究初步表明，在大规模训练且未使用显式接地目标的（视觉-）语言模型中，接地可能涌现。然而，这种涌现的具体位置及其驱动机制仍 largely 未被探索。为解决这一问题，我们引入了一个受控评估框架，通过机械和因果分析系统地追踪符号接地如何在内部计算中产生。我们的发现表明，接地集中在中层计算中，并通过聚合机制实现，其中注意力头聚合环境接地以支持语言形式的预测。这种现象在多模态对话和跨架构（Transformer 和状态空间模型）中复现，但在单向 LSTM 中未出现。我们的结果提供了行为和机械证据，表明符号接地可以在语言模型中涌现，并对预测和潜在控制生成的可靠性具有实际意义。

英文摘要

Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.

URL PDF HTML ☆

赞 0 踩 0

2510.13704 2026-06-04 cs.LG cs.AI cs.RO

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

单纯形嵌入提升Actor-Critic智能体的样本效率

Johan Obando-Ceron, Walter Mayor, Samuel Lavoie, Scott Fujimoto, Aaron Courville, Pablo Samuel Castro

发表机构 * Mila – Québec AI Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； McGill University（麦吉尔大学）； CIFAR AI Chair（CIFAR人工智能主席）

AI总结针对大规模环境并行化下Actor-Critic方法仍需大量交互的问题，提出使用单纯形嵌入作为轻量级表示层，通过几何归纳偏置产生稀疏离散特征，稳定评论家引导并强化策略梯度，在FastTD3、FastSAC和PPO中一致提升样本效率和最终性能。

2510.09953 2026-06-04 cs.CV

J-RAS: Mutual Adaptation for Medical Image Segmentation via Contrastive Retrieval-Augmented Joint Optimization

J-RAS：基于对比检索增强联合优化的医学图像分割互适应方法

Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli

发表机构 * Laurier University（劳里尔大学）

AI总结提出J-RAS框架，通过交替对比学习和监督学习联合优化分割与检索模型，实现检索与分割的互适应，提升医学图像分割的边界描绘、鲁棒性和跨数据集泛化能力。

详情

AI中文摘要

临床医生手动进行医学图像分割虽然准确，但耗时且在不同专家间存在差异，而基于AI的模型自动化了这一过程，但在数据有限和域偏移时往往表现不佳。受病理学学员通过指导性比较专家标注的切片和组织病理学图谱参考图像来获得疾病识别技能的启发，我们提出了联合检索增强分割（J-RAS）。该框架使分割网络能够在指导下学习。J-RAS通过交替对比学习和监督学习联合优化分割模型和检索模型，使检索网络能够发现上下文相关的图像-掩码对，从而细化分割模型的解剖推理。与被动提供相似样本的传统检索增强不同，J-RAS建立了一个互适应和优化循环，其中检索模型学习强调分割相关的线索，而分割模型利用检索到的示例来改进边界描绘、对罕见病例的鲁棒性以及跨数据集泛化。在涵盖不同成像模态的四个公共基准（包括ACDC和M&Ms（MRI）、乳腺癌超声以及肺部和感染CT）上，使用多种骨干网络（U-Net、TransUNet、SAM和SegFormer）进行的评估证明了J-RAS的泛化性和有效性。例如，在ACDC上，SegFormer的平均Dice从0.8708±0.042和HD从1.8130±2.49提升至0.9115±0.031和1.1489±0.30。这些结果突显了检索引导的对比优化如何在医学图像分割中桥接人类式指导与机器学习的精确性。

英文摘要

Manual medical image segmentation by clinicians, though accurate, is time-consuming and variable across experts, whereas AI-based models automate this process but often underperform with limited data and domain shifts. Inspired by how pathology trainees acquire disease recognition skills through guided comparison with expert-annotated slides and histopathology atlas reference images, we propose Joint Retrieval-Augmented Segmentation (J-RAS). This framework enables segmentation networks to learn with guidance. J-RAS jointly optimizes a segmentation model and a retrieval model through alternating contrastive and supervised learning, allowing the retrieval network to discover contextually relevant image-mask pairs that refine the segmentation model's anatomical reasoning. Unlike conventional retrieval-based augmentation that passively provides similar samples, J-RAS establishes a mutual adaptation and optimization loop where the retrieval model learns to emphasize segmentation-relevant cues, while the segmentation model leverages retrieved examples to improve boundary delineation, robustness to rare cases, and cross-dataset generalization. Evaluations on four public benchmarks spanning different imaging modalities, including ACDC and M&Ms (MRI), Breast Cancer Ultrasound, and lung and infection CT, across multiple backbones (U-Net, TransUNet, SAM, and SegFormer) demonstrate the generalizability and effectiveness of J-RAS. For instance, on ACDC, SegFormer improves from a mean Dice of 0.8708$\pm$0.042 and HD of 1.8130$\pm$2.49 to 0.9115$\pm$0.031 and 1.1489$\pm$0.30. These results highlight how retrieval-guided contrastive optimization bridges human-like guidance and machine-learned precision in medical image segmentation.

URL PDF HTML ☆

赞 0 踩 0

2505.11166 2026-06-04 cs.CL cs.AI

SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization

SoLoPO: 通过短到长偏好优化解锁大语言模型的长上下文能力

Huashan Sun, Shengyi Liao, Yansen Han, Yu Bai, Yang Gao, Cheng Fu, Weizhou Shen, Fanqi Wan, Ming Yan, Ji Zhang, Fei Huang

发表机构 * Tongyi Lab, Alibaba Group（通义实验室，阿里巴巴集团）

AI总结提出SoLoPO框架，将长上下文偏好优化解耦为短上下文偏好优化和短到长奖励对齐，以提升大语言模型的长上下文利用能力。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

尽管在扩展上下文大小的预训练方面取得了进展，但大语言模型（LLMs）在有效利用现实世界中的长上下文信息方面仍面临挑战，这主要是由于数据质量问题、训练效率低下以及缺乏设计良好的优化目标导致的长上下文对齐不足。为了解决这些限制，我们提出了一个名为 extbf{S}h extbf{o}rt-to- extbf{Lo}ng extbf{P}reference extbf{O}ptimization（ extbf{SoLoPO}）的框架，将长上下文偏好优化（PO）解耦为两个组成部分：短上下文PO和短到长奖励对齐（SoLo-RA），并得到了理论和实验证据的支持。具体来说，短上下文PO利用从短上下文中采样的偏好对来增强模型的情境知识利用能力。同时，SoLo-RA明确鼓励在包含相同任务相关信息的短上下文和长上下文条件下，响应的奖励分数一致性。这有助于将模型处理短上下文的能力迁移到长上下文场景中。SoLoPO与主流的偏好优化算法兼容，同时显著提高了数据构建和训练过程的效率。实验结果表明，SoLoPO增强了所有这些算法在各种长上下文基准测试中的长度和领域泛化能力，同时在计算和内存效率方面取得了显著提升。

英文摘要

Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named \textbf{S}h\textbf{o}rt-to-\textbf{Lo}ng \textbf{P}reference \textbf{O}ptimization (\textbf{SoLoPO}), decoupling long-context preference optimization (PO) into two components: short-context PO and short-to-long reward alignment (SoLo-RA), supported by both theoretical and empirical evidence. Specifically, short-context PO leverages preference pairs sampled from short contexts to enhance the model's contextual knowledge utilization ability. Meanwhile, SoLo-RA explicitly encourages reward score consistency for the responses when conditioned on both short and long contexts that contain identical task-relevant information. This facilitates transferring the model's ability to handle short contexts into long-context scenarios. SoLoPO is compatible with mainstream preference optimization algorithms, while substantially improving the efficiency of data construction and training processes. Experimental results show that SoLoPO enhances all these algorithms with respect to stronger length and domain generalization abilities across various long-context benchmarks, while achieving notable improvements in both computational and memory efficiency.

URL PDF HTML ☆

赞 0 踩 0

1708.06233 2026-06-04 cs.AI cs.MA cs.SI econ.GN physics.soc-ph q-fin.EC

Fake News in Social Networks

社交媒体中的虚假新闻

Christoph Aymanns, Jakob Foerster, Co-Pierre Georg, Matthias Weber

发表机构 * University of St. Gallen（圣加尔大学）； University of Oxford（牛津大学）； Frankfurt School of Finance and Management（法兰克福金融与管理学院）； Swiss Finance Institute（瑞士金融研究所）

AI总结本文提出多智能体强化学习作为建模社交媒体中虚假新闻的新方法，发现针对高连接性和弱隐私信息的人群更有效，且信息分散传播比集中传播更有效，同时平衡网络中虚假新闻传播较弱，通过人类实验验证了模型的适用性。

2510.03511 2026-06-04 cs.CV cs.AI cs.LG eess.IV

Platonic Transformers: A Solid Choice For Equivariance

柏拉图式Transformer：等变性的坚实选择

Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. Sánchez, Sharvaree Vadgama, Georg Bökman, Erik J. Bekkers

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Platonic Transformer，通过基于柏拉图立体对称群参考帧的注意力机制实现等变性，在不增加计算成本的前提下提升性能。

详情

AI中文摘要

尽管Transformer广泛应用，但缺乏科学和计算机视觉中常见几何对称性的归纳偏置。现有的等变方法往往通过复杂、计算密集的设计牺牲了Transformer的高效性和灵活性。我们引入Platonic Transformer来解决这一权衡。通过将注意力定义为相对于柏拉图立体对称群参考帧，我们的方法引入了一种有原则的权重共享方案。这使得模型能够同时对连续平移和柏拉图对称性保持等变，同时保留标准Transformer的精确架构和计算成本。此外，我们证明这种注意力在形式上等价于动态群卷积，这表明模型学习自适应几何滤波器，并实现高度可扩展的线性时间卷积变体。在计算机视觉（CIFAR-10）、3D点云（ScanObjectNN）和分子性质预测（QM9、OMol25）等多个基准测试中，Platonic Transformer通过利用这些几何约束以零额外成本取得了有竞争力的性能。

英文摘要

While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

URL PDF HTML ☆

赞 0 踩 0

2510.01902 2026-06-04 cs.AI cs.CL cs.LG

Constrained Adaptive Rejection Sampling

约束自适应拒绝采样

Paweł Parys, Sairam Vaidya, Taylor Berg-Kirkpatrick, Loris D'Antoni

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出约束自适应拒绝采样（CARS），通过自适应剪枝无效前缀来提高拒绝采样的样本效率，同时保持无分布扭曲，在程序模糊测试和分子生成等任务中优于现有方法。

详情

AI中文摘要

语言模型（LMs）越来越多地应用于生成的输出必须满足严格语义或语法约束的场景。现有的约束生成方法处于一个谱系中：贪婪约束解码方法在解码过程中强制执行有效性，但扭曲了LM的分布；而拒绝采样（RS）保留了保真度，但通过丢弃无效输出浪费计算资源。在程序模糊测试等领域，样本的有效性和多样性都至关重要，这两种极端方法都有问题。我们提出约束自适应拒绝采样（CARS），一种严格提高RS样本效率且不产生分布扭曲的方法。CARS从无约束LM采样开始，通过将违反约束的续写记录在trie中并从后续抽取中减去其概率质量，自适应地排除它们。这种自适应剪枝确保已证明无效的前缀不会被重新访问，接受率单调提高，并且生成的样本精确遵循约束分布。在多个领域的实验（例如程序模糊测试和分子生成）中，CARS始终实现更高的效率（以每个有效样本的LM前向传递次数衡量），同时产生比GCD和近似LM分布的方法更强的样本多样性。

英文摘要

Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding methods enforce validity during decoding but distort the LM's distribution, while rejection sampling (RS) preserves fidelity but wastes computation by discarding invalid outputs. Both extremes are problematic in domains such as program fuzzing, where both validity and diversity of samples are essential. We present Constrained Adaptive Rejection Sampling (CARS), an approach that strictly improves the sample-efficiency of RS without distributional distortion. CARS begins with unconstrained LM sampling and adaptively rules out constraint-violating continuations by recording them in a trie and subtracting their probability mass from future draws. This adaptive pruning ensures that prefixes proven invalid are never revisited, acceptance rates improve monotonically, and the resulting samples exactly follow the constrained distribution. In experiments on a variety of domains -- e.g., program fuzzing and molecular generation -- CARS consistently achieves higher efficiency -- measured in the number of LM forward passes per valid sample -- while also producing stronger sample diversity than both GCD and methods that approximate the LM's distribution.

URL PDF HTML ☆

赞 0 踩 0

2510.01532 2026-06-04 cs.CV

MATCH: Multi-faceted Adaptive Topo-Consistency for Semi-Supervised Histopathology Segmentation

MATCH: 面向半监督组织病理学分割的多面自适应拓扑一致性

Meilong Xu, Xiaoling Hu, Shahira Abousamra, Chen Li, Chao Chen

发表机构 * Stony Brook University（斯通布罗克大学）； Massachusetts General Hospital and Harvard Medical School（麻省总医院和哈佛医学院）； Department of Biomedical Data Science, Stanford University（斯坦福大学生物医学数据科学系）

AI总结提出一种半监督分割框架MATCH，通过随机丢弃和时间训练快照生成多种扰动预测，并强制拓扑一致性来识别和保留相关拓扑特征，引入结合空间重叠与全局结构对齐的匹配策略以减少预测差异，有效降低拓扑错误，提升分割鲁棒性和准确性。

Comments 20 pages, 6 figures. Accepted by NeurIPS 2025

详情

AI中文摘要

在半监督分割中，从无标签数据中捕获有意义的语义结构至关重要。这在组织病理学图像分析中尤其具有挑战性，因为物体分布密集。为了解决这个问题，我们提出了一个半监督分割框架，旨在稳健地识别和保留相关的拓扑特征。我们的方法利用通过随机丢弃和时间训练快照获得的多种扰动预测，强制这些不同输出之间的拓扑一致性。这种一致性机制有助于将生物学有意义的结构与瞬态和噪声伪影区分开来。这个过程的一个关键挑战是在没有真实标签的情况下准确匹配预测中对应的拓扑特征。为了克服这一点，我们引入了一种新颖的匹配策略，将空间重叠与全局结构对齐相结合，最小化预测之间的差异。大量实验表明，我们的方法有效减少了拓扑错误，从而产生更稳健和准确的分割，这对于可靠的下游分析至关重要。代码可在 https://github.com/Melon-Xu/MATCH 获取。

英文摘要

In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at https://github.com/Melon-Xu/MATCH.

URL PDF HTML ☆

赞 0 踩 0

2505.15497 2026-06-04 cs.LG cs.SY eess.SY

Certified Neural Approximations of Nonlinear Dynamics

非线性动力学的认证神经逼近

Frederik Baymler Mathiesen, Nikolaus Vertovec, Francesco Fabiano, Luca Laurenti, Alessandro Abate

发表机构 * Delft Center for Systems and Control（代尔夫特系统与控制中心）； Department of Computer Science, University of Oxford（牛津大学计算机科学系）； The Italian Institute of Artificial Intelligence (AI4I)（意大利人工智能研究所（AI4I））

AI总结提出一种基于认证一阶模型的自适应并行验证方法，为神经网络逼近非线性动力学提供形式化误差界，从而安全地用作替代模型，并在多个基准测试中显著优于现有方法。

Comments first and second author contributed equally

详情

AI中文摘要

神经网络作为非线性动力系统的近似模型具有巨大潜力，由此产生的神经逼近能够实现对此类系统的验证和控制。然而，在安全关键背景下，使用神经逼近需要对其与底层系统的接近程度有形式化界限。为了解决这一基本挑战，我们提出了一种新颖的、自适应的、可并行化的验证方法，基于认证的一阶模型。我们的方法为动力系统的神经逼近提供了形式化误差界，通过将误差界解释为作用于近似动力学的有界扰动，使得它们能够安全地用作替代模型。我们在文献中的一系列既定基准测试上展示了我们方法的有效性和可扩展性，表明它显著优于现有技术。此外，我们展示了我们的框架能够成功解决现有方法以前无法处理的额外场景——神经网络压缩和基于自编码器的深度学习架构，用于训练Koopman算子以进行轨迹预测。

英文摘要

Neural networks hold great potential to act as approximate models of nonlinear dynamical systems, with the resulting neural approximations enabling verification and control of such systems. However, in safety-critical contexts, the use of neural approximations requires formal bounds on their closeness to the underlying system. To address this fundamental challenge, we propose a novel, adaptive, and parallelizable verification method based on certified first-order models. Our approach provides formal error bounds on the neural approximations of dynamical systems, allowing them to be safely employed as surrogates by interpreting the error bound as bounded disturbances acting on the approximated dynamics. We demonstrate the effectiveness and scalability of our method on a range of established benchmarks from the literature, showing that it significantly outperforms the state of the art. Furthermore, we show that our framework can successfully address additional scenarios previously intractable for existing methods -- neural network compression and an autoencoder-based deep learning architecture for training Koopman operators for the purpose of trajectory prediction.

URL PDF HTML ☆

赞 0 踩 0

2509.22454 2026-06-04 cs.LG

Overclocking Electrostatic Generative Models

超频静电生成模型

Daniil Shlenskii, Alexander Korotin

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出逆泊松流匹配（IPFM）蒸馏框架，加速所有维度D下的静电生成模型，实现少步采样且质量接近甚至超越教师模型。

详情

AI中文摘要

诸如PFGM++等静电生成模型最近作为一种强大的框架出现，在图像合成中取得了竞争性能。PFGM++在具有辅助维度$D$的扩展数据空间中运行，当$D\to\infty$时恢复扩散模型框架，而在有限$D$下产生更优的经验结果。与扩散模型一样，PFGM++依赖昂贵的ODE模拟来生成样本，计算成本高。为解决此问题，我们提出逆泊松流匹配（IPFM），一个原则性的蒸馏框架，可加速所有$D$值下的静电生成模型。我们的IPFM将蒸馏重新表述为一个逆问题：学习一个生成器，其诱导的静电场与教师模型匹配。我们为该问题推导了一个可处理的训练目标，并表明当$D\to\infty$时，我们的IPFM紧密恢复分数恒等蒸馏（SiD），一种最近用于蒸馏扩散模型的方法。实验上，我们的IPFM生成的蒸馏生成器仅需少量函数评估即可达到接近教师甚至更优的样本质量。此外，我们发现单步生成器蒸馏在有限$D$下比在$D\to\infty$扩散极限下收敛更快，这与先前证据一致，即有限$D$的PFGM++模型提供更有利的优化和采样行为。

英文摘要

Electrostatic generative models such as PFGM++ have recently emerged as a powerful framework, achieving competitive performance in image synthesis. PFGM++ operates in an extended data space with auxiliary dimensionality $D$, recovering the diffusion model framework as $D\to\infty$, while yielding superior empirical results for finite $D$. Like diffusion models, PFGM++ relies on expensive ODE simulations to generate samples, making it computationally costly. To address this, we propose Inverse Poisson Flow Matching (IPFM), a principled distillation framework that accelerates electrostatic generative models across all values of $D$. Our IPFM reformulates distillation as an inverse problem: learning a generator whose induced electrostatic field matches that of the teacher. We derive a tractable training objective for this problem and show that, as $D\to\infty$, our IPFM closely recovers Score Identity Distillation (SiD), a recent method for distilling diffusion models. Empirically, our IPFM produces distilled generators that achieve near-teacher or even superior sample quality using only a few function evaluations. Moreover, we find that one-step generator distillation converges faster at finite $D$ than in the $D\to\infty$ diffusion limit, aligning with prior evidence that finite-$D$ PFGM++ models offer more favorable optimization and sampling behavior.

URL PDF HTML ☆

赞 0 踩 0

2509.15676 2026-06-04 cs.LG cs.AI cs.CL

KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning

KITE: 基于核方法和信息论的上下文学习示例选择

Vaibhav Singh, Soumya Suvra Ghosal, Kapu Nirmal Joshua, Soumyabrata Pal, Sayak Ray Chowdhury

发表机构 * IIT Bombay（印度比哈尔理工学院）； UMD College Park（马里兰大学 College Park 分校）； IIT Kanpur（印度坎普尔理工学院）； Adobe Research（Adobe 研究）

AI总结针对上下文学习中的示例选择问题，提出一种基于信息论和核方法的贪心算法，通过最小化查询特定预测误差并引入多样性正则化，显著提升分类性能。

详情

AI中文摘要

上下文学习（ICL）已成为一种强大的范式，通过仅使用提示中精心选择的少量任务特定示例，使大型语言模型（LLM）适应新的、数据稀缺的任务。然而，鉴于LLM有限的上下文大小，一个基本问题出现了：应选择哪些示例以最大化给定用户查询的性能？虽然基于最近邻的方法（如KATE）已被广泛用于此目的，但它们在高维嵌入空间中存在众所周知的缺点，包括泛化能力差和缺乏多样性。在这项工作中，我们从原则性的、信息论驱动的角度研究ICL中的示例选择问题。我们首先将LLM建模为输入嵌入上的线性函数，并将示例选择任务框架化为一个查询特定的优化问题：从较大的示例库中选择一个子集，以最小化特定查询上的预测误差。这种表述通过针对特定查询实例的准确预测，偏离了传统的以泛化为中心的学习理论方法。我们推导出一个原则性的代理目标，该目标是近似子模的，从而能够使用具有近似保证的贪心算法。我们通过（i）引入核技巧以在高维特征空间中操作而无需显式映射，以及（ii）引入基于最优设计的正则化项以鼓励所选示例的多样性，进一步增强了我们的方法。实验上，我们在多个分类任务上展示了相对于标准检索方法的显著改进，突出了在真实世界、标签稀缺场景中，结构感知、多样化的示例选择对ICL的益处。

英文摘要

In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using only a few carefully selected task-specific examples presented in the prompt. However, given the limited context size of LLMs, a fundamental question arises: Which examples should be selected to maximize performance on a given user query? While nearest-neighbor-based methods like KATE have been widely adopted for this purpose, they suffer from well-known drawbacks in high-dimensional embedding spaces, including poor generalization and a lack of diversity. In this work, we study this problem of example selection in ICL from a principled, information theory-driven perspective. We first model an LLM as a linear function over input embeddings and frame the example selection task as a query-specific optimization problem: selecting a subset of exemplars from a larger example bank that minimizes the prediction error on a specific query. This formulation departs from traditional generalization-focused learning theoretic approaches by targeting accurate prediction for a specific query instance. We derive a principled surrogate objective that is approximately submodular, enabling the use of a greedy algorithm with an approximation guarantee. We further enhance our method by (i) incorporating the kernel trick to operate in high-dimensional feature spaces without explicit mappings, and (ii) introducing an optimal design-based regularizer to encourage diversity in the selected examples. Empirically, we demonstrate significant improvements over standard retrieval methods across a suite of classification tasks, highlighting the benefits of structure-aware, diverse example selection for ICL in real-world, label-scarce scenarios.

URL PDF HTML ☆

赞 0 踩 0

2502.06301 2026-06-04 cs.LG cs.NE

Utilizing Novelty-based Evolution Strategies to Train Transformers in Reinforcement Learning

利用基于新颖性的进化策略训练强化学习中的Transformer

Matyáš Lorenc, Roman Neruda

发表机构 * Faculty of Mathematics and Physics, Charles University（数学与物理学院，查理大学）； Institute of Computer Science, Czech Academy of Sciences（计算机科学研究所，捷克科学院）

AI总结本研究实验了基于新颖性的进化策略变体（NS-ES和NSR-ES），评估其在训练强化学习中的Transformer架构（如Decision Transformer）的效果，并探索预训练模型加速训练的可能性。

详情

DOI: 10.1109/ICTAI66417.2025.00116
Journal ref: 2025 IEEE 37th International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, 2025, pp. 801-805

AI中文摘要

在本文中，我们实验了OpenAI-ES的基于新颖性的变体，即NS-ES和NSR-ES算法，并评估了它们在训练针对强化学习问题设计的复杂Transformer架构（如Decision Transformers）中的有效性。我们还测试了是否可以通过使用预训练模型进行种子训练来加速这些更大模型的新颖性训练。实验结果喜忧参半。NS-ES显示出进展，但显然需要更多迭代才能产生有趣的智能体。另一方面，NSR-ES被证明能够直接用于更大模型，因为其性能在前馈模型和Decision Transformer之间表现相似，正如我们之前工作中OpenAI-ES的表现一样。

英文摘要

In this paper, we experiment with novelty-based variants of OpenAI-ES, the NS-ES and NSR-ES algorithms, and evaluate their effectiveness in training complex, transformer-based architectures designed for the problem of reinforcement learning, such as Decision Transformers. We also test if we can accelerate the novelty-based training of these larger models by seeding the training with a pretrained models. The experimental results were mixed. NS-ES showed progress, but it would clearly need many more iterations for it to yield interesting agents. NSR-ES, on the other hand, proved quite capable of being straightforwardly used on larger models, since its performance appears as similar between the feed-forward model and Decision Transformer, as it was for the OpenAI-ES in our previous work.

URL PDF HTML ☆

赞 0 踩 0

2509.10247 2026-06-04 cs.RO cs.AI

DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning

DiffAero: 一种用于高效四旋翼策略学习的GPU加速可微分仿真框架

Xinhong Zhang, Runqing Wang, Yunfan Ren, Jian Sun, Hao Fang, Jie Chen, Gang Wang

发表机构 * State Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（自主智能无人系统国家重点实验室，北京理工大学）； Zhongguancun Academy（中关村academy）； Department of Mechanical Engineering, University of Hong Kong（香港大学机械工程系）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结提出DiffAero，一种轻量级、GPU加速且完全可微的仿真框架，通过并行化物理与渲染实现高效四旋翼控制策略学习，并在消费级硬件上数小时内训练出鲁棒策略。

Comments 8 pages, 11 figures, 1 table

详情

AI中文摘要

本文介绍了DiffAero，一种轻量级、GPU加速且完全可微的仿真框架，专为高效的四旋翼控制策略学习而设计。DiffAero支持环境级和智能体级并行，并在统一的GPU原生训练接口中集成了多种动力学模型、可定制的传感器堆栈（IMU、深度相机和LiDAR）以及多样化的飞行任务。通过在GPU上完全并行化物理和渲染，DiffAero消除了CPU-GPU数据传输瓶颈，并在仿真吞吐量上实现了数量级的提升。与现有仿真器相比，DiffAero不仅提供高性能仿真，还作为探索可微和混合学习算法的研究平台。广泛的基准测试和真实世界飞行实验表明，DiffAero与混合学习算法相结合，可以在消费级硬件上数小时内学习到鲁棒的飞行策略。代码可在https://github.com/flyingbitac/diffaero获取。

英文摘要

This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning. DiffAero supports both environment-level and agent-level parallelism and integrates multiple dynamics models, customizable sensor stacks (IMU, depth camera, and LiDAR), and diverse flight tasks within a unified, GPU-native training interface. By fully parallelizing both physics and rendering on the GPU, DiffAero eliminates CPU-GPU data transfer bottlenecks and delivers orders-of-magnitude improvements in simulation throughput. In contrast to existing simulators, DiffAero not only provides high-performance simulation but also serves as a research platform for exploring differentiable and hybrid learning algorithms. Extensive benchmarks and real-world flight experiments demonstrate that DiffAero and hybrid learning algorithms combined can learn robust flight policies in hours on consumer-grade hardware. The code is available at https://github.com/flyingbitac/diffaero.

URL PDF HTML ☆

赞 0 踩 0

2509.08846 2026-06-04 cs.LG cs.AI stat.ML

Uncertainty Estimation using Variance-Gated Distributions

使用方差门控分布的不确定性估计

H. Martin Gillis, Isaac Xu, Thomas Trappenberg

发表机构 * Faculty of Computer Science（计算机科学学院）； Dalhousie University（达尔豪斯大学）

AI总结提出基于类概率分布信噪比的方差门控不确定性估计框架，通过集成置信因子缩放预测，解决神经网络预测不确定性分解中的加性分解问题。

Comments NeurIPS Workshop: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making

2509.07963 2026-06-04 cs.LG

Customizing the Inductive Biases of Softmax Attention using Structured Matrices

使用结构化矩阵定制软注意力机制的归纳偏置

Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson

发表机构 * University of Cambridge（剑桥大学）

AI总结针对标准注意力机制在低维投影信息损失和缺乏距离依赖偏置的问题，提出基于块张量列（BTT）和连续多级低秩（MLR）结构化矩阵的高秩评分函数，在上下文回归、语言建模和长程时间序列预测中提升性能。

Comments ICML 2025. Code available at https://github.com/YilunKuang/structured-attention

详情

AI中文摘要

注意力机制的核心组件是评分函数，它将输入转换为低维查询和键，并计算每对向量的点积。虽然低维投影提高了效率，但对于某些具有本质高维输入的任务，它会导致信息损失。此外，注意力对所有输入对使用相同的评分函数，而没有对序列中相邻标记施加距离相关的计算偏置。在这项工作中，我们通过提出基于计算高效的高秩结构化矩阵（包括块张量列（BTT）和连续多级低秩（MLR）矩阵）的新评分函数来解决这些缺陷。在高维输入的上下文回归任务中，我们提出的评分函数在任意固定计算预算下均优于标准注意力。在语言建模（一种表现出局部性模式的任务）中，基于MLR的注意力方法相比标准注意力和滑动窗口注意力的变体实现了改进的扩展定律。此外，我们表明BTT和MLR都属于更广泛的高效结构化矩阵家族，能够编码全秩或距离依赖的计算偏置，从而解决了标准注意力的显著缺陷。最后，我们展示了MLR注意力在长程时间序列预测中具有令人期待的结果。

英文摘要

The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and contiguous Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. Finally, we show that MLR attention has promising results for long-range time-series forecasting.

URL PDF HTML ☆

赞 0 踩 0

2508.01815 2026-06-04 cs.CL cs.AI

From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs

从图检索到模式实现：面向异构知识图谱的文本到SPARQL的反事实验证

Chengxiao Dai, Yue Xiu, Dusit Niyato

发表机构 * University of Bristol（布里斯托大学）

AI总结提出SchemaForge框架，通过问题条件化的模式切片对齐和反事实验证，在异构知识图谱上提升文本到SPARQL查询生成的执行准确率。

详情

AI中文摘要

文本到SPARQL将自然语言问题映射为RDF知识图谱上的可执行SPARQL查询。标准评估通常预先固定目标图，但实际知识图谱问答（KGQA）可能涉及具有不同模式、部分对齐和不完整元数据的异构图集合。在此设置下，查询生成不仅依赖于SPARQL语法：系统必须识别能够支持问题所需的谓词、实体类型、连接、过滤器和约束的图模式。我们提出SchemaForge，一个面向异构KG集合的文本到SPARQL的基于模式的智能体框架。其核心机制是问题条件化的模式切片对齐：弱图证据首先识别可能的图，而更强的模式证据确定局部模式切片能否实现预期查询。选定的模式切片随后在执行前约束查询生成和验证。当仅有一个图可用时，该公式简化为带有模式基础的标准单KG文本到SPARQL。我们在LC-QuAD 2.0、QALD-9 Plus、QALD-10和Spider4SPARQL上评估SchemaForge。在四个公开基准上，SchemaForge相比最强匹配的智能体基线平均提高执行准确率11.50个百分点。在Spider4SPARQL上，SchemaForge将执行准确率从54.86%提升至64.18%，并达到73.0%的Top-1和97.0%的Top-3图分配准确率。这些结果表明，从弱图证据转向模式特定的查询承诺，结合反事实答案集检查，改进了异构知识图谱上的可执行查询生成。

英文摘要

Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix the target graph in advance, practical knowledge graph question answering (KGQA) may involve heterogeneous graph collections with different schemas, partial alignments, and incomplete metadata. In this setting, query generation depends on more than SPARQL syntax: the system must identify a graph schema that can support the predicates, entity types, joins, filters, and constraints required by the question. We present SchemaForge, a schema-grounded agentic framework for text-to-SPARQL over heterogeneous KG collections. Its central mechanism is question-conditioned schema-slice alignment: weak graph evidence first identifies plausible graphs, while stronger schema evidence determines whether a local schema slice can realize the intended query. The selected schema slice then constrains query generation and verification before execution. When only one graph is available, the same formulation reduces to standard single-KG text-to-SPARQL with schema grounding. We evaluate SchemaForge on LC-QuAD 2.0, QALD-9 Plus, QALD-10, and Spider4SPARQL. Across the four public benchmarks, SchemaForge improves execution accuracy over the strongest matched agent baseline by 11.50 percentage points on average. On Spider4SPARQL, SchemaForge improves execution accuracy from 54.86% to 64.18% and achieves 73.0% Top-1 and 97.0% Top-3 graph allocation accuracy. These results show that moving from weak graph evidence to schema-specific query commitments, together with counterfactual answer-set checks, improves executable query generation over heterogeneous knowledge graphs.

URL PDF HTML ☆

赞 0 踩 0

2507.21892 2026-06-04 cs.CL

Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning

Graph-R1：通过端到端强化学习实现智能图RAG框架

Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, Luu Anh Tuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Graph-R1，首个通过端到端强化学习的智能图RAG框架，采用轻量知识超图构建、多轮智能体-环境交互检索和端到端奖励机制，在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强RAG方法。

Comments Accepted by ICML 2026 main conference

详情

Journal ref: ICML 2026

AI中文摘要

检索增强生成（RAG）通过引入外部知识减轻大语言模型中的幻觉，但依赖于缺乏结构语义的基于块的检索。图RAG方法通过将知识建模为实体-关系图来改进RAG，但仍面临构建成本高、固定一次性检索以及依赖长上下文推理和提示设计等挑战。为解决这些问题，我们提出Graph-R1，首个通过端到端强化学习（RL）的智能图RAG框架。它引入了轻量知识超图构建，将检索建模为多轮智能体-环境交互，并通过端到端奖励机制优化智能体过程。在标准RAG数据集上的实验表明，Graph-R1在推理准确性、检索效率和生成质量上优于传统图RAG和强化学习增强的RAG方法。我们的软件和数据公开在https://github.com/LHRLAB/Graph-R1。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates hallucination in LLMs by incorporating external knowledge, but relies on chunk-based retrieval that lacks structural semantics. GraphRAG methods improve RAG by modeling knowledge as entity-relation graphs, but still face challenges in high construction cost, fixed one-time retrieval, and reliance on long-context reasoning and prompt design. To address these challenges, we propose Graph-R1, the first agentic GraphRAG framework via end-to-end reinforcement learning (RL). It introduces lightweight knowledge hypergraph construction, models retrieval as a multi-turn agent-environment interaction, and optimizes the agent process via an end-to-end reward mechanism. Experiments on standard RAG datasets show that Graph-R1 outperforms traditional GraphRAG and RL-enhanced RAG methods in reasoning accuracy, retrieval efficiency, and generation quality. Our software and data are publicly available at https://github.com/LHRLAB/Graph-R1.

URL PDF HTML ☆

赞 0 踩 0

2507.21638 2026-06-04 cs.AI cs.LG cs.MA cs.RO

Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

Assistax: 一个用于辅助机器人的多智能体硬件加速强化学习基准

Leonard Hinckeldey, Elliot Fosong, Rimvydas Rubavicius, Elle Miller, Trevor McInroe, Fan Zhang, Patricia Wollstadt, Stefano V. Albrecht, Subramanian Ramamoorthy

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出Assistax基准，利用JAX硬件加速和基于多智能体强化学习的辅助机器人任务，实现高达370倍加速，并测试机器人的零样本协调能力。

Comments Accepted at the Reinforcement Learning Conference 2026

详情

AI中文摘要

强化学习（RL）算法的发展在很大程度上受到具有挑战性的任务和基准的推动。游戏在RL基准中占据主导地位，因为它们呈现了相关的挑战，运行成本低且易于理解。虽然围棋和Atari等游戏带来了许多突破，但它们通常不能直接转化为现实世界的具身应用。在认识到需要多样化RL基准并解决具身交互场景中出现的复杂性的情况下，我们引入了Assistax：一个旨在解决辅助机器人任务中出现的挑战的开源基准。Assistax利用JAX的硬件加速，在基于物理的模拟中实现显著的学习加速。在开环挂钟时间方面，Assistax在向量化训练运行时比基于CPU的替代方案快高达370倍。Assistax使用多智能体RL将辅助机器人与活跃的人类患者之间的交互概念化，以训练一群多样化的伙伴智能体，从而可以测试具身机器人智能体的零样本协调能力。对流行的连续控制RL和MARL算法进行的广泛评估和超参数调优提供了可靠的基线，并将Assistax确立为推进辅助机器人RL研究的实用基准。代码可在以下网址获取：https://github.com/assistive-autonomy/assistax。

英文摘要

The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have dominated RL benchmarks because they present relevant challenges, are inexpensive to run and easy to understand. While games such as Go and Atari have led to many breakthroughs, they often do not directly translate to real-world embodied applications. In recognising the need to diversify RL benchmarks and addressing complexities that arise in embodied interaction scenarios, we introduce Assistax: an open-source benchmark designed to address challenges arising in assistive robotics tasks. Assistax uses JAX's hardware acceleration for significant speed-ups for learning in physics-based simulations. In terms of open-loop wall-clock time, Assistax runs up to $370\times$ faster when vectorising training runs compared to CPU-based alternatives. Assistax conceptualises the interaction between an assistive robot and an active human patient using multi-agent RL to train a population of diverse partner agents against which an embodied robotic agent's zero-shot coordination capabilities can be tested. Extensive evaluation and hyperparameter tuning for popular continuous control RL and MARL algorithms provide reliable baselines and establish Assistax as a practical benchmark for advancing RL research for assistive robotics. The code is available at: https://github.com/assistive-autonomy/assistax.

URL PDF HTML ☆

赞 0 踩 0