arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2604.04554 2026-06-09 cs.CV cs.RO 版本更新

Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

基于关系epipolar图的鲁棒相对相机姿态估计

Prateeth Rao, Sachit Rao

发表机构 * International Institute of Information Technology(国际信息科技研究所)

AI总结 本文提出基于epipolar图的关系推断方法,用于估计相对相机姿态,通过图操作估计旋转、平移和本质矩阵,提升对密集噪声和大基线变化的鲁棒性。

Comments 21 pages, 11 figures, 11 Tables, Submitted to IJCV

详情
AI中文摘要

视觉同步定位与建图(VSLAM)的关键组成部分是利用匹配的关键点估计相对相机姿态。准确估计面临噪声对应关系的挑战。经典方法依赖于随机假设采样和迭代估计,而基于学习的方法通常缺乏显式的几何结构。在本文中,我们将相对姿态估计重新表述为epipolar对应图上的关系推断问题,其中匹配的关键点是节点,相邻的节点通过边连接。图操作如修剪、消息传递和池化可估计四元组旋转、平移向量和本质矩阵(EM)。最小化包含(i)与地面真实值(GT)的$\mathcal{L}_2$差异,(ii)估计与GT EM之间的Frobenius范数,(iii)奇异值差异,(iv)航向角差异,(v)尺度差异的损失,可得到图像对之间的相对姿态。所用的密集检测器-free方法LoFTR用于匹配。在室内和室外基准测试中,相比经典和学习引导方法,该方法在密集噪声和大基线变化方面表现出改进的鲁棒性,突显了全局关系共识的有效性。

英文摘要

A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.

2604.04251 2026-06-09 cs.AI cs.CY cs.LG 版本更新

MC-CPO: Mastery-Conditioned Constrained Policy Optimization for Pedagogically Safe Intelligent Tutoring Systems

MC-CPO:基于 mastery 的约束策略优化用于教学安全的智能辅导系统

Oluseyi Olukola, Nick Rahimi

发表机构 * School of Computing Sciences(计算科学学院) Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA(计算机工程,密西西比大学,哈特斯伯格,MS 39406,USA)

AI总结 本文提出 MC-CPO 框架,通过结构化约束解决教学安全问题,提升学习者知识掌握率,实验证明其在两个平台上的效果显著。

Comments 35 pages, 8 figures. v2: Major revision adding real-world validation on Junyi Academy (16.2M interactions, 72,758 students) and XES3G5M (NeurIPS 2023, 5.1M interactions, 14,453 students). Revised title and abstract. Submitted to Computers and Education: Artificial Intelligence

详情
AI中文摘要

智能辅导系统越来越多地依赖强化学习来个性化教学,但优化可观察的参与信号可能会系统性地将学习者活动与真正的知识获取分离。分析超过2100万学生互动数据,发现Junyi Academy平台有26.5%的互动没有对应的掌握增长,XES3G5M平台为3.1%。本文引入Mastery-Conditioned Constrained Policy Optimization (MC-CPO),一种强化学习框架,通过将可接受的教学动作空间条件于学习者掌握状态,使概念在先决知识达到掌握阈值时才可出现,从而自然扩展动作空间。通过结构化约束确保教学安全,具有形式保证的结构性先决安全、对偶收敛和严格优于事后过滤。MC-CPO是唯一在所有条件下减少奖励黑客严重性的方法。在Junyi Academy上,平均每回合掌握增长增加18.3%,在XES3G5M上增加54.0%,同时保持竞争性的参与表现。这些结果支持结构化约束建模作为部署辅导系统中更安全自适应教学策略的原理性基础。

英文摘要

Intelligent tutoring systems increasingly rely on reinforcement learning to personalise instruction, yet optimising for observable engagement signals can systematically decouple learner activity from genuine knowledge acquisition. Analysing over 21 million student interactions across two deployed platforms, we find engagement events without corresponding mastery gains occur in 26.5% of interactions on Junyi Academy (72,758 students) and 3.1% on XES3G5M (14,453 students, NeurIPS 2023), confirming this pattern is directly observable in deployed educational technology at scale. We introduce Mastery-Conditioned Constrained Policy Optimisation (MC-CPO), a reinforcement learning framework that addresses this problem structurally. MC-CPO conditions the admissible instructional action space on learner mastery state: a concept becomes available only when prerequisite knowledge meets a mastery threshold, yielding an action space that expands naturally as learners acquire knowledge. Pedagogical safety constraints are enforced by construction, with formal guarantees of structural prerequisite safety, primal-dual convergence, and strict dominance over post-hoc filtering. MC-CPO is the only method to reduce reward hacking severity across all conditions. Mean per-episode mastery gain increases by 18.3% on Junyi Academy and 54.0% on XES3G5M relative to all baselines, while competitive engagement performance is maintained. These results support structural constraint modelling as a principled foundation for safer adaptive instructional policies in deployed tutoring systems.

2604.02056 2026-06-09 cs.CV 版本更新

COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

COMPASS:通过代理令牌和共享空间实现完整的多模态融合以实现无处不在的感知

Hao Wang, Yanyu Qian, Pengcheng Weng, Zixuan Xia, William Dan, Yangxin Xu, Fei Wang

发表机构 * Universität Bern(伯恩大学) Xi’an Jiaotong University(西安交通大学) Nanyang Technological University(南洋理工大学)

AI总结 COMPASS通过代理令牌和共享空间实现多模态融合,解决缺失模态导致的信息丢失和融合接口不匹配问题,提升多模态感知的鲁棒性。

详情
AI中文摘要

COMPASS通过代理令牌和共享空间实现多模态融合,解决缺失模态导致的信息丢失和融合接口不匹配问题,提升多模态感知的鲁棒性。

英文摘要

Missing modalities in multimodal sensing cause not only information loss but also a fusion-interface mismatch: a fusion head trained on a canonical set of modality slots must operate on changing observed subsets at inference time. We propose Compass, an interface-complete fusion framework that restores this canonical slot structure before prediction. Each modality is assigned a fixed fusion slot. Observed modalities populate their slots with real representations, while absent modalities are filled with target-slot completion representations estimated from the observed sources. Multiple source-specific estimates for the same missing slot are aggregated into a single slot filler, allowing the same lightweight fusion operator to be applied under arbitrary missing-modality patterns. Training uses synthetic modality masking, slot-compatibility supervision, and representation-space stabilization to make completed slots compatible with real modality representations and useful for downstream recognition. Across XRF55, MM-Fi, and OctoNet, Compass improves robustness under diverse single- and multiple-missing settings, including controlled comparisons against imputation, distillation, and translation-style baselines. These results suggest that preserving the fusion interface is a simple and effective principle for robust multimodal sensing.

2604.01609 2026-06-09 cs.CL 版本更新

Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

Swift-SVD:在低秩LLM压缩中理论最优与实用效率的结合

Ruoling Qi, Yirui Liu, Xuaner Wu, Xiangyu Wang, Ming Li, Chen Chen, Jian Chen, Yin Chen, Qizhen Weng

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出Swift-SVD框架,通过激活感知和闭式压缩方法,在保证理论最优的同时提升实用效率和数值稳定性,实验显示其在压缩精度和端到端压缩时间上有显著优势。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型的部署受到静态权重和动态键值缓存的内存和带宽需求的限制。基于SVD的压缩提供了一种硬件友好的解决方案来降低这些成本。然而,现有方法存在两个关键限制:一些方法在重建误差上不最优,而另一些方法在理论上最优但实际效率低。本文提出Swift-SVD,一种激活感知的闭式压缩框架,同时保证理论最优、实用效率和数值稳定性。Swift-SVD在给定一批输入的情况下逐步聚合输出激活的协方差,并在聚合后执行一次特征值分解,实现免训练、快速且最优的层级低秩近似。我们采用有效秩分析局部层级压缩性,并设计一种动态秩分配策略,同时考虑局部重建损失和端到端层重要性。在六个LLM和八个数据集上的广泛实验表明,Swift-SVD优于现有最先进基线,实现最优压缩精度的同时,端到端压缩时间提升了3-70倍。我们的代码可在https://github.com/hiahei/Swift-SVD获取。

英文摘要

The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code is available at https://github.com/hiahei/Swift-SVD.

2604.00903 2026-06-09 cs.CV 版本更新

IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

IDDM: 一种具有可调隐私-效用权衡的去标识化个性化扩散模型

Linyan Dai, Xinwei Zhang, Haoyang Li, Qingqing Ye, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出IDDM模型,通过在个性化流程中集成身份解耦,实现授权个性化的同时降低公开生成的身份可链接性,可调隐私-效用权衡。

详情
AI中文摘要

个性化文本到图像扩散模型(例如DreamBooth、LoRA)使用户能够从少量参考照片合成高质量的肖像用于社交表达。然而,一旦这些生成内容在社交媒体平台(如Instagram、Facebook)上分享,它们可通过人脸识别系统与真实用户关联,从而实现身份跟踪和画像。现有防御措施主要采用反个性化策略,通过破坏模型微调来保护公开发布的参考照片。虽然对未经授权的个性化有效,但未解决另一种实际场景:当个性化被授权时,但公开输出仍泄露身份信息。为此,我们引入新的防御设置,称为模型侧输出免疫,目标是生成支持授权个性化的模型,同时减少公开生成的身份可链接性,并通过可调控制隐私-效用权衡来满足多样化的隐私需求。为此,我们提出身份解耦个性化扩散模型(IDDM),一种模型侧防御,将身份解耦整合到个性化流程中。具体而言,IDDM采用交替过程,交替进行短个性化更新和身份解耦数据优化,使用两阶段计划来平衡身份可链接性抑制和生成效用。在多个数据集、多样提示和最先进的人脸识别系统上的广泛实验表明,IDDM一致地降低了身份可链接性,同时保持高质量的个性化生成。

英文摘要

Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.

2603.13431 2026-06-09 cs.LG cs.AI 版本更新

CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design

CHIMERA-Bench:一种针对表位特异性抗体设计的基准数据集

Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson

发表机构 * Georgia State University(佐治亚州立大学) Georgia Institute of Technology(佐治亚理工学院) University of Engineering and Technology(工程与技术大学) Lahore University of Management Sciences(拉合尔管理科学大学)

AI总结 本文提出CHIMERA-Bench,一个统一的抗体设计基准,包含2922个抗原-抗体复合物数据,测试泛化能力,并评估多种生成方法的通用性。

详情
AI中文摘要

计算抗体设计在过去三年中取得了快速的方法进展,提出了数十种深度生成方法,但该领域缺乏标准化的基准用于公平比较和模型开发。这些方法在不同的SAbDab快照、非重叠测试集和不兼容的指标上进行评估,文献将设计问题分解为多个子任务,没有共同定义。我们引入CHIMERA-Bench:(CDR建模与表位引导的重设计),围绕单一经典任务:表位条件下的CDR序列-结构共设计。CHIMERA-Bench提供三个组成部分。第一个是一个经过精心挑选、去重的包含2922个抗体-抗原复合物的数据集,带有表位和抗原结合位点注释。第二个是一组三个生物动机的分割,测试泛化到未见表位、未见抗原折叠和前瞻性时间目标的能力。第三个是全面的评估协议,包括五个指标组,包括新的表位特异性度量。我们基准测试了十一种方法,涵盖六个生成范式,并在所有分割上报告结果。CHIMERA-Bench是该抗体设计问题中最大的数据集,允许社区开发和测试新方法,并评估其泛化能力。

英文摘要

Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non-overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub-tasks with no common definition. We introduce CHIMERA-Bench: (CDR Modeling with Epitope-guided Redesign), a unified benchmark built around a single canonical task: epitope-conditioned CDR sequence-structure co-design. CHIMERA-Bench provides three components. The first is a curated, deduplicated dataset of 2,922 antibody-antigen complexes with epitope and paratope annotations. The second is a set of three biologically motivated splits that test generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets. The third is a comprehensive evaluation protocol with five metric groups, including novel epitope-specificity measures. We benchmark eleven methods spanning six generative paradigms and report results across all splits. CHIMERA-Bench is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability.

2603.29824 2026-06-09 cs.LG 版本更新

Curvature-Guided LoRA: Matching Full Fine-Tuning in Function Space

曲率引导的LoRA:在函数空间中匹配全微调

Frédéric Zheng, Alexandre Proutière

发表机构 * KTH(皇家理工学院)

AI总结 本文提出Curvature-Guided LoRA,通过函数空间视角解决LoRA与全微调输出对齐问题,采用曲率感知的二阶方法提升微调效率与性能。

Comments Preprint

详情
AI中文摘要

参数高效的微调方法如LoRA能够高效适应大预训练模型,但通常在收敛速度和最终性能上落后于全微调。最近的方法旨在通过将LoRA参数更新与全微调对齐来缩小这一差距,但这种参数空间对齐只能间接控制模型预测。相反,我们采用函数空间视角,提出预测对齐问题,其目标是使LoRA微调的输出与全微调的输出一致。我们证明该目标自然导致曲率感知的二阶公式,其中最优低秩更新对应于牛顿似、曲率白化的梯度。基于此见解,我们提出Curvature-Guided LoRA (CG-LoRA),一种利用局部曲率信息选择适应方向的算法。我们的方法计算高效且避免显式构造二阶矩阵。在标准自然语言理解基准上的实验表明,与现有LoRA变体相比,我们的方法在性能和收敛速度上均有提升。

英文摘要

Parameter-efficient fine-tuning methods such as LoRA enable efficient adaptation of large pretrained models, but often lag behind full fine-tuning in both convergence speed and final performance. Recent approaches aim to reduce this gap by aligning LoRA parameter updates with those of full fine-tuning, but such parameter-space alignment only indirectly controls model predictions. Instead, we adopt a function-space perspective and formulate the \emph{prediction alignment problem}, whose objective is to match the outputs of LoRA fine-tuning to those of full fine-tuning. We show that this objective naturally leads to a curvature-aware, second-order formulation, where optimal low-rank updates correspond to a Newton-like, curvature-whitened gradient. Based on this insight, we propose Curvature-Guided LoRA (CG-LoRA), an algorithm that selects adaptation directions using local curvature information. Our method is computationally efficient and avoids explicit second-order matrix construction. Experiments on standard natural language understanding benchmarks demonstrate improved performance and faster convergence compared to existing LoRA variants.

2603.29495 2026-06-09 cs.CV cs.ET cs.HC 版本更新

All-in-One Augmented Reality Guided Head and Neck Tumor Resection

一体化增强现实引导的头颈部肿瘤切除

Yue Yang, Matthieu Chabanas, Carrie Reale, Annie Benson, Jason Slagle, Matthew Weinger, Michael Topf, Jie Ying Wu

发表机构 * Vanderbilt Institute for Surgery and Engineering(范德比尔特手术与工程研究院) Vanderbilt University Medical Center(范德比尔特大学医学院)

AI总结 本文提出了一种整合增强现实技术的系统,用于在手术中精确定位肿瘤边缘,通过HoloLens 2的深度感应和自动标记less表面配准,显著提高了手术精度。

详情
AI中文摘要

头颈部鳞状细胞癌常存在阳性边缘,但术中重新切除常因边缘位置依赖口头沟通而不够精确。本文提出了一种一体化增强现实系统,通过HoloLens 2的深度感应和全自动标记less表面配准,将已切除标本的阳性边缘重新定位到切除床并进行原位可视化。在硅基仿生研究中,标记less配准的定位误差与标记基线相当(中位数1.8 mm vs. 1.7 mm;最大值<4 mm)。在边缘重新定位任务中,增强现实指导将误差从口头指导(中位数14.2 mm)降低到几毫米(中位数3.2 mm),所有AR定位误差均在5 mm以内。这些结果支持了无标记增强现实边缘指导在提高术中重新切除精度方面的可行性。

英文摘要

Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.

2603.29237 2026-06-09 cs.LG cs.NA math.NA 版本更新

Stochastic Dimension Implicit Functional Projections for Global Integral Conservation in High-Dimensional PINNs

随机维度隐式函数投影用于高维PINNs中的全局积分守恒

Zhangyong Liang, Huanhuan Gao

发表机构 * National Center for Applied Mathematics, Tianjin University, China(应用数学国家中心,天津大学,中国) School of Mechanical and Aerospace Engineering, Jilin University, China(机械与 aerospace 工程学院,吉林大学,中国)

AI总结 本文提出SDIFP方法,通过全局线性修正神经网络输出,实现高维PINNs中的一阶和二阶空间矩约束,避免了张量积求积的高维扩展问题,提高了计算效率。

详情
AI中文摘要

在无网格神经PDE求解器中,强制执行预设的全局积分约束在高维域中具有挑战性。现有的空间积分投影方法通常依赖于固定网格或均匀求积,这与随机采样的物理信息神经网络(PINNs)相冲突,并且在高维情况下扩展性差。高阶微分算子也增加了反向模式自动微分的内存成本。我们提出随机维度隐式函数投影(SDIFP),一种用于强制执行预设一阶和二阶空间矩的求积级框架。SDIFP用神经网络输出的全局线性修正代替张量积节点投影,两个标量系数由加权求积规则确定。在正的目标方差和非零经验原始方差下,这种修正是在加权求积范数下对经验二矩约束集的最近点投影。因此,预设的矩对于所选求积规则是精确的,而连续误差是修正场的求积误差。对于可分解的高维线性算子,SDIFP将线性矩修正与随机算子子集采样相结合。通过独立残差和导数采样以及条件无偏系数梯度估计,所得估计器对于指定的求积基残差目标是无偏的;共享子集快速模式通常是有偏的。SDIFP避免了张量积求积用于矩强制,分离了正向求积评估与反向模式图,并且在确定或预计算了线性系数后保留了点wise推断效率。

英文摘要

Enforcing prescribed global integral constraints in mesh-free neural PDE solvers is challenging in high-dimensional domains. Existing projection methods for spatial integrals are often tied to fixed grids or uniform quadrature, which can conflict with randomly sampled physics-informed neural networks (PINNs) and scale poorly with dimension. High-order differential operators also increase reverse-mode automatic differentiation memory costs. We propose Stochastic Dimension Implicit Functional Projection (SDIFP), a quadrature-level framework for enforcing prescribed first and second spatial moments. SDIFP replaces tensor-product nodal projection by a global affine correction of the neural-network output, with two scalar coefficients determined from a weighted quadrature rule. Under positive target variance and nonzero empirical raw variance, this correction is the nearest-point projection, in the weighted quadrature norm, onto the empirical two-moment constraint set. Thus, the prescribed moments are exact for the selected quadrature rule, while continuum errors are quadrature errors of the corrected field. For decomposable high-dimensional linear operators, SDIFP combines affine moment correction with stochastic operator-subset sampling. With independent residual and derivative sampling and conditionally unbiased coefficient-gradient estimation, the resulting estimator is unbiased for the specified quadrature-based residual objective; the shared-subset fast mode is biased in general. SDIFP avoids tensor-product quadrature for moment enforcement, separates forward quadrature evaluation from the reverse-mode graph, and retains pointwise inference efficiency once the affine coefficients are fixed or precomputed.

2603.25726 2026-06-09 cs.CV 版本更新

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

AnyHand:一个大规模合成数据集用于RGB(-D)手姿态估计

Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su

发表机构 * University of California, San Diego(加州大学圣迭戈分校) Lambda, Inc(Lambda公司) Imperial College London(伦敦帝国理工学院) Nanyang Technological University(南洋理工大学)

AI总结 AnyHand通过提供大规模RGB-D图像和丰富的几何标注,提升了3D手姿态估计的性能,证明了数据多样性和质量对模型效果的重要性。

详情
AI中文摘要

我们介绍了AnyHand,一个大规模合成数据集,旨在推动3D手姿态估计的前沿。尽管近期基于基础方法的工作表明,扩大训练数据显著提高了手姿态估计,但现有现实数据集在覆盖范围上有限,且先前合成数据集很少能同时提供遮挡、手臂细节和对齐的深度信息。为解决这一瓶颈,我们提出的AnyHand包含250万张单手和4100万张手-物体交互的RGB-D图像,具有丰富的几何标注。我们展示了将现有RGB基线的原始训练数据配方扩展为AnyHand可显著提升多个基准(FreiHAND和HO-3D)的性能,即使在保持架构和训练方案不变的情况下。结合对训练数据规模和组成设置的广泛消融分析,这些结果表明,训练数据的多样性和质量与规模一样关键,对于推动手姿态估计的发展至关重要。我们进一步在附录中检验了AnyHand对齐深度图的实用性,显示使用AnyHand扩展RGB-D监督可使现有RGB基线的轻量深度融合变体超越先前的RGB-D方法。

英文摘要

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation. While recent works with foundation approaches have shown that scaling training data markedly improves hand pose estimation, existing real-world datasets are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our proposed AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. We show that extending the original training data recipes of existing RGB baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architectures and training schemes fixed. Together with extensive ablations on the scale and composition of the training data setups, these results suggest that training data diversity and quality are as critical as scale for advancing hand pose estimation. We further examine the utility of AnyHand's aligned depth maps in the appendix, showing that scaling RGB-D supervision with AnyHand allows a lightweight depth-fusion variant of existing RGB baselines to outperform prior RGB-D methods.

2511.04124 2026-06-09 cs.LG 版本更新

Decomposable Neuro Symbolic Regression

可分解的神经符号回归

Giorgio Morales, John W. Sheppard

发表机构 * Gianforte School of Computing(吉安福特计算学院) Montana State University(蒙塔纳州立大学)

AI总结 本文提出一种可解释的神经符号回归方法,利用Transformer、遗传算法和遗传编程生成可解释的多元表达式,通过多集合Transformer生成单变量符号骨架,并通过GA和GP融合优化,实现比其他方法更准确的数学表达。

Comments Under review as submission to TMLR

详情
AI中文摘要

符号回归(SR)通过发现数学表达式来建模复杂系统,但大多数方法优先最小化预测误差而非识别 governing equations,常产生过于复杂或不准确的表达式。为此,我们提出一种可分解的SR方法,利用Transformer模型、遗传算法(GAs)和遗传编程(GP)生成可解释的多元表达式。我们的可解释SR方法将训练好的“不透明”回归模型提炼为数学表达式,作为其计算函数的解释。我们采用多集合Transformer生成多个单变量符号骨架,描述每个变量如何影响不透明模型的响应。然后通过GA方法评估生成骨架的性能,选择高质量候选子集,并通过GP基于的级联过程逐步合并它们,以保持原始骨架结构。最终的多元骨架通过GA进行系数优化。我们在具有受控和变化噪声程度的问题上评估了我们的方法,证明其插值和外推误差低于或与两种GP方法、三种神经SR方法和混合方法相当。与其他方法不同,我们的方法始终学习到与原始数学结构匹配的表达式。同样,我们的方法在费曼数据集上实现了高符号解恢复率和与基准方法相竞争的预测性能。

英文摘要

Symbolic regression (SR) models complex systems by discovering mathematical expressions that capture underlying relationships in observed data. However, most SR methods prioritize minimizing prediction error over identifying the governing equations, often producing overly complex or inaccurate expressions. To address this, we present a decomposable SR method that generates interpretable multivariate expressions leveraging transformer models, genetic algorithms (GAs), and genetic programming (GP). In particular, our explainable SR method distills a trained ``opaque'' regression model into mathematical expressions that serve as explanations of its computed function. Our method employs a Multi-Set Transformer to generate multiple univariate symbolic skeletons that characterize how each variable influences the opaque model's response. We then evaluate the generated skeletons' performance using a GA-based approach to select a subset of high-quality candidates before incrementally merging them via a GP-based cascade procedure that preserves their original skeleton structure. The final multivariate skeletons undergo coefficient optimization via a GA. We evaluated our method on problems with controlled and varying degrees of noise, demonstrating lower or comparable interpolation and extrapolation errors compared to two GP-based methods, three neural SR methods, and a hybrid approach. Unlike them, our approach consistently learned expressions that matched the original mathematical structure. Similarly, our method achieved both a high symbolic solution recovery rate and competitive predictive performance relative to benchmark methods on the Feynman dataset.

2603.27493 2026-06-09 cs.CV 版本更新

Fully Spiking Neural Networks with Target Awareness for Energy-Efficient UAV Tracking

具有目标意识的全脉冲神经网络用于节能无人机跟踪

Pengzhi Zhong, Jiwei Mo, Dan Zeng, Feixiang He, Shuiwang Li

发表机构 * College of Computer Science and Engineering, Guilin University of Technology(桂林理工大学计算机科学与工程学院) School of Artificial Intelligence, Sun Yat-sen University(中山大学人工智能学院) School of Electronic Information, Central South University(中南大学电子信息学院)

AI总结 本文提出STATrack,一种基于RGB输入的全脉冲神经网络框架,用于无人机视觉跟踪,通过引入自适应互信息最大化机制和动态加权策略,提升目标语义保留与背景干扰抑制能力,实验证明其在能耗低下的高效跟踪性能。

详情
AI中文摘要

脉冲神经网络(SNNs)以其事件驱动计算和低功耗特性,在无人机(UAVs)上的能量高效视觉跟踪中展现出巨大潜力。然而,现有基于SNN的跟踪器通常依赖成本高昂的事件相机,限制了其在标准RGB相机UAV平台上的应用。为解决这一限制,我们提出了STATrack,一种仅使用RGB输入的全脉冲神经网络框架用于无人机视觉跟踪。到目前为止,这是首次探索全脉冲神经网络用于基于RGB的无人机视觉跟踪。为缓解脉冲离散化导致的目标语义退化以及减少无人机场景中的背景干扰,我们引入了自适应互信息最大化(AMIM)机制。AMIM最大化模板输入与其深层目标意识特征之间的互信息,促使脉冲骨干网络保留判别性目标语义。此外,设计了一种样本难度意识的动态加权策略,以自适应地调整训练过程中的互信息约束。在四个广泛使用的无人机跟踪基准上的大量实验表明,STATrack在低理论能耗下实现了最先进的跟踪性能,突显了其在能量受限的无人机应用中的潜力。

英文摘要

Spiking Neural Networks (SNNs), characterized by their event-driven computation and low power consumption, have shown great potential for energy-efficient visual tracking on unmanned aerial vehicles (UAVs). However, existing SNN-based trackers often rely on costly event cameras, which limits their deployment on standard RGB-camera UAV platforms. To address this limitation, we propose STATrack, a fully spiking neural network framework for UAV visual tracking using only RGB inputs. To the best of our knowledge, this is the first study to explore fully spiking neural networks for RGB-based UAV visual tracking. To alleviate target semantic degradation caused by spike discretization and reduce background interference in UAV scenes, we introduce an Adaptive Mutual Information Maximization (AMIM) mechanism. AMIM maximizes the mutual information between template inputs and their deep target-aware features, encouraging the spiking backbone to preserve discriminative target semantics. In addition, a sample-difficulty-aware dynamic weighting strategy is designed to adaptively adjust the mutual information constraint during training. Extensive experiments on four widely used UAV tracking benchmarks demonstrate that STATrack achieves state-of-the-art tracking performance with low theoretical energy consumption, highlighting its potential for energy-constrained UAV applications.

2603.26763 2026-06-09 cs.CV cs.MM eess.IV 版本更新

A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

面向各种计算机视觉任务的相机原生谈话头视频数据集

Babak Naderi, Ross Cutler, Nabakumar Singh Khongbantabam

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出一个包含847个谈话头视频的数据集,用于评估视频压缩、超分辨率和质量评估等任务,展示了其在实时通信中的应用价值。

详情
AI中文摘要

谈话头视频是实时通信中的主要内容类型,但该领域视频处理研究的公开数据集仍然稀缺且信号保真度有限。本文开源了一个包含847个谈话头视频的数据集(约212分钟),每个视频持续15秒,通过446个不同的消费级摄像头设备在自然环境中录制。所有视频均使用FFV1无损编码器存储,保留相机原生信号——未压缩(24.4%)或MJPEG编码(75.6%)——而不进行额外的有损处理。每个视频都标注了平均意见分数(MOS)和十个感知质量标记,共同解释了64.4%的MOS方差。从该数据集中,我们挑选出120个视频片段,分为三种内容条件:原始、背景虚化和背景替换。在四个数据集和四个编码器(H.264、H.265、H.266和AV1)上的编码效率评估显示,H.266相对于H.264的VMAF BD-rate节省高达-71.3%,编码器×数据集(η_p² = 0.112)和编码器×内容条件(η_p² = 0.149)的交互显著,表明内容类型和背景处理会影响压缩效率。初步的超分辨率评估显示,该数据集显著影响绝对性能,但保持模型排名,证明其在编码器基准测试之外的应用价值。该数据集的规模是现有最大谈话头摄像头数据集的5倍(847 vs. 160个视频),具有无损信号保真度,为视频压缩、超分辨率、质量评估和增强模型的实时通信基准测试提供了资源。

英文摘要

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a camera-native dataset of 847 talking-head recordings (approximately 212 minutes), each 15s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4%) or MJPEG-encoded (75.6%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. A preliminary super-resolution evaluation with four SR models confirms that the dataset significantly affects absolute performance while preserving model rankings, demonstrating applicability beyond codec benchmarking. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs. 160 clips) with lossless signal fidelity, establishing a resource for benchmarking video compression, super-resolution, quality assessment, and enhancement models in real-time communication.

2603.25184 2026-06-09 cs.LG cs.AI 版本更新

Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

在移动边缘训练:一种在线验证的提示选择方法用于大型推理模型的高效强化学习训练

Jiahao Wu, Ning Lu, Shengcai Liu, Kun Wang, Yanting Yang, Bailong Lin, Chen Jason Zhang, Li Qing, Ke Tang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology(香港科学理工大学) Nanyang Technological University(南洋理工大学) Rutgers University(罗格斯大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学理工大学(广州))

AI总结 本文提出HIVE方法,通过历史奖励轨迹和实时提示熵实现高效RL训练,提升提示选择效率而不牺牲性能。

详情
AI中文摘要

强化学习(RL)已成为在推理任务中训练大型语言模型(LLMs)的关键技术。尽管扩大 rollout 可以稳定训练并提高性能,但计算开销是一个关键问题。在像 GRPO 等算法中,每个提示多个 rollout 会带来极高的成本,因为大量提示提供微不足道的梯度,因此效用较低。为了解决这个问题,我们研究如何在 rollout 阶段之前选择高效用的提示。我们的实验分析揭示了样本效用是非均匀且动态变化的:最强的学习信号集中在「学习边缘」,即中等难度和高不确定性的交界处,随着训练进行而变化。受此启发,我们提出了 HIVE(基于历史和在线验证的提示选择),一种数据高效的 RL 框架。HIVE 利用历史奖励轨迹进行粗略选择,并利用提示熵作为实时代理来修剪效用过时的实例。通过在多个数学推理基准和模型上评估 HIVE,我们证明 HIVE 在不牺牲性能的情况下显著提高了 rollout 的效率。

英文摘要

Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

2603.25157 2026-06-09 cs.LG cs.AI cs.CV stat.ML 版本更新

Vision Hopfield Memory Networks for Image Recognition

Vision Hopfield Memory Networks

Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Faculty of Informatics, Vienna University of Technology(维也纳理工大学信息学院)

AI总结 本文提出了一种受大脑启发的视觉Hopfield记忆网络(V-HMN),通过整合分层记忆机制和迭代细化更新,实现了统一框架下的局部和全局动态建模,提升了可解释性和数据效率。

详情
AI中文摘要

近年来,视觉和多模态基础模型,如Transformer家族和状态空间模型(如Mamba)在图像、文本等领域取得了显著进展。尽管这些架构在经验上取得了成功,但它们与人脑的计算原理仍有很大差距,通常需要大量的训练数据且可解释性有限。在本文中,我们提出了视觉Hopfield记忆网络(V-HMN),一种受大脑启发的基础模型,整合了分层记忆机制和迭代细化更新。具体而言,V-HMN包含局部Hopfield模块,提供图像块级别的关联记忆动态,全局Hopfield模块作为情境调节的事件记忆,以及受预测编码启发的细化规则用于迭代误差校正。通过将这些基于记忆的模块分层组织,V-HMN在一个统一的框架中捕捉了局部和全局动态。记忆检索揭示了输入与存储模式之间的关系,使决策更具可解释性,而存储模式的重用提高了数据效率。这种受大脑启发的设计因此在可解释性和数据效率方面超越了现有的自注意或状态空间方法。我们在公开的计算机视觉基准上进行了广泛的实验,V-HMN在与广泛采用的基础架构竞争的同时,提供了更好的可解释性、更高的数据效率和更强的生物合理性。这些发现突显了V-HMN作为下一代视觉基础模型的潜力,同时为文本和音频等领域的多模态基础模型提供了通用的蓝图,从而将受大脑启发的计算与大规模机器学习联系起来。

英文摘要

Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recognition. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. We propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates hierarchical memory mechanisms across layers with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, providing a prototype-based form of interpretability through explicit memory retrieval, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances data efficiency and provides a prototype-based form of interpretability compared to existing self-attention- or state-space-based approaches. We conducted extensive experiments on public image classification benchmarks. V-HMN achieves strong performance on small- and medium-scale benchmarks, and remains competitive with widely adopted backbone architectures on ImageNet despite minimal architectural tuning, while offering improved data efficiency and a prototype-based form of interpretability. These findings highlight the potential of V-HMN as a memory-centric alternative to standard vision backbones, thereby bridging brain-inspired computation with modern machine learning.

2603.24925 2026-06-09 cs.LG cs.CL cs.IR 版本更新

GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

GraphER: 一种高效的基于图的增强和重排序方法用于检索增强生成

Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系,构建查询时的图结构并应用图排序技术,提升检索完整性,无需额外图基础设施,兼容标准向量存储。

详情
AI中文摘要

GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系,构建查询时的图结构并应用图排序技术,提升检索完整性,无需额外图基础设施,兼容标准向量存储。

英文摘要

Retrieval-augmented generation (RAG) systems that rely on semantic search often fail to retrieve the complete set of evidence for complex queries, particularly when information is distributed across multiple sources. Existing approaches either rely on iterative agentic retrieval, which can be inefficient, or maintain additional structures such as knowledge graphs, which introduce storage and maintenance overhead. In this paper, we propose GraphER, a graph-based enrichment and reranking framework that (1) leverages the organizational structure of data to capture proximity relationships beyond semantic similarity, (2) constructs a graph at query time based on these proximities, and (3) applies graph-based ranking to surface the top candidate documents. Experiments across table retrieval, multi-hop retrieval, and long-document retrieval benchmarks demonstrate consistent improvements in terms of retrieval completeness. Additionally, GraphER requires no additional graph infrastructure and integrates seamlessly with standard vector stores. The framework is retriever-agnostic, supports multiple forms of proximity, and introduces minimal query-time latency.

2603.24388 2026-06-09 cs.CV 版本更新

Causal Transfer in Medical Image Analysis

医学图像分析中的因果迁移

Mohammed M. Abdelsamea, Daniel Tweneboah Anyimadu, Tasneem Selim, Saif Alzubi, Lei Zhang, Ahmed Karam Eldaly, Xujiong Ye

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文探讨了医学图像分析中因果迁移方法,整合因果推理与跨域表示学习,以解决领域偏移问题,提升临床AI的鲁棒性和泛化能力。

详情
AI中文摘要

医学成像模型在跨医院、扫描仪、人群或成像协议部署时经常失效,这是由于领域偏移,限制了其临床可靠性。虽然迁移学习和领域适应通过统计方法解决此类偏移,但它们通常依赖于破坏性变化条件下的虚假相关性。另一方面,因果推断提供了一种原则性方法,以识别在不同环境中保持稳定的不变机制。本文介绍了并系统化了医学图像分析中的因果迁移学习(CTL)。该范式将因果推理与跨域表示学习结合,以实现稳健且可推广的临床AI。我们把领域偏移视为因果问题,并分析结构性因果模型、不变风险最小化和反事实推理如何嵌入迁移学习流程中。我们研究了涵盖分类、分割、重建、异常检测和多模态成像的任务,并按任务、偏移类型和因果假设进行组织。提出了一种统一的分类法,将因果框架与迁移机制联系起来。我们进一步总结了数据集、基准测试和经验收益,突出因果迁移在何时以及为何优于基于相关性的领域适应。最后,我们讨论了CTL如何支持多机构和联邦设置中的公平性、鲁棒性和可信部署,并概述了临床可靠医学成像AI的开放挑战和研究方向。

英文摘要

Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.

2603.24238 2026-06-09 cs.RO 版本更新

Decentralized End-to-End Multi-AAV Pursuit Using Predictive Spatio-Temporal Observation via Deep Reinforcement Learning

基于深度强化学习的去中心化端到端多无人艇追捕

Yude Li, Zhexuan Zhou, Huizhe Li, Yanke Sun, Yenan Wu, Yichen Lai, Yiming Wang, Youmin Gong, Jie Mei

发表机构 * School of Intelligence Science and Engineering(智能科学与工程学院) Guangdong Key Laboratory of Intelligent Morphing Mechanisms and Adaptive Robotics(广东省智能变形机制与自适应机器人重点实验室) Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments(深圳先进运动控制与现代自动化设备重点实验室) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出一种去中心化端到端多智能体强化学习框架,通过预测时空观测实现自主空中群体在复杂环境中的追捕,提升捕获效率和协同成功率。

详情
AI中文摘要

在复杂环境中实现去中心化协作追捕对自主空中群体而言具有挑战性,特别是在感知不完全和噪声存在的情况下。现有方法通常依赖于抽象几何特征或特权地面真实状态,从而回避了现实环境中的感知不确定性。本文提出了一种去中心化端到端多智能体强化学习(MARL)框架,直接将原始激光雷达观测映射到连续控制命令。框架的核心是预测时空观测(PSTO),一种以自我为中心的网格表示,将障碍物几何与预测对抗意图和队友运动统一在一个固定分辨率投影中。基于PSTO,单一去中心化策略使智能体能够导航静态障碍物、拦截动态目标并维持协同包围。仿真显示,所提方法在捕获效率和成功率方面优于依赖特权障碍信息的现有学习方法。此外,统一策略可无缝扩展到不同团队规模而无需重新训练。最后,完全自主的户外实验验证了该框架在仅依赖机载传感和计算的四旋翼群体上的有效性。

英文摘要

Decentralized cooperative pursuit in cluttered environments is challenging for autonomous aerial swarms, especially under partial and noisy perception. Existing methods often rely on abstracted geometric features or privileged ground-truth states, and therefore sidestep perceptual uncertainty in real-world settings. We propose a decentralized end-to-end multi-agent reinforcement learning (MARL) framework that maps raw LiDAR observations directly to continuous control commands. Central to the framework is the Predictive Spatio-Temporal Observation (PSTO), an egocentric grid representation that aligns obstacle geometry with predictive adversarial intent and teammate motion in a unified, fixed-resolution projection. Built on PSTO, a single decentralized policy enables agents to navigate static obstacles, intercept dynamic targets, and maintain cooperative encirclement. Simulations demonstrate that the proposed method achieves superior capture efficiency and competitive success rates compared to state-of-the-art learning-based approaches relying on privileged obstacle information. Furthermore, the unified policy scales seamlessly across different team sizes without retraining. Finally, fully autonomous outdoor experiments validate the framework on a quadrotor swarm relying on only onboard sensing and computing.

2509.16136 2026-06-09 cs.RO 版本更新

Reward Evolution with Graph-of-Thoughts: A Bi-Level Language Model Framework for Reinforcement Learning

基于思维图的奖励进化:一种用于强化学习的双层语言模型框架

Changwei Yao, Xinzi Liu, Chen Li, Marios Savvides

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Tokyo(东京大学)

AI总结 本文提出RE-GoT框架,结合LLM与VLM的图思维推理,通过任务分解和视觉反馈迭代优化奖励函数,实验表明在RoboGen和ManiSkill2任务中均优于现有方法。

详情
Journal ref
IEEE International Conference on Robotics and Automation (ICRA 2026)
AI中文摘要

设计有效的奖励函数仍是强化学习(RL)中的主要挑战,通常需要大量的人类专业知识和迭代优化。最近的进展利用大语言模型(LLM)进行自动化奖励设计,但这些方法受限于幻觉、依赖人类反馈以及处理复杂多步骤任务的困难。在本工作中,我们引入基于思维图的奖励进化(RE-GoT),一种新颖的双层框架,通过结构化的图推理增强LLM,并整合视觉语言模型(VLM)进行自动化 rollout 评估。RE-GoT首先将任务分解为文本属性图,实现全面分析和奖励函数生成,然后通过VLM的视觉反馈迭代优化奖励,无需人工干预。在10个RoboGen和4个ManiSkill2任务上的广泛实验表明,RE-GoT在多个指标上均优于现有基于LLM的基线方法。在RoboGen中,我们的方法将平均任务成功率提高了32.25%,在复杂多步骤任务上表现尤为突出。在ManiSkill2中,RE-GoT在四个多样化操作任务上的平均成功率为93.73%,显著超越了现有基于LLM的方法,甚至超过了专家设计的奖励。我们的结果表明,结合LLM和VLM的图思维推理提供了一种可扩展且有效的解决方案,用于RL中的自主奖励进化。

英文摘要

Designing effective reward functions remains a major challenge in reinforcement learning (RL), often requiring considerable human expertise and iterative refinement. Recent advances leverage Large Language Models (LLMs) for automated reward design, but these approaches are limited by hallucinations, reliance on human feedback, and challenges with handling complex, multi-step tasks. In this work, we introduce Reward Evolution with Graph-of-Thoughts (RE-GoT), a novel bi-level framework that enhances LLMs with structured graph-based reasoning and integrates Visual Language Models (VLMs) for automated rollout evaluation. RE-GoT first decomposes tasks into text-attributed graphs, enabling comprehensive analysis and reward function generation, and then iteratively refines rewards using visual feedback from VLMs without human intervention. Extensive experiments on 10 RoboGen and 4 ManiSkill2 tasks demonstrate that RE-GoT consistently outperforms existing LLM-based baselines. On RoboGen, our method improves average task success rates by 32.25%, with notable gains on complex multi-step tasks. On ManiSkill2, RE-GoT achieves an average success rate of 93.73% across four diverse manipulation tasks, significantly surpassing prior LLM-based approaches and even exceeding expert-designed rewards. Our results indicate that combining LLMs and VLMs with graph-of-thoughts reasoning provides a scalable and effective solution for autonomous reward evolution in RL.

2404.01948 2026-06-09 cs.CV 版本更新

Quantifying Noise of Dynamic Vision Sensor

量化动态视觉传感器的噪声

Evgeny V. Votyakov, Alessandro Artusi

发表机构 * DeepCamera MRG CYENS Centre of Excellence(CYENS卓越中心) Nicosia, Cyprus(塞浦路斯尼科西亚)

AI总结 本文提出基于去趋势波动分析的新型技术,用于量化动态视觉传感器背景噪声,解决无地面真实情况下噪声与信号区分难题,并展示最优去噪滤波器参数的确定方法。

Comments 5 pages, 4 figures, submitted to the IEEE Signal Processing Letters

详情
AI中文摘要

动态视觉传感器(DVS)以其大量背景活动噪声(BA)为特征,该噪声与原始信号混合。信号的动态特性和实际应用中缺乏地面真实,使得标准图像处理技术难以区分噪声与清洁信号。本文提出一种基于去趋势波动分析(DFA)的新技术,用于表征BA噪声。该技术可用于解决现有DVS问题:如何在无地面真实情况下量化噪声和信号,以及如何推导最优去噪滤波器参数。后者问题的解决方案在流行的实时移动汽车数据集中得到了演示。

英文摘要

Dynamic visual sensors (DVS) are characterized by a large amount of background activity (BA) noise, which it is mixed with the original (cleaned) sensor signal. The dynamic nature of the signal and the absence in practical application of the ground truth, it clearly makes difficult to distinguish between noise and the cleaned sensor signals using standard image processing techniques. In this letter, a new technique is presented to characterise BA noise derived from the Detrended Fluctuation Analysis (DFA). The proposed technique can be used to address an existing DVS issues, which is how to quantitatively characterised noise and signal without ground truth, and how to derive an optimal denoising filter parameters. The solution of the latter problem is demonstrated for the popular real moving-car dataset.

2603.22793 2026-06-09 cs.AI 版本更新

Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI

信号不是状态:面向文化意识课堂AI的神经符号安全机制

Sina Bagheri Nezhad

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出NSCR框架,通过神经符号方法处理课堂多模态信号,区分可观测证据与文化负载解读,减少文化偏见对课堂AI的负面影响。

Comments Accepted at the Workshop on Stereotypes Across Cultures in Language Technologies @ ACL 2026

详情
AI中文摘要

课堂AI系统越来越多地从多模态和语言信号推断出高水平教育状态,如参与度、困惑、合作、参与和教学质量。在多元文化和多语言课堂中,此类推断可能将文化特定的行为转化为刻板印象:沉默可能被解读为不参与,目光回避可能被解读为不专心,语言切换可能被解读为低能力,或间接求助可能被解读为困惑。我们主张,具有刻板印象意识的课堂AI应将可观察的证据与文化负载的解读分开,并将未经支持的构造层面的主张视为安全风险。我们引入NSCR,一种基于文化的神经符号框架,将视频、音频、语音识别、课程材料和上下文元数据转换为带不确定性的事实、来源和文化范围,然后通过可执行推理和政策约束组合它们。我们定义了刻板印象倾向课堂推断的分类学,并提出了涵盖文化条件下的状态推断、证据基础的主张验证、多语言和语言切换推理、合作分析、反事实文化鲁棒性以及文化条件下的红队测试的基准议程。我们进一步指定了刻板印象泄漏、未支持的归属、文化校准差距、文化模糊性下的回避以及证据忠实度的度量标准。贡献是方法学的:为减少课堂AI中的刻板印象推理提供具体的框架和评估议程,教育作为高风险、文化多变的部署场景。

英文摘要

Classroom AI systems increasingly infer high-level educational states such as engagement, confusion, collaboration, participation, and instructional quality from multimodal and linguistic signals. In multicultural and multilingual classrooms, such inferences can translate culturally situated behavior into stereotyped claims: silence may be read as disengagement, gaze aversion as inattention, code-switching as low proficiency, or indirect help-seeking as confusion. We argue that stereotype-aware classroom AI should separate observable evidence from culturally loaded interpretation and should treat unsupported construct-level claims as safety risks. We introduce NSCR, a culturally grounded neuro-symbolic framework that converts video, audio, ASR, lesson artifacts, and contextual metadata into typed facts with uncertainty, provenance, and cultural scope, then composes them through executable reasoning and policy constraints. We define a taxonomy of stereotype-prone classroom inferences and propose a benchmark agenda covering culture-conditioned state inference, evidence-grounded claim verification, multilingual and code-switched reasoning, collaboration analysis, counterfactual cultural robustness, and culture-conditioned red-teaming. We further specify metrics for stereotype leakage, unsupported attribution, cultural calibration gaps, abstention under cultural ambiguity, and evidence faithfulness. The contribution is methodological: a concrete framework and evaluation agenda for mitigating stereotyped reasoning in classroom AI, with education as a high-stakes, culturally variable deployment setting.

2603.22473 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications

组件消融用于高效混合语言模型架构:性能、鲁棒性和压缩影响

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

发表机构 * Doctoral Program in Computer Science, University of Valencia(瓦伦西亚大学计算机科学博士项目)

AI总结 本文通过组件消融研究混合语言模型,发现注意力机制与替代序列处理路径对性能有显著影响,揭示了模型鲁棒性与压缩优化的关键因素。

Comments 25 pages, 7 figures, 6 tables; revised title, abstract, figures, and data/code repository URL

详情
AI中文摘要

混合语言模型结合softmax注意力与线性时间序列机制,如状态空间或线性注意力层,但各组件的功能贡献尚不明确。本文在两个子10亿参数的混合语言模型Qwen3.5-0.8B和Falcon-H1-0.5B上,通过基于似然的评估、下游基准、逐层干预、随机控制和表征级诊断研究组件消融。测试结果显示,移除注意力或替代序列处理路径会显著降低性能,表明两种组件类型均对模型行为有贡献。似然指标对线性注意力或状态空间路径特别敏感,而下游基准退化取决于任务和架构。逐层消融显示组件重要性位置依赖,最强效果集中在早期或中期网络组件而非整个深度。随机移除控制进一步显示混合架构与相同家族Transformer基线在结构扰动下退化不同。这些结果表明组件消融是理解混合语言模型架构的有效诊断方法。发现为高效模型设计、压缩、鲁棒性分析和部署决策提供了相关证据。

英文摘要

Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and representation-level diagnostics. Across the tested models, removing either attention or the alternative sequence-processing pathway substantially degrades performance, indicating that both component types contribute to model behavior. Likelihood metrics are especially sensitive to the linear-attention or state-space pathway, while downstream benchmark degradation depends on task and architecture. Layer-wise ablations show that component importance is position-dependent, with the strongest effects concentrated in early or mid-network components rather than uniformly across depth. Random-removal controls further show that hybrid architectures and same-family Transformer baselines degrade differently under structural perturbation. These results suggest that component ablation is a useful diagnostic for understanding hybrid language model architectures. The findings provide evidence relevant to efficient model design, compression, robustness analysis, and deployment decisions in architectures that combine attention with alternative sequence-processing mechanisms.

2603.03292 2026-06-09 cs.CL cs.AI cs.IR 版本更新

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

从冲突到共识:通过多轮代理RAG提升医疗推理

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

发表机构 * GitHub

AI总结 本文提出MA-RAG框架,通过多轮代理循环迭代优化外部证据和内部推理历史,提升医疗复杂推理能力,实验显示在7个医疗问答基准上表现优于现有方法。

Comments 27 pages, 8 figures, 18 tables

详情
AI中文摘要

大型语言模型(LLMs)在医疗问答中表现出高推理能力,但其产生幻觉和过时知识的倾向对医疗领域构成重大风险。虽然检索增强生成(RAG)缓解了这些问题,但现有方法依赖于噪声的token级信号,并缺乏复杂推理所需的多轮细化。本文提出MA-RAG(多轮代理RAG),通过在代理细化循环中迭代演变外部证据和内部推理历史,实现复杂医疗推理的测试时间扩展。在每一轮中,代理将候选响应间的语义冲突转换为可检索的外部证据查询,同时优化历史推理轨迹以缓解长上下文退化。MA-RAG通过利用不一致性作为主动信号来扩展自我一致性原则,并通过迭代最小化残差误差来实现稳定、高保真的医疗共识。在7个医疗问答基准上的广泛评估显示,MA-RAG在推理时间扩展和RAG基线方面均优于竞争方法,平均准确率比基础模型提高+6.8点。我们的代码可在https://github.com/NJU-RL/MA-RAG上获得。

英文摘要

Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In this paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.

2602.20551 2026-06-09 cs.CV 版本更新

CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

CAD-Prompted SAM3: 用于工业物体的几何条件实例分割

Zhenran Tang, Rohan Nagabhirava, Changliu Liu

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 本文提出基于CAD模型的SAM3分割框架,通过几何条件进行实例分割,解决工业环境中难以用语言或外观描述的对象问题。

详情
AI中文摘要

基于自然语言的分割方法受限于语言表达能力,在制造和3D打印环境中常遇到难以描述的实例特定对象。尽管图像示例提供替代方案,但它们主要编码外观线索如颜色和纹理,与部件的几何身份无关。在工业环境中,单一组件可能由不同材料、表面处理或颜色生产,使基于外观的提示不可靠。相反,此类对象通常由精确的CAD模型定义,捕捉其标准几何形状。我们提出基于SAM3的CAD提示分割框架,使用CAD模型的多视图渲染作为提示输入。渲染的视图提供独立于表面外观的几何条件。模型通过在模拟中生成的网格渲染数据进行训练,涵盖多样化的视角和场景上下文。我们的方法实现了单阶段CAD提示掩码预测,将可提示分割扩展到无法仅通过语言或外观描述的对象。

英文摘要

Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

2512.13869 2026-06-09 cs.CV 版本更新

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

基于扩散模型的UAV人类检测粗到细层次对齐

Wenda Li, Meng Wu, Liangzhao Chen, Sungmin Eum, Heesung Kwon, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Coarse-to-Fine Hierarchical Alignment框架,通过合成数据转换缩小领域差距,提升UAV人类检测精度,实验显示mAP50提升14.1%。

详情
AI中文摘要

训练目标检测器需要大量任务特定标注,但UAV人类检测因目标分布不断变化和标注图像稀缺而难以实现。为此,本文引入Coarse-to-Fine Hierarchical Alignment(CFHA),一个基于扩散模型的三阶段框架,旨在将合成数据转换为UAV人类检测数据,缩小领域差距并保持原始合成标签。CFHA通过三个模块显式解耦全局风格和局部内容领域的差异:(1)全局风格迁移——扩散模型通过少量真实参考集对齐合成图像的颜色、光照和纹理统计至真实风格;(2)局部细化——超分辨率扩散模型用于细化小物体的精细和逼真细节,如人类实例,保持形状和边界完整性;(3)幻觉消除——过滤掉与真实数据不匹配的人类实例,使人类外观更接近目标分布。在公开的UAV Sim2Real检测基准上进行的广泛实验表明,本文方法显著优于非转换基线。具体而言,本文方法在Semantic-Drone基准上mAP50提升达+14.1%。消融研究证实了全局和局部阶段的互补作用,并突显了层次对齐的重要性。代码已发布在https://github.com/liwd190019/CFHA。

英文摘要

Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

2603.21511 2026-06-09 cs.CV 版本更新

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

回到点:探索用于零样本3D异常检测的点语言模型

Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han, Jin Wan

发表机构 * Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China(计算机功率网络与信息安全重点实验室,教育部,山东计算机科学中心(济南国家超级计算机中心),齐鲁大学(山东科学院),济南,中国) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, China(山东省计算功率互联网与服务计算重点实验室,山东省计算机科学基础研究中心,济南,中国)

AI总结 本文提出BTP框架,通过结合3D点云和文本嵌入,提升零样本3D异常检测的性能,实验表明其在Real3D-AD和Anomaly-ShapeNet上表现优异。

Comments Corrected several numerical entries due to a reporting error; the corrected values do not affect the main conclusions

详情
AI中文摘要

零样本(ZS)3D异常检测对于可靠工业检测至关重要,因为它可以在不需目标类别训练数据的情况下检测和定位缺陷。现有方法将3D点云转换为2D图像,并利用预训练的视觉-语言模型(VLMs)进行异常检测。然而,这种策略不可避免地丢弃了几何细节,并对局部异常表现出有限的敏感性。在本文中,我们重新审视内在的3D表示,并探索预训练点语言模型(PLMs)在ZS 3D异常检测中的潜力。我们提出了BTP(Back To Point),一种新的框架,能够有效对齐3D点云和文本嵌入。具体而言,BTP将多粒度补丁特征与文本表示对齐,用于局部异常检测,同时结合几何描述符以增强对结构异常的敏感性。此外,我们引入了一种联合表示学习策略,利用辅助点云数据以提高鲁棒性并丰富异常语义。在Real3D-AD和Anomaly-ShapeNet上的大量实验表明,BTP在ZS 3D异常检测中实现了优越的性能。代码将在https://github.com/wistful-8029/BTP-3DAD上提供。

英文摘要

Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.

2512.10656 2026-06-09 cs.LG 版本更新

Token Sample Complexity of Attention

注意力的标记采样复杂度

Léa Bohbot, Cyril Letrouit, Gabriel Peyré, François-Xavier Vialard

发表机构 * CNRS, ENS Paris, France(法国国家科学研究中心、巴黎高等师范学院)

AI总结 研究注意力在极端序列长度下的收敛行为,提出标记采样复杂度概念,分析注意力映射的均匀收敛和变换分布矩的收敛速率,实验验证预测结果。

详情
AI中文摘要

随着大语言模型上下文窗口的扩展,有必要研究注意力在极长序列长度下的行为。我们引入标记采样复杂度:在n个标记上计算的注意力收敛到无限标记极限的速率。我们估计有限n下的收敛界:注意力映射的点wise均匀收敛和变换分布矩的收敛。对于紧支撑(更一般地亚高斯)分布,我们的第一个结果表明,注意力映射在半径为R的球上以速率C(R)/√n收敛,其中C(R)随R指数增长。对于大R,此估计失去实用价值,我们的第二个结果通过建立变换分布矩的收敛速率来解决这一问题。在该情况下,速率是C'(R)/n^β,其中β<1/2,且C'(R)与分布支撑大小的多项式相关。指数β取决于注意力几何和标记分布的谱性质。我们还研究了注意力参数趋于无穷大且softmax趋近于hardmax的 regime,并在此设定下建立了对数收敛速率。合成和真实数据的实验支持我们的预测,并显示预测的减慢在下游准确性中得到反映。

英文摘要

As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C'(R)/n^β$ with $β<\tfrac{1}{2}$, and $C'(R)$ depends polynomially on the size of the support of the distribution. The exponent $β$ depends on the attention geometry and the spectral properties of the token distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic and real data support our predictions and show that the predicted slowdown is reflected in downstream accuracy.

2603.19183 2026-06-09 cs.RO 版本更新

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

稀疏自编码器揭示VLA模型中可解释且可操控的特征

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy, Mac Schwager

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Computer Science(计算机科学系) Department of Aeronautics & Astronautics(航空与航天系)

AI总结 本文通过训练稀疏自编码器揭示VLA模型中可解释且可操控的特征,验证了其在不同任务和场景中的可迁移性。

Comments 24 pages, 11 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人操作的有希望方法。然而,很少有研究系统地探讨了它们在物体、场景和指令之间泛化的原因和时机。为此,我们训练了稀疏自编码器(SAEs)来探索VLA隐藏层激活的内部表示。SAEs学习稀疏字典,通常揭示与模型表示空间中可解释方向对应的特征。我们识别出与运动原语和语义概念相关的SAE特征,包括在多个回合中普遍且因果可控的特征。我们提出了一种度量标准,将特征分类为通用可迁移原语或回合特定的记忆化,为VLA泛化提供了新的视角。我们通过在LIBERO模拟基准和真实世界DROID硬件上的操控实验验证了这些发现。我们发现增强通用和语义特征会诱导出与其意义一致的行为,而消去它们会破坏模型性能。此外,我们展示了操控作为在无提示方向上控制行为的方式。这些结果提供了机制证据,表明VLA可以学习可重用的内部特征,将感知、语言和动作跨任务和场景连接起来。我们的项目页面位于https://drvla.github.io

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations. SAEs learn sparse dictionaries over model activations, often revealing features that correspond to interpretable directions in the model's representation space. We identify SAE features corresponding to motion primitives and semantic concepts, including features that are general across episodes and causally steerable. We propose a metric to categorize features as general transferable primitives or episode-specific memorizations, offering a promising glimpse towards VLA generalization. We validate these findings through steering experiments on both the LIBERO simulation benchmark and on real-world DROID hardware. We find that amplifying general and semantic features induces behaviors consistent with their meanings, whereas ablating them destroys model performance. Furthermore, we demonstrate steering as a way to control behavior in unpromptable directions. Together, these results provide mechanistic evidence that VLAs can learn reusable internal features linking perception, language, and action across tasks and scenes. Our project page is located at https://drvla.github.io

2603.18388 2026-06-09 cs.AI cs.MA 版本更新

Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

暗箱中的反射:在反射提示优化中揭示并逃离黑箱

Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Huazhong University of Science and Technology(华中科技大学) Hefei University of Technology(合肥工业大学)

AI总结 本文提出VISTA框架,通过解耦假设生成与提示重写,实现可解释的提示优化,有效解决GEPA在缺陷种子下的性能下降问题。

Comments Accepted at ACL SRW 2026

详情
AI中文摘要

自动提示优化(APO)已成为提升LLM性能的强大范式,无需手动提示工程。反射APO方法如GEPA通过迭代优化失败案例来改进提示,但其优化过程仍为黑箱且无标签,导致不可解释的轨迹和系统性失败。我们识别并实证了四个限制:在GSM8K上使用缺陷种子时,GEPA将准确性从23.81%降至13.50%。我们提出VISTA,一种多智能体APO框架,通过解耦假设生成与提示重写,实现语义标注的假设、并行小批量验证和可解释的优化轨迹。结合随机重启和epsilon-贪婪采样的两层探索-利用机制进一步逃离局部最优。VISTA在相同缺陷种子上恢复准确性至87.57%,并在GSM8K和AIME2025上所有条件下均优于基线。

英文摘要

Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

2601.15165 2026-06-09 cs.CL cs.AI cs.LG 版本更新

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

灵活性陷阱:重新思考扩散语言模型中任意顺序的价值

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学Leap实验室) NLPLab, Tsinghua University(清华大学自然语言处理实验室) Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) BNRist, Tsinghua University(清华大学北京研究院)

AI总结 本文发现,尽管扩散语言模型(dLLMs)允许任意生成顺序,但这种灵活性可能限制其推理能力,通过采用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,在保持并行解码能力的同时提升了推理性能。

Comments Code and pre-trained models: https://github.com/LeapLabTHU/JustGRPO

详情
AI中文摘要

扩散大语言模型(dLLMs)打破了传统语言模型的严格左到右约束,使token生成可以按任意顺序进行。直观上,这种灵活性意味着解决方案空间严格超越了固定的自回归轨迹,理论上解锁了更强大的推理潜力。然而,在本文中,我们发现对于一般推理任务(例如数学和编程),任意顺序生成可能实际上会限制dLLMs的推理潜力。我们观察到dLLMs倾向于利用这种顺序灵活性来绕过关键探索的高不确定性token,这可能导致解决方案覆盖的过早崩溃。这一观察促使我们重新思考dLLMs的强化学习方法,其中大量的复杂性,如处理组合轨迹和不可计算的似然,通常致力于保持这种灵活性。我们证明,通过放弃任意顺序并应用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,可以有效地激发推理能力。我们的方法,JustGRPO,虽然简洁却出人意料地有效(例如在GSM8K上达到89.1%的准确率),同时完全保留了dLLMs的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap

英文摘要

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap