arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
2605.25547 2026-05-26 cs.RO cs.CV

TapSampling: Inference-Time Sampling with a Task-Progress-Understanding Verifier for Robotic Manipulation

TapSampling:基于任务进度理解验证器的推理时采样方法用于机器人操作

Sizhe Zhao, Shengping Zhang, Shuo Yang, Weiyu Zhao, Shuigen Wang, Xiangyang Ji

发表机构 * Harbin Institute of Technology, China(哈尔滨工业大学,中国) Harbin Institute of Technology (Weihai) Qingdao Research Institute, China(哈尔滨工业大学(威海)青岛研究院,中国) Iray Technology co., Ltd., Shandong, China(Iray科技有限公司,山东,中国) Tsinghua University, Beijing, China(清华大学,北京,中国)

AI总结 提出TapSampling框架,通过Action-VAE在低维潜空间采样候选动作,并利用任务进度预测验证器选择最优动作,无需微调即可提升多种通用策略的性能。

Comments ICML 2026. Project Page: https://aipixel.github.io/TapSampling/

详情
AI中文摘要

现有的具身控制研究通过扩展训练数据和模型规模展现了显著的性能提升。我们则探索推理时策略作为另一个维度。非确定性生成模型,如扩散模型和自回归模型,已被广泛应用于具身控制领域。然而,单次推理范式限制了它们的性能。在本文中,我们提出 extbf{TapSampling},一个即插即用的推理时采样框架。首先,我们引入一个Action-VAE,通过将策略生成的初始动作映射到压缩的后验分布中,在低维潜空间中表示动作,从中可以抽取任意数量的潜样本并解码为候选动作,这些动作近似于真实动作分布。其次,我们将动作验证表述为任务进度结果预测,利用机器人数据集固有的序列结构训练一个语义基础验证器,用于可解释的动作选择。此外,TapSampling是一个策略无关的框架。在模拟和真实环境中的大量实验表明,我们的方法无需进一步微调策略即可显著提升多种通用策略的性能。代码和模型可在项目页面获取。

英文摘要

Existing embodied control research demonstrates remarkable performance improvements by scaling training data and model size. We instead explore inference-time strategy as an alternative axis. Non-deterministic generative models, such as diffusion and autoregressive models, have been widely adopted in the field of embodied control. However, the single-shot inference paradigm limits their performance. In this paper, we propose \textbf{TapSampling}, a plug-and-play framework for inference-time sampling. First, we introduce an Action-VAE that represents actions in a low-dimensional latent space by mapping policy-generated initial actions into a compressed posterior distribution, from which any number of latent samples can be drawn and decoded into candidate actions that approximate the true action distribution. Second, we formulate action verification as task-progress outcome prediction, using the intrinsic sequential structure of robotic datasets to train a semantically grounded verifier for interpretable action selection. Furthermore, TapSampling is a policy-agnostic framework. Extensive experiments in both simulated and real-world environments demonstrate that our method substantially improves multiple generalist policies without further policy finetuning. Code and models are available at the project page.

2605.25546 2026-05-26 cs.RO

Safety-Critical Whole-Body Control for Humanoid Robots via Input-to-State Safe Control Barrier Functions

基于输入到状态安全控制屏障函数的人形机器人安全关键全身控制

Kwanwoo Lee, Sanghyuk Park, Gyeongjae Park, Myeong-Ju Kim, Jaeheung Park

发表机构 * Department of Intelligence and Information, Seoul National University(智能信息系,首尔国立大学) Robotics Lab, Hyundai Motor Group(Hyundai Motor Group 机器人实验室) Advanced Institute of Convergence Technology(融合技术高级研究院)

AI总结 提出一种基于输入到状态安全控制屏障函数(ISSf-CBF)的分层安全关键全身控制框架,通过运动级全身控制器、ISSf-CBF安全滤波器和动力学级全身控制器,在存在未知扰动时保证人形机器人的运动学安全约束。

Comments 14 pages, 6 figures

详情
AI中文摘要

安全关键控制对于在复杂人类中心环境中运行的人形机器人至关重要,这些环境中的物理安全约束(如关节限位、自碰撞避免、障碍物避免和工作空间边界)必须在实际机器人操作中得到满足。然而,现有方法仍然有限,因为在存在未知扰动(如模型不确定性、轨迹跟踪误差和外部扰动)时,运动学安全保证可能会降低。本文提出了一种基于输入到状态安全控制屏障函数(ISSf-CBF)的人形机器人分层安全关键全身控制框架。所提出的架构集成了运动级全身控制器(KinWBC)、ISSf-CBF安全滤波器和动力学级全身控制器(DynWBC)。KinWBC根据优先级任务生成标称关节运动参考;ISSf-CBF滤波器最小程度地修改这些参考,以在有界扰动下满足运动学安全约束;DynWBC跟踪滤波后的参考,同时确保全身动力学可行性和接触稳定性。安全约束施加于全身运动学模型,并保守地调整ISSf-CBF参数,使得所得的运动学安全保证能够在未知扰动下传递到全阶人形机器人动力学。仿真和实际机器人实验表明,所提出的框架在模型失配下提高了安全裕度,并在行走、遥操作和带手控制的单腿平衡过程中实时可靠地强制执行多个安全约束。项目网站:https://kwlee365.github.io/SafeWBC-Website/

英文摘要

Safety-critical control is essential for humanoid robots operating in complex human-centered environments, where physical safety constraints such as joint limits, self-collision avoidance, obstacle avoidance, and workspace boundaries must be satisfied during real-robot operation. However, existing approaches remain limited because kinematic safety guarantees can be degraded in the presence of unknown disturbances, such as model uncertainties, trajectory-tracking errors, and external perturbations. This paper presents a hierarchical safety-critical whole-body control framework for humanoid robots based on input-to-state safe control barrier functions (ISSf-CBFs). The proposed architecture integrates a kinematic-level whole-body controller (KinWBC), an ISSf-CBF safety filter, and a dynamic-level whole-body controller (DynWBC). KinWBC generates nominal joint-motion references from prioritized tasks; the ISSf-CBF filter minimally modifies these references to satisfy kinematic safety constraints under bounded disturbances; and DynWBC tracks the filtered references while enforcing full-body dynamic feasibility and contact stability. Safety constraints are imposed on a whole-body kinematic model, and the ISSf-CBF parameters are conservatively tuned so that the resulting kinematic safety guarantees can be transferred to full-order humanoid dynamics under unknown disturbances. Simulation and real-robot experiments demonstrate that the proposed framework improves safety margins under model mismatch and reliably enforces multiple safety constraints in real time during locomotion, teleoperation, and single-leg balancing with hand control. Project website: https://kwlee365.github.io/SafeWBC-Website/

2605.25543 2026-05-26 cs.AI

ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting

ADMFormer:一种用于交通预测的具有时变掩码空间注意力的自适应分解Transformer

Ruiwen Gu, Qitai Tan, Yahao Liu, Xiao-Ping Zhang

发表机构 * Shenzhen International Graduate School(深圳国际研究生院) Tsinghua University(清华大学) School of Computer Science and Engineering(计算机科学与工程学院) University of Electronic Science and Technology of China(电子科技大学) Shenzhen Ubiquitous Data Enabling Key Lab(深圳 ubiquitous 数据赋能重点实验室)

AI总结 提出ADMFormer,通过自适应分解机制解耦交通序列中的稳定周期规律与事件驱动波动,并使用时变掩码空间注意力稀疏化动态空间依赖,实现交通预测的SOTA性能。

详情
AI中文摘要

准确的交通预测对于智能交通系统至关重要,支持广泛的现实应用。然而,由于两个关键因素,它仍然具有挑战性:(1)交通序列包含异质的时间模式,其中稳定的周期性规律与事件驱动的波动共存。现有方法通常将它们统一表示,限制了捕捉细粒度时间动态的能力。(2)节点间的空间依赖本质上是动态且稀疏的,而密集的全对注意力常常引入冗余交互并放大噪声。为了解决这些问题,我们提出了ADMFormer,一种具有时变掩码空间注意力的自适应分解Transformer。具体来说,ADMFormer首先采用时间-节点自适应门控机制将交通信号解耦为随时间与节点变化的主导规律和残余波动。然后设计了一个双分支时间模块,分别从这两个分解成分中捕捉全局周期依赖和高频不规则变化。此外,ADMFormer引入了时变掩码空间注意力,基于实时交通状态稀疏化空间交互,从而有效保留动态且信息丰富的依赖。在四个真实世界数据集上的大量实验表明,ADMFormer实现了最先进的性能。

英文摘要

Accurate traffic forecasting is essential for intelligent transportation systems, supporting a wide range of real-world applications. However, it remains challenging due to two key factors:~(1) Traffic series contain heterogeneous temporal patterns, where stable periodic regularities coexist with event-driven fluctuations. Existing methods often treat them within a unified representation, limiting their ability to capture fine-grained temporal dynamics.~(2)Spatial dependencies among nodes are inherently dynamic and sparse, while dense all-pairs attention often introduces redundant interactions and amplifies noise. To address these issues, we propose ADMFormer, an Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention. Specifically, ADMFormer first employs a time-node adaptive gating mechanism to decouple traffic signals into dominant regularities and residual fluctuations that vary across time and nodes. A dual-branch temporal module is then designed to separately capture global periodic dependencies and high-frequency irregular variations from these two decomposed components. Furthermore, ADMFormer introduces a time-varying masked spatial attention that sparsifies spatial interactions based on real-time traffic states, thereby effectively preserving dynamic and informative dependencies. Extensive experiments on four real-world datasets demonstrate that ADMFormer achieves state-of-the-art performance.

2605.25540 2026-05-26 cs.SD cs.LG

A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

基于语言和声学表征学习的多模态痴呆检测框架

Loukas Ilias, Dimitris Askounis

发表机构 * Decision Support Systems Laboratory, School of Electrical and Computer Engineering, National Technical University of Athens(决策支持系统实验室,电气与计算机工程学院,国家技术大学雅典)

AI总结 提出一个端到端可训练的多模态深度学习框架,通过预训练模型提取声学和文本特征,结合注意力融合与互信息最大化,实现自动痴呆检测。

详情
AI中文摘要

阿尔茨海默病(AD)是一种进行性神经退行性疾病,是痴呆的主要原因,影响记忆、推理、沟通和日常功能。早期诊断尤为重要,因为及时干预可能有助于减缓认知衰退并改善患者护理。最近的研究表明,自发性言语包含与痴呆相关的有价值的语言和声学生物标志物。然而,现有方法通常依赖于独立训练的模态特定模型、特征拼接策略、集成方法或基于注意力的融合机制,这些方法并未明确最大化语音和转录表示之间的依赖性。在这项工作中,我们提出了一种用于自动痴呆检测的多模态深度学习框架,该框架以端到端可训练的方式联合利用语音和转录信息。具体来说,语音录音被分割成10秒的片段,并通过预训练的HuBERT模型提取上下文化的声学表示。为了更好地捕捉信息丰富的时域语音特征,采用注意力统计池化来聚合帧级声学嵌入。对于文本模态,使用预训练的BERT模型对转录进行编码,其中[CLS]标记表示用作语言嵌入。随后,使用基于注意力的音频-文本融合(AT-Fusion)机制组合声学和文本表示。此外,我们引入了一个MINE目标,以最大化模态之间的互信息并改善多模态表示对齐。最终融合的多模态表示用于痴呆分类。在公开的ADReSS挑战赛和PROCESS-2数据集上进行的实验证明了所提方法在基于语音的痴呆评估中的有效性和鲁棒性。

英文摘要

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.

2605.25537 2026-05-26 cs.RO

Action-Prior Denoising for Smooth Real-Time Chunking

基于动作先验去噪的平滑实时分块

Dongyang Liu, Zhaowen Zheng, Yu Sun, Longxu Zhang, Yixuan Liu, Hao Wan

发表机构 * ROKAE (Shandong) Robot Group Co., Ltd.(ROKAE(山东)机器人集团有限公司) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出Soft RTC方法,通过动作先验去噪训练时模拟执行延迟,在保持近朴素运行时间的同时,降低高延迟动作变化并提升平滑性。

Comments 7 pages, 5 figures, 1 table

详情
AI中文摘要

实时分块(RTC)通过将新生成的动作块条件于前一块已提交的动作,使得分块动作策略能够在推理延迟下运行。训练时RTC在学习过程中模拟这种延迟,避免了部署时昂贵的指导,但其二元前缀掩码将所有非前缀令牌视为完全无约束。这低估了异步执行:早期重叠动作是固定的,而后期重叠动作虽然可编辑,但仍应保持接近先前的计划。我们提出Soft RTC,一种基于动作先验去噪的训练时RTC泛化方法。Soft RTC从部分去噪状态而非纯噪声中构建损坏的重叠令牌,并通过轻量级的令牌级混合规则在推理时将对齐的前一块作为相同先验注入。在12个已发布的大型Kinetix关卡上,短软窗口在整体解决率上几乎与硬训练时RTC相当(0.809 vs. 0.815),而中等窗口相对于硬RTC将高延迟动作变化和急动度分别降低了9.1%和9.6。与推理时RTC基线不同,两种变体都保持近朴素运行时间。一项小型初步真实机器人分拣研究提供了额外证据,表明训练时RTC可以提高完成率,并且Soft RTC在测试策略中给出了最低的命令动作有限差分指标。

英文摘要

Real-time chunking (RTC) lets chunked action policies operate under inference delay by conditioning a newly generated action chunk on actions already committed by the previous chunk. Training-time RTC simulates this delay during learning and avoids expensive guidance at deployment, but its binary prefix mask treats all non-prefix tokens as fully unconstrained. This under-models asynchronous execution: early overlap actions are fixed, while later overlap actions remain editable but should still stay close to the previous plan. We propose Soft RTC, a training-time RTC generalization based on action-prior denoising. Soft RTC constructs corrupted overlap tokens from partially denoised states instead of pure noise and injects the aligned previous chunk as the same prior during inference through a lightweight token-wise blending rule. On the 12 released large Kinetix levels, a short soft window nearly matches hard training-time RTC in overall solve rate (0.809 vs. 0.815), while a medium window reduces high-delay action delta and jerk by 9.1% and 9.6% relative to hard RTC. Both variants keep near-naive runtime, unlike inference-time RTC baselines. A small preliminary real-robot sorting study provides additional evidence that training-time RTC can improve completion and that Soft RTC gives the lowest commanded-action finite-difference metrics among the tested policies.

2605.25535 2026-05-26 cs.AI

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

个性化再存储:面向长时程智能体的个性化记忆基准测试与学习

Yeonjun In, Wonjoong Kim, Sangwu Park, Kanghoon Yoon, Chanyoung Park

发表机构 * KAIST(韩国科学技术院)

AI总结 针对现有基于大语言模型的记忆系统采用通用静态策略忽略用户间存储上下文差异的问题,提出首个个性化记忆基准PerMemBench和会话级存储门控框架,验证个性化能显著提升记忆保留但精确门控仍是关键挑战。

Comments preprint

详情
AI中文摘要

现有的基于大语言模型(LLM)的记忆系统采用通用、静态的策略,忽略了一个基本现实:不同用户值得存储在记忆中的上下文是不同的。这种错位将有限的记忆预算浪费在短暂交互上,同时未能为长时程任务保留关键上下文。为解决这一差距,我们研究了一个未被充分探索的问题:基于LLM的记忆系统能否学习个性化的记忆策略?我们引入了PerMemBench,这是首个用于评估个性化记忆系统的基准,具有跨多年、多领域、多样化用户角色的交互历史。我们进一步提出了记忆个性化的首个实证研究,提出了会话级存储门控,这是一个轻量级框架,可选择性地绕过短暂会话的记忆操作。我们的研究证实,在完美门控下,个性化能带来显著的保留增益,但同时也揭示出精确门控仍然是一个开放且关键的挑战。

英文摘要

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

2605.25534 2026-05-26 cs.AI

StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs

StructBreak: 多模态大语言模型中结构性认知过载引发的安全故障

Yang Luo, Xinran Liu, Tiantian Ji, Zhiyi Yin, Lingyun Peng, Shuyu Li

发表机构 * Key Laboratory of Trustworthy Distributed Computing and Service (MoE), Beijing University of Posts and Telecommunications(可信分布式计算与服务重点实验室(MoE),北京邮电大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 提出StructBreak框架,通过量化结构性认知过载(SCO)揭示一种高阶认知过载攻击范式,在六种主流MLLM上实现92%平均攻击成功率,并证明该攻击通过结构性通道绕过安全过滤器。

Comments 23 pages; accepted to Findings of ACL 2026. This paper contains examples of harmful content

详情
AI中文摘要

多模态大语言模型(MLLM)在结构推理方面表现出色,但在结构一致性方面存在明显的逻辑脆弱性。我们将这种现象称为结构性认知过载(SCO),它是深度推理与安全对齐之间竞争产生的副产品。然而,先前的工作主要针对排版和像素级扰动,对SCO的研究尚不充分。为此,我们提出了StructBreak,一个自动化的端到端框架,旨在量化SCO。通过利用StructBreak,我们发现了一种新颖的高阶认知过载攻击范式;值得注意的是,这种攻击在实用的黑盒设置下运行,无需内部模型访问。因此,我们利用该框架建立了一个涵盖十种不同威胁场景的综合基准。对六种领先MLLM的实证评估表明,SCO容易触发有毒内容生成,平均攻击成功率(ASR)达到92%(在Gemini 2.5上高达97%)。为了阐明SCO的机制,我们进一步进行了模型级解释,涵盖注意力动态、潜在空间拓扑和几何分析。我们的发现表明,StructBreak作为一种新颖的结构性通道来绕过安全过滤器。此外,固有安全机制的有效性有限,凸显了当前的对齐范式不足以应对复杂多模态推理的时代。

英文摘要

Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.

2605.25530 2026-05-26 cs.CV

Location Prior Generation via Multi-Source Urban Data Fusion for Low-Altitude Air Mobility

基于多源城市数据融合的低空空中交通位置先验生成

Xiang Xie, Xiaonan Liu

发表机构 * Politecnico di Milano(米兰理工大学) School of Natural and Computing Science, University of Aberdeen(阿伯丁大学自然科学与计算科学学院)

AI总结 提出LPGF框架,融合多源数据(哨兵2号影像、无人机遥测、车辆GPS轨迹、OSM足迹)生成结构化城市位置先验,通过三级优先级分配建筑高度,并引入质量门控的阴影估计模块,在米兰数据集上验证了约5.5米的最坏误差。

Comments 11 pages, 7 figures, submitted to IEEE Journal of Internet of Things

详情
AI中文摘要

建筑高度作为城市空间数据的第三维度,在全球地理空间数据库中超过95%的结构中缺失。对于新兴的低空经济而言,这一数据缺口迫使每个空中平台依赖实时机载感知而非预计算的3D场景几何。我们提出了位置先验生成框架(LPGF),这是一个多源数据融合管道,将哨兵2号影像、无人机遥测、车辆GPS轨迹和OpenStreetMap足迹整合为结构化、可重用的城市位置先验。LPGF通过三级优先级层次分配建筑高度:(1)可用的显式OSM高度标签,(2)楼层数乘以每层3.2米(若记录),以及(3)否则使用建筑类型默认高度,产生约5.5米的最坏情况误差。一个可选的基于阴影的高度估计模块(SHEM)仅在满足四项质量标准时才被激活;当任何标准失败时,管道转向结构化后备方案。在MiTra A50米兰数据集上,质量门正确识别了两种成像故障模式:10米GSD下的亚像素阴影和0.93米GSD下的地面阴影合并,在两种情况下均产生一致的27栋建筑先验。第三级类型默认高度与手动楼层计数(n=15)进行验证,在5.0米不确定性范围内达到MAE=3.07米。该框架表明,结构化、质量门控的通用数据流融合可以为低空城市运营启动3D场景覆盖。

英文摘要

Building height, the third dimension (3D) of urban spatial data, is absent in over 95% of structures in global geospatial databases. For the emerging low-altitude economy, this data gap forces each aerial platform to rely on real-time onboard sensing rather than pre-computed 3D scene geometry. We present the Location Prior Generation Framework (LPGF), a multi-source data fusion pipeline that integrates Sentinel-2 imagery, UAV telemetry, vehicle GPS trajectories, and OpenStreetMap footprints into structured, reusable urban location priors. LPGF assigns building heights through a three-tier priority hierarchy: (1) explicit OSM height tags where available, (2) floor count multiplied by 3.2 m per story where recorded, and (3) building-type default heights otherwise, yielding a worst-case error of approximately 5.5 m. An optional shadow-based height estimation module (SHEM) is activated only when a four-criterion quality gate is satisfied; when any criterion fails, the pipeline routes to structured fallback. On the MiTra A50 Milan dataset, the quality gate correctly identified two imaging failure modes: sub-pixel shadows at 10 m GSD and ground shadow merging at 0.93 m GSD, producing a consistent 27-building prior in both cases. Tier 3 type-default heights were validated against manual floor counts (n=15), achieving MAE=3.07 m within the 5.0 m uncertainty bound. The framework demonstrates that structured, quality-gated fusion of universally available data streams can bootstrap 3D scene coverage for low-altitude urban operations.

2605.25527 2026-05-26 cs.LG cs.CE

DeepSeekMath Meets Order Book: Group-Aware Policy Optimization for High-Frequency Directional Trading

DeepSeekMath 遇见订单簿:面向高频方向性交易的组感知策略优化

Sayak Charabarty, Souradip Pal

发表机构 * Department of Computer Science(计算机科学系) Northwestern University(西北大学) School of Electrical and Computer Engineering(电气与计算机工程学院) Purdue University(普渡大学)

AI总结 本文通过将基于订单流的状态模型与策略梯度方法结合,研究限价订单簿上的高频交易强化学习,提出组感知策略优化方法,在回测中优于基于价值的 Q-learning 基线。

Comments 9 pages, 3 figures

详情
AI中文摘要

本文通过将基于订单流的状态模型与策略梯度方法配对,研究限价订单簿上的高频交易强化学习。与基于价值的 RL 技术(如表格 Q-learning)不同,我们的方法部署基于策略的方法,如普通 PPO 以及受 DeepSeekMath 启发的变体 GRPO 和 GSPO,这些方法使用组归一化更新和下行感知整形。在使用基于点差缩放奖励的简化回测设置下,对金融资产 AMZN、AAPL 和 GOOG 进行回测,这些新策略在净平均 PnL、盈利能力和回撤方面优于 Q-learning 基线。我们的结果表明:(1) 订单流信号是策略 RL 的合适状态;(2) 组感知 PPO 替代方法优于基于价值的基线。

英文摘要

This paper studies reinforcement learning for high-frequency trading on limit order books by pairing an Order-Flow-based state model with policy-gradient methods. Instead of value-based RL techniques like tabular Q-learning, our approach deploys policy-based methods like vanilla PPO and DeepSeekMath-inspired variants like GRPO and GSPO, that use group-normalized updates and downside-aware shaping. On backtests with financial assets AMZN, AAPL, and GOOG under a simplified backtesting setup based on spread-scaled rewards, these new policies improve net average PnL, profitability, and drawdown over the Q-Learning baseline. Our results show that (1) Order-Flow signals are an adequate state for policy RL and (2) group-aware PPO surrogates are preferable over value-based baselines.

2605.25525 2026-05-26 cs.LG

SAE-FD: Sparse Autoencoder Feature Distillation for Continual Learning of Large Language Models

SAE-FD: 面向大语言模型持续学习的稀疏自编码器特征蒸馏

Mingxu Zhang, Yuhan Li, Lujundong Li, Dazhong Shen, Hui Xiong, Ying Sun

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) The 63rd Research Institute, National University of Defense Technology, Nanjing(国防科技大学第六十三研究所,南京)

AI总结 针对持续学习中的灾难性遗忘问题,提出基于稀疏自编码器特征蒸馏的方法,通过将模型表示锚定在稀疏特征空间以减少表征纠缠,实现更精准的正则化,在多个基准上优于现有方法。

详情
AI中文摘要

持续学习使大语言模型能够适应不断变化的任务而无需从头重新训练,但灾难性遗忘仍然是一个核心障碍。在持续学习方法中,基于正则化的方法被广泛用于约束模型更新并减少遗忘,这些方法在权重空间、梯度空间或输出空间中操作。然而,这些密集表示空间存在特征叠加问题,即多个概念被编码在重叠的维度中,使得难以在不阻碍新任务学习的情况下有选择地保护先前学到的知识。为了解决这个问题,我们提出了\method(稀疏自编码器特征蒸馏),该方法将模型表示锚定在预训练稀疏自编码器的稀疏特征空间中,其中密集激活被分解为稀疏过完备基,从而减少表征纠缠,实现更有针对性的正则化,同时减少对新任务学习的干扰。在三个模型架构上的两个持续学习基准实验表明,\method始终优于现有的基于正则化的方法,平均准确率高达52.70%,仅产生-0.46的后向迁移。

英文摘要

Continual learning enables large language models to adapt to evolving tasks without retraining from scratch, yet catastrophic forgetting remains a central obstacle. Among continual learning methods, regularization-based approaches are widely used to constrain model updates and reduce forgetting, operating in weight space, gradient space, or output space. However, these dense representation spaces suffer from feature superposition, where multiple concepts are encoded in overlapping dimensions, making it difficult to selectively protect previously learned knowledge without impeding new-task learning. To address this issue, we propose \method (Sparse Autoencoder Feature Distillation), which anchors model representations in the sparse feature space of a pre-trained Sparse Autoencoder, where dense activations are decomposed into a sparse overcomplete basis that reduces representational entanglement, enabling more targeted regularization with less interference to new-task learning. Experiments on two continual learning benchmarks across three model architectures show that \method consistently outperforms existing regularization-based methods, achieving up to 52.70% average accuracy with only -0.46 backward transfer.

2605.25524 2026-05-26 cs.CV

ProSR: Process-Shaped Spatial Reasoning for Reliable Chain-of-Thought in VLMs

ProSR: 面向可靠思维链的过程塑造空间推理方法

Jiangyang Li, Cong Wan, Changjie Wu, Songlin Dong, Lingjun Zhang, Linzhe Shi, Xu Wang, Zhiheng Ma, Hang Zhang, Mu Xu, Yihong Gong

发表机构 * Xi’an Jiaotong University(西安交通大学) Amap, Alibaba Group(阿里巴巴集团Amap) Tsinghua University(清华大学) Shenzhen University of Advanced Technology(深圳大学先进技术学院)

AI总结 针对视觉语言模型在空间推理中存在的虚假基础与尾部不稳定性问题,提出ProSR框架,通过反事实不变性惩罚和尾部漂移惩罚优化推理过程,提升答案准确率及轨迹稳定性与视觉依赖性。

Comments 19 pages, 6 figures

详情
AI中文摘要

可靠的空间推理仍然是视觉语言模型(VLM)的核心瓶颈。现有的空间推理主流训练范式主要依赖于结果对齐或过程模仿,缺乏对推理过程的显式约束,因此难以确保真正的视觉依赖和稳定的推理轨迹。在本文中,我们构建了一个覆盖多种空间现象的高质量思维链数据集,并诊断了模型的推理过程,揭示了强化学习优化过程中两种典型的过程退化类型:虚假基础(绕过视觉证据)和尾部不稳定性(推理后期不确定性异常上升)。为了解决这些问题,我们提出了ProSR,一种用于空间推理的过程塑造优化框架。通过反事实不变性惩罚和尾部漂移惩罚,ProSR将优化目标从单一的答案正确性扩展到两个过程级维度:视觉依赖性和轨迹稳定性。在多个复杂和分布外的空间推理基准上的实验表明,ProSR在提高答案准确率的同时,生成的推理轨迹更加稳定且更依赖于视觉证据。

英文摘要

Reliable spatial reasoning remains a core bottleneck for vision-language models (VLMs). Existing mainstream training paradigms for spatial reasoning largely rely on outcome alignment or process imitation, lacking explicit constraints on the reasoning process, and therefore struggle to ensure genuine visual dependence and stable reasoning trajectories. In this paper, we construct a high-quality CoT dataset covering diverse spatial phenomena and diagnose the model's reasoning process, revealing two typical types of process degradation during reinforcement learning optimization: Spurious Grounding, which bypasses visual evidence, and Tail Instability, where uncertainty abnormally rises in the later stage of reasoning. To address these issues, we propose ProSR, a process-shaping optimization framework for spatial reasoning. Through a Counterfactual Invariance Penalty and a Tail Drift Penalty, ProSR extends the optimization objective from single answer correctness to two process-level dimensions: visual dependence and trajectory stability. Experiments on multiple complex and out-of-distribution spatial reasoning benchmarks show that ProSR improves answer accuracy while generating reasoning trajectories that are more stable and more dependent on visual evidence.

2605.25520 2026-05-26 cs.CL

Is Inference Mediated by Distinct Semantic Structures in LLMs? A Mechanistic Interpretation

LLMs中的推理是否由不同的语义结构介导?一种机制性解释

Nura Aljaafari, Marco Valentino, André Freitas

发表机构 * University of Manchester(曼彻斯特大学) University of Sheffield(谢菲尔德大学) Idiap Research Institute(Idiap研究所) CRUK National Biomarker Centre, University of Manchester(CRUK国家生物标志物中心,曼彻斯特大学)

AI总结 通过SVD分解和激活引导实验,研究自然语言推理中Transformer模型是否编码语义操作,发现操作级子空间部分重叠且因果影响预测,表明模型不仅编码假设与前提的关系,还部分编码如何关联。

Comments 26 pages, 16 figures, 13 tables

详情
AI中文摘要

正确预测标签并不一定需要表示产生该标签的操作。已知Transformer表示携带标签级信息,但它们是否编码产生这些标签的语义操作尚不清楚。我们使用受控的前提-假设对(仅通过单一语义变换区分)在自然语言推理中对此进行研究。利用逐层激活,通过SVD估计操作级子空间,并通过在四个开源解码器模型中的激活引导测试其因果相关性。变换效果以84.8%-99%的准确率可解码,并占据部分不同但重叠的子空间,超过随机子空间基线。引导实验表明这些方向因果性地影响预测,尽管可引导性因模型而异;跨操作引导进一步揭示了结构化干扰以及子空间选择性与跨操作独立性之间的分离。这些发现表明,模型不仅编码假设与前提相关,还部分编码如何相关,这意味着机制分析和控制应在语义操作层面而非仅预测标签层面进行。

英文摘要

Predicting a label correctly does not necessarily require representing the operation that produces it. Transformer representations are known to carry label-level information, but whether they encode semantic operations producing those labels is unclear. We investigate this in Natural Language Inference using controlled premise-hypothesis pairs that differ by a single semantic transformation. Using layer-wise activations, we estimate operation-level subspaces via SVD and test their causal relevance through activation steering in four open-weight decoder models. Transformation effects are decodable with $84.8$-$99\%$ accuracy and occupy partially distinct but overlapping subspaces, exceeding random-subspace baselines. Steering experiments show that these directions causally influence predictions, though steerability varies across models; cross-operation steering further reveals structured interference and a dissociation between subspace selectivity and cross-operation independence. These findings indicate that the models encode not only that a hypothesis relates to a premise but also, in part, how it does so, implying that mechanistic analysis and control should operate at the level of semantic operations rather than predicted labels alone.

2605.25518 2026-05-26 cs.CV cs.AI

Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis

受放射科医生启发的乳腺超声诊断的跨阶段注意力多专家网络

Xinyang Zhai, Chong Yang, Ruizhi Zhang

发表机构 * International Agency for Research on Cancer (IARC)(国际癌症研究机构) World Health Organization(世界卫生组织)

AI总结 提出跨阶段注意力混合专家网络(CSA-MoE-Net),通过跨阶段注意力模块增强多级特征、三分支MoE块从全肿瘤图像、肿瘤核心和边界学习互补特征,并在平衡数据集上实现96.33%准确率,显著优于基线ResNet-18。

详情
AI中文摘要

乳腺超声成像是一种重要的早期乳腺癌诊断无创方法,但由于肿瘤异质性、边界模糊和数据不平衡,自动良恶性分类仍具挑战。为了提高特征表示和分类准确性,本文提出了跨阶段注意力混合专家网络(CSA-MoE-Net)。它采用跨阶段注意力增强的ResNet-18作为骨干网络,其中跨阶段注意力模块自适应地重新校准多级特征,从而增强关键肿瘤特征并抑制冗余。一个三分支混合专家(MoE)块从全肿瘤图像、肿瘤核心和边界学习互补特征,自适应门控网络融合这些特征以捕获形态、纹理和上下文信息。融合后的特征在架构中称为融合专家特征(FEF)。在包含2,129张乳腺超声图像的平衡数据集上的实验表明,在20次独立运行的平均值下,该模型实现了96.33%的准确率、94.09%的精确率、98.53%的召回率、96.25%的F1分数和99.50%的AUC。与基线ResNet-18相比,这些指标分别提高了3.01、0.70、5.37、2.98和5.42个百分点。所提出的机制无需侵入性修改,可无缝嵌入VGG-16、DenseNet-121等网络,带来稳定的性能提升,从而为计算机辅助诊断提供可靠支持。

英文摘要

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33\%, precision of 94.09\%, recall of 98.53\%, F1-score of 96.25\%, and AUC of 99.50\%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

2605.25517 2026-05-26 cs.AI

What Gets Cited: Competitive GEO in AI Answer Engines

什么被引用:AI 问答引擎中的竞争性生成式引擎优化

Rahul Vishwakarma, Shushant Kumar, Ratnesh Jamidar

发表机构 * Sprinklr

AI总结 研究 AI 问答引擎中两个检索候选源竞争时,哪些因素决定哪个源被优先引用,通过控制实验发现主题相关性和列表位置是主要驱动因素。

详情
AI中文摘要

AI 问答引擎从检索到的页面生成答案,但只引用少数来源。这使得可见性不仅取决于排名,还取决于被引用。我们研究竞争性生成式引擎优化(GEO):当两个检索到的候选源竞争时,什么因素使得其中一个更可能被首先引用?我们构建了一个受控的两文档检索增强生成(RAG)测试平台,将恰好两个候选源注入模型上下文,并测量输出中第一个引用标记引用了哪个源。在六个 LLM 上,我们执行了 252,000 次试验,在 18 个内容因素的一个析因程序下进行重复配对比较。在每次试验中,两个源恰好在一个因素上不同;我们使用品牌匿名化和平衡源顺序来将内容效应与位置偏差分离。混合效应模型显示,主题相关性和列表位置是被首先引用的最大驱动因素。包含明确的价格信息和最近的时间戳也持续有帮助。完整性和信任线索带来较小的增益,而仅格式编辑几乎没有影响。我们发布了一个可重复的评估协议和一个优先化的 GEO 检查清单供从业者使用,并在 Sprinklr 的早期内部试点中进行了实践,团队报告了对工作流可用性的积极定性反馈。

英文摘要

AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but on being cited. We study competitive Generative Engine Optimization (GEO): when two retrieved candidates compete, what makes one more likely to be cited first? We build a controlled two-document retrieval-augmented generation (RAG) testbed that injects exactly two candidate sources into the model context and measures which source is referenced by the first citation marker in the output. Across six LLMs we execute 252,000 trials, repeated paired comparisons under one factorial program over 18 content factors. In each trial the two sources differ in exactly one factor; we use brand anonymization and counterbalanced source order to separate content effects from position bias. Mixed-effects models show that topical relevance and list position are the biggest drivers of being cited first. Including explicit price information and a recent timestamp also helps consistently. Completeness and trust cues add smaller gains, while formatting-only edits have little impact. We release a reproducible evaluation protocol and a prioritized GEO checklist for practitioners, and we exercised it in an early internal pilot at Sprinklr, where teams reported positive qualitative feedback on workflow usability.

2605.25511 2026-05-26 cs.CL

CRPO: Character-centric Group Relative Policy Optimization for Role-aware Reasoning in Role-playing Agents

CRPO:以角色为中心的群体相对策略优化用于角色扮演代理中的角色感知推理

Yihong Tang, Kehai Chen, Liang Yue, Benyou Wang, Min Zhang

发表机构 * Institute of Computing and Intelligence(计算与智能研究院) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen Loop Area Institute (SLAI)(深圳Loop区研究院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出CRPO框架,通过解耦任务逻辑与风格奖励、动态调整优化约束和利用通用响应作为负基线,解决强化学习在角色扮演中角色保真度下降和风格崩溃问题。

详情
AI中文摘要

强化学习的最新进展,特别是群体相对策略优化(GRPO),显著提升了大型语言模型的推理能力。然而,将这些以问题为中心的优化方法应用于角色扮演代理时,往往会导致角色保真度下降和风格崩溃,因为它们优先考虑上下文特定的效用而非角色对齐。为了解决这个问题,我们提出了以角色为中心的群体相对策略优化(CRPO),这是一个旨在将强化学习目标与角色扮演任务重新对齐的框架。CRPO通过三种机制提升角色独特性:解耦任务逻辑与风格奖励以解决梯度冲突,根据角色复杂度动态调整优化约束,以及利用通用响应作为负基线以防止模型回归到常见分布。大量实验表明,CRPO在一致性、情感等方面优于现有方法。

英文摘要

Recent advancements in Reinforcement Learning (RL), particularly Group Relative Policy Optimization (GRPO), have significantly enhanced the reasoning capabilities of Large Language Models. However, applying these problem-centric optimization methods to role-playing agents often leads to a loss of character fidelity and style collapse, as they prioritize context-specific utility over persona alignment. To address this, we propose Character-Centric Group Relative Policy Optimization (CRPO), a framework designed to realign RL objectives with the role-playing task. CRPO improves character distinctiveness through three mechanisms: decoupling task logic from stylistic rewards to resolve gradient conflicts, dynamically adapting optimization constraints based on character complexity, and utilizing generic responses as negative baselines to prevent the model from reverting to a common distribution. Extensive experiments demonstrate that CRPO outperforms existing methods in consistency, emotion and others.

2605.25508 2026-05-26 cs.LG

Relative Repairability: A Calibration-Based Diagnostic for High-Sparsity Post-Pruning Allocation

相对可修复性:一种基于校准的高稀疏度剪枝后分配诊断方法

Qishi Zhan, Liang He, Minxuan Hu, Ziheng Chen

发表机构 * Marquette University(马凯特大学) Tongji University(同济大学) Cornell University(康奈尔大学) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 提出相对可修复性(RR)指标,通过校准数据比较层剪枝引起的原始激活失真与通道方差匹配修复后的残余失真,用于诊断高稀疏度下剪枝损伤的可修复性,实验表明其在架构依赖的可恢复性转变区域优于现有分配规则。

详情
AI中文摘要

在极高稀疏度下,神经网络剪枝不仅决定哪些权重保留,还决定剪枝引起的损伤在网络中的分布位置,以及这些损伤能否通过固定的轻量修复过程恢复。我们通过修复条件稀疏分配的角度研究这一问题。我们引入相对可修复性(RR),一种基于校准的诊断方法,比较逐层剪枝引起的原始激活失真与通道方差匹配修复后的残余失真。RR仅使用未标记的校准数据,估计修复后剩余局部损伤的比例。在CIFAR10和CIFAR100上的ResNet18、ResNet34和VGG16 BN实验中,我们发现RR并非普遍主导的分配规则。相反,它在架构依赖的可恢复性转变附近最为有用,此时标准的结构或幅度基分配先验开始失去可靠性,但修复后恢复尚未完全崩溃。在CIFAR100 ResNet18上,细粒度扫描显示RR在中心转变带上优于ERK,并在该带上部超过LAMP。投影强制消融进一步表明,有上限的ERK可能过度保护投影层,将过多稀疏度转移到常规卷积上,降低修复后恢复。这些结果表明,高稀疏度剪枝不仅应分配保留的权重,还应分配可修复的损伤。

英文摘要

At very high sparsity, neural network pruning does more than decide which weights remain. It also determines where pruning induced damage is placed across the network, and whether that damage can be recovered by a fixed lightweight repair procedure. We study this problem through the lens of repair conditioned sparsity allocation. We introduce Relative Repairability (RR), a calibration based diagnostic that compares the raw activation distortion caused by layerwise pruning with the residual distortion left after channelwise variance matching repair. RR estimates the fraction of local damage that remains after repair, using only unlabeled calibration data. Across ResNet18, ResNet34, and VGG16 BN on CIFAR10 and CIFAR100, we find that RR is not a universally dominant allocation rule. Instead, it is most useful near an architecture dependent recoverability transition, where standard structural or magnitude based allocation priors begin to lose reliability but post repair recovery has not yet fully collapsed. On CIFAR100 ResNet18, a fine grained sweep shows that RR improves over ERK across the central transition band and surpasses LAMP near the upper part of this band. A projection forced ablation further shows that capped ERK can over protect projection layers, shifting excessive sparsity onto regular convolutions and reducing post repair recovery. These results suggest that high sparsity pruning should allocate not only retained weights, but also repairable damage.

2605.25503 2026-05-26 cs.CV

Metric--Phase Fields: Decoupling Distance and Sign for Thin-Structure Reconstruction from Unoriented Point Clouds

度量-相位场:从无定向点云中解耦距离和符号以重建薄结构

Jiayi Kong, Xuhui Chen, Chen Zong, Fei Hou, Junhui Hou, Wenping Wang, Ying He

发表机构 * S-Lab, Nanyang Technological University, Singapore Key Laboratory of System Software (CAS), Institute of Software, Chinese Academy of Sciences, China University of Chinese Academy of Sciences, China School of Mathematics, Nanjing University of Aeronautics Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China Department of Computer Science Engineering, Texas A\&M University, USA

AI总结 提出度量-相位场(MPF),通过解耦度量距离和拓扑相位,结合门控度量公式和残差相位注入,实现从无定向点云中稳定重建薄结构和开放边界。

详情
AI中文摘要

神经有符号距离函数(SDF)在重建水密流形方面表现出色,但由于严格的内外约束,在薄结构和开放边界上失败。相反,无符号距离场(UDF)适应一般几何形状,但在零水平集处存在梯度奇异性,阻碍优化和提取。我们引入度量-相位场(MPF),一种解耦的隐式表示,将度量邻近性与拓扑相位分离。给定无定向点云,MPF学习(i)无符号度量场$r$和(ii)平滑相位场$θ$,我们推导出一个有界相位指示器$P=\tanh(βθ)$,在有意义的地方提供软内外线索。我们通过门控度量公式和残差相位注入耦合这两个场,以获得具有稳定近表面梯度的有符号隐函数。相位系数$β$是可学习的,允许MPF自适应控制相变锐度和软符号指示器的饱和程度。在合成和扫描的薄壳及薄板形状上的实验表明,MPF比最近的基于SDF的方法更忠实地保留薄层结构,同时比基于UDF的方法实现更稳健的训练和更可靠的表面提取。源代码和测试模型见\href{https://github.com/JIAYI-Scarlett/ICML2026-MPF}{MPFs-GitHub}。

英文摘要

Neural Signed Distance Functions (SDFs) excel at reconstructing watertight manifolds but fail on thin structures and open boundaries due to strict inside--outside constraints. Conversely, Unsigned Distance Fields (UDFs) accommodate general geometries but suffer from gradient singularities at the zero-level set, hindering optimization and extraction. We introduce Metric--Phase Fields (MPFs), a decoupled implicit representation that separates metric proximity from topological phase. Given an unoriented point cloud, MPFs learn (i) an unsigned metric field $r$ and (ii) a smooth phase field $θ$, for which we derive a bounded phase indicator $P=\tanh(βθ)$ that provides soft inside--outside cues where they are meaningful. We couple the two fields via a gated-metric formulation with a residual phase injection to obtain a signed implicit function with stable near-surface gradients. The phase coefficient $β$ is learnable, allowing MPFs to adaptively control the sharpness of the phase transition and the degree of saturation of the soft sign indicator. Experiments on both synthetic and scanned thin-shell and thin-plate shapes demonstrate that MPFs preserve thin and layered structures more faithfully than recent SDF-based methods, while also enabling more robust training and more reliable surface extraction than UDF-based approaches. Check out \href{https://github.com/JIAYI-Scarlett/ICML2026-MPF}{MPFs-GitHub} for source code and test models.

2605.25502 2026-05-26 cs.CL cs.AI

A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

面向教育方面情感分析的可控合成基准

Yehudit Aperstein, Alexander Apartsin

发表机构 * Intelligent Systems, Afeka Academic College of Engineering(阿法卡学术工程智能系统学院) School of Computer Science, Faculty of Sciences, Holon Institute of Technology(霍洛技术学院计算机科学学院)

AI总结 为解决教育领域标注数据稀缺问题,提出一个包含10,000条合成课程评论和20个教学方面的可控合成基准,并通过实验验证了任务难度及合成到真实的迁移能力。

Comments 39 pages, 14 figures

详情
AI中文摘要

教育方面情感分析(ABSA)可以支持课程改进,但带有方面标签的学生反馈仍然稀缺,因为教育评论是私有的、特定于机构的且标注成本高昂。本研究引入了一个面向教育ABSA的可控合成基准,该基准由10,000条合成课程评论构建,具有明确的训练-验证-测试划分,以及一个涵盖教学质量、评估与课程管理、学习需求、学习环境和参与度的20方面教学模式。该语料库通过采样的目标标签、采样的细微属性以及经过三轮评审-编辑流程优化的真实感提示生成。在该基准上,使用TF-IDF、两阶段变换器和联合编码器的局部基线表明该任务并非易事;最强的未调优模型BERT在留出集上的检测微F1得分为0.2760,而一个适度的低学习率BERT调度将其提升至0.2930。基于gpt-5.2的全测试GPT推理在零样本模式下达到0.2519微F1,在使用基于检索的少样本提示时达到0.2501,使批量推理高于经典基线并接近紧凑的联合编码器。在来自Herath等人的2,829条映射学生反馈评论上进行的保守外部评估中,BERT在9个方面重叠上的微F1得分为0.4593,表明部分合成到真实的迁移。真实性和忠实度分析作为生成器诊断报告,阐明了基准如何稳定以及标签噪声仍然存在的位置。因此,本研究贡献了一个合成教育ABSA语料库、一个文档化的生成过程以及一个可复现的基准设置,适用于公共标注数据仍然难以获得的领域。

英文摘要

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

2605.25500 2026-05-26 cs.CV

Full-4D: Generating Full-Scope 4D Scenes from a Single-View Video

Full-4D:从单视角视频生成全范围4D场景

Tingxi Chen, Ke Hao, Yabo Chen, Zhengxue Cheng, Rong Xie, Li Song, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院)

AI总结 提出一种将单视角视频转换为全范围4D场景的框架,通过多视角视频合成和基于优化的4D重建,引入大规模数据集Real-MV-4D、融合时间-视图注意力的扩散模型和流匹配蒸馏损失,实现高保真度和几何一致性。

详情
AI中文摘要

从单视角视频生成4D场景本质上是不适定的:单一视角缺乏恢复完整、动态场景所需的信息。现有方法通常局限于单目视频、简单的3D效果或仅在原始视角附近的小视角扰动,未能实现真正的4D生成。同时,缺乏捕捉全范围4D场景的大规模同步多视角视频数据集进一步阻碍了这一方向的发展。我们提出了一种新颖的单视角视频到4D框架,将全范围4D生成视为多视角视频合成,然后从生成的视角进行基于优化的4D重建。为了端到端地实例化这一公式,我们做出了三个关键贡献。首先,我们引入了Real-MV-4D,一个大规模数据集,包含在多样化真实环境中捕获的同步多视角视频,以提供4D监督。其次,我们训练了一个多视角视频扩散模型,该模型由一种新颖的融合时间(T)-视图(V)注意力机制驱动,直接将几何重投影先验和显式相机条件嵌入到其视图-时间交互中。与基本的特征融合不同,这种直接绑定严格地将生成过程与物理3D先验对齐,以生成密集、同步的T×V视频网格。第三,我们不依赖非交互且不一致的2D视频插值,而是将合成的多视角视频提升为显式4D表示(即4DGS),并通过流匹配蒸馏损失进行正则化,利用多视角先验改进新视角渲染。大量实验表明,我们的方法在视觉保真度和几何一致性方面均优于现有方法,实现了从单视角视频生成全范围4D场景。

英文摘要

Generating 4D scenes from a single-view video is inherently ill-posed: a single viewpoint lacks the information needed to recover a complete, dynamic scene with full coverage. Existing methods are typically limited to monocular videos, simple 3D effects, or only small viewpoint perturbations around the original viewpoint, falling short of true 4D generation. Meanwhile, the lack of large-scale datasets capturing full-scope 4D scenes with synchronized multi-view videos further hinders progress in this direction. We propose a novel single-view video-to-4D framework that casts full-scope 4D generation as a multi-view video synthesis followed by optimization-based 4D reconstruction from the generated views. To instantiate this formulation end-to-end, we make three key contributions. First, we introduce Real-MV-4D, a large-scale dataset of synchronized multi-view videos captured in diverse real-world environments to provide the 4D supervision. Second, we train a multi-view video diffusion model driven by a novel fused time(T)-view(V) attention mechanism that directly embeds geometric reprojection priors and explicit camera conditioning into its view-time interactions. Unlike basic feature fusion, this direct binding strictly aligns the generation process with physical 3D priors to produce a dense, synchronized T$\times $V video grid. Third, rather than relying on non-interactive and inconsistent 2D video interpolations, we lift the synthesized multi-view videos into an explicit 4D representation (i.e. 4DGS), regularized by a Flow Matching Distillation loss that exploits the multi-view prior to improve novel-view rendering. Extensive experiments demonstrate that our method outperforms existing approaches in both visual fidelity and geometric consistency, enabling full-scope 4D scene generation from single-view videos.

2605.25499 2026-05-26 cs.LG

Accelerated Dynamic Importance Weighting with Versatile Divergence-Minimizing Estimators

加速动态重要性加权与通用散度最小化估计器

Tongtong Fang, Nan Lu, Gang Niu, Kenji Fukumizu, Masashi Sugiyama

发表机构 * The Institute of Statistical Mathematics(统计数学研究所) University of Bristol(布里斯托大学) RIKEN Center for Advanced Intelligence Project(RIKEN高级智能项目中心) The University of Tokyo(东京大学)

AI总结 针对联合分布偏移问题,提出加速动态重要性加权(ADIW)框架,通过轻量投影梯度下降和通用散度最小化,在提升效率的同时实现最优性能。

详情
AI中文摘要

重要性加权(IW)是解决联合分布偏移的黄金方法,其中训练数据和测试数据的联合分布不同。为解决此问题,IW估计测试与训练密度比作为重要性权重,并相应地重新加权训练损失。最近动态IW(DIW)的进展将权重估计集成到模型训练中,实现了深度模型的可扩展IW,并在大型现代数据集上取得了强劲性能。尽管有前景,DIW仍存在两个局限。首先,它通过在每个小批量中求解核均值匹配(KMM)诱导的优化问题至收敛,导致大量计算开销。其次,它仅依赖KMM进行权重估计,而IW文献包含基于不同散度度量的多种估计方法。本文提出加速动态IW(ADIW),一个统一且高效的联合分布偏移下深度学习IW框架。ADIW执行少量轻量投影梯度下降更新,从先前更新的权重热启动,显著提高效率。此外,ADIW将DIW推广为一个统一的散度最小化框架,以即插即用方式支持多种权重估计方法,包括基于Kullback-Leibler散度、平方距离和Wasserstein-1距离的方法。我们在温和条件下建立了ADIW的收敛保证,实证结果表明ADIW在实现最先进IW性能的同时,效率大幅提升。

英文摘要

Importance weighting (IW) is a golden solver for joint distribution shift, where the joint distributions differ between the training and test data. To solve this problem, IW estimates test-to-training density ratios as importance weights and reweights the training losses accordingly. Recent advances in dynamic IW (DIW) integrate weight estimation into model training, enabling scalable IW for deep models and achieving strong performance on large modern datasets. Despite its promise, DIW remains limited in two aspects. First, it incurs substantial computational overhead by solving a kernel mean matching (KMM)-induced optimization problem to convergence in every mini-batch. Second, it relies solely on KMM for weight estimation, whereas the IW literature contains diverse estimation methods based on different divergence measures. In this paper, we propose accelerated DIW (ADIW), a unified and efficient IW framework for deep learning under joint distribution shift. ADIW performs a few lightweight projected gradient descent updates that warm-start from previously updated weights, substantially improving efficiency. Moreover, ADIW generalizes DIW into a unified divergence-minimization framework that supports diverse weight-estimation methods in a plug-and-play manner, including those based on the Kullback-Leibler divergence, squared distance, and Wasserstein-1 distance. We establish convergence guarantees for ADIW under mild conditions, and empirical results demonstrate that ADIW achieves state-of-the-art IW performance while being substantially more efficient.

2605.25495 2026-05-26 cs.RO cs.CV

RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation

RepSAM: 通过表示引导的适应连接基础模型与机器人视觉

Wenhui Chu

发表机构 * Department of Computer Science and Engineering, Texas A&M University(计算机科学与工程系,德克萨斯大学阿马尔科分校)

AI总结 针对基础模型在非结构化机器人视觉场景中性能下降的问题,提出RepSAM框架,通过CKA引导的秩分配策略和多模态融合模块实现参数高效微调,在减少158倍可训练参数的同时达到全微调97.9%的性能。

Comments Accepted to IJCAI-ECAI 2026 (Special Track on AI and Robotics). 8 pages, 4 figures, 12 tables

详情
AI中文摘要

尽管SAM等基础模型具有零样本能力,但在非结构化环境中的机器人感知仍然具有挑战性。本文将性能下降归因于Transformer层间非均匀的表示偏移:浅层表现出显著的领域差距(CKA < 0.5),而深层则有效迁移(CKA > 0.7)。基于这一观察,我们提出RepSAM,一种表示引导的参数高效微调(PEFT)框架,用于将基础模型适应到机器人视觉。RepSAM采用理论基础的CKA引导秩分配策略,结合多模态融合模块,以稳健处理具有挑战性的机器人场景,包括透明物体和杂乱场景。在六个基准和机器人操作任务上的实验评估表明,RepSAM达到了全微调性能的97.9%(89.0% vs. 90.9% mIoU),同时将可训练参数减少了158倍(从632M降至4.0M)。RepSAM在单个A100 GPU上仅需4小时训练(比全微调减少96倍,全微调需要384 GPU小时),即可比DoRA提高7.9%的mIoU。这些改进具有统计显著性(p < 0.01),并在机器人操作成功率上比LoRA(RGB)基线绝对提高了12.0%。

英文摘要

Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.

2605.25492 2026-05-26 cs.LG

SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

SafetyRepro: 对齐基准上的配置条件排名不稳定性

Yanhang Li, Zhichao Fan, Zexin Zhuang

发表机构 * Northeastern University, Boston, MA, USA(东北大学) University of Illinois Urbana-Champaign, Urbana, IL, USA(伊利诺伊大学厄巴纳-香槟分校) Southern Methodist University, Dallas, TX, USA(南方 Methodist 大学)

AI总结 本文通过理论命题和提交戳评估协议,证明对齐基准上的成对模型比较结果(如“A比B更安全”)会因未指定的配置选择而发生严格反转。

详情
AI中文摘要

从基础模型基准中得出的成对模型比较(“A比B更安全”)被视为定量结论,但依赖于基准论文未充分指定的工具选择。我们在此原语上闭合了一个理论-基准循环:一个有限包络命题,将可测量的成对不一致率与严格排序是否允许配置对反转联系起来,并配以一个提交戳评估协议,该协议在广泛引用的对齐基准上实现了这一命题。在我们测试的每个基准上,仅配置选择就能翻转成对结论;该命题隔离了这种严格反转的失败模式。

英文摘要

Pairwise model comparisons drawn from foundation-model benchmarks ("A is safer than B") are read as quantitative verdicts but hinge on harness choices benchmark papers under-specify. We close one theory-benchmark loop on this primitive: a finite-envelope proposition tying a measurable pairwise-disagreement rate to whether the strict ordering admits a configuration-pair reversal, paired with a commit-stamped evaluation protocol that operationalises it on widely cited alignment benchmarks. On every benchmark we test, configuration choice alone can flip the pairwise verdict; the proposition isolates this strict-reversal failure mode.

2605.25489 2026-05-26 cs.AI cs.HC

ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

ATWL:一种用于表示、比较和重用可视化分析工作流的正式语言

Natalia Andrienko, Gennady Andrienko, Jürgen Bernard, Michael Sedlmair

发表机构 * Fraunhofer Institute IAIS(弗劳恩霍夫研究所IAIS) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习与人工智能研究所) City St George’s, University of London(伦敦大学城市圣乔治学院) University of Zurich(苏黎世大学) University of Stuttgart(斯图加特大学)

AI总结 提出ATWL语言,通过模块化本体和标准化意图形式化表示可视化分析工作流,结合LLM提取工作流,实现结构比较和重用。

详情
AI中文摘要

可视化分析(VA)工作流本质上是复杂的,涉及数据转换、特征工程、视觉表示和人类解释。它们通常以非结构化的散文形式描述,阻碍了系统比较、成熟策略的重用以及新手的培训。我们提出了工件-转换工作流语言(ATWL),这是一种领域无关的声明式语言,通过捕获工作流的结构和潜在分析意图来形式化表示VA工作流。ATWL构建于一个由八种工件类型(实体、特征、排列、可视化、模式、模型、知识、规范)和以标准化意图(例如,定义单元、表征、情境化、抽象)为特征的转换组成的模块化本体之上。为了证明形式化工作不必阻碍采用,我们通过监督式LLM代理交互从研究论文中提取工作流,将人类角色简化为审查和细化。利用这一过程,我们从已发表的VA论文中构建了一个包含17个ATWL工作流的库。跨工作流分析揭示了结构规律性——一个反复出现的元结构、重复出现的主题、可重用的构建块、多样的迭代策略以及跨领域等价性——这些在散文中是不可见的。我们进一步通过一个受控实验评估了实际效用,在该实验中,同一个LLM处理了两个分析问题,提供的库要么是原始论文,要么是ATWL表示。两种形式都提供了有用的建议,但形式化表示系统地添加了显式迭代结构、类型化数据流、片段级适应来源以及支持扩展的紧凑性,超出了散文库在LLM上下文中的容量。ATWL使得从叙事描述向形式化表示、可比较和可重用的分析知识过渡成为可能。

英文摘要

Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and human interpretation. They are typically described in unstructured prose, hindering systematic comparison, reuse of proven strategies, and training of novices. We present Artifact-Transform Workflow Language (ATWL), a domain-agnostic, declarative language that formally represents VA workflows by capturing their structure and underlying analytical intent. ATWL is built upon a modular ontology of eight artifact types (entities, features, arrangements, visualisations, patterns, models, knowledge, specifications) and transforms characterised by standardised intents (e.g., define-unit, characterise, contextualise, abstract). To show that formalisation effort need not impede adoption, we extract workflows from research papers through supervised interaction with LLM agents, reducing the human role to review and refinement. Using this process, we constructed a library of seventeen ATWL workflows from published VA papers. Cross-workflow analysis reveals structural regularities -- a recurrent meta-structure, recurring motifs, reusable building blocks, diverse iterative strategies, and cross-domain equivalences -- that remain invisible in prose. We further evaluate practical utility through a controlled experiment in which the same LLM addressed two analytical problems with the library supplied either as original papers or as ATWL representations. Both forms enabled useful recommendations, but the formal representation systematically added explicit iteration structure, typed data flow, fragment-level adaptation provenance, and compactness supporting scaling beyond what prose libraries can fit in an LLM's context. ATWL enables a transition from narrative descriptions to formally represented, comparable, and reusable analytical knowledge.

2605.25488 2026-05-26 cs.CV cs.AI cs.MM

Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation

测试时自适应条件用于稳定音频驱动说话头生成

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院)

AI总结 提出一种无需参数训练的测试时自适应条件框架(TT-SAC),通过反馈循环调整条件表示,提升预训练说话头生成器的身份保持、时间一致性和感知质量。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成在AniTalker、FLOAT和Sonic等最新模型中取得了显著进展。尽管取得了成功,大多数现有方法在推理阶段依赖单一静态参考图像来调节整个视频生成过程。这种静态条件范式通常导致固定身份特征与动态面部运动之间的不匹配,从而引起身份漂移、时间不一致性和感知质量下降。我们引入了测试时自适应条件(TT-SAC),这是一个无需参数的推理框架,使预训练的说话头生成器能够在推理过程中调整其条件表示,而无需重新训练、梯度更新或额外监督。TT-SAC不是将参考肖像视为不可变的,而是将生成器与其编码器组合成一个反馈循环:生成器自身的输出被重新编码,以构建一个更符合合成序列时间动态的精细条件表示。单次自适应步骤近似于生成过程的自洽平衡,稳定了跨时间的身份和运动。我们进一步提供了理论分析,表明在温和的Lipschitz假设下,测试时条件自适应减少了特征方差并提高了生成稳定性,同时表现出原则性的偏差-方差权衡,该权衡决定了自适应最优强度。在最新说话头生成器和基准数据集上的大量实验表明,在唇形同步准确性、时间一致性、身份保持和感知保真度方面均有持续改进。TT-SAC提供了一种模型无关且无需训练的策略来增强生成视频模型,将测试时条件自适应确立为稳定音频驱动肖像动画的有效机制。

英文摘要

Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their success, most existing approaches rely on a single static reference image to condition the entire video generation process at inference stage. This static conditioning paradigm often creates a mismatch between fixed identity features and dynamically evolving facial motion, leading to identity drift, temporal inconsistency, and degraded perceptual quality. We introduce Test-Time Self-Adaptive Conditioning (TT-SAC), a parameter-free inference framework that enables pretrained talking-head generators to adapt their conditioning representations during inference without retraining, gradient updates, or additional supervision. Instead of treating the reference portrait as immutable, TT-SAC composes the generator with its encoder in a feedback loop: the generator's own outputs are re-encoded to construct a refined conditioning representation that better aligns with the temporal dynamics of the synthesized sequence. A single adaptation step approximates a self-consistent equilibrium of the generative process, stabilizing identity and motion across time. We further provide theoretical analysis showing that test-time conditioning adaptation reduces feature variance and improves generative stability under mild Lipschitz assumptions, while exhibiting a principled bias-variance tradeoff that governs the optimal strength of adaptation. Extensive experiments on state-of-the-art talking-head generators and benchmark datasets demonstrate consistent improvements in lip-sync accuracy, temporal coherence, identity preservation, and perceptual fidelity. TT-SAC offers a model-agnostic and training-free strategy for enhancing generative video models, establishing test-time conditioning adaptation as an effective mechanism for stabilizing audio-driven portrait animation.

2605.25479 2026-05-26 cs.CV

MAIL++: Multi-Modal Bi-directional Agent Layer for Vision-Language Models

MAIL++: 视觉语言模型的多模态双向智能体层

Kaixiang Chen, Pengfei Fang, Hui Xue

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用国家重点实验室(东南大学),中华人民共和国教育部,中国)

AI总结 提出MAIL/MAIL++方法,通过将跨模态耦合嵌入VLM内在计算模块并引入双向桥接,实现参数高效微调,在少样本分类和跨域检索中超越现有方法。

详情
AI中文摘要

将大型视觉语言模型(如CLIP)适应下游任务仍然具有挑战性,因为全微调计算成本高且在小数据场景下容易过拟合。参数高效微调(PEFT)通过轻量级提示或适配器模块缓解了这些问题,而跨模态耦合通过增强视觉和语言之间的交互被证明特别有效。然而,现有的耦合机制主要依赖外部辅助模块,导致间接、粗粒度的交互,这些交互在结构上与原始VLM解耦,从而限制了表示的表达能力。在本文中,我们提出了多模态交互智能体层(MAIL),这是一种PEFT范式,将跨模态耦合直接嵌入VLM的内在计算模块中。MAIL冻结主干网络,并在核心模块(如LayerNorm)之后插入轻量级智能体层,以近似全微调引起的参数更新。为了在这一层面耦合视觉和文本流,我们引入了一个基于瓶颈的文本到图像桥,该桥联合优化跨模态的成对智能体层,协调相应计算模块的适应。我们进一步提出了MAIL++,它通过元智能体层、元文本桥和元图像桥实现了双向跨模态交换。在推理时,所有智能体层被重参数化到冻结的主干网络中,保持原始计算效率。在少样本图像分类和少样本通用跨域检索上的大量实验表明,MAIL和MAIL++始终优于最先进的PEFT方法。

英文摘要

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

2605.25477 2026-05-26 cs.RO cs.AI

EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models

EXPO-FT:面向视觉-语言-动作模型的样本高效强化学习微调

Perry Dong, Kuo-Han Hung, Tian Gao, Dorsa Sadigh, Chelsea Finn

发表机构 * Stanford University(斯坦福大学)

AI总结 提出EXPO-FT系统,通过样本高效的强化学习微调预训练的VLA策略,在多种高精度操作任务中实现完美性能(30/30成功率),平均仅需19.1分钟在线机器人数据。

详情
AI中文摘要

高效且可靠地学习新任务的能力一直是机器人学的基础挑战。视觉-语言-动作(VLA)模型在多种操作任务中展现出强大的泛化能力,但预训练策略始终无法达到实际部署所需的可靠性。强化学习(RL)微调为弥合这一差距提供了有前景的路径,但现有方法要么从头开始训练而未充分利用预训练先验,要么微调VLA而未达到实际部署所需的样本效率和成功率。我们提出了EXPO-FT,一个用于对预训练VLA策略进行稳定、样本高效的RL微调的系统,填补了这一空白。我们的系统解决了一系列具有挑战性的操作任务,包括串灯并插入插头点亮、将台球击入袋中、将花插入酒瓶,每个任务都需要高精度、动态动作以及对不同初始状态的鲁棒性。我们的系统在所有评估任务中均实现了完美的任务性能(30/30成功),平均仅需19.1分钟的在线机器人数据,优于先前的从头RL训练和VLA微调方法。我们发布了一个开源代码库,旨在促进机器人领域中VLA模型RL微调的更广泛采用。

英文摘要

The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models have demonstrated strong generalization across diverse manipulation tasks, yet pretrained policies consistently fall short of the reliability required for real-world deployment. Reinforcement learning (RL) fine-tuning offers a promising path to bridge this gap, but existing approaches either train from scratch without fully leveraging pretrained priors, or fine-tune VLAs without achieving the sample efficiency and success rates that practical deployment demands. We present EXPO-FT, a system for stable, sample-efficient RL finetuning of pretrained VLA policies that closes this gap. Our system solves a suite of challenging manipulation tasks, including routing string lights and inserting the plug to light it up, striking a pool ball into a pocket, and inserting a flower into a wine bottle, each requiring combinations of high precision, dynamic actions, and robustness to varied initial states. Our system achieves perfect task performance (30/30 successes) across all evaluated tasks within an average of 19.1 minutes of online robot data, outperforming both prior RL-from-scratch and VLA finetuning approaches. We release an open-source codebase with the aim of facilitating broader adoption of RL finetuning of VLA models in robotics.

2605.25475 2026-05-26 cs.CL cs.AI

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

IndexMem: 基于潜在记忆的学习型KV缓存驱逐策略用于长上下文LLM推理

Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Zhejiang University(浙江大学)

AI总结 提出一种可学习的索引器预测KV重要性,并结合轻量级潜在记忆模块压缩被驱逐的令牌,以在有限KV预算下实现准确的长上下文推理。

详情
AI中文摘要

大型语言模型(LLM)越来越需要处理长上下文,但标准softmax注意力机制的KV缓存随序列长度线性增长,迅速成为长上下文推理的瓶颈。一种实用的补救措施是驱逐不太重要的KV条目;然而,现有的驱逐策略大多是启发式的,难以捕捉令牌重要性的丰富、输入相关的分布。在这项工作中,我们引入了一个可学习的索引器来预测KV重要性,从而能够更准确地保留关键令牌。同时,简单地驱逐令牌会永久丢弃其信息,导致不可逆的遗忘和长距离检索性能下降。为了解决这个问题,我们提出了一个轻量级的潜在记忆模块,将驱逐的令牌压缩成紧凑的、在线更新的状态,并提供残差读出以补偿通过KV驱逐丢失的注意力贡献。总的来说,我们的方法能够在有限的KV预算下实现准确的长上下文推理,在RULER(4K/16K)上对Qwen、Mistral和Llama模型(在激进驱逐下提升高达25分)带来一致的改进,在Needle-in-a-Haystack检索中显著更稳定,并且在LongBench得分和压缩曲线上优于现有的驱逐策略。

英文摘要

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

2605.25474 2026-05-26 cs.CL

TypedCSIP: Typed Counterfactual Pretraining for Chinese Legislative Conflict Classification

TypedCSIP:面向中国立法冲突分类的类型化反事实预训练

Yao Liu

发表机构 * Chengdu University of Technology, Leshan, China(成都理工大学,乐山,中国) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia(马来西亚理科大学计算机科学学院,槟城,马来西亚)

AI总结 提出TypedCSIP方法,通过类型化反事实选择性干预预训练(阶段1)和五路分类头迁移(阶段2),在LCR-CN基准上提升立法冲突分类的宏F1值。

详情
AI中文摘要

TypedCSIP是一种针对LCR-CN基准(Zhao等人,2026)冲突分类任务的类型化反事实预训练方法:给定(上位法,下位法)条款对,预测该对是否冲突以及四种法律教义类型(责任、条件、制裁、定义)中哪一种描述不一致。我们利用LCR-CN中专家编写的最小修订作为训练时的反事实监督;测试时分类器仅读取原始条款对。阶段1在(上位法,下位法,专家修订)三元组上使用类型化反事实选择性干预预训练目标预训练共享编码器,将专家修订视为反事实,类型因子头必须将其分类为无冲突证据。阶段2将编码器迁移到五路分类头。确认性测试在观察v6测量之前在开放科学框架上注册:18个种子,锁定规则要求每种子平均差异至少0.8个百分点,且种子自举和学生t 95%置信下限均大于零。在696条记录测试集上,v2变体在chinese-roberta-wwm-ext上比最强单模型基线提高宏F1 +0.916个百分点,在SAILER跨骨干复制上提高+1.288个百分点;两个单元格均通过规则。在244条Unseen-gB记录上的冷启动分层结果在两个骨干上均保持正增益。跨任务诊断显示阶段2编码器是分类专用的,不能迁移到LCR-CN的上位法检索任务,因此我们将贡献限定在冲突分类。我们发布代码、72个预注册预测文件、匹配种子和MLM控制辅助文件以及OSF预注册记录。

英文摘要

TypedCSIP is a typed counterfactual pretraining method for the conflict-classification task of the LCR-CN benchmark (Zhao et al., 2026): given a (superior, subordinate) provision pair, predict whether the pair conflicts and which of four legal-doctrine types (Responsibility, Condition, Sanction, Definition) describes the inconsistency. We exploit LCR-CN's expert-written minimal revisions as training-time counterfactual supervision; at test time the classifier reads only the original pair. Stage 1 pretrains a shared encoder with a typed Counterfactual Selective Intervention Pretraining objective on (superior, subordinate, expert-revised) triplets, treating the expert revision as a counterfactual that the typed factor head must classify as carrying no conflict evidence. Stage 2 transfers the encoder to a five-way classification head. The confirmatory test was registered on the Open Science Framework before observing v6 measurements: 18 seeds, locked rule requiring mean per-seed difference at least 0.8 pp with both seed-bootstrap and Student-t 95% lower bounds above zero. On the 696-record test split, the v2 variant improves macro-F1 over the strongest single-model baseline by +0.916 pp on chinese-roberta-wwm-ext and +1.288 pp on the SAILER cross-backbone replication; both cells pass the rule. A cold-start stratified result on the 244 Unseen-gB records keeps the gain positive on both backbones. A cross-task diagnostic shows the Stage-2 encoder is classification-specialized and does not transfer to LCR-CN's superior-law retrieval task, so we scope the contribution to conflict classification. We release code, 72 pre-registered prediction files, matched-seed and MLM-control auxiliaries, and the OSF pre-registration record.

2605.25469 2026-05-26 cs.LG

JacQuant: STE-Free Quantization-Aware Training via Learned Jacobian Surrogates

JacQuant: 通过学习的雅可比替代实现无STE的量化感知训练

Kai Yi, Vignesh Vivekraja, Harshit Khaitan, Steven Li

发表机构 * Meta AI

AI总结 提出JacQuant框架,通过学习模型对参数变化的局部敏感性的轻量级替代,在不使用直通估计器的情况下稳定和加速量化感知训练,在低比特大语言模型上达到更高精度。

详情
AI中文摘要

量化感知训练(QAT)被广泛部署,但通常依赖于直通估计器(STE),它通过强行将梯度通过不可微量化器传递。这常常使得训练在边界附近脆弱,并且与低精度模型的实际行为弱对齐。我们引入JacQuant,一个QAT框架,它学习模型对参数变化的局部敏感性的轻量级替代,并使用它在标准方差缩减优化器中稳定和加速训练。该替代是廉价的(对角或块对角)、数据驱动的,并且与常见的权重和激活量化器兼容。在代码保持的训练阶段,我们证明了非凸目标的收敛性,并在PL条件下获得了线性速率,并通过简单的校准论证将学习到的敏感性与端到端输出保真度联系起来。在$\leq 2$比特的LLM基准测试中,JacQuant始终达到比基于STE的QAT更高的准确度,并且对各种模型的运行时分析表明,在实际分组大小下,额外成本可以忽略不计。该方法即插即用,无需更改前向量化器;我们的实证声明仅限于超低比特LLM QAT。

英文摘要

Quantization-aware training (QAT) is widely deployed but typically relies on the Straight-Through Estimator (STE), which passes gradients through non-differentiable quantizers by fiat. This often makes training brittle near bin boundaries and weakly aligned with the actual behavior of the low-precision model. We introduce JacQuant, a QAT framework that learns a lightweight surrogate of the model's local sensitivity to parameter changes and uses it to stabilize and accelerate training within standard variance-reduced optimizers. The surrogate is inexpensive (diagonal or block-diagonal), data-driven, and compatible with common weight and activation quantizers. On code-preserving training phases, we prove convergence for non-convex objectives and obtain linear rates under a PL condition, and we relate the learned sensitivity to end-to-end output fidelity via a simple calibration argument. Across LLM benchmarks at $\leq 2$ bits, JacQuant consistently reaches higher accuracy than STE-based QAT, and the runtime analyses on various models show that the added cost remains negligible under practical group sizes. The method is drop-in and requires no changes to the forward quantizers; our empirical claims are scoped to ultra-low-bit LLM QAT.

2605.25461 2026-05-26 cs.CV

MetaphorVU: Towards Metaphorical Video Understanding

MetaphorVU:迈向隐喻视频理解

Zhuoqun Li, Boxi Cao, Guiping Jiang, Fangrui Lv, Ruotong Pan, Jianan Wang, Xiangyu Wu, Hongyu Lin, Yaojie Lu, Yong Du, Ruyin Jia, Liyan, Tingting Gao, Han Li, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对当前多模态大语言模型在隐喻视频理解上的不足,提出首个系统性基准MetaphorVU-Bench,并设计基于隐喻知识图谱的推理增强框架MetaphorBoost,显著提升模型性能。

Comments ICML 2026 spotlight

详情
AI中文摘要

隐喻视频在各种现实场景中广泛存在,用于传达复杂思想,理解它们通常需要高阶认知能力。对隐喻视频理解缺乏系统性研究不仅限制了多模态大语言模型(MLLMs)的现实应用,也阻碍了对其高阶认知能力的全面评估。为填补这一空白,我们提出了MetaphorVU-Bench,这是首个专门用于隐喻视频理解的系统性和综合性基准。通过实验,我们发现当前的MLLMs在准确的隐喻视频理解上存在困难,远落后于人类水平,主要原因是跨域映射存在缺陷。受此发现启发,我们构建了一个隐喻知识图谱作为映射增强,并提出了MetaphorBoost,一个推理时增强框架,实现了持续的性能提升。我们的基准、分析和方法为未来推进MLLMs的研究提供了有用的见解和基础。

英文摘要

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.