arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2604.07993 2026-05-20 cs.RO

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

HEX: 人形对齐的专家用于跨躯体全身体操作

Shuanghao Bai, Meng Li, Xinyuan Lv, Jiawei Wang, Xinhua Wang, Fei Liao, Chengkai Hou, Langzhe Gu, Wanqi Zhou, Kun Wu, Ziluo Ding, Zhiyuan Xu, Lei Sun, Shanghang Zhang, Zhengping Che, Jian Tang, Badong Chen

发表机构 * Beijing Innovation Center of Humanoid Robotics(北京人形机器人创新中心) Xi’an Jiaotong University(西安交通大学) Nankai University(南开大学) Peking University(北京大学)

AI总结 HEX通过引入人形对齐的通用状态表示和混合专家统一本体预测器,实现了对全尺寸双足人形机器人全身体操作的协调控制,展示了在任务成功率和泛化能力上的最新成果。

Comments Project page: https://hex-humanoid.github.io/

详情
AI中文摘要

人类通过协调的全身控制实现复杂操作,而大多数视觉-语言-动作(VLA)模型将机器人身体部分独立处理,使得高自由度的人形控制具有挑战性和不稳定性。我们提出了HEX,一种面向全尺寸双足人形机器人的协调操作状态中心框架。HEX引入了人形对齐的通用状态表示,以实现跨异构躯体的可扩展学习,并结合混合专家统一本体预测器,从大规模多躯体轨迹数据中建模全身协调和时间运动动态。为了高效捕捉时间视觉上下文,HEX使用轻量级历史标记来总结过去的观察,避免在推理过程中重复编码历史图像。它进一步采用残差门控融合机制和流匹配动作头,以适应性地整合视觉-语言提示与本体动态以生成动作。在现实世界的人形操作任务中,HEX在任务成功率和泛化能力上实现了最先进的性能,特别是在快速反应和长时间范围场景中。

英文摘要

Humans achieve complex manipulation through coordinated whole-body control, whereas most Vision-Language-Action (VLA) models treat robot body parts largely independently, making high-DoF humanoid control challenging and often unstable. We present HEX, a state-centric framework for coordinated manipulation on full-sized bipedal humanoid robots. HEX introduces a humanoid-aligned universal state representation for scalable learning across heterogeneous embodiments, and incorporates a Mixture-of-Experts Unified Proprioceptive Predictor to model whole-body coordination and temporal motion dynamics from large-scale multi-embodiment trajectory data. To efficiently capture temporal visual context, HEX uses lightweight history tokens to summarize past observations, avoiding repeated encoding of historical images during inference. It further employs a residual-gated fusion mechanism with a flow-matching action head to adaptively integrate visual-language cues with proprioceptive dynamics for action generation. Experiments on real-world humanoid manipulation tasks show that HEX achieves state-of-the-art performance in task success rate and generalization, particularly in fast-reaction and long-horizon scenarios.

2604.07393 2026-05-20 cs.LG cs.AI

DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting

DSPR:双流物理残差网络用于可信的工业时间序列预测

Yeran Zhang, Pengwei Yang, Guoqing Wang, Tianyu Li

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Research Center, East Hope Group Co., Ltd(东希望集团有限公司研究院)

AI总结 本文提出DSPR框架,通过分离稳定的时序模式与受制度影响的残差动态,提升工业时间序列预测的准确性与物理合理性,实验表明其在不同制度下均能保持高预测精度和鲁棒性。

Comments 12 pages, 7 figures, accepted by KDD 2026

详情
AI中文摘要

准确预测工业时间序列需要在非平稳运行条件下平衡预测精度与物理合理性。现有数据驱动模型在统计性能上表现优异,但难以尊重受制度影响的交互结构和传输延迟等现实系统特性。为解决这一挑战,我们提出了DSPR(双流物理残差网络)预测框架,该框架明确分离稳定的时间模式与受制度影响的残差动态。第一流建模单个变量的统计时间演化。第二流通过两个关键机制关注残差动态:自适应窗口模块估计流依赖的传输延迟,以及物理引导的动态图整合物理先验,学习时间变化的交互结构并抑制虚假相关性。在四个工业基准上实验表明,DSPR在制度转换下持续提升预测精度和鲁棒性,同时保持强物理合理性。它实现了最先进的预测性能,平均守恒精度超过99%,总变化率达到97.2%。除了预测外,学习的交互结构和自适应滞后提供了与已知领域机制一致的可解释见解,如流依赖的传输延迟和风到功率的缩放行为。这些结果表明,通过物理一致的归纳偏差的架构解耦,为可信的工业时间序列预测提供了一条有效路径。此外,DSPR在长期工业部署中展示出的鲁棒性能弥合了先进预测模型与可信自主控制系统之间的差距。

英文摘要

Accurate forecasting of industrial time series requires balancing predictive accuracy with physical plausibility under non-stationary operating conditions. Existing data-driven models often achieve strong statistical performance but struggle to respect regime-dependent interaction structures and transport delays inherent in real-world systems. To address this challenge, we propose DSPR (Dual-Stream Physics-Residual Networks), a forecasting framework that explicitly decouples stable temporal patterns from regime-dependent residual dynamics. The first stream models the statistical temporal evolution of individual variables. The second stream focuses on residual dynamics through two key mechanisms: an Adaptive Window module that estimates flow-dependent transport delays, and a Physics-Guided Dynamic Graph that incorporates physical priors to learn time-varying interaction structures while suppressing spurious correlations. Experiments on four industrial benchmarks spanning heterogeneous regimes demonstrate that DSPR consistently improves forecasting accuracy and robustness under regime shifts while maintaining strong physical plausibility. It achieves state-of-the-art predictive performance, with Mean Conservation Accuracy exceeding 99% and Total Variation Ratio reaching up to 97.2%. Beyond forecasting, the learned interaction structures and adaptive lags provide interpretable insights that are consistent with known domain mechanisms, such as flow-dependent transport delays and wind-to-power scaling behaviors. These results suggest that architectural decoupling with physics-consistent inductive biases offers an effective path toward trustworthy industrial time-series forecasting. Furthermore, DSPR's demonstrated robust performance in long-term industrial deployment bridges the gap between advanced forecasting models and trustworthy autonomous control systems.

2604.07035 2026-05-20 cs.CL

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

统一的部署感知开放推理语言模型评估

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science, Rensselaer Polytechnic Institute(理海理工学院计算机科学系) Department of Biomedical Engineering, Rensselaer Polytechnic Institute(理海理工学院生物医学工程系)

AI总结 本文提出了一种统一的开放推理语言模型评估方法,通过四个基准测试(ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1)对七种配置进行评估,结合零样本、链式思维(CoT)和少量样本CoT提示策略,分析模型在准确率、延迟、内存使用等多目标下的表现,强调部署感知的多目标优化问题。

详情
AI中文摘要

开放推理语言模型通常在混合样本量、部分标准化提示和以准确性为中心的总结下进行比较,这使得实际模型选择难以解释。我们针对ARC-Challenge、GSM8K、MATH 1-3级和TruthfulQA MC1四个基准测试,对七种开放推理语言模型配置进行了统一评估。我们对每种模型-数据集-策略条件下的238个示例子集测试了零样本、链式思维(CoT)和少量样本CoT提示策略,得到一个完整的7×4×3设计,包含84个条件和19,992个评估示例。除了准确性外,我们还报告了Wilson置信区间、延迟、峰值视频随机访问内存(VRAM)、加权聚合性能、帕累托高效运行点、提示敏感度指标和兼容性诊断。Gemma-4-26B-A4B在零样本提示下实现了最高的加权分数0.794。Gemma-4-E4B在各种提示设置中仍接近顶部,同时使用显著更低的延迟和内存,使其成为一种强大的实际运行点。Bootstrap和配对排列分析显示,领先配置足够接近,部署权衡仍然重要。我们还发现提示策略的变化会改变模型排名,而不是统一移动所有模型。基准特定的互补性创造了路由空间,一个 oracle 任务感知选择器达到了加权分数0.825。兼容性诊断显示,一些明显失败,尤其是Phi-4-Reasoning在GSM8K上的表现,反映了在共享评估流程下的鲁棒性和接口适应性问题。这些结果支持一个核心主张:开放模型评估应作为部署感知的多目标运行点问题,而不是单一分数排行榜练习。

英文摘要

Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks: ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1. We test zero-shot, chain-of-thought (CoT), and few-shot CoT prompting on the same 238-example subset for every model--dataset--strategy condition, yielding a complete 7 x 4 x 3 design with 84 conditions and 19,992 evaluated examples. Beyond accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Gemma-4-26B-A4B with zero-shot prompting achieves the highest weighted score at 0.794. Gemma-4-E4B remains close to the top across prompting settings while using substantially lower latency and memory, making it a strong practical operating point. Bootstrap and paired-permutation analyses show that the leading configurations are close enough that deployment tradeoffs remain important. We also find that prompting strategy changes model rankings rather than shifting all models uniformly. Benchmark-specific complementarity creates routing headroom, with an oracle task-aware selector reaching a weighted score of 0.825. Compatibility diagnostics show that some apparent failures, especially Phi-4-Reasoning on GSM8K, reflect robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

2604.03419 2026-05-20 cs.LG math.CO

Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization

自适应阈值驱动的连续贪心方法用于可扩展的子模优化

Mohammadreza Rostami, Solmaz S. Kia

发表机构 * Department of Mechanical and Aerospace Engineering, University of California Irvine(加州大学尔湾分校机械与航空航天工程系)

AI总结 该研究提出了一种自适应阈值驱动的连续贪心方法(ATCG),用于解决在Matroid约束下的子模最大化问题,通过动态调整活跃集扩展策略,提高了算法效率并减少了通信开销。

详情
AI中文摘要

在组合优化中,子模最大化在传感、数据摘要、主动学习和资源分配中有广泛应用。尽管顺序贪心(SG)算法由于不可逆选择只能达到1/2的近似比,连续贪心(CG)通过多线性松弛获得最优的(1-1/e)近似比,但其代价是逐渐密集的决策向量,迫使代理为几乎每一个基础集元素交换特征嵌入。我们提出ATCG(自适应阈值驱动连续贪心),通过每个分区的进度比率η_i来控制梯度评估,仅在当前候选未能捕获足够边际增益时扩展每个代理的活跃集,从而直接限制哪些特征嵌入会被传输。理论分析建立了具有曲率意识的近似保证,有效因子τ_eff= max{τ,1-c},在阈值保证和低曲率区域之间插值,其中ATCG恢复CG的性能。这表明,曲率所捕捉的问题结构决定了接近全CG性能所需的协调和通信量。在类平衡的原型选择问题实验中,ATCG在CIFAR-10动物数据集的子集上实现了与全CG方法相当的目标值,同时显著减少了通信开销。

英文摘要

Submodular maximization under matroid constraints is a fundamental problem in combinatorial optimization with applications in sensing, data summarization, active learning, and resource allocation. While the Sequential Greedy (SG) algorithm achieves only a $\frac{1}{2}$-approximation due to irrevocable selections, Continuous Greedy (CG) attains the optimal $\bigl(1-\frac{1}{e}\bigr)$-approximation via the multilinear relaxation, at the cost of a progressively dense decision vector that forces agents to exchange feature embeddings for nearly every ground-set element. We propose \textit{ATCG} (\underline{A}daptive \underline{T}hresholded \underline{C}ontinuous \underline{G}reedy), which gates gradient evaluations behind a per-partition progress ratio $η_i$, expanding each agent's active set only when current candidates fail to capture sufficient marginal gain, thereby directly bounding which feature embeddings are ever transmitted. Theoretical analysis establishes a curvature-aware approximation guarantee with effective factor $τ_{\mathrm{eff}}=\max\{τ,1-c\}$, interpolating between the threshold-based guarantee and the low-curvature regime where \textit{ATCG} recovers the performance of CG. This shows that the problem structure, as captured by curvature, determines the amount of coordination and communication required to approach full-CG performance. Experiments on a class-balanced prototype selection problem over a subset of the CIFAR-10 animal dataset show that \textit{ATCG} achieves objective values comparable to those of the full CG method while substantially reducing communication overhead through adaptive active-set expansion.

2604.02784 2026-05-20 cs.CV cs.CL

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

EnsemHalDet: 通过内部状态检测器的集成实现鲁棒的视觉语言模型幻觉检测

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

发表机构 * The University of Electro-Communications(电通大学)

AI总结 本文提出EnsemHalDet,一种通过集成多个内部表示的视觉语言模型幻觉检测框架,以提高多模态幻觉检测的鲁棒性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态任务中表现出色,但它们仍然容易受到事实错误或与输入图像无关的幻觉影响。最近的研究表明,利用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而,现有的基于内部表示的方法通常依赖于单一的表示或检测器,限制了它们捕捉多样化幻觉信号的能力。在本文中,我们提出了EnsemHalDet,一种基于集成的幻觉检测框架,利用VLMs的多种内部表示,包括注意力输出和隐藏状态。EnsemHalDet为每个表示训练独立的检测器,并通过集成学习进行组合。在多个VQA数据集和VLMs上的实验结果表明,EnsemHalDet在AUC方面始终优于先前的方法和单检测器模型。这些结果表明,集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。

英文摘要

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

2603.29501 2026-05-20 cs.LG cs.AI

Target-Aligned Reinforcement Learning

目标对齐的强化学习

Leonard S. Pleiss, James Harrison, Maximilian Schiffer

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种目标对齐的强化学习方法,通过强调目标网络和在线网络估计高度一致的过渡,改进了传统深度强化学习算法的稳定性与收敛速度,实验证明在多个基准环境中取得了显著提升。

详情
AI中文摘要

许多基于价值的深度强化学习算法依赖于目标网络——在线网络的滞后副本——来稳定训练。虽然有效,但这种机制引入了一个基本的稳定性与新鲜度权衡:较慢的目标更新可以提高稳定性,但会降低学习信号的时效性,从而阻碍收敛速度。我们提出目标对齐的强化学习(TARL),这是一种简单的改进方法,适用于现有算法,强调目标网络和在线网络估计高度一致的过渡。通过将更新集中在良好对齐的目标上,TARL减轻了陈旧目标估计的负面影响,同时保留了目标网络的稳定作用。我们在离散和连续控制算法中,在各种基准环境中展示了持续的改进,无需任何超参数调整,包括在Atari-10上实现了38.18%的峰值得分提升,同时仅导致不到4%的实时时钟时间增加。

英文摘要

Many value-based deep reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a simple drop-in refinement for existing algorithms that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We empirically demonstrate consistent improvements within discrete and continuous control algorithms across various benchmark environments without any hyperparameter tuning, including a 38.18% peak score gain on Atari-10, while incurring less than a 4% increase in wall-clock time.

2603.29092 2026-05-20 cs.CV

TrajectoryMover: Generative Movement of Object Trajectories in Videos

TrajectoryMover: 视频中物体轨迹的生成性运动

Kiran Chhatre, Hyeonho Jeong, Yulia Gryaditskaya, Christopher E. Peters, Chun-Hao Paul Huang, Paul Guerrero

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Adobe Research(Adobe研究)

AI总结 本文提出TrajectoryMover,一种生成视频中物体轨迹运动的方法,通过生成大规模合成配对视频数据和细调的视频生成器,实现了物体轨迹的生成性移动。

Comments 15 pages, 9 figures. Project page: https://chhatrekiran.github.io/trajectorymover

详情
AI中文摘要

生成性视频编辑已经使一些直观的编辑操作成为可能,这些操作以前在短视频片段中难以实现,特别是对于非专业编辑者而言。现有方法专注于在视频中为对象的3D或2D运动轨迹指定路径,或改变对象或场景的外观,同时保持视频的合理性和身份。然而,目前仍缺少一种方法,可以在视频中移动对象的3D运动轨迹,即在保持其相对3D运动的情况下移动对象。主要挑战在于获取这种场景下的配对视频数据。先前的方法通常依赖于巧妙的数据生成方法,从不成对的视频中构造出合理的配对数据,但这种方法在无法从另一视频轻易构造出配对视频时会失效。相反,我们引入了TrajectoryAtlas,一种新的大规模合成配对视频数据生成管道,以及一个通过此数据细调的视频生成器TrajectoryMover。我们证明这种方法成功实现了物体轨迹的生成性移动。项目页面:https://chhatrekiran.github.io/trajectorymover

英文摘要

Generative video editing has enabled several intuitive editing operations for short video clips that would previously have been difficult to achieve, especially for non-expert editors. Existing methods focus on prescribing an object's 3D or 2D motion trajectory in a video, or on altering the appearance of an object or a scene, while preserving both the video's plausibility and identity. Yet a method to move an object's 3D motion trajectory in a video, i.e., moving an object while preserving its relative 3D motion, is currently still missing. The main challenge lies in obtaining paired video data for this scenario. Previous methods typically rely on clever data generation approaches to construct plausible paired data from unpaired videos, but this approach fails if one of the videos in a pair can not easily be constructed from the other. Instead, we introduce TrajectoryAtlas, a new data generation pipeline for large-scale synthetic paired video data and a video generator TrajectoryMover fine-tuned with this data. We show that this successfully enables generative movement of object trajectories. Project page: https://chhatrekiran.github.io/trajectorymover

2603.25620 2026-05-20 cs.CL

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

PICon: 一种用于评估人设代理一致性的多轮询问框架

Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Hwajung Hong, Edward Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出PICon框架,通过逻辑链式的多轮提问评估人设代理的一致性,发现即使之前被认为高度一致的系统在三个维度上也未能达到人类基准水平,揭示了矛盾和逃避回应。

Comments 20 pages, 6 figures

详情
AI中文摘要

基于大型语言模型的人设代理正被广泛应用于替代人类参与者,但缺乏系统方法来验证其响应是否在交互中保持一致性和准确性。本文提出PICon框架,通过逻辑链式的多轮提问来评估人设代理的一致性,从内部一致性(无自相矛盾)、外部一致性(与现实世界事实一致)和重测一致性(重复测试下的稳定性)三个核心维度进行评估。在评估七组人设代理和63名真实人类参与者时,发现即使之前报告为高度一致的系统在三个维度上也未能达到人类基准水平,揭示了矛盾和逃避回应。本文为评估人设代理提供了概念基础和实用方法,提供了源代码和交互演示:https://kaist-edlab.github.io/picon/

英文摘要

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

2603.22453 2026-05-20 cs.CL cs.SI

XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception

XNote: 对基于图像的上下文欺骗的自动社区笔记生成进行基准测试

Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng

发表机构 * School of Computing, Clemson University(克莱姆森大学计算机学院)

AI总结 本文研究了基于图像的上下文欺骗的自动社区笔记生成任务,提出了一个真实世界数据集XNote,并对前沿大视觉语言模型和商业工具进行了基准测试,以评估其在欺骗检测和笔记生成任务中的性能。

详情
AI中文摘要

社区笔记已成为一种有效的众包机制,用于对抗社交媒体上的在线欺骗。然而,其依赖于人类贡献者限制了及时性和可扩展性。在本工作中,我们研究了基于图像的上下文欺骗的自动社区笔记生成任务,其中一张真实图像与误导性上下文(例如时间、实体和事件)配对。与之前主要关注欺骗检测(即以二元方式判断帖子是否真实)的工作不同,自动社区笔记生成需要生成简洁且有根据的笔记,帮助用户恢复缺失或更正的上下文。由于支持此任务的数据集稀缺,该问题仍未被充分探索。为了解决这一差距,我们整理了一个真实世界的数据集XNote,包含X篇帖子及其相关的社区笔记和外部上下文,以及主题和欺骗因素的注释。我们进一步在XNote上基准测试了一系列前沿的大视觉语言模型(LVLMs),评估它们在欺骗检测和笔记生成任务中的性能。我们还对比了端到端方法SNIFFER和商业工具GPT-5。我们的结果突显了自动社区笔记生成的挑战,强调了改进针对此任务的方法和指标的必要性。

英文摘要

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation task for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), automated Community Notes generation requires producing concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to the scarcity of datasets that support this task. To address this gap, we curate a real-world dataset, XNote, comprising X posts with associated Community Notes and external contexts, along with annotations of topics and deceptive factors. We further benchmark a range of frontier large vision language models (LVLMs) on XNote, evaluating their performance on both deception detection and note generation tasks. We also compare against an end-to-end approach, SNIFFER, and a commercial tool, GPT-5. Our results highlight the challenges in automated Community Notes generation, underscoring the need for improved methods and metrics tailored for this task.

2603.17839 2026-05-20 cs.CL cs.AI cs.LG

How do LLMs Compute Verbal Confidence

LLMs如何计算言语自信

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 研究探讨了大型语言模型如何内部生成言语自信评分,通过实验发现自信评分在回答生成后被缓存并用于后续输出,揭示了模型自我评估的机制。

详情
AI中文摘要

言语自信——提示LLMs以数字或类别形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而,LLMs内部如何生成此类评分仍不清楚。我们解答了两个问题:首先,信心是在被请求时即时计算,还是在生成答案时自动计算并缓存以供后续检索;其次,言语自信代表什么——token对数概率,还是更丰富的答案质量评估?我们聚焦于Gemma 3 27B(在TriviaQA、BigMath和MMLU上的表现)、Qwen 2.5 7B以及推理模型Magistral Small 24B,提供了缓存检索的收敛证据。激活引导、修补、噪声和交换实验揭示,信心表示在回答相邻位置先出现,再出现在言语化位置。注意力阻断指出了信息流:信心从回答token中收集,缓存于第一个回答后的位置,然后用于输出。关键发现是线性探测和方差划分揭示,这些缓存表示能够解释超出token对数概率的显著方差,表明是更丰富的答案质量评估,而非简单的流畅性读取。这些发现表明,言语自信反映了自动、复杂的自我评估——而非事后重建——对理解LLMs中的元认知和改进校准具有启示。

英文摘要

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

2603.16284 2026-05-20 cs.CV cs.LG

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

定位后再稀疏化:基于归因的视觉幻觉缓解稀疏策略

Tiantian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang

发表机构 * State Key Lab. of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院人工智能安全国家重点实验室,计算技术研究所) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉科学学院) School of Computer Science and Technology, University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院)

AI总结 本文提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的框架,通过定位和稀疏化策略,根据每层与幻觉的相关性调整特征引导强度,从而有效缓解视觉语言模型中的幻觉问题,同时保持良好的性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

尽管大型视觉-语言模型(LVLMs)在技术上取得了显著进展,但其生成幻觉的倾向削弱了可靠性并限制了更广泛的实际应用。在幻觉缓解方法中,特征引导作为一种有前景的方法,能够在不增加推理成本的情况下减少LVLMs中的错误输出。然而,当前的方法在所有层上应用统一的特征引导策略。这种启发式策略忽略了层间的差异,可能会干扰与幻觉无关的层,最终导致在通用任务上的性能下降。在本文中,我们提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的即插即用框架,该框架根据每层与幻觉的相关性来控制引导强度。我们首先构建了一个包含token级和句子级幻觉案例的数据集。基于此数据集,我们引入了一种基于因果干预的归因方法,以量化每层的幻觉相关性。利用各层的归因分数,我们提出了一种逐层策略,将这些分数转换为针对单个层的特征引导强度,从而在幻觉相关的层上实现更精确的调整。在多个LVLMs和基准测试中进行的广泛实验表明,LTS-FS有效缓解了幻觉问题,同时保持了强大的性能。代码可在https://github.com/huttersadan/LTS-FS上获得。

英文摘要

Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose Locate-Then-Sparsify for Feature Steering (LTS-FS), a plug-and-play framework which controls the steering intensity according to the hallucination relevance of each layer. We first construct a dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that LTS-FS effectively mitigates hallucination while preserving strong performance. Codes are available at https://github.com/huttersadan/LTS-FS.

2603.15411 2026-05-20 cs.AI cs.LG

A Hybrid Modeling Framework for Crop Prediction Tasks via Dynamic Parameter Calibration and Multi-Task Learning

一种通过动态参数校准和多任务学习的作物预测混合建模框架

William Solow, Paola Pesantez-Cabrera, Markus Keller, Lav Khot, Sandhya Saisubramanian, Alan Fern

发表机构 * Oregon State University(俄勒冈州立大学) Washington State University(华盛顿州立大学)

AI总结 本文提出了一种混合建模方法,通过动态参数校准和多任务学习,提高作物预测的准确性,特别是在数据有限的情况下,利用神经网络对生物物理模型进行参数化,并在不同作物品种间高效共享数据,从而提升预测精度和生物合理性。

详情
AI中文摘要

准确预测作物状态(例如物候阶段和耐寒性)对于及时进行灌溉、施肥和树冠管理等农场管理决策至关重要,以优化作物产量和质量。虽然传统生物物理模型可以用于季节性预测,但它们缺乏用于特定地点管理所需的精度。深度学习方法是一种有吸引力的替代方案,但可能会产生生物上不合理的预测,并需要大规模数据。我们提出了一种混合建模方法,使用神经网络对可微分的生物物理模型进行参数化,并利用多任务学习在数据有限的情况下在不同作物品种之间高效共享数据。通过预测生物物理模型的参数,我们的方法在提高预测精度的同时保持生物合理性。使用真实世界和合成数据集的实证评估表明,与部署的生物物理模型相比,我们的方法在物候预测方面提高了60%,在耐寒性预测方面提高了40%。

英文摘要

Accurate prediction of crop states (e.g., phenology stages and cold hardiness) is essential for timely farm management decisions such as irrigation, fertilization, and canopy management to optimize crop yield and quality. While traditional biophysical models can be used for season-long predictions, they lack the precision required for site-specific management. Deep learning methods are a compelling alternative, but can produce biologically unrealistic predictions and require large-scale data. We propose a \emph{hybrid modeling} approach that uses a neural network to parameterize a differentiable biophysical model and leverages multi-task learning for efficient data sharing across crop cultivars in data limited settings. By predicting the \emph{parameters} of the biophysical model, our approach improves the prediction accuracy while preserving biological realism. Empirical evaluation using real-world and synthetic datasets demonstrates that our method improves prediction accuracy by 60\% for phenology and 40\% for cold hardiness compared to deployed biophysical models.

2603.13609 2026-05-20 cs.CV

A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas

基于网格的电动滑板车需求表示与深度学习的时序输入设计框架:以德克萨斯州奥斯汀为例

Mohammad Sahnoon, Merkebe Getachew Demissie, Roberto Souza

发表机构 * Schulich School of Engineering, University of Calgary(卡莱尔大学施吕希学院)

AI总结 本文提出了一种基于网格的电动滑板车需求表示方法和深度学习的时序输入设计框架,通过系统性的数据处理流程和统计学方法,提高了空间学习的一致性并保留了需求模式,实验结果表明该方法在下一小时和下一24小时预测中将均方误差降低了37%和35%。

Comments 16 pages, 7 tables, 10 figures

详情
AI中文摘要

尽管在共享微出行需求预测方面深度学习取得了进展,但系统设计和时序输入结构的统计验证仍然缺乏。时序特征通常被启发式选择,尽管历史需求强烈影响模型性能和泛化能力。本文介绍了一种可重复的数据处理流程和一种基于统计学的方法,用于设计图像到图像需求预测的时序输入结构。利用德克萨斯州奥斯汀的大规模电动滑板车数据,我们通过将行程记录转换为每小时的起点和终点需求图像,构建了一个基于网格的时空数据集。该流程包括行程过滤、将人口普查街区映射到空间位置、网格构建、需求汇总以及创建一个全球活动掩码,以限制评估仅限于历史上活跃的区域。这种表示支持一致的空间学习,同时保留需求模式。我们随后引入了一种结合相关性和误差的程序来识别有信息的历史输入。通过使用基线UNET模型的消融研究,结合配对非参数检验和Holm校正,选择最优的时序深度。所得到的时序结构能够捕捉短期持续性以及日和周周期。与相邻小时和固定周期基线相比,所提出的设计在下一小时预测中将均方误差降低了高达37%,在下一24小时预测中降低了35%。这些结果突显了系统性数据集构建和统计学验证的时序输入设计在时空微出行需求预测中的价值。

英文摘要

Despite progress in deep learning for shared micromobility demand prediction, the systematic design and statistical validation of temporal input structures remain underexplored. Temporal features are often selected heuristically, even though historical demand strongly affects model performance and generalizability. This paper introduces a reproducible data-processing pipeline and a statistically grounded method for designing temporal input structures for image-to-image demand prediction. Using large-scale e-scooter data from Austin, Texas, we build a grid-based spatiotemporal dataset by converting trip records into hourly pickup and dropoff demand images. The pipeline includes trip filtering, mapping Census Tracts to spatial locations, grid construction, demand aggregation, and creation of a global activity mask that limits evaluation to historically active areas. This representation supports consistent spatial learning while preserving demand patterns. We then introduce a combined correlation- and error-based procedure to identify informative historical inputs. Optimal temporal depth is selected through an ablation study using a baseline UNET model with paired non-parametric tests and Holm correction. The resulting temporal structures capture short-term persistence as well as daily and weekly cycles. Compared with adjacent-hour and fixed-period baselines, the proposed design reduces mean squared error by up to 37 percent for next-hour prediction and 35 percent for next-24-hour prediction. These results highlight the value of principled dataset construction and statistically validated temporal input design for spatiotemporal micromobility demand prediction.

2603.12296 2026-05-20 cs.LG cs.AI eess.SP

Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions

脑机接口中的合成数据生成:概述、基准测试与未来方向

Ziwei Wang, Zhentao He, Xingyi He, Hongbin Wang, Tianwang Jia, Jingwei Luo, Siyang Li, Xiaoqing Chen, Dongrui Wu

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能与自动化学院,武汉,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国)

AI总结 本文综述了用于脑机接口的合成脑数据生成方法,讨论了不同生成方法的分类、基准实验、评估指标和应用,以及未来研究方向,旨在提升数据效率和隐私保护的脑机接口系统。

Comments 33 pages, 8 figures

详情
AI中文摘要

深度学习在多个领域取得了变革性的性能,主要得益于大规模和高质量的训练数据。相比之下,脑机接口(BCIs)的发展受到有限、异质性和隐私敏感的神经记录的限制。生成合成且生理上合理的脑信号因此成为缓解数据稀缺、提高模型泛化能力和支持数据高效的BCIs的有希望策略。本文全面回顾了用于BCIs的合成脑数据生成方法,涵盖了方法学分类、基准实验、评估指标、关键应用和未来方向。我们系统地将现有生成方法分为四类:基于信号变换、基于特征、基于模型和基于翻译的生成,并讨论了它们的特征、优势和局限性。此外,我们对四种BCI范式中的代表性脑信号生成方法进行了基准测试,包括运动想象、癫痫发作检测、稳态视觉诱发电位和听觉注意力检测,以提供对其下游用途的客观比较。我们还总结了从多个角度对生成脑信号的评估原则,包括信号真实性、生理合理性、下游用途和隐私保护。最后,我们讨论了当前生成方法的潜力和挑战,并概述了未来研究方向,以实现准确、数据高效、可推广和隐私感知的BCI系统。基准代码库可在https://github.com/wzwvv/DG4BCI上找到。

英文摘要

Deep learning has achieved transformative performance across diverse domains, largely driven by large-scale and high-quality training data. In contrast, the development of brain-computer interfaces (BCIs) is fundamentally constrained by limited, heterogeneous, and privacy-sensitive neural recordings. Generating synthetic yet physiologically plausible brain signals has therefore emerged as a promising strategy to mitigate data scarcity, improve model generalization, and support data-efficient BCIs. This survey provides a comprehensive review of synthetic brain data generation for BCIs, covering methodological taxonomies, benchmark experiments, evaluation metrics, key applications, and future directions. We systematically categorize existing generation approaches into four types: signal-transformation-based, feature-based, model-based, and translation-based generation, and discuss their characteristics, advantages, and limitations. Furthermore, we benchmark representative brain signal generation approaches across four BCI paradigms, including motor imagery, epileptic seizure detection, steady-state visually evoked potentials, and auditory attention detection, to provide an objective comparison of their downstream utility. We also summarize evaluation principles for generated brain signals from multiple perspectives, including signal realism, physiological plausibility, downstream utility, and privacy preservation. Finally, we discuss the potential and challenges of current generation approaches and outline future research directions toward accurate, data-efficient, generalizable, and privacy-aware BCI systems. The benchmark codebase is available at https://github.com/wzwvv/DG4BCI.

2603.11024 2026-05-20 cs.CV cs.AI

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

AI 是否能像艺术史家一样看?解析视觉语言模型如何识别艺术风格

Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Emily L. Spratt, Anna Filonenko, Hannah Pivo, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown

发表机构 * Columbia University, Department of Computer Science(哥伦比亚大学计算机科学系) Columbia University, Department of Art History & Archaeology(哥伦比亚大学艺术史与考古系) University of Texas at Austin(德克萨斯大学奥斯汀分校) UNC Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究了视觉语言模型(VLMs)在识别艺术风格方面的机制,通过跨学科合作,分析VLMs如何预测艺术风格,并评估其与艺术史家判断艺术风格的标准的一致性。

Comments 20 pages, 18 figures

详情
AI中文摘要

视觉语言模型(VLMs)在多种计算机视觉任务上已表现出越来越强的能力,例如视觉问答和目标检测。这包括在艺术领域中越来越强的能力,从分析艺术品到生成艺术品。在计算机科学家和艺术史家的跨学科合作中,我们表征了VLMs预测艺术风格的机制,并评估其与艺术史家用于推理艺术风格标准的契合程度。我们采用潜在空间分解方法来识别驱动艺术风格预测的概念,并通过定量评估、因果分析和艺术史家的评估进行评估。我们的发现表明,73%的提取概念被艺术史家认为具有连贯且语义明确的视觉特征,90%用于预测特定艺术品风格的概念被判定为相关。在无关概念成功预测风格的情况下,艺术史家发现了其成功的原因;例如,模型可能以更正式的方式理解概念,如明暗对比。

英文摘要

VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.

2603.07561 2026-05-20 cs.CV

PureCC: Pure Learning for Text-to-Image Concept Customization

PureCC: 文本到图像概念定制的纯学习

Zhichao Liao, Xiaole Xian, Qingyu Li, Wenyu Qin, Meng Wang, Weicheng Xie, Siyang Song, Pingfa Feng, Long Zeng, Liang Pan

发表机构 * Tsinghua University(清华大学) School of Computer Science & Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University(深圳大学智能信息处理广东省重点实验室) Kling Team, Kuaishou Technology(快手科技Kling团队) University of Exeter(埃克塞特大学) S-Lab, Nanyang Technological University(南洋理工大学S实验室)

AI总结 本文提出PureCC,一种用于文本到图像概念定制的纯学习方法,通过分离学习目标来平衡概念定制的保真度与模型保留。

Comments Accepted to CVPR 2026

详情
AI中文摘要

现有概念定制方法在高保真和多概念定制方面取得了显著成果。然而,它们往往忽视了在学习新个性化概念时对原始模型行为和能力的影响。为了解决这个问题,我们提出了PureCC。PureCC引入了一个新的分离学习目标用于概念定制,结合了目标概念的隐式指导与原始条件预测。这种分离形式使PureCC在训练过程中能够显著专注于原始模型。此外,基于此目标,PureCC设计了一个双分支训练流水线,包括一个冻结的提取器提供纯净的目标概念表示作为隐式指导,以及一个可训练的流模型产生原始条件预测,共同实现对个性化概念的纯学习。此外,PureCC引入了一个新的自适应指导尺度$λ^\star$,以动态调整目标概念的指导强度,平衡定制保真度和模型保留。广泛的实验表明,PureCC在保留原始行为和能力的同时,实现了高保真的概念定制。代码可在https://github.com/lzc-sg/PureCC上获得。

英文摘要

Existing concept customization methods have achieved remarkable outcomes in high-fidelity and multi-concept customization. However, they often neglect the influence on the original model's behavior and capabilities when learning new personalized concepts. To address this issue, we propose PureCC. PureCC introduces a novel decoupled learning objective for concept customization, which combines the implicit guidance of the target concept with the original conditional prediction. This separated form enables PureCC to substantially focus on the original model during training. Moreover, based on this objective, PureCC designs a dual-branch training pipeline that includes a frozen extractor providing purified target concept representations as implicit guidance and a trainable flow model producing the original conditional prediction, jointly achieving pure learning for personalized concepts. Furthermore, PureCC introduces a novel adaptive guidance scale $λ^\star$ to dynamically adjust the guidance strength of the target concept, balancing customization fidelity and model preservation. Extensive experiments show that PureCC achieves state-of-the-art performance in preserving the original behavior and capabilities while enabling high-fidelity concept customization. The code is available at https://github.com/lzc-sg/PureCC.

2603.03066 2026-05-20 cs.CV

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

EduVQA: 向概念感知的教育AI生成视频评估迈进

Baoliang Chen, Xinlong Bu, Hanwei Zhu, Lingyu Zhu, Jieyu Zhan

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Department of Computer Science, South China Normal University, China(华南师范大学计算机学院) School of Computer Science, City University of Hong Kong(香港城市大学计算机科学学院)

AI总结 本研究提出EduVQA框架,通过引入结构化2D混合专家架构,实现了对教育AI生成视频中概念正确性的感知评估,解决了传统方法在教育场景中忽略概念正确性的不足。

详情
AI中文摘要

现有的AI生成视频质量评估(AIGVQA)方法主要关注全局感知真实性和粗略的文本-视频对齐,而忽视了教育场景中的关键要求:概念正确性。在早期数学教育中,即使视觉上合理,数值量、几何关系或空间配置中的细微错误也可能从根本上改变传达的知识。为了解决这个问题,我们引入了EduAVQABench,这是首个概念感知的教育AIGV评估基准,包含1,130个由十种最先进的T2V模型生成的视频,以及超过310,650个精细的人工标注,涵盖感知质量和语义对齐。基于此基准,我们进一步提出了EduVQA,一个概念感知的AIGVQA框架,配备了结构化2D混合专家(S2D-MoE)架构。通过通过共享专家和自适应二维路由联合建模细粒度概念评估和整体质量预测,EduVQA有效地捕捉了传统全局评分方法所忽略的细微概念层面不一致。广泛的实验表明,EduVQA在感知和语义评估任务中均优于现有AIGVQA方法,并在未见过的基准上表现出强大的泛化能力。代码和数据集将在:https://github.com/EduVQA/EduVQA 公开。

英文摘要

Existing AI-generated video quality assessment (AIGVQA) methods mainly focus on global perceptual realism and coarse text-video alignment, while overlooking a critical requirement in educational scenarios: concept correctness. In early mathematics education, subtle errors in numerical quantities, geometric relations, or spatial configurations may fundamentally alter the conveyed knowledge despite visually plausible generation. To address this problem, we introduce EduAVQABench, the first benchmark for concept-aware educational AIGV assessment, containing 1,130 videos generated by ten state-of-the-art T2V models together with over 310,650 fine-grained human annotations spanning perceptual quality and semantic alignment. Built upon this benchmark, we further propose EduVQA, a concept-aware AIGVQA framework equipped with a Structured 2D Mixture-of-Experts (S2D-MoE) architecture. By jointly modeling fine-grained concept assessment and overall quality prediction through shared experts and adaptive two-dimensional routing, EduVQA effectively captures subtle concept-level inconsistencies overlooked by conventional global scoring methods. Extensive experiments demonstrate that EduVQA consistently outperforms existing AIGVQA approaches across both perceptual and semantic evaluation tasks while exhibiting strong generalization capability on unseen benchmarks. Code and dataset will be publicly available at: https://github.com/EduVQA/EduVQA.

2603.01009 2026-05-20 cs.CL

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays

Qayyem: 一个实时平台用于评分阿拉伯语作文的熟练程度

Hoor Elbahnasawi, Marwan Sayed, Sohaila Eltanbouly, Fatima Brahamia, Tamer Elsayed

发表机构 * Computer Science and Engineering Department, Qatar University(卡塔尔大学计算机科学与工程系)

AI总结 本文提出Qayyem平台,提供阿拉伯语作文评分的集成工作流程,通过友好的界面简化评分服务器API的交互,部署了多种先进的阿拉伯语作文评分模型。

Comments Accepted at ACL 2026

详情
AI中文摘要

近年来,自动作文评分(AES)系统因其可扩展性和一致性而受到越来越多的关注,作为评估学生写作熟练程度的解决方案。尽管有最近的进步,但阿拉伯语AES的支持仍然有限,由于语言复杂性和大规模公开标注数据集的缺乏。在本工作中,我们提出了Qayyem,一个基于网络的平台,旨在通过提供集成的工作流程来支持阿拉伯语AES,包括作业创建、批量作文上传、评分配置和每个特征作文评估。Qayyem抽象了与评分服务器API交互的技术复杂性,允许教师通过用户友好的界面访问高级评分服务。该平台部署了多种最先进的阿拉伯语作文评分模型,具有不同的有效性和效率指标。

英文摘要

Over the past years, Automated Essay Scoring (AES) systems have gained increasing attention as scalable and consistent solutions for assessing the proficiency of student writing. Despite recent progress, support for Arabic AES remains limited due to linguistic complexity and scarcity of large publicly-available annotated datasets. In this work, we present Qayyem, a Web-based platform designed to support Arabic AES by providing an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation. Qayyem abstracts the technical complexity of interacting with scoring server APIs, allowing instructors to access advanced scoring services through a user-friendly interface. The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.

2602.23622 2026-05-20 cs.CV cs.AI

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

DLEBench: 评估基于指令的图像编辑模型在小规模物体编辑能力

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

发表机构 * College of Computer Science(计算机科学学院) Artificial Intelligence(人工智能) Fudan University(复旦大学)

AI总结 本文提出DLEBench,首个专门评估基于指令的图像编辑模型在小规模物体编辑能力的基准,通过1889个样本覆盖复杂场景,揭示了现有模型在小物体编辑上的性能差距,强调了专用基准的重要性。

详情
AI中文摘要

在基于指令的图像编辑模型(IIEMs)领域已取得显著进展。然而,尽管这些模型在当前基准上表现出对指令的合理遵循和强大的推理能力,但它们在编辑小物体方面的能力仍缺乏深入探索,尽管这对精确局部编辑和生成图像中细节的细化至关重要。本文介绍了DeepLookEditBench(DLEBench),首个专门评估IIEMs在编辑小规模物体能力的基准。具体而言,我们构建了一个包含七个指令类型的挑战性测试平台,共1889个样本。在这些样本中,目标物体仅占图像面积的1%-10%,涵盖了部分遮挡和多物体编辑等复杂场景。为确保对本基准的稳健评估,我们提出了一种评估协议,包含细化的评分标准,以最小化在“指令遵循”和“视觉一致性”两个标准中的主观性和歧义性。该协议还引入了双模式评估框架(工具驱动模式和Oracle引导模式),以解决DLEBench中LMM-as-a-Judge与人类判断之间的不一致问题。在10个IIEMs上的实证结果揭示了小规模物体编辑上的显著性能差距,突显了专用基准在推动该能力发展方面的重要性。

英文摘要

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

2602.17038 2026-05-20 cs.AI

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

具有相意识的专家混合用于代理强化学习

Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China(东南大学计算机科学与工程学院,南京,中国) College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学 computing and Data Science学院,新加坡)

AI总结 本文提出了一种具有相意识的专家混合(PA-MoE),以解决传统专家混合(MoE)中由于token级路由导致的相一致模式碎片化问题,通过学习隐含的相边界来提升专家的专业性。

详情
AI中文摘要

强化学习(RL)已使LLM代理具备解决复杂任务的强大能力。然而,现有RL方法通常使用单一策略网络,导致简单任务占据大部分参数并主导梯度更新,从而为复杂任务留出不足的容量。一个可行的解决方案是在策略网络中采用专家混合(MoE)架构,因为MoE允许不同参数(专家)专门处理不同任务,防止简单任务主导所有参数。然而,传统MoE的一个关键限制是其token级路由,其中路由器将每个token分配给专门化的专家,这会将相一致的模式碎片化为分散的专家分配,从而削弱专家专业化。在本文中,我们提出了具有相意识的专家混合(PA-MoE)。它首先具有一个轻量级的相路由器,该路由器直接从RL目标中学习隐含的相边界,而无需预定义相类别。然后,相路由器将时间一致的分配分配给同一专家,使专家能够保留相特定的专业知识。实验结果展示了我们提出的PA-MoE的有效性。

英文摘要

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

2602.15752 2026-05-20 cs.LG

Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching

超越匹配最大化和公平性:以用户留存优化的双侧匹配

Ren Kishimoto, Rikiya Takehi, Koichi Tanaka, Masahiro Nomura, Riku Togashi, Yoji Tomita, Yuta Saito

发表机构 * Institute of Science Tokyo(东京科学研究所) Waseda University(早稻田大学) Keio University(庆应大学) CyberAgent Tokyo(CyberAgent 东京) Hajuku-kaso, Co., Ltd.(汉久科社)

AI总结 本文提出了一种新的双侧匹配优化方法,旨在最大化用户留存而非单纯匹配数量或公平性,通过引入动态学习排序算法MRet,利用用户个性化留存曲线优化推荐策略,提升整体用户留存率。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

在在线约会和招聘等双侧匹配平台上,推荐算法通常旨在最大化总匹配数。然而,这一目标导致了不平衡,一些用户获得过多匹配而另一些用户则获得极少并最终离开平台。对于许多平台,尤其是依赖订阅的平台,用户留存至关重要。一些平台可能使用公平性目标来解决匹配最大化的问题。然而,公平性本身并非所有平台的最终目标,因为用户不会仅仅因为曝光均等而奖励平台。在实践中,用户留存通常是最终目标,随意依赖公平性会使留存优化取决于运气。在本工作中,我们没有最大化匹配或公理化定义公平性,而是正式定义了双侧匹配平台中最大化用户留存的新问题设置。为此,我们引入了一种动态学习到排序(LTR)算法,称为Matching for Retention(MRet)。与传统的双侧匹配算法不同,我们的方法通过从每个用户档案和交互历史中学习个性化留存曲线来建模用户留存。基于这些曲线,MRet通过同时考虑接收推荐的用户和被推荐用户的留存收益,动态调整推荐策略,使得有限的匹配机会分配到最能提高整体留存的地方。自然但重要的是,对主要在线约会平台的合成和真实世界数据集的实证评估显示,MRet实现了更高的用户留存率,因为传统方法优化匹配或公平性而非留存。

英文摘要

On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user's profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.

2602.13466 2026-05-20 cs.CL cs.AI cs.LG

Language Model Memory and Memory Models for Language

语言模型记忆与记忆模型用于语言

Benjamin L. Badger

发表机构 * IBM(IBM公司)

AI总结 研究探讨了语言模型和记忆模型在信息存储中的能力差异,发现语言模型的嵌入向量信息较少,而自编码器在输入再生训练中能形成接近完美的记忆,提出了一种可并行的编码器-解码器记忆模型架构,并通过结合因果和信息保留目标函数来提升记忆形成和解码能力。

详情
AI中文摘要

机器学习模型存储输入信息的能力,类似于“记忆”的概念,在隐藏层向量嵌入中被广泛使用但未充分表征。我们发现,无论数据和计算规模如何,语言模型嵌入通常包含相对较少的输入信息。相比之下,用于输入再生训练的自编码器嵌入能够形成几乎完美的记忆。用记忆嵌入替代令牌序列可带来显著的计算效率,从而引入一种可并行的编码器-解码器记忆模型架构。在因果训练后,这些模型包含信息贫乏的嵌入,无法进行任意信息访问,但通过结合因果和信息保留目标函数,它们学会形成和解码信息丰富的记忆。通过冻结高保真编码器并采用课程训练方法,解码器首先学习处理记忆,然后学习预测下一个令牌。我们引入了观点,即仅使用下一个令牌预测训练不足以准确形成记忆,因为目标本身不可逆,从而推动在输入不完全暴露的情况下使用结合目标函数的模型。

英文摘要

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

2602.11910 2026-05-20 cs.SD cs.LG

TADA! Tuning Audio Diffusion Models through Activation Steering

TADA! 通过激活引导调整音频扩散模型

Łukasz Staniszewski, Katarzyna Zaleska, Mateusz Modrzejewski, Kamil Deja

发表机构 * Warsaw University of Technology(华沙技术大学) IDEAS Research Institute(IDEAS研究院)

AI总结 本文通过激活引导技术揭示音频扩散模型中的语义瓶颈,并展示了局部激活引导在音频概念调节中的新状态-of-the-art性能。

Comments Preprint

详情
AI中文摘要

音频扩散模型能够从文本生成高质量的音乐,但实现对特定音乐属性的精细控制仍然具有挑战性,因为其内部机制对高级概念的表示尚不明确。在本文中,我们利用激活修补技术证明,最近的音频扩散架构存在语义瓶颈,其中一小部分连续的注意力层控制不同的音乐概念,例如特定乐器、人声或音乐类型的存在。在此基础上,我们系统地评估了广泛的应用引导方法,比较了激活引导与提示级、乐谱空间和权重空间干预,分析了引导机制与干预位置之间的相互作用。我们的新基准,通过广泛的用户研究支持,证明了局部激活引导在音频概念调节中建立了新的状态-of-the-art性能。

英文摘要

Audio diffusion models can synthesize high-fidelity music from text, yet achieving fine-grained control over specific musical attributes remains challenging, as their internal mechanisms for representing high-level concepts are poorly understood. In this work, we use activation patching to demonstrate that recent audio diffusion architectures exhibit a semantic bottleneck, where a small, shared subset of consecutive attention layers controls distinct musical concepts, such as the presence of specific instruments, vocals, or genres. Building on this, we systematically evaluate a broad spectrum of steering paradigms, comparing activation steering against prompt-level, score-space, and weight-space interventions, analyzing the interaction between the steering mechanism and the intervention site. Our new benchmark, supported by an extensive user study, demonstrates that localized activation steering establishes a new state-of-the-art in audio concept modulation.

2602.11767 2026-05-20 cs.AI cs.CL cs.LG

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TSR:用于LLM代理多轮RL的轨迹搜索

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Heiko Ludwig, Holger Boche

发表机构 * Technical University Munich(慕尼黑技术大学) IBM Research(IBM研究院)

AI总结 本文提出TSR,一种在训练时改进每轮轨迹生成的方法,通过轻量级树状搜索构造高质量轨迹,提升rollout质量和学习稳定性,适用于多轮RL任务。

详情
AI中文摘要

大规模语言模型(LLMs)的进步正在推动使用强化学习(RL)来训练代理,从跨任务的迭代、多轮交互中学习。然而,多轮RL仍然具有挑战性,因为奖励通常稀疏或延迟,而环境可能是随机的。在这种情况下,朴素的轨迹采样会阻碍利用并导致模式崩溃。我们提出了TSR(轨迹搜索rollouts),一种训练时的方法,重新利用测试时扩展的想法以改进每轮rollout生成。TSR通过基于状态的反馈在每个回合中选择高分动作,进行轻量级树状搜索来构造高质量轨迹。这提高了rollout质量并稳定了学习,同时与标准策略梯度优化器兼容,使TSR对优化器无偏见。我们用best-of-N、beam和浅层前瞻搜索实例化TSR,并与PPO和GRPO配对,在Sokoban、FrozenLake和WebShop任务中实现高达15%的性能提升和更稳定的训练,仅需适度增加一次训练计算。通过将搜索从推理时间转移到训练的rollout阶段,TSR提供了一种模块化且通用的机制,用于更强的多轮代理学习,与现有框架和拒绝采样式选择方法互补。

英文摘要

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using state-based feedback. This improves rollout quality and stabilizes learning while remaining compatible with standard policy gradient optimizers, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a modest, one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a modular and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

2602.09872 2026-05-20 cs.CV cs.HC

BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

BabyMamba-HAR:轻量级选择性状态空间模型用于资源受限设备上高效的人体活动识别

Mridankan Mandal

发表机构 * Department of Information Technology(信息科技系) Indian Institute of Information Technology, Allahabad Prayagraj(印度阿利哈巴德信息科技学院)

AI总结 本文提出BabyMamba-HAR,一种轻量级选择性状态空间模型,用于在资源受限设备上高效进行人体活动识别,通过两种轻量级架构实现高精度和低资源消耗。

详情
AI中文摘要

在资源受限的设备上进行人体活动识别(HAR)需要在多样化的传感器设置下保持高精度。选择性状态空间模型(SSMs)提供了高效的线性时间序列处理,成为注意力机制的一种有吸引力的替代方案。然而,其TinyML设计空间仍待探索。本文介绍了BabyMamba-HAR,包含两种轻量级架构:(1)CI-BabyMamba-HAR,利用通道独立的茎部以提高噪声鲁棒性;(2)Crossover-BiDir-BabyMamba-HAR,利用早期融合的茎部以实现通道计数独立的复杂度。两者都集成了权重绑定的双向扫描和门控时间注意力池化。在八个基准测试中,Crossover-BiDir-BabyMamba-HAR平均达到86.52%的F1分数,使用27K参数和2.21M MACs,与TinyHAR(86.16%)相当,但要求在高通道数据集上减少11倍的MACs。在设备上部署到Raspberry Pi Pico 2和ESP32上使用混合精度C++运行时(INT8投影,float32状态)。融合计算策略与生命周期感知内存管理将峰值内存足迹从O(B*dmodel*L*dstate)减少到O(B*dmodel*dstate),适应于支持权重绑定的双向和通道流执行。两种架构均实现了完整的8/8数据集覆盖,与PyTorch的>99.2%的兼容性,而INT8量化TFLite基线显示了退化的覆盖和兼容性(TinyHAR:7/8和4/8覆盖,60.4%和88.6%兼容性,TinierHAR:8/8和6/8在54.2%和90.8%兼容性,DeepConvLSTM:1/8和0/8在Pico 2和ESP32上)。Crossover-BiDir-BabyMamba-HAR在ESP32上平均延迟为154.4 ms,在Pico 2上为481.9 ms。消融实验确认双向扫描和门控注意力分别将F1分数提高高达8.42%和8.94%,建立了TinyML SSM部署的实用原则。

英文摘要

Human activity recognition (HAR) on resource constrained devices requires high accuracy across diverse sensor setups. Selective state space models (SSMs) offer efficient linear time sequence processing, presenting a compelling alternative to attention mechanisms. However, their TinyML design space remains unexplored. This paper introduces BabyMamba-HAR, comprising two lightweight architectures: (1) CI-BabyMamba-HAR, utilizing a channel independent stem for noise robustness, and (2) Crossover-BiDir-BabyMamba-HAR, utilizing an early fusion stem for channel count independent complexity. Both integrate weight tied bidirectional scanning and gated temporal attention pooling. Across eight benchmarks, Crossover-BiDir-BabyMamba-HAR averages an 86.52% F1-score with 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. On-device deployment on the Raspberry Pi Pico 2 and ESP32 utilized a mixed precision C++ runtime (INT8 projections, float32 states). A fused computation strategy with lifetime aware memory management reduces peak memory footprint from O(B*dmodel*L*dstate) to O(B*dmodel*dstate), adapting to support weight-tied bidirectional and channel-streaming execution. Both architectures achieved full 8/8 dataset coverage with >99.2% PyTorch parity, whereas INT8 quantized TFLite baselines showed degraded coverage and parity (TinyHAR: 7/8 and 4/8 coverage at 60.4% and 88.6% parity, TinierHAR: 8/8 and 6/8 at 54.2% and 90.8%, DeepConvLSTM: 1/8 and 0/8 on Pico 2 and ESP32, respectively). Crossover-BiDir-BabyMamba-HAR averages 154.4 ms latency on ESP32 and 481.9 ms on Pico 2. Ablations confirm bidirectional scanning and gated attention improve F1-scores by up to 8.42% and 8.94%, respectively, establishing practical principles for TinyML SSM deployment.

2602.09259 2026-05-20 cs.RO cs.HC

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

以数据为中心的基于学习的多任务手术注视感知模型设计

Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

发表机构 * Department of Electrical, Computer, and Systems Engineering, Case Western Reserve University(电气、计算机与系统工程系,凯斯西储大学)

AI总结 本研究探讨了在多任务模拟中,基于学习的手术注视感知模型的设计,通过主动-被动注视数据集分析,评估了不同注视来源对注意力模型学习的影响,并提出了可扩展的群众源注视监督方法。

Comments 8 pages, conference pre-print

详情
AI中文摘要

在机器人辅助微创手术(RMIS)中,减少的触觉反馈和深度线索增加了对专家视觉感知的依赖,推动了基于注视引导的训练和基于学习的手术感知模型。然而,操作专家的注视数据收集成本高,且不清楚注视监督来源(专家水平(中级 vs. 初学者)和感知模态(主动执行 vs. 被动观看))如何影响注意力模型的学习。我们引入了一个配对的主动-被动、多任务手术注视数据集,该数据集在达芬奇SimNow模拟器上进行了四次钻探任务。使用VR头盔和眼动追踪记录了任务执行期间的主动注视,相应的视频被重新利用作为刺激,以收集观察者的被动注视,从而实现受控的同视频比较。我们量化了技能和模态依赖的注视组织差异,并通过注视密度重叠分析和单帧显著性建模评估了被动注视在操作监督中的可替代性。在各种设置中,MSI-Net产生了稳定且可解释的预测,而SalGAN不稳定且经常与人类注视不一致。训练于被动注视的模型恢复了相当大的中级主动注意力,但存在可预测的退化,且主动和被动目标之间的迁移是不对称的。值得注意的是,初学者的被动标签在较高质量演示中对中级-被动目标的近似具有有限的损失,这表明了一条可行的路径,用于在手术指导和感知建模中实现可扩展的群众源注视监督。

英文摘要

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

2602.09023 2026-05-20 cs.RO

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL: 基于数字孪生的强化学习用于真实世界机器人操作

Qinwen Xu, Jiaming Liu, Rui Zhou, Shaojun Shi, Nuowei Han, Zhuoyang Liu, Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) Simplexity Robotics(Simplexity机器人) Tsinghua University(清华大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出TwinRL框架,通过数字孪生与真实世界协同训练,提升视觉-语言-动作模型在真实世界中的探索效率和收敛速度,实现高成功率和快速收敛。

详情
AI中文摘要

尽管具有强大的泛化能力,视觉-语言-动作(VLA)模型仍然受到专家演示成本高和现实世界交互有限的限制。虽然在线强化学习(RL)显示出前景,但将其应用于真实世界VLA操作受到探索效率低和探索覆盖受限的阻碍。通过系统性的现实世界实验,我们发现在线RL的有效探索空间主要受监督微调(SFT)期间诱导的轨迹分布所限制。受此观察启发,我们提出TwinRL,一种数字孪生-真实世界协同的后训练框架,通过三个阶段扩展和引导RL探索:SFT预热、孪生RL预热和真实世界RL。TwinRL首先从手机捕捉的场景中重建高保真的数字孪生。在SFT阶段,我们引入一种探索空间扩展策略,将轨迹分布的支持扩展到现实演示之外,重塑探索空间以更有效地进行RL。与将孪生视为数据增强工具不同,我们提出一种孪生RL预热策略,使其能够作为真实世界RL的探索引导。具体而言,TwinRL在数字孪生中执行高效的并行RL,生成填充回放缓冲区的交互轨迹,稳定后续真实世界RL学习。这一过程还识别出易失败但信息丰富的配置,使针对人类在回路中的rollouts进一步提高机器人上的效率。在四个任务中,TwinRL在分布内和分布外区域均实现近100%的成功率,比先前的真实世界RL方法快30%以上,仅需20分钟的机器人交互时间。

英文摘要

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

2602.07008 2026-05-20 cs.CV cs.LG

Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

不应学习的地方:基于子集归因约束的先验对齐训练以实现可靠的决策制定

Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu, Shiming Liu, Zhangcheng Wang, Qunli Zhang, Hua Zhang, Xiaochun Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) University of Chinese Academy of Sciences(中国科学院大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系) Communication University of China(中国传媒大学) Imperial College London(伦敦帝国学院) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区网络科学与技术学院)

AI总结 本文提出了一种基于归因的先验对齐方法,通过子集选择归因技术约束模型依赖于人类先验区域,从而提升决策的可靠性。

详情
AI中文摘要

可靠的模型不仅要预测正确,还要能用可接受的证据来解释决策。然而,传统监督学习通常只提供类别级标签,使模型通过捷径相关性实现高精度,而非预期的证据。人类先验可以约束此类行为,但对齐模型到这些先验仍然具有挑战性,因为学习的表示往往偏离人类感知。为了解决这一挑战,我们提出了一种基于归因的人类先验对齐方法。我们将人类先验编码为模型应依赖的输入区域(例如边界框),并利用高度忠实的子集选择归因方法,在训练过程中暴露模型的决策证据。当归因区域显著偏离先验区域时,我们惩罚对非先验证据的依赖,促使模型将归因转向预期区域。这是通过一个训练目标实现的,该目标通过人类先验诱导归因约束。我们在基于MLLM的GUI代理模型上验证了我们的方法,涵盖图像分类和点击决策任务。在传统分类和自回归生成设置中,人类先验对齐一致提高了任务准确性,同时增强了模型的决策合理性。

英文摘要

Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

2602.06462 2026-05-20 cs.CL cs.LG

Diffusion-State Policy Optimization for Masked Diffusion Language Models

扩散状态策略优化用于掩码扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

发表机构 * Institute of Science Tokyo(东京科学研究院)

AI总结 本文提出Diffusion-State Policy Optimization(DiSPO),一种用于掩码扩散语言模型的插件信用分配层,通过直接优化中间填充决策来改进生成过程,实验表明其在数学和规划基准测试中优于现有基线方法。

详情
AI中文摘要

掩码扩散语言模型通过迭代填充掩码标记来生成文本,但仅对最终完成结果的终端奖励对中间填充决策的信用分配过于粗糙。我们提出Diffusion-State Policy Optimization(DiSPO),一种插件信用分配层,直接优化中间填充决策。在选定的中间掩码状态下,DiSPO通过从滚出缓存的logits中重新采样当前掩码位置,评估由此产生的完成结果,并仅更新新填充的标记,无需额外的多步扩散滚出或优化器步骤。我们为分支完成形式化了一个固定状态目标,并推导出一个策略梯度估计器,该估计器重用与终端反馈策略优化相同的滚出。在LLaDA-8B-Instruct上的实验表明,DiSPO在匹配的滚出计算和优化器步骤下,一致提高了终端反馈基线,包括diffu-GRPO和SPG,在数学和规划基准测试中。我们的项目页面可在https://daioba.github.io/dispo上找到。

英文摘要

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

2602.05709 2026-05-20 cs.AI

Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions

非线性作为秩:基于径向基函数的生成低秩适配器

Yihao Ouyang, Shiwei Li, Haozhao Wang, Xiandi Luo, Zhuoqi Hu, Yuetong Song, Qiyu Qin, Yichen Li, Ruixuan Li

发表机构 * Huazhong University of Science and Technology(华中科技大学) Hebei University of Technology(河北工业大学)

AI总结 本文提出GenLoRA,通过使用轻量级非线性函数生成径向基函数来替代传统低秩适配器中显式的基向量存储,从而提高参数效率和细调性能。

详情
AI中文摘要

低秩适配(LoRA)通过两个低秩矩阵的乘积来近似预训练权重矩阵的更新。然而,标准LoRA遵循显式秩范式,增加模型容量需要在低秩矩阵中添加更多行或列(即基向量),导致参数增长显著。在本文中,我们发现这些基向量表现出显著的参数冗余,可以被轻量级非线性函数紧凑地表示。因此,我们提出生成低秩适配器(GenLoRA),用非线性基向量生成替代显式基向量存储。具体而言,GenLoRA为每个低秩矩阵维护一个潜在向量,并使用一组轻量级径向基函数(RBFs)来合成基向量。每个RBF所需的参数远少于显式基向量,使GenLoRA实现了更高的参数效率。在多个数据集和架构上的广泛实验表明,GenLoRA在较小的参数预算下实现了更高的有效LoRA秩,从而获得更优越的微调性能。代码可在https://anonymous.4open.science/r/GenLoRA获取。

英文摘要

Low-rank adaptation (LoRA) approximates the update of a pretrained weight matrix using the product of two low-rank matrices. However, standard LoRA follows an explicit-rank paradigm, where increasing model capacity requires adding more rows or columns (i.e., basis vectors) to the low-rank matrices, leading to substantial parameter growth. In this paper, we find that these basis vectors exhibit significant parameter redundancy and can be compactly represented by lightweight nonlinear functions. Therefore, we propose Generative Low-Rank Adapter (GenLoRA), which replaces explicit basis vector storage with nonlinear basis vector generation. Specifically, GenLoRA maintains a latent vector for each low-rank matrix and employs a set of lightweight radial basis functions (RBFs) to synthesize the basis vectors. Each RBF requires far fewer parameters than an explicit basis vector, enabling higher parameter efficiency in GenLoRA. Extensive experiments across multiple datasets and architectures show that GenLoRA attains higher effective LoRA ranks under smaller parameter budgets, resulting in superior fine-tuning performance. The code is available at https://anonymous.4open.science/r/GenLoRA.