2604.03419 2026-05-20 cs.LG math.CO

Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization

自适应阈值驱动的连续贪心方法用于可扩展的子模优化

Mohammadreza Rostami, Solmaz S. Kia

发表机构 * Department of Mechanical and Aerospace Engineering, University of California Irvine（加州大学尔湾分校机械与航空航天工程系）

AI总结该研究提出了一种自适应阈值驱动的连续贪心方法（ATCG），用于解决在Matroid约束下的子模最大化问题，通过动态调整活跃集扩展策略，提高了算法效率并减少了通信开销。

详情

AI中文摘要

在组合优化中，子模最大化在传感、数据摘要、主动学习和资源分配中有广泛应用。尽管顺序贪心（SG）算法由于不可逆选择只能达到1/2的近似比，连续贪心（CG）通过多线性松弛获得最优的(1-1/e)近似比，但其代价是逐渐密集的决策向量，迫使代理为几乎每一个基础集元素交换特征嵌入。我们提出ATCG（自适应阈值驱动连续贪心），通过每个分区的进度比率η_i来控制梯度评估，仅在当前候选未能捕获足够边际增益时扩展每个代理的活跃集，从而直接限制哪些特征嵌入会被传输。理论分析建立了具有曲率意识的近似保证，有效因子τ_eff= max{τ,1-c}，在阈值保证和低曲率区域之间插值，其中ATCG恢复CG的性能。这表明，曲率所捕捉的问题结构决定了接近全CG性能所需的协调和通信量。在类平衡的原型选择问题实验中，ATCG在CIFAR-10动物数据集的子集上实现了与全CG方法相当的目标值，同时显著减少了通信开销。

英文摘要

Submodular maximization under matroid constraints is a fundamental problem in combinatorial optimization with applications in sensing, data summarization, active learning, and resource allocation. While the Sequential Greedy (SG) algorithm achieves only a $\frac{1}{2}$-approximation due to irrevocable selections, Continuous Greedy (CG) attains the optimal $\bigl(1-\frac{1}{e}\bigr)$-approximation via the multilinear relaxation, at the cost of a progressively dense decision vector that forces agents to exchange feature embeddings for nearly every ground-set element. We propose \textit{ATCG} (\underline{A}daptive \underline{T}hresholded \underline{C}ontinuous \underline{G}reedy), which gates gradient evaluations behind a per-partition progress ratio $η_i$, expanding each agent's active set only when current candidates fail to capture sufficient marginal gain, thereby directly bounding which feature embeddings are ever transmitted. Theoretical analysis establishes a curvature-aware approximation guarantee with effective factor $τ_{\mathrm{eff}}=\max\{τ,1-c\}$, interpolating between the threshold-based guarantee and the low-curvature regime where \textit{ATCG} recovers the performance of CG. This shows that the problem structure, as captured by curvature, determines the amount of coordination and communication required to approach full-CG performance. Experiments on a class-balanced prototype selection problem over a subset of the CIFAR-10 animal dataset show that \textit{ATCG} achieves objective values comparable to those of the full CG method while substantially reducing communication overhead through adaptive active-set expansion.

URL PDF HTML ☆

赞 0 踩 0

2604.02784 2026-05-20 cs.CV cs.CL

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

EnsemHalDet: 通过内部状态检测器的集成实现鲁棒的视觉语言模型幻觉检测

Ryuhei Miyazato, Shunsuke Kitada, Kei Harada

发表机构 * The University of Electro-Communications（电通大学）

AI总结本文提出EnsemHalDet，一种通过集成多个内部表示的视觉语言模型幻觉检测框架，以提高多模态幻觉检测的鲁棒性。

详情

AI中文摘要

视觉语言模型（VLMs）在多模态任务中表现出色，但它们仍然容易受到事实错误或与输入图像无关的幻觉影响。最近的研究表明，利用内部表示进行幻觉检测比仅依赖模型输出的方法更高效和准确。然而，现有的基于内部表示的方法通常依赖于单一的表示或检测器，限制了它们捕捉多样化幻觉信号的能力。在本文中，我们提出了EnsemHalDet，一种基于集成的幻觉检测框架，利用VLMs的多种内部表示，包括注意力输出和隐藏状态。EnsemHalDet为每个表示训练独立的检测器，并通过集成学习进行组合。在多个VQA数据集和VLMs上的实验结果表明，EnsemHalDet在AUC方面始终优于先前的方法和单检测器模型。这些结果表明，集成多样化的内部信号显著提高了多模态幻觉检测的鲁棒性。

英文摘要

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

URL PDF HTML ☆

赞 0 踩 0

2603.29501 2026-05-20 cs.LG cs.AI

XNote: 对基于图像的上下文欺骗的自动社区笔记生成进行基准测试

Jin Ma, Jingwen Yan, Mohammed Aldeen, Ethan Anderson, Taran Kavuru, Jinkyung Katie Park, Feng Luo, Long Cheng

发表机构 * School of Computing, Clemson University（克莱姆森大学计算机学院）

AI总结本文研究了基于图像的上下文欺骗的自动社区笔记生成任务，提出了一个真实世界数据集XNote，并对前沿大视觉语言模型和商业工具进行了基准测试，以评估其在欺骗检测和笔记生成任务中的性能。

详情

AI中文摘要

社区笔记已成为一种有效的众包机制，用于对抗社交媒体上的在线欺骗。然而，其依赖于人类贡献者限制了及时性和可扩展性。在本工作中，我们研究了基于图像的上下文欺骗的自动社区笔记生成任务，其中一张真实图像与误导性上下文（例如时间、实体和事件）配对。与之前主要关注欺骗检测（即以二元方式判断帖子是否真实）的工作不同，自动社区笔记生成需要生成简洁且有根据的笔记，帮助用户恢复缺失或更正的上下文。由于支持此任务的数据集稀缺，该问题仍未被充分探索。为了解决这一差距，我们整理了一个真实世界的数据集XNote，包含X篇帖子及其相关的社区笔记和外部上下文，以及主题和欺骗因素的注释。我们进一步在XNote上基准测试了一系列前沿的大视觉语言模型（LVLMs），评估它们在欺骗检测和笔记生成任务中的性能。我们还对比了端到端方法SNIFFER和商业工具GPT-5。我们的结果突显了自动社区笔记生成的挑战，强调了改进针对此任务的方法和指标的必要性。

英文摘要

Community Notes have emerged as an effective crowd-sourced mechanism for combating online deception on social media platforms. However, its reliance on human contributors limits both the timeliness and scalability. In this work, we study the automated Community Notes generation task for image-based contextual deception, where an authentic image is paired with misleading context (e.g., time, entity, and event). Unlike prior work that primarily focuses on deception detection (i.e., judging whether a post is true or false in a binary manner), automated Community Notes generation requires producing concise and grounded notes that help users recover the missing or corrected context. This problem remains underexplored due to the scarcity of datasets that support this task. To address this gap, we curate a real-world dataset, XNote, comprising X posts with associated Community Notes and external contexts, along with annotations of topics and deceptive factors. We further benchmark a range of frontier large vision language models (LVLMs) on XNote, evaluating their performance on both deception detection and note generation tasks. We also compare against an end-to-end approach, SNIFFER, and a commercial tool, GPT-5. Our results highlight the challenges in automated Community Notes generation, underscoring the need for improved methods and metrics tailored for this task.

URL PDF HTML ☆

赞 0 踩 0

2603.17839 2026-05-20 cs.CL cs.AI cs.LG

How do LLMs Compute Verbal Confidence

LLMs如何计算言语自信

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković

发表机构 * Google DeepMind（谷歌深Mind）

AI总结研究探讨了大型语言模型如何内部生成言语自信评分，通过实验发现自信评分在回答生成后被缓存并用于后续输出，揭示了模型自我评估的机制。

详情

AI中文摘要

言语自信——提示LLMs以数字或类别形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而，LLMs内部如何生成此类评分仍不清楚。我们解答了两个问题：首先，信心是在被请求时即时计算，还是在生成答案时自动计算并缓存以供后续检索；其次，言语自信代表什么——token对数概率，还是更丰富的答案质量评估？我们聚焦于Gemma 3 27B（在TriviaQA、BigMath和MMLU上的表现）、Qwen 2.5 7B以及推理模型Magistral Small 24B，提供了缓存检索的收敛证据。激活引导、修补、噪声和交换实验揭示，信心表示在回答相邻位置先出现，再出现在言语化位置。注意力阻断指出了信息流：信心从回答token中收集，缓存于第一个回答后的位置，然后用于输出。关键发现是线性探测和方差划分揭示，这些缓存表示能够解释超出token对数概率的显著方差，表明是更丰富的答案质量评估，而非简单的流畅性读取。这些发现表明，言语自信反映了自动、复杂的自我评估——而非事后重建——对理解LLMs中的元认知和改进校准具有启示。

英文摘要

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

URL PDF HTML ☆

赞 0 踩 0

2603.16284 2026-05-20 cs.CV cs.LG

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

定位后再稀疏化：基于归因的视觉幻觉缓解稀疏策略

Tiantian Dang, Chao Bi, Shufan Shen, Jinzhe Liu, Qingming Huang, Shuhui Wang

发表机构 * State Key Lab. of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院人工智能安全国家重点实验室，计算技术研究所）； School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences（中国科学院大学先进交叉科学学院）； School of Computer Science and Technology, University of Chinese Academy of Sciences（中国科学院大学计算机科学与技术学院）

AI总结本文提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的框架，通过定位和稀疏化策略，根据每层与幻觉的相关性调整特征引导强度，从而有效缓解视觉语言模型中的幻觉问题，同时保持良好的性能。

Comments Accepted by CVPR 2026

详情

AI中文摘要

尽管大型视觉-语言模型（LVLMs）在技术上取得了显著进展，但其生成幻觉的倾向削弱了可靠性并限制了更广泛的实际应用。在幻觉缓解方法中，特征引导作为一种有前景的方法，能够在不增加推理成本的情况下减少LVLMs中的错误输出。然而，当前的方法在所有层上应用统一的特征引导策略。这种启发式策略忽略了层间的差异，可能会干扰与幻觉无关的层，最终导致在通用任务上的性能下降。在本文中，我们提出了一种名为Locate-Then-Sparsify for Feature Steering (LTS-FS)的即插即用框架，该框架根据每层与幻觉的相关性来控制引导强度。我们首先构建了一个包含token级和句子级幻觉案例的数据集。基于此数据集，我们引入了一种基于因果干预的归因方法，以量化每层的幻觉相关性。利用各层的归因分数，我们提出了一种逐层策略，将这些分数转换为针对单个层的特征引导强度，从而在幻觉相关的层上实现更精确的调整。在多个LVLMs和基准测试中进行的广泛实验表明，LTS-FS有效缓解了幻觉问题，同时保持了强大的性能。代码可在https://github.com/huttersadan/LTS-FS上获得。

英文摘要

Despite the significant advancements in Large Vision-Language Models (LVLMs), their tendency to generate hallucinations undermines reliability and restricts broader practical deployment. Among the hallucination mitigation methods, feature steering emerges as a promising approach that reduces erroneous outputs in LVLMs without increasing inference costs. However, current methods apply uniform feature steering across all layers. This heuristic strategy ignores inter-layer differences, potentially disrupting layers unrelated to hallucinations and ultimately leading to performance degradation on general tasks. In this paper, we propose Locate-Then-Sparsify for Feature Steering (LTS-FS), a plug-and-play framework which controls the steering intensity according to the hallucination relevance of each layer. We first construct a dataset comprising token-level and sentence-level hallucination cases. Based on this dataset, we introduce an attribution method based on causal interventions to quantify the hallucination relevance of each layer. With the attribution scores across layers, we propose a layerwise strategy that converts these scores into feature steering intensities for individual layers, enabling more precise adjustments specifically on hallucination-relevant layers. Extensive experiments across multiple LVLMs and benchmarks demonstrate that LTS-FS effectively mitigates hallucination while preserving strong performance. Codes are available at https://github.com/huttersadan/LTS-FS.

URL PDF HTML ☆

赞 0 踩 0

2603.15411 2026-05-20 cs.AI cs.LG

EduVQA: 向概念感知的教育AI生成视频评估迈进

Baoliang Chen, Xinlong Bu, Hanwei Zhu, Lingyu Zhu, Jieyu Zhan

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Department of Computer Science, South China Normal University, China（华南师范大学计算机学院）； School of Computer Science, City University of Hong Kong（香港城市大学计算机科学学院）

AI总结本研究提出EduVQA框架，通过引入结构化2D混合专家架构，实现了对教育AI生成视频中概念正确性的感知评估，解决了传统方法在教育场景中忽略概念正确性的不足。

详情

AI中文摘要

现有的AI生成视频质量评估（AIGVQA）方法主要关注全局感知真实性和粗略的文本-视频对齐，而忽视了教育场景中的关键要求：概念正确性。在早期数学教育中，即使视觉上合理，数值量、几何关系或空间配置中的细微错误也可能从根本上改变传达的知识。为了解决这个问题，我们引入了EduAVQABench，这是首个概念感知的教育AIGV评估基准，包含1,130个由十种最先进的T2V模型生成的视频，以及超过310,650个精细的人工标注，涵盖感知质量和语义对齐。基于此基准，我们进一步提出了EduVQA，一个概念感知的AIGVQA框架，配备了结构化2D混合专家（S2D-MoE）架构。通过通过共享专家和自适应二维路由联合建模细粒度概念评估和整体质量预测，EduVQA有效地捕捉了传统全局评分方法所忽略的细微概念层面不一致。广泛的实验表明，EduVQA在感知和语义评估任务中均优于现有AIGVQA方法，并在未见过的基准上表现出强大的泛化能力。代码和数据集将在：https://github.com/EduVQA/EduVQA 公开。

超越匹配最大化和公平性：以用户留存优化的双侧匹配

Ren Kishimoto, Rikiya Takehi, Koichi Tanaka, Masahiro Nomura, Riku Togashi, Yoji Tomita, Yuta Saito

发表机构 * Institute of Science Tokyo（东京科学研究所）； Waseda University（早稻田大学）； Keio University（庆应大学）； CyberAgent Tokyo（CyberAgent 东京）； Hajuku-kaso, Co., Ltd.（汉久科社）

AI总结本文提出了一种新的双侧匹配优化方法，旨在最大化用户留存而非单纯匹配数量或公平性，通过引入动态学习排序算法MRet，利用用户个性化留存曲线优化推荐策略，提升整体用户留存率。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

在在线约会和招聘等双侧匹配平台上，推荐算法通常旨在最大化总匹配数。然而，这一目标导致了不平衡，一些用户获得过多匹配而另一些用户则获得极少并最终离开平台。对于许多平台，尤其是依赖订阅的平台，用户留存至关重要。一些平台可能使用公平性目标来解决匹配最大化的问题。然而，公平性本身并非所有平台的最终目标，因为用户不会仅仅因为曝光均等而奖励平台。在实践中，用户留存通常是最终目标，随意依赖公平性会使留存优化取决于运气。在本工作中，我们没有最大化匹配或公理化定义公平性，而是正式定义了双侧匹配平台中最大化用户留存的新问题设置。为此，我们引入了一种动态学习到排序（LTR）算法，称为Matching for Retention（MRet）。与传统的双侧匹配算法不同，我们的方法通过从每个用户档案和交互历史中学习个性化留存曲线来建模用户留存。基于这些曲线，MRet通过同时考虑接收推荐的用户和被推荐用户的留存收益，动态调整推荐策略，使得有限的匹配机会分配到最能提高整体留存的地方。自然但重要的是，对主要在线约会平台的合成和真实世界数据集的实证评估显示，MRet实现了更高的用户留存率，因为传统方法优化匹配或公平性而非留存。

英文摘要

On two-sided matching platforms such as online dating and recruiting, recommendation algorithms often aim to maximize the total number of matches. However, this objective creates an imbalance, where some users receive far too many matches while many others receive very few and eventually abandon the platform. Retaining users is crucial for many platforms, such as those that depend heavily on subscriptions. Some may use fairness objectives to solve the problem of match maximization. However, fairness in itself is not the ultimate objective for many platforms, as users do not suddenly reward the platform simply because exposure is equalized. In practice, where user retention is often the ultimate goal, casually relying on fairness will leave the optimization of retention up to luck. In this work, instead of maximizing matches or axiomatically defining fairness, we formally define the new problem setting of maximizing user retention in two-sided matching platforms. To this end, we introduce a dynamic learning-to-rank (LTR) algorithm called Matching for Retention (MRet). Unlike conventional algorithms for two-sided matching, our approach models user retention by learning personalized retention curves from each user's profile and interaction history. Based on these curves, MRet dynamically adapts recommendations by jointly considering the retention gains of both the user receiving recommendations and those who are being recommended, so that limited matching opportunities can be allocated where they most improve overall retention. Naturally but importantly, empirical evaluations on synthetic and real-world datasets from a major online dating platform show that MRet achieves higher user retention, since conventional methods optimize matches or fairness rather than retention.

URL PDF HTML ☆

赞 0 踩 0

2602.13466 2026-05-20 cs.CL cs.AI cs.LG

Language Model Memory and Memory Models for Language

语言模型记忆与记忆模型用于语言

Benjamin L. Badger

发表机构 * IBM（IBM公司）

AI总结研究探讨了语言模型和记忆模型在信息存储中的能力差异，发现语言模型的嵌入向量信息较少，而自编码器在输入再生训练中能形成接近完美的记忆，提出了一种可并行的编码器-解码器记忆模型架构，并通过结合因果和信息保留目标函数来提升记忆形成和解码能力。

详情

AI中文摘要

机器学习模型存储输入信息的能力，类似于“记忆”的概念，在隐藏层向量嵌入中被广泛使用但未充分表征。我们发现，无论数据和计算规模如何，语言模型嵌入通常包含相对较少的输入信息。相比之下，用于输入再生训练的自编码器嵌入能够形成几乎完美的记忆。用记忆嵌入替代令牌序列可带来显著的计算效率，从而引入一种可并行的编码器-解码器记忆模型架构。在因果训练后，这些模型包含信息贫乏的嵌入，无法进行任意信息访问，但通过结合因果和信息保留目标函数，它们学会形成和解码信息丰富的记忆。通过冻结高保真编码器并采用课程训练方法，解码器首先学习处理记忆，然后学习预测下一个令牌。我们引入了观点，即仅使用下一个令牌预测训练不足以准确形成记忆，因为目标本身不可逆，从而推动在输入不完全暴露的情况下使用结合目标函数的模型。

以数据为中心的基于学习的多任务手术注视感知模型设计

Yizhou Li, Shuyuan Yang, Jiaji Su, Zonghe Chua

发表机构 * Department of Electrical, Computer, and Systems Engineering, Case Western Reserve University（电气、计算机与系统工程系，凯斯西储大学）

AI总结本研究探讨了在多任务模拟中，基于学习的手术注视感知模型的设计，通过主动-被动注视数据集分析，评估了不同注视来源对注意力模型学习的影响，并提出了可扩展的群众源注视监督方法。

Comments 8 pages, conference pre-print

详情

AI中文摘要

在机器人辅助微创手术（RMIS）中，减少的触觉反馈和深度线索增加了对专家视觉感知的依赖，推动了基于注视引导的训练和基于学习的手术感知模型。然而，操作专家的注视数据收集成本高，且不清楚注视监督来源（专家水平（中级 vs. 初学者）和感知模态（主动执行 vs. 被动观看））如何影响注意力模型的学习。我们引入了一个配对的主动-被动、多任务手术注视数据集，该数据集在达芬奇SimNow模拟器上进行了四次钻探任务。使用VR头盔和眼动追踪记录了任务执行期间的主动注视，相应的视频被重新利用作为刺激，以收集观察者的被动注视，从而实现受控的同视频比较。我们量化了技能和模态依赖的注视组织差异，并通过注视密度重叠分析和单帧显著性建模评估了被动注视在操作监督中的可替代性。在各种设置中，MSI-Net产生了稳定且可解释的预测，而SalGAN不稳定且经常与人类注视不一致。训练于被动注视的模型恢复了相当大的中级主动注意力，但存在可预测的退化，且主动和被动目标之间的迁移是不对称的。值得注意的是，初学者的被动标签在较高质量演示中对中级-被动目标的近似具有有限的损失，这表明了一条可行的路径，用于在手术指导和感知建模中实现可扩展的群众源注视监督。

英文摘要

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.09023 2026-05-20 cs.RO

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

TwinRL: 基于数字孪生的强化学习用于真实世界机器人操作

Qinwen Xu, Jiaming Liu, Rui Zhou, Shaojun Shi, Nuowei Han, Zhuoyang Liu, Chenyang Gu, Shuo Gu, Yang Yue, Gao Huang, Wenzhao Zheng, Sirui Han, Peng Jia, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（信息处理国家重点实验室，计算机学院，北京大学）； Simplexity Robotics（Simplexity机器人）； Tsinghua University（清华大学）； Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出TwinRL框架，通过数字孪生与真实世界协同训练，提升视觉-语言-动作模型在真实世界中的探索效率和收敛速度，实现高成功率和快速收敛。

详情

AI中文摘要

尽管具有强大的泛化能力，视觉-语言-动作（VLA）模型仍然受到专家演示成本高和现实世界交互有限的限制。虽然在线强化学习（RL）显示出前景，但将其应用于真实世界VLA操作受到探索效率低和探索覆盖受限的阻碍。通过系统性的现实世界实验，我们发现在线RL的有效探索空间主要受监督微调（SFT）期间诱导的轨迹分布所限制。受此观察启发，我们提出TwinRL，一种数字孪生-真实世界协同的后训练框架，通过三个阶段扩展和引导RL探索：SFT预热、孪生RL预热和真实世界RL。TwinRL首先从手机捕捉的场景中重建高保真的数字孪生。在SFT阶段，我们引入一种探索空间扩展策略，将轨迹分布的支持扩展到现实演示之外，重塑探索空间以更有效地进行RL。与将孪生视为数据增强工具不同，我们提出一种孪生RL预热策略，使其能够作为真实世界RL的探索引导。具体而言，TwinRL在数字孪生中执行高效的并行RL，生成填充回放缓冲区的交互轨迹，稳定后续真实世界RL学习。这一过程还识别出易失败但信息丰富的配置，使针对人类在回路中的rollouts进一步提高机器人上的效率。在四个任务中，TwinRL在分布内和分布外区域均实现近100%的成功率，比先前的真实世界RL方法快30%以上，仅需20分钟的机器人交互时间。

英文摘要

Despite strong generalization capabilities, Vision-Language-Action (VLA) models remain constrained by the high cost of expert demonstrations and limited real-world interaction. While online reinforcement learning (RL) has shown promise, its application to real-world VLA manipulation is hindered by low exploration efficiency and restricted exploration coverage. Through systematic real-world experiments, we observe that the effective exploration space of online RL is largely constrained by the trajectory distribution induced during supervised fine-tuning (SFT). Motivated by this observation, we propose TwinRL, a digital twin-real-world collaborative post-training framework that expands and guides RL exploration for VLA models through three stages: SFT warm-up, twin RL warm-up, and real-world RL. TwinRL first reconstructs a high-fidelity digital twin from smartphone-captured scenes. During the SFT stage, we introduce an exploration space expansion strategy that expands the support of the trajectory distribution beyond real demonstrations, reshaping the exploration space for more effective RL. Rather than treating the twin as a data augmentation tool, we propose a twin RL warm-up strategy that enables it to act as an exploration guide for real-world RL. Specifically, TwinRL performs efficient parallel RL in the digital twin to generate interactive trajectories that populate the replay buffer and stabilize subsequent real-world RL learning. This process also identifies failure-prone yet informative configurations, enabling targeted human-in-the-loop rollouts to further improve on-robot efficiency. Across four tasks, TwinRL achieves near-100% success in both in-distribution and out-of-distribution regions, delivering over 30% faster convergence than prior real-world RL methods with only 20 minutes of on-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2602.07008 2026-05-20 cs.CV cs.LG

Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

不应学习的地方：基于子集归因约束的先验对齐训练以实现可靠的决策制定

Ruoyu Chen, Shangquan Sun, Xiaoqing Guo, Sanyi Zhang, Kangwei Liu, Shiming Liu, Zhangcheng Wang, Qunli Zhang, Hua Zhang, Xiaochun Cao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； University of Chinese Academy of Sciences（中国科学院大学）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； Department of Computer Science, Hong Kong Baptist University（香港 Baptist 大学计算机科学系）； Communication University of China（中国传媒大学）； Imperial College London（伦敦帝国学院）； School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区网络科学与技术学院）

AI总结本文提出了一种基于归因的先验对齐方法，通过子集选择归因技术约束模型依赖于人类先验区域，从而提升决策的可靠性。

详情

AI中文摘要

可靠的模型不仅要预测正确，还要能用可接受的证据来解释决策。然而，传统监督学习通常只提供类别级标签，使模型通过捷径相关性实现高精度，而非预期的证据。人类先验可以约束此类行为，但对齐模型到这些先验仍然具有挑战性，因为学习的表示往往偏离人类感知。为了解决这一挑战，我们提出了一种基于归因的人类先验对齐方法。我们将人类先验编码为模型应依赖的输入区域（例如边界框），并利用高度忠实的子集选择归因方法，在训练过程中暴露模型的决策证据。当归因区域显著偏离先验区域时，我们惩罚对非先验证据的依赖，促使模型将归因转向预期区域。这是通过一个训练目标实现的，该目标通过人类先验诱导归因约束。我们在基于MLLM的GUI代理模型上验证了我们的方法，涵盖图像分类和点击决策任务。在传统分类和自回归生成设置中，人类先验对齐一致提高了任务准确性，同时增强了模型的决策合理性。

英文摘要

Reliable models should not only predict correctly, but also justify decisions with acceptable evidence. Yet conventional supervised learning typically provides only class-level labels, allowing models to achieve high accuracy through shortcut correlations rather than the intended evidence. Human priors can help constrain such behavior, but aligning models to these priors remains challenging because learned representations often diverge from human perception. To address this challenge, we propose an attribution-based human prior alignment method. We encode human priors as input regions that the model is expected to rely on (e.g., bounding boxes), and leverage a highly faithful subset-selection-based attribution approach to expose the model's decision evidence during training. When the attribution region deviates substantially from the prior regions, we penalize reliance on off-prior evidence, encouraging the model to shift its attribution toward the intended regions. This is achieved through a training objective that imposes attribution constraints induced by the human prior. We validate our method on both image classification and click decision tasks in MLLM-based GUI agent models. Across conventional classification and autoregressive generation settings, human prior alignment consistently improves task accuracy while also enhancing the model's decision reasonability.

URL PDF HTML ☆

赞 0 踩 0

2602.06462 2026-05-20 cs.CL cs.LG

Diffusion-State Policy Optimization for Masked Diffusion Language Models

扩散状态策略优化用于掩码扩散语言模型

Daisuke Oba, Hiroki Furuta, Naoaki Okazaki

发表机构 * Institute of Science Tokyo（东京科学研究院）

AI总结本文提出Diffusion-State Policy Optimization（DiSPO），一种用于掩码扩散语言模型的插件信用分配层，通过直接优化中间填充决策来改进生成过程，实验表明其在数学和规划基准测试中优于现有基线方法。

详情

AI中文摘要

掩码扩散语言模型通过迭代填充掩码标记来生成文本，但仅对最终完成结果的终端奖励对中间填充决策的信用分配过于粗糙。我们提出Diffusion-State Policy Optimization（DiSPO），一种插件信用分配层，直接优化中间填充决策。在选定的中间掩码状态下，DiSPO通过从滚出缓存的logits中重新采样当前掩码位置，评估由此产生的完成结果，并仅更新新填充的标记，无需额外的多步扩散滚出或优化器步骤。我们为分支完成形式化了一个固定状态目标，并推导出一个策略梯度估计器，该估计器重用与终端反馈策略优化相同的滚出。在LLaDA-8B-Instruct上的实验表明，DiSPO在匹配的滚出计算和优化器步骤下，一致提高了终端反馈基线，包括diffu-GRPO和SPG，在数学和规划基准测试中。我们的项目页面可在https://daioba.github.io/dispo上找到。

英文摘要

Masked diffusion language models generate text through iterative masked-token filling, but terminal-only rewards on final completions provide coarse credit assignment for the intermediate filling decisions that shape the generation process. We propose Diffusion-State Policy Optimization (DiSPO), a plug-in credit-assignment layer that directly optimizes intermediate filling decisions. At selected intermediate masked states, DiSPO branches by resampling the currently masked positions from rollout-cached logits, scores the resulting completions, and updates only the newly filled tokens, requiring no additional multi-step diffusion rollouts or optimizer steps. We formalize a fixed-state objective for branched completions and derive a policy-gradient estimator that reuses the same rollouts as terminal-feedback policy optimization. Experiments on LLaDA-8B-Instruct show that DiSPO consistently improves terminal-feedback baselines, including diffu-GRPO and SPG, on math and planning benchmarks under matched rollout compute and optimizer steps, supporting its use as a general plug-in for masked diffusion policy optimization. Our project page is available at https://daioba.github.io/dispo .

URL PDF HTML ☆

赞 0 踩 0

2602.05709 2026-05-20 cs.AI

Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions

非线性作为秩：基于径向基函数的生成低秩适配器

Yihao Ouyang, Shiwei Li, Haozhao Wang, Xiandi Luo, Zhuoqi Hu, Yuetong Song, Qiyu Qin, Yichen Li, Ruixuan Li

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Hebei University of Technology（河北工业大学）

AI总结本文提出GenLoRA，通过使用轻量级非线性函数生成径向基函数来替代传统低秩适配器中显式的基向量存储，从而提高参数效率和细调性能。

详情

AI中文摘要

低秩适配（LoRA）通过两个低秩矩阵的乘积来近似预训练权重矩阵的更新。然而，标准LoRA遵循显式秩范式，增加模型容量需要在低秩矩阵中添加更多行或列（即基向量），导致参数增长显著。在本文中，我们发现这些基向量表现出显著的参数冗余，可以被轻量级非线性函数紧凑地表示。因此，我们提出生成低秩适配器（GenLoRA），用非线性基向量生成替代显式基向量存储。具体而言，GenLoRA为每个低秩矩阵维护一个潜在向量，并使用一组轻量级径向基函数（RBFs）来合成基向量。每个RBF所需的参数远少于显式基向量，使GenLoRA实现了更高的参数效率。在多个数据集和架构上的广泛实验表明，GenLoRA在较小的参数预算下实现了更高的有效LoRA秩，从而获得更优越的微调性能。代码可在https://anonymous.4open.science/r/GenLoRA获取。

英文摘要

Low-rank adaptation (LoRA) approximates the update of a pretrained weight matrix using the product of two low-rank matrices. However, standard LoRA follows an explicit-rank paradigm, where increasing model capacity requires adding more rows or columns (i.e., basis vectors) to the low-rank matrices, leading to substantial parameter growth. In this paper, we find that these basis vectors exhibit significant parameter redundancy and can be compactly represented by lightweight nonlinear functions. Therefore, we propose Generative Low-Rank Adapter (GenLoRA), which replaces explicit basis vector storage with nonlinear basis vector generation. Specifically, GenLoRA maintains a latent vector for each low-rank matrix and employs a set of lightweight radial basis functions (RBFs) to synthesize the basis vectors. Each RBF requires far fewer parameters than an explicit basis vector, enabling higher parameter efficiency in GenLoRA. Extensive experiments across multiple datasets and architectures show that GenLoRA attains higher effective LoRA ranks under smaller parameter budgets, resulting in superior fine-tuning performance. The code is available at https://anonymous.4open.science/r/GenLoRA.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

HEX: Humanoid-Aligned Experts for Cross-Embodiment Whole-Body Manipulation

DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

Adaptive Threshold-Driven Continuous Greedy Method for Scalable Submodular Optimization

EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

Target-Aligned Reinforcement Learning

TrajectoryMover: Generative Movement of Object Trajectories in Videos

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

XNote: Benchmarking Automated Community Notes Generation for Image-based Contextual Deception

How do LLMs Compute Verbal Confidence

Locate-then-Sparsify: Attribution Guided Sparse Strategy for Visual Hallucination Mitigation

A Hybrid Modeling Framework for Crop Prediction Tasks via Dynamic Parameter Calibration and Multi-Task Learning

A Grid-Based Framework for E-Scooter Demand Representation and Temporal Input Design for Deep Learning: Evidence from Austin, Texas

Synthetic Data Generation for Brain-Computer Interfaces: Overview, Benchmarking, and Future Directions

Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

PureCC: Pure Learning for Text-to-Image Concept Customization

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Beyond Match Maximization and Fairness: Retention-Optimized Two-Sided Matching

Language Model Memory and Memory Models for Language

TADA! Tuning Audio Diffusion Models through Activation Steering

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation

Where Not to Learn: Prior-Aligned Training with Subset-based Attribution Constraints for Reliable Decision-Making

Diffusion-State Policy Optimization for Masked Diffusion Language Models

Nonlinearity as Rank: Generative Low-Rank Adapter with Radial Basis Functions