arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
2605.26998 2026-05-27 cs.LG q-bio.NC

Probabilistic Recurrent Intention Switching Model

概率递归意图切换模型

Wenyuan Sheng, Hao Zhu, Joschka Boedecker

发表机构 * Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 提出PRISM模型,利用轻量级递归网络建模非平稳意图切换,实现精确EM分解和闭式求解,在网格世界、小鼠迷宫和机器人操作任务中取得最优似然并恢复可解释意图。

详情
AI中文摘要

逆强化学习(IRL)从观察到的行为中恢复奖励函数,但传统方法假设单一固定奖励,无法捕捉一个回合内的目标切换。最近的多意图IRL方法通过分割轨迹来解决这一问题,但将意图转换建模为无记忆马尔可夫链或通过固定历史窗口的手动状态增强。我们提出概率递归意图切换模型(PRISM),该模型用轻量级递归网络替代这两种机制,将观察历史映射到每步意图分布。我们证明由此产生的EM目标可以精确分解为独立的每意图奖励子问题,每个子问题可闭式求解,从而得到$\mathcal{O}(nK)$的E步,无需变分近似。我们在非马尔可夫网格世界、小鼠迷宫和BridgeData~V2机器人操作(首个大规模多意图IRL机器人应用)上评估PRISM。在所有设置中,PRISM在保持最高留出对数似然的同时,从未标记的演示中恢复出可命名、时间上连贯的意图,表明离散目标切换存在于生物和人工智能体中。

英文摘要

Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an $\mathcal{O}(nK)$ E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.

2605.26992 2026-05-27 cs.CV

On the Robustness of Machine Unlearning for Vision-Language Models

机器遗忘在视觉-语言模型中的鲁棒性研究

Yujie Lin, Kaidi Jia, Jiayao Ma, Chengyi Yang, Jinsong Su

发表机构 * Xiamen University(厦门大学)

AI总结 本文首次系统调查了视觉-语言模型机器遗忘的鲁棒性,通过提出三种攻击范式揭示现有方法往往隐藏而非彻底移除目标知识。

详情
AI中文摘要

视觉-语言模型(VLM)可能会记忆训练数据中的不良信息,这激发了人们对机器遗忘的兴趣。在这项工作中,我们首次对VLM遗忘进行了系统调查和鲁棒性分析。我们提供了现有VLM遗忘方法的全面分类和回顾,以及在多种提示设置下的统一评估。然后,我们提出了三种攻击范式,以检验被遗忘的多模态知识是否可以通过上下文提示或下游微调重新激活。大量实验表明,许多现有方法在这些攻击下仍然脆弱,这表明当前方法往往隐藏而非完全移除目标知识。我们的研究为当前VLM遗忘方法的鲁棒性和局限性提供了新见解,并强调了需要更可靠的多模态遗忘策略。代码可在https://github.com/XMUDeepLIT/VLM-UnL-Attack获取。

英文摘要

Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at https://github.com/XMUDeepLIT/VLM-UnL-Attack.

2605.26991 2026-05-27 cs.RO

Towards Shared Embodied Intelligence in Humanoid Robots through Optimization Development and Testing of the Human Aware ergoCub Robot

通过优化开发与测试人类感知的ergoCub机器人迈向人形机器人的共享具身智能

Carlotta Sartore, Mohamed Elobaid, Lorenzo Rapetti, Giulio Romualdi, Stefano Dafarra, Nicola A. Piga, Ines Sorrentino, Paolo Maria Vicecone, Silvio Traversaro, Ugo Pattacini, Luca Fiorio, Francesco Draicchio, Giovanna Tranfo, Lorenzo Natale, Marco Maggiali, Daniele Pucci

发表机构 * GenerativeBionics(生成生物技术) Artificial and Mechanical Intelligence(人工与机械智能) School of Computer Science, University of Manchester(曼彻斯特大学计算机科学学院) Humanoid Sensing and Perception(人形感知与感知) iCub Tech Facility(iCub技术设施) DiMEILA, Istituto Nazionale Assicurazione Infortuni sul Lavoro (INAIL)(DiMEILA,意大利国家职业伤害保险机构(INAIL))

AI总结 提出一种融合共享智能与具身认知的架构,通过优化机器人硬件与控制以符合人体工学指标,实现人机物理协作,并以ergoCub人形机器人为具体实现。

详情
AI中文摘要

协作是人类行为的核心,使得完成超出个人能力的任务成为可能。这种能力源于通过对他人的内部表征来协调行动,这一概念被称为共享智能。此外,人类以其身体和认知能力为特征,这些能力会根据环境进行优化,这种现象被称为具身认知。设计能够安全有效地与人协作的人形机器人需要统一这些原则。在此,我们提出一种整合共享智能与具身认知的架构,使机器人能够与人类进行物理协作,其中机器人硬件和控制针对人体指标进行优化,利用人体和运动智能的表征。最终目标是实现一种共享具身智能的形式。具体而言,我们的架构根据人体工程学指标优化机器人硬件和物理智能参数。这是通过将人机交互建模为硬件配置的函数,并将人体模型嵌入机器人的物理智能中来实现的。作为具体实现,我们介绍了人形机器人ergoCub,其形态和控制已针对与人类的协作任务进行了优化。我们的方法为设计在硬件和物理智能层面优先考虑人体工程学的人形机器人提供了一个框架,并应用于工业和辅助机器人领域。

英文摘要

Collaboration is central to human behavior, enabling tasks beyond individual capability. This ability arises from coordinating actions through internal representations of others, a concept known as shared intelligence. Additionally, humans are characterized by physical bodies and cognitive abilities that are optimized in response to their environment, a phenomenon referred to as embodied cognition. Designing humanoid robots that collaborate safely and effectively with people requires unifying these principles. Here we propose an architecture that integrates shared intelligence and embodied cognition to enable robots to physically collaborate with humans, where robot hardware and control are optimized for human metrics, using representations of the human body and motion intelligence. The ultimate goal is to achieve a form of shared embodied intelligence. Specifically, our architecture optimizes robot hardware and physical intelligence parameters with respect to human ergonomic metrics. This is accomplished by modeling human-robot interaction as a function of hardware configurations and embedding human models into the robot's physical intelligence. As a concrete implementation, we present the humanoid robot ergoCub, whose morphology and control have been optimized for collaborative tasks with humans. Our approach provides a framework for designing humanoid robots that prioritize human ergonomics at both the hardware and physical intelligence levels, with applications in industrial and assistive robotics.

2605.26984 2026-05-27 cs.LG

TED: Related Party Transaction guided Tax Evasion Detection on Heterogeneous Graph

TED:基于关联方交易的异构图偷漏税检测

Yiming Xu, Bin Shi, Bo Dong, Jiaxiang Wang, Hua Wei, Qinghua Zheng

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) School of Distance Education, Xi’an Jiaotong University(西安交通大学继续教育学院) School of Computing and Augmented Intelligence, Arizona State University(亚利桑那州立大学计算与增强智能学院)

AI总结 针对现有偷漏税检测方法未能充分利用税务场景中丰富交互信息的问题,提出一种基于异构图神经网络的TED模型,通过关联方交易组过滤噪声并设计层次注意力机制捕获深层语义,在真实数据集上显著优于现有方法。

Comments Accepted by Data Mining and Knowledge Discovery (DMKD25)

详情
AI中文摘要

偷漏税导致政府收入严重损失并扰乱公平竞争的经济秩序。为缓解这一问题,最新的偷漏税检测解决方案利用专家知识提取特征,然后训练分类器判断公司是否涉嫌偷漏税。然而,现有方案主要关注公司的统计特征,未能利用税务场景中丰富的交互信息,从而影响检测性能。在本文中,我们首先将税务场景建模为异构图,并研究异构图模型下的偷漏税检测问题。为了提高偷漏税检测的性能,提出了一种新颖的图神经网络模型来提取异构图的综合信息。具体来说,我们利用异构且复杂的关联方交易组来过滤低层噪声信息。此外,设计了一种层次注意力机制来捕获关联方交易组中隐藏的更深层次结构和语义信息。我们将该方法应用于税务局的真实风险管理系统,并在两个人工标注的真实世界税务数据集上进行评估。结果表明,我们的方法在偷漏税检测任务上显著优于现有最先进方法。

英文摘要

Tax evasion causes severe losses of government revenues and disturbs the economic order of fair competition. To help alleviate this problem, the latest tax evasion detection solutions utilize expert knowledge to extract features and then train classifiers to determine whether a company is suspected of tax evasion. However, existing solutions mainly focus on the statistical features of the company, but fail to exploit the rich interactive information in tax scenarios, which affect the detection performance. In this paper, we first model the tax scenario as a heterogeneous graph and study the tax evasion detection problem under the heterogeneous graph model. To improve the performance of tax evasion detection, a novel graph neural network model is proposed to extract the comprehensive information of heterogeneous graphs. Specifically, we use heterogeneous and complex related party transaction groups to filter low-level noise information. Moreover, a hierarchical attention mechanism is designed to capture the deeper structure and semantic information hidden in the related party transaction group. We apply our method to the real risk management system of the tax bureau, and evaluate it on two human-labeled real-world tax datasets. The results demonstrate that our method significantly outperforms the state-of-the-art in the tax evasion detection task.

2605.26978 2026-05-27 cs.CL cs.SD

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

PashtoTTS-Bench:低资源非拉丁文字文本转语音的自动化筛选

Hanif Rahman

发表机构 * Independent Researcher(独立研究员)

AI总结 针对低资源非拉丁文字TTS评估中单一ASR往返WER的不足,提出INSV报告框架及其自动化筛选子集INSV-A,并实例化为PashtoTTS-Bench基准,通过多指标评估多个TTS系统。

详情
AI中文摘要

对于低资源非拉丁文字语言,当文本转语音(TTS)评估依赖于单一的ASR往返词错误率(WER)时可能会失败。系统可能不产生音频、说出邻近语言、仅在ASR转录中保留目标文字脚本,或者对母语者来说听起来不自然。我们引入了INSV(可懂度、自然度、脚本保真度和验证)报告框架,将这些情况分开。本文报告了INSV-A,即自动化筛选子集:合成完成度、ASR WER/CER、转录脚本保真率和音频语言识别。原生MOS和语音标注已指定但未在此版本中声明。我们将INSV-A实例化为PashtoTTS-Bench,一个针对普什图语TTS的带日期基准。2026年4月至5月的运行评估了Edge GulNawaz、Edge Latifa、OmniVoice clone、OmniVoice auto和一个乌尔都语阴性对照,使用200个FLEURS和200个过滤后的Common Voice 24提示。在独立的omniASR_CTC_300M_v2下,OmniVoice auto的WER最低(FLEURS 24.1%,CV24 27.4%),其次是Edge GulNawaz(32.8%,39.5%)、Edge Latifa(35.6%,47.7%)和OmniVoice clone(45.4%,34.8%)。低于自然语音基线的WER反映了干净的合成音频,不应被解读为优于原生语音。Whisper Large V3在检查的普什图语TTS音频上返回0.0%的普什图语标签,而MMS-LID-4017和SpeechBrain VoxLingua107将普什图语输出与乌尔都语对照区分开。该版本提供了提供者元数据、每句分数、LID审计、失败日志和用于添加系统的脚本。

英文摘要

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.

2605.26977 2026-05-27 cs.LG math.OC

Convergence of Spectral Descent for Non-smooth Optimization

非光滑优化的谱下降收敛性

Yixuan Yang, Yuqing He, Song Li

发表机构 * School of Mathematical Sciences, Zhejiang University, Hangzhou, China(浙江大学数学科学学院,杭州,中国)

AI总结 研究Muon优化器的简化变体谱下降(SD)及其截断版本(TSD)在非光滑凸优化中的全局线性收敛性,并应用于鲁棒低秩矩阵恢复。

详情
AI中文摘要

Muon优化器最近在训练大型语言模型方面展示了显著的经验成功。然而,对其机制的理论理解仍然有限。目前Muon的收敛保证严重依赖于光滑性假设,其非光滑收敛行为在很大程度上未被探索。在这项工作中,我们通过研究谱下降(SD)(Muon的简化变体)及其截断版本截断谱下降(TSD),朝着弥合这一差距迈出了一步。在凸性、Lipschitz连续性和尖锐性条件下,我们建立了SD和TSD在非光滑凸公式中的全局线性收敛性。我们还研究了配备解耦权重衰减的正则化变体,并通过它们与Frank-Wolfe方法的联系推导出次线性收敛保证。最后,我们将我们的理论框架应用于混合稀疏和密集噪声下的鲁棒低秩矩阵恢复,并提供了严格的恢复保证。数值实验支持理论发现,并展示了Muon类型方法在非光滑优化中的有效性。

英文摘要

The Muon optimizer has recently demonstrated remarkable empirical success in training large language models. However, the theoretical understanding of its mechanisms remains limited. Current convergence guarantees for Muon rely heavily on smoothness assumptions, leaving its non-smooth convergence behavior largely unexplored. In this work, we take a step toward bridging this gap by investigating Spectral Descent (SD), a simplified variant of Muon, together with its truncated counterpart, Truncated Spectral Descent (TSD). Under convexity, Lipschitz continuity, and sharpness conditions, we establish global linear convergence for both SD and TSD in non-smooth convex formulations. We also study regularized variants equipped with decoupled weight decay and derive sublinear convergence guarantees through their connection with Frank-Wolfe methods. Finally, we apply our theoretical framework to robust low-rank matrix recovery under mixed sparse and dense noise regimes and provide rigorous recovery guarantees. Numerical experiments support the theoretical findings and demonstrate the effectiveness of Muon-type methods for non-smooth optimization.

2605.26971 2026-05-27 cs.LG

RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

RLVR 数据集及其查找方法:通过数据溯源寻找更好的训练数据

Hsiu-Yuan Huang, Weijie Liu, Chenming Tang, Sanwoo Lee, Kai Yang, Yangkun Chen, Saiyong Yang, Yunfang Wu

发表机构 * National Key Laboratory for Multimedia Information Processing, Peking University(北京大学多媒体信息处理国家重点实验室) School of Computer Science, Peking University(北京大学计算机科学学院) LLM Department, Tencent(腾讯LLM部门)

AI总结 针对可验证奖励强化学习(RLVR)数据集来源不清的问题,提出基于谱系感知搜索的原子源追踪框架(ATLAS),追溯超过99.7%的实例至20个原子源,并基于源级反事实归因(SCA)原则构建去污染数据集DAPO++,其质量分数Q与下游RLVR性能强相关。

Comments 7 figures, 12 tables

详情
AI中文摘要

可验证奖励强化学习(RLVR)数据集的激增加剧了来源崩溃问题,原因是现有数据集之间的谱系不明确。为弥合这一碎片化的RLVR数据格局,我们提出了基于谱系感知搜索的原子源追踪(ATLAS),这是一个系统框架,用于将RLVR数据集追溯至其原子源,将145万个实例中的超过99.7%归因于20个原子源。我们的分析表明,大多数RLVR数据集是一小组共享上游源的变体,很少有引入真正新数据的,许多面临数据污染风险。这些发现自然促使我们策划一个新的RLVR数据集DAPO++,并从谱系感知的角度对现有数据集进行基准测试。为此,我们提出源级反事实归因(SCA)作为指导原则,以策划一个具有集中学习信号的去污染训练数据集。本质上,SCA通过比较每个原子源的RL检查点与共享基模型来测量样本的边际效用。基于这些归因信号,我们进一步设计了一个复合数据集质量分数Q,该分数与下游RLVR性能强相关。在Qwen3系列模型上的实验验证了DAPO++在保留基准上持续提升性能,而Q可靠地预测了下游RLVR训练效果。我们的代码和数据可在https://github.com/Celine-hxy/ATLAS获取。

英文摘要

The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine-hxy/ATLAS.

2605.26969 2026-05-27 cs.CL cs.AI

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon:基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Recon方法,通过动作重建分数评估推理轨迹的预测能力,以改进用户建模中的推理合成,在多个领域优于事后合理化基线。

详情
AI中文摘要

用户建模旨在使用语言模型(LM)从过去的上下文-动作对(例如对话轮次)语料库中模拟个体的行为,从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库,通常通过同时以上下文和动作为条件生成。然而,这种条件构成事后合理化而非推理:轨迹保证证明动作的合理性,但可能不编码潜在的潜在因果决策路径。我们提出Recon,它使用动作重建通过预测能力对推理轨迹进行评分:给定上下文和候选推理,重建模型预测动作,重建保真度决定推理质量。在四个领域,Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外,我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能,相对于基线实现了高达70.0%的胜率。我们进一步表明,Recon合成的推理可跨模型迁移,并改善重建模型之外的用户建模。我们的工作表明,事后合理化对于推理合成是不够的,有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

2605.26967 2026-05-27 cs.CV

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

CodecCap: 高保真度编解码器启发的残差建模用于密集视频字幕生成

Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng, Rui Liu

发表机构 * ERNIE Team, Baidu(百度ERNIE团队) College of Artificial Intelligence, Inner Mongolia University(内蒙古大学人工智能学院)

AI总结 提出CodecCap框架,通过关键帧和残差字幕模拟视频编解码器,在保持细粒度视觉证据的同时减少冗余,并引入VidCapQA基准验证其高保真度。

Comments 11 pages, 4 figures

详情
AI中文摘要

现有的视频字幕方法难以平衡视觉保真度和冗余:整体字幕紧凑但丢失细粒度证据,而分段字幕改善覆盖但引入大量冗余。我们提出CodecCap,一种受编解码器启发的高保真度密集视频字幕框架。类似于视频编解码器,CodecCap使用关键帧和残差字幕表示视频。关键帧字幕详尽编码稳定的视觉上下文,而残差字幕仅捕获时间上局部的动作、运动和变化。这有效保留了细粒度视觉证据,同时减少冗余描述。为了量化字幕的保真度,我们引入VidCapQA,一个包含14个能力维度1000个问题的字幕-问答基准。VidCapQA上的结果表明,强VLM直接生成的字幕仍然遗漏许多视觉细节,突显字幕表示是关键瓶颈。实验表明,CodecCap显著超越使用相同底层VLM的直接字幕生成,表明关键帧-残差字幕是一种高保真度视频-语言监督的方式。我们进一步使用CodecCap构建CodecVDC-100K,一个包含锚点、残差、场景级和视频级监督的大规模密集字幕数据集。

英文摘要

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.

2605.26958 2026-05-27 cs.CL cs.AI

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO:面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) University of Southern California(南加州大学) Zhejiang University(浙江大学) Xiaohongshu Inc.(小红书公司)

AI总结 针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题,提出Tournament-GRPO框架,通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励,在Deep Research Bench上取得4.52分提升。

详情
AI中文摘要

开放式长文本生成中的强化学习具有挑战性,因为可靠的参考答案和自动评估指标通常不可用。现有的基于规则的方法通常依赖于逐点的LLM作为评判的评分,但绝对分数难以在复杂响应间校准,可能对同一查询的生成结果提供弱区分度,并在优化过程中饱和。我们提出Tournament-GRPO,一种群组奖励框架,通过同一查询生成结果间的重复多轮锦标赛将基于规则的LLM评判转化为相对奖励。Tournament-GRPO在群组内比较候选结果,累积锦标赛结果,并将其归一化为用于GRPO训练的群组奖励。在Deep Research Bench上的实验表明,Tournament-GRPO持续优于现有的奖励设计基线,在最强基线上实现了4.52分的整体分数提升。进一步分析表明,锦标赛奖励提供了有利的有效性-效率权衡,并且锦标赛设计影响训练动态。这些结果表明,基于规则的锦标赛比较为开放式长文本生成中的强化学习提供了有效的奖励信号。

英文摘要

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.

2605.26955 2026-05-27 cs.CL cs.AI

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

JuICE:评估LLM裁判识别文化错误的基准

Jiho Jin, Junho Myung, Juhyun Oh, Junyeong Park, Rifki Afina Putri, Sunipa Dev, Vinodkumar Prabhakaran, Alice Oh

发表机构 * KAIST(韩国科学技术院) Google(谷歌) Universitas Gadjah Mada(加查马达大学)

AI总结 提出JuICE基准,包含7470个文化语言错误标注的多语言数据集,用于评估LLM裁判在长文本中识别深层文化错误的能力,发现最强模型F1仅0.52且常遗漏本地人易识别的错误。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署给全球用户,它们被整合到不同文化背景下的日常任务中,从起草个人通信到头脑风暴创意想法。这些任务本质上是文化性的:它们需要语境适当性、象征共鸣以及母语者本能依赖的隐性文化期望,这意味着一个回答可能在事实上合理,但对本地读者来说却明显错误。现有的文化基准通过事实验证或规范蕴含方法将文化视为一组扁平的事实,并采用LLM作为裁判,而未检查它们是否能捕捉到这种深层的文化错误。为填补这一空白,我们提出了JuICE(LLM裁判识别文化错误基准),这是一个多语言数据集,包含7470个跨度级别的文化语言错误标注,涵盖来自四个国家(美国、韩国、印度尼西亚和孟加拉国)的1050个查询-响应对,使用英语和这些国家的主要语言。利用JuICE,我们发现即使是最强的LLM裁判在错误跨度检测任务中也仅达到0.52的F1分数。此外,LLM裁判始终会遗漏本地居民容易识别的深层文化错误。我们的研究结果表明,稳健的文化评估必须超越表面级别的检测,转向考虑文化意义的深度和情境性的框架。

英文摘要

As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.

2605.26952 2026-05-27 cs.CL

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

基于策略内在知识边界增强的高效智能体强化学习

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang

发表机构 * Tencent Inc(腾讯公司) The Chinese University of Hong Kong(香港中文大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 提出AKBE方法,通过双路径(有工具和无工具)在线策略训练动态探测模型内在知识边界,构建针对性监督信号,在保持准确率的同时减少工具调用。

详情
AI中文摘要

智能体强化学习已被证明对于训练具有外部工具使用能力的基于LLM的智能体是有效的。然而,我们发现智能体强化学习训练会导致冗余工具调用增加,并模糊模型的内在知识边界,即模型无法区分何时需要工具以及何时参数化知识足够。现有的基于奖励塑形的解决方案创建了粗粒度的优化目标,倾向于激励不加区分的工具调用抑制,导致奖励黑客行为。在本文中,我们提出AKBE(智能体知识边界增强),一种在线策略方法,通过在训练期间进行双路径(有工具和无工具)展开来动态探测模型的内在知识边界。我们将知识边界定义为每个实例是否需要工具以及所需的最小工具调用次数。通过比较各路径的正确性,AKBE对轨迹进行分类并构建针对性的监督信号,为每个问题引导高效的工具使用模式。这些信号无缝集成到智能体强化学习训练循环中。在七个QA基准上的实验表明,与标准智能体强化学习相比,AKBE平均任务准确率提高+1.85,工具调用减少18%,工具生产率提高25%,且没有任何准确率-效率权衡。进一步分析表明其在不同RL算法上的即插即用兼容性以及每个信号类别的机制。我们的代码可在https://github.com/CuSO4-Chen/AKBE获取。

英文摘要

Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at https://github.com/CuSO4-Chen/AKBE.

2605.26949 2026-05-27 cs.CV cs.GR

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models

DinoComplete: 利用蒸馏语义先验和状态空间模型进行3D形状补全

Furkan Mert Algan, Eckehard Steinbach

发表机构 * Chair of Media Technology(媒体技术教授职位) Munich Institute of Robotics and Machine Intelligence(慕尼黑机器人与机器智能研究所) School of Computation Information and Technology, Technical University of Munich(计算信息科学学院,慕尼黑技术大学)

AI总结 提出DinoComplete框架,通过从DINO特征中蒸馏语义先验并结合多尺度体素Mamba模块,实现高效、鲁棒的3D形状补全,在未见类别和真实噪声扫描上优于现有方法。

详情
AI中文摘要

从部分扫描进行3D形状补全对于未见类别和嘈杂的真实世界观测仍然具有挑战性,因为仅凭几何信息往往不足以推断缺失结构。我们提出了DinoComplete,一个确定且高效的形状补全框架,通过从DINO特征中蒸馏的体素对齐语义先验来增强几何重建。首先,我们构建与ShapeNet数据对齐的多视图DINO特征体积,并训练一个学生网络直接从不完整形状预测密集语义特征。这些预测特征捕获全局结构和部分感知的语义上下文,同时与底层几何保持对齐。然后,我们将这些蒸馏特征集成到一个补全网络中,其中几何和语义体素表示通过体素状态空间建模进行融合。为了在不牺牲分辨率的情况下实现高效的长距离推理,我们引入了一个多尺度体素Mamba模块,通过结合全网格和分块序列建模来细化融合特征。在未见过的ShapeNet类别和ScanNet物体上的实验表明,DinoComplete在使用更少参数、更低内存和更快推理速度的同时,实现了比先前确定性和基于生成的方法更强的补全质量。我们的结果表明,从视觉基础模型中蒸馏语义先验提高了3D形状补全的泛化能力和鲁棒性。

英文摘要

3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.

2605.26944 2026-05-27 cs.RO cs.CV

Object Pose and Shape Estimation for Grasping: Does it Work?

用于抓取的目标姿态与形状估计:有效吗?

Pavan Karke, Kushal Shah, Gaurav Singh, Md Faizal Karim, K Madhava Krishna, Rajat Talak

发表机构 * Robotics Research Center, IIIT Hyderabad(IIIT海得拉巴机器人研究中心) National University of Singapore(新加坡国立大学)

AI总结 本文通过对比端到端抓取合成方法与模块化方法(先估计目标姿态和形状再采样抓取),评估现有姿态和形状估计方法在抓取任务中的有效性。

Comments 9 pages, 8 figures

详情
AI中文摘要

目标姿态和形状估计问题近年来取得了关键进展。编码器-解码器(如SAM3D、LRM、CRISP)和基于扩散的模型(如InstantMesh、Zero123、SceneComplete)展示了类别无关的形状编码能力和开放集泛化性。在这项工作中,我们提出一个问题:当与对极抓取采样结合使用时,目标姿态和形状估计方法是否足够成熟,以至于能够超越端到端抓取合成方法?我们通过将研究范围限定在平行颚夹爪、7自由度抓取和单视图RGB(-D)图像输入,详细探讨了这个问题。我们实现并比较了一种最先进的端到端抓取合成方法和三种模块化方法,这些方法首先估计场景中所有目标的姿态和形状,然后使用对极采样生成抓取。我们观察到,在所有实验中,模块化方法均优于端到端方法。模块化方法能够合成大量抓取,即使是对于端到端方法失败的小目标也是如此。模块化方法的有效性取决于姿态和形状估计的准确性,并且在杂乱场景中会部分退化——这是现有姿态和形状估计方法的局限性。我们还分析了三种模块化方法的失败模式和运行时间,这些方法使用了两种不同的目标姿态和形状估计方式:一种基于编码器-解码器模型,另一种基于扩散模型。最后,我们证明单视图目标姿态和形状估计方法可以与视觉语言模型结合,仅从单视图RGB-D图像输入即可产生语言条件抓取。我们注意到其性能与最先进的LERF-TOGO基线相当。

英文摘要

The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.

2605.26940 2026-05-27 cs.CL

Accountable Human-AI Deliberation with LLMs: Scaling Collective Intelligence through Symbiotic Scaffolding

负责任的基于LLM的人机协商:通过共生脚手架扩展集体智能

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 提出一个三层共生人机框架,通过多样性放大、条款级溯源和人类主导批准,在扩展集体智能的同时保持主体性和合法性。

Comments Accepted at the LREC 2026 / 2nd Workshop on Language-driven Deliberation Technology

详情
AI中文摘要

大型语言模型(LLM)可以在以前受轮流发言和引导带宽限制的规模上支持民主协商。最近的研究表明,LLM生成的群体陈述通常比人类中介的输出更受欢迎,而理论分析认为LLM放松了限制集体智能的同时性约束。然而,纯LLM中介存在使多元性崩溃、过度优化一致性以及当参与者无法质疑其如何被代表时损害合法性的风险。我们提出了一个共生的人机框架,分为三个层次:观察与多样性放大、具有条款级溯源的引导、以及人类优先批准。我们的贡献包括:具有显著性加权分级覆盖、多样性和擦除度量;结合交叉编码器相似性与因果剔除诊断的溯源管道;偏好条件权衡控制;公平感知的可争议工作流;对抗性鲁棒性测试;以及基于LLM作为评判者局限性证据的消融设计评估协议。结果是一个可测试的协商技术蓝图,能够在扩展集体智能的同时保持主体性和合法性。

英文摘要

Large language models (LLMs) can support democratic deliberation at scales previously constrained by turn-taking and facilitation bandwidth. Recent work shows that LLM-generated group statements are often preferred over human-mediated outputs, while theoretical analyses argue that LLMs relax the simultaneity constraints limiting collective intelligence. Yet pure LLM mediation risks collapsing pluralism, over-optimizing for agreement, and undermining legitimacy when participants cannot contest how they are represented. We propose a symbiotic human-AI framework organized into three layers: observation and diversity amplification, facilitation with clause-level provenance, and human primacy for ratification. Our contributions include graded coverage, diversity, and erasure metrics with salience-aware weighting; a provenance pipeline combining cross-encoder similarity with causal knockout diagnostics; preference-conditioned trade-off control; equity-aware contestability workflows; adversarial robustness tests; and an evaluation protocol with ablation designs informed by evidence of LLM-as-judge limitations. The result is a testable blueprint for deliberation technology that scales collective intelligence while preserving agency and legitimacy.

2605.26937 2026-05-27 cs.CL cs.AI

Beyond Questions: Evaluating What Large Language Models (Actually) Know

超越问题:评估大型语言模型(实际)知道什么

Luca Giordano, Simon Razniewski

发表机构 * ScaDS.AI Dresden/Leipzig & TU Dresden, Germany(ScaDS.AI 德尔布兰德/莱比锡及德累斯顿技术大学,德国)

AI总结 提出开放知识评估新范式,通过开放式提示(如“告诉我关于M.L. King的一切”)评估模型自然表达的知识,并构建BeQu基准测试10,000个实体。

详情
AI中文摘要

大型语言模型(LLM)中的参数化知识是其成功的基石,但仍未被充分理解。现有的知识基准通常依赖于预定义的问题(例如,“M.L. King的出生日期是什么?”),仅评估基准设计者明确选择查询的知识,这是一种有问题的可用性偏差。在本文中,我们引入了开放知识评估,这是一种用于LLM知识基准测试的新范式。它不提出狭隘的问题,而是评估模型在响应开放式引发提示(例如,“告诉我关于M.L. King的一切”)时选择呈现的知识。这将焦点从预定义的答案检索转向表征模型自然表达的知识。我们用BeQu(超越问题)实例化这一范式,这是一个包含10,000个实体并配有用于陈述验证的参考语料库的基准。使用BeQu,我们评估了广泛的语言模型,并分析了推理努力、模型规模、提示格式和知识领域的影响。数据和排行榜可在此工作的GitHub仓库和基准网站上获取。

英文摘要

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

2605.26935 2026-05-27 cs.CL

DunbaaBERT: From Sacrifice to Semantics

DunbaaBERT: 从牺牲到语义

Iffat Maab, Waleed Jamil, Raphael Schmitt

发表机构 * Research and Development Center for Large Language Models (LLMC), National Institute of Informatics, Tokyo(大型语言模型研发中心(LLMC),信息研究所,东京) School of Computation, Information and Technology, Technical University of Munich, Germany(计算、信息与技术学院,慕尼黑技术大学,德国) Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Germany(临床医学系,弗赖堡大学医学院及医学中心,德国)

AI总结 本文提出DunbaaBERT,一种从零训练的乌尔都语RoBERTa-base模型族,通过不同词汇表大小在17GB语料上预训练,在多项下游任务中达到与强多语言基线相当的性能,并发现较大词汇表并不持续提升效果。

详情
AI中文摘要

大型语言模型在许多自然语言处理任务中取得了强劲性能,但由于资源有限和评估设置碎片化,乌尔都语仍相对未被充分探索。为填补这一空白,我们引入了DunbaaBERT,一个乌尔都语RoBERTa-base模型族,在去重后的17GB乌尔都语语料库上使用32k、52k和96k token的Byte-BPE词汇表从头训练。我们在内在和下游乌尔都语自然语言处理基准上评估DunbaaBERT,涵盖语言可接受性、新闻分类、攻击性语言检测和情感分析,同时分析词汇表大小对性能和效率权衡的影响。在各项基准中,DunbaaBERT变体与强多语言基线相比取得了有竞争力的性能,同时始终保持有利的效率权衡。有趣的是,较大的词汇表并不持续提升下游效果,DunbaaBERT$_{\text{32k}}$反复提供最强的整体效率概况。总体而言,我们的结果表明,尽管模型和训练规模相对紧凑,精心策划的乌尔都语特定编码器模型仍能保持高度竞争力。所有模型均在MIT许可下发布。

英文摘要

Large language models have achieved strong performance across many NLP tasks, yet Urdu remains comparatively underexplored due to limited resources and fragmented evaluation settings. To address this gap, we introduce DunbaaBERT, a family of Urdu RoBERTa-base models trained from scratch with Byte-BPE vocabularies of 32k, 52k, and 96k tokens on a deduplicated 17GB Urdu corpus. We evaluate DunbaaBERT across intrinsic and downstream Urdu NLP benchmarks covering linguistic acceptability, news classification, offensive language detection, and sentiment analysis while analyzing vocabulary-size effects on performance and efficiency trade-offs. Across benchmarks, the DunbaaBERT variants achieve competitive performance against strong multilingual baselines while consistently maintaining favorable efficiency trade-offs. Interestingly, larger vocabularies do not consistently improve downstream effectiveness, with DunbaaBERT$_{\text{32k}}$ repeatedly providing the strongest overall efficiency profile. Overall, our results demonstrate that carefully curated Urdu-specific encoder models can remain highly competitive despite comparatively compact model and training scales. All models are released under the MIT license.

2605.26934 2026-05-27 cs.CL cs.AI

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂度:逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

发表机构 * Kyoto University(京都大学) University of Tokyo(东京大学) NII LLMC(日本信息处理学会LLMC) RIKEN(理化学研究所)

AI总结 通过将推理空间划分为深度和复杂度两个维度,并考虑四种推理形式,在合成知识图谱环境中进行受控实验,发现联合深度-复杂度覆盖优于单轴策略,不同推理家族对RLVR覆盖的反应非均匀,且均匀混合优于分阶段课程。

Comments Pre-print

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为后训练推理模型的核心,但现有研究的一个关键局限在于对推理空间的狭隘视角:难度仅被视为推理深度,奖励集中在正向演绎状态追踪。相反,我们沿两个维度刻画推理空间。难度:除了推理深度,我们研究环境复杂度,即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式:我们考虑现实世界推理核心的四种能力:演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素,我们构建了一个合成知识图谱环境,具有受控的预训练和后训练分布,其中每个实例在深度、复杂度和任务家族上变化。三个发现:联合深度-复杂度覆盖优于单轴策略;推理家族反应非均匀,溯因推理在RL覆盖区域外退化,任务相关性聚类为演绎-溯因对和归纳-类比对;在固定预算下,均匀混合优于分阶段课程。我们还发现,最近的现成模型表现出相同的演绎-溯因不对称性,表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

2605.26933 2026-05-27 cs.CV

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

利用文本到图像扩散模型进行无监督视觉目标跟踪

Zhengbo Zhang, Zhigang Tu, Junsong Yuan, De Wen Soh, Bo Du

发表机构 * Information Systems Technology and Design Pillar, Singapore University of Technology and Design(新加坡科技设计大学信息系统技术与设计学院) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University(武汉大学测绘遥感信息工程国家重点实验室) Department of Computer Science and Engineering, University at Buffalo, State University of New York(纽约州立大学布法罗分校计算机科学与工程系) School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出Diff-Tracking方法,利用预训练文本到图像扩散模型的跨注意力机制,通过初始提示学习器和在线提示更新器实现无监督目标跟踪。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026

详情
AI中文摘要

无监督视觉目标跟踪是一项具有挑战性的任务,需要在没有真实标注训练的情况下跟踪视频中的任意目标。尽管取得了显著进展,现有的最先进无监督跟踪器在处理需要细粒度理解视频帧内语义和视觉结构信息的场景时仍常遇到困难。文本到图像扩散模型以其生成准确反映输入提示中描述的语义和结构的图像的能力而闻名,展现出对视觉语义和结构的强大把握。基于这一能力,我们从新的角度处理无监督跟踪,利用预训练文本到图像扩散模型中编码的丰富语义知识。为了将原本用于图像生成的扩散模型适应到跟踪任务,我们将其重新解释为文本和图像模态之间的桥梁。这种连接通过跨注意力机制实现:当文本和图像同时输入模型时,模型会突出显示与文本语义对齐的图像区域(在跨注意力图中)。因此,我们学习一个表示跟踪目标的提示,并在每一帧中激活其在跨注意力图中的对应区域,从而利用扩散模型实现目标跟踪。具体来说,我们的方法Diff-Tracking由两个主要部分组成:初始提示学习器和在线提示更新器。初始提示学习器生成一个捕获第一帧中目标对象的提示,使扩散模型能够识别目标。在线提示更新器基于运动信息优化提示,实现跨视频帧的一致跟踪。我们在六个具有挑战性的跟踪数据集上评估了我们的方法,证明了其有效性。

英文摘要

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.

2605.26926 2026-05-27 cs.AI

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

从规范到指标 (N2I-RAG): 一种用于法律指标计算的智能检索增强生成框架

Youssef Al Mouatamid, Marie Bonnin, Jihad Zahir

发表机构 * LISI Laboratory(LISI实验室) Cadi Ayyad University(卡迪·阿亚德大学) Univ Brest(布列塔尼大学) IRD, Univ Brest, CNRS, Ifremer, LEMAR(IRD、布列塔尼大学、CNRS、Ifremer、LEMAR)

AI总结 提出N2I-RAG框架,通过自适应检索、基于LLM的智能体和验证机制,实现从法律文本到指标的透明、可追溯的自动计算,在法国海洋环境法语料库上优于基线方法。

详情
AI中文摘要

从规范文本计算法律指标是法律监测和政策评估中的关键任务,但由于法律语言的复杂性、规模、解释性以及可用文档质量的差异,这一任务面临重大挑战。现有的自然语言处理技术和生成模型可以辅助法律分析,但往往存在较高的幻觉风险,且缺乏可靠指标计算所需的可解释性和证据基础。本文提出N2I-RAG(从规范到指标),一种智能检索增强生成框架,旨在以透明且可追溯的方式自动化法律指标的计算。我们将自适应检索、基于LLM的智能体和验证机制集成到一个模块化流水线中,其中每个组件在过滤、检索和评估证据,以及生成与可识别法律条款相关的二元法律结果方面执行定义明确的角色。该框架通过要求对中间决策和最终指标分配进行明确解释来强调可追溯性。我们使用内部构建的包含扫描和数字两种来源的法国海洋环境法律语料库评估N2I-RAG。与多个语言模型家族的对比实验表明,所提出的方法始终优于基线系统,并且在两种不同禁令的测试中具有良好的泛化能力。结果表明,智能检索增强生成可以桥接开放文本法律语言和标准化指标计算,为透明且可扩展的法律观测站奠定基础。

英文摘要

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

2605.26924 2026-05-27 cs.CL

Learning to Adapt SFT Data for Better Reasoning Generalization

学习适应SFT数据以实现更好的推理泛化

Lisong Sun, Li Wang, Chen Zhang, Jinyang Wu, Kui Zhang, Tianhao Peng, Wenjun Wu

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) Hangzhou International Innovation Institute(杭州国际创新研究院) Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing(北京未来区块链与隐私计算高级创新中心)

AI总结 提出DART方法,通过强化学习训练映射器将分布不匹配的SFT数据转化为模型自适应的监督,提升推理泛化能力。

详情
AI中文摘要

大型语言模型(LLMs)取得了显著进展,其中后训练在增强其推理能力方面起着关键作用。在后训练范式中,监督微调(SFT)被广泛使用:它利用外部数据提供密集监督并实现高效训练。然而,当数据分布与目标模型自身分布不匹配时,直接在专家数据上微调可能会损害泛化能力。在这项工作中,我们提出了推理调优的数据适应(DART),它将使用固定且可能分布不匹配的SFT数据集表述为对演示转换的优化问题。DART使用强化学习训练一个映射器模型,将原始SFT数据转换为与目标模型分布和学习偏好更匹配的模型自适应监督。转换后的数据随后用于SFT,使目标模型能够更好地利用外部监督。在多个模型和数据集上的实验表明,DART提高了泛化能力,实现了比直接RL更高的训练效率,并帮助模型超越标准SFT。我们的代码可在https://anonymous.4open.science/r/DART525E50D获取。

英文摘要

Large language models (LLMs) have achieved remarkable progress, with post-training playing a crucial role in enhancing their reasoning capabilities. Among post-training paradigms, supervised fine-tuning (SFT) is widely used: it leverages external data to provide dense supervision and enables efficient training. However, directly fine-tuning on expert data can hurt generalization when the data distribution is mismatched with the target model's own distribution. In this work, we propose Data Adaptation for Reasoning Tuning (DART), which formulates the use of a fixed, potentially distributionally misaligned SFT dataset as an optimization problem over demonstration transformations. DART trains a mapper model with reinforcement learning to convert original SFT data into model-adapted supervision that better matches the target model's distribution and learning preferences. The transformed data are then used for SFT, allowing the target model to better exploit external supervision. Experiments across multiple models and datasets show that DART improves generalization, achieves higher training efficiency than direct RL, and helps models surpass standard SFT. Our code is available at https://anonymous.4open.science/r/DART525E50D.

2605.26918 2026-05-27 cs.CL

Are Video Models Zero-Shot Learners and Reasoners in Education? EduVideoBench, A Knowledge-Skills-Attitude Benchmark for Educational Video Generation

视频模型是教育领域的零样本学习者和推理者吗?EduVideoBench:面向教育视频生成的知识-技能-态度基准

Unggi Lee, Hoyoung Ahn, Yoon Choi, Seonmin Eun, Jahyun Jeong, Seonmin Jin, Harmony Jung, Hye Jin Kim, Chaerin Lee, Hyunji Lee, Jeongjin Lee, Soohwan Lee, Young-Seok Oh, Jaehyeon Park, Sun-ok Ryu, Sunyoung Shin, Yoorim Son, Haeun Park, Yeil Jeong

发表机构 * Korea University Sejong Campus(韩国大学世宗校区) Cardiff Metropolitan University(卡迪夫 Metropolitan 大学) Seoul National University(首尔国立大学) Bugil Academy(Bugil 学院) Gyeonggi Provincial Office of Education(京畿省教育厅) Loughborough University(洛桑大学) Korea National University of Education(韩国教育大学) Korea University(韩国大学) Korean Educational Development Institute(韩国教育发展院) Sungshin Women’s University(顺天堂女子大学) Seoul National University of Education(首尔国立教育大学) Korea Institute for Curriculum and Evaluation(韩国课程评价院) Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出基于知识-技能-态度框架的教育视频生成基准EduVideoBench,评估五个前沿视频生成模型在教育有效性上的不足,并发现教育有效性是多维度的,单一元素不匹配即可使视频失效。

详情
AI中文摘要

视频生成模型(VGMs)正迅速进入课堂,然而现有基准仅评估感知质量、内在忠实性、通用安全性或将视频作为推理媒介,没有评估输出是否具有教育有效性。在这项工作中,我们提出了EduVideoBench,这是教育领域第一个平衡的基准,基于知识-技能-态度(KSA)框架,使得教学充分性和教育安全性被联合评估,而非作为临时的质量维度。在五个前沿VGMs上,我们的结果显示,在知识、技能和态度方面,它们距离课堂准备就绪还有很大的改进空间。我们辅以专家评论的定性分析,发现教育有效性是多维度的,单个不匹配的元素(如节奏、可读性或符号)可能使原本正确的视频失效。我们希望EduVideoBench能够指导开发教学上合理且课堂安全的VGMs。

英文摘要

Video generation models (VGMs) are rapidly entering classrooms, yet existing benchmarks evaluate only perceptual quality, intrinsic faithfulness, generic safety, or video as a reasoning medium, and none assesses whether the outputs are educationally valid. In this work, we present EduVideoBench, the first balanced benchmark in the education domain, grounded in the Knowledge-Skills-Attitude (KSA) framework so that pedagogical adequacy and educational safety are evaluated jointly rather than as ad-hoc quality dimensions. Across five frontier VGMs, our results show substantial room for improvement across knowledge, skills, and attitude before they are classroom-ready. We complement this with a qualitative analysis of expert comments, finding that educational validity is multi-component, where a single misaligned element such as pacing, legibility, or notation can invalidate an otherwise correct video. We hope EduVideoBench will guide the development of VGMs that are pedagogically grounded and safe for the classroom.

2605.26911 2026-05-27 cs.AI

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

TADDLE: 一种用于检测有缺陷的LLM生成同行评审的工具增强型代理

Hanqi Duan, Xiang Li

发表机构 * East China Normal University(东华大学)

AI总结 针对LLM生成的同行评审难以检测缺陷的问题,提出TADDLE工具增强型代理,通过四个专用分析工具和两阶段半监督学习,在二元检测和多标签分类任务上表现优异。

详情
AI中文摘要

LLM生成的同行评审在主要会议中越来越常见,但由于它们语言流畅、结构良好,其缺陷难以检测。现有工作要么仅分类作者身份而不评判质量,要么使用为人类撰写的评审设计的特征来评分质量;没有先前系统能在单个缺陷类型级别检测LLM生成评审中的缺陷。为弥补这一空白,我们引入了TADDLE,一种用于检测有缺陷的LLM生成同行评审的工具增强型代理,以及首个针对此任务的专家标注基准。我们的基准包含对50篇ICLR 2025论文的1800条评审,由18位领域专家根据六个缺陷类别(加上一个无缺陷标签)的分类法进行多标签标注。TADDLE将检测分解为四个专用分析工具——验证、纠正、完善和转换——由一个代理协调;一个集成器通过两阶段半监督学习将其输出综合为二元和多标签分类。大量实验表明,TADDLE在二元检测和多标签分类任务上均表现强劲。我们在https://github.com/AquariusAQ/TADDLE发布基准和代码。

英文摘要

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

2605.26908 2026-05-27 cs.AI cs.DS cs.LG

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

关于因子图中可交换因子检测的充要条件

Malte Luttermann, Ralf Möller, Marcel Gehrke

发表机构 * Institute for Humanities-Centered Artificial Intelligence, University of Hamburg, Germany(人文导向人工智能研究所,汉堡大学,德国)

AI总结 本文重新审视了因子图中可交换因子检测的理论基础,指出现有算法依赖的定理仅为必要条件而非充分条件,并提出了修正算法以保证正确性和效率。

详情
AI中文摘要

利用概率图模型(如因子图)中对象的不可区分性是提升概率推理算法的关键,并允许对领域规模进行可处理的概率推理问题。在因子图中利用不可区分对象的核心是识别可交换因子,即其输出值在分配给其部分参数的输入值的排列下保持不变的因子。本文重新审视了检测可交换因子的最先进算法的理论基础。具体而言,我们表明,在其当前形式下,最先进算法依赖于一个中心定理,该定理被错误地视为识别可交换因子的充分条件,而实际上它仅意味着必要条件。因此,正如我们在本文中所展示的,最先进算法可能会产生错误结果。为了修复当前最先进算法中存在的缺陷,我们证明了上述定理的一个略微修改版本,该版本作为识别可交换因子的必要条件。此外,我们提出了最先进算法的修正版本,在保持其效率的同时确保正确性,并引入了一种具有更严格最坏情况边界的补充算法。

英文摘要

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

2605.26900 2026-05-27 cs.LG

SPHERE-JEPA: Spherical Prediction with Homogeneous Embeddings

SPHERE-JEPA: 均匀嵌入的球面预测

Léo Nicollier, Max Dunitz, Marc Pic, Pablo Musé, Enric Meinhardt-Llopis, Gabriele Facciolo

发表机构 * Université Paris-Saclay, CNRS, Advanced Track and Trace(巴黎萨克雷大学,国家科学研究中心,先进跟踪与追溯) ENS Paris-Saclay, Centre Borelli, Advanced Track and Trace(巴黎萨克雷高等师范学院,博雷利中心,先进跟踪与追溯) Université Paris-Saclay, CNRS(巴黎萨克雷大学,国家科学研究中心) ENS Paris-Saclay, Centre Borelli(巴黎萨克雷高等师范学院,博雷利中心)

AI总结 本文提出SPHERE-JEPA框架,通过将Cramér-Wold投影机制调整为强制超球面均匀性而非高斯先验,解决了自监督学习中高斯嵌入导致各向异性k-NN邻域的问题,在纹理检索和ImageNet-1K线性探测上取得显著提升。

详情
AI中文摘要

自监督学习中的一个基本开放问题是明确表征学习表示的最优几何。最近,LeJEPA将各向同性高斯嵌入确定为在欧几里得空间中最小化下游预测风险的最优解。然而,对于支撑在低维流形(如超球面)上的分布,相应问题仍未探索。在这项工作中,我们证明将这种极小极大分析扩展到黎曼流形上的光滑分布会根本性地改变最优解。我们表明,在最坏情况公式下,k近邻和核岭回归都诱导超球面均匀性。更精确地说,我们证明流形上的均匀分布对于k近邻是最优的,而球面上的均匀分布对于使用指数点积核和线性核的核岭回归是最优的。这一理论见解揭示了高斯嵌入的一个根本局限:其非均匀密度导致各向异性的k-NN邻域,严重偏置估计器。为纠正这一点,我们引入了SPHERE-JEPA,一个理论基础的SSL框架。我们调整LeJEPA的Cramér-Wold投影机制以强制超球面均匀性而非高斯先验。实验上,SPHERE-JEPA取得了显著改进,将纹理检索mAP提升了超过6%,同时在标准基准上持续匹配或超越LeJEPA——包括在ImageNet-1K(ViT-B/14)上+1.8%的线性探测增益。

英文摘要

A fundamental open question in self-supervised learning (SSL) is the explicit characterization of the optimal geometry of the learned representations. Recently, LeJEPA identified isotropic Gaussian embeddings as optimal for minimizing downstream prediction risk in Euclidean spaces. However, the corresponding problem for distributions supported on lower-dimensional manifolds, such as the hypersphere, remains unexplored. In this work, we demonstrate that extending this minimax analysis to smooth distributions on Riemannian manifolds fundamentally changes the optimal solution. We show that, under a worst-case formulation, both k-nearest neighbors and kernel ridge regression induce hyperspherical uniformity. More precisely, we show that uniform distributions on manifolds are optimal for k-nearest neighbors, and that the uniform distribution on the sphere is optimal for kernel ridge regression with both the exponential dot-product kernel and the linear kernel. This theoretical insight reveals a fundamental limitation of Gaussian embeddings: their non-uniform density induces anisotropic k-NN neighborhoods, severely biasing the estimator. To correct this, we introduce SPHERE-JEPA, a theoretically grounded SSL framework. We adapt LeJEPA's Cram{é}r-Wold projection mechanism to enforce hyperspherical uniformity rather than a Gaussian prior. Empirically, SPHERE-JEPA yields significant improvements, boosting texture retrieval mAP by over 6%, while consistently matching or outperforming LeJEPA on standard benchmarks-including a +1.8% linear probing gain on ImageNet-1K (ViT-B/14).

2605.26895 2026-05-27 cs.LG cs.AI stat.ML

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

微不足道的大小,显著的效果:大型语言模型中的尺度向量

Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li, Kai Shen, Shu Zhong

发表机构 * Peking University(北京大学)

AI总结 本文系统研究了大型语言模型中的尺度向量,发现其虽参数占比极小但对预训练至关重要,通过自放大预条件效应优化优化过程,并提出了三种轻量级改进策略,在多种模型规模上一致提升性能。

Comments 36 pages

详情
AI中文摘要

现代大型语言模型(LLM)中的归一化层由确定性归一化操作和可学习的尺度向量组成。尽管归一化操作已被广泛研究,但尺度向量尽管被普遍使用,其作用仍未被充分理解。在这项工作中,我们从表达能力、优化和架构结构的角度对LLM中的尺度向量进行了系统研究。首先,我们通过实验表明,虽然尺度向量仅占模型参数的极小部分,但移除它们会显著降低LLM的预训练效果。我们的理论进一步表明,在Pre-Norm架构中,尺度向量并不增加表达能力;相反,它们通过对后续线性映射产生自放大预条件效应来改善优化。其次,我们研究了权重衰减对尺度向量的作用。通过区分Input-Norm和Output-Norm层,我们从理论上证明,由于它们在优化和表达能力中的不同作用,权重衰减对前者有益但对后者有害。第三,受此理解的启发,我们提出了三种轻量级且互补的尺度向量改进方法:分支特异性异质性、线性映射周围的改进放置以及幅度-方向重参数化。理论和实验均表明,每种改进都能带来一致的收益。最后,我们将这些改进整合为一个统一的尺度向量策略,并通过在0.12B到2B参数的密集和混合专家模型上进行大规模LLM预训练实验,使用多种优化器和学习率调度,在工业级token预算下进行评估。该统一策略始终比精心调整的基线获得更低的终端损失,并展现出更有利的扩展行为,同时增加可忽略的参数和计算开销。

英文摘要

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

2605.26894 2026-05-27 cs.CV

SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising

SIMPC: 学习自诱导镜像点一致性用于无监督点云去噪

Chengwei Zhang, Xueyi Zhang, Tao Jiang, Xinhao Xu, Wenjie Li, Fubo Zhang, Longyong Chen

发表机构 * National Key Laboratory of Microwave Imaging, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China(微波成像国家重点实验室,航天信息研究所,中国科学院,北京,中国) School of Computing, National University of Singapore, Singapore(计算学院,新加坡国立大学,新加坡)

AI总结 提出自诱导镜像点一致性(SIMPC)方法,通过几何先验生成镜像点并约束去噪目标一致性,实现无监督点云去噪,在合成和真实数据集上超越现有无监督及部分有监督方法。

Comments Accepted by ICML 2026. 17 pages, 8 figures, 8 tables

详情
AI中文摘要

在点云中,噪声直接扰动编码空间位置和几何形状的点坐标,使得构建一一对应关系比图像更具挑战性。现有方法通过噪声或最优传输在噪声变体之间施加统计映射,但存在对应歧义。本文提出自诱导镜像点一致性(SIMPC),以无监督方式学习点与潜在表面之间的确定性对应关系。对于每个噪声点,SIMPC在去噪过程中根据几何先验在潜在表面的另一侧生成一个镜像点。通过鼓励原始点与其镜像点的去噪目标之间的一致性,SIMPC有效定位潜在表面的位置。在合成和真实数据集上的大量实验表明,SIMPC显著优于最先进的无监督方法,并超越了几种强监督方法。

英文摘要

In point clouds, noise directly perturbs point coordinates that encode both spatial location and geometry, making one-to-one correspondence construction more challenging than in images. Existing methods impose statistical mappings across noisy variants via noise or optimal transport, but suffer from correspondence ambiguity. In this work, we propose Self-Induced Mirror-Point Consistency (SIMPC) to learn deterministic correspondences between points and the underlying surface in an unsupervised manner. For each noisy point, SIMPC generates a mirror-point on the opposite side of the underlying surface, guided by geometric priors during the denoising process. By encouraging consistency between the denoising targets of the original point and its mirror counterpart, SIMPC effectively localizes the position of underlying surface. Extensive experiments on synthetic and real-world datasets demonstrate that SIMPC significantly outperforms state-of-the-art unsupervised methods and surpasses several strong supervised counterparts.

2605.26893 2026-05-27 cs.CL cs.AI

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

GeoFaith: 时空双视角下的忠实思维链

Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia

发表机构 * Xidian University(西安电子科技大学) Xi’an Jiaotong University(西安交通大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Science and Technology of China(中国科学技术大学)

AI总结 针对思维链推理中的事后合理化问题,提出基于潜在几何结构和熵动力学的时空框架GeoFaith,通过可扩展的引导流水线构建忠实性检测器并联合优化结果正确性、过程忠实性和轨迹一致性。

详情
AI中文摘要

思维链推理推动了大型语言模型的发展,但基于结果的监督导致了普遍的事后合理化,产生了看似合理但不忠实的推理链。大多数先前的忠实性评估方法要么不可扩展、昂贵,要么不可靠。我们提出GeoFaith,一个利用潜在几何结构和熵动力学来诊断和强制忠实推理的时空框架。我们开发了一个可扩展的引导流水线,将四个领域的步骤级标注从1k扩展到20k样本,训练了一个在标准基准上优于GPT-5的8B忠实性检测器,并设计了一个忠实性感知的强化学习框架,联合优化结果正确性、过程忠实性和轨迹一致性。实验表明,所提出的方法在忠实性检测和下游推理上均取得了优越性能,生成了更短、更可解释的链,且不牺牲准确性。我们的代码将公开提供。

英文摘要

Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.

2605.26879 2026-05-27 cs.CV

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

通过对齐单目视频中的高阶时间动态恢复自然人体运动

Dingkun Wei, Zehong Shen, Yan Xia, Georgios Pavlakos, Yujun Shen, Xiaowei Zhou

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HTD-Refine框架,利用PVA-Net估计的高阶时间动态(速度和加速度)优化全局轨迹,恢复自然人体运动。

Comments 13 pages, 6 figures. Accepted as an Oral presentation and Best Paper Candidate at CVPR 2026. Project page: https://zju3dv.github.io/htd-refine/

详情
AI中文摘要

从单目视频中恢复的人体运动通常显得过于平滑或动态不一致,即使关节位置在数值上是准确的。我们观察到,这种局限性源于缺乏可靠的高阶时间线索——速度和加速度——这些对于重建具有真实动量、时序和高频细节的运动至关重要。我们引入了HTD-Refine,一个后处理框架,通过显式估计的高阶时间动态来增强现有的人体运动恢复(HMR)流程。我们系统的核心是PVA-Net,一个时间变换器,它直接从单目视频推断每个关节的2D位置、3D速度和3D加速度。这些预测的动态作为全局优化过程中的软约束,优化世界空间轨迹,显著减少抖动、抑制过度平滑,并恢复物理上合理的运动。在具有挑战性的野外基准上的大量实验表明,HTD-Refine持续改进了最先进的HMR方法,产生了更准确的全局轨迹和更自然的运动动态。我们的结果强调了高阶时间建模在推进单目人体运动恢复中的关键作用。

英文摘要

Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues -- velocity and acceleration -- which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing over-smoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.

2605.26878 2026-05-27 cs.AI

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

多方利益相关者LLM对齐:从聚合中分解估计

Lulu Zheng, Wenjin Yang, Xiangwen Zhang, Rong Yin, Yulan Hu, Zheng Pan, Xin Li

发表机构 * AMAP, Alibaba Group(阿里集团AMAP) Beihang University(北京航空航天大学)

AI总结 针对多方利益相关者任务中用户偏好冲突的问题,提出DecompR方法,通过反事实校准权重固定查询结构,独立估计角色效用,消除候选依赖的权重漂移并降低估计噪声。

详情
AI中文摘要

多方利益相关者任务需要单个输出满足具有冲突偏好的用户。整体LLM评判器混淆了效用估计和效用聚合,产生不稳定的隐式权重。我们通过实验和理论证明,当利益相关者满意度分散时,这种聚合特定的\emph{权重噪声}可能导致较大的分数偏移;在我们的实验中,这些权重引起的偏移也随利益相关者数量增加而增加。我们提出 extsc{DecompR}:反事实校准的权重在候选评分前从查询结构固定,而每个角色的效用独立估计,消除了候选依赖的权重漂移并减少了估计噪声。

英文摘要

Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation and utility aggregation, yielding unstable implicit weights. We show empirically and theoretically that this aggregation-specific \emph{weighting noise} can create large score shifts when stakeholder satisfaction is dispersed; in our experiments, these weight-induced shifts also increase with stakeholder count. We propose \textsc{DecompR}: counterfactual-calibrated weights are fixed from query structure before candidate scoring, while per-role utilities are estimated independently, removing candidate-dependent weight drift and reducing estimation noise.