arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30116 2026-05-29 cs.CV cs.LG

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

SGMD: 得分梯度匹配蒸馏用于少步视频扩散蒸馏

Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu

AI总结 针对分布匹配蒸馏在少步视频扩散中训练昂贵且运动动态保守的问题,提出得分梯度匹配蒸馏(SGMD),通过直接优化假得分朝向教师并使用教师停止梯度Fisher作为稳定目标,实现约3倍训练加速并显著提升运动动态。

详情
Comments
ICML 2026
AI中文摘要

分布匹配蒸馏(DMD)是加速少步视频扩散模型推理的常用范式。然而,DMD风格的视频蒸馏面临两个耦合挑战:假得分必须跟踪不断演化的生成器,当需要频繁更新时训练成本高昂,而反向KL风格匹配可能具有模式寻求性和保守性,难以保持强运动动态。为解决这些问题,我们提出 extbf{得分梯度匹配蒸馏(SGMD)}。SGMD采用假得分视角,直接优化假得分朝向教师,同时使用教师停止梯度Fisher作为稳定的分布匹配目标。我们提供了梯度分析,论证了在理想跟踪下该目标选择的合理性。在此基础上,SGMD引入一对双重势:负残差(NR)用于外环校正,残差收缩(RC)用于内环跟踪。实验上,与DMD2相比,SGMD实现了约$\sim 3 imes$的训练加速,并显著改善了4步蒸馏模型的运动动态,同时保持了时间一致性。一项人类研究证实,SGMD在运动质量和整体偏好上更受青睐,而视觉质量和文本对齐保持相当。代码可在https://github.com/ModelTC/LightX2V获取。

英文摘要

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

2605.30115 2026-05-29 cs.CV

Large Depth Completion Model from Sparse Observations

来自稀疏观测的大深度补全模型

Zhu Yu, Zhengyi Zhao, Runmin Zhang, Lingteng Qiu, Kejie Qiu, Yisheng He, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

AI总结 提出LDCM,利用单目基础模型和基于泊松的深度初始化策略,结合点图头回归3D坐标,实现稀疏观测下的度量准确深度补全。

详情
Comments
ICLR 2026. Project webpage: https://pkqbajng.github.io/ldcm/
AI中文摘要

本文提出了大深度补全模型(LDCM),一个简单、有效且鲁棒的框架,用于稀疏观测下的单视图度量深度估计。在不依赖复杂架构设计的情况下,LDCM使用Transformer生成度量准确的密集深度图。它在多种数据集和稀疏观测下优于现有方法。我们从两个关键角度实现这一点:(1)利用现有的单目基础模型提高稀疏深度输入的质量,(2)重新制定训练目标以更好地捕捉几何结构和度量一致性。具体来说,首先引入基于泊松的深度初始化策略,从不同的稀疏观测生成均匀的粗密集深度图,为网络提供强大的结构先验。关于训练目标,我们用点图头替换传统的深度头,该点图头回归相机空间中的逐像素3D坐标,使模型能够直接学习底层3D场景结构,而不是执行逐像素深度图恢复。此外,这种设计消除了对相机内参的需求,使LDCM能够自然地产生度量尺度的3D点图。大量实验表明,LDCM在多个基准测试和不同稀疏度水平下,在深度补全和点图估计方面均持续优于最先进的方法,展示了其有效性和对未见数据分布的强泛化能力。

英文摘要

This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps using a transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) leveraging existing monocular foundation models to improve the quality of sparse depth inputs, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization strategy is first introduced to generate a uniform coarse dense depth map from diverse sparse observations, providing a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity levels in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.

2605.30112 2026-05-29 cs.LG

Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation

跨越雷诺数:神经PDE泛化中的表示几何

Jianing Shi

AI总结 通过分析神经PDE求解器在跨雷诺数泛化中的表示几何,发现基于卷积自编码器的匹配方法(ConvAE-Relay)在无需目标域数据的情况下达到38.34%误差,揭示了局部多尺度表示对跨雷诺数迁移的关键作用。

详情
Comments
12 pages, 8 figures, 5 tables
AI中文摘要

神经PDE求解器中的跨雷诺数泛化仍然缺乏表征。在标准的强迫二维Navier-Stokes基准上,训练好的傅里叶神经算子在10倍雷诺数偏移下达到46.68%的相对L2误差,而零前向模型检索基线已经改进到41-42%。这表明表示几何是测试方法中的一个主要组织变量。我们通过ConvAE-Relay测试这一假设,该方法在源训练卷积自编码器潜在空间中匹配状态,并从源域数据库借用动力学,仅使用源域数据库且无需目标域拟合、标签或数据库条目,达到38.34+/-0.07%的误差。2x2消融实验将匹配质量隔离为优于更新规则的主导因素。Oracle实验证实,当匹配保持在流形上时,源域动力学方向仍然可迁移(余弦相似度~0.84);自回归漂移是主要瓶颈(约12个百分点)。从学习预测方面,具有多尺度跳跃连接的U-Net达到34.72+/-0.60%的误差,与检索方面的发现一致,即局部多尺度表示组织测试方法中的跨雷诺数迁移。所有结论均限于该基准。

英文摘要

Cross-Reynolds generalisation in neural PDE solvers remains poorly characterised. On the canonical forced 2D Navier-Stokes benchmark, a trained Fourier Neural Operator reaches 46.68% relative L2 error under a 10x Reynolds-number shift, yet zero-forward-model retrieval baselines already improve to 41-42%. This suggests representation geometry as a major organising variable among the tested methods. We test this hypothesis through ConvAE-Relay, which matches states in a source-trained convolutional autoencoder latent space and borrows dynamics from a source-regime database, achieving 38.34+/-0.07% using only a source-regime database and no target-regime fitting, labels, or database entries. A 2x2 ablation isolates matching quality as dominant over the update rule. Oracle experiments confirm that source-regime dynamics directions remain transferable (cosine similarity ~0.84) when matching stays on-manifold; autoregressive drift is the primary bottleneck (~12 percentage points). From the learned-prediction side, a U-Net with multi-scale skip connections achieves 34.72+/-0.60%, consistent with the retrieval-side finding that local, multi-scale representations organise cross-Reynolds transfer among tested methods. All claims are scoped to this benchmark.

2605.30111 2026-05-29 cs.CV cs.AI

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

xModel-KD:基于LiDAR的3D场景感知跨模态知识蒸馏

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

AI总结 提出跨模态知识蒸馏框架xModel-KD,通过对比学习对齐2D图像纹理与3D点云几何特征,在无额外标注下提升LiDAR点云分割性能。

详情
Comments
3 figures, and 5 tables
AI中文摘要

点云分割是3D场景理解中的基础任务。其进展受到密集3D标注高成本和高时间的限制,导致标注样本难以获取。除了标注稀缺,不同感知模态面临固有局限性。2D图像提供丰富的纹理和外观线索,但缺乏明确的深度和几何结构。相比之下,3D点云捕捉精确的空间几何,但稀疏且不含纹理信息。因此,依赖单一模态限制了所学表示的丰富性并削弱了泛化能力。尽管最近结合3D点云与2D图像的多模态方法在分类和检索等任务中表现出色,但它们通常依赖大规模标注数据集,且尚未充分用于数据高效的密集预测。为解决这些限制,我们提出一种新颖的跨模态知识蒸馏框架xModel-KD,用于3D点云分割。我们的方法通过跨模态对齐学习统一的逐点表示,利用2D纹理和3D几何的互补优势。具体而言,我们设计了一个跨模态融合编码器,通过对比目标训练,强制多视图下对应的2D和3D表示之间的特征一致性。通过将强大的预训练骨干与有针对性的融合策略相结合,所提框架有效地将图像的外观线索迁移到几何感知的点特征中。实验结果表明,跨模态融合在mIoU上比仅使用LiDAR的基线实现了2%的绝对提升,证明了利用互补多模态信息进行可扩展和标注高效的3D场景理解的优势。

英文摘要

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.

2605.30107 2026-05-29 cs.CL

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Dial HEALTHDIAL for Advice: 一个用于知识驱动信息检索的多语言多平行口语对话数据集

Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang, Alexander Fraser, Ivan Vulić, Anna Korhonen

AI总结 本文构建了HEALTHDIAL,一个大规模多语言多平行口语对话数据集,用于开发基于检索增强生成的口语对话系统,并揭示了不同语言间的性能差异。

详情
Comments
Accepted to Findings of ACL 2026
AI中文摘要

创建口语对话数据集在方法论上具有挑战性,当目标是构建大规模多语言多平行数据集时,这些挑战更加突出。本文介绍了HEALTHDIAL,一个用于开发和评估基于检索增强生成(RAG)的口语对话系统的大规模多语言多平行数据集。该数据集包含6,000个信息寻求对话(每种语言1,500个),这些对话基于世界卫生组织(WHO)的可信内容,以及来自四种WHO官方语言(阿拉伯语、中文、英语和西班牙语)的母语者录制的163小时用户语音。每个说话者都标注了人口统计学(如性别、年龄)和社会语言学(如主要语言、原籍地区)变量。我们报告了关键对话任务的基准结果,揭示了不同语言之间(即使是高资源语言)持续存在的性能差异。为支持未来研究,我们发布了该数据集、一个原型系统以及一个用于数据收集和系统评估的工具包。

英文摘要

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.

2605.30104 2026-05-29 cs.CL

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

SEAL: 饱和基准能否通过LLM作为元裁判得以复兴?

Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li, Yansen Zhang, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

AI总结 提出SEAL协议,通过自适应LLM元裁判从饱和基准中提取潜在排名信号,在代码生成、数学推理等任务上以更少调用实现高排名准确率。

详情
AI中文摘要

广泛使用的语言模型基准日益饱和,前沿系统常获得标准指标无法区分的接近分数。我们不构建更难的替代方案,而是探究是否可以通过改进对相同候选输出的评估来使现有任务重新具有信息量。因此,我们提出了带自适应LLM元裁判的种子淘汰法,这是一种自我改进的评估协议,用于从饱和基准中提取潜在排名信号。SEAL将候选输出种子化为单淘汰赛,并通过任务级原则和自改进检查表标准评估每场比赛。我们在涵盖代码生成、数学推理、知识密集型问答和工具使用智能体任务完成的多个饱和基准上评估SEAL。在这些设置中,SEAL改善了排名准确性与延迟之间的权衡,与完全成对评判相比达到了0.83-1.00的Spearman一致性和4/4的top-1一致性,同时每个任务仅需11.89次调用,而完全成对评估需要28.00次。

英文摘要

Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.

2605.30103 2026-05-29 cs.LG

Convergence Theory for Iterative LLM-Based Neural Architecture Search: A Parametric Cross-Entropy Framework with Closed-Form Proxy Reliability

基于迭代式LLM的神经架构搜索的收敛理论:一个具有闭式代理可靠性的参数化交叉熵框架

Santosh Premi Adhikari, Radu Timofte, Dmitry Ignatov

AI总结 将迭代式LLM-NAS建模为参数化交叉熵方法,证明了收敛性、精英集概率几何收敛、增量生成有效性、MinHash-Jaccard去重防止模式崩溃以及代理可靠性闭式公式,并通过实验验证了理论预测。

详情
Comments
14 pages, 2 figures, 2 tables. Submitted to NeurIPS 2026
AI中文摘要

大型语言模型(LLM)越来越多地被用作迭代式神经架构搜索(NAS)中的生成器,然而这类算法尚无正式的收敛理论。我们将迭代式LLM-NAS建模为在可执行程序上的参数化交叉熵(CE)方法,并证明了六个结果:(1)在精英架构上的迭代式LLM微调等价于限制在LLM参数族内的CE更新;(2)期望架构质量在循环间单调非减;(3)精英集概率以几何速率C_t >= 1-(1-rho_0)^t收敛到不动点;(4)在一阶马尔可夫令牌误差模型下,基于增量的生成比全代码生成实现严格更高的有效生成率;(5)MinHash-Jaccard新颖性过滤器防止模式崩溃;(6)代理可靠性具有闭式形式rho_S = (6/pi) arcsin(rho_P(SNR)/2),从而得出实际诊断条件sigma^2_arch >> sigma^2_noise作为基于代理的可靠排名的必要条件。在22个循环、三个LLM、六个数据集、3300个生成架构的实验中,定量验证了两个预测,在效应方向层面验证了两个预测,并解释了先前经验观察到但未得到解释的代理可靠性天花板效应。

英文摘要

Large language models (LLMs) are increasingly used as generators in iterative neural architecture search (NAS), yet no formal convergence theory exists for this class of algorithms. We model iterative LLM-NAS as a parametric Cross-Entropy (CE) method over executable programs and prove six results: (1) iterative LLM fine-tuning on elite architectures is equivalent to the CE update restricted to the LLM parametric family; (2) expected architecture quality is monotonically non-decreasing across cycles; (3) elite-set probability converges to a fixed point at a geometric rate C_t >= 1-(1-rho_0)^t; (4) delta-based generation achieves a strictly higher valid-generation rate than full-code generation under a first-order Markov token-error model; (5) the MinHash-Jaccard novelty filter prevents mode collapse; (6) proxy reliability admits the closed-form rho_S = (6/pi) arcsin(rho_P(SNR)/2), yielding the practical diagnostic sigma^2_arch >> sigma^2_noise as a necessary condition for trustworthy proxy-based rankings. Testing against a 22-cycle, three-LLM, six-dataset experiment with 3,300 generated architectures confirms two predictions quantitatively, two at direction-of-effect level, and explains the proxy-reliability ceiling effect previously reported empirically but left unexplained.

2605.30102 2026-05-29 cs.MA cs.AI

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

当云端智能体遇到设备端智能体:混合多智能体系统的经验教训

Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

AI总结 本文系统研究混合多智能体系统(结合设备端小模型和云端大模型)的设计空间,分析不同设计选择对功耗、成本和性能帕累托前沿的影响,发现最优架构高度依赖任务且前沿计算并不总能带来更好性能。

详情
Comments
30 pages, 16 figures. Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
AI中文摘要

智能体AI推理的设计空间涵盖两个极端:前沿大语言模型(LLM),通常托管在云端,在广泛任务上提供强性能但成本高昂;以及更具成本效益的小语言模型(SLM),适合设备端推理。结合设备端和云端模型的混合多智能体系统(MAS)提供了一种有前景的中间地带,但它们也引入了一个复杂且理解不足的设计空间,其中任务准确性、货币成本和边缘能耗紧密耦合;在缺乏通用设计原则的情况下,混合组件虽然并非最普遍的选择,但通常通过针对特定领域的临时决策引入。在这项工作中,我们更系统地审视了这一设计空间。我们调整了两种代表性的MAS架构以支持混合推理,并研究了单个设计选择如何沿着功耗、成本和性能的帕累托前沿移动工作点。我们的发现描绘了混合MAS设计的细致图景:虽然SLM可以有效受益于LLM的协助,但最优架构高度依赖任务,且更大的前沿计算并不总能转化为更好的性能。

英文摘要

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.

2605.30100 2026-05-29 cs.LG

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

Chess-World-Model: 一个用于从国际象棋走棋序列精确状态跟踪的1000万对局基准

Benjamin Walker, Terry Lyons

AI总结 提出一个基于1000万真实国际象棋对局的大规模状态跟踪基准,通过预测合法走棋序列后的棋盘状态,测试模型学习转换规则的能力,并发现循环模型优于Transformer,且随机均匀分布子集能揭示规模掩盖的失败。

详情
Comments
20 pages, 4 figures
AI中文摘要

世界模型需要状态跟踪,即跨动作序列维持正确潜在状态的能力。现有基准通常是合成或基于语言的,限制了它们作为结构化状态更新测试在现实领域中的价值。我们引入了Chess-World-Model,一个基于1000万真实国际象棋对局构建的大规模状态跟踪基准,其中模型预测经过一系列合法走棋后达到的精确棋盘状态。除了一个留出的真实对局子集外,我们还包含一个来自均匀随机合法走棋的分布外子集,用于测试模型是否学习转换规则而非来自常见人类走法的捷径。先前的理论和实证工作表明,Transformer难以进行状态跟踪,而输入依赖的线性RNN需要表达性强的状态转换矩阵才能做到。因此,我们在匹配的接口和训练协议下,对因果Transformer、块对角SLiCE、Mamba-3和具有负特征值的Gated DeltaNet进行了基准测试。在300万和800万参数下,循环模型显著优于Transformer。真实对局性能在1800万参数以上饱和,但随机均匀子集在4000万参数下仍具有区分性,暴露了规模掩盖的失败。此外,消融实验表明,对于所有三种循环模型,表达性较弱的状态转换机制会降低分布外子集的性能。这些结果共同确立了Chess-World-Model作为一个实用的大规模状态跟踪基准,能够暴露模型规模原本会掩盖的失败。

英文摘要

World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

2605.30099 2026-05-29 cs.CV

Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion Detection

对话代理评估:理解情感检测中的文化、背景与环境

Martha Teiko Teye, Yaw Marfo Missah, Emmanuel Ahene, Twum Frimpong, Auxane Boch

AI总结 针对黑人非洲社会,提出结合语音和图像数据、使用3层CNN和AFME算法的情感预测模型,准确率85%-96%,并识别讽刺,提升对话AI情感识别系统的可信度。

详情
Journal ref
IEEE Access 10 (2022) 24976-24984; Erratum: IEEE Access (2022) 35900-35900
Comments
IEEE paper on arxiv
AI中文摘要

现在,有价值决策和高度优先分析依赖于面部生物识别、社交媒体照片标记和人机交互等应用。然而,成功部署这些应用的能力取决于它们在考虑可能边缘情况下的测试用例效率。多年来,已经实施了大量通用解决方案来模仿人类情感,包括讽刺。然而,地理位置或文化差异等因素在其解决伦理问题和改进对话AI(人工智能)的相关性中尚未得到充分探索。在本文中,我们旨在解决在黑人非洲社会中对话AI使用的潜在挑战。我们开发了一个情感预测模型,准确率在85%到96%之间。我们的模型结合了语音和图像数据来检测七种基本情感,并特别关注识别讽刺。它使用了3层卷积神经网络,并结合了一种新的音频帧平均表情(AFME)算法,重点放在模型的预处理和后处理阶段。最后,我们的解决方案有助于维护对话AI中情感识别系统的可信度。

英文摘要

Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.

2605.30096 2026-05-29 cs.CR cs.AI

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

AI攻击者对固定脆弱目标的可靠性如何?LLM渗透测试一致性的400次运行实证研究

Galip Tolga Erdem

AI总结 通过400次自主渗透测试运行(4个模型各100次),研究LLM在固定目标上攻击行为的一致性,发现模型间成功率差异显著且失败模式独特。

详情
Comments
41 pages, 7 figures. Code and 400-run dataset: https://doi.org/10.5281/zenodo.20421592
AI中文摘要

大型语言模型(LLM)可以自主进行多阶段网络攻击,但其在重复试验下攻击行为的一致性尚未被研究。本文首次对LLM攻击一致性进行了大规模实证测量:针对托管OWASP Juice Shop和另外两个脆弱服务的相同蜜罐,进行了400次自主渗透测试运行(4个模型,各100次),保持提示、编排器和目标不变。没有模型发出在编排器第0-1次迭代的一次性授权重新提示后仍存在的拒绝内容。Claude Sonnet 4的API调用确实遇到了上游服务不可用——在记录的Anthropic容量事件期间,1135次调用中有91次返回HTTP 529 overloaded_error,导致100次Claude运行中有39次被截断。早期草稿将这些归类为安全拒绝;在完整日志审计后,它们是上游API故障,而非模型级拒绝。尽管如此,Claude在100次运行中有61次实现了完全利用;Gemini 2.5 Flash-Lite为85次;GPT-4o-mini为56次,同时部署了98种独特的攻击策略;qwen2.5-coder:14b为25次。失败模式因模型而异:Claude因API截断(39次运行),qwen因过早完成(52次),GPT-4o-mini因迭代预算耗尽(23次)。跨服务凭据重用仅出现在保留最多对话历史的配置中(qwen 57%,GPT-4o-mini 49%,云模型在5次交换窗口内为0%)。跨模型利用率的差异具有统计学显著性(p < 0.001),效应量大;qwen与Gemini的SQL注入率差异的Cohen's h = 1.12。首次利用时间落在15-30秒的挂钟时间范围内。据我们所知,这是首个在N=100每模型下测量跨多服务目标的自主LLM攻击行为的研究。

英文摘要

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repeated trials remains unstudied. This work presents the first large-scale empirical measurement of LLM attack consistency: 400 autonomous penetration testing runs (4 models, 100 each) against an identical honeypot hosting OWASP Juice Shop and two additional vulnerable services, holding prompt, orchestrator, and target constant. No model emitted a content refusal that survived the orchestrator's one-shot authorization re-prompt at iterations 0-1. Claude Sonnet 4's API calls did encounter upstream service unavailability - 91 of 1,135 calls returned HTTP 529 overloaded_error during a documented Anthropic capacity event, truncating 39 of 100 Claude runs. An earlier draft catalogued these as safety refusals; on full-log audit they are upstream API failures, not model-level refusals. Despite this, Claude achieved full exploitation in 61 of 100 runs; Gemini 2.5 Flash-Lite in 85; GPT-4o-mini in 56 while deploying 98 unique attack strategies; qwen2.5-coder:14b in 25. Failure modes are model-distinctive: Claude through API truncation (39 runs), qwen through premature completion (52), GPT-4o-mini through iteration-budget exhaustion (23). Cross-service credential reuse appeared only in configurations retaining the most conversation history (qwen 57%, GPT-4o-mini 49%, cloud models 0% on 5-exchange windows). Cross-model exploitation rate differences are statistically significant (p < 0.001) with large effect sizes; qwen vs. Gemini SQL injection rates differ at Cohen's h = 1.12. First-exploit timing fell within a 15-30 second wall-clock range. To our knowledge, this is the first study to measure autonomous LLM attack behavior at N=100 per model across a multi-service target.

2605.30094 2026-05-29 cs.AI cs.GT

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

PokerSkill: 无需训练或求解器,大语言模型可达到专家级扑克水平

Boning Li, Baoxiang Wang, Longbo Huang

AI总结 提出PokerSkill框架,通过规则驱动的技能库约束大语言模型动作,无需训练或求解器即可在扑克中达到接近GTO水平的性能。

详情
Comments
45 pages, 3 figures
AI中文摘要

扑克是人工智能的一个标志性挑战。主流方法依赖于基于反事实遗憾最小化的均衡求解器,需要数百万核心小时的训练。大语言模型(LLMs)拥有广泛的扑克知识,但当被要求直接游戏时,其表现远低于基于求解器的智能体。传统的基于规则的扑克智能体是可解释且无需训练的,但其策略上限仍远低于均衡玩法。我们提出了 extbf{PokerSkill},一个无需训练且无需求解器的框架,通过使用详细的基于规则的扑克技能作为LLMs的结构化动作基础接口来弥合这一差距。一个确定性上下文引擎分析当前状态,并从完全由人类扑克专家设计的分层技能库中仅检索相关片段,将LLM的选择限制在合理动作内。针对最先进的GTO基准GTOWizard,使用PokerSkill的GPT-5.5 XHigh达到$-57 \pm 21$ mbb/hand,Claude Opus 4.6达到$-80 \pm 29$ mbb/hand,Claude Opus 4.7达到$-87\pm 64$ mbb/hand,相比默认提示基线减少了49-61%的损失,并优于强机器人Slumbot。我们的关键发现是,仅靠基于规则的技能不足以构成强大策略,仅靠LLM也无法良好游戏,但它们的结合产生了一个既不需要训练也不需要求解器访问,却能媲美基于数百万核心小时计算构建的系统的智能体。据我们所知,这是首次证明LLM在复杂不完美信息游戏中无需特定游戏训练或求解器查询即可达到竞争性能。代码可在https://github.com/lbn187/PokerSkill获取。

英文摘要

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

2605.30093 2026-05-29 cs.CV

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

几何至关重要:用于学习语义对应的3D基础先验

Artur Jesslen, Olaf Dünkel, Adam Kortylewski

AI总结 提出一种3D感知的后训练框架,利用3D基础模型(SAM3D)估计物体几何和姿态,生成几何感知特征图,结合DINO和Stable Diffusion特征,通过测地距离过滤候选对应,训练轻量适配器改进语义对应。

详情
Comments
9 pages (main paper), 21 pages (total), 4 figures
AI中文摘要

来自自监督视觉模型和文本到图像扩散模型的基础特征已被证明对语义对应估计有效。然而,由于这些特征主要从2D图像目标学习,它们缺乏明确的3D意识,并且常常混淆对称物体侧面、重复部分以及在3D中不同的视觉相似结构。我们引入了一个3D感知的后训练框架,通过结合3D基础模型的先验,超越了现有的2D基础特征。给定一张图像,我们的方法使用SAM3D估计物体几何和姿态,并通过渲染-比较优化来细化姿态。随后,我们根据估计的物体姿态,将重建几何中的PartField描述符渲染到图像平面。由此产生的几何感知特征图补充了DINO和Stable Diffusion特征,而重建形状上的测地距离能够可靠地过滤候选对应。我们使用过滤后的匹配作为监督,在DINO和Stable Diffusion之上训练一个轻量适配器用于语义对应。与之前需要姿态标注并依赖粗略球形几何的后训练方法相比,我们的方法自动获得实例特定的3D结构,并用它来指导对应学习。实验表明,我们的方法改进了语义对应,同时减少了人工几何监督。代码和模型可在 https://github.com/GenIntel/3D-SC 获取。

英文摘要

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

2605.30090 2026-05-29 cs.CL cs.CV

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

AI总结 提出DirectorBench,一种基于多智能体的诊断基准,通过80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成,并定位瓶颈和用户偏好依赖。

详情
AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作,具有叙事结构、电影控制、音频和跨模态同步。然而,评估此类视频仍然具有挑战性,因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐,并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench,一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数,而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中,DirectorBench揭示了一个单元间瓶颈:过渡质量平均仅为0.256,最佳工作流达到0.356,而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估,以验证DirectorBench与人类判断的一致性。结果表明,DirectorBench捕捉到了人类可感知的质量差异,并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

2605.30089 2026-05-29 cs.LG

Distributionally Robust Set Representation Learning Under Inference-Time Element Corruption

推理时元素损坏下的分布鲁棒集合表示学习

Yankai Chen, Hanrong Zhang, Bowei He, Philip S. Yu, Xue, Liu

AI总结 针对推理时元素损坏问题,提出SW-DRSO分布鲁棒优化框架,通过重心对抗近似最坏情况损失,在四个任务上验证了鲁棒性和性能。

详情
Comments
Accepted by ICML'26
AI中文摘要

标准集合表示学习方法通常在精心整理的数据上表现良好,但往往忽略了推理时元素损坏的挑战。这指的是部署模型遇到元素级别的退化(如异常值或缺失组件)时,可能扭曲集合表示并降低性能。我们提出了SW-DRSO,一个专门为集合设计的分布鲁棒优化框架。SW-DRSO不是仅最小化观测训练数据上的损失,而是优化一个关于一系列合理推理时变体的最坏情况期望损失的可处理替代项。我们引入了一个重心对抗,通过可微的训练时优化单纯形权重来近似对损坏集合的难以处理的搜索。在四个任务上的大量实验表明,SW-DRSO在保持高整体性能的同时,有效增强了对损坏的鲁棒性。

英文摘要

Standard Set Representation Learning methods typically excel on curated data but often overlook the challenge of inference-time element corruption. This refers to scenarios where deployed models encounter element-level degradations, such as outliers or missing components, that may distort set representation and degrade performance. We propose SW-DRSO, a distributionally robust optimization framework tailored for sets. Rather than minimizing loss solely on observed training data, SW-DRSO optimizes a tractable surrogate of the worst-case expected loss over a family of plausible inference-time variations. We introduce a barycentric adversary that approximates the intractable search over corrupted sets by a differentiable training-time optimization over simplex weights. Extensive experiments across four tasks demonstrate that SW-DRSO effectively enhances robustness against corruption while maintaining high overall performance.

2605.30087 2026-05-29 cs.AI

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

冲突多源个人记忆上的选择性问答:诊断性测试平台与方法比较

Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

AI总结 针对多源冲突记忆的选择性问答问题,构建了包含34,560个实例的诊断基准,评估了多种方法,发现结构化融合方法在准确性和选择性上优于纯提示LLM。

详情
Comments
55 pages, 5 figures
AI中文摘要

新兴的个人AI代理正朝着持久、多源记忆的方向发展。这带来了一个评估问题:系统必须决定如何使用冲突或不完整的证据;它们不能仅从一个干净的历史中检索事实。现有的基准很少能显示错误是来自提供给方法的证据还是来自方法的冲突解决步骤。我们将此研究为冲突多源个人记忆上的选择性问答:系统基于冲突的、有时不完整的来源进行回答,或者在证据不足时放弃回答。我们开发了一个基准,包含8种推理类型下的18个问题模板、480个角色、4个随机种子和34,560个实例,具有受控的来源扭曲和确定性的真实答案。我们评估了无法访问任何来源、访问单一来源、结构化融合方法以及前沿LLM的基线性能。最佳训练融合解析器达到80.3%的准确率,而最强的纯提示LLM基线达到70.0%。在允许弃权的情况下,同一解析器在78.3%的覆盖率下达到85.3%的选择性准确率,最佳LLM在95.4%的覆盖率下达到71.0%的选择性准确率。不同模型在不同推理类型上具有不同的优势。我们发布了数据、代码、缓存的模型输出以及数据生成过程以供复用。

英文摘要

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

AI总结 提出CROP方法,通过保形校准选择阈值,返回最长无错前缀,并控制错误包含概率,平衡保留有效推理与丢弃误导后缀。

详情
Comments
Code available at https://github.com/matthewyccheung/crop
AI中文摘要

语言模型推理轨迹很少是全有或全无;在关键错误发生之前,它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应,未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题,我们引入了CROP(保形推理输出前缀),一种与验证器无关的校准程序,用于干净前缀认证。给定任何步骤级风险代理,CROP选择一个校准阈值,并返回其步骤风险代理保持低于该阈值的最长连续前缀,将未认证的后缀路由到下游审查或修复。假设可交换性,CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上,我们证明了标准步骤级指标(如AUROC)不能完全捕捉前缀效用,建议验证器应改为通过认证前缀长度进行评估。此外,CROP平衡了过度保留和不足保留,通过保留有效的中间推理同时丢弃误导后缀,提高了下游修复的准确性。最终,这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

2605.30083 2026-05-29 cs.CV

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

未来强制:自回归视频生成中无需训练的未来感知KV缓存策略

Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen

AI总结 提出Future Forcing,一种无需训练的未来感知KV缓存策略,通过利用自回归视频模型中查询分布的平稳性来估计未来查询,从而改进长视频生成的一致性。

详情
AI中文摘要

自回归(AR)视频生成已成为长时域视频合成的一种有前景的范式,其中每一帧的生成基于先前生成的令牌。为了加速推理,使用KV缓存避免跨生成步骤的冗余重计算。然而,随着生成长度的增长,KV缓存会引入越来越多的内存和误差累积,限制了AR模型扩展到更长序列的可扩展性。现有的KV缓存压缩方法通过选择性地保留被认为重要的视频令牌来缓解这一问题。然而,大多数现有方法使用从当前或历史生成上下文中提取的短时域信号来评估令牌重要性,这使得这些方法容易忽略在早期步骤中看似不重要但后来对未来帧至关重要的令牌。在这项工作中,我们识别了训练好的AR视频模型的一个重要性质:尽管RoPE调制的查询在自回归步骤中演变,但底层的规范预RoPE查询分布在视频生成过程中保持显著稳定。这种近似平稳性意味着未来查询分布可以从历史统计中估计,从而无需额外训练即可实现原则性的未来感知缓存决策。基于这一洞察,我们提出了Future Forcing,一种用于AR视频生成的无需训练的未来感知KV缓存策略。具体来说,Future Forcing首先从历史统计中构建未来查询代理,然后根据该代理下的重要性对KV缓存令牌进行评分,最后在未来查询诱导的仿射子空间内合并冗余令牌对。大量实验表明,Future Forcing在有限的KV缓存下改善了长时域一致性,在VBench-Long上针对60秒生成,与现有的AR视频KV缓存策略相比,主体一致性提升了高达1.49。

英文摘要

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.

2605.30080 2026-05-29 cs.CL

Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

自适应目标动态分块用于无分词层次模型

Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata

AI总结 提出自适应目标动态分块(ATDC)机制,通过课程学习动态调整压缩比,以优化无分词层次模型的字节压缩效果,在FineWeb-Edu 100B数据集上实现竞争性的每字节比特数性能,并提升训练稳定性和下游任务表现。

详情
AI中文摘要

无分词层次模型正成为传统大型语言模型(LLM)的有前途替代方案,解决了词汇设计复杂性、词汇外(OOV)错误和语言特定约束等固有预处理问题。然而,这些字节级方法的一个重大挑战是压缩比的优化,这是决定模型通过分块处理字节数据性能的关键因素。在本文中,我们提出自适应目标动态分块(ATDC),一种新颖的字节压缩控制机制,旨在增强层次架构中动态分块的有效性。我们的方法利用课程学习在训练过程中逐步调整压缩比,从低压缩过渡到高压缩以稳定学习过程。我们提供分析,建立了目标压缩比与每最内层分块字节数(BPIC)之间的关系,从而能够在整个训练阶段跟踪分块大小的演变。在FineWeb-Edu 100B数据集上进行的评估表明,配备ATDC的层次模型在每字节比特数(BPB)性能上与在字节和词元级别上运行的常规基线相比具有竞争力。此外,与使用固定压缩比的模型相比,所提出的方法在多种下游任务中表现出更稳定的训练动态和更优的最终性能,同时保持了字节级处理的固有鲁棒性和灵活性。

英文摘要

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.

2605.30076 2026-05-29 cs.CL

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

UniSteer: 文本引导的激活空间流匹配用于多功能LLM引导

Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang, Jingyi Yu, Kan Ren

AI总结 提出UniSteer,一种文本引导的激活流匹配模型,通过学习残差流激活的条件分布,实现统一的行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类。

详情
Comments
16 pages,4 figures
AI中文摘要

基于激活的控制通过在推理过程中干预大型语言模型(LLM)的内部表示来引导它们,并已成为控制个性、风格等行为的有效范式。然而,现有方法通常依赖于固定的引导方向或特定任务的干预模块,难以适应细粒度概念和组合约束。我们提出UniSteer,一种文本引导的激活流匹配模型,它从自然语言条件中学习残差流激活的条件分布。UniSteer不是为每个目标行为拟合单独的干预,而是在激活空间中学习一个通用的条件速度场。在推理时,UniSteer通过将源激活部分传输到潜在状态并在目标文本条件下重新生成它,然后将其注入回冻结的LLM,从而执行流反转。相同的条件模型通过选择具有最低重建能量的文本标签来支持激活空间分类。在三个目标LLM上的实验表明,UniSteer在行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类方面提供了统一的接口。

英文摘要

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.

2605.30075 2026-05-29 cs.LG cs.DC

Q-ANCHOR: Federated Quantum Learning with ZNE-guided Correction

Q-ANCHOR: 基于ZNE引导校正的量子联邦学习

Hoang M. Ngo, Quan Nguyen, Wanli Xing, My T. Thai

AI总结 针对量子联邦学习中非独立同分布数据导致的客户端漂移和量子硬件噪声导致的硬件偏差,提出Q-ANCHOR聚合架构,通过零噪声外推锚定服务器更新并应用有状态客户端校正,理论证明可同时减轻两类漂移,实验显示训练更稳定。

详情
AI中文摘要

量子联邦学习(QFL)提供了一个有前景的框架,可以在保持数据严格本地化的同时,跨分布式客户端训练量子模型。由于其简单性和低通信开销,联邦平均(FedAvg)是QFL文献中的标准聚合选择。然而,在实际硬件上部署QFL会暴露出严重的双重漂移现象:全局模型同时受到来自非独立同分布数据的客户端漂移和来自噪声量子梯度估计的硬件偏差的干扰。在这项工作中,我们首先分析了FedAvg在这些现实条件下的收敛性,数学上证明了量子硬件偏差会产生标准平均无法纠正的持久误差下限。为了克服这一限制,我们提出了Q-ANCHOR,一种量子感知的联邦聚合架构,该架构通过零噪声外推锚定服务器更新,同时应用有状态客户端校正来抑制客户端漂移和硬件引起的偏差。我们的收敛理论证明,Q-ANCHOR成功减轻了经典客户端漂移,同时积极降低了硬件偏差下限。实验结果表明,Q-ANCHOR实现了比传统FL基线显著更稳定的训练。

英文摘要

Quantum Federated Learning (QFL) offers a promising framework to train quantum models across distributed clients while keeping data strictly local. Due to its simplicity and low communication overhead, Federated Averaging (FedAvg) is the standard aggregation choice in QFL literature. However, deploying QFL on practical hardware exposes a severe double-drift phenomenon: the global model is simultaneously derailed by client drift from non-IID data and hardware bias from noisy quantum gradient estimates. In this work, we first analyze the convergence of FedAvg under these realistic conditions, mathematically demonstrating that quantum hardware bias creates a persistent error floor that standard averaging cannot correct. To overcome this limitation, we propose Q-ANCHOR, a quantum-aware federated aggregation architecture that anchors server updates with zero-noise extrapolation while applying stateful client correction to suppress both client drift and hardware-induced bias. Our convergence theory proves that Q-ANCHOR successfully mitigates classical client drift while actively reducing the hardware-bias floor. Experimental results demonstrate that Q-ANCHOR achieves significantly more stable training than conventional FL baselines.

2605.30073 2026-05-29 cs.CV

Native Audio-Visual Alignment for Generation

原生音视频对齐生成

Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He

AI总结 提出NAVA框架,通过原生音视频对齐和上下文条件联合去噪,实现高质量、同步且可控的音视频生成。

详情
Comments
Project page: https://ernie-research.github.io/NAVA/
AI中文摘要

联合音视频生成旨在合成时间同步且语义一致的视觉-声学内容。然而,现有的开源方法主要依赖于带有后对齐的双塔设计或统一的三模态设计,将文本上下文、音频和视频混合在一个共享空间中。前者削弱了细粒度的音视频协同进化,而后者将语义条件与低级同步耦合。为了解决这些限制,我们提出了NAVA,一个用于联合音视频生成的原生音视频对齐框架。NAVA建立在上下文条件的原生音视频对齐之上:它首先在专用的交互空间中建立音视频对应关系,然后使用外部上下文来条件化联合去噪过程。具体地,NAVA通过Align-then-Fuse MMDiT架构实例化,该架构从模态感知的音视频对齐过渡到模态共享的联合去噪。此外,我们引入了上下文音色条件,将参考音色线索与相应的语音跨度关联,以实现可控的语音音色。在Verse-Bench和Seed-TTS上的实验以及用户研究表明,NAVA仅使用6.3B参数就实现了卓越的视频质量、精确的音视频同步、有竞争力的音频质量和更强的参考音色可控性。

英文摘要

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

2605.30070 2026-05-29 cs.LG cs.AI

A Predictive Law for On-Policy Self-Distillation From World Feedback

基于世界反馈的在线自蒸馏预测定律

Tommy He, Jerome Sieber, Matteo Saponati

AI总结 本文发现在线自蒸馏(OPSD)中初始师生性能差距与最终性能改进之间存在线性关系,并提出一种预测定律,用于在训练前预测OPSD配置的效果。

详情
AI中文摘要

超越简单的标量奖励,向更丰富的世界反馈迈进,是实现更可扩展的RL后训练的自然路径。在线自蒸馏(OPSD)是一种有前景的最新方法,它使用任意反馈作为学习信号,但其与GRPO等成熟方法相比的可靠性仍不清楚。我们发现了OPSD中初始学生-教师性能差距与最终性能改进之间存在惊人的一致线性相关性。这种关系在不同上下文类型和模型家族中均成立,为预测OPSD配置的结果提供了一种强大的预测定律,而无需运行完整的训练过程。有趣的是,我们表明这种线性可预测性随模型规模成立,这为具有更强上下文学习能力的大型模型上新的经验缩放定律提供了潜在基础。本质上,我们的发现表明,OPSD性能可以在训练前进行预测和调整,为将世界反馈作为后训练流水线的一等组件提供了一种原则性方法。

英文摘要

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

2605.30065 2026-05-29 cs.CV

Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained Priors

利用二维预训练先验提升零样本三维风格迁移

Xin Dong, Yunzhi Teng, Wenfeng Deng, Yansong Tang

AI总结 提出Data-Sufficient StyleGaussian模型,通过集成大规模2D图像数据集预训练的解码器,结合特征高斯溅射与延迟风格化,在数据稀缺条件下实现零样本3D风格迁移的高质量多视图一致渲染。

详情
Comments
Accepted by IEEE IVMSP2026
AI中文摘要

在这项工作中,我们专注于零样本三维风格迁移,即给定任意风格图像,生成三维场景的多视图一致风格化视图。我们主要解决三维风格迁移中的数据稀缺问题,该问题源于每个模型仅在单个场景上训练,从而限制了可用内容图像的数量。这种稀缺性严重阻碍了风格化性能,因为模型优化依赖于足够数量的内容-风格图像对来提供监督信号。我们的核心思想是将在大规模二维图像数据集上预训练的解码器集成到三维风格迁移流程中,从而利用解码器从大量内容-风格图像对中学习到的先验知识。我们的方法结合了特征高斯溅射和延迟风格化,通过将视图相关操作统一为视图不变过程,在确保视图一致性的同时,利用数据充足的解码器网络实现高质量风格化。实验表明,我们的Data-Sufficient StyleGaussian(DS-StyleGaussian)模型在多个数据集上的视觉质量优于现有的零样本三维风格迁移方法。这项工作也表明,二维预训练可以作为三维任务的强增强手段,弥合二维与三维之间的数据差距。

英文摘要

In this work, we focus on zero-shot 3D style transfer that can generate multi-view consistent stylized views of the 3D scene given an arbitrary style image. We primarily tackle the issue of data scarcity in 3D style transfer, which arises when each model is trained on only a single scene, thereby limiting the number of available content images. This scarcity significantly hampers stylization performance, as model optimization relies on a sufficient number of content-style image pairs to provide supervisory signals. Our core idea is to integrate a decoder pre-trained on large-scale 2D image datasets into the 3D style transfer pipeline, thereby leveraging the prior knowledge encoded in the decoder from learning over numerous content-style image pairs. Our method combines feature Gaussian splatting and deferred stylization, enabling high-quality stylization with the data-sufficient decoder network while ensuring view consistency by unifying view-dependent operations into a view-invariant process. Experiments demonstrate that our Data-Sufficient StyleGaussian (DS-StyleGaussian) model outperforms existing zero-shot 3D style transfer methods in terms of visual quality across various datasets. This work also suggests that 2D pre-training can serve as a strong enhancement for 3D tasks, bridging the data gap between 2D and 3D.

2605.30062 2026-05-29 cs.CV

FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection

FakeVLM-R1:通过思维链内化物理定律进行合成图像检测

Leqi Zhu, Junyan Ye, Kaiqing Lin, Zhiyuan Yan, Conghui He, Weijia Li

AI总结 提出FakeVLM-R1框架,结合监督微调、组相对策略优化和批判性思维链机制,通过双向辩证推理和物理常识构建真实性反证,实现高精度、逻辑可解释的合成图像检测,解决现有方法的过度拒绝偏差。

详情
AI中文摘要

生成式人工智能技术的发展已将合成图像的视觉真实性提升至前所未有的水平。尽管当前基于大型多模态模型(LMM)的可解释检测方法取得了一定进展,但它们仍然依赖于从大量伪造数据中获得的模仿学习,因此缺乏真正的因果推理能力,容易产生解释性幻觉。为克服这一瓶颈,我们提出FakeVLM-R1,旨在赋予模型在执行合成检测任务时类似人类的批判性思维能力。该框架在监督微调(SFT)基础上,将组相对策略优化(GRPO)与批判性思维链(CoT)机制相结合。在推理阶段,模型执行“双向辩证推理”过程:在提出伪造假设的同时,必须同时调用物理常识构建真实性反证。此外,我们构建了包含高质量样本的FakeClue++数据集,该数据集广泛引入了基于真实图像物理定律的注释,为模型提供了统一的真实性锚点。实验证实,FakeVLM-R1在多个基准测试中达到了评估模型中的最优性能(SOTA)。它不仅实现了高精度、逻辑可解释的检测,还解决了现有方法对真实图像的过度拒绝偏差,展现出对扰动的泛化性和鲁棒性。

英文摘要

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

2605.30059 2026-05-29 cs.LG cond-mat.stat-mech stat.ML

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

泊松重置的岭回归:谱正则化的更新视角

Petar Jolakoski

AI总结 通过非平衡统计物理中的随机重置与统计学习中的岭正则化建立联系,证明线性梯度流下以速率r重置到原点产生的稳态均值即为岭估计,并推广到一般更新重置律以生成替代谱滤波器。

详情
AI中文摘要

我们将非平衡统计物理中的随机重置与统计学习中的岭正则化联系起来。对于线性梯度流,以速率$r$重置到原点产生稳态均值$(X^\top X+rI)^{-1}X^\top y$,这正是惩罚项$\lambda=r$的岭估计。这利用了岭回归与梯度流指数时间平均之间已知的拉普拉斯变换关系,其中指数时间现在被解释为与泊松重置相关的稳态年龄。然后我们将这一恒等式推广到一般更新重置律:指数重置时间分布是唯一的更新律,其稳态均值在每个特征方向上作为精确的滤波器恒等式对每个正曲率重现标量岭,而非指数更新律则生成替代的谱滤波器。在波动层面,我们研究了一个具有恒定扩散的独立加性奥恩斯坦-乌伦贝克扩展,解释为一种风格化的SGD近似。在这种设定下,等式仅在均值层面成立,因为重置过程由于累积的OU噪声和重置时序方差具有非零稳态协方差,而确定性岭是一个具有相同中心的固定估计量。风格化实验直接比较了确定性更新诱导的滤波器,并说明了非指数重置时间律诱导的滤波器何时可能在预测上与岭不同。关于稳态均值和诱导谱滤波器的结果是在二次目标上具有各向同性重置的连续时间梯度流下建立的;协方差和风险公式额外假设具有状态独立协方差的加性噪声。

英文摘要

We connect stochastic resetting from non-equilibrium statistical physics with ridge regularization in statistical learning. For linear gradient flow, resetting to the origin at rate $r$ produces stationary mean $(X^\top X+rI)^{-1}X^\top y$, exactly the ridge estimator with penalty $λ=r$. This uses the known Laplace-transform relationship between ridge regression and exponential-time averaging of gradient flow, with the exponential time now interpreted as the stationary age associated with Poisson resetting. We then extend this identity to general renewal reset laws: the exponential reset time distribution is the unique renewal law whose stationary mean reproduces scalar ridge in every eigendirection as an exact filter identity for every positive curvature, while non-exponential renewal laws generate alternative spectral filters. At the fluctuation level, we study a separate additive Ornstein-Uhlenbeck extension with constant diffusion, interpreted as a stylized SGD approximation. In this setting, the equality holds only at the level of the mean, since the reset process has a nonzero stationary covariance from accumulated OU noise and reset-timing variance, whereas deterministic ridge is a fixed estimator with the same center. Stylized experiments compare the deterministic renewal-induced filters directly and illustrate when filters induced by non-exponential reset-time laws can differ predictively from ridge. The results for the stationary mean and the induced spectral filters are established for continuous-time gradient flow with isotropic resetting on quadratic objectives; the covariance and risk formulas additionally assume additive noise with state-independent covariance.

2605.30058 2026-05-29 cs.CL

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

HEART-Bench: 大语言模型智能体是否表现出类似人类的心理学?

Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

AI总结 提出HEART-Bench基准,通过构建基于大五人格和自传体记忆的虚拟角色,在DIAMONDS情境框架下评估LLM智能体能否展现一致的人类心理特征。

详情
Comments
GitHub: https://github.com/peng-weihan/HEART-BENCH
AI中文摘要

尽管LLM智能体在规划、推理和行动等任务导向能力上表现出色,但很少有研究将它们视为完整的人类个性,其中情感维度同样重要。在本文中,我们引入了一个新颖的基准,系统评估LLM智能体是否能模拟连贯、类似人类的心理。具体来说,我们的基准构建了11个基于正交大五人格特质的多样化人类角色,每个角色都深入整合了1000个结构化的自传体式情景记忆,这些记忆分布在基于理论的发展生命阶段。为了严格评估LLM的心理表现,我们设计了一套由64个决策场景组成的精选套件,这些场景基于DIAMONDS分类法,这是一个心理框架,从八个维度描述情境:责任、智力、逆境、求偶、积极性、消极性、欺骗和社交性。通过将智能体置于不同场景中,基准评估它们是否能整合其固有的人格特质和自传体记忆,做出与其特定心理特征一致的行为决策。经过系统的人工验证和过滤,我们得到了一个包含673道多项选择题(MCQ)的基准。我们相信,这个基准为研究基于LLM的智能体中的人类情感、人格一致性和价值一致的行为决策提供了一个原则性且可扩展的测试平台。

英文摘要

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

2605.30056 2026-05-29 cs.RO cs.LG

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于评论家引导的样本高效扩散强化学习

Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi

AI总结 针对扩散策略在强化学习中探索与利用不平衡的问题,提出评论家引导的扩散策略优化(CGPO),通过无训练引导技术平衡探索与利用,在MuJoCo和Franka机器人任务上取得最优性能。

详情
Comments
accepted by ICML2026
AI中文摘要

近年来,强化学习(RL)通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中,一个代表性分支专注于基于采样的策略优化。这种设计使得扩散模型在训练初期具有更好的探索能力,但在Q值信息的利用上不足,导致策略收敛缓慢。另一个分支关注基于梯度的策略优化,该方法充分利用Q函数的梯度,但容易退化为低多样性的单峰策略。为了解决这个问题,我们提出了CGPO(评论家引导的扩散策略优化),通过将无训练引导技术集成到扩散策略的去噪过程中,有效平衡探索与利用。具体而言,CGPO将动作生成引导至评论家网络定义的高价值区域,并将引导后的动作作为回归目标。通过这种方式,CGPO减少了获取高质量动作所需的时间,并通过更好的探索-利用权衡提高了最终性能。我们在5个MuJoCo运动任务上验证了CGPO的有效性,与现有的基于扩散的RL方法相比,CGPO达到了最先进的性能。值得注意的是,CGPO是首次成功将扩散策略应用于真实世界RL的方法,在Franka机器人臂抓取任务上表现出优越性能。我们的官方页面发布在https://dingsht.tech/cgpo-webpage。

英文摘要

Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.

2605.30054 2026-05-29 cs.SE cs.AI

Projectional Decoding: Towards Semantic-Aware LLM Generation

投影式解码:迈向语义感知的LLM生成

Boqi Chen, José Antonio Hernández López, Aren A. Babikian

AI总结 提出投影式解码框架,通过维护部分图模型作为主要工件表示,实现增量语义验证和错误检测,以提升LLM生成工件的语义有效性。

详情
Comments
5 pages, 3 figures. Accepted at FSE 2026 IVR track
AI中文摘要

大型语言模型(LLM)越来越多地被用于跨许多软件工程(SE)任务生成软件工件,然而确保这些工件的语义有效性仍然是一个基本挑战。现有的约束解码技术可以强制执行语法正确性,并且在某些情况下强制执行特定的语义规则,但缺乏一种通用表示,能够将LLM生成的文本与SE中语义验证所需的推理联系起来。在本文中,我们提出了投影式解码,一种新颖的概念框架,通过在整个生成过程中与文本一起维护部分图模型作为主要工件表示,直接将领域语义集成到生成过程中。这种抽象表示通过显式捕获不确定性并原生支持错误检测,实现增量语义验证,同时引导生成朝向具有可证明保证的语义有效输出。我们在一个程序生成任务上展示了初步结果,证明了这种方法在提高LLM生成工件的语义有效性方面的潜力。我们还讨论了投影式解码如何能够在各种SE活动中实现与LLM的可验证自动化。

英文摘要

Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring the semantic validity of these artifacts remains a fundamental challenge. Existing constrained decoding techniques can enforce syntactic correctness and, in some cases, specific semantic rules, but lack a general representation that bridges LLM-generated text with the reasoning required for semantic validation in SE. In this paper, we propose projectional decoding, a novel conceptual framework that integrates domain semantics directly into the generation process by maintaining, alongside text, a partial graph model as the primary artifact representation throughout generation. This abstract representation enables incremental semantic validation by explicitly capturing uncertainty and natively supporting error detection, while guiding generation toward semantically valid outputs with provable guarantees. We present preliminary results on a program generation task which demonstrate the potential of this approach to improve the semantic validity of LLM-generated artifacts. We also discuss how projectional decoding can enable verifiable automation with LLMs across various SE activities.

2605.30052 2026-05-29 cs.SE cs.AI cs.CL

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

REPOT:通过检查点修复实现可恢复的思维程序

Parsa Mazaheri

AI总结 提出 RePoT 方法,通过确定性验证重放和 LLM 调用从验证前缀恢复,以解决 Program-of-Thought 中单个无效动作导致轨迹失效的问题,在多个模型和基准上提升成功率。

详情
AI中文摘要

单次 Program-of-Thought (PoT) 生成一个打印基本动作计划的 Python 程序;单个无效动作会无声地使轨迹失效。我们引入 RePoT (可恢复 PoT):一种确定性验证重放,它将计划遍历环境直到第一个无效转换,然后通过一次 LLM 调用从验证前缀恢复。在 PoT 失败的约 14% 的问题上,RePoT 最多增加一次 LLM 调用。在 PuzzleZoo-775 上,RePoT 在四种闭模型配置上比 PoT 提高 +3 到 +11 个百分点,在 gpt-5.4-mini-medium 上达到 96.9% 对比 86.3% 的峰值;与预算匹配的 PoT-retry 基线相比,RePoT 在 Gemini 上明显获胜(+3.8pp,95% CI [+2.2,+5.4]),在 GPT-medium 和 Claude 上处于采样噪声范围内,在 GPT-mini 上失败——这是一种能力扩展模式,我们开始通过自适应 RePoT 解决,这是一种基于规则的调度器,根据验证前缀长度在后缀修复和全新 PoT 重试之间路由(初步)。我们在 PlanBench Blocksworld 上复现(+1.1 到 +11.4pp),在四个开放权重模型上(四个中的三个 +3.3 到 +20.0pp)。在 Derail-550(我们的受控恢复基准)上,每个能够访问检查点信息的条件在 GPT-medium 上达到 >=30%,在 Gemini 上达到 >=70%,而仅错误反馈条件 <=3.1%——表明检查点信息(而非特定的验证前缀尾部)是承载恢复的信号。

英文摘要

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.