arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.30144 2026-05-29 cs.AI cs.MA

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

AgentSchool：基于LLM的多智能体教育模拟系统

Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

发表机构 * Shanghai Institute of AI for Education（上海人工智能教育研究院）； School of Computer Science and Technology（计算机科学与技术学院）； East China Normal University（东华大学）； School of Design（设计学院）； Faculty of Education（教育学院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出AgentSchool，一种LLM驱动的多智能体模拟器，通过可成长的学生智能体（带知识图谱、思维工作流和错误概念）与自适应教师智能体（基于最近发展区）模拟学习过程，支持多尺度模拟，实验验证了其生成差异化掌握轨迹和符合课堂社会理论的行为模式。

Comments 39 pages, 10 figures

详情

AI中文摘要

尽管LLM已迅速部署到课堂中，验证教育AI仍然具有独特的棘手性：干预措施作用于发展中的学习者，其认知和社会轨迹被不可逆地塑造，而现实世界试验缓慢、受伦理约束且受制度限制。基于LLM的教育模拟器已成为潜在的补救措施，但许多模拟器仍将学习简化为角色扮演，并且当仅优化以再现现有课堂时，可能会结构性惩罚教学改革所需的制度创新。在这项工作中，我们介绍了AgentSchool，一种LLM驱动的多智能体模拟器，将学习建模为状态转换而非提示行为。AgentSchool将可成长的学生智能体（配备加权学科知识图谱、思维工作流池和显式错误概念）与自适应教师智能体（在最近发展区内规划、搭建支架和反思）相结合，嵌入可配置的场景生成器（将教学置于正式和非正式学习领域）和多尺度模拟器（解耦交互规模、时间粒度和模拟持续时间）。实验表明，结构化学生智能体比基线模拟器产生更差异化的掌握和错误概念轨迹，而教师智能体比较显示出与基于ZPD的适应一致的骨干依赖模式。此外，AgentSchool生成与课堂社会理论一致的外围参与、小团体形成、攻击者诱导的凝聚力和意见领袖出现的合理轨迹。除了作为教育研究工具的作用外，AgentSchool还将教育构建为在组织压力下进行长时记忆、多智能体协调和未来制度推理的社会意义测试平台。

英文摘要

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

URL PDF HTML ☆

赞 0 踩 0

2605.30140 2026-05-29 cs.CV

AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection

AnomalyAgent: 用于零样本/少样本异常检测的无训练智能体模型

Yi Zhang, Jiawen Zhu, Lele Fu, Guansong Pang

发表机构 * Singapore Management University（新加坡国立管理学院）； Sun Yat-sen University（中山大学）

AI总结提出一种基于多模态大语言模型的无训练智能体框架AnomalyAgent，通过定制工具集和记忆模块实现零样本/少样本异常检测，在逻辑/上下文异常等复杂场景中优于现有方法。

详情

AI中文摘要

受益于视觉语言模型（如CLIP）的泛化能力，许多零样本/少样本异常检测方法已在各种数据集上取得了令人印象深刻的检测性能。然而，它们需要在大规模辅助数据集上进行大量训练以适应异常检测，并且其推理主要依赖于基于视觉-文本嵌入相似度的异常分数，缺乏检测需要深度上下文理解的复杂异常的推理能力。为了解决这一局限性，我们提出了 extbf{AnomalyAgent}，一种新颖的无训练智能体框架，利用多模态大语言模型的先进推理和泛化能力进行异常检测。关键要素包括： extbf{1)}一个全面的以异常为中心的工具集，能够在零样本设置下实现自适应MLLM驱动的智能体异常推理； extbf{2)}一个定制的记忆模块，通过少样本上下文参考示例来支撑异常推理。我们将评估从广泛使用的基准测试中检测简单异常（例如，裂纹和凹痕等表面缺陷以及明显病变）扩展到更多样化的异常类型，例如物流和制造环境中的逻辑/上下文异常。大量实验结果表明，我们的AnomalyAgent与无训练的基于VLM的异常检测和通用智能体方法相比，实现了显著更好的性能，突显了其在零样本和少样本异常检测设置中的优越泛化能力。代码实现可在此地址找到。

英文摘要

Benefiting from generalizability of vision-language models (VLMs) such as CLIP, many zero-/few-shot anomaly detection (AD) approaches have achieved impressive detection performance across various datasets. Nevertheless, they require substantial training on large auxiliary datasets to adapt VLMs to anomaly detection, and their inference largely relies on visual-text embedding similarity-based anomaly scores, lacking reasoning abilities to detect complex anomalies that require in-depth contextual understanding. To address this limitation, we propose \textbf{AnomalyAgent}, a novel training-free, agentic framework that leverages the advanced reasoning and generalization capabilities of multimodal large language models (MLLMs) for anomaly detection. The key ingredients include \textbf{1)} a comprehensive anomaly-centric toolset that enables adaptive MLLM-driven, agentic anomaly reasoning in zero-shot settings, and \textbf{2)} a customized memory module that grounds anomaly reasoning with few-shot, in-context reference examples. We extend evaluation beyond the detection of simple anomalies (e.g., surface defects like cracks and dents and clear lesions) in widely used benchmarks to more diverse types of anomalies such as logical/contextual anomalies in logistics and manufacturing settings. Extensive experiment results demonstrate that our AnomalyAgent achieves substantially better performance compared to training-free VLM-based AD and generic agentic methods, highlighting its superior generalization capability in both zero-shot and few-shot anomaly detection settings. The code implementation can be find at this address.

URL PDF HTML ☆

赞 0 踩 0

2605.30136 2026-05-29 cs.AI

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

通过上下文相关性的注意力引导增强多智能体通信

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

发表机构 * Purdue University（普渡大学）

AI总结针对LLM多智能体系统中长对话历史导致信息稀释的问题，提出无训练的上下文管理方法Agent-Radar，利用时空衰减机制动态引导注意力，在五个基准上取得最高7.64个绝对点的提升。

2605.30135 2026-05-29 cs.LG cs.AI

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

DAMEL: 双轴多专家学习用于类别不平衡学习

Hyuck Lee, Taemin Park, Heeyoung Kim

发表机构 * AI Research, Krafton（AI研究，Krafton）； Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)（工业与系统工程系，韩国科学技术院）

AI总结提出双轴多专家学习算法DAMEL，通过表示轴和时间轴上的多专家集成，同时降低预测偏差和方差，有效解决类别不平衡学习问题。

详情

AI中文摘要

针对来自具有长尾分布的真实世界数据的类别不平衡学习所带来的挑战，已有多种算法被提出。这些算法通过重平衡技术减少了预测偏差，但通常以增加预测方差为代价。一些多专家学习算法旨在解决这一方差问题，但涉及复杂的过程。我们提出了一种新的多专家学习算法，称为双轴多专家学习（DAMEL），该算法通过沿表示轴和时间轴使用多个专家来同时降低预测的偏差和方差。沿表示轴，DAMEL拼接多个专家的表示，并同时使用拼接后的表示训练一个辅助的平衡分类器。沿时间轴，DAMEL聚合跨训练时期的网络权重，并在测试时使用这些聚合权重。实验结果表明，DAMEL同时降低了预测的偏差和方差，突显了其在类别不平衡学习中的有效性。

英文摘要

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

URL PDF HTML ☆

赞 0 踩 0

2605.30133 2026-05-29 cs.CL

CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution

CorPipe at CRAC 2026: 多语言共指消解中的空节点与跨语言迁移

Milan Straka

发表机构 * Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics（查理大学数学与物理系形式与应用语言学研究所）

AI总结本文提出CorPipe 26系统，通过单一模型联合预测空节点、提及和共指链接，在CRAC 2026多语言共指消解共享任务中超越所有其他系统，并在LLM赛道和不受限赛道分别领先2.8和9.5个百分点。

Comments Accepted to CODI-CRAC 2026

2605.30132 2026-05-29 cs.LG stat.ML

Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation

学习外推到新任务：一种关系型任务外推方法

Adam Ousherovitch, Yixin Wang

发表机构 * Department of Statistics, University of Michigan, Ann Arbor（统计学系，密歇根大学，安阿伯）

AI总结提出关系型任务外推器（RTE），通过将目标任务分解为锚定任务和变换关系并学习关系算子，实现向未见任务的系统性外推，在函数预测和序列预测中显著优于现有方法。

Comments ICML 2026

详情

AI中文摘要

现代学习系统擅长内插，但难以泛化到训练分布支持范围之外的未见任务。即使在简单设置中（如处理超出训练范围的任务参数），这种失败也会发生，并且尽管基础模型取得了进展，问题依然存在。为此，我们开发了关系型任务外推器（RTE），一种旨在实现向新任务系统性外推的算法。关键观察是外推本质上是关系型的：外推到未见任务需要学习任务如何相互转换。如果模型在训练期间学习了任务A和B之间的变换，它可以在测试时应用相同的变换来关联已知任务和未见任务。RTE通过将每个目标任务分解为一个已知的锚定任务和一个连接锚定与目标的变换来实现这一思想。然后它学习一个关系算子，将锚定-变换对映射到目标任务的预测。我们在函数预测的多个任务外推场景中实例化RTE，例如目标任务使用超出范围的参数（参数外推）、具有更大的组合深度（长度外推）和/或以未见方式重新组合函数原语（组合外推）。我们进一步将RTE扩展到序列预测，将其集成到基础模型的微调算法中。在实证研究中，我们发现RTE在向新颖、未见任务的外推上显著优于现有方法。

英文摘要

Modern learning systems excel at interpolation but struggle to generalize to unseen tasks outside the training distribution's support. This failure occurs even in simple settings, such as handling task parameters beyond the training range, and persists despite advances in foundation models. To this end, we develop the Relational Task Extrapolator (RTE), an algorithm designed to enable systematic extrapolation to novel tasks. The key observation is that extrapolation is inherently relational: extrapolating to unseen tasks requires learning how tasks transform into one another. If a model learns the transformation between tasks A and B during training, it can apply that same transformation to relate known tasks to unseen ones at test time. RTE operationalizes this idea by decomposing each target task into a known anchor task and a transformation linking the anchor and target. It then learns a relational operator, mapping an anchor-transformation pair to predictions for the target task. We instantiate RTE across multiple task extrapolation regimes in function prediction, e.g. where target tasks use out-of-range parameters (parameter extrapolation), have greater compositional depth (length extrapolation), and/or recombine function primitives in unseen ways (compositional extrapolation). We further extend RTE to sequence prediction, integrating it into fine-tuning algorithms for foundation models. Across empirical studies, we find that RTE substantially outperforms existing approaches on extrapolation to novel, unseen tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.30131 2026-05-29 cs.CL cs.CV

CCS: Clinical Consensus Selection for Radiology Report Generation

CCS：放射学报告生成的临床共识选择

Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow（格拉斯哥大学计算机科学学院）； School of Electrical and Computer Engineering, University of Sydney（悉尼大学电气与计算机工程学院）； Language Technology Lab, University of Cambridge（剑桥大学语言技术实验室）

AI总结提出CCS框架，通过采样多个候选报告并选择临床共识最高的一个，以改进放射学报告生成在推理时的质量。

Comments 17 pages, 6 figures

详情

AI中文摘要

放射学报告生成（RRG）通常被表述为单路径生成任务，其中多模态大语言模型（MLLM）产生一个解码报告作为最终输出。虽然最近的进展主要通过扩展训练数据、模型容量和检索机制来推动，但在推理时提高报告质量仍未被充分探索。在这项工作中，我们观察到固定的放射学MLLM在其候选池中通常生成比默认解码选择的报告临床更强的报告，这表明推理时的决策仍然是一个被忽视的瓶颈。为了解决这个问题，我们提出了临床共识选择（CCS），一个解码器无关的推理时选择框架，它采样多个候选报告，并选择在展开池中具有最高临床共识的报告。CCS将基于文本的效用与由图像-报告训练的多模态嵌入器计算的放射学适应效用统一起来，该嵌入器测量超越表面文本相似性的候选一致性。在三个数据集和多个放射学MLLM上，CCS始终优于单路径解码和通用Best-of-N基线，特别是在临床指标上取得了明显提升。进一步分析表明，基于图像的效用形成了与文本共识不同的选择轴，并且在推理时改进RRG仍有很大的提升空间。

英文摘要

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

URL PDF HTML ☆

赞 0 踩 0

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克研究所）； Google（谷歌）

AI总结提出PARCEL视觉分词架构，通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突，在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情

AI中文摘要

大型视觉-语言模型（LVLMs）将视觉输入映射为密集的令牌序列，导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而，现有方法在激进压缩下表现不佳。空间压缩（如嵌套池化）表现为不完美的低通滤波器，并引起频谱混叠，掩盖了细粒度细节。查询压缩（如嵌套查询重采样）用非局部摘要替代显式的网格对齐令牌，显著降低了空间定位能力。为解决这一表示冲突，我们引入了PARCEL（基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解），一种视觉分词架构，动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点，并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征，而非冗余的空间映射。在27个基准上的广泛评估表明，PARCEL改进了性能-效率帕累托前沿，在各种视觉令牌预算下持续优于现有的嵌套基线，同时保留了“一次训练，随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

URL PDF HTML ☆

赞 0 踩 0

2605.30117 2026-05-29 cs.AI

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

VLA-Trace: 通过表示与行为追踪诊断视觉-语言-动作模型

Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China（中国科学技术大学）； University of Manchester（曼彻斯特大学）； Beihang University（北航）； Fudan University（复旦大学）； University of New South Wales（新南威尔士大学）

AI总结提出VLA-Trace诊断框架，通过表示演化、因果控制归因和行为表现分析，揭示VLA模型在多模态知识向具身控制转化中的机制，发现不同模型在微调适应、多模态路由和语义遵循上的差异与局限。

详情

AI中文摘要

理解视觉-语言-动作（VLA）模型如何将多模态知识转化为具身控制仍然是一个开放的挑战。我们提出了VLA-Trace，一个渐进式诊断框架，通过从表示动态到因果控制归因再到行为表现的统一证据链来分析VLA模型。它具体结合了跨模态和以检查点漂移为中心的核对齐（CKA）来追踪表示演化，注意力阻断干预来识别模态特定的控制通路，以及 rollout 级别的行为探针来检查基础能力、捷径依赖和语义遵循。在 $π_{0.5}$ 和 OpenVLA 上的实验揭示了三个关键发现。第一，两个模型在 VLA 微调期间表现出不同的模态特定适应动态。第二，它们在动作解码期间依赖于不同的多模态路由策略和层间依赖关系。第三，尽管 VLA 策略在视觉引导的轨迹生成方面表现出色，但在细粒度语义遵循方面仍然有限。这些发现指出了表示保持适应、因果 VLA 回路和组合语义控制的未来方向。

英文摘要

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $π_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

URL PDF HTML ☆

赞 0 踩 0

2605.30116 2026-05-29 cs.CV cs.LG

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

SGMD: 得分梯度匹配蒸馏用于少步视频扩散蒸馏

Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu

发表机构 * Beihang University（北京理工大学）； SenseTime Research（秒速科技研究院）； Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结针对分布匹配蒸馏在少步视频扩散中训练昂贵且运动动态保守的问题，提出得分梯度匹配蒸馏（SGMD），通过直接优化假得分朝向教师并使用教师停止梯度Fisher作为稳定目标，实现约3倍训练加速并显著提升运动动态。

Comments ICML 2026

详情

AI中文摘要

分布匹配蒸馏（DMD）是加速少步视频扩散模型推理的常用范式。然而，DMD风格的视频蒸馏面临两个耦合挑战：假得分必须跟踪不断演化的生成器，当需要频繁更新时训练成本高昂，而反向KL风格匹配可能具有模式寻求性和保守性，难以保持强运动动态。为解决这些问题，我们提出 extbf{得分梯度匹配蒸馏（SGMD）}。SGMD采用假得分视角，直接优化假得分朝向教师，同时使用教师停止梯度Fisher作为稳定的分布匹配目标。我们提供了梯度分析，论证了在理想跟踪下该目标选择的合理性。在此基础上，SGMD引入一对双重势：负残差（NR）用于外环校正，残差收缩（RC）用于内环跟踪。实验上，与DMD2相比，SGMD实现了约$\sim 3 imes$的训练加速，并显著改善了4步蒸馏模型的运动动态，同时保持了时间一致性。一项人类研究证实，SGMD在运动质量和整体偏好上更受青睐，而视觉质量和文本对齐保持相当。代码可在https://github.com/ModelTC/LightX2V获取。

英文摘要

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

URL PDF HTML ☆

赞 0 踩 0

2605.30115 2026-05-29 cs.CV

Large Depth Completion Model from Sparse Observations

来自稀疏观测的大深度补全模型

Zhu Yu, Zhengyi Zhao, Runmin Zhang, Lingteng Qiu, Kejie Qiu, Yisheng He, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

发表机构 * Zhejiang University（浙江大学）； Tongyi Lab, Alibaba Group（阿里云实验室）； Fudan University（复旦大学）； Ningbo Innovation Center, Zhejiang University（宁波创新中心，浙江大学）； NingboTech University（宁波科技学院）； Jinhua Institute of Zhejiang University（金华大学浙大研究院）

AI总结提出LDCM，利用单目基础模型和基于泊松的深度初始化策略，结合点图头回归3D坐标，实现稀疏观测下的度量准确深度补全。

Comments ICLR 2026. Project webpage: https://pkqbajng.github.io/ldcm/

详情

AI中文摘要

本文提出了大深度补全模型（LDCM），一个简单、有效且鲁棒的框架，用于稀疏观测下的单视图度量深度估计。在不依赖复杂架构设计的情况下，LDCM使用Transformer生成度量准确的密集深度图。它在多种数据集和稀疏观测下优于现有方法。我们从两个关键角度实现这一点：（1）利用现有的单目基础模型提高稀疏深度输入的质量，（2）重新制定训练目标以更好地捕捉几何结构和度量一致性。具体来说，首先引入基于泊松的深度初始化策略，从不同的稀疏观测生成均匀的粗密集深度图，为网络提供强大的结构先验。关于训练目标，我们用点图头替换传统的深度头，该点图头回归相机空间中的逐像素3D坐标，使模型能够直接学习底层3D场景结构，而不是执行逐像素深度图恢复。此外，这种设计消除了对相机内参的需求，使LDCM能够自然地产生度量尺度的3D点图。大量实验表明，LDCM在多个基准测试和不同稀疏度水平下，在深度补全和点图估计方面均持续优于最先进的方法，展示了其有效性和对未见数据分布的强泛化能力。

英文摘要

This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps using a transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) leveraging existing monocular foundation models to improve the quality of sparse depth inputs, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization strategy is first introduced to generate a uniform coarse dense depth map from diverse sparse observations, providing a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity levels in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.

URL PDF HTML ☆

赞 0 踩 0

2605.30112 2026-05-29 cs.LG

Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation

跨越雷诺数：神经PDE泛化中的表示几何

Jianing Shi

发表机构 * London School of Economics and Political Science（伦敦政治经济学院）

AI总结通过分析神经PDE求解器在跨雷诺数泛化中的表示几何，发现基于卷积自编码器的匹配方法（ConvAE-Relay）在无需目标域数据的情况下达到38.34%误差，揭示了局部多尺度表示对跨雷诺数迁移的关键作用。

Comments 12 pages, 8 figures, 5 tables

详情

AI中文摘要

神经PDE求解器中的跨雷诺数泛化仍然缺乏表征。在标准的强迫二维Navier-Stokes基准上，训练好的傅里叶神经算子在10倍雷诺数偏移下达到46.68%的相对L2误差，而零前向模型检索基线已经改进到41-42%。这表明表示几何是测试方法中的一个主要组织变量。我们通过ConvAE-Relay测试这一假设，该方法在源训练卷积自编码器潜在空间中匹配状态，并从源域数据库借用动力学，仅使用源域数据库且无需目标域拟合、标签或数据库条目，达到38.34+/-0.07%的误差。2x2消融实验将匹配质量隔离为优于更新规则的主导因素。Oracle实验证实，当匹配保持在流形上时，源域动力学方向仍然可迁移（余弦相似度~0.84）；自回归漂移是主要瓶颈（约12个百分点）。从学习预测方面，具有多尺度跳跃连接的U-Net达到34.72+/-0.60%的误差，与检索方面的发现一致，即局部多尺度表示组织测试方法中的跨雷诺数迁移。所有结论均限于该基准。

英文摘要

Cross-Reynolds generalisation in neural PDE solvers remains poorly characterised. On the canonical forced 2D Navier-Stokes benchmark, a trained Fourier Neural Operator reaches 46.68% relative L2 error under a 10x Reynolds-number shift, yet zero-forward-model retrieval baselines already improve to 41-42%. This suggests representation geometry as a major organising variable among the tested methods. We test this hypothesis through ConvAE-Relay, which matches states in a source-trained convolutional autoencoder latent space and borrows dynamics from a source-regime database, achieving 38.34+/-0.07% using only a source-regime database and no target-regime fitting, labels, or database entries. A 2x2 ablation isolates matching quality as dominant over the update rule. Oracle experiments confirm that source-regime dynamics directions remain transferable (cosine similarity ~0.84) when matching stays on-manifold; autoregressive drift is the primary bottleneck (~12 percentage points). From the learned-prediction side, a U-Net with multi-scale skip connections achieves 34.72+/-0.60%, consistent with the retrieval-side finding that local, multi-scale representations organise cross-Reynolds transfer among tested methods. All claims are scoped to this benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.30111 2026-05-29 cs.CV cs.AI

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

xModel-KD：基于LiDAR的3D场景感知跨模态知识蒸馏

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

发表机构 * Dept. of Computer Science Lakehead University Thunder Bay, Canada ； School of Computer Science Engg. \& Info. Systems Vellore Institute of Technology Tamil Nadu, India ； Dept. of Software Engg. Lakehead University Thunder Bay, Canada

AI总结提出跨模态知识蒸馏框架xModel-KD，通过对比学习对齐2D图像纹理与3D点云几何特征，在无额外标注下提升LiDAR点云分割性能。

Comments 3 figures, and 5 tables

详情

AI中文摘要

点云分割是3D场景理解中的基础任务。其进展受到密集3D标注高成本和高时间的限制，导致标注样本难以获取。除了标注稀缺，不同感知模态面临固有局限性。2D图像提供丰富的纹理和外观线索，但缺乏明确的深度和几何结构。相比之下，3D点云捕捉精确的空间几何，但稀疏且不含纹理信息。因此，依赖单一模态限制了所学表示的丰富性并削弱了泛化能力。尽管最近结合3D点云与2D图像的多模态方法在分类和检索等任务中表现出色，但它们通常依赖大规模标注数据集，且尚未充分用于数据高效的密集预测。为解决这些限制，我们提出一种新颖的跨模态知识蒸馏框架xModel-KD，用于3D点云分割。我们的方法通过跨模态对齐学习统一的逐点表示，利用2D纹理和3D几何的互补优势。具体而言，我们设计了一个跨模态融合编码器，通过对比目标训练，强制多视图下对应的2D和3D表示之间的特征一致性。通过将强大的预训练骨干与有针对性的融合策略相结合，所提框架有效地将图像的外观线索迁移到几何感知的点特征中。实验结果表明，跨模态融合在mIoU上比仅使用LiDAR的基线实现了2%的绝对提升，证明了利用互补多模态信息进行可扩展和标注高效的3D场景理解的优势。

英文摘要

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.30107 2026-05-29 cs.CL

Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking

Dial HEALTHDIAL for Advice: 一个用于知识驱动信息检索的多语言多平行口语对话数据集

Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang, Alexander Fraser, Ivan Vulić, Anna Korhonen

发表机构 * Language Technology Lab, University of Cambridge（剑桥大学语言技术实验室）； School of Computation, Information and Technology, Technical University of Munich（慕尼黑技术大学计算、信息与技术学院）； Independent Researcher（独立研究员）

AI总结本文构建了HEALTHDIAL，一个大规模多语言多平行口语对话数据集，用于开发基于检索增强生成的口语对话系统，并揭示了不同语言间的性能差异。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

创建口语对话数据集在方法论上具有挑战性，当目标是构建大规模多语言多平行数据集时，这些挑战更加突出。本文介绍了HEALTHDIAL，一个用于开发和评估基于检索增强生成（RAG）的口语对话系统的大规模多语言多平行数据集。该数据集包含6,000个信息寻求对话（每种语言1,500个），这些对话基于世界卫生组织（WHO）的可信内容，以及来自四种WHO官方语言（阿拉伯语、中文、英语和西班牙语）的母语者录制的163小时用户语音。每个说话者都标注了人口统计学（如性别、年龄）和社会语言学（如主要语言、原籍地区）变量。我们报告了关键对话任务的基准结果，揭示了不同语言之间（即使是高资源语言）持续存在的性能差异。为支持未来研究，我们发布了该数据集、一个原型系统以及一个用于数据收集和系统评估的工具包。

英文摘要

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.30104 2026-05-29 cs.CL

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?

SEAL: 饱和基准能否通过LLM作为元裁判得以复兴？

Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li, Yansen Zhang, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.（字节跳动公司）； City University of Hong Kong（香港城市大学）

AI总结提出SEAL协议，通过自适应LLM元裁判从饱和基准中提取潜在排名信号，在代码生成、数学推理等任务上以更少调用实现高排名准确率。

详情

AI中文摘要

广泛使用的语言模型基准日益饱和，前沿系统常获得标准指标无法区分的接近分数。我们不构建更难的替代方案，而是探究是否可以通过改进对相同候选输出的评估来使现有任务重新具有信息量。因此，我们提出了带自适应LLM元裁判的种子淘汰法，这是一种自我改进的评估协议，用于从饱和基准中提取潜在排名信号。SEAL将候选输出种子化为单淘汰赛，并通过任务级原则和自改进检查表标准评估每场比赛。我们在涵盖代码生成、数学推理、知识密集型问答和工具使用智能体任务完成的多个饱和基准上评估SEAL。在这些设置中，SEAL改善了排名准确性与延迟之间的权衡，与完全成对评判相比达到了0.83-1.00的Spearman一致性和4/4的top-1一致性，同时每个任务仅需11.89次调用，而完全成对评估需要28.00次。

英文摘要

Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.30103 2026-05-29 cs.LG

Convergence Theory for Iterative LLM-Based Neural Architecture Search: A Parametric Cross-Entropy Framework with Closed-Form Proxy Reliability

基于迭代式LLM的神经架构搜索的收敛理论：一个具有闭式代理可靠性的参数化交叉熵框架

Santosh Premi Adhikari, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室，CAIDAS与IFI，乌尔姆大学，德国）

AI总结将迭代式LLM-NAS建模为参数化交叉熵方法，证明了收敛性、精英集概率几何收敛、增量生成有效性、MinHash-Jaccard去重防止模式崩溃以及代理可靠性闭式公式，并通过实验验证了理论预测。

Comments 14 pages, 2 figures, 2 tables. Submitted to NeurIPS 2026

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作迭代式神经架构搜索（NAS）中的生成器，然而这类算法尚无正式的收敛理论。我们将迭代式LLM-NAS建模为在可执行程序上的参数化交叉熵（CE）方法，并证明了六个结果：（1）在精英架构上的迭代式LLM微调等价于限制在LLM参数族内的CE更新；（2）期望架构质量在循环间单调非减；（3）精英集概率以几何速率C_t >= 1-(1-rho_0)^t收敛到不动点；（4）在一阶马尔可夫令牌误差模型下，基于增量的生成比全代码生成实现严格更高的有效生成率；（5）MinHash-Jaccard新颖性过滤器防止模式崩溃；（6）代理可靠性具有闭式形式rho_S = (6/pi) arcsin(rho_P(SNR)/2)，从而得出实际诊断条件sigma^2_arch >> sigma^2_noise作为基于代理的可靠排名的必要条件。在22个循环、三个LLM、六个数据集、3300个生成架构的实验中，定量验证了两个预测，在效应方向层面验证了两个预测，并解释了先前经验观察到但未得到解释的代理可靠性天花板效应。

英文摘要

Large language models (LLMs) are increasingly used as generators in iterative neural architecture search (NAS), yet no formal convergence theory exists for this class of algorithms. We model iterative LLM-NAS as a parametric Cross-Entropy (CE) method over executable programs and prove six results: (1) iterative LLM fine-tuning on elite architectures is equivalent to the CE update restricted to the LLM parametric family; (2) expected architecture quality is monotonically non-decreasing across cycles; (3) elite-set probability converges to a fixed point at a geometric rate C_t >= 1-(1-rho_0)^t; (4) delta-based generation achieves a strictly higher valid-generation rate than full-code generation under a first-order Markov token-error model; (5) the MinHash-Jaccard novelty filter prevents mode collapse; (6) proxy reliability admits the closed-form rho_S = (6/pi) arcsin(rho_P(SNR)/2), yielding the practical diagnostic sigma^2_arch >> sigma^2_noise as a necessary condition for trustworthy proxy-based rankings. Testing against a 22-cycle, three-LLM, six-dataset experiment with 3,300 generated architectures confirms two predictions quantitatively, two at direction-of-effect level, and explains the proxy-reliability ceiling effect previously reported empirically but left unexplained.

URL PDF HTML ☆

赞 0 踩 0

2605.30100 2026-05-29 cs.LG

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

Chess-World-Model: 一个用于从国际象棋走棋序列精确状态跟踪的1000万对局基准

Benjamin Walker, Terry Lyons

发表机构 * Mathematical Institute, University of Oxford（牛津大学数学研究所）； Department of Mathematics, Imperial College London（伦敦帝国理工学院数学系）

AI总结提出一个基于1000万真实国际象棋对局的大规模状态跟踪基准，通过预测合法走棋序列后的棋盘状态，测试模型学习转换规则的能力，并发现循环模型优于Transformer，且随机均匀分布子集能揭示规模掩盖的失败。

Comments 20 pages, 4 figures

详情

AI中文摘要

世界模型需要状态跟踪，即跨动作序列维持正确潜在状态的能力。现有基准通常是合成或基于语言的，限制了它们作为结构化状态更新测试在现实领域中的价值。我们引入了Chess-World-Model，一个基于1000万真实国际象棋对局构建的大规模状态跟踪基准，其中模型预测经过一系列合法走棋后达到的精确棋盘状态。除了一个留出的真实对局子集外，我们还包含一个来自均匀随机合法走棋的分布外子集，用于测试模型是否学习转换规则而非来自常见人类走法的捷径。先前的理论和实证工作表明，Transformer难以进行状态跟踪，而输入依赖的线性RNN需要表达性强的状态转换矩阵才能做到。因此，我们在匹配的接口和训练协议下，对因果Transformer、块对角SLiCE、Mamba-3和具有负特征值的Gated DeltaNet进行了基准测试。在300万和800万参数下，循环模型显著优于Transformer。真实对局性能在1800万参数以上饱和，但随机均匀子集在4000万参数下仍具有区分性，暴露了规模掩盖的失败。此外，消融实验表明，对于所有三种循环模型，表达性较弱的状态转换机制会降低分布外子集的性能。这些结果共同确立了Chess-World-Model作为一个实用的大规模状态跟踪基准，能够暴露模型规模原本会掩盖的失败。

英文摘要

World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

URL PDF HTML ☆

赞 0 踩 0

2605.30099 2026-05-29 cs.CV

Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion Detection

对话代理评估：理解情感检测中的文化、背景与环境

Martha Teiko Teye, Yaw Marfo Missah, Emmanuel Ahene, Twum Frimpong, Auxane Boch

发表机构 * Cluster of Excellence, University of Stuttgart（斯图加特大学卓越中心）； Department of Computer Science, Kwame Nkrumah University of Science and Technology（库马西技术科学大学计算机科学系）； Institute for Ethics in Artificial Intelligence, Technical University of Munich（慕尼黑技术大学人工智能伦理研究所）

AI总结针对黑人非洲社会，提出结合语音和图像数据、使用3层CNN和AFME算法的情感预测模型，准确率85%-96%，并识别讽刺，提升对话AI情感识别系统的可信度。

Comments IEEE paper on arxiv

Journal ref IEEE Access 10 (2022) 24976-24984; Erratum: IEEE Access (2022) 35900-35900

详情

DOI: 10.1109/ACCESS.2022.3153787

AI中文摘要

现在，有价值决策和高度优先分析依赖于面部生物识别、社交媒体照片标记和人机交互等应用。然而，成功部署这些应用的能力取决于它们在考虑可能边缘情况下的测试用例效率。多年来，已经实施了大量通用解决方案来模仿人类情感，包括讽刺。然而，地理位置或文化差异等因素在其解决伦理问题和改进对话AI（人工智能）的相关性中尚未得到充分探索。在本文中，我们旨在解决在黑人非洲社会中对话AI使用的潜在挑战。我们开发了一个情感预测模型，准确率在85%到96%之间。我们的模型结合了语音和图像数据来检测七种基本情感，并特别关注识别讽刺。它使用了3层卷积神经网络，并结合了一种新的音频帧平均表情（AFME）算法，重点放在模型的预处理和后处理阶段。最后，我们的解决方案有助于维护对话AI中情感识别系统的可信度。

英文摘要

Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.

URL PDF HTML ☆

赞 0 踩 0

2605.30094 2026-05-29 cs.AI cs.GT

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

PokerSkill: 无需训练或求解器，大语言模型可达到专家级扑克水平

Boning Li, Baoxiang Wang, Longbo Huang

发表机构 * IIIS, Tsinghua University（清华大学人工智能研究院）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出PokerSkill框架，通过规则驱动的技能库约束大语言模型动作，无需训练或求解器即可在扑克中达到接近GTO水平的性能。

Comments 45 pages, 3 figures

详情

AI中文摘要

扑克是人工智能的一个标志性挑战。主流方法依赖于基于反事实遗憾最小化的均衡求解器，需要数百万核心小时的训练。大语言模型（LLMs）拥有广泛的扑克知识，但当被要求直接游戏时，其表现远低于基于求解器的智能体。传统的基于规则的扑克智能体是可解释且无需训练的，但其策略上限仍远低于均衡玩法。我们提出了 extbf{PokerSkill}，一个无需训练且无需求解器的框架，通过使用详细的基于规则的扑克技能作为LLMs的结构化动作基础接口来弥合这一差距。一个确定性上下文引擎分析当前状态，并从完全由人类扑克专家设计的分层技能库中仅检索相关片段，将LLM的选择限制在合理动作内。针对最先进的GTO基准GTOWizard，使用PokerSkill的GPT-5.5 XHigh达到$-57 \pm 21$ mbb/hand，Claude Opus 4.6达到$-80 \pm 29$ mbb/hand，Claude Opus 4.7达到$-87\pm 64$ mbb/hand，相比默认提示基线减少了49-61%的损失，并优于强机器人Slumbot。我们的关键发现是，仅靠基于规则的技能不足以构成强大策略，仅靠LLM也无法良好游戏，但它们的结合产生了一个既不需要训练也不需要求解器访问，却能媲美基于数百万核心小时计算构建的系统的智能体。据我们所知，这是首次证明LLM在复杂不完美信息游戏中无需特定游戏训练或求解器查询即可达到竞争性能。代码可在https://github.com/lbn187/PokerSkill获取。

英文摘要

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

URL PDF HTML ☆

赞 0 踩 0

2605.30093 2026-05-29 cs.CV

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

几何至关重要：用于学习语义对应的3D基础先验

Artur Jesslen, Olaf Dünkel, Adam Kortylewski

发表机构 * University of Freiburg（弗赖堡大学）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； CISPA Helmholtz Center for Information Security（CISPA 河岸信息安全中心）

AI总结提出一种3D感知的后训练框架，利用3D基础模型（SAM3D）估计物体几何和姿态，生成几何感知特征图，结合DINO和Stable Diffusion特征，通过测地距离过滤候选对应，训练轻量适配器改进语义对应。

Comments 9 pages (main paper), 21 pages (total), 4 figures

详情

AI中文摘要

来自自监督视觉模型和文本到图像扩散模型的基础特征已被证明对语义对应估计有效。然而，由于这些特征主要从2D图像目标学习，它们缺乏明确的3D意识，并且常常混淆对称物体侧面、重复部分以及在3D中不同的视觉相似结构。我们引入了一个3D感知的后训练框架，通过结合3D基础模型的先验，超越了现有的2D基础特征。给定一张图像，我们的方法使用SAM3D估计物体几何和姿态，并通过渲染-比较优化来细化姿态。随后，我们根据估计的物体姿态，将重建几何中的PartField描述符渲染到图像平面。由此产生的几何感知特征图补充了DINO和Stable Diffusion特征，而重建形状上的测地距离能够可靠地过滤候选对应。我们使用过滤后的匹配作为监督，在DINO和Stable Diffusion之上训练一个轻量适配器用于语义对应。与之前需要姿态标注并依赖粗略球形几何的后训练方法相比，我们的方法自动获得实例特定的3D结构，并用它来指导对应学习。实验表明，我们的方法改进了语义对应，同时减少了人工几何监督。代码和模型可在 https://github.com/GenIntel/3D-SC 获取。

英文摘要

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

URL PDF HTML ☆

赞 0 踩 0

2605.30090 2026-05-29 cs.CL cs.CV

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.（字节跳动公司）； City University of Hong Kong（香港城市大学）

AI总结提出DirectorBench，一种基于多智能体的诊断基准，通过80个结构化元数据、7个用户画像和40个检查点标准，在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成，并定位瓶颈和用户偏好依赖。

详情

AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作，具有叙事结构、电影控制、音频和跨模态同步。然而，评估此类视频仍然具有挑战性，因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐，并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench，一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准，在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数，而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中，DirectorBench揭示了一个单元间瓶颈：过渡质量平均仅为0.256，最佳工作流达到0.356，而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估，以验证DirectorBench与人类判断的一致性。结果表明，DirectorBench捕捉到了人类可感知的质量差异，并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

发表机构 * Department of Electrical & Computer Engineering, Rice University（电气与计算机工程系，里士满大学）

AI总结提出CROP方法，通过保形校准选择阈值，返回最长无错前缀，并控制错误包含概率，平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情

AI中文摘要

语言模型推理轨迹很少是全有或全无；在关键错误发生之前，它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应，未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题，我们引入了CROP（保形推理输出前缀），一种与验证器无关的校准程序，用于干净前缀认证。给定任何步骤级风险代理，CROP选择一个校准阈值，并返回其步骤风险代理保持低于该阈值的最长连续前缀，将未认证的后缀路由到下游审查或修复。假设可交换性，CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上，我们证明了标准步骤级指标（如AUROC）不能完全捕捉前缀效用，建议验证器应改为通过认证前缀长度进行评估。此外，CROP平衡了过度保留和不足保留，通过保留有效的中间推理同时丢弃误导后缀，提高了下游修复的准确性。最终，这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

URL PDF HTML ☆

赞 0 踩 0

2605.30083 2026-05-29 cs.CV

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

未来强制：自回归视频生成中无需训练的未来感知KV缓存策略

Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen

发表机构 * BUAA（北京航空航天大学）； ZGCA（中广核人工智能研究院）； PKU（北京大学）； CASIA（中国科学院自动化研究所）； USTC（中国科学技术大学）； NUS（新加坡国立大学）

AI总结提出Future Forcing，一种无需训练的未来感知KV缓存策略，通过利用自回归视频模型中查询分布的平稳性来估计未来查询，从而改进长视频生成的一致性。

详情

AI中文摘要

自回归（AR）视频生成已成为长时域视频合成的一种有前景的范式，其中每一帧的生成基于先前生成的令牌。为了加速推理，使用KV缓存避免跨生成步骤的冗余重计算。然而，随着生成长度的增长，KV缓存会引入越来越多的内存和误差累积，限制了AR模型扩展到更长序列的可扩展性。现有的KV缓存压缩方法通过选择性地保留被认为重要的视频令牌来缓解这一问题。然而，大多数现有方法使用从当前或历史生成上下文中提取的短时域信号来评估令牌重要性，这使得这些方法容易忽略在早期步骤中看似不重要但后来对未来帧至关重要的令牌。在这项工作中，我们识别了训练好的AR视频模型的一个重要性质：尽管RoPE调制的查询在自回归步骤中演变，但底层的规范预RoPE查询分布在视频生成过程中保持显著稳定。这种近似平稳性意味着未来查询分布可以从历史统计中估计，从而无需额外训练即可实现原则性的未来感知缓存决策。基于这一洞察，我们提出了Future Forcing，一种用于AR视频生成的无需训练的未来感知KV缓存策略。具体来说，Future Forcing首先从历史统计中构建未来查询代理，然后根据该代理下的重要性对KV缓存令牌进行评分，最后在未来查询诱导的仿射子空间内合并冗余令牌对。大量实验表明，Future Forcing在有限的KV缓存下改善了长时域一致性，在VBench-Long上针对60秒生成，与现有的AR视频KV缓存策略相比，主体一致性提升了高达1.49。

英文摘要

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.

URL PDF HTML ☆

赞 0 踩 0

2605.30080 2026-05-29 cs.CL

Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

自适应目标动态分块用于无分词层次模型

Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata

发表机构 * Fujitsu Research of America（富士通美国研究）； Fujitsu Limited（富士通株式会社）

AI总结提出自适应目标动态分块（ATDC）机制，通过课程学习动态调整压缩比，以优化无分词层次模型的字节压缩效果，在FineWeb-Edu 100B数据集上实现竞争性的每字节比特数性能，并提升训练稳定性和下游任务表现。

详情

AI中文摘要

无分词层次模型正成为传统大型语言模型（LLM）的有前途替代方案，解决了词汇设计复杂性、词汇外（OOV）错误和语言特定约束等固有预处理问题。然而，这些字节级方法的一个重大挑战是压缩比的优化，这是决定模型通过分块处理字节数据性能的关键因素。在本文中，我们提出自适应目标动态分块（ATDC），一种新颖的字节压缩控制机制，旨在增强层次架构中动态分块的有效性。我们的方法利用课程学习在训练过程中逐步调整压缩比，从低压缩过渡到高压缩以稳定学习过程。我们提供分析，建立了目标压缩比与每最内层分块字节数（BPIC）之间的关系，从而能够在整个训练阶段跟踪分块大小的演变。在FineWeb-Edu 100B数据集上进行的评估表明，配备ATDC的层次模型在每字节比特数（BPB）性能上与在字节和词元级别上运行的常规基线相比具有竞争力。此外，与使用固定压缩比的模型相比，所提出的方法在多种下游任务中表现出更稳定的训练动态和更优的最终性能，同时保持了字节级处理的固有鲁棒性和灵活性。

英文摘要

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.

URL PDF HTML ☆

赞 0 踩 0

2605.30076 2026-05-29 cs.CL

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

UniSteer: 文本引导的激活空间流匹配用于多功能LLM引导

Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang, Jingyi Yu, Kan Ren

发表机构 * ShanghaiTech University（上海科技大学）

AI总结提出UniSteer，一种文本引导的激活流匹配模型，通过学习残差流激活的条件分布，实现统一的行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类。

Comments 16 pages,4 figures

详情

AI中文摘要

基于激活的控制通过在推理过程中干预大型语言模型（LLM）的内部表示来引导它们，并已成为控制个性、风格等行为的有效范式。然而，现有方法通常依赖于固定的引导方向或特定任务的干预模块，难以适应细粒度概念和组合约束。我们提出UniSteer，一种文本引导的激活流匹配模型，它从自然语言条件中学习残差流激活的条件分布。UniSteer不是为每个目标行为拟合单独的干预，而是在激活空间中学习一个通用的条件速度场。在推理时，UniSteer通过将源激活部分传输到潜在状态并在目标文本条件下重新生成它，然后将其注入回冻结的LLM，从而执行流反转。相同的条件模型通过选择具有最低重建能量的文本标签来支持激活空间分类。在三个目标LLM上的实验表明，UniSteer在行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类方面提供了统一的接口。

英文摘要

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.

URL PDF HTML ☆

赞 0 踩 0

2605.30075 2026-05-29 cs.LG cs.DC

Q-ANCHOR: Federated Quantum Learning with ZNE-guided Correction

Q-ANCHOR: 基于ZNE引导校正的量子联邦学习

Hoang M. Ngo, Quan Nguyen, Wanli Xing, My T. Thai

发表机构 * Department of Computer & Information Science & Engineering（计算机与信息科学与工程系）； University of Florida（佛罗里达大学）； Frost Institute for Data Science and Computing（数据科学与计算弗罗斯特研究所）； University of Miami（迈阿密大学）

AI总结针对量子联邦学习中非独立同分布数据导致的客户端漂移和量子硬件噪声导致的硬件偏差，提出Q-ANCHOR聚合架构，通过零噪声外推锚定服务器更新并应用有状态客户端校正，理论证明可同时减轻两类漂移，实验显示训练更稳定。

详情

AI中文摘要

量子联邦学习（QFL）提供了一个有前景的框架，可以在保持数据严格本地化的同时，跨分布式客户端训练量子模型。由于其简单性和低通信开销，联邦平均（FedAvg）是QFL文献中的标准聚合选择。然而，在实际硬件上部署QFL会暴露出严重的双重漂移现象：全局模型同时受到来自非独立同分布数据的客户端漂移和来自噪声量子梯度估计的硬件偏差的干扰。在这项工作中，我们首先分析了FedAvg在这些现实条件下的收敛性，数学上证明了量子硬件偏差会产生标准平均无法纠正的持久误差下限。为了克服这一限制，我们提出了Q-ANCHOR，一种量子感知的联邦聚合架构，该架构通过零噪声外推锚定服务器更新，同时应用有状态客户端校正来抑制客户端漂移和硬件引起的偏差。我们的收敛理论证明，Q-ANCHOR成功减轻了经典客户端漂移，同时积极降低了硬件偏差下限。实验结果表明，Q-ANCHOR实现了比传统FL基线显著更稳定的训练。

英文摘要

Quantum Federated Learning (QFL) offers a promising framework to train quantum models across distributed clients while keeping data strictly local. Due to its simplicity and low communication overhead, Federated Averaging (FedAvg) is the standard aggregation choice in QFL literature. However, deploying QFL on practical hardware exposes a severe double-drift phenomenon: the global model is simultaneously derailed by client drift from non-IID data and hardware bias from noisy quantum gradient estimates. In this work, we first analyze the convergence of FedAvg under these realistic conditions, mathematically demonstrating that quantum hardware bias creates a persistent error floor that standard averaging cannot correct. To overcome this limitation, we propose Q-ANCHOR, a quantum-aware federated aggregation architecture that anchors server updates with zero-noise extrapolation while applying stateful client correction to suppress both client drift and hardware-induced bias. Our convergence theory proves that Q-ANCHOR successfully mitigates classical client drift while actively reducing the hardware-bias floor. Experimental results demonstrate that Q-ANCHOR achieves significantly more stable training than conventional FL baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.30073 2026-05-29 cs.CV

Native Audio-Visual Alignment for Generation

原生音视频对齐生成

Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He

发表机构 * ERNIE Team, Baidu Inc.（百度公司ERNIE团队）

AI总结提出NAVA框架，通过原生音视频对齐和上下文条件联合去噪，实现高质量、同步且可控的音视频生成。

Comments Project page: https://ernie-research.github.io/NAVA/

详情

AI中文摘要

联合音视频生成旨在合成时间同步且语义一致的视觉-声学内容。然而，现有的开源方法主要依赖于带有后对齐的双塔设计或统一的三模态设计，将文本上下文、音频和视频混合在一个共享空间中。前者削弱了细粒度的音视频协同进化，而后者将语义条件与低级同步耦合。为了解决这些限制，我们提出了NAVA，一个用于联合音视频生成的原生音视频对齐框架。NAVA建立在上下文条件的原生音视频对齐之上：它首先在专用的交互空间中建立音视频对应关系，然后使用外部上下文来条件化联合去噪过程。具体地，NAVA通过Align-then-Fuse MMDiT架构实例化，该架构从模态感知的音视频对齐过渡到模态共享的联合去噪。此外，我们引入了上下文音色条件，将参考音色线索与相应的语音跨度关联，以实现可控的语音音色。在Verse-Bench和Seed-TTS上的实验以及用户研究表明，NAVA仅使用6.3B参数就实现了卓越的视频质量、精确的音视频同步、有竞争力的音频质量和更强的参考音色可控性。

英文摘要

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.30070 2026-05-29 cs.LG cs.AI

A Predictive Law for On-Policy Self-Distillation From World Feedback

基于世界反馈的在线自蒸馏预测定律

Tommy He, Jerome Sieber, Matteo Saponati

发表机构 * Open-source models（开源模型）； LiveCodeBench

AI总结本文发现在线自蒸馏（OPSD）中初始师生性能差距与最终性能改进之间存在线性关系，并提出一种预测定律，用于在训练前预测OPSD配置的效果。

详情

AI中文摘要

超越简单的标量奖励，向更丰富的世界反馈迈进，是实现更可扩展的RL后训练的自然路径。在线自蒸馏（OPSD）是一种有前景的最新方法，它使用任意反馈作为学习信号，但其与GRPO等成熟方法相比的可靠性仍不清楚。我们发现了OPSD中初始学生-教师性能差距与最终性能改进之间存在惊人的一致线性相关性。这种关系在不同上下文类型和模型家族中均成立，为预测OPSD配置的结果提供了一种强大的预测定律，而无需运行完整的训练过程。有趣的是，我们表明这种线性可预测性随模型规模成立，这为具有更强上下文学习能力的大型模型上新的经验缩放定律提供了潜在基础。本质上，我们的发现表明，OPSD性能可以在训练前进行预测和调整，为将世界反馈作为后训练流水线的一等组件提供了一种原则性方法。

英文摘要

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.30065 2026-05-29 cs.CV

Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained Priors

利用二维预训练先验提升零样本三维风格迁移

Xin Dong, Yunzhi Teng, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Pengcheng Laboratory（鹏城实验室）

AI总结提出Data-Sufficient StyleGaussian模型，通过集成大规模2D图像数据集预训练的解码器，结合特征高斯溅射与延迟风格化，在数据稀缺条件下实现零样本3D风格迁移的高质量多视图一致渲染。

Comments Accepted by IEEE IVMSP2026

详情

AI中文摘要

在这项工作中，我们专注于零样本三维风格迁移，即给定任意风格图像，生成三维场景的多视图一致风格化视图。我们主要解决三维风格迁移中的数据稀缺问题，该问题源于每个模型仅在单个场景上训练，从而限制了可用内容图像的数量。这种稀缺性严重阻碍了风格化性能，因为模型优化依赖于足够数量的内容-风格图像对来提供监督信号。我们的核心思想是将在大规模二维图像数据集上预训练的解码器集成到三维风格迁移流程中，从而利用解码器从大量内容-风格图像对中学习到的先验知识。我们的方法结合了特征高斯溅射和延迟风格化，通过将视图相关操作统一为视图不变过程，在确保视图一致性的同时，利用数据充足的解码器网络实现高质量风格化。实验表明，我们的Data-Sufficient StyleGaussian（DS-StyleGaussian）模型在多个数据集上的视觉质量优于现有的零样本三维风格迁移方法。这项工作也表明，二维预训练可以作为三维任务的强增强手段，弥合二维与三维之间的数据差距。

英文摘要

In this work, we focus on zero-shot 3D style transfer that can generate multi-view consistent stylized views of the 3D scene given an arbitrary style image. We primarily tackle the issue of data scarcity in 3D style transfer, which arises when each model is trained on only a single scene, thereby limiting the number of available content images. This scarcity significantly hampers stylization performance, as model optimization relies on a sufficient number of content-style image pairs to provide supervisory signals. Our core idea is to integrate a decoder pre-trained on large-scale 2D image datasets into the 3D style transfer pipeline, thereby leveraging the prior knowledge encoded in the decoder from learning over numerous content-style image pairs. Our method combines feature Gaussian splatting and deferred stylization, enabling high-quality stylization with the data-sufficient decoder network while ensuring view consistency by unifying view-dependent operations into a view-invariant process. Experiments demonstrate that our Data-Sufficient StyleGaussian (DS-StyleGaussian) model outperforms existing zero-shot 3D style transfer methods in terms of visual quality across various datasets. This work also suggests that 2D pre-training can serve as a strong enhancement for 3D tasks, bridging the data gap between 2D and 3D.

URL PDF HTML ☆

赞 0 踩 0

2605.30062 2026-05-29 cs.CV

FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection

FakeVLM-R1：通过思维链内化物理定律进行合成图像检测

Leqi Zhu, Junyan Ye, Kaiqing Lin, Zhiyuan Yan, Conghui He, Weijia Li

发表机构 * Shanghai AI Lab（上海人工智能实验室）； Nanjing University（南京大学）； Sun Yat-Sen University（中山大学）； Shenzhen University（深圳大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结提出FakeVLM-R1框架，结合监督微调、组相对策略优化和批判性思维链机制，通过双向辩证推理和物理常识构建真实性反证，实现高精度、逻辑可解释的合成图像检测，解决现有方法的过度拒绝偏差。

详情

AI中文摘要

生成式人工智能技术的发展已将合成图像的视觉真实性提升至前所未有的水平。尽管当前基于大型多模态模型（LMM）的可解释检测方法取得了一定进展，但它们仍然依赖于从大量伪造数据中获得的模仿学习，因此缺乏真正的因果推理能力，容易产生解释性幻觉。为克服这一瓶颈，我们提出FakeVLM-R1，旨在赋予模型在执行合成检测任务时类似人类的批判性思维能力。该框架在监督微调（SFT）基础上，将组相对策略优化（GRPO）与批判性思维链（CoT）机制相结合。在推理阶段，模型执行“双向辩证推理”过程：在提出伪造假设的同时，必须同时调用物理常识构建真实性反证。此外，我们构建了包含高质量样本的FakeClue++数据集，该数据集广泛引入了基于真实图像物理定律的注释，为模型提供了统一的真实性锚点。实验证实，FakeVLM-R1在多个基准测试中达到了评估模型中的最优性能（SOTA）。它不仅实现了高精度、逻辑可解释的检测，还解决了现有方法对真实图像的过度拒绝偏差，展现出对扰动的泛化性和鲁棒性。

英文摘要

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

URL PDF HTML ☆

赞 0 踩 0