arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2605.09253 2026-06-02 cs.CL cs.AI

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

基石还是绊脚石?解读在线策略蒸馏中的岩石令牌

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校) Case Western Reserve University(凯斯西储大学) Arizona State University(亚利桑那州立大学) VU Amsterdam(阿姆斯特丹自由大学)

AI总结 本文研究在线策略蒸馏中持续高损失的“岩石令牌”,发现它们虽占据大量梯度但功能贡献微弱,提出绕过这些令牌可简化对齐过程。

详情
AI中文摘要

尽管近期关于可验证奖励强化学习(RLVR)的研究表明,一小部分关键令牌不成比例地驱动推理增益,但在线策略蒸馏(OPD)中类似的令牌级理解仍未探索。本文研究了高损失令牌——在OPD的逐令牌KL目标下,作为师生不匹配的最直接信号,根据现有研究,这些令牌应随着训练收敛而逐渐减少;然而,我们的实证分析显示并非如此。即使在OPD训练达到明显饱和后,仍有大量令牌持续表现出高损失;我们将这些令牌称为“岩石令牌”,它们可占生成输出中高达18%的令牌。我们的研究揭示了两个令人惊讶的悖论。首先,尽管这些令牌的高出现频率提供了不成比例的大份额总梯度范数,但岩石令牌本身在整个训练过程中保持停滞,抵抗教师驱动的修正。其次,通过因果干预,我们发现这些令牌对模型的实际推理性能贡献可忽略不计。这些发现表明,大量优化带宽被花费在学生模型无法或无需内化的结构和话语残差上。通过解构这些动态,我们证明策略性地绕过这些“绊脚石”可以显著简化对齐过程,挑战了统一令牌权重的必要性,并为大规模模型蒸馏提供了更高效的范式。

英文摘要

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

2605.09098 2026-06-02 cs.CL

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

动态元度量:面向机器翻译评估的源句条件加权

Luke Zhang, Justin Vasselli, Aditya Khan, York Hay Ng, En-Shiun Annie Lee

发表机构 * University of Toronto, Canada(多伦多大学) Nara Institute of Science and Technology, Japan(奈良科学技術大學) Ontario Tech University, Canada(安大略技术大学)

AI总结 提出动态元度量(DMM)框架,通过源句条件组合现有度量来提升机器翻译评估性能,实验表明MLP组合优于线性与高斯过程集成,软条件扩展进一步带来提升。

Comments 5 pages, ACL SRW 2026

详情
AI中文摘要

我们提出动态元度量(DMM),一种用于机器翻译评估的框架,学习基于源句条件组合现有度量。DMM不依赖单一的静态集成或语言特定权重,而是根据源片段属性调整度量组合。我们研究了硬条件,即每个簇拟合一个可解释的组合器,以及探索性的软条件扩展,其权重随源簇责任连续变化。我们使用系统和片段级别的成对一致性度量,在WMT度量共享任务数据上跨多个语言对评估DMM。在各种设置下,基于MLP的组合优于线性和基于高斯过程的集成,引入软条件在线性模型上带来了增益。

英文摘要

We propose Dynamic Meta-Metrics (DMM), a framework for machine translation evaluation that learns source-sentence conditioned combinations of existing metrics. Rather than relying on a single static ensemble or language-specific weighting, DMM adapts the metric combination based on properties of the source segment. We study hard conditioning, which fits an interpretable combiner per cluster, and an exploratory soft-conditioned extension whose weights vary continuously with source-cluster responsibilities. We evaluate DMM on the WMT Metrics Shared Task data across multiple language pairs using pairwise agreement measures at the system and segment levels. Across settings, MLP-based combinations outperform linear and Gaussian process-based ensembles, and introducing soft conditioning yields gains over linear models.

2605.08398 2026-06-02 cs.LG cs.CV

Exploring and Exploiting Stability in Latent Flow Matching

探索和利用潜流匹配中的稳定性

Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文证明潜流匹配模型对数据缩减和模型容量收缩具有鲁棒性,并利用这种稳定性提出更高效的训练和推理算法,包括数据节省和超过两倍的推理加速。

Comments Accepted at ICML 2026

详情
AI中文摘要

在这项工作中,我们展示了潜流匹配(LFM)模型对不同类型的扰动具有鲁棒性,包括数据缩减和模型容量收缩。我们通过这些模型在相同噪声种子下倾向于生成相似输出来表征这种稳定性。我们提供了一个视角,将这种现象与流匹配理论联系起来,表明这种稳定性是FM目标固有的。我们进一步利用这种稳定性推导出更高效训练和推理的实用算法。具体来说,首先,我们表明通过在显著减少的数据集上训练LFM模型,性能得以保持,并且在计算受限的情况下,模型在保持质量的同时收敛更快。这带来了多种优势,包括由于更快的收敛而节省训练时间,以及在训练条件模型时减轻标注工作。其次,LFM在架构收缩下的稳定性产生了一种双模型由粗到细的方法,一个使用轻量级架构用于FM轨迹的第一阶段,另一个具有更高容量用于第二阶段,从而大幅降低推理成本。为了确定哪些样本具有信息量,我们引入了三个样本评分标准,并在生成模型的标准指标下进行评估。我们的结果在多个数据集上进行了彻底评估,展示了这种稳定性的实际优势,包括数据节省和超过两倍的推理加速,同时生成可比较的输出。

英文摘要

In this work, we show that Latent Flow-Matching (LFM) models are robust to different types of perturbations, including data reduction and model capacity shrinkage. We characterize this stability by these models' tendency to generate similar outputs under identical noise seeds. We provide a perspective relating this phenomenon to flow matching theory, which indicates that this stability is inherent to the FM objective. We further exploit this stability to derive practical algorithms for more efficient training and inference. Concretely, first, we show that by training LFM models on significantly reduced datasets, performance is preserved, and in compute-constrained regimes, the model converges faster while maintaining quality. This yields multiple advantages, including savings in the training time due to faster convergence, and alleviating annotation effort when training conditional models. Second, LFM stability under architectural shrinkage gives rise to a two-model coarse-to-fine approach, one using a light-weight architecture for the first phase of the FM trajectory, and one with higher capacity for the second, thereby reducing the inference cost substantially. To determine which samples are informative, we introduce three sample-scoring criteria and evaluate them under standard metrics for generative models. Our results are thoroughly evaluated on multiple datasets, demonstrating the practical advantage of this stability, including data savings and a more than two-fold inference speedup while generating comparable outputs.

2605.07971 2026-06-02 cs.CV cs.LG

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD: 用于3D生成和编辑的离散体素扩散

Zhengrui Xiang, Jiaqi Wu, Fupeng Sun, Heliang Zheng, Yingzhen Li

发表机构 * Imperial College London(伦敦帝国学院) Math Magic Hitem3D

AI总结 提出离散体素扩散框架(DVD),通过将体素占用视为离散变量,实现稀疏体素的生成、不确定性估计和编辑,避免连续到离散的阈值处理,并提供可解释的生成动态。

详情
AI中文摘要

我们引入了离散体素扩散(DVD),这是一个离散扩散框架,用于生成、评估和编辑基于SLat(结构化潜在)的3D生成管道中的稀疏体素。尽管离散扩散通常没有在类似图像的生成中取代连续扩散,但我们表明它可以作为稀疏体素支架的有效第一阶段先验。通过将体素占用视为原生离散变量,DVD避免了连续到离散的阈值处理,并为体素生成、不确定性估计和编辑提供了一个简单的框架。除了质量提升外,DVD通过显式类别建模提供了更可解释的生成动态。此外,我们利用预测熵作为稳健的不确定性度量,以识别模糊的体素区域和复杂样本,促进数据过滤和质量评估等任务。最后,我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内修复和编辑体素,所需的辅助计算量可忽略不计,且无需额外的模型评估。

英文摘要

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations.

2605.07527 2026-06-02 cs.LG cs.AI

Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It

为什么 GNN 解释中会出现自不一致性以及如何利用它

Wenxin Tai, Yaqian Liu, Ting Zhong, Fan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文分析了图神经网络解释中自不一致性的成因(重新解释引起的上下文扰动),提出潜在信号分配假设解释边缘敏感性,并设计无需训练的后处理策略 Self-Denoising 来校准解释。

Comments Corrected result errors and fixed typos

详情
AI中文摘要

最近的工作观察到,自解释图神经网络(SI-GNN)产生的解释可能存在自不一致性:当模型重新应用于其自身的解释性子图时,可能会产生不同的解释。然而,自不一致性产生的原因尚不清楚。在这项工作中,我们首先将重新解释引起的上下文扰动确定为分数变化的直接原因。然后,我们引入潜在信号分配假设来解释为什么只有部分边缘对此扰动敏感,并分析简洁性正则化如何影响潜在信号分配。鉴于自不一致的边缘不能为模型预测提供稳定的证据,我们提出了自去噪(SD),这是一种模型无关且无需训练的后处理策略,仅需一次额外前向传播即可校准解释。在代表性 SI-GNN 框架、骨干架构和基准数据集上的实验支持我们的假设,并表明 SD 能够持续提高解释质量,同时在实际中仅增加约 4-6% 的计算开销。

英文摘要

Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model's prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4--6\% computational overhead in practice.

2605.07061 2026-06-02 cs.SD cs.AI cs.CV cs.MM

Do Joint Audio-Video Generation Models Understand Physics?

联合音视频生成模型是否理解物理?

Zijun Cui, Xiulong Liu, Hao Fang, Mingwei Xu, Jiageng Liu, Zexin Xu, Weiguo Pian, Shijian Deng, Feiyu Du, Chenming Ge, Yapeng Tian

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) University of Washington(华盛顿大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对联合音视频生成模型,提出AV-Phys Bench基准测试其物理常识,发现所有模型在物理一致性上表现不足,尤其是事件驱动和环境驱动转换场景。

Comments Preprint. Project Page: https://zijuncui.com/AV-Phys/. Full abstract appears in the PDF

详情
AI中文摘要

联合音视频生成模型正迅速接近专业制作质量,这引发了一个核心问题:它们是否理解音视频物理,还是仅仅生成看似合理但违反现实一致性的声音和帧?我们引入了AV-Phys Bench,一个用于评估联合音视频生成中物理常识的基准。AV-Phys Bench测试模型在三种场景类别上的表现:稳态、事件转换和环境转换。它涵盖了从现实场景中提取的基于物理的子类别,以及故意要求物理不一致音视频行为的反AV物理提示。每个生成结果沿五个维度评估:视觉语义遵循、音频语义遵循、视觉物理常识、音频物理常识和跨模态物理常识。在三个专有模型和四个开源模型中,我们发现Seedance 2.0整体表现最佳,但所有模型距离鲁棒的物理理解仍有很大差距。在事件驱动和环境驱动转换上性能急剧下降,即使是强大的专有系统在反AV物理提示上也崩溃。我们进一步引入了AV-Phys Agent,一个结合多模态语言模型与确定性声学测量工具的ReAct风格评估器,产生的排名与人类评分高度一致。我们的结果指出,跨模态物理一致性和转换驱动的场景动态是联合音视频生成的关键开放挑战。

英文摘要

Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.

2602.16571 2026-06-02 cs.CL

Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

数学辅导中的效用保持去标识化:MathEd-PII基准数据集中的数值歧义研究

Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak, Daryl Hedley, Jorge Dias, Chris Shaw, Ruth Schäfer, René F. Kizilcec

发表机构 * University of Washington(华盛顿大学)

AI总结 针对数学辅导对话中数值表达式与标识符相似导致过度去标识化的问题,提出MathEd-PII基准数据集,并采用领域感知提示策略(F1达0.821)在保持数据效用的同时有效检测PII。

详情
AI中文摘要

大规模共享对话数据是推进教学科学的关键,但严格的去标识化仍是一大障碍。在数学辅导记录中,数值表达式常与结构化标识符(如日期或ID)相似,导致通用个人身份信息(PII)检测系统过度编辑核心教学内容,降低数据效用。本研究探讨如何在保持教育效用的同时检测PII,重点关注这一“数值歧义”问题。我们引入了MathEd-PII,这是首个用于数学辅导对话中PII检测的基准数据集,通过人机协同的LLM标注构建。利用基于密度的分割,我们发现虚假PII编辑集中在数学密集区域,证实数值歧义是主要失败模式。随后比较了四种检测策略:Presidio基线以及三种基于LLM的方法(基础提示、数学感知提示和片段感知提示)。领域感知提示(包括数学感知F1: 0.802和片段感知F1: 0.821)显著优于基线(F1: 0.379),同时减少了数值假阳性,表明去标识化必须融入领域上下文以保持分析效用。本研究提供了新的基准和证据,表明辅导数据的效用保持去标识化需要领域感知建模。

英文摘要

Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

2602.04672 2026-06-02 cs.CV cs.GR cs.RO

AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

AGILE: 通过代理生成从视频重建手-物体交互

Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu, Xiaoyang Liu, Zeju Li, Hao Chen, Chunhua Shen

发表机构 * State Key Lab of CAD & CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Zhejiang University of Technology(浙江工业大学)

AI总结 提出AGILE框架,利用视觉语言模型引导生成完整物体网格,结合锚定-跟踪策略和接触感知优化,从单目视频鲁棒重建手-物体交互,生成可直接用于仿真的资产。

Comments 16 pages, SIGGRAPH 2026

详情
AI中文摘要

从单目视频重建动态手-物体交互对于灵巧操作数据收集以及为机器人和VR创建逼真的数字孪生至关重要。然而,当前方法面临两个难以逾越的障碍:(1) 依赖神经渲染通常在严重遮挡下产生碎片化、不可用于仿真的几何体;(2) 依赖脆弱的运动恢复结构(SfM)初始化导致在野外视频中频繁失败。为克服这些限制,我们提出AGILE,一个鲁棒的框架,将范式从重建转变为交互学习的代理生成。首先,我们采用代理流水线,其中视觉语言模型(VLM)引导生成模型合成一个完整、水密的物体网格,具有高保真纹理,不受视频遮挡影响。其次,完全绕过脆弱的SfM,我们提出一种鲁棒的锚定-跟踪策略。我们使用基础模型在单个交互起始帧初始化物体姿态,并通过利用生成资产与视频观测之间的强视觉相似性在时间上传播姿态。最后,接触感知优化整合语义、几何和交互稳定性约束以强制执行物理合理性。在HO3D、DexYCB、ARCTIC和野外视频上的大量实验表明,AGILE在全局几何精度上优于基线,同时在先前技术经常崩溃的具有挑战性的序列上表现出卓越的鲁棒性。通过优先考虑物理有效性,我们的方法生成可直接用于仿真的资产,并通过真实到仿真重定向在机器人应用中验证。项目页面:https://agile-hoi.github.io。

英文摘要

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.

2411.13109 2026-06-02 cs.RO

Special Unitary Parameterized Estimators of Rotation

旋转的特殊酉参数化估计器

Akshay Chandrasekhar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过特殊酉矩阵重新审视旋转估计问题,提出两种新的连续表示用于神经网络中的旋转学习,并通过实验验证其有效性。

Comments Published at ICLR 2026; clarified paper contribution and theoretical narrative; 33 pages

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR 2026)
AI中文摘要

本文通过特殊酉矩阵的视角重新审视旋转估计问题。我们首先使用$SU(2)$重新表述Wahba问题,推导出多个解,从而得到对应四元数参数的线性约束。然后,我们通过为相关问题制定高效方法来探索这些约束的应用。最后,基于这一理论基础,我们提出了两种新的连续表示,用于神经网络中的旋转学习。大量实验验证了所提方法的有效性。

英文摘要

This paper revisits the topic of rotation estimation through the lens of special unitary matrices. We begin by reformulating Wahba's problem using $SU(2)$ to derive multiple solutions that yield linear constraints on corresponding quaternion parameters. We then explore applications of these constraints by formulating efficient methods for related problems. Finally, from this theoretical foundation, we propose two novel continuous representations for learning rotations in neural networks. Extensive experiments validate the effectiveness of the proposed methods.

2605.05427 2026-06-02 cs.AI

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

拒绝-顺从权衡:大型语言模型的大规模安全行为审计

Alif Al Hasan, Sumon Biswas

发表机构 * Department of Computer and Data Sciences(计算机与数据科学系)

AI总结 本研究通过调整组合方法隔离模型敏感性与数据集毒性混淆,审计了21个开源权重LLM在四个安全基准上的拒绝与顺从失败模式,发现模型采用不同校准策略、人口保护不平等以及拒绝与顺从倾向在模型家族内稳定。

详情
AI中文摘要

拒绝率是LLM安全性的一个不良代理指标,即模型可能过度拒绝良性提示,同时仍顺从有害提示。我们审计了21个开源权重LLM在四个安全基准(OR-Bench、XSTest、ToxiGen、BOLD)上的两种失败模式,使用组合调整来隔离模型敏感性与数据集毒性混淆。我们报告三个发现。首先,模型采用根本不同的校准策略:保守生态系统(如Llama)以过度拒绝为代价抑制不安全输出,而宽松生态系统(如DeepSeek和Qwen)保持有用性但容忍更高的有害顺从。其次,人口保护不平等:模型过度保护突出的种族和宗教群体,经常拒绝甚至关于它们的良性提示,而对针对残疾的攻击提供显著较弱的保护。第三,拒绝和顺从倾向在模型家族内跨代和规模稳定,表明后训练目标比架构更能塑造安全行为。我们的结果呼吁进行联合、人口意识感知和多评判者的安全评估。

英文摘要

Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We audit both failure modes across 21 open-weight LLMs on four safety benchmarks (OR-Bench, XSTest, ToxiGen, BOLD), using a composition adjustment to isolate model sensitivity from dataset toxicity confounds. We report three findings. First, models adopt fundamentally different calibration strategies: conservative ecosystems such as Llama suppress unsafe outputs at the cost of elevated over-refusals, while permissive ecosystems such as DeepSeek and Qwen preserve helpfulness but tolerate higher harmful compliance. Second, demographic protection is unequal: models over-protect prominent racial and religious groups, frequently refusing even benign prompts about them, while providing substantially weaker protection against disability-targeted attacks. Third, refusal and compliance tendencies are stable within model families across generations and scales, suggesting that post-training objectives shape safety behavior more than architecture. Our results call for joint, demographically-aware, and multi-judge safety evaluation.

2604.17415 2026-06-02 cs.LG cs.AI cs.CV

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

奖励分数匹配:统一流模型和扩散模型的基于奖励的微调

Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, Korea(人工智能研究生院,韩国科学技术院)

AI总结 提出奖励分数匹配(RSM)框架,统一了多种基于奖励的微调方法,通过分数匹配与值引导目标对齐,简化了设计空间并提高了效率。

Comments 43 pages, 15 figures

详情
AI中文摘要

基于奖励的微调引导预训练的扩散或基于流的生成模型生成更高奖励的样本,同时保持接近预训练模型。尽管现有方法源自不同视角,但我们表明许多方法可以写在一个共同框架下,我们称之为奖励分数匹配(RSM)。在此视角下,对齐变为针对值引导目标的分数匹配,方法间的主要差异归结为值引导估计器的构建和跨时间步的有效优化强度。这种统一澄清了现有设计的偏差-方差-计算权衡,并将核心优化组件与增加复杂性而无明显益处的辅助机制区分开来。在此视角指导下,我们针对代表性的可微和黑盒奖励对齐任务开发了更简单、更高效的重新设计。总体而言,RSM将看似分散的基于奖励的微调方法集合转变为更小、更可解释且更可操作的设计空间。代码可在 https://github.com/jaylee2000/rsm 获取。

英文摘要

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space. Code is available at https://github.com/jaylee2000/rsm

2604.09063 2026-06-02 cs.CV cs.AI

Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition

频率增强扩散模型:基于课程引导语义对齐的零样本骨架动作识别

Yuxi Zhou, Zhengbo Zhang, Jingyu Pan, Zhiyu Lin, Zhigang Tu

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(测绘遥感信息工程国家重点实验室) Wuhan University(武汉大学) Information Systems Technology and Design Pillar(信息系统技术与设计学院) Singapore University of Technology and Design(新加坡科技与设计大学) School of Geodesy and Geomatics(测绘学院) School of Mathematics and Statistics(数学与统计学院) Wuhan University Shenzhen Research Institute(武汉大学深圳研究院)

AI总结 提出频率感知扩散模型FDSM,通过语义引导频谱残差模块、时间步自适应频谱损失和课程语义抽象,解决扩散模型频谱偏差导致的高频动态过度平滑问题,实现零样本骨架动作识别,在多个数据集上达到最优性能。

Comments Accepted by The Visual Computer

详情
AI中文摘要

人体动作识别在计算机视觉中至关重要,应用范围从监控到人机交互。尽管基于监督的骨架方法有效,但其对详尽标注的依赖限制了对新动作的泛化能力。零样本骨架动作识别(ZSAR)成为一种有前景的范式,但由于扩散模型的频谱偏差(过度平滑高频动态)而面临挑战。在此,我们提出频率感知扩散用于骨架-文本匹配(FDSM),集成了语义引导频谱残差模块、时间步自适应频谱损失和基于课程的语义抽象以应对这些挑战。我们的方法有效恢复了细粒度运动细节,在NTU RGB+D、PKU-MMD和Kinetics-skeleton数据集上实现了最先进的性能。代码已公开于https://github.com/yuzhi535/FDSM。项目主页:https://yuzhi535.github.io/FDSM.github.io/

英文摘要

Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the effectiveness of supervised skeleton-based methods, their reliance on exhaustive annotation limits generalization to novel actions. Zero-Shot Skeleton Action Recognition (ZSAR) emerges as a promising paradigm, yet it faces challenges due to the spectral bias of diffusion models, which oversmooth high-frequency dynamics. Here, we propose Frequency-Aware Diffusion for Skeleton-Text Matching (FDSM), integrating a Semantic-Guided Spectral Residual Module, a Timestep-Adaptive Spectral Loss, and Curriculum-based Semantic Abstraction to address these challenges. Our approach effectively recovers fine-grained motion details, achieving state-of-the-art performance on NTU RGB+D, PKU-MMD, and Kinetics-skeleton datasets. Code has been made available at https://github.com/yuzhi535/FDSM. Project homepage: https://yuzhi535.github.io/FDSM.github.io/

2605.01270 2026-06-02 cs.LG

Continuous Temporal Representations of Event-Based Signals via Interference-Based Wave Modeling

基于干涉波建模的事件驱动信号的连续时间表示

Magnus Bengtsson

发表机构 * Department of Engineering, University of Borås(于厄萨大学工程系)

AI总结 提出基于干涉波表示的连续时间建模框架,通过复值潜波场编码事件驱动信号的时间结构,实现高效梯度优化和鲁棒特征提取,在sEMG数据上优于纯实值表示。

Comments 18 pages, 3 figures, Submitted to Journal

详情
AI中文摘要

来自事件驱动生物过程的时空信号,如表面肌电图(sEMG),表现出异步且高度结构化的激活模式,使用传统的离散或纯实值表示难以建模。在这项工作中,我们提出了一种基于干涉波表示的连续时间建模框架。该方法将类事件输入信号映射到复值潜波场,其中时间结构通过相位调制和潜分量之间的相互作用进行编码。通过将所得波场投影到能量域,模型在有限观测窗口内诱导出捕获时间定位和关系依赖性的结构化激活模式,而无需依赖显式循环或因果状态传播。所提出的公式特别适用于事件驱动的生物信号,其中连续表示能够实现高效的基于梯度的优化和鲁棒的特征提取。特别是,该方法旨在支持从sEMG数据中学习,用于生物力学系统中的下游控制任务,例如假肢装置和外骨骼。实验结果表明,与纯实值表示相比,所提出的干涉波模型提供了改进的表示质量,同时保持了适合实际部署的计算效率。

英文摘要

Spatio-temporal signals arising from event-driven biological processes, such as surface electromyography (sEMG), exhibit asynchronous and highly structured activation patterns that are challenging to model using conventional discrete or purely real-valued representations. In this work, we propose a continuous temporal modeling framework based on interference-based wave representations. The approach maps event-like input signals into a complex-valued latent wave field, where temporal structure is encoded through phase modulation and interactions between latent components. By projecting the resulting wave field onto an energy domain, the model induces structured activation patterns that capture both temporal localization and relational dependencies within finite observation windows, without relying on explicit recurrence or causal state propagation. The proposed formulation is particularly suited for event-driven biosignals, where continuous representations enable efficient gradient-based optimization and robust feature extraction. In particular, the method is designed to support learning from sEMG data for downstream control tasks in biomechanical systems, such as prosthetic devices and exoskeletons. Experimental results demonstrate that the proposed interference-based wave model provides improved representation quality compared to purely real-valued representations, while maintaining computational efficiency suitable for practical deployment.

2405.15491 2026-06-02 cs.CV

GSDeformer: Direct, Real-time and Extensible Cage-based Deformation for 3D Gaussian Splatting

GSDeformer:面向3D高斯泼溅的直接、实时且可扩展的笼形变形方法

Jiajun Huang, Shuolin Xu, Hongchuan Yu, Tong-Yee Lee

发表机构 * National Centre for Computer Animation(国家计算机动画中心) Bournemouth University(伯恩茅斯大学) Department of Computer Science and Information Engineering(计算机科学与信息工程系)

AI总结 提出GSDeformer,通过代理点云表示桥接笼形变形与3D高斯泼溅,实现无需重新训练、实时且兼容多种3DGS变体的直接变形。

Comments Project Page: https://jhuangbu.github.io/gsdeformer, Video: https://www.youtube.com/watch?v=-ecrj48-MqM

详情
AI中文摘要

我们提出了GSDeformer,一种能够在3D高斯泼溅(3DGS)上实现笼形变形的方法。我们的方法通过使用代理点云表示来桥接笼形变形和3DGS。该点云从3D高斯生成,施加于点云的变形被转换为对3D高斯的变换。为了处理变形可能引起的弯曲,我们引入了一个分裂过程来近似它。我们的方法不修改或扩展3D高斯泼溅的核心架构,因此与任何训练好的原始3DGS或其变体兼容。此外,我们使用渲染-重建方法自动为3DGS及其变体构建笼子。实验表明,与现有方法相比,GSDeformer提供了更优的变形结果,在极端变形下具有鲁棒性,无需重新训练即可编辑,实时运行,并且可以扩展到其他3DGS变体。项目页面:https://jhuangbu.github.io/gsdeformer/

英文摘要

We present GSDeformer, a method that enables cage-based deformation on 3D Gaussian Splatting (3DGS). Our approach bridges cage-based deformation and 3DGS by using a proxy point-cloud representation. This point cloud is generated from 3D Gaussians, and deformations applied to the point cloud are translated into transformations on the 3D Gaussians. To handle potential bending caused by deformation, we incorporate a splitting process to approximate it. Our method does not modify or extend the core architecture of 3D Gaussian Splatting, making it compatible with any trained vanilla 3DGS or its variants. Additionally, we automate cage construction for 3DGS and its variants using a render-and-reconstruct approach. Experiments demonstrate that GSDeformer delivers superior deformation results compared to existing methods, is robust under extreme deformations, requires no retraining for editing, runs in real-time, and can be extended to other 3DGS variants. Project Page: https://jhuangbu.github.io/gsdeformer/

2605.00394 2026-06-02 cs.LG

Mesh Field Theory: Port-Hamiltonian Formulation of Mesh-Based Physics

网格场理论:基于网格的物理的端口-哈密顿形式化

Satoshi Noguchi, Yoshinobu Kawahara

发表机构 * University of Tokyo(东京大学)

AI总结 提出网格场理论(MeshFT)及其神经实现MeshFT-Net,通过端口-哈密顿形式化分离物理的拓扑与度量结构,实现近零能量漂移和强物理保真度。

Comments 29 pages, 7 figures, 15 tables. Accepted to ICML 2026

详情
AI中文摘要

我们提出了网格场理论(MeshFT)及其神经实现MeshFT-Net:一个用于基于网格的连续介质物理的结构保持框架,该框架清晰地将物理的拓扑结构与度量结构分开。通过施加最小物理原理(局部性、置换等变性、方向协变性以及能量平衡/耗散不等式),我们证明了基于网格的物理的约化定理。在这些条件下,物理动力学可以局部分解为端口-哈密顿形式:保守互连由网格拓扑唯一固定,而度量效应仅通过本构关系和耗散进入。这种约化阐明了哪些是必须固定的,哪些是应该学习的,直接指导了MeshFT-Net的设计。在解析和真实数据集、物理一致性测试以及分布外验证的评估中,MeshFT-Net实现了近零能量漂移和强物理保真度(正确的色散和动量守恒),以及稳健的外推和高数据效率。通过消除非物理自由度并仅学习依赖于度量的结构,MeshFT为稳定、忠实且数据高效的基于学习的物理模拟提供了原则性的归纳偏置。

英文摘要

We present Mesh Field Theory (MeshFT) and its neural realization, MeshFT-Net: a structure-preserving framework for mesh-based continuum physics that cleanly separates the physics' topological structure from its metric structure. Imposing minimal physical principles (locality, permutation equivariance, orientation covariance, and energy balance/dissipation inequality), we prove a reduction theorem for mesh-based physics. Under these conditions, the physical dynamics admit a local factorization into a port-Hamiltonian form: the conservative interconnection is fixed uniquely by mesh topology, whereas metric effects enter only through constitutive relations and dissipation. This reduction clarifies what must be fixed and what should be learned, directly informing MeshFT-Net's design. Across evaluations on analytic and realistic datasets, physics-consistency tests, and out-of-distribution validation, MeshFT-Net achieves near-zero energy drift and strong physical fidelity (correct dispersion and momentum conservation) along with robust extrapolation and high data efficiency. By eliminating non-physical degrees of freedom and learning only metric-dependent structure, MeshFT provides a principled inductive bias for stable, faithful, and data-efficient learning-based physical simulation.

2605.00310 2026-06-02 cs.CV cs.AI cs.LG

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

超越视觉保真度:通过下游任务集成评估大规模遥感影像的超分辨率模型

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

发表机构 * University of Maryland(马里兰大学) University of Pittsburgh(匹兹堡大学) Worcester Polytechnic Institute(沃思利技术学院) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对现有超分辨率评估依赖PSNR/SSIM等保真度指标而忽略下游任务效用的问题,提出GeoSR-Bench基准数据集,集成土地覆盖分割、基础设施映射等下游任务,评估GAN、Transformer等9种SR模型在270种设置下的性能,发现保真度指标与任务性能弱相关甚至负相关。

Comments Under review at IEEE TPAMI

详情
AI中文摘要

超分辨率(SR)技术在从低分辨率输入重建高分辨率图像方面取得了重大进展。分辨率的提高为监测任务提供了视觉增强和实用性。特别是,SR已越来越多地用于基于卫星的地球观测,应用于城市规划、农业、生态学和灾害响应。然而,现有的SR研究和基准通常使用保真度指标如PSNR或SSIM,而超分辨率图像的真实效用在于支持下游任务,如土地覆盖分类、生物量估计和变化检测。为弥合这一差距,我们引入了GeoSR-Bench,一个下游任务集成的SR基准数据集,用于评估超越保真度指标的SR模型。GeoSR-Bench包含来自约36,000个地点的空间共位、时间对齐和质量控制的图像对,覆盖多种土地覆盖类型,分辨率从500米到0.6米。据我们所知,GeoSR-Bench是第一个直接将SR模型提高的图像分辨率与下游地球监测任务(包括土地覆盖分割、基础设施映射和生物物理变量估计)联系起来的SR基准。利用GeoSR-Bench,我们对基于GAN、Transformer、神经算子和扩散的SR模型在感知质量和下游任务性能上进行了基准测试。我们进行了270种设置的实验,涵盖2个跨平台SR任务、9个SR模型、3个下游任务模型以及每个SR任务的5个下游任务。结果表明,传统SR指标的改进通常与任务性能的提升不相关,甚至可能负相关,表明这些指标为选择适用于下游任务的优越模型提供的指导有限。这揭示了将下游任务集成到SR模型开发和评估中的必要性。

英文摘要

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increased resolution provides visual enhancement and utility for monitoring tasks. In particular, SR has been increasingly developed for satellite-based Earth observation, with applications in urban planning, agriculture, ecology, and disaster response. However, existing SR studies and benchmarks typically use fidelity metrics such as PSNR or SSIM, whereas the true utility of super-resolved images lies in supporting downstream tasks such as land cover classification, biomass estimation, and change detection. To bridge this gap, we introduce GeoSR-Bench, a downstream task-integrated SR benchmark dataset to evaluate SR models beyond fidelity metrics. GeoSR-Bench comprises spatially co-located, temporally aligned, and quality-controlled image pairs from about 36,000 locations across diverse land covers, spanning resolutions from 500m to 0.6m. To the best of our knowledge, GeoSR-Bench is the first SR benchmark that directly connects improved image resolution from SR models with downstream Earth monitoring tasks, including land cover segmentation, infrastructure mapping, and biophysical variable estimation. Using GeoSR-Bench, we benchmark GAN, transformer, neural operator, and diffusion-based SR models on perceptual quality and downstream task performance. We conduct experiments with 270 settings, covering 2 cross-platform SR tasks, 9 SR models, 3 downstream task models, and 5 downstream tasks for each SR task. The results show that improvements in traditional SR metrics often do not correlate with gains in task performance, and the correlations can be negative, indicating that these metrics provide limited guidance for selecting superior models for downstream tasks. This reveals the need to integrate downstream tasks into SR model development and evaluation.

2605.00161 2026-06-02 cs.LG

Consistent Diffusion Language Models

一致性扩散语言模型

Hasan Amin, Yuan Gao, Yaser Souri, Subhojit Som, Ming Yin, Rajiv Khanna, Xia Song

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出多路径离散一致性(MPDC)原则,通过训练去噪器在随机桥上的路径不变性,实现单阶段训练的一致性扩散语言模型(CDLM),在文本生成中达到最先进性能。

Comments ICML 2026

详情
AI中文摘要

扩散语言模型(DLM)是自回归模型的一个有吸引力的替代方案,因为它们承诺亚线性时间、并行生成,但实际收益仍然难以捉摸,因为高质量样本仍需要数百个细化步骤。在连续域中,沿着概率流ODE的一致性训练是加速扩散的流行方法。对于离散扩散,不存在类似的样本空间ODE,使得直接适应不明确。我们认为正确的离散替代是精确后验桥,即连接任意两个噪声水平的闭式条件分布,这对于包括掩码扩散和均匀扩散在内的广泛损坏是可用的。基于这一观察,我们引入了多路径离散一致性(MPDC),这是一个新原则,它训练去噪器在这些随机桥上期望路径不变,并将其实例化为一致性扩散语言模型(CDLM),这是一个不需要预训练教师模型的单阶段训练框架。我们的CDLM目标将掩码扩散、连续一致性模型以及渐进或离散蒸馏恢复为同一观点的分析极限或经验近似。实验上,CDLM在条件和非条件文本生成上建立了新的最先进水平,在采样预算下始终优于强基线的离散扩散模型,并且通常甚至优于多阶段蒸馏基线,在少步数情况下增益最大。总之,这些结果将CDLM定位为下一代快速、高保真离散生成建模的原则性和可扩展基础。

英文摘要

Diffusion language models (DLMs) are an attractive alternative to autoregressive models because they promise sublinear-time, parallel generation, yet practical gains remain elusive as high-quality samples still demand hundreds of refinement steps. In continuous domains, consistency training along the probability-flow ODE is a popular recipe to accelerate diffusion. For discrete diffusion, no analogous sample-space ODE exists, making direct adaptation ill-defined. We argue that the right discrete substitute is the exact posterior bridge, the closed-form conditional law linking any two noise levels, which is available for broad corruptions including masked and uniform diffusion. Building on this observation, we introduce Multi-Path Discrete Consistency (MPDC), a new principle that trains a denoiser to be path-invariant in expectation across these stochastic bridges, and instantiate it as the Consistent Diffusion Language Model (CDLM), a single-stage training framework that does not require an already trained teacher model. Our CDLM objective recovers masked diffusion, continuous consistency models, and progressive or discrete distillation as analytic limits or empirical approximations of one common view. Empirically, CDLM establishes a new state of the art on both conditional and unconditional text-generation, consistently outperforming strong base discrete diffusion models and often even multi-stage distilled baselines across sampling budgets, with the largest gains in the few-step regime. Together, these results position CDLM as a principled and scalable foundation for the next generation of fast, high-fidelity discrete generative modeling.

2506.05412 2026-06-02 cs.CV cs.CL

Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues

视觉-语言模型将头部方向误认为注视方向:非语言对话线索

Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo

发表机构 * Brown University(布朗大学) Columbia University(哥伦比亚大学) Emory University(埃默里大学) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学) University of Michigan(密歇根大学) UC San Diego(圣地亚哥大学)

AI总结 本研究通过控制头部方向的实验发现,视觉-语言模型(VLMs)在推断注视目标时主要依赖头部方向而非眼睛外观,导致与人类存在显著性能差距,并指出数据偏差是主要原因。

Comments Accepted by ACL 2026. Project page at https://zoryzhang.github.io/gaze/

详情
AI中文摘要

一个人的注视方向是儿童和成人常用的非语言交流线索。视觉-语言模型(VLMs)推断注视目标的能力如何?为了构建评估刺激,我们拍摄了1,360张真实场景照片,其中一个人注视着桌子上几个物体之一。重要的是,我们还控制了注视者的头部方向:有时朝向注视目标,有时朝向干扰物,有时不加约束。我们发现VLMs与人类之间存在显著的性能差距,排除了分辨率、物体命名能力等替代解释,并确定了差距的主要原因是VLMs使用头部方向而非眼睛外观来推断注视方向。这种偏差可能源于数据而非架构,正如基于transformer的视觉模型微调的概念验证实验所表明的那样。未来的工作应研究这些发现是否广泛适用于基于现有数据训练的各种深度学习方法,以及更好的数据是否能缓解所有架构的这一问题。准确定位原因将为能够解读注视目标的技术奠定基础,从而与人类进行更高效的交互。

英文摘要

Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table. Importantly, we also controlled the gazer's head orientation: sometimes it was directed toward the gaze target, sometimes toward a distractor object, and sometimes left unconstrained. We found a substantial performance gap between VLMs and humans, ruled out alternative explanations such as resolution and object-naming skills, and identified the main reason for the gap as VLMs inferring gaze direction using head orientation rather than eye appearance. Such a bias is likely due to data rather than architecture, as suggested by a proof-of-concept experiment finetuning a transformer-based vision model. Future work should investigate whether these findings hold broadly across various deep learning methods trained on existing data, and whether better data mitigates this problem for all architectures. Pinpointing the reason sets the stage for technologies that can interpret gaze targets to have more efficient interactions with humans.

2604.25702 2026-06-02 cs.CL

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

反向翻译增强的直接偏好优化用于神经机器翻译

Mehrdad Ghassabi, Spehr Rajabi, Hamidreza Baradaran Kashani, Sadra Hakim, Mahshid Keivandarian, Amirhossein Jahani Bahnamiri

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出基于直接偏好优化的强化学习后训练框架,利用通用文本语料和专家反馈纠正神经机器翻译错误,在英德翻译任务上将COMET分数从0.703提升至0.747。

Comments 5 pages, 2 figures

详情
AI中文摘要

当代神经机器翻译(NMT)系统几乎完全通过在有监督的平行数据上训练构建。尽管取得了巨大进展,这些系统仍然表现出持续的翻译错误。本文提出,基于强化学习(RL)的后训练范式可以有效纠正此类错误。我们引入了一个新颖的框架,仅需要通用文本语料库和专家翻译器(可以是人类或AI系统)提供迭代反馈。在我们的实验中,我们特别关注英德翻译作为代表性高资源语言对。关键的是,我们使用直接偏好优化(DPO)实现了这种基于RL的后训练。将我们的DPO驱动框架应用于gemma3-1b模型,在英德翻译任务上显著提升了翻译质量,将其COMET分数从0.703提高到0.747。结果表明,DPO为通过基于偏好的后训练增强预训练NMT模型提供了一条高效且稳定的途径。

英文摘要

Contemporary neural machine translation (NMT) systems are almost exclusively built by training on supervised parallel data. Despite the tremendous progress achieved, these systems still exhibit persistent translation errors. This paper proposes that a post-training paradigm based on reinforcement learning (RL) can effectively rectify such mistakes. We introduce a novel framework that requires only a general text corpus and an expert translator which can be either human or an AI system to provide iterative feedback. In our experiments, we focus specifically on English-to-German translation as a representative high-resource language pair. Crucially, we implement this RL-based post-training using Direct Preference Optimization (DPO). Applying our DPO-driven framework to the gemma3-1b model yields a significant improvement in translation quality, elevating it's COMET score from 0.703 to 0.747 on the English to German task. The results demonstrate that DPO offers an efficient and stable pathway for enhancing pre-trained NMT models through preference-based post-training.

2604.23593 2026-06-02 cs.AI

When AI reviews science: Can we trust the referee?

当AI评审科学:我们能信任审稿人吗?

Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, Lei Chen

发表机构 * School of Electronic Engineering, Southeast University(东南大学电子工程学院) Zhejiang University(浙江大学)

AI总结 针对AI审稿的安全性和可靠性问题,本文通过分类攻击类型并实验验证声望框架、断言强度、反驳谄媚和上下文投毒对评分的影响,为评估AI同行评审的可靠性提供基线。

详情
Journal ref
The Innovation Informatics 2:100030 (2026)
AI中文摘要

科学投稿数量持续攀升,超过了合格人类审稿人的容量,并延长了编辑时间线。与此同时,现代大型语言模型(LLMs)在摘要、事实核查和文献分类方面展现出令人印象深刻的能力,使得将AI整合到同行评审中越来越有吸引力——实际上,也无可避免。然而,早期的部署和非正式采用已经暴露了严重的故障模式。最近的事件表明,嵌入在稿件中的隐藏提示注入可以引导LLM生成的评审走向不合理的正面判断。补充研究还显示出对对抗性措辞、权威和长度偏见以及幻觉主张的脆弱性。这些事件引发了学术交流的一个核心问题:当AI评审科学时,我们能信任AI审稿人吗?本文提供了以安全和可靠性为中心的AI同行评审分析。我们映射了评审生命周期中的攻击——训练和数据检索、初审、深度评审、反驳和系统层面。我们通过在分层选取的ICLR 2025投稿上使用两个基于LLM的高级审稿人进行四项处理-控制探针,实例化了这一分类法,以隔离声望框架、断言强度、反驳谄媚和上下文投毒对评审分数的因果效应。总之,这一分类法和实验审计为评估和跟踪AI同行评审的可靠性提供了基于证据的基线,并突出了具体的故障点,以指导有针对性的、可测试的缓解措施。

英文摘要

The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive -- and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle -- training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

2604.22896 2026-06-02 cs.RO cs.LG

Magnetic Indoor Localization through CNN Regression and Rotation Invariance

基于CNN回归和旋转不变性的磁室内定位

Helge Rosé, Konstantin Klipp, Tom Koubek, Bernd Schäufele, Ilja Radusch

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出使用旋转不变特征(磁场强度和重力轴投影)训练轻量级CNN模型,实现无需方向校准的室内定位,在MagPie数据集上达到或超越现有最优精度。

Comments Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)

详情
AI中文摘要

室内定位是GNSS拒止环境中广泛应用的关键技术,包括室内导航和物联网系统。结合卷积神经网络(CNN)和基于磁场特征的方法,提供了一种低成本、无需基础设施的精确定位解决方案。尽管磁指纹是室内定位的一种有前景的方法,但基于原始3D磁力计数据训练的模型对设备方向高度敏感。我们通过使用从3D磁场导出的两个旋转不变特征来解决这个问题:磁场强度(Mn)和重力轴投影(Mg)。我们在磁序列上训练轻量级7层扩张CNN(MagNetS/XL),直接回归(x, y)位置。使用MagPie数据集(三栋建筑,手持轨迹),我们系统评估了测试和/或训练数据的固定和随机旋转。原始3D输入(Mx, My, Mz)在固定90°旋转下表现出各向同性误差增加,并随着随机旋转增大而进一步恶化。相比之下,2D输入(Mn, Mg)保持旋转不变精度,并且一旦旋转超过三个参考建筑的特定阈值(Loomis大建筑0°,Talbot中建筑5°,CSL小建筑6°),其性能就超过3D输入。MagNetXL在MagPie数据集上达到或超越了现有最优精度,而MagNetS以约三分之一的参数实现了相似性能,有利于移动部署。这些结果表明,在实际使用中,从旋转不变输入获得的鲁棒性超过了输入维度降低的损失,从而无需方向校准或额外基础设施即可进行地图构建和定位。

英文摘要

Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.

2604.07967 2026-06-02 cs.CL cs.AI

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

AtomEval: 事实核查中对抗性声明重写的有效性感知原子评估

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang

发表机构 * Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出AtomEval协议,通过原子分解和保留门控,区分有效规避验证与改变命题的重写,并引入VASR指标,解决传统ASR膨胀问题。

详情
AI中文摘要

大型语言模型(LLM)可以重写被驳斥的声明以规避基于证据的事实核查器,但当重写改变、削弱或纠正了本应保留的虚假命题时,传统的攻击成功率(ASR)可能会被夸大。我们引入了AtomEval,一种用于固定证据对抗性声明重写的有效性感知评估协议。AtomEval将声明表示为“主体-关系-客体-修饰语”(SROM)原子,应用单向保留门将有效的验证器规避与改变命题的重写分开,并报告有效性感知攻击成功率(VASR),该指标仅统计保留原始虚假命题的验证器规避重写。AtomEval进一步提供细粒度诊断,解释命题级失败和非最小有效重写。在FEVER被驳斥声明重写任务上,AtomEval揭示并解释了ASR膨胀:许多明显的攻击通过改变、削弱或纠正本应保留的命题来欺骗验证器。通过使受攻击命题的保留变得明确且可测量,AtomEval为评估必须在验证器规避与命题保留之间取得平衡的对抗性重写器提供了稳定的评估目标。

英文摘要

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies a one-way preservation gate to separate valid verifier evasion from proposition-changing rewrites, and reports validity-aware attack success rate (VASR), which counts only verifier-evasive rewrites that preserve the original false proposition. AtomEval further provides fine-grained diagnostics that explain both proposition-level failures and non-minimal valid rewrites. On FEVER refuted-claim rewriting, AtomEval exposes and explains ASR inflation: many apparent attacks fool the verifier by altering, weakening, or correcting the proposition they should preserve. By making attacked-proposition preservation explicit and measurable, AtomEval provides a stable evaluation target for evaluating adversarial rewriters that must balance verifier evasion with proposition preservation.

2602.17513 2026-06-02 cs.CL

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

弥合领域鸿沟:从MIMIC-III到产科的监督式与零样本临床章节分割

Baris Karacan, Barbara Di Eugenio, Patrick Thornton

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 通过构建产科笔记数据集、评估基于Transformer的监督模型和首次与零样本大语言模型对比,发现监督模型在域内表现强但域外下降显著,而零样本模型在修正幻觉后展现出稳健的域外适应性。

Comments 14 pages. Camera-ready version accepted at LREC 2026; includes minor revisions and an appendix. To appear in the conference proceedings

详情
Journal ref
Proceedings of the 2026 Language Resources and Evaluation Conference (LREC 2026), pages 2594-2607, Palma, Spain. ELRA 2026
AI中文摘要

临床自由文本笔记包含重要的患者信息。它们被组织成带标签的章节;识别这些章节已被证明支持临床决策和下游NLP任务。在本文中,我们通过三个关键贡献推进临床章节分割。首先,我们整理了一个新的去标识化、带章节标签的产科笔记数据集,以补充公共语料库(如MIMIC-III)所涵盖的医学领域,现有的大多数分割方法都是在这些语料库上训练的。其次,我们在MIMIC-III的一个精选子集(域内)和新的产科数据集(域外)上系统评估了基于Transformer的监督模型用于章节分割。第三,我们首次将医学章节分割的监督模型与零样本大语言模型进行直接比较。我们的结果表明,虽然监督模型在域内表现强劲,但其性能在域外大幅下降。相比之下,一旦纠正了幻觉章节标题,零样本模型展现出稳健的域外适应性。这些发现强调了开发特定领域临床资源的重要性,并指出零样本分割是将医疗NLP应用于研究充分的语料库之外的一个有前景的方向,前提是适当管理幻觉。

英文摘要

Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

2604.20308 2026-06-02 cs.LG

Sheaf Neural Networks on SPD Manifolds: Second-Order Geometric Representation Learning

SPD流形上的层神经网络:二阶几何表示学习

Yuhan Peng, Junwen Dong, Yuzhi Zeng, Hao Li, Ce Ju, Huitao Feng, Diaaeldin Taha, Anna Wienhard, Kelin Xia

发表机构 * arXiv.org GitHub

AI总结 针对图神经网络在欧氏空间中的线性结构限制,提出首个在对称正定矩阵流形上运行的层神经网络,利用李群结构定义层算子,实现二阶几何表示学习,在MoleculeNet基准上取得6/7最优结果。

详情
AI中文摘要

图神经网络面临两个源于欧氏向量空间线性结构的基本挑战:(1) 当前架构通过向量(方向、梯度)表示几何,但许多任务需要矩阵值表示来捕捉方向之间的关系——例如分子中原子取向的协变。这些二阶表示自然地由对称正定矩阵流形上的点捕获;(2) 标准消息传递在边上应用共享变换。层神经网络通过边特定变换解决了这一问题,但现有公式仍局限于向量空间,因此无法传播矩阵值特征。我们通过开发首个在SPD流形上原生运行的层神经网络来应对这两个挑战。我们的关键洞察是SPD流形具有李群结构,使得无需投影到欧氏空间即可定义良置的层算子。理论上,我们证明SPD值层比欧氏层具有更强的表达能力:它们能容纳向量值层无法表示的相容配置(全局截面),直接转化为更丰富的学习表示。实验上,我们的层卷积有效地将秩1方向输入变换为编码局部几何结构的满秩矩阵。我们的双流架构在MoleculeNet基准的6/7个任务上达到最优,且层框架提供了持续的深度鲁棒性。

英文摘要

Graph neural networks face two fundamental challenges rooted in the linear structure of Euclidean vector spaces: (1) Current architectures represent geometry through vectors (directions, gradients), yet many tasks require matrix-valued representations that capture relationships between directions-such as how atomic orientations covary in a molecule. These second-order representations are naturally captured by points on the symmetric positive definite matrices (SPD) manifold; (2) Standard message passing applies shared transformations across edges. Sheaf neural networks address this via edge-specific transformations, but existing formulations remain confined to vector spaces and therefore cannot propagate matrix-valued features. We address both challenges by developing the first sheaf neural network operates natively on the SPD manifold. Our key insight is that the SPD manifold admits a Lie group structure, enabling well-posed analogs of sheaf operators without projecting to Euclidean space. Theoretically, we prove that SPD-valued sheaves are strictly more expressive than Euclidean sheaves: they admit consistent configurations (global sections) that vector-valued sheaves cannot represent, directly translating to richer learned representations. Empirically, our sheaf convolution transforms effectively rank-1 directional inputs into full-rank matrices encoding local geometric structure. Our dual-stream architecture achieves SOTA on 6/7 MoleculeNet benchmarks, with the sheaf framework providing consistent depth robustness.

2604.19786 2026-06-02 cs.CL

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

HumorRank: 基于锦标赛的大语言模型幽默生成评估排行榜

Edward Ajayi, Prasenjit Mitra

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲分校)

AI总结 提出HumorRank,一种基于锦标赛的框架,通过理论指导的成对偏好判断对文本幽默生成进行排名,并利用Bradley-Terry估计生成全局排行榜。

详情
AI中文摘要

幽默在大语言模型(LLM)中仍然难以评估,因为一个回答是否有趣是主观的、比较性的,并且由相互作用的喜剧机制而非单一标量属性塑造。因此,现有的幽默评估协议往往产生孤立的分数或特定任务的判断,难以跨模型进行比较。我们引入了HumorRank,一种基于锦标赛的框架,通过理论指导的成对偏好判断对文本幽默生成进行排名。在SemEval-2026 MWAHAHA和Humor Transfer Bench上,HumorRank使用基于LLM的比较判断(基于言语幽默通论GTVH)评估了九个专有、开放权重和专门模型,并通过Bradley-Terry估计的锦标赛聚合生成全局排名。得到的排名跨评判者稳定:独立的Llama和Qwen LLM评判者在两个基准上均达到Kendall τ = 0.889。排行榜揭示了清晰的模型分层,表明强大的幽默生成不仅依赖于规模,还依赖于对喜剧机制(如不协调、简洁、升级和荒谬)的掌握。HumorRank提供了一种可扩展且可解释的方法,用于对LLM生成的幽默进行基准测试,而不完全依赖孤立的自动指标或有限的人工评估。

英文摘要

Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor generation through theory-grounded pairwise preference judgments. Across SemEval-2026 MWAHAHA and Humor Transfer Bench, HumorRank evaluates nine proprietary, open-weight, and specialized models using LLM-based comparative judgments informed by the General Theory of Verbal Humor (GTVH), with tournament aggregation yielding global rankings via Bradley-Terry estimation. The resulting rankings are cross-judge stable: independent Llama and Qwen LLM judges achieve Kendall τ = 0.889 on both benchmarks. The leaderboard reveals clear model stratification, showing that strong humor generation depends not only on scale but on mastery of comedic mechanisms such as incongruity, conciseness, escalation, and absurdity. HumorRank provides a scalable and interpretable methodology for benchmarking LLM-generated humor without relying solely on isolated automatic metrics or limited human evaluation.

2603.15956 2026-06-02 cs.RO cs.AI

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

ExpertGen: 从非完美行为先验的可扩展仿真到现实专家策略学习

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Kausik Sivakumar, Ahmet Salih Gundogdu, Eric Rosen, Riedana Yan, Tushar Kusnur, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

发表机构 * Robotics and AI Institute(机器人与人工智能研究院) University of Texas at Austin(德克萨斯大学奥斯汀分校) Sony AI(索尼人工智能)

AI总结 提出ExpertGen框架,通过扩散策略初始化行为先验并结合强化学习优化噪声,在仅稀疏奖励下生成高质量专家策略,实现从仿真到现实的可扩展迁移。

详情
AI中文摘要

学习通用且鲁棒的行为克隆策略需要大量高质量的机器人数据。虽然人类演示(例如通过遥操作)是专家行为的标准来源,但在现实世界中大规模获取此类数据成本过高。本文介绍了ExpertGen,一个在仿真中自动化专家策略学习的框架,以实现可扩展的仿真到现实迁移。ExpertGen首先使用在非完美演示(可能由大语言模型合成或由人类提供)上训练的扩散策略初始化行为先验。然后,通过优化扩散模型的初始噪声同时保持原始策略冻结,使用强化学习将该先验引导至高的任务成功率。通过保持预训练的扩散策略冻结,ExpertGen将探索正则化到安全、类人的行为流形内,同时仅使用稀疏奖励即可实现有效学习。在具有挑战性的操作基准上的实证评估表明,ExpertGen无需奖励工程即可可靠地生成高质量的专家策略。在工业装配任务中,ExpertGen实现了90.5%的整体成功率,而在长时域操作任务中达到了85%的整体成功率,优于所有基线方法。所得策略表现出灵巧的控制,并在不同的初始配置和失败状态下保持鲁棒。为了验证仿真到现实的迁移,学习到的基于状态的专家策略通过DAgger进一步提炼为视觉运动策略,并成功部署在真实的机器人硬件上。

英文摘要

Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstrations (e.g., through teleoperation) serve as the standard source for expert behaviors, acquiring such data at scale in the real world is prohibitively expensive. This paper introduces ExpertGen, a framework that automates expert policy learning in simulation to enable scalable sim-to-real transfer. ExpertGen first initializes a behavior prior using a diffusion policy trained on imperfect demonstrations, which may be synthesized by large language models or provided by humans. Reinforcement learning is then used to steer this prior toward high task success by optimizing the diffusion model's initial noise while keep original policy frozen. By keeping the pretrained diffusion policy frozen, ExpertGen regularizes exploration to remain within safe, human-like behavior manifolds, while also enabling effective learning with only sparse rewards. Empirical evaluations on challenging manipulation benchmarks demonstrate that ExpertGen reliably produces high-quality expert policies with no reward engineering. On industrial assembly tasks, ExpertGen achieves a 90.5% overall success rate, while on long-horizon manipulation tasks it attains 85% overall success, outperforming all baseline methods. The resulting policies exhibit dexterous control and remain robust across diverse initial configurations and failure states. To validate sim-to-real transfer, the learned state-based expert policies are further distilled into visuomotor policies via DAgger and successfully deployed on real robotic hardware.

2604.18360 2026-06-02 cs.SD cs.CL

Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

Omni-Embed-Audio: 利用多模态大语言模型实现鲁棒的音频-文本检索

HaeJun Yoo, Yongseop Shin, Insung Lee, Myoung-Wan Koo, Du-Seong Chang

发表机构 * Sogang University(首尔大学)

AI总结 提出Omni-Embed-Audio(OEA)检索编码器,利用多模态大语言模型原生理解音频,并通过用户意图查询(UIQ)和硬负样本挖掘,在文本到音频检索中达到与M2D-CLAP相当的性能,同时在文本到文本检索和硬负样本判别上显著优于现有方法。

Comments Accepted at ACL 2026 Main Conference. Camera-ready version

详情
AI中文摘要

基于对比语言-音频预训练(CLAP)的音频-文本检索系统在传统基准上表现强劲;然而,这些基准依赖于与真实世界搜索行为差异显著的标题风格查询,限制了其对实际检索鲁棒性的评估。我们提出了Omni-Embed-Audio(OEA),一种利用具有原生音频理解能力的多模态大语言模型的检索导向编码器。为了系统评估超越标题风格查询的鲁棒性,我们引入了用户意图查询(UIQ)——五种反映自然搜索行为的表述形式:问题、命令、关键词标签、释义和基于排除的负查询。对于负查询,我们开发了一个硬负样本挖掘管道,并提出了判别指标(HNSR, TFR),评估模型抑制声学相似干扰物的能力。在AudioCaps、Clotho和MECAT上的实验表明,OEA在文本到音频检索性能上与最先进的M2D-CLAP相当,同时在两个关键领域展现出明显优势:(1)主导的文本到文本检索(相对提升22%),以及(2)显著优越的硬负样本判别(HNSR@10提升4.3个百分点,TFR@10相对提升34.7%),揭示了大语言模型骨干对复杂查询具有更优的语义理解能力。

英文摘要

Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment of practical retrieval robustness. We present Omni-Embed-Audio (OEA), a retrieval-oriented encoder leveraging multimodal LLMs with native audio understanding. To systematically evaluate robustness beyond caption-style queries, we introduce User-Intent Queries (UIQs) - five formulations reflecting natural search behaviors: questions, commands, keyword tags, paraphrases, and exclusion-based negative queries. For negative queries, we develop a hard negative mining pipeline and propose discrimination metrics (HNSR, TFR) assessing models' ability to suppress acoustically similar distractors. Experiments on AudioCaps, Clotho, and MECAT show that OEA achieves comparable text-to-audio retrieval performance to state-of-the-art M2D-CLAP, while demonstrating clear advantages in two critical areas: (1) dominant text-to-text retrieval (+22% relative improvement), and (2) substantially superior hard negative discrimination (+4.3%p HNSR@10, +34.7% relative TFR@10), revealing that LLM backbones provide superior semantic understanding of complex queries.

2604.18326 2026-06-02 cs.CV

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

OmniHuman:面向以人为中心的视频生成的大规模数据集与基准

Lei Zhu, Xing Cai, Yingjie Chen, Yiheng Li, Binxin Yang, Hao Liu, Jie Chen, Chen Li, Jing LYu

发表机构 * Peking University(北京大学) WeChat Lab(微信实验室) Chinese Academy of Sciences(中国科学院)

AI总结 为解决现有数据集在场景多样性、交互建模和属性对齐方面的结构性缺陷,提出OmniHuman大规模多场景数据集及全自动标注流程,并建立OHBench三级评估体系,实现与人类感知高度一致的诊断。

Comments 19 pages, 6 figures

详情
AI中文摘要

近期音频-视频联合生成模型在内容创作方面展现出令人印象深刻的能力。然而,在复杂的真实世界物理场景中生成高保真以人为中心的视频仍然是一个重大挑战。我们指出根本原因在于现有数据集在三个维度上的结构性缺陷:有限的全局场景和相机多样性、稀疏的交互建模(包括人与人以及人与物体),以及不足的个体属性对齐。为弥补这些差距,我们提出了OmniHuman,一个大规模、多场景数据集,专为细粒度人体建模而设计。OmniHuman提供了层次化标注,涵盖视频级场景、帧级交互和个体级属性。为此,我们开发了一个全自动流水线,用于高质量数据收集和多模态标注。作为数据集的补充,我们建立了OmniHuman基准(OHBench),一个三级评估系统,为以人为中心的音频-视频合成提供科学诊断。关键的是,OHBench引入了与人类感知高度一致的指标,通过提供跨全局场景、关系交互和个体属性的全面诊断,填补了现有基准的空白。

英文摘要

Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

2604.17838 2026-06-02 cs.LG stat.CO stat.ML

Efficient Diffusion Models under Nonconvex Equality and Inequality constraints via Landing

非凸等式和不等式约束下的高效扩散模型 via Landing

Kijung Jeon, Michael Muehlebach, Molei Tao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一个统一框架,通过计算高效的landing机制替代投影,结合欠阻尼动力学加速混合,在非凸可行集上实现等式和不等式约束下的扩散模型,显著降低计算成本。

Comments 58 pages

详情
AI中文摘要

在约束集合内的生成建模对于涉及物理、几何或安全要求(例如分子生成、机器人学)的科学和工程应用至关重要。我们提出了一个通用框架,用于在一般非凸可行集 $Σ$ 上的约束扩散模型,该模型在整个扩散过程中同时强制执行等式和不等式约束。我们的框架包含了过阻尼和欠阻尼动力学用于前向和后向采样。一个关键的算法创新是计算高效的landing机制,它替代了昂贵且通常定义不清的到 $Σ$ 的投影,确保可行性而无需迭代牛顿求解或投影失败。通过利用欠阻尼动力学,我们加速了向先验分布的混合,有效缓解了通常与约束扩散相关的高模拟成本。实验上,该方法在训练和推理过程中减少了函数评估和内存使用,同时保持了样本质量。在具有等式和混合约束的基准测试中,我们的方法在显著降低计算成本的同时实现了与最先进基线相当的样本质量,为非凸可行集上的扩散提供了实用且可扩展的解决方案。

英文摘要

Generative modeling within constrained sets is essential for scientific and engineering applications involving physical, geometric, or safety requirements (e.g., molecular generation, robotics). We present a unified framework for constrained diffusion models on generic nonconvex feasible sets $Σ$ that simultaneously enforces equality and inequality constraints throughout the diffusion process. Our framework incorporates both overdamped and underdamped dynamics for forward and backward sampling. A key algorithmic innovation is a computationally efficient landing mechanism that replaces costly and often ill-defined projections onto $Σ$, ensuring feasibility without iterative Newton solves or projection failures. By leveraging underdamped dynamics, we accelerate mixing toward the prior distribution, effectively alleviating the high simulation costs typically associated with constrained diffusion. Empirically, this approach reduces function evaluations and memory usage during both training and inference while preserving sample quality. On benchmarks featuring equality and mixed constraints, our method achieves comparable sample quality to state-of-the-art baselines while significantly reducing computational cost, providing a practical and scalable solution for diffusion on nonconvex feasible sets.

2604.17625 2026-06-02 cs.CV

FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation

FlowC2S:从当前帧流向后续帧以实现快速且内存高效的视频延续

Hovhannes Margaryan, Quentin Bammey, Christian Sandor

发表机构 * Team ARAI, Université Paris-Saclay, CNRS, LISN, France(ARAI团队,巴黎萨克雷大学,法国国家科学研究中心,LISN,法国) LTCI, Télécom Paris, Institut Polytechnique de Paris, France(LTCI,巴黎电信学院,巴黎理工学院,法国)

AI总结 提出FlowC2S方法,通过微调预训练文本到视频流模型学习当前与后续视频块之间的向量场,利用固有最优耦合和目标反转实现快速、内存高效的视频延续。

详情
AI中文摘要

本文介绍了一种生成快速且内存高效的视频延续的新方法。我们的方法名为FlowC2S,它微调预训练的文本到视频流模型,以学习当前视频块与后续视频块之间的向量场。两个设计选择是关键。首先,我们引入固有最优耦合,在训练期间利用时间上相邻的视频块作为真实最优耦合的实用代理,从而产生更直的流。其次,我们纳入目标反转,将目标块的倒置潜在变量注入输入表示中,以加强对应关系并提高视觉保真度。通过直接从当前帧流向后续帧,而不是常见的将当前帧与噪声组合以生成视频延续的方式,我们将模型输入的维度减少了一半。所提出的方法从LTXV和Wan微调而来,在FID和FVD的定量评估中超越了最先进的分数,且仅需五次神经函数评估。

英文摘要

This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.