arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.27046 2026-05-27 cs.RO

Learning to Balance Motor Thermal Safety and Quadrupedal Locomotion Performance with Residual Policy

学习平衡电机热安全与四足运动性能的残差策略

Yuhang Wan, Weixian Lin, Letian Qian, Yiqi Zou, Weiwei Wu, Shengwei Wu, Chuanlin Zhao, Xin Luo

AI总结 提出一种两阶段训练框架,结合整机热模型和残差策略,在保持运动性能的同时防止电机过热,实现长时间负重运动。

详情
AI中文摘要

电机热管理在电动驱动机器人(尤其是腿式机器人)中常被忽视,但电机过热是限制长时间运动的关键因素,特别是在负载条件下。本文将一个四足机器人的整机热模型集成到强化学习流水线中以更新电机温度,并提出一个用于电机热管理的两阶段训练框架。在该框架中,首先预训练一个名义策略作为能够穿越多种地形的运动基线。然后,在名义策略之上训练一个残差策略,根据机器人的热状态提供修正动作,确保在低温条件下保持高性能,并在高温条件下防止电机过热。仿真结果表明,所提出的策略在电机热安全与运动性能之间实现了有效平衡。在宇树A1四足机器人上的真实世界实验进一步验证了该方法:在3千克负载下,机器人能够在多种地形上稳定运动超过13分钟,而仅使用名义策略时,约5分钟就会导致电机过热。

英文摘要

Motor thermal management is often overlooked in the context of electrically-actuated robots, particularly legged robots, but motor overheating is a key factor that limits long-duration locomotion especially under payload conditions. This paper integrates a whole-body thermal model of a quadruped robot into the reinforcement learning pipeline to update motor temperatures, and proposes a two-stage training framework for motor thermal management. In this framework, a nominal policy is first pre-trained as a locomotion baseline capable of traversing diverse terrains. A residual policy is then trained on top of the nominal policy to provide corrective actions based on the robot's thermal state, ensuring high performance under low-temperature conditions and preventing motor overheating under high-temperature conditions. Simulation results demonstrate that the proposed policy achieves an effective balance between motor thermal safety and locomotion performance. Real-world experiments on a Unitree A1 quadruped robot further validate the approach: under a 3 kg payload, the robot achieves stable locomotion across multiple terrains for over 13 minutes, while the nominal policy alone leads to motor overheating in about 5 minutes.

2605.27045 2026-05-27 cs.CL

ExTax: Explainable Disinformation Detection via Persuasion, Emotion, and Narrative Role Taxonomies

ExTax:基于说服、情感和叙事角色分类学的可解释虚假信息检测

Shang Luo, Yingguang Yang, Zhenchen Sun, Yang Liu, Bin Chong, Jingru Chen, Yancheng Chen, Jiayu Liang, Kefu Xu, Hao Peng, Philip S. Yu

AI总结 提出ExTax框架,统一说服修辞、情感操纵和叙事角色为17维分类空间,通过熵驱动动态标签平滑和多头注意力融合分类与上下文特征,实现可解释的虚假信息检测,在跨域基准上达到0.8456 Macro F1。

详情
AI中文摘要

LLMs的普及加速了高度流畅虚假信息的生成和传播,使得传统的句法语义验证越来越不足。这种欺骗很少仅依赖表面虚假;相反,它常常结合说服性修辞、情感操纵和叙事角色构建,通过多种认知途径影响读者的解读。然而,现有检测器通常强调孤立信号——如句法、外部知识、说服或情感线索——因此难以捕捉虚假信息背后的多方面操纵意图或提供人类可审计的解释。为填补这一空白,我们提出了 extbf{ExTax},一个面向分类学的可解释虚假信息检测框架。ExTax将说服修辞、情感操纵和叙事角色统一到17维分类空间中,涵盖6种说服修辞策略、5种情感操纵方法和6种叙事角色类别。它从多个前沿LLMs中提取属性,通过熵驱动动态标签平滑协调它们的分歧,并通过异构多头注意力将所得分类表示与上下文编码融合,将每个预测基于可解释的操纵画像。在五个跨领域和跨体裁基准上,ExTax实现了0.8456的整体Macro $F_1$,优于最先进的深度学习和基于LLM的基线。在严重的体裁不平衡下,最强的深度基线从0.9454降至0.6194,而ExTax保持稳健。

英文摘要

The democratization of LLMs has accelerated the generation and circulation of highly fluent disinformation, making traditional syntax-semantic verification increasingly insufficient. Such deception rarely relies solely on surface-level falsity; instead, it often combines persuasive rhetoric, emotional manipulation, and narrative role construction to influence readers' interpretations through multiple cognitive pathways. However, existing detectors typically emphasize isolated signals -- such as syntax, external knowledge, persuasion, or affective cues -- and therefore struggle to capture the multi-faceted manipulative intents underlying disinformation or provide human-auditable explanations. To address this gap, we present \textbf{ExTax}, a taxonomy-aligned framework for explainable disinformation detection. ExTax unifies persuasive rhetoric, emotional manipulation, and narrative roles into a 17-dimensional taxonomic space, covering 6 persuasive-rhetoric strategies, 5 emotional-manipulation methods, and 6 narrative-role categories. It elicits attributes from multiple frontier LLMs, reconciles their disagreements through Entropy-driven Dynamic Label Smoothing, and fuses the resulting taxonomic representations with contextual encodings via Heterogeneous Multi-Head Attention, grounding each prediction in an interpretable manipulation profile. Across five cross-domain and cross-genre benchmarks, ExTax achieves an overall Macro $F_1$ of $0.8456$, outperforming state-of-the-art deep learning and LLM-based baselines. It also remains robust under severe genre imbalance, where the strongest deep baseline degrades from $0.9454$ to $0.6194$.

2605.27043 2026-05-27 stat.ML cs.LG stat.ME

Causal Representation Learning for Generalisable Recommendation

因果表示学习用于可泛化推荐

Yorgos Felekis, Michael O'Riordan, Oriol Corcoll, Ciarán M. Gilligan-Lee

AI总结 针对推荐系统中训练分布与部署分布不一致导致的泛化问题,提出基于因果表示学习的信息论解缠标准及其可计算变分下界,仅利用混淆日志即可提升模型在分布偏移下的泛化能力,在Spotify A/B测试、KuaiRand数据集和合成基准上验证了有效性。

详情
AI中文摘要

基于观测数据训练的预测模型在部署时往往无法泛化到所遇到的分布,尤其是当训练数据是被优化系统的产物时。推荐系统是一个典型例子:它们是在被部署策略、过去用户行为和平台过滤混淆的交互日志上训练的。因此,训练分布与在服务时评分的候选分布存在显著差异,这种差距使得离线指标无法可靠预测在线性能。我们通过一种受因果表示学习(CRL)启发的方法来解决分布偏移问题。我们提出了一种信息论解缠标准,并证明其最优值仅取决于输入的因果成分。然后,我们推导出一个可处理的变分下界,使得该标准仅从有限观测数据中即可优化。我们的方法范围比大多数CRL文献更窄,因为我们目标是改善分布偏移下的泛化能力,而非完全识别所有潜在因果因素。这个更窄的目标使得该方法实用,仅需要现有的混淆日志,适用于任何标准监督模型,且不增加推理时间成本。我们的主要评估是在Spotify上对数百万用户进行的A/B测试,应用于个性化播放列表生成的排序器。一个容量匹配的CRL变体在离线性能上相当,但在在线听众参与度上带来了显著提升。在公开的KuaiRand推荐数据集和具有已知因果结构的合成基准上的补充证据显示了相同模式:与基线离线持平,在分布偏移下获得收益。在所有三种设置中,加入我们的因果解缠目标都带来了更有意义的分布外泛化。

英文摘要

Predictive models trained on observational data often fail to generalise to the distributions they encounter when deployed, especially when the training data is a product of the system being optimised. Recommender systems are a canonical example: they are trained on interaction logs confounded by the deployed policy, past user behaviour, and platform filtering. As a result, the training distribution differs substantially from the candidate distribution scored at serving time, a gap that makes offline metrics unreliable predictors of online performance. We address the distribution shift problem with a method motivated by causal representation learning (CRL). We propose an information-theoretic disentanglement criterion and prove that its optimum depends only on the causal components of the input. We then derive a tractable variational lower bound that makes the criterion optimisable from finite observational data alone. The scope of our method is narrower than that of much of the CRL literature, in that we target better generalisation under distribution shift, not full identification of all latent causal factors. This narrower target is what makes the method practical, requiring only the existing confounded logs, applying to any standard supervised model, and adding no inference-time cost. Our headline evaluation is an A/B test with millions of users on Spotify, applied to a production ranker for personalised playlist generation. A capacity-matched CRL variant performed on par offline but delivered substantial online gains in listener engagement. Complementary evidence on the public KuaiRand recommendation dataset and a synthetic benchmark with known causal structure shows the same pattern: offline parity with baseline, gains under distribution shift. Across all three settings, adding our causal disentanglement objective yields meaningfully better out-of-distribution generalisation.

2605.27042 2026-05-27 cs.CR cs.AI

Lessons from Penetration Tests on Large-Scale Agent Systems

大规模智能体系统渗透测试的经验教训

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang, Frederico Araujo, Ian Molloy

AI总结 本文通过对2025年专有智能体产品的两次渗透测试,评估了AI智能体的安全态势是否有所改善,并指出许多安全漏洞并非全新,而是反映了先前计算系统中长期存在的重复性弱点类别。

详情
Comments
Accepted at SAGAI 2026
AI中文摘要

随着AI系统获得越来越多的自主性和执行能力,发现的安全漏洞数量持续上升。然而,许多这些漏洞并非根本上的新颖,而是反映了先前计算系统中长期观察到的重复性弱点类别。具有执行能力的AI智能体实际上是无限的自修改程序,与计算栈的多个层进行广泛交互。这种广泛的交互表面给开发者带来了显著的安全负担,他们必须推理并保护复杂的跨层行为。先前的研究主要集中在开源智能体和智能体框架中的漏洞。相比之下,专有智能体系统——在更严格的编码标准和正式审查流程下开发——是否表现出类似的安全弱点仍不清楚。在本文中,我们展示了2025年对专有智能体产品进行的两次渗透测试的结果,并评估了自这些评估以来AI智能体的安全态势是否有所改善。

英文摘要

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

2605.27039 2026-05-27 eess.AS cs.SD

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

为什么它们记不住?揭示多轮声学记忆中的表征和检索瓶颈

Yang Xiao, Siyi Wang, Han Yin, Hong Jia, Vidhyasaharan Sethu, Eun-Jung Holden, Ting Dang

AI总结 本文通过引入EnvMem基准,发现大型音频语言模型在多轮交互中非语音信息记忆失败的主要原因是表征轨迹漂移,而非注意力分配不足。

详情
AI中文摘要

大型音频语言模型(LALMs)处理语音和环境声学线索,但在多轮交互中难以保留非语音信息。语义(语音)和声学(非语音)理解之间的性能差距仍未被充分理解,其表征和检索的底层机制尚不清楚。本文引入EnvMem,一个受控的多轮基准,用于研究这一差距并识别表征(即潜在嵌入)和检索层面(即注意力分配)失败的根源。我们进一步进行事后干预以探究表征结构和注意力动态。我们的结果揭示表征轨迹漂移是关键失败模式,同时表明注意力分配在解释观察到的退化中作用有限。总体而言,我们提供了一个系统框架,用于分析和改进长上下文LALMs中的非语言记忆,为未来鲁棒声学记忆建模的数据和训练设计提供启示。

英文摘要

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

2605.27038 2026-05-27 cs.RO

TPS-Drive: Task-Guided Representation Purification for VLM-based Autonomous Driving

TPS-Drive: 基于VLM的自动驾驶任务引导表示净化

Jiaxiang Li, Yumao Liu, Ke Ma

AI总结 提出TPS-Drive框架,通过任务引导的表示净化(Agent-Centric Tokenizer)解决VLM在自动驾驶中的空间幻觉和表示干扰问题,实现精确的3D空间预测与安全规划。

详情
AI中文摘要

视觉-语言模型(VLM)为自动驾驶规划提供了有前景的基础,但弥合语义推理与精确3D空间预测之间的差距仍然是一个关键挑战。现有的表示策略通常遵循两条路径:文本对齐方法将连续空间状态扁平化为符号,这损害了几何结构并导致“空间幻觉”;密集视觉方法保留了空间拓扑,但用冗余的背景纹理压垮了标准分词器,导致“表示干扰”。为了解决这些限制,我们引入了TPS-Drive,一个以任务引导表示净化为核心的新框架,使VLM能够在净化空间中思考。其核心是一个以智能体为中心的分词器,利用由冻结的3D检测头监督的任务引导向量量化机制,将有限的码本容量从普遍的静态背景显式重新分配给关键的动态智能体,并有效隔离空间冗余。利用这种净化的空间词汇,TPS-Drive采用解耦的推理流程,依次执行场景理解、未来预测和动作生成。该框架通过渐进的三阶段训练范式进行优化,最终通过奖励驱动的细化超越纯模仿学习。大量实验验证了我们的方法:TPS-Drive在开环nuScenes评估中实现了准确的智能体空间状态预测并降低了碰撞率,同时在严格的闭环NAVSIMv1和NAVSIMv2基准测试中建立了新的安全记录。

英文摘要

Vision-Language Models (VLMs) provide a promising foundation for autonomous driving planning, yet bridging semantic reasoning and precise 3D spatial forecasting remains a critical challenge. Existing representation strategies generally follow two paths: text-aligned methods flatten continuous spatial states into symbols, which compromises geometric structure and induces "spatial hallucinations"; dense visual methods preserve spatial topology but overwhelm standard tokenizers with redundant background textures, leading to "representation interference". To address these limitations, we introduce TPS-Drive, a novel framework centered on Task-Guided Representation Purification that empowers VLMs to Think in Purified Space. At its core, an Agent-Centric Tokenizer utilizes a task-guided vector quantization mechanism supervised by a frozen 3D detection head, which explicitly reallocates limited codebook capacity from pervasive static backgrounds to critical dynamic agents and effectively isolates spatial redundancy. Leveraging this purified spatial vocabulary, TPS-Drive employs a decoupled reasoning pipeline that sequentially performs scene understanding, future forecasting, and action generation. The framework is optimized via a progressive three-stage training paradigm, culminating in reward-driven refinement that surpasses pure imitation learning. Extensive experiments validate our approach: TPS-Drive achieves accurate agent spatial state forecasting and reduces collision rates in open-loop nuScenes evaluations, while establishing new safety records on the rigorous closed-loop NAVSIMv1 and NAVSIMv2 benchmarks.

2605.27033 2026-05-27 cs.CL cs.AI cs.LG

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

AI总结 提出s-Trace方法估计最优子图,发现LLM计算分为早期稀疏核心和后期密集细化两个阶段,且计算量与模型不确定性相关。

详情
AI中文摘要

基于Transformer的大型语言模型(LLMs)由数十亿个参数组成,这些参数排列在深度和宽度都很大的计算图中,但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法,以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法,我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点(主要位于后期层,且越来越多地由注意力头组成)会导致近似完整输出分布的逐步细化。此外,我们发现每个输入所需的计算量与模型不确定性相关,并且更稀疏的子图编码浅层统计信息,例如单字频率。总体而言,我们的结果表明,有效的LLM计算中存在一致的模块化组织,其中稀疏的早期层核心提供粗略预测,然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

2605.27032 2026-05-27 cs.CV

SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation

SCKAN: 基于结构一致性的KAN原型学习用于半监督胰腺分割

Yuqi Liu, Yufei Chen, Wei Fu, Xiaodong Yue, Shuo Li

AI总结 针对半监督胰腺分割中稀疏监督导致的监督偏差问题,提出基于结构一致性的KAN原型学习方法(SCKAN),通过跨样本结构一致性学习和KAN自适应融合实现更泛化且准确的分割。

详情
Comments
10.5 pages, 5 figures, Medical Image Computing and Computer Assisted Intervention 2026
AI中文摘要

精确的胰腺分割对于早期癌症诊断至关重要,而标注稀缺使得半监督学习(SSL)成为必要。然而,由于样本间显著的形态变异性,现有SSL方法在稀疏监督下存在严重的泛化限制,导致监督偏差问题。为解决这一问题,我们提出了基于结构一致性的KAN原型学习(SCKAN),该方法首次利用Kolmogorov-Arnold网络(KANs)构建跨样本结构一致性学习,以实现更泛化和准确的分割。具体而言,SCKAN包含两个关键设计:结构约束的原型一致性学习(SPCL),通过原型级对比优化强制跨样本一致性,促进无偏结构表示;以及基于一致性的Kolmogorov-Arnold融合(CKaF),通过KAN的自适应B样条非线性聚合稳定一致性并过滤样本特定噪声,减少形态特异性偏差。在两个公开胰腺数据集上的大量实验证明了SCKAN的有效性。代码位于https://github.com/rhodaliu17/SCKAN。

英文摘要

Accurate pancreas segmentation is critical for early cancer diagnosis, where annotation scarcity necessitates Semi-Supervised Learning (SSL). However, due to significant inter-sample morphological variability, existing SSL methods face severe generalizability limitations under sparse supervision, leading to the Supervision Bias problem. To address this, we propose Structural Consensus-based KAN Prototype Learning (SCKAN), which constructs the first cross-sample structural consensus learning with Kolmogorov-Arnold Networks (KANs), to achieve more generalizable and accurate segmentation. Specifically, SCKAN contains two key designs: Structure-constrained Prototype Consistency Learning (SPCL), which prompts unbiased structural representation by enforcing cross-sample consistency via prototype-level contrastive optimization, and Consensus-based Kolmogorov-Arnold Fusion (CKaF), which reduces morphology-specific bias by aggregating stable consensus and filtering sample-wise noise via KAN's adaptive B-spline nonlinearity. Extensive experiments on two public pancreas datasets demonstrate the effectiveness of SCKAN. Code is at https://github.com/rhodaliu17/SCKAN.

2605.27030 2026-05-27 cs.CL

Share More, Search Less: Collaborative Parallel Thinking for Efficient Test-Time Scaling

分享更多,搜索更少:面向高效测试时间扩展的协作并行思考

Xinglin Wang, Hao Lin, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li

AI总结 提出一种无需训练的协作并行思考框架,通过在并行分支间共享搜索信息来减少冗余探索,从而在测试时间扩展中实现更优的准确率-延迟帕累托边界。

详情
Comments
Preprint
AI中文摘要

测试时间扩展(TTS)通过分配额外的推理计算来探索解空间,从而增强大型语言模型的推理能力。然而,现有的并行TTS方法通常在搜索过程中保持分支隔离:中间发现保持分支私有,无法及时指导其他分支。这种信息隔离导致大量冗余探索,因为分支反复重新发现其他地方已有的信息,并且需要更多搜索步骤来收集做出正确回答所需的完整决策信息。为弥补这一差距,我们提出协作并行思考(CPT),一种无需训练的推理框架,能够在并行分支间实现搜索时信息共享。CPT从正在运行的分支中提取紧凑的中间信息,维护一个去重的查询级信息池,并通过输入上下文广播池条目,使得后续搜索步骤中的每个分支能够重用其他分支的发现,而不是重新发现相同信息。实验上,在HMMT和AIME基准测试上的结果表明,CPT在不同rollout预算和模型规模下,相比强基线建立了更强的准确率-延迟帕累托前沿,突显了搜索时协作作为高效并行TTS的一个有效方向。

英文摘要

Test-Time Scaling (TTS) enhances the reasoning capabilities of large language models by allocating additional inference compute to explore the solution space. However, existing parallel TTS methods typically keep branches isolated during search: intermediate discoveries remain branch-private and cannot guide other branches in time. This information isolation causes substantial redundant exploration, as branches repeatedly rediscover information already found elsewhere and require more search steps to collect complete decision information needed to reach correct answers. To bridge this gap, we propose \textbf{Collaborative Parallel Thinking (CPT)}, a training-free inference framework that enables search-time information sharing across parallel branches. CPT extracts compact intermediate information from ongoing branches, maintains a deduplicated query-level information pool, and broadcasts pool entries through the input context, allowing each branch in subsequent search steps to reuse discoveries made by other branches rather than rediscover the same information. Empirically, experiments on HMMT and AIME benchmarks show that CPT establishes a stronger accuracy--latency Pareto frontier than strong baselines across rollout budgets and model scales, highlighting search-time collaboration as an effective direction for efficient parallel TTS.

2605.27028 2026-05-27 cs.LG cs.AI

Less is More: Early Stopping Rollout for On-Policy Distillation

少即是多:用于在线策略蒸馏的早停展开

Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos

AI总结 针对在线策略蒸馏中存在的“离策略教师衰减”问题,提出早停展开(ESR)方法,通过限制响应生成的前几个token来提升性能、GPU效率和训练稳定性。

详情
AI中文摘要

在线策略蒸馏最近成为标准序列级模仿的有前途的替代方案,通过使用教师模型对学生自身的展开进行评分来训练学生。然而,我们观察到这种范式中的“离策略教师衰减”问题:对于后面的token,由于学生的早期轨迹作为上下文对于教师来说是离策略的,教师产生纠正性分数的能力会衰减,并可能退回到预训练阶段学习的token补全行为。我们通过实验验证了这个问题,并提出了早停展开(ESR)来解决它:一种简单而有效的蒸馏策略,仅限制展开生成到前几个响应token。我们表明,ESR在模型大小、家族、任务和训练制度上均超越了全展开在线策略蒸馏的性能,并且在跨模型家族场景下表现出更高的GPU效率和训练稳定性。我们进一步研究了这一惊人性能背后的机制,发现了ESR的“级联对齐”和“子模式承诺”效应,这可能解释其为何有效,甚至有时超过教师模型性能。此外,我们表明这种基于位置的token选择策略不能完全由KL散度和熵信号解释。

英文摘要

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

2605.27027 2026-05-27 cs.LG

SQARL: A Size-Agnostic Reinforcement Learning approach for Circuit Allocation in Distributed Quantum Architectures

SQARL: 一种适用于分布式量子架构中电路分配的大小无关强化学习方法

Víctor Carballo, Júlia López-Closa, Mario Martin

AI总结 针对分布式量子计算中的量子比特分配问题,提出一种基于Transformer的灵活强化学习架构,无需重新训练即可处理任意数量的量子比特和核心,在分配成本上比匈牙利量子比特分配算法降低33%。

详情
AI中文摘要

量子处理器的扩展目前受到退相干和串扰等技术挑战的限制。随着量子比特数量的增加,干扰会增大计算噪声。分布式量子计算通过互连更小、更易处理的量子处理器(核心)来解决这些限制,但引入了最小化缓慢且易出错的核间通信的挑战。在最小化通信成本的同时将量子电路分配到核心的任务被称为量子比特分配问题。本文致力于开发一种深度学习方法来解决该问题,强调对量子硬件拓扑的灵活性,并提升现有最优性能。 启发式和非学习算法,如匈牙利量子比特分配(HQA),目前代表了最优水平。强化学习(RL)方法利用学习到的分配策略,但通常缺乏灵活性,当硬件配置改变时需要重新训练,并且其解的质量不如非学习方法。然而,学习机制可能超越人工设计的启发式方法。 为克服这些限制,本文提出一种灵活的基于Transformer的架构,无需重新训练即可处理任意数量的量子比特和核心。结果表明,训练后的策略持续优于先前的RL最优水平,并缩小了RL与HQA在大多数常见电路上的差距。对于Cuccaro加法器,它相对于HQA实现了33%的分配成本降低,对于随机电路平均降低25%。这些发现表明,基于学习的方法可以有效地匹配手工启发式方法的性能,这是向实际应用迈出的关键一步。

英文摘要

The scaling of quantum processors is currently limited by technical challenges such as decoherence and cross-talk. As the number of qubits grows, interference increases the computational noise. Distributed quantum computing addresses these limitations by interconnecting smaller, easier-to-handle quantum processors (cores), but it introduces the challenge of minimizing slow, error-prone inter-core communication. The task of distributing quantum circuits across cores while minimizing communication costs is known as the Qubit Allocation problem. This work focuses on developing a deep learning approach to this problem, emphasizing flexibility to quantum hardware topology and improving state-of-the-art performance. Heuristic and non-learning algorithms, such as the Hungarian Qubit Allocation (HQA), currently represent the state of the art. Reinforcement Learning (RL) approaches leverage learned allocation policies but often lack flexibility, requiring retraining when hardware configurations change, and they fall short of the solution quality achieved by non-learning methods. However, learning mechanisms could outperform human-crafted heuristics. To overcome these limitations, this work proposes a flexible, transformer-based architecture that can handle arbitrary numbers of qubits and cores without retraining. Results show that the trained policy consistently outperforms the previous RL state of the art and narrows the gap between RL and HQA for the most common circuits. It achieves a 33% reduction in allocation cost relative to the HQA for the Cuccaro Adder and 25% on average for random circuits. These findings show that learning-based approaches can effectively match the performance of hand-crafted heuristics, a crucial step towards their application in real-world scenarios.

2605.27025 2026-05-27 cs.CL cs.MM

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

基于属性的LLM与仇恨言论标注的对齐诊断

Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser

AI总结 通过分析LLM在十个主观属性上的判断与人类标注的对齐情况,发现行为显式维度对齐良好而评价维度系统性反转,并提出基于置信度加权岭回归的属性组合方法,重构连续仇恨言论分数,R²达0.71。

详情
AI中文摘要

仇恨言论标注成本高昂、主观性强且容易产生标注者分歧,使得大规模数据集构建具有挑战性。我们系统分析了大型语言模型(LLM)在十个理论上基于主观属性(如去人性化、暴力和情感)上与人类判断的对齐程度,评估了Llama 3.1和Qwen 2.5的小型及大型变体。我们的分析揭示了所有模型的一致分裂:行为显式维度(侮辱、羞辱、攻击-防御)与人类标注高度相关,而评价维度(尊重、情感、仇恨言论)则系统性反转。人口统计角色条件化降低了模型置信度,但未改善对齐。基于这些发现,我们提出通过置信度加权岭回归组合属性级LLM预测,从测量仇恨言论语料库中重构连续仇恨言论分数,R²达到0.71,优于直接提示基线,表明结构化属性分解比端到端标签预测单独恢复出更丰富且更符合人类对齐的信号。

英文摘要

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

2605.27024 2026-05-27 cs.CV cs.MM

NeR-SC: Adapting Neural Video Representation to Screen Content

NeR-SC:适应屏幕内容的神经视频表示

Ruohan Shi, Jiaoyan Zhao, Haogang Feng

AI总结 提出NeR-SC框架,通过可学习调色板、多门密集融合和嵌入级帧跳过策略,针对屏幕内容视频的离散颜色、强时间冗余等特性进行优化,在低码率下超越H.264/H.265。

详情
Comments
Submitted to PRMVAI 2026
AI中文摘要

隐式神经表示已成为视频压缩的一种有前景的范式,最近的方法在自然视频上取得了有竞争力的性能。然而,屏幕内容视频——常见于远程桌面、在线教育和云游戏——表现出独特的统计特性:锐利边缘、有限调色板和强时间冗余。现有的为自然场景设计的神经表示方法缺乏利用这些特性的机制,留下了很大的改进空间。在本文中,我们提出了NeR-SC,一个为屏幕内容视频量身定制的神经表示框架。基于SNeRV骨干网络,NeR-SC引入了三个屏幕内容特定模块:(i) 可学习调色板,通过将低频子带限制到学习到的颜色集来建模屏幕内容的离散颜色结构;(ii) 多门密集融合模块,用密集的、注意力门控的跨阶段交互替代顺序特征融合;(iii) 嵌入级帧跳过策略,绕过静态帧的冗余解码器调用,且零训练开销。在DSCVC和VCD上的实验表明,NeR-SC实现了40.32 dB和41.73 dB的平均PSNR,优于代表性的神经视频表示方法,并且在低码率下超越了H.264和H.265。帧跳过策略实现了实时解码且质量无损失。

英文摘要

Implicit neural representations have emerged as a promising paradigm for video compression, with recent methods achieving competitive performance on natural video. However, screen content video -- common in remote desktop, online education, and cloud gaming -- exhibits distinct statistics: sharp edges, limited color palettes, and strong temporal redundancy. Existing neural representation methods, designed for natural scenes, lack mechanisms to exploit these properties, leaving substantial room for improvement. In this paper, we propose NeR-SC, a neural representation framework tailored for screen content video. Building on the SNeRV backbone, NeR-SC introduces three screen-content-specific modules: (i) a learnable color palette that models the discrete color structure of screen content by restricting the low-frequency sub-band to a learned color set; (ii) a multi-gate dense fusion module that replaces sequential feature fusion with dense, attention-gated cross-stage interaction; and (iii) an embedding-level frame skip strategy that bypasses redundant decoder invocations for static frames, with zero training overhead. Experiments on DSCVC and VCD show that NeR-SC achieves 40.32~dB and 41.73~dB average PSNR, outperforming representative neural video representation methods and, at low bitrates, surpassing H.264 and H.265. The skip strategy enables real-time decoding with no loss in quality.

2605.27023 2026-05-27 cs.AI

Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling

通过增强负采样提升知识图谱基础模型

Yinan Liu, Wenjin Xu, Zhiyuan Zha, Xiaochun Yang, Bin Wang

AI总结 提出自适应负采样方法KMAS,通过动态调整困难负三元组比例,增强知识图谱基础模型在零样本补全任务中的性能。

详情
AI中文摘要

知识图谱已成为问答和推荐系统等众多下游任务的核心支柱。然而,尽管如此,知识图谱往往非常不完整。为了在未见过的知识图谱(其关系词汇与预训练时不同)中进行零样本知识图谱补全,知识图谱基础模型受到了广泛关注。现有的知识图谱基础模型通常使用随机负三元组进行训练,这些负三元组是通过将正三元组的头实体或尾实体替换为随机实体构建的。然而,这些负三元组通常质量有限,为知识图谱基础模型训练提供的监督较弱。在本文中,我们提出了一种简单而有效的自适应负采样方法KMAS,以增强现有的知识图谱基础模型。KMAS通过从现有知识图谱基础模型的关系编码器生成的更新关系嵌入来构建困难负三元组。为了进一步自适应地与训练过程中知识图谱基础模型不断发展的能力对齐,KMAS在整个训练过程中动态调整困难负三元组的比例:在预热阶段后,线性增加比例,然后线性减少。在44个数据集上进行了大量实验。实验结果表明,我们提出的负采样方法可以在不需要过多额外时间或内存消耗的情况下增强许多最先进的知识图谱基础模型。

英文摘要

Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. However, despite all this, KGs are often very incomplete. To perform zero-shot knowledge graph completion in unseen KGs, which have different relational vocabularies from those used for pre-training, KG foundation models (KGFMs) receive a wide range of attention. Existing KGFMs often perform training using random negative triples, which are constructed by replacing the head or tail entity of a positive triple with a random entity. However, these negative triples are often constructed with limited quality, providing weak supervision for KGFM training. In this paper, we propose a simple yet effective adaptive negative sampling approach, KMAS, to enhance existing KGFMs. KMAS constructs hard negative triples through the updated relation embeddings generated from the existing KGFM's relation encoder. To further adaptively align with the evolving capability of the KGFM during the training process, KMAS adjusts the ratio of hard negative triples dynamically throughout the whole training process: after a warmup phrase, it increases the ratio linearly and then decreases linearly. Extensive experiments are conducted over 44 data sets. Experimental results demonstrate that our proposed negative sampling method can enhance many SOTA KGFMs without requiring excessive additional time or memory consumption.

2605.27022 2026-05-27 cs.AI

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

ORCA:一种用于优化根因分析的端到端交互式副驾驶

Phi Nguyen Xuan, Nicholas Tagliapietra, Lavdim Halilaj, Kristian Kersting, Juergen Luettin

AI总结 提出ORCA,一种端到端因果分析副驾驶,通过编排智能体理解用户目标并引导其完成从全自动到高度用户引导的因果分析工作流,涵盖因果发现、效应估计、可解释性和根因分析,并生成结构化报告。

详情
AI中文摘要

因果分析是制造、社会科学和医学等多个领域的关键任务。然而,尽管近期取得了进展,因果方法的概念和方法复杂性使得领域专家难以使用。这一差距阻碍了专家利用这些进展,并阻碍了缺乏真实世界数据进行验证的研究人员。为了弥合这一鸿沟,我们引入了ORCA,一种用于端到端因果分析的副驾驶。ORCA编排智能体以理解用户的目标,并引导他们完成最合适的因果分析工作流,从全自动到高度用户引导的执行。它具有因果发现、因果效应估计、可解释性和根因分析(RCA)功能。ORCA评估和比较性能,生成关键指标和图表,并通过结构化报告生成洞察。我们强调了它在几个真实世界用例中的有效性。

英文摘要

Causal analysis is a crucial task in many domains, including manufacturing, social science, and medicine. However, despite recent progress, the conceptual and methodological complexity of causal methods makes them largely inaccessible to domain experts. This gap prevents experts from leveraging these advances and hinders researchers who lack access to real-world data for validation. To bridge this divide, we introduce ORCA, a copilot for end-to-end causal analysis. ORCA orchestrates agents to understand the user's goals and guide them through the most appropriate causal analysis workflow, from fully automatic to highly user-guided execution. It features causal discovery, causal effect estimation, explainability and Root-Cause-Analysis (RCA). ORCA evaluates and compares performance, generates key metrics and diagrams, and generates insights through structured reports. We highlight its effectiveness across several real-world use-cases.

2605.27020 2026-05-27 cs.CV cs.AI

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

黑盒成员推断攻击:针对图像生成模型的预训练数据

Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang, Lianchao Zhao, Jinrui Wang, Zichen Qin, Shangguang Wang, Yongfeng Huang

AI总结 提出一种基于跨模态数据扰动的黑盒成员推断攻击框架SD-MIA,通过分析扩散模型对目标图像和扰动文本指令的去噪过程,有效检测预训练数据中的成员关系。

详情
Comments
13 pages, 9 figures; CVPR 2026 camera-ready
AI中文摘要

基于扩散的图像生成模型的快速发展引发了对涉及人类创建数据的潜在版权和隐私侵犯的严重担忧。成员推断攻击(MIA)已成为识别模型训练期间未经授权数据使用的有前景工具。现有方法通常评估模型对扰动嫌疑图像的去噪能力作为成员状态的指标。然而,此类特征的判别能力高度依赖于模型记忆程度,并且在应用于曝光较少的数据(例如预训练数据)时显著下降。尽管有几种方法尝试通过利用模型内部特征来增强检测,但这些特征在主流闭源图像生成平台中通常不可访问,限制了其实用性。在本文中,我们证明分析黑盒扩散模型如何对目标图像和相应的扰动文本指令进行去噪可以揭示更具区分性的成员线索。基于这一见解,我们提出了一种名为SD-MIA的黑盒成员推断攻击框架,该框架利用跨模态数据扰动机制来检测扩散模型中的预训练数据。我们在一个公共基准数据集和一个新构建的数据集上进行了广泛实验,每个数据集包含具有相同分布的预训练成员和非成员样本。实验结果表明,SD-MIA相比现有基线(包括那些具有不公平访问模型内部特征优势的基线)实现了更优的性能。

英文摘要

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

2605.27016 2026-05-27 cs.CL cs.AI cs.LG stat.ML

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

评估不确定性估计器与LLM幻觉的相关性

Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau, Karteek Alahari

AI总结 通过系统实证研究,评估信息论、基于采样和反思性等不确定性估计器与LLM幻觉之间的关联,发现关联性高度可变且通常较弱,挑战了将不确定性作为幻觉直接信号的做法。

详情
Comments
35 pages, 7 figures, 9 tables
AI中文摘要

大型语言模型(LLM)容易产生幻觉,即与输入或训练数据不符的陈述,阻碍了可靠部署。同时,许多不确定性估计(UE)方法被提出来量化模型置信度,并常被隐含地视为模型失败的代理。然而,不确定性与幻觉之间的关系尚未得到充分表征。我们对不确定性估计器与LLM幻觉之间的关联进行了系统的实证研究。我们不是假设这种关联,而是直接评估它在何时以及在多大程度上成立。我们考虑了多种不确定性估计器,包括信息论、基于采样和反思性估计器,并检查了它们在幻觉设置中的行为。我们的实验涵盖了内在幻觉(违反输入忠实性)和外在幻觉(相对于训练数据的无根据主张),使用了四个互补基准,包括RAGTruth和HalluLens。我们发现,这种关联性高度可变且通常较弱,取决于幻觉类型和所评估的LLM。这些结果挑战了将不确定性作为幻觉直接信号的做法,并阐明了何时它能提供可操作的信息。

英文摘要

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

2605.27015 2026-05-27 cs.CL

PersLitEval: Fine-grained Benchmark and Evaluation of LLMs on Persian Literature Questions

PersLitEval:波斯文学问题上的细粒度基准与LLM评估

Ruhallah Niazi, Faeze Ghorbanpour, Alexander Fraser

AI总结 提出PersLitEval基准,包含4514道波斯文学多选题,评估六种LLM在十种提示策略下的表现,发现模型在概念相似性任务上准确率高,但在拼写和构词等正式语言分析上困难,且提示策略显著影响性能。

详情
AI中文摘要

尽管多语言能力令人印象深刻,但大型语言模型(LLM)在非英语语言的文学知识方面仍然缺乏充分评估。我们引入了PersLitEval,这是一个包含4514道波斯文学多选题的基准,涵盖拼写、修辞手法、语法、词汇、构词和概念理解等八个细粒度类别,题目来源于Konkur大学入学考试材料。我们评估了六种LLM在十种提示策略下的表现,揭示了三个难度层级上显著的类别差异:模型在概念相似性任务上准确率较高,但在正式语言分析上表现不佳,其中拼写和构词对所有模型来说都是最难的。提示策略对性能有显著影响,其中带解释的少样本示例效果最佳,尤其是在正式语言类别上。错误分析识别出三种失败模式:语义理解差距、正式语言知识差距以及计数/枚举错误,表明不同类别需要不同的改进策略。

英文摘要

Despite impressive multilingual capabilities, large language models (LLMs) remain poorly evaluated on literary knowledge in non-English languages. We introduce PersLitEval, a benchmark of 4,514 Persian literature multiple-choice questions across eight fine-grained categories spanning spelling, literary devices, grammar, vocabulary, word formation, and conceptual understanding, sourced from materials for the Konkur university entrance examination. We evaluate six LLMs across ten prompting strategies, revealing striking category-level disparities across three tiers of task difficulty: models reach higher accuracy on conceptual similarity tasks but struggle with formal linguistic analysis, with spelling and word formation proving the hardest across all models. Prompting strategy has a significant impact on performance, with explained few-shot examples yielding the best results, particularly on formal linguistic categories. An error analysis identifies three failure modes: semantic comprehension gaps, formal linguistic knowledge gaps, and counting/enumeration errors, suggesting that different categories require different improvement strategies.

2605.27014 2026-05-27 cs.LO cs.AI

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

ReasonOps: 可信验证的LLM推理的统一操作范式

Adnan Rashid

AI总结 本文提出ReasonOps,一种将推理视为持续监控、可验证、可靠性感知的操作过程的统一范式,整合语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正,以解决当前LLM推理中的逻辑不一致、幻觉符号转换等问题。

详情
Comments
5 Pages
AI中文摘要

大型语言模型(LLM)已将人工智能从主要生成系统转变为日益强大的推理代理。最近在定理证明、自动形式化、符号推理和工具增强语言模型方面的进展表明,在机器辅助形式推理方面取得了实质性进展。然而,当前的推理系统仍然存在隐藏的逻辑不一致、幻觉符号转换、无支持的定理应用以及有限可靠性保证。现有方法在形式验证、运行时保证、神经符号推理和可信人工智能(AI)研究社区之间仍然分散。本文介绍了ReasonOps,一种用于可信验证推理系统的统一操作范式。受DevOps和MLOps等操作生态系统的启发,ReasonOps将推理视为一个持续监控、可验证、可靠性感知的操作过程,而不是一个孤立的推理任务。所提出的范式将语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正整合到一个统一的推理生命周期中。本文进一步介绍了ReasonOps架构,使用自主制动系统分析示例演示了其工作流程,并讨论了其在未来安全关键自主AI系统中的潜在作用。我们认为,像ReasonOps这样的操作推理范式可能成为下一代可信AI生态系统的基础设施。

英文摘要

Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents. Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning. However, current reasoning systems still suffer from hidden logical inconsistencies, hallucinated symbolic transitions, unsupported theorem applications, and limited reliability guarantees. Existing approaches remain fragmented across formal verification, runtime assurance, neuro-symbolic reasoning and trustworthy Artificial Intelligence (AI) research communities. This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems. Inspired by operational ecosystems such as DevOps and MLOps, ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task. The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle. The paper further presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems. We argue that operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.

2605.27013 2026-05-27 cs.AI

Generating Robust Portfolios of Optimization Models using Large Language Models

使用大型语言模型生成鲁棒的优化模型组合

Eleni Straitouri, Cheol Woo Kim, Milind Tambe

AI总结 提出一种利用LLM作为随机生成器和推理评估器的统一框架,生成鲁棒的优化模型组合,并保证在生成器或评估器之一与人类偏好对齐时组合中包含高质量候选模型。

详情
Comments
Accepted at the ICML 2026 LM4Plan Workshop
AI中文摘要

数学优化是跨领域(如资源分配和规划)进行结构化决策的强大工具。然而,制定忠实于现实的优化模型仍然是一个重大瓶颈,因为它通常需要领域专业知识和优化知识,而这些往往是稀缺的。最近大型语言模型(LLM)的进展有望弥合这一差距,使得从自然语言描述中生成候选优化模型成为可能。然而,无法保证任何单个LLM生成的模型是可靠的,因此仅输出一个模型的现有方法存在风险。在这项工作中,我们提出了一种新颖的算法,生成一个优化模型组合,旨在对LLM的局限性具有鲁棒性。我们的方法利用了一个观察:单个LLM可以扮演两个不同的角色——作为随机生成器和作为推理评估器——并提出了一个统一的框架,以互补的方式利用这两种能力。我们提供了理论保证,表明只要生成器或评估器中至少有一个与人类偏好良好对齐,该组合就保证包含高质量的候选模型,从而实现一个原则性的人机交互过程,决策者可以在承诺使用一个模型之前审查多个候选模型。我们进一步通过实验验证了我们的方法,展示了在一系列优化建模任务中的强大性能。

英文摘要

Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.

2605.27009 2026-05-27 cs.LG

SCENT: Aligning Mass Spectra with Molecular Structure for Olfactory Perception

SCENT: 将质谱与分子结构对齐用于嗅觉感知

Ziqi Zhang, Eunyeong Jin, Miguel Vasco, Farzaneh Taleb, Nona Rajabi, Alexandra Gutmann, Jonathan Williams, Antônio H. Ribeiro, Danica Kragic

AI总结 提出SCENT多模态对比学习框架,通过将电子电离质谱表示与预训练化学结构嵌入对齐,在无需分子结构的情况下实现与结构模型相当的嗅觉预测性能。

详情
AI中文摘要

从分子结构预测人类嗅觉感知已取得显著进展,但这些方法在推理时需要明确的化学结构,而这在实际传感场景中并不可用。我们通过探索直接电子电离质谱(EI-MS)作为嗅觉预测的替代输入模态来弥补这一差距,该传感技术可在数秒内获取化学信息丰富的碎片指纹。我们提出了谱图到化学嵌入对齐(SCENT),这是一个多模态对比学习框架,它将EI-MS表示与预训练的化学结构嵌入对齐,同时在推理时仅需要质谱。在多标签气味描述符预测任务中,SCENT显著优于仅使用MS的基线,并实现了与基于结构的模型相当的性能,尽管在测试时不需要明确的分子结构。学习到的表示还能更好地逼近连续的人类感知评分,并泛化到真实实验室测量的谱图,表明跨模态对齐是将分析谱图嵌入化学语义的有效策略。

英文摘要

Predicting human olfactory perception from molecular structure has seen remarkable progress, yet these approaches require explicit chemical structure at inference, which is not available in practical sensing settings. We address this gap by exploring direct electron ionization mass spectrometry (EI-MS), a sensing technique that acquires chemically informative fragmentation fingerprints in seconds, as an alternative input modality for olfactory prediction. We contribute Spectrum-to-Chemical Embedding alignmeNT (SCENT), a multi-modal contrastive learning framework that aligns EI-MS representations with pretrained chemical structure embeddings, while requiring only mass spectra at inference. On the multi-label odor descriptor prediction task, SCENT significantly outperforms MS-only baselines and achieves performance comparable to structure-based models, despite requiring no explicit molecular structure at test time. The learned representations also better approximate continuous human perceptual ratings and generalize to real-world lab-measured spectra, suggesting that cross-modal alignment is an effective strategy for grounding analytical spectra in chemical semantics.

2605.27006 2026-05-27 cs.LG cond-mat.dis-nn stat.ML

Sampling Data with Chains of Forward-Backward Diffusion Steps

通过前向-反向扩散步骤链采样数据

Hyunmo Kang, Noam Itzhak Levi, Corinna Elena Wegner, Daniel J. Korchinski, Matthieu Wyart

AI总结 提出U-turn链,通过扩散模型的短前向-反向步骤迭代构造马尔可夫链,结合Metropolis-Hastings校正从能量修正目标中采样,并发现最小U-turn动力学经历由数据流形碎片化驱动的遍历性破缺相变。

详情
AI中文摘要

从学习到的高维分布中采样是一个基础的计算问题。我们引入U-turn链:通过迭代扩散模型的短前向-反向步骤获得的马尔可夫链,其中每一步提出一个保持在所学数据流形上的移动,并与Metropolis-Hastings校正配对,从能量修正目标中采样。对于合成语言,我们表明最小U-turn动力学经历由数据流形碎片化驱动的遍历性破缺相变;在更大的U-turn幅度下遍历性得以恢复。在非遍历区域,低层特征比高层特征松弛得更快,这种顺序仅在足够大的U-turn幅度下才会反转。我们在自然语言和自然图像上测试这些预测。在两种模态中,最小U-turn松弛缓慢,尤其是对于由CNN或LLM中深层表示近似的高层特征。层序反转仅在噪声足够大且混合高效时出现——这些特征与强约束、弱混合的局部动力学一致。我们讨论了这些结果对使用扩散模型采样的启示。

英文摘要

Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.

2605.27003 2026-05-27 cs.CV cs.AI

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

时间步感知的 SVDQuant-GPTQ 用于 Wan2.2-I2V 的 W4A4 量化

Junhao Wu, Dezhong Yao, Hai Jin

AI总结 针对 Wan2.2-I2V 视频扩散 Transformer 的 W4A4 量化,提出结合 SVDQuant 低秩异常补偿、GPTQ 重建感知残差权重量化和时间步分箱逐层激活裁剪比搜索的后训练量化框架,在 OpenS2V-Eval 上降低 59.3% 峰值显存且仅损失 0.9% VBench 平均分。

详情
AI中文摘要

大型视频扩散 Transformer 的 W4A4 量化提供了显著的内存节省,但面临两个主要挑战:稀疏的大幅度激活异常值,以及跨多步去噪轨迹的强时间步依赖的激活分布。这些困难因 Wan2.2-I2V 的双专家混合专家 DiT 设计而加剧,其高噪声和低噪声专家表现出不同的量化敏感性,单一全局校准策略无法捕捉。我们提出了一种后训练量化框架,结合基于 SVDQuant 的低秩异常补偿、基于 GPTQ 的重建感知残差权重量化,以及针对每个专家独立进行的时间步分箱逐层激活裁剪比搜索。在 OpenS2V-Eval 基准上,我们的方法相对于 BF16 基线将峰值 GPU 内存降低了 59.3%,同时仅导致 VBench 平均分数下降 0.9%,成像质量下降 2.3%,表明专家和时间步感知的校准对于 MoE 视频 DiT 的高保真 W4A4 推理至关重要。

英文摘要

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

2605.26999 2026-05-27 cs.CL cs.CR

Prompt Injection Detection is Regime-Dependent: A Deployment-Aware Evaluation with Interpretable Structural Signals

提示注入检测是依赖于场景的:一种基于可解释结构信号的部署感知评估

Akindoyin Akinrele, Shreyank N Gowda

AI总结 本研究通过多模型、多场景的实验框架,评估了提示注入检测方法,发现检测性能高度依赖于部署场景和阈值选择,其中基于Transformer的模型表现最佳,结构信号在特定场景下提供适度但一致的改进。

详情
AI中文摘要

提示注入对大型语言模型的安全部署构成严重威胁,然而现有的检测方法通常在有限的设置下进行评估,未能反映真实世界的操作约束。在这项工作中,我们使用多模型和多场景实验框架,对提示注入检测进行了部署感知评估。我们比较了基于词汇、语义、结构和Transformer的检测器,在多个分布外设置、重复数据划分以及排名和阈值部署指标下的表现。我们引入了可解释的结构信号,这些信号捕捉了层次覆盖、系统提示欺骗、角色重定义和逃避模式,并评估了它们在稀疏模型中以及与强编码器基线结合时的贡献。我们的结果表明,检测性能高度依赖于场景,并且对阈值选择敏感,没有单一模型在所有设置中占据主导地位。基于Transformer的模型实现了最强的整体性能,而结构信号在特定场景下提供了适度但一致的改进,并在更困难的场景中改善了低假阳性率行为。这些发现凸显了排名性能与部署有效性之间的差距,并强调了在现实操作约束下评估提示注入防御的重要性。代码将发布。

英文摘要

Prompt injection poses a critical threat to the safe deployment of large language models, yet existing detection approaches are typically evaluated under limited settings that do not reflect real-world operating constraints. In this work, we present a deployment-aware evaluation of prompt injection detection using a multi-model and multi-regime experimental framework. We compare lexical, semantic, structural, and transformer-based detectors across multiple out-of-distribution settings, repeated data splits, and both ranking and thresholded deployment metrics. We introduce interpretable structural signals that capture hierarchy overrides, system prompt spoofing, role redefinition, and evasion patterns, and assess their contribution both within sparse models and in combination with strong encoder baselines. Our results show that detection performance is highly regime-dependent and sensitive to threshold selection, with no single model dominating across all settings. Transformer-based models achieve the strongest overall performance, while structural signals provide modest but consistent gains in certain regimes and improve low false positive rate behaviour in harder scenarios. These findings highlight the gap between ranking performance and deployment effectiveness and underscore the importance of evaluating prompt injection defences under realistic operational constraints. Code will be released.

2605.26998 2026-05-27 cs.LG q-bio.NC

Probabilistic Recurrent Intention Switching Model

概率递归意图切换模型

Wenyuan Sheng, Hao Zhu, Joschka Boedecker

AI总结 提出PRISM模型,利用轻量级递归网络建模非平稳意图切换,实现精确EM分解和闭式求解,在网格世界、小鼠迷宫和机器人操作任务中取得最优似然并恢复可解释意图。

详情
AI中文摘要

逆强化学习(IRL)从观察到的行为中恢复奖励函数,但传统方法假设单一固定奖励,无法捕捉一个回合内的目标切换。最近的多意图IRL方法通过分割轨迹来解决这一问题,但将意图转换建模为无记忆马尔可夫链或通过固定历史窗口的手动状态增强。我们提出概率递归意图切换模型(PRISM),该模型用轻量级递归网络替代这两种机制,将观察历史映射到每步意图分布。我们证明由此产生的EM目标可以精确分解为独立的每意图奖励子问题,每个子问题可闭式求解,从而得到$\mathcal{O}(nK)$的E步,无需变分近似。我们在非马尔可夫网格世界、小鼠迷宫和BridgeData~V2机器人操作(首个大规模多意图IRL机器人应用)上评估PRISM。在所有设置中,PRISM在保持最高留出对数似然的同时,从未标记的演示中恢复出可命名、时间上连贯的意图,表明离散目标切换存在于生物和人工智能体中。

英文摘要

Inverse reinforcement learning (IRL) recovers reward functions from observed behavior, yet traditional methods assume a single stationary reward that cannot capture goal switching within an episode. Recent multi-intention IRL methods address this by segmenting trajectories, but model intention transitions as either a memoryless Markov chain or via manual state augmentation with a fixed history window. We propose the Probabilistic Recurrent Intention Switching Model (PRISM), which replaces both mechanisms with a lightweight recurrent network that maps observation history to a per-step intention distribution. We prove that the resulting EM objective decomposes exactly into independent per-intention reward subproblems, each solvable in closed form, yielding an $\mathcal{O}(nK)$ E-step with no variational approximation. We evaluate PRISM on a non-Markovian gridworld, a mouse labyrinth, and BridgeData~V2 robotic manipulation, the first large-scale robotic application of multi-intention IRL. Across all settings PRISM achieves the highest held-out log-likelihood while recovering nameable, temporally coherent intentions from unlabeled demonstrations, suggesting that discrete goal switching is present in both biological and artificial agents.

2605.26992 2026-05-27 cs.CV

On the Robustness of Machine Unlearning for Vision-Language Models

机器遗忘在视觉-语言模型中的鲁棒性研究

Yujie Lin, Kaidi Jia, Jiayao Ma, Chengyi Yang, Jinsong Su

AI总结 本文首次系统调查了视觉-语言模型机器遗忘的鲁棒性,通过提出三种攻击范式揭示现有方法往往隐藏而非彻底移除目标知识。

详情
AI中文摘要

视觉-语言模型(VLM)可能会记忆训练数据中的不良信息,这激发了人们对机器遗忘的兴趣。在这项工作中,我们首次对VLM遗忘进行了系统调查和鲁棒性分析。我们提供了现有VLM遗忘方法的全面分类和回顾,以及在多种提示设置下的统一评估。然后,我们提出了三种攻击范式,以检验被遗忘的多模态知识是否可以通过上下文提示或下游微调重新激活。大量实验表明,许多现有方法在这些攻击下仍然脆弱,这表明当前方法往往隐藏而非完全移除目标知识。我们的研究为当前VLM遗忘方法的鲁棒性和局限性提供了新见解,并强调了需要更可靠的多模态遗忘策略。代码可在https://github.com/XMUDeepLIT/VLM-UnL-Attack获取。

英文摘要

Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at https://github.com/XMUDeepLIT/VLM-UnL-Attack.

2605.26991 2026-05-27 cs.RO

Towards Shared Embodied Intelligence in Humanoid Robots through Optimization Development and Testing of the Human Aware ergoCub Robot

通过优化开发与测试人类感知的ergoCub机器人迈向人形机器人的共享具身智能

Carlotta Sartore, Mohamed Elobaid, Lorenzo Rapetti, Giulio Romualdi, Stefano Dafarra, Nicola A. Piga, Ines Sorrentino, Paolo Maria Vicecone, Silvio Traversaro, Ugo Pattacini, Luca Fiorio, Francesco Draicchio, Giovanna Tranfo, Lorenzo Natale, Marco Maggiali, Daniele Pucci

AI总结 提出一种融合共享智能与具身认知的架构,通过优化机器人硬件与控制以符合人体工学指标,实现人机物理协作,并以ergoCub人形机器人为具体实现。

详情
AI中文摘要

协作是人类行为的核心,使得完成超出个人能力的任务成为可能。这种能力源于通过对他人的内部表征来协调行动,这一概念被称为共享智能。此外,人类以其身体和认知能力为特征,这些能力会根据环境进行优化,这种现象被称为具身认知。设计能够安全有效地与人协作的人形机器人需要统一这些原则。在此,我们提出一种整合共享智能与具身认知的架构,使机器人能够与人类进行物理协作,其中机器人硬件和控制针对人体指标进行优化,利用人体和运动智能的表征。最终目标是实现一种共享具身智能的形式。具体而言,我们的架构根据人体工程学指标优化机器人硬件和物理智能参数。这是通过将人机交互建模为硬件配置的函数,并将人体模型嵌入机器人的物理智能中来实现的。作为具体实现,我们介绍了人形机器人ergoCub,其形态和控制已针对与人类的协作任务进行了优化。我们的方法为设计在硬件和物理智能层面优先考虑人体工程学的人形机器人提供了一个框架,并应用于工业和辅助机器人领域。

英文摘要

Collaboration is central to human behavior, enabling tasks beyond individual capability. This ability arises from coordinating actions through internal representations of others, a concept known as shared intelligence. Additionally, humans are characterized by physical bodies and cognitive abilities that are optimized in response to their environment, a phenomenon referred to as embodied cognition. Designing humanoid robots that collaborate safely and effectively with people requires unifying these principles. Here we propose an architecture that integrates shared intelligence and embodied cognition to enable robots to physically collaborate with humans, where robot hardware and control are optimized for human metrics, using representations of the human body and motion intelligence. The ultimate goal is to achieve a form of shared embodied intelligence. Specifically, our architecture optimizes robot hardware and physical intelligence parameters with respect to human ergonomic metrics. This is accomplished by modeling human-robot interaction as a function of hardware configurations and embedding human models into the robot's physical intelligence. As a concrete implementation, we present the humanoid robot ergoCub, whose morphology and control have been optimized for collaborative tasks with humans. Our approach provides a framework for designing humanoid robots that prioritize human ergonomics at both the hardware and physical intelligence levels, with applications in industrial and assistive robotics.

2605.26990 2026-05-27 stat.ML cs.LG

Constrained Bayesian Experimental Design via Online Planning

通过在线规划的约束贝叶斯实验设计

Yujia Guo, Daolang Huang, Xinyu Zhang, Sammie Katt, Samuel Kaski, Ayush Bharti

AI总结 提出一种结合离线预训练摊销策略和后验网络与在线多步前瞻规划(场景树)的方法,以在动态约束下优化贝叶斯实验设计,相比现有方法获得更优信息序列且计算开销适中。

详情
Comments
24 pages, 9 figures. Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

贝叶斯实验设计(BED)是一个用于数据高效顺序实验设计的理论框架。然而,现有的BED方法无法适应实际任务中由于预算限制、成本变化或物理约束(限制设计随时间演化)而产生的动态约束。在本文中,我们介绍了一种新的BED方法,通过将离线预训练的摊销策略和后验网络与使用场景树的在线多步前瞻规划相结合,实现了实验设计的约束优化。我们通过实验证明,在多种约束BED任务中,我们的方法相比现有方法产生了更信息丰富的设计序列,同时仅增加了适度的额外计算开销。

英文摘要

Bayesian experimental design (BED) is a principled framework for data-efficient design of sequential experiments. However, existing BED methods are unable to adapt to dynamic constraints inherent in real-world tasks due to budget limitations, varying costs, or physical constraints that restrict how designs evolve over time. In this paper, we introduce a novel approach to BED that enables constrained optimization of experimental designs by combining offline pre-training of an amortized policy and a posterior network with online multi-step lookahead planning using scenario trees. We empirically demonstrate that our method yields substantially more informative design sequences than existing methods across a range of constrained BED tasks, while incurring only a modest additional computational overhead.

2605.26984 2026-05-27 cs.LG

TED: Related Party Transaction guided Tax Evasion Detection on Heterogeneous Graph

TED:基于关联方交易的异构图偷漏税检测

Yiming Xu, Bin Shi, Bo Dong, Jiaxiang Wang, Hua Wei, Qinghua Zheng

AI总结 针对现有偷漏税检测方法未能充分利用税务场景中丰富交互信息的问题,提出一种基于异构图神经网络的TED模型,通过关联方交易组过滤噪声并设计层次注意力机制捕获深层语义,在真实数据集上显著优于现有方法。

详情
Comments
Accepted by Data Mining and Knowledge Discovery (DMKD25)
AI中文摘要

偷漏税导致政府收入严重损失并扰乱公平竞争的经济秩序。为缓解这一问题,最新的偷漏税检测解决方案利用专家知识提取特征,然后训练分类器判断公司是否涉嫌偷漏税。然而,现有方案主要关注公司的统计特征,未能利用税务场景中丰富的交互信息,从而影响检测性能。在本文中,我们首先将税务场景建模为异构图,并研究异构图模型下的偷漏税检测问题。为了提高偷漏税检测的性能,提出了一种新颖的图神经网络模型来提取异构图的综合信息。具体来说,我们利用异构且复杂的关联方交易组来过滤低层噪声信息。此外,设计了一种层次注意力机制来捕获关联方交易组中隐藏的更深层次结构和语义信息。我们将该方法应用于税务局的真实风险管理系统,并在两个人工标注的真实世界税务数据集上进行评估。结果表明,我们的方法在偷漏税检测任务上显著优于现有最先进方法。

英文摘要

Tax evasion causes severe losses of government revenues and disturbs the economic order of fair competition. To help alleviate this problem, the latest tax evasion detection solutions utilize expert knowledge to extract features and then train classifiers to determine whether a company is suspected of tax evasion. However, existing solutions mainly focus on the statistical features of the company, but fail to exploit the rich interactive information in tax scenarios, which affect the detection performance. In this paper, we first model the tax scenario as a heterogeneous graph and study the tax evasion detection problem under the heterogeneous graph model. To improve the performance of tax evasion detection, a novel graph neural network model is proposed to extract the comprehensive information of heterogeneous graphs. Specifically, we use heterogeneous and complex related party transaction groups to filter low-level noise information. Moreover, a hierarchical attention mechanism is designed to capture the deeper structure and semantic information hidden in the related party transaction group. We apply our method to the real risk management system of the tax bureau, and evaluate it on two human-labeled real-world tax datasets. The results demonstrate that our method significantly outperforms the state-of-the-art in the tax evasion detection task.

2605.26978 2026-05-27 cs.CL cs.SD

PashtoTTS-Bench: automated screening for low-resource non-Latin-script text-to-speech

PashtoTTS-Bench:低资源非拉丁文字文本转语音的自动化筛选

Hanif Rahman

AI总结 针对低资源非拉丁文字TTS评估中单一ASR往返WER的不足,提出INSV报告框架及其自动化筛选子集INSV-A,并实例化为PashtoTTS-Bench基准,通过多指标评估多个TTS系统。

详情
AI中文摘要

对于低资源非拉丁文字语言,当文本转语音(TTS)评估依赖于单一的ASR往返词错误率(WER)时可能会失败。系统可能不产生音频、说出邻近语言、仅在ASR转录中保留目标文字脚本,或者对母语者来说听起来不自然。我们引入了INSV(可懂度、自然度、脚本保真度和验证)报告框架,将这些情况分开。本文报告了INSV-A,即自动化筛选子集:合成完成度、ASR WER/CER、转录脚本保真率和音频语言识别。原生MOS和语音标注已指定但未在此版本中声明。我们将INSV-A实例化为PashtoTTS-Bench,一个针对普什图语TTS的带日期基准。2026年4月至5月的运行评估了Edge GulNawaz、Edge Latifa、OmniVoice clone、OmniVoice auto和一个乌尔都语阴性对照,使用200个FLEURS和200个过滤后的Common Voice 24提示。在独立的omniASR_CTC_300M_v2下,OmniVoice auto的WER最低(FLEURS 24.1%,CV24 27.4%),其次是Edge GulNawaz(32.8%,39.5%)、Edge Latifa(35.6%,47.7%)和OmniVoice clone(45.4%,34.8%)。低于自然语音基线的WER反映了干净的合成音频,不应被解读为优于原生语音。Whisper Large V3在检查的普什图语TTS音频上返回0.0%的普什图语标签,而MMS-LID-4017和SpeechBrain VoxLingua107将普什图语输出与乌尔都语对照区分开。该版本提供了提供者元数据、每句分数、LID审计、失败日志和用于添加系统的脚本。

英文摘要

Text-to-speech (TTS) evaluation for low-resource non-Latin-script languages can fail when it relies on a single ASR round-trip word error rate (WER). A system may produce no audio, speak a neighbouring language, preserve target script text only in an ASR transcript, or sound unnatural to native listeners. We introduce INSV (Intelligibility, Naturalness, Script fidelity, and Verification), a reporting framework that separates these cases. This paper reports INSV-A, the automated screening subset: synthesis completion, ASR WER/CER, transcript Script Fidelity Rate, and audio language identification. Native MOS and phonetic annotation are specified but not claimed in this release. We instantiate INSV-A as PashtoTTS-Bench, a dated benchmark for Pashto TTS. The April-May 2026 run evaluates Edge GulNawaz, Edge Latifa, OmniVoice clone, OmniVoice auto, and an Urdu negative control on 200 FLEURS and 200 filtered Common Voice 24 prompts. Under the independent omniASR_CTC_300M_v2, OmniVoice auto has the lowest WER (24.1% FLEURS, 27.4% CV24), followed by Edge GulNawaz (32.8%, 39.5%), Edge Latifa (35.6%, 47.7%), and OmniVoice clone (45.4%, 34.8%). WER below the natural-speech baseline reflects clean synthetic audio and should not be read as better than native speech. Whisper Large V3 returns 0.0% Pashto labels on checked Pashto TTS audio, while MMS-LID-4017 and SpeechBrain VoxLingua107 separate Pashto outputs from the Urdu control. The release provides provider metadata, per-sentence scores, LID audits, failure logs, and scripts for adding systems.