arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2117
2601.11004 2026-06-12 cs.CL 版本更新

NOVA: NOise-aware Verbal Confidence CAlibration for Robust Large Language Models in RAG Systems

NOVA: 面向RAG系统中鲁棒大语言模型的噪声感知言语置信度校准

Jiayu Liu, Rui Wang, Qing Zong, Yumeng Wang, Cheng Qian, Qingcheng Zeng, Tianshi Zheng, Haochen Shi, Dadi Guo, Baixuan Xu, Chunyang Li, Yangqiu Song

AI总结 提出NOVA框架,通过规则引导的监督微调,解决检索增强生成中噪声上下文导致的过度自信问题,在域内和域外分别提升ECE 10.9%和8.0%。

详情
AI中文摘要

准确评估模型置信度对于在关键事实领域部署大语言模型(LLM)至关重要。尽管检索增强生成(RAG)被广泛采用以改善基础事实,但RAG设置中的置信度校准仍知之甚少。我们跨四个基准进行了系统研究,揭示LLM在检索到噪声上下文时校准性能较差。具体而言,矛盾或无关的证据往往会加剧模型的过度自信问题。为解决此问题,我们提出NOVA规则(噪声感知言语置信度校准规则),为在噪声下解决过度自信提供原则性基础。我们进一步设计了NOVA,一个噪声感知校准框架,该框架通过由这些规则指导的约2K HotpotQA示例合成监督信号。通过使用此数据进行监督微调(SFT),NOVA使模型具备内在的噪声感知能力,而无需依赖更强的教师模型。实验结果表明,NOVA带来了显著收益,在域内和域外分别将ECE分数提高了10.9%和8.0%。通过弥合检索噪声与言语校准之间的差距,NOVA为构建既准确又认知可靠的LLM铺平了道路。

英文摘要

Accurately assessing model confidence is essential for deploying large language models (LLMs) in mission-critical factual domains. While retrieval-augmented generation (RAG) is widely adopted to improve grounding, confidence calibration in RAG settings remains poorly understood. We conduct a systematic study across four benchmarks, revealing that LLMs exhibit poor calibration performance especially when noisy contexts are retrieved. Specifically, contradictory or irrelevant evidence tends to exacerbate the model's overconfidence issue. To address this, we propose NOVA Rules (NOise-Aware Verbal Confidence CAlibration Rules) to provide a principled foundation for resolving overconfidence under noise. We further design NOVA, a noise-aware calibration framework that synthesizes supervision from ~2K HotpotQA examples guided by these rules. By performing supervised fine-tuning (SFT) with this data, NOVA equips models with intrinsic noise awareness without relying on stronger teacher models. Empirical results show that NOVA yields substantial gains, improving ECE scores by 10.9% in-domain and 8.0% out-of-domain. By bridging the gap between retrieval noise and verbal calibration, NOVA paves the way for both accurate and epistemically reliable LLMs.

2603.11863 2026-06-12 cs.AI cs.CL 版本更新

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

CreativeBench: 通过自我进化挑战基准测试和增强机器创造力

Zi-Han Wang, Lam Nguyen, Zhengyang Zhao, Mengyue Yang, Chengwei Qin, Yujiu Yang, Linyi Yang

AI总结 提出CreativeBench基准,基于认知框架通过代码生成评估机器创造力,包含组合与探索两个子集,利用逆向工程和自我博弈自动生成挑战,并通过质量与新颖性乘积的指标区分创造与幻觉。

详情
Comments
ACL 2026. Project page: https://zethwang.github.io/creativebench.github.io/
AI中文摘要

高质量预训练数据的饱和已将研究焦点转向能够持续生成新颖产物的进化系统,从而促成了AlphaEvolve的成功。然而,此类系统的进展因缺乏严格、量化的评估而受阻。为应对这一挑战,我们引入了CreativeBench,这是一个基于经典认知框架、用于评估代码生成中机器创造力的基准。该基准包含两个子集——CreativeBench-Combo和CreativeBench-Explore,通过利用逆向工程和自我博弈的自动化流程,分别针对组合创造力和探索创造力。通过利用可执行代码,CreativeBench通过一个统一指标(定义为质量与新颖性的乘积)客观地区分创造力与幻觉。我们对最先进模型的分析揭示了不同的行为:(1) 规模扩展显著提升了组合创造力,但对探索的收益递减;(2) 更大的模型表现出“规模收敛”,即变得更正确但更少发散;(3) 推理能力主要有利于受约束的探索而非组合。最后,我们提出了EvoRePE,一种即插即用的推理时引导策略,通过内化进化搜索模式来持续增强机器创造力。

英文摘要

The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating novel artifacts, leading to the success of AlphaEvolve. However, the progress of such systems is hindered by the lack of rigorous, quantitative evaluation. To tackle this challenge, we introduce CreativeBench, a benchmark for evaluating machine creativity in code generation, grounded in a classical cognitive framework. Comprising two subsets -- CreativeBench-Combo and CreativeBench-Explore -- the benchmark targets combinatorial and exploratory creativity through an automated pipeline utilizing reverse engineering and self-play. By leveraging executable code, CreativeBench objectively distinguishes creativity from hallucination via a unified metric defined as the product of quality and novelty. Our analysis of state-of-the-art models reveals distinct behaviors: (1) scaling significantly improves combinatorial creativity but yields diminishing returns for exploration; (2) larger models exhibit ``convergence-by-scaling,'' becoming more correct but less divergent; and (3) reasoning capabilities primarily benefit constrained exploration rather than combination. Finally, we propose EvoRePE, a plug-and-play inference-time steering strategy that internalizes evolutionary search patterns to consistently enhance machine creativity.

2603.12530 2026-06-12 cs.LG 版本更新

Mixing Makes Markovian Contexts Cheap for Linear Bandits

混合使得马尔可夫上下文在线性赌博机中变得廉价

Kaan Buyukkalayci, Osama Hanna, Christina Fragouli

AI总结 针对马尔可夫上下文线性赌博机问题,提出一种基于均匀几何遍历性的约简方法,通过构建平稳替代动作集和延迟更新方案,实现了与标准线性赌博机相当的最坏情况遗憾界。

详情
AI中文摘要

最近的研究表明,当上下文是独立同分布时,线性上下文赌博机可以简化为单上下文线性赌博机。这种“上下文廉价”的视角非常有利,因为它允许更精确的有限时间分析,并利用线性赌博机文献中的成熟技术,例如针对错误规范和对抗性腐败的技术。然而,这种约简关键依赖于上下文的独立性,并不适用于时间相关(例如马尔可夫)的上下文设置,而这种设置在现实中经常出现。受时间相关可用性应用的启发,我们将这一视角扩展到具有马尔可夫上下文过程的线性赌博机,其中动作集通过外生马尔可夫链演化。我们的主要贡献是在均匀几何遍历性条件下的一种约简。我们构建了一个平稳替代动作集,使用标准线性赌博机预言机来解决问题,并采用延迟更新方案来控制由非平稳条件上下文分布引起的偏差。我们进一步为未知平稳分布提供了一种分阶段算法,该算法在线学习替代映射。在两种设置中,我们在足够快的混合区域获得了与底层线性赌博机预言机相匹配的高概率最坏情况遗憾界。然后,我们在一个真实世界实例上验证了我们的结果,展示了相对于LinUCB基线的实际改进。

英文摘要

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap'' perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. However, this reduction crucially relies on the independence of contexts and does not extend to settings with temporally correlated (e.g., Markovian) contexts, which arise frequently in practice. Motivated by applications with temporally correlated availability, we extend this perspective to linear bandits with Markovian context processes, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown stationary distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle in sufficiently fast mixing regimes. We then validate our results on a real-world instance, where we show practical gains over a LinUCB baseline.

2603.02234 2026-06-12 cs.LG cs.AI 版本更新

Structured vs. Unstructured Pruning: An Exponential Gap

结构化剪枝与非结构化剪枝:指数级差距

Davide Ferre', Frédéric Giroire, Frederik Mallmann-Trenn, Emanuele Natale

AI总结 研究随机初始化网络中剪枝的局限性,证明神经元剪枝需要指数级更大的网络规模才能达到与非结构化剪枝相同的近似精度。

详情
AI中文摘要

强彩票假说(SLTH)指出,大型随机初始化神经网络包含稀疏子网络,无需训练即可在初始化时逼近目标函数,这表明仅剪枝就足够了。剪枝方法通常分为非结构化(可移除单个权重)和结构化(根据特定模式移除参数,如神经元剪枝)。现有支持SLTH的理论结果几乎完全依赖于非结构化剪枝,表明对数级的过参数化足以逼近简单的目标网络。相比之下,神经元剪枝尽管因其直接加速硬件的实用性而备受关注,但理论关注有限。本文考虑通过剪枝随机初始化两层ReLU网络的隐藏单元来逼近单个无偏置ReLU神经元的问题,从而隔离神经元剪枝的内在局限性。我们证明,实现ε-逼近需要神经元剪枝的起始网络规模为Ω(1/ε),而权重剪枝仅需O(log(1/ε))个隐藏单元,揭示了两种方法之间的指数级差距。

英文摘要

The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention, despite its practical appeal for direct hardware speedups. In this work, we consider the problem of approximating a single bias-free ReLU neuron by pruning hidden units of a randomly initialized two-layer ReLU network, effectively isolating the intrinsic limitations of neuron pruning. We show that achieving an $\varepsilon$-approximation requires a starting network size of $Ω(1/\varepsilon)$ for neuron pruning, whereas weight pruning succeeds with only $O(\log(1/\varepsilon))$ hidden units, revealing an exponential separation between the two approaches.

2602.08913 2026-06-12 cs.LG stat.ML 版本更新

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

GEMSS: 一种用于在分类和回归问题中发现多个稀疏解的变分贝叶斯方法

Kateřina Henclová, Václav Šmídl

AI总结 提出GEMSS算法,利用结构化spike-and-slab先验、高斯混合近似后验和Jaccard惩罚,通过变分推断同时发现多个多样化的稀疏特征组合,在128个实验和3个真实数据集上优于对比方法。

详情
AI中文摘要

高维、欠定且高度相关的系统在数据科学实践中很常见,尤其是在分析物理测量时。在这种情况下,特征选择面临根本性挑战,因为多个不同的稀疏子集可能同样好地解释响应。识别这些子集不仅对预测建模至关重要,而且对生成关于潜在机制的领域特定见解也至关重要。然而,传统方法通常只隔离单个解,掩盖了全部合理的解释。本文介绍了GEMSS(高斯集成多稀疏解),一种变分算法,旨在同时发现多个多样化的稀疏特征组合。该方法采用结构化spike-and-slab先验实现稀疏性,使用高斯混合近似难以处理的多模态后验,并引入基于Jaccard的惩罚进一步控制解的多样性。通过随机梯度下降优化单个目标函数。该方法通过一个新的基准测试框架在128个综合实验上进行测试,该框架旨在生成具有相同预测属性的多个稀疏解的人工问题。这使我们能够测量真实特征的检索,而不仅仅是评估预测性能——这些特征更符合我们的实际需求。比较分析表明,GEMSS始终优于通过ALFESE框架适配的五种著名特征选择方法。最后,我们通过来自代谢组学和物理化学的3个具有挑战性的真实世界数据集展示了其实用性:GEMSS成功分离出多个不同但质量高的解。GEMSS作为PyPI包'gemss'提供。相应的存储库此http URL包含完整的代码库和免费的无代码应用程序GEMSS Explorer。

英文摘要

High-dimensional, underdetermined and highly correlated systems are common in data science practice, especially when analyzing physical measurements. In such settings, feature selection poses a fundamental challenge because multiple distinct sparse subsets may explain the response equally well. Their identification is crucial not only for predictive modeling but also for generating domain-specific insights into the underlying mechanisms. Yet, conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. This work introduces GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational algorithm designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. A single objective function is optimized via stochastic gradient descent. The method is tested on 128 comprehensive experiments by a novel benchmarking framework designed to generate artificial problems with multiple sparse solutions of equal predictive properties. This allows us to measure the retrieval of ground truth features rather than only evaluating predictive performance -- characteristics more fitting to our practical needs. A comparative analysis shows that GEMSS consistently outperforms five prominent feature selection methods adapted through the ALFESE framework. Finally, we demonstrate practical usability through 3 challenging real-world datasets from metabolomics and physical chemistry: GEMSS successfully isolates multiple distinct yet quality solutions. GEMSS is available as a PyPI package 'gemss'. The corresponding repository github.com/kat-er-ina/gemss/ includes the full codebase and a free, no-code application GEMSS Explorer.

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

AI总结 提出SCALE推理策略,利用自不确定性联合调节视觉感知和动作,无需额外训练或验证器,仅单次前向传播,提升VLA模型在模拟和真实环境中的鲁棒性。

详情
Comments
ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,测试时缩放(TTS)在增强训练外鲁棒性方面受到关注。然而,现有的VLA TTS方法需要额外训练、验证器和多次前向传播,使其部署不切实际。此外,它们仅干预动作解码,而保持视觉表示固定——在感知模糊的情况下不足,此时重新考虑如何感知与决定做什么同样重要。为解决这些限制,我们提出SCALE,一种简单的推理策略,基于“自不确定性”联合调节视觉感知和动作,受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索,而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明,SCALE改进了最先进的VLA模型,并优于现有TTS方法,同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

2509.25787 2026-06-12 cs.CV 版本更新

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

自进化视觉语言模型用于图像质量评估:基于投票与排序

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang

AI总结 提出EvoQuality框架,通过自一致性生成伪标签,利用群体相对策略优化迭代提升VLM的图像质量感知能力,无监督下在多个IQA基准上超越监督方法。

详情
Comments
Published as a conference paper at ICLR 2026
AI中文摘要

在训练后阶段改进视觉语言模型(VLM)通常依赖于监督微调或强化学习,这些方法需要昂贵的人工标注数据。虽然自监督技术已被证明能有效增强推理能力,但其在图像质量评估(IQA)等感知领域的应用仍鲜有探索。在这项工作中,我们引入了EvoQuality,一种新颖的框架,使VLM能够自主优化其质量感知能力,无需任何真实标签。EvoQuality将自一致性原则适应于IQA的排序本质。它通过对VLM自身输出进行成对多数投票来生成伪标签,建立相对质量的共识。这些伪排序随后被转化为保真度奖励,通过群体相对策略优化(GRPO)指导模型的迭代进化。通过迭代利用自身预测,EvoQuality逐步优化VLM的感知能力。大量实验表明,EvoQuality在多个IQA基准上将基础VLM的零样本性能提升了31.8%(PLCC)。值得注意的是,尽管完全自监督,EvoQuality的性能与甚至超越最先进的基于监督VLM的IQA模型,在7个IQA基准中的5个上表现更优。此外,该框架展现出显著的灵活性,可与预训练IQA模型堆叠以增强在未见数据集上的泛化能力。代码和检查点将在此https URL提供。

英文摘要

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

2512.15134 2026-06-12 cs.LG cs.AI cs.CL 版本更新

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

从孤立到纠缠:可解释性方法何时识别和解缠已知概念?

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

AI总结 本文提出多概念评估框架,研究稀疏自编码器和探针等方法是否真正解缠概念,发现特征通常只对单一概念敏感,但概念分布在多个特征上,且干预特征常影响多个概念,表明相关性指标不足以证明干预选择性。

详情
Comments
ACL 2026
AI中文摘要

可解释性的一个目标是从神经网络的激活中恢复潜在概念(特征)的解缠表示。特征的质量通常孤立地评估,并在可能不成立的隐式独立性假设下进行。因此,尚不清楚常见的特征化方法(如稀疏自编码器(SAE)和探针)在多大程度上将一个概念与另一个概念解缠。我们提出了一个多概念评估设置,使用包括情感、领域、语态和时态在内的概念。我们评估特征化器产生每个概念的解缠表示的效果,观察到特征通常只对单一概念敏感,但概念分布在许多特征上。然后,我们干预这些特征,测量每个概念是否可独立操控,以及特征是否相互作用。即使在理想化设置中,干预一个特征通常会影响多个概念,尽管几乎没有交互效应。这些结果表明,相关性指标不足以建立干预选择性,并且证明两个特征在分离空间中运行不足以声称它们将对一个概念具有选择性。这些结果强调了可解释性研究中多概念评估的重要性。

英文摘要

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

2507.08794 2026-06-12 cs.LG cs.CL 版本更新

One Token to Fool LLM-as-a-Judge

一个令牌就能欺骗LLM裁判

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu

AI总结 发现基于参考的生成式奖励模型易受奖励黑客攻击,表面输入(如非词符号或通用推理开头)能持续引发假阳性奖励,提出使用截断模型输出作为对抗性负例的数据增强策略,构建鲁棒的Master奖励模型。

详情
AI中文摘要

大型语言模型(LLM)越来越被信任作为自动裁判,协助评估并为训练其他模型提供奖励信号,特别是在基于参考的设置中,如带可验证奖励的强化学习(RLVR)。然而,我们揭示了即使在这种基于参考的范式中也存在一个关键漏洞:生成式奖励模型系统性地容易受到奖励黑客攻击。我们发现,表面输入——我们称之为“万能钥匙”,例如非词符号(如“:”或“.”)或通用推理开头(如“思考过程:”或“让我们逐步解决这个问题。”)——可以在没有任何实质性推理的情况下持续引发假阳性奖励。我们的系统评估表明,这是一个广泛存在的失败,影响多种模型,包括领先的专有系统如GPT-o1和Claude-4。这些结果挑战了LLM裁判假定的鲁棒性,并对其可靠性构成重大威胁。为了解决这个问题,我们提出了一种简单而有效的数据增强策略,使用截断的模型输出作为对抗性负例。由此产生的Master奖励模型(Master-RMs)在对这些“万能钥匙”攻击方面表现出最先进的鲁棒性,同时在标准评估设置中保持高性能。我们通过跨模型规模、提示变化和常见推理时策略的漏洞全面分析来补充这些发现,为未来关于鲁棒LLM评估的研究提供见解。我们在https://this.url 和 https://this.url 发布我们的鲁棒通用领域奖励模型和合成训练数据。

英文摘要

Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

2507.07879 2026-06-12 cs.SD eess.AS 版本更新

LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification

LISTEN:面向边缘通知的轻量级工业声音可表示Transformer

Changheon Han, Yun Seok Kang, Yuseop Sim, Hyung Wook Park, Martin Byung-Guk Jun

AI总结 提出轻量级工业声音基础模型LISTEN,通过知识蒸馏从大模型IMPACT压缩,仅用少量数据微调即可在边缘设备上实现实时机器监控,性能接近大模型。

详情
Journal ref
Advanced Engineering Informatics, Volume 76, Part A, 2026, 104944
AI中文摘要

基于深度学习的机器听觉正在拓宽工业声学分析的范围,但其在实时车间中的广泛实施受到对每个新任务依赖大型、任务特定标注数据集的阻碍。虽然新兴的通用声音基础模型旨在减轻数据依赖性,但它们在实践中暴露出关键困境。通用声音基础模型计算成本高,并且在以音调谐波、宽带噪声和瞬态故障事件为特征的工业场景中失败,使得即时、现场部署不切实际。这些挑战共同意味着,在实时车间部署声音基础模型的实用端到端系统仍然难以实现。为了解决这一挑战,本研究引入了LISTEN(面向边缘通知的轻量级工业声音可表示Transformer),这是第一个专门针对工业声音的轻量级基础模型。通过从大规模教师模型IMPACT(基于声学认知Transformer的工业机器感知)进行知识蒸馏,我们构建了针对资源受限边缘环境优化的LISTEN。通过冻结骨干网络并仅对最小目标过程数据训练浅层头部,而不是进行完全微调或重新训练,LISTEN在多种制造过程中实现了与IMPACT几乎相同的性能。本研究进一步展示了一个完整的实时机器监控系统,包括使用工业物联网(IIoT)设备进行数据采集、使用最小标注数据进行快速模型适应,以及在低成本边缘设备上进行实时监控。通过在实时CNC机器上验证整个系统,这项工作建立了在活跃工业环境中部署轻量级工业声音基础模型的第一个可行的端到端系统。

英文摘要

Deep learning-based machine listening is broadening the scope of industrial acoustic analysis, yet its widespread implementation on live shop floors is hindered by the reliance on large, task-specific annotated datasets for every new task. While emerging general-purpose sound foundation models aim to alleviate data dependency, they reveal critical dilemmas in practice. General-purpose sound foundation models are computationally expensive and fail in industrial scenarios characterized by tonal harmonics, broadband noise, and transient fault events, making instant, on-site deployment impractical. These challenges combined mean that a practical, end-to-end system for deploying a sound foundation model on a live shop floor has remained elusive. To address this challenge, this study introduces LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), the first lightweight foundation model specialized for industrial sound. Through Knowledge Distillation (KD) from the large-scale teacher model IMPACT (Industrial Machine Perception via Acoustic Cognitive Transformer), we construct LISTEN optimized for resource-constrained edge environments. By freezing the backbone and training only a shallow head on minimal target-process data, rather than performing full fine-tuning or retraining, LISTEN achieves nearly identical performance to IMPACT across diverse manufacturing processes. This study further demonstrates a complete system for real-time machine monitoring, encompassing data acquisition with Industrial Internet of Things (IIoT) devices, rapid model adaptation using minimal annotated data, and real-time monitoring on a low-cost edge device. By validating the entire system on a live CNC machine, this work establishes the first feasible end-to-end system for deploying a lightweight industrial sound foundation model in an active industrial environment.

2501.08425 2026-06-12 cs.LG math.AP math.PR 版本更新

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

随机梯度下降有效吗?机器学习过程的PDE视角

Davide Barbieri, Matteo Bonforte, Peio Ibarrondo

AI总结 通过Fokker-Planck型抛物PDE分析SGD行为,区分漂移和扩散两个阶段,量化浓度现象并证明平均退出时间界限,为非凸损失和退化扩散矩阵下的渐近收敛提供新结果。

详情
AI中文摘要

本文分析了随机梯度下降(SGD)的行为,这是一种在监督学习中广泛使用的方法,通过最小化非凸损失函数来优化神经网络权重。自E、Li和Tai(2017)的开创性工作以来,此类过程的基本结构可以通过Fokker-Planck型抛物PDE来理解,这是我们分析的核心。尽管Fokker-Planck方程历史悠久且文献丰富,但当势函数非凸或扩散矩阵退化时,几乎一无所知,这是我们分析中面临的主要困难。我们识别出两种不同的阶段:在SGD的初始阶段,损失函数驱动权重集中在最近的局部最小值附近。我们将此阶段称为漂移阶段,并提供了关于这种集中现象的定量估计。接下来,我们引入扩散阶段,其中随机波动帮助学习过程逃离次优局部最小值。我们分析了平均退出时间(MET),并证明了MET的上下界。最后,我们针对非凸代价函数和退化扩散矩阵(不允许使用标准方法并需要新技术)研究了SGD的渐近收敛性。为此,我们利用了两种不同的方法:对偶方法和熵方法。我们提供了关于SGD动力学和有效性的新结果,建立了随机优化与PDE理论之间的深层联系,并为机器学习过程中的基本问题提供了一些答案和见解:SGD需要多长时间才能逃离一个坏的最小值?使用SGD时神经网络参数是否收敛?在SGD训练的第一阶段,参数如何演化?

英文摘要

In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

2605.01727 2026-06-12 cs.AI cs.CY

Are LLMs More Skeptical of Entertainment News?

LLM是否对娱乐新闻更持怀疑态度?

Huiqian Lai

AI总结 研究零样本LLM在新闻可信度评估中是否对娱乐新闻有更高的误判率,发现模型间存在差异,并通过风格交换和提示缓解实验探讨原因。

详情
Journal ref
Proceedings of the ICWSM Workshops, MisD 2026: The 2nd Workshop on Misinformation Detection in the Era of LLMs, 2026
Comments
Accepted at the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), co-located with ICWSM 2026, May 26, 2026, Los Angeles, CA, USA
AI中文摘要

大型语言模型(LLMs)越来越多地被用于自动新闻可信度评估,但目前尚不清楚它们是否对不同新闻体裁采用统一标准。我们使用FakeNewsNet中的GossipCop数据集,通过数据集内设计,检验零样本LLM是否更倾向于将合法的娱乐新闻误分类为假新闻,而非合法的硬新闻。在四个前沿模型中,我们发现了清晰但模型特定的体裁不对称性:DeepSeek-V3.2和GPT-5.2的假阳性率差距分别为10.1和8.8个百分点(两者p < .001),而Claude Opus 4.6和Gemini 3 Flash则没有表现出显著差异。风格交换实验仅产生有限且不一致的变化,表明这种不对称性不能仅归结于风格语域。基于提示的缓解措施同样可能但并非通用:将模型设定为娱乐新闻事实核查员可使DeepSeek-V3.2的假阳性减少约50%,且未检测到召回率损失,但对GPT-5.2的改进甚微。探索性定性编码进一步揭示了采样假阳性中两种反复出现的错误模式:将私人生活主张视为本质上不可验证,以及将娱乐新闻视为认识论上较弱的体裁。综合来看,这些发现表明,总体性能指标可能掩盖合法新闻中的结构性假阳性。我们认为,基于LLM的可信度评估不仅可能评估真实性主张,还可能差异性地识别新闻体裁的合法性,因此评估应包含按体裁分层的假阳性分析以及总体准确率。

英文摘要

Large language models (LLMs) are increasingly used for automated news credibility assessment, yet it remains unclear whether they apply even-handed standards across journalistic genres. We examine whether zero-shot LLMs are more likely to misclassify legitimate entertainment news as fake than legitimate hard news, using a within-dataset design on GossipCop from FakeNewsNet. Across four frontier models, we find a clear but model-specific genre asymmetry: DeepSeek-V3.2 and GPT-5.2 show false-positive-rate gaps of 10.1 and 8.8 percentage points, respectively (both $p < .001$), whereas Claude Opus 4.6 and Gemini 3 Flash show no comparable difference. A style-swap experiment yields only limited and inconsistent changes, suggesting that the asymmetry is not reducible to stylistic register alone. Prompt-based mitigation is likewise possible but not generic: framing the model as an entertainment-news fact-checker reduces false positives for DeepSeek-V3.2 by about 50\% without detectable recall loss, but offers little improvement for GPT-5.2. Exploratory qualitative coding further suggests two recurring error patterns in sampled false positives: treating private-life claims as inherently unverifiable and discounting entertainment journalism as an epistemically weaker genre. Taken together, these findings show that aggregate performance metrics can obscure structured false positives within legitimate journalism. We argue that LLM-based credibility assessment may not only evaluate truth claims but also differentially recognize the legitimacy of journalistic genres, and that evaluation should therefore include genre-stratified false-positive analysis alongside overall accuracy.

2604.08581 2026-06-12 cs.LG

Fully Autonomous Z-Score-Based TinyML Anomaly Detection on Resource-Constrained MCUs Using Power Side-Channel Data

Abdulrahman Albaiz, Fathi Amsaad

详情
Journal ref
Proc. IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC), Houston, TX, USA, 2026, pp. 1-6
Comments
SaTC 2026 Conference
英文摘要

This paper presents a fully autonomous Tiny Machine Learning (TinyML) Z-Score-based anomaly detection system deployed on a low-power microcontroller for real-time monitoring of appliance behavior using power side-channel data. Unlike existing Internet of Things (IoT) anomaly detection approaches that rely on offline training or cloud-assisted analytics, the proposed system performs both model training and inference directly on a resource-constrained microcontroller without external computation or connectivity. The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. The architecture was implemented on an STM32-based platform and evaluated using a 14-day dataset collected from a household mini-fridge under normal operation and controlled anomaly conditions. Results demonstrate perfect detection performance, with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. These results confirm that robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework to incorporate additional lightweight models and multi-device learning scenarios.

2603.27393 2026-06-12 cs.LG

K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL)

Abdulrahman Albaiz, Fathi Amsaad

详情
Journal ref
Proc. IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC), Houston, TX, USA, 2026, pp. 1-5
Comments
SaTC 2026 Conference
英文摘要

This paper presents a lightweight K-Means anomaly detection model and a distributed model-sharing workflow designed for resource-constrained microcontrollers (MCUs). Using real power measurements from a mini-fridge appliance, the system performs on-device feature extraction, clustering, and threshold estimation to identify abnormal appliance behavior. To avoid retraining models on every device, we introduce the Distributed Internet of Learning (DIoL), which enables a model trained on one MCU to be exported as a portable, text-based representation and reused directly on other devices. A two-device prototype demonstrates the feasibility of the "Train Once, Share Everywhere" (TOSE) approach using a real-world appliance case study, where Device A trains the model and Device B performs inference without retraining. Experimental results show consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation. The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices.

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG

A Survey of Deep Learning for Geometry Problem Solving

Jianzhe Ma, Wenxuan Wang, Qin Jin

详情
Comments
ACL 2026 Main Conference
英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

2508.03721 2026-06-12 cs.CV eess.IV

Enhancing Diameter Measurement Accuracy in Machine Vision Applications

Ahmet Gokhan Poyraz, Ahmet Emir Dirik, Hakan Gurkan, Mehmet Kacmaz

详情
Journal ref
Measurement 278 (2026) 121646
Comments
Preprint
英文摘要

In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera's field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.

2507.21086 2026-06-12 cs.CL

Multi-Amateur Contrastive Decoding for Text Generation

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

详情
Comments
This paper has been accepted for oral presentation and publication in the proceedings of the IEEE I2ITCON 2025. The conference will be organized in Pune, India, from July 4 to 5, 2025. This is the accepted version of the paper and NOT the final camera-ready version. The paper is 11 pages long and contains 5 figures and 6 tables
英文摘要

Contrastive Decoding (CD) has emerged as an effective inference-time strategy for enhancing open-ended text generation by exploiting the divergence in output probabilities between a large expert language model and a smaller amateur model. Although CD improves coherence and fluency, its dependence on a single amateur restricts its capacity to capture the diverse and multifaceted failure modes of language generation, such as repetition, hallucination, and stylistic drift. This paper proposes Multi-Amateur Contrastive Decoding (MACD), a generalization of the CD framework that employs an ensemble of amateur models to more comprehensively characterize undesirable generation patterns. MACD integrates contrastive signals through both averaging and consensus penalization mechanisms and extends the plausibility constraint to operate effectively in the multi-amateur setting. Furthermore, the framework enables controllable generation by incorporating amateurs with targeted stylistic or content biases. Experimental results across multiple domains, such as news, encyclopedic, and narrative, demonstrate that MACD consistently surpasses conventional decoding methods and the original CD approach in terms of fluency, coherence, diversity, and adaptability, all without requiring additional training or fine-tuning.

2412.14631 2026-06-12 cs.CV

Review of Fruit Tree Image Segmentation

Il-Seok Oh

详情
Journal ref
Agriculture, Volume 15, Issue 21, 2025
英文摘要

Fruit tree image segmentation is an essential problem in automating a variety of agricultural tasks such as phenotyping, harvesting, spraying, and pruning. Many research papers have proposed a diverse spectrum of solutions suitable to specific tasks and environments. The review scope of this paper is confined to the front views of fruit trees and based on 158 relevant papers collected using a newly designed crawling review method. These papers are systematically reviewed based on a taxonomy that sequentially considers the method, image, task, and fruit. This taxonomy will assist readers to intuitively grasp the big picture of these research activities. Our review reveals that the most noticeable deficiency of the previous studies was the lack of a versatile dataset and segmentation model that could be applied to a variety of tasks and environments. Six important future research tasks are suggested, with the expectation that these will pave the way to building a versatile tree segmentation module.

2306.01690 2026-06-12 cs.LG cs.AI

Context selectivity with dynamic availability enables lifelong continual learning

Martin Barry, Wulfram Gerstner, Guillaume Bellec

详情
英文摘要

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.

2302.01090 2026-06-12 cs.SD cs.IR eess.AS

Goniometers are a Powerful Acoustic Feature for Music Information Retrieval Tasks

Tim Ziemer

详情
Journal ref
Fortschritte der Akustik (DAGA) 2023
英文摘要

Goniometers, also known as Phase Scopes or Vector Scopes, are audio metering tools that help music producers and mixing engineers monitor spatial aspects of a music mix, such as the stereo panorama, the width of single sources, the amount and diffuseness of reverberation as well as phase cancellations that may occur on the sweet-spot and in a mono-mixdown. In addition, they implicitly inform about the dynamics of the sound. Self-organizing maps trained with a goniometer, are consulted to explore the usefulness of this acoustic feature for music information retrieval tasks. One can see that goniometers are able to classify different genres and cluster a single album. The advantage of goniometers is the causality: Music producers and mixing engineers consciously consult goniometers to reach their desired sound, which is not the case for other acoustic features, from Zero-Crossing Rate to Mel-Frequency Cepstral Coefficients.

2204.10552 2026-06-12 cs.RO

Making Parameterization and Constrains of Object Landmark Globally Consistent via SPD(3) Manifold and Improved Cost Functions

Yutong Hu, Wei Wang

详情
Comments
8 pages, 8 figures, submitted to IROS 2022 & RA-L
英文摘要

Object-level SLAM introduces semantic meaningful and compact object landmarks that help both indoor robot applications and outdoor autonomous driving tasks. However, the back end of object-level SLAM suffers from singularity problems because existing methods parameterize object landmark separately by their scales and poses. Under that parameterization method, the same abstract object can be represented by rotating the object coordinate frame by 90 deg and swapping its length with width value, making the pose of the same object landmark not globally consistent. To avoid the singularity problem, we first introduce the symmetric positive-definite (SPD) matrix manifold as an improved object-level landmark representation and further improve the cost functions in the back end to make them compatible with the representation. Our method demonstrates a faster convergence rate and more robustness in simulation experiments. Experiments on real datasets also reveal that using the same front-end data, our strategy improves the mapping accuracy by 22% on average.

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 新提交

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结 提出任务可交换性条件,确保在科学研究中使用合成数据进行统计推断的有效性,并给出在民意调查和AI评估中的应用。

详情
AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如,社会科学家主张在试点研究中使用LLM生成的“硅样本”;AI评估越来越依赖“LLM作为裁判”的输出;蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性:合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧:合成数据可能有偏、有噪声且设定错误。在这项工作中,我们提出了在科学研究中使用合成数据的统计原则,并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说,这是一个要求,即研究人员可以识别出有真实数据可用的历史任务,使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法,以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 新提交

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言:面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结 提出ModeratorLM,一种基于角色条件的语音大模型,通过分块流式处理和链式推理,在多方对话中实现自适应轮流发言,显著提升轮流精度和召回率。

详情
Comments
Accepted for publication at Interspeech 2026
AI中文摘要

多方口语对话中的轮流发言仍然是语音代理面临的基本挑战,特别是在动态的发言权竞争和用户期望变化的情况下。我们提出ModeratorLM,一种角色扮演语音代理,它在多方环境中根据明确分配的角色来调节轮流发言行为。该系统基于以分块流式方式运行的语音大语言模型。我们进一步引入了一种推理增强变体,该变体结合了对对话上下文和分配角色的链式推理。我们构建了RolePlayConv,一个大规模合成数据集,包含具有多种助手角色的口语多方对话。在真实会议数据和RolePlayConv上的实验表明,与无角色条件的基线相比,轮流发言精度提高了40%以上,召回率提高了70%以上,同时大幅减少了误报中断。

英文摘要

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

2606.13450 2026-06-12 eess.AS cs.SD 新提交

Endpoint Anticipation for Low-Latency Spoken Dialogue

低延迟口语对话的端点预测

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

AI总结 提出端点预测方法,通过提前预测对话结束信号实现低延迟,在部分上下文中投机执行LLM和TTS流水线,平均延迟降低505毫秒。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

虽然低延迟交互对于口语对话至关重要,但级联架构通常受限于反应式话轮结束检测。我们提出端点预测,从反应式检测转向主动预测结束信号。我们的基于语音的模型可提前最多2.56秒预测端点,从而能够在部分上下文中投机执行LLM和TTS流水线。我们引入指标来量化实现的延迟降低与计算冗余之间的权衡。在对话和任务导向数据集上的评估表明,我们的模型始终优于基于VAP的竞争基线。与Unmute框架的集成展示了平均延迟降低505毫秒,投机计算增加28.4%,有效掩盖了顺序瓶颈,从而在实时语音到语音交互中实现复杂推理。

英文摘要

While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.

2606.13295 2026-06-12 stat.ML cs.LG stat.ME 新提交

Simultaneous Latent Budget Trees for Stratified Classification

用于分层分类的同时潜在预算树

Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni, Stefano Pellegrino, Giulia Vannucci, Roberta Siciliano

AI总结 提出同时潜在预算树框架,通过模型驱动的分裂规则处理分层因素,实现可解释分类,并应用于肌萎缩侧索硬化症性别差异分析。

详情
AI中文摘要

在可解释人工智能时代,单棵树因其易于解释而重新受到关注。本文介绍了同时潜在预算树,这是一个概率机器学习框架,用于在存在分层因素(如时间、空间或人口统计变量)作为控制变量或潜在混杂因素时的分类树。标准的树生长过程并非设计用于优化条件分裂规则。提出了一种基于模型的分裂规则,其中子节点被解释为同时混合模型(如同时潜在预算模型及其约束版本)的潜在成分,该模型拟合于父节点。混合参数驱动观测值(不同组别不同)到达子节点,而潜在预算参数更新控制变量每个水平的响应类别轮廓。参数通过最小二乘法估计,考虑模型的神经网络视角。信息丰富的树结构可以通过节点和路径上的解释辅助工具进行交互式可视化,包括视觉剪枝和决策树选择过程。提出了适当的措施来处理不平衡的响应类别分布。所提出的方法应用于调查肌萎缩侧索硬化症疾病进展中的性别相关差异。SLBT库及其各种基于树的算法可在链接的GitHub仓库中获取。

英文摘要

In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneous mixture model, such as the Simultaneous Latent Budget Model and its constrained versions, fitted to the parent node. Mixing parameters drive the observations, differently for each group, to the child nodes whereas latent budgets parameters update the response classes profile of each level of the control variable. Parameters are estimated by least squares considering a neural network perspective of the model. An informative tree structure can be interactively visualized with interpretation aids on the node and the paths, including visual pruning and decision tree selection procedure. Suitable measures are proposed to handle an unbalanced response class distribution. The proposed methodology is applied to investigate gender-related differences in disease progression of Amyotrophic Lateral Sclerosis. The SLBT library with the various tree-based algorithms is available in the linked GitHub repository.

2606.13277 2026-06-12 stat.ML cs.LG 新提交

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

ProtoX-AD:自解释的时间序列异常检测与特征描述

Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

AI总结 提出ProtoX-AD框架,通过原型学习实现自监督时间序列异常检测的可解释性,在保持检测性能的同时提供语义一致的异常特征解释。

详情
Comments
26 pages, 8 figures
AI中文摘要

时间序列异常检测(TSAD)的最新进展突显了自监督分类方法的有效性。这些方法对正常训练样本应用变换,训练分类器识别变换特定模式,从而通过增加分类误差来帮助识别异常。尽管性能强大,但一个重大挑战是缺乏可解释性,因为它们对标记异常的特征提供的洞察有限。为了解决这一局限,我们提出了ProtoX-AD,一种基于原型的自解释框架,用于自监督TSAD。ProtoX-AD学习变换感知的潜在表示以及可解释的原型,从而实现准确的异常检测和通过基于原型的解释识别不同的异常轮廓。此外,它允许系统分析变换设计如何影响检测性能和可解释性。在合成和真实世界数据集上的实验结果表明,ProtoX-AD实现了与其黑盒对应物相当的检测性能,同时比现有的可解释基线提供更一致和语义上有意义的解释。我们的代码在此 https URL 公开。

英文摘要

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at https://github.com/Aitorzan3/ProtoX-AD.

2606.13193 2026-06-12 eess.AS cs.PL cs.SD 新提交

A Dual-Mode Faust-to-CLAP Compilation System

双模式 Faust 到 CLAP 编译系统

Facundo Franchino, Stéphane Letz, Jatin Chowdhury

AI总结 提出 faust2clap 框架,支持静态编译和动态解释两种模式,通过地址身份匹配算法和稳定槽位分配方案解决 DSP 参数身份保持问题,实现高效编译与热更新。

详情
Comments
4 pages, 4 figures, 1 algorithm. Presented at the International Faust Conference (IFC-26), Lyon, France, June 2026
AI中文摘要

我们描述了 faust2clap,一个建立从 Faust DSP 规范到 CLAP 格式的首个官方维护编译路径的框架。该系统以两种不同模式运行。静态模式采用提前编译以生成最优效率的原生二进制文件,而动态模式使用运行时解释以允许在不中断宿主应用程序的情况下修改 DSP 代码。后一种能力解决了音频软件开发中一个长期存在的摩擦,即编辑、编译和重载循环的累积开销。我们详细阐述了两种模式背后的算法机制,特别关注参数身份问题。为了在结构 DSP 突变中保留参数值及其与宿主自动化的绑定,我们引入了一种基于地址的身份匹配算法和一种稳定的槽位分配方案。该实现包含约 2400 行 C++ 架构和 Python 工具代码,并已集成到 Faust 主发行版中。

英文摘要

We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

2606.13146 2026-06-12 stat.ML cs.LG stat.ME 新提交

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

鲁棒的状态条件特征加权跳跃模型用于时间聚类

Federico P. Cortese, Alessio Farcomeni

AI总结 提出一种鲁棒的特征加权跳跃模型,通过Tukey双权损失函数实现鲁棒性,并引入状态特定特征权重,在模拟和实证中优于竞争方法。

详情
AI中文摘要

我们提出了一种用于时间依赖聚类的鲁棒特征加权跳跃模型。使用惩罚项来鼓励随时间平滑过渡,同时通过Tukey双权损失函数实现鲁棒性。一个额外的参数控制特征权重在不同状态间的变异性,允许模型为每个特征分配状态特定的相关性。我们在模拟中展示了该方法如何准确恢复真实聚类序列并可靠识别相关特征,特别是在存在异常值的情况下优于竞争方法。最后,我们进行了两个实证应用,一个涉及1998-2000年科索沃冲突相关杀人事件的数量,另一个涉及1949-2024年十二个欧洲国家的宏观经济表现。

英文摘要

We propose a robust feature-weighted jump model for time-dependent clustering. A penalty is used to encourage smoothness of transitions over time, while robustness is achieved through the use of a Tukey's biweight loss function. An additional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.

2606.13109 2026-06-12 eess.AS cs.SD 新提交

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

为真实场景语音增强生成训练目标:通过近远麦克风投影

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki

AI总结 提出近远麦克风投影(C2D投影)方法,利用真实录音生成配对数据,通过参数化多通道维纳滤波器实现投影,训练神经网络在远场语音增强中优于现有GSS方法。

详情
Journal ref
Proceedings of IEEE ICASSP 2026
AI中文摘要

在远距离语音捕获场景中训练语音增强(SE)神经网络需要配对的失真和干净参考语音信号。虽然此类数据通常通过模拟生成,但模拟与真实录音之间的不匹配显著限制了SE的准确性。为解决此问题,我们提出近远麦克风投影(C2D投影),一种从近距离和远距离麦克风捕获的真实录音中生成配对数据的方法。C2D投影估计一个最优投影矩阵,将近麦克风输入转换为与远麦克风录音对齐的干净参考信号,同时执行去噪。我们证明,使用参数化多通道维纳滤波器(PMWF)的变体可以有效地实现这种投影。实验结果表明,在具有挑战性的CHiME6晚宴派对ASR任务中,使用C2D投影数据训练的神经网络在oracle说话人日志条件下,当使用GSS的增强输出作为神经网络的辅助输入时,优于最先进的引导源分离(GSS)。

英文摘要

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

2606.13095 2026-06-12 eess.AS cs.SD 新提交

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

在端到端大语言模型中平衡ASR与说话人日志以进行多说话人语音识别

Naijun Zheng, Yuke Lin, Sanli Tian, Mengtian Li, Zhiwei Lin, Longshuai Xiao, Dandan Tu

AI总结 提出双编码器架构、特征交错格式、长度感知说话人ID损失和自适应阈值ASR损失策略,在有限真实数据下高效训练LLM系统,平衡ASR与说话人日志任务,在AliMeeting和Aishell4语料库上分别实现18%和24%的相对改进。

详情
Comments
Accepted in Interspeech 2026
AI中文摘要

多说话人语音识别通常通过结合自动语音识别(ASR)和说话人日志的流水线系统来处理。最近,基于大语言模型(LLM)的方法通过联合建模语义和说话人信息显示出前景,但它们通常需要大规模的多说话人语料库,而标注这些语料库成本高昂。在本文中,我们研究了如何在有限真实录音数据下高效训练基于LLM的系统,同时保持说话人归属的高准确性。我们提出了几种策略:(1)双编码器架构,用于提取语义和说话人特征;(2)特征交错格式,将这些特征合并作为LLM的输入;(3)长度感知的说话人ID损失,以增强日志能力;(4)自适应阈值的ASR损失计算,以减轻语音重叠引起的幻觉。这些策略平衡了ASR和说话人日志任务之间的训练。我们的系统优于开源基线方法,在AliMeeting语料库上实现了18%的相对改进,在Aishell4语料库上实现了24%的相对改进。

英文摘要

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.