arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1530
专题追踪
2606.07822 2026-06-19 cs.CL cs.AI cs.LG 新提交

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2407.11933 2026-06-19 cs.LG

Fairness-Aware Multi-Group Target Detection in Online Discussion

具有公平性的多群体目标检测在线讨论中

Soumyajit Gupta, Maria De-Arteaga, Matthew Lease

发表机构 * Dept. of Computer Science, The University of Texas at Austin(德克萨斯大学奥斯汀分校计算机科学系) Department of Data, Analytics, Technology, and Artificial Intelligence, ESADE(ESADE大学数据、分析、技术和人工智能系) The Information School, The University of Texas at Austin(德克萨斯大学奥斯汀分校信息学院)

AI总结 本文研究了在线讨论中目标群体检测的公平性影响,提出了一种公平性意识的多群体目标检测方法,减少了群体间的偏见并提升了预测性能。

Journal ref 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT)

详情
AI中文摘要

目标群体检测的任务是确定内容所针对或涉及的群体。应用包括定向营销、内容推荐和群体特定内容评估。主要挑战包括:1) 单个帖子可能针对多个群体;2) 确保跨群体检测准确性的一致性以实现公平性。在本工作中,我们探讨了在毒性检测背景下目标群体检测的公平性影响,其中社交媒体帖子的感知危害往往取决于其针对的群体。由于毒性高度依赖语境,在一般情况下看似无害的语言在针对特定人口群体时可能变得有害。我们展示了所提出的公平性意识的多群体目标检测方法不仅减少了群体间的偏见,还表现出强大的预测性能,超越了现有的公平性意识基线。为了促进可重复性和未来研究,我们在线分享了我们的代码。

英文摘要

Target-group detection is the task of detecting which group(s) a piece of content is ``directed at or about''. Applications include targeted marketing, content recommendation, and group-specific content assessment. Key challenges include: 1) that a single post may target multiple groups; and 2) ensuring consistent detection accuracy across groups for fairness. In this work, we investigate fairness implications of target-group detection in the context of toxicity detection, where the perceived harm of a social media post often depends on which group(s) it targets. Because toxicity is highly contextual, language that appears benign in general can be harmful when targeting specific demographic groups. We show our {\em fairness-aware multi-group target detection} approach both reduces bias across groups and shows strong predictive performance, surpassing existing fairness-aware baselines. To enable reproducibility and spur future work, we share our code online.

2605.00569 2026-06-19 cs.CV cs.GR

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

2D-SuGaR:面向表面的高斯点散布用于几何准确的网格重建

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt(图宾根大学) ELIZA(ELIZA实验室) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

AI总结 本文提出2D-SuGaR方法,通过结合单目深度和法线先验,提升多视图图像中网格重建的几何精度和鲁棒性,实现在DTU数据集上达到最先进的重建效果。

Journal ref Eurographics 2026 Short Papers, The Eurographics Association, 2026

详情
AI中文摘要

3D高斯点散布(3DGS)已发展为一种强大的技术,用于实时生成逼真的场景渲染。然而,3DGS的体积性质限制了其准确捕捉表面几何的能力。为此,提出了2D高斯点散布(2DGS)以实现从多视角图像中一致且几何准确的表面重建。然而,2DGS对高斯原始体的初始化敏感。依赖结构从运动(SfM)初始化,在挑战性图像集上可能产生较差的估计,导致次优结果。在本文中,我们通过引入单目深度和法线先验来增强2DGS,提高几何精度和鲁棒性。我们提出了一种基于深度的初始化策略用于高斯点,并引入基于聚类的技巧来修剪退化高斯点。我们在DTU数据集上评估了我们的方法,其中它在网格重建中实现了最先进的结果,同时保持高质量的视点合成。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for generating photorealistic renderings of a scene in real-time. However, the volumetric nature of 3DGS limits its ability to accurately capture surface geometry. To address this, 2D Gaussian Splatting (2DGS) was proposed to enable view-consistent and geometrically accurate surface reconstruction from multi-view images. However, 2DGS can be sensitive to the initialization of the Gaussian primitives. Reliance on Structure-from-Motion (SfM) initializations, which can produce poor estimates on challenging image sets, may lead to subpar results. In this work, we enhance 2DGS by incorporating monocular depth and normal priors to improve both geometric accuracy and robustness. We propose a depth-guided initialization strategy for Gaussians and introduce a clustering-based technique for pruning degenerate Gaussians. We evaluate our method on the DTU dataset, where it achieves state-of-the-art results in mesh reconstruction while preserving high-quality novel view synthesis.

2603.16648 2026-06-19 cs.AI

Domain-Independent Dynamic Programming with Constraint Propagation

基于约束传播的领域无关动态规划

Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa

发表机构 * Imko Marijnissen 1 J. Christopher Beck 2 Emir Demirović 1 Ryo Kuroiwa 3, 4

AI总结 本文通过将约束传播整合到动态规划中,实现了动态规划与约束规划方法的结合,有效减少状态扩展数量,在多个组合优化问题中表现出色。

Comments 13 pages. To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

Journal ref Proceedings of the International Conference on Automated Planning and Scheduling (2026) | Volume 36(1) | Pages 171-180

详情
AI中文摘要

存在两种主流的基于模型的组合问题求解范式:1)基于状态的表示,如启发式搜索、动态规划(DP)和决策图;2)基于约束和领域表示,如约束规划(CP)、混合整数规划和布尔可满足性。本文通过在动态规划中整合约束传播,弥合了DP与CP范式之间的差距,使动态规划求解器能够利用约束传播来修剪状态和转换。为此,我们使用通用的CP求解器在领域无关动态规划框架中实现约束传播,并在三个组合优化问题上进行评估:带有时间窗口的单机调度问题、资源受限项目调度问题(RCPSP)和带有时间窗口的旅行商问题(TSPTW)。我们的评估显示,约束传播显著减少了状态扩展数量,使我们的方法在单机调度和RCPSP问题上能够解决更多实例,并在紧密约束的TSPTW实例上表现出相似的改进。运行时间性能表明,传播带来的好处超过了约束实例的开销,但进一步研究以减少传播开销可能进一步提升性能。本文是理解约束传播在动态规划求解器中价值的关键步骤,提供了一种基于模型的方法来整合动态规划和约束规划。

英文摘要

There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.

2511.23071 2026-06-19 cs.CV cs.AI cs.CL

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Bharat Scene Text: 一种新的综合性数据集和基准,用于印度语言场景文本理解

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院朱道尔)

AI总结 本文提出BSTD数据集,涵盖11种印度语言和英语,用于研究印度语言场景文本识别,评估了现有模型在印度语言上的适应性,揭示了挑战与机遇。

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

详情
AI中文摘要

阅读场景文本,即图像中出现的文本,有广泛的应用领域,包括辅助技术、搜索和电子商务。尽管英语场景文本识别已显著进步,常被视为几乎解决的问题,但印度语言场景文本识别仍是一个开放的挑战。这归因于脚本多样性、非标准字体、变化的书写风格,以及更重要的是缺乏高质量的数据集和开源模型。为了解决这些差距,我们引入了Bharat Scene Text Dataset (BSTD)——一个大规模且全面的基准,用于研究印度语言场景文本识别。它包含超过100,000个单词,涵盖11种印度语言和英语,来源于超过6,500张场景图像,这些图像在印度不同语言地区拍摄。该数据集经过精心标注,并支持多种场景文本任务,包括:(i) 场景文本检测,(ii) 脚本识别,(iii) 截取词识别,以及(iv) 端到端场景文本识别。我们评估了最初为英语开发的最先进模型,并通过适应(微调)它们来适应印度语言。我们的结果突显了印度语言场景文本识别的挑战和机遇。我们相信,这个数据集代表了推动该领域研究的重要一步。所有模型和数据都是开源的。

英文摘要

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

2603.27698 2026-06-19 cs.CV cs.DL

Ink Detection from Surface Topography of the Herculaneum Papyri

赫拉克利翁莎草纸表面拓扑中的墨迹检测

Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales

发表机构 * Vesuvius Challenge, USA(维苏威挑战赛,美国) Università degli Studi di Napoli Federico II, Italy(那不勒斯费德里科二世大学,意大利) University of Glasgow, Scotland, UK(格拉斯哥大学,苏格兰,英国) EduceLab, University of Kentucky, USA(EduceLab,肯塔基大学,美国)

AI总结 本文提出通过三维光学轮廓测量训练机器学习模型,利用莎草纸书写区域的表面形态区分墨迹与纸张,探讨了横向采样对学习能力的影响及高分辨率拓扑对墨迹检测的作用。

Comments 9 pages, 3 figures, 2 tables. Currently under review

Journal ref Scientific Reports (2026)

详情
AI中文摘要

阅读赫拉克利翁莎草纸具有挑战性,因为卷轴和碳基墨迹均被碳化。在X射线成像和断层扫描中,墨迹检测通常依赖密度或成分驱动的对比度,但碳化莎草纸上的碳墨迹提供很少的衰减对比度。基于形态学假设,我们证明书写区域的表面形态包含足够的信号以区分墨迹与莎草纸。为此,我们训练机器学习模型,利用机械打开的赫拉克利翁莎草纸的三维光学轮廓测量,以分离墨迹和未墨迹区域。我们进一步量化横向采样如何影响可学习性,并探讨原生分辨率模型在粗化输入上的行为。我们证明高分辨率拓扑本身包含可用于墨迹检测的信号。随着横向分辨率降低,分割性能下降,这为我们的数据集必须解析的特征空间尺度提供了见解。这些发现为通过X射线断层扫描进行基于形态学的封闭卷轴阅读提供了空间分辨率目标。

英文摘要

Reading the Herculaneum papyri is challenging because both the scrolls and the ink, which is carbon-based, are carbonized. In X-ray radiography and tomography, ink detection typically relies on density- or composition-driven contrast, but carbon ink on carbonized papyrus provides little attenuation contrast. Building on the morphological hypothesis, we show that the surface morphology of written regions contains enough signal to distinguish ink from papyrus. To this end, we train machine learning models on three-dimensional optical profilometry from mechanically opened Herculaneum papyri to separate inked and uninked areas. We further quantify how lateral sampling governs learnability and how a native-resolution model behaves on coarsened inputs. We show that high-resolution topography alone contains a usable signal for ink detection. Diminishing segmentation performance with decreasing lateral resolution provides insight into the characteristic spatial scales that must be resolved on our dataset to exploit the morphological signal. These findings inform spatial resolution targets for morphology-based reading of closed scrolls through X-ray tomography.

2603.27361 2026-06-19 cs.RO

Online Inertia Tensor Identification for Non-Cooperative Spacecraft via Augmented UKF

非合作航天器在线惯性张量识别:基于增强型UKF

Batu Candan, Simone Servadio

发表机构 * Department of Aerospace Engineering, Iowa State University(航空航天工程系,爱荷华州立大学)

AI总结 本文提出一种增强型UKF框架,用于同时估计非合作目标航天器的六自由度姿态和完整惯性张量,结合视觉和LiDAR数据,实现实时惯性参数估计,提升深空环境下的导航与引导精度。

Journal ref AIAA 2026 Region V Student Conference, AIAA 2026-108993

详情
AI中文摘要

自主接近操作,如主动碎片清除和在轨服务,需要高保真的相对导航解决方案,在参数不确定性存在时仍保持鲁棒性。传统估计框架通常假设目标航天器的质量特性已知,但对于非合作或翻滚目标,这些参数往往未知或不确定,导致基于模型的传播器快速发散。本文提出一种增强型无迹卡尔曼滤波(UKF)框架,旨在联合估计非合作目标航天器的相对六自由度姿态和完整惯性张量。所提出的架构融合了基于单目视觉的卷积神经网络(CNN)的视觉测量与LiDAR的深度信息,以约束耦合刚体动力学。通过将状态向量扩展以包含惯性张量的六个独立元素,滤波器能够动态恢复目标的归一化质量分布,而无需地面预校准。为确保估计常数参数时的数值稳定性和物理一致性,滤波器采用自适应过程噪声公式,防止协方差崩溃,同时允许惯性参数逐步收敛。通过蒙特卡洛模拟进行数值验证,证明所提出的增强型UKF能够同时收敛运动学状态和惯性参数,从而实现非合作深空环境中的准确长期轨迹预测和鲁棒引导。

英文摘要

Autonomous proximity operations, such as active debris removal and on-orbit servicing, require high-fidelity relative navigation solutions that remain robust in the presence of parametric uncertainty. Standard estimation frameworks typically assume that the target spacecraft's mass properties are known a priori; however, for non-cooperative or tumbling targets, these parameters are often unknown or uncertain, leading to rapid divergence in model-based propagators. This paper presents an augmented Unscented Kalman Filter (UKF) framework designed to jointly estimate the relative 6-DOF pose and the full inertia tensor of a non-cooperative target spacecraft. The proposed architecture fuses visual measurements from monocular vision-based Convolutional Neural Networks (CNN) with depth information from LiDAR to constrain the coupled rigid-body dynamics. By augmenting the state vector to include the six independent elements of the inertia tensor, the filter dynamically recovers the target's normalized mass distribution in real-time without requiring ground-based pre-calibration. To ensure numerical stability and physical consistency during the estimation of constant parameters, the filter employs an adaptive process noise formulation that prevents covariance collapse while allowing for the gradual convergence of the inertial parameters. Numerical validation is performed via Monte Carlo simulations, demonstrating that the proposed Augmented UKF enables the simultaneous convergence of kinematic states and inertial parameters, thereby facilitating accurate long-term trajectory prediction and robust guidance in non-cooperative deep-space environments.

2412.20298 2026-06-19 cs.LG cs.CY stat.ML

An Experimental Study on Fairness-aware Machine Learning for Credit Scoring Problems

对信用评分问题中公平性意识机器学习的实验研究

Huyen Giang Thi Thu, Thang Viet Doan, Ha-Bang Ban, Tai Le Quy

发表机构 * Banking Academy of Vietnam(越南银行学院) Vietnam Academy of Science and Technology(越南科学技术 academy) Hanoi University of Science and Technology(河内科学技术大学) University of Koblenz(科隆大学)

AI总结 本文研究信用评分中公平性意识机器学习的关键方面,评估公平性模型与传统分类模型的平衡性,发现公平性模型在预测准确性和公平性间取得更好平衡。

Comments The manuscript is submitted to Springer Nature's journal

详情
AI中文摘要

信用评分的数字化对金融机构和商业银行至关重要,尤其是在数字化转型时代。机器学习技术常用于评估客户信用worthiness。然而,机器学习模型的预测结果可能对受保护属性如种族或性别存在偏见。已提出许多公平性意识机器学习模型和公平性度量,但其在信用评分中的性能尚未深入研究。本文提出对信用评分中公平性意识机器学习的全面实验研究。研究探索了信用评分的关键方面,包括金融数据集、预测模型和公平性度量。我们还对广泛使用的金融数据集上的公平性意识预测模型和公平性度量进行了详细评估。实验结果表明,公平性意识模型在预测准确性和公平性之间取得了比传统分类模型更好的平衡。

英文摘要

The digitalization of credit scoring has become essential for financial institutions and commercial banks, especially in the era of digital transformation. Machine learning techniques are commonly used to evaluate customers' creditworthiness. However, the predicted outcomes of machine learning models can be biased toward protected attributes, such as race or gender. Numerous fairness-aware machine learning models and fairness measures have been proposed. Nevertheless, their performance in the context of credit scoring has not been thoroughly investigated. In this paper, we present a comprehensive experimental study of fairness-aware machine learning in credit scoring. The study explores key aspects of credit scoring, including financial datasets, predictive models, and fairness measures. We also provide a detailed evaluation of fairness-aware predictive models and fairness measures on widely used financial datasets. The experimental results show that fairness-aware models achieve a better balance between predictive accuracy and fairness compared to traditional classification models.

2508.21190 2026-06-19 cs.CV

Radially Distorted Homographies, Revisited

径向畸变仿射变换,再探讨

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

发表机构 * Linköping University(林雪平大学) Ericsson Research(爱立信研究)

AI总结 本文提出统一方法解决径向畸变仿射变换的三种配置,提供快速稳定准确的最小解算器,测试结果表明性能优于现有方法。

Journal ref 2026, Proceedings of the International Conference on 3D Vision (3DV). Vancouver, BC, Canada: IEEE, pp. 52-62

详情
AI中文摘要

仿射变换是几何计算机视觉和射影几何中最常见的变换之一,其估计是许多计算机视觉任务的关键步骤。在处理真实图像时,由于镜头引起的几何失真,可能需要同时确定仿射变换和镜头失真,特别是径向失真。当考虑两幅图像间的径向失真仿射变换时,有三种概念上不同的配置:(i)仅在一幅图像中失真,(ii)两幅图像具有相同的失真,(iii)两幅图像具有独立的失真。尽管这些情况曾被分别处理,本文提供了一种新颖的统一方法来解决所有三种情况。我们展示了如何利用所提出的方法构建新的快速、稳定和准确的径向失真仿射变换最小解算器。在所有三种情况下,我们的解算器比现有最先进的解算器更快,同时保持相似的精度。解算器在包括鱼眼镜头拍摄图像在内的经典基准上进行了测试。所提出的解算器的参考实现作为HomLib(https://github.com/marcusvaltonen/HomLib)的一部分提供。

英文摘要

Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. A reference implementation of the proposed solvers is made available as part of HomLib (https://github.com/marcusvaltonen/HomLib).

2511.16223 2026-06-19 cs.RO

DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks

DynaMimicGen:一种用于机器人动态任务学习的数据生成框架

Vincenzo Pomponi, Paolo Franceschi, Stefano Baraldo, Loris Roveda, Oliver Avram, Luca Maria Gambardella, Anna Valente

发表机构 * Institute of Systems and Technologies for Sustainable Production (ISTePS)(可持续生产系统与技术研究所) Department of Innovative Technologies (DTI)(创新技术系) University of Applied Science and Arts of Southern Switzerland (SUPSI)(瑞士南部应用科学与艺术大学) Istituto Dalle Molle di studi sull’intelligenza artificiale (IDSIA)(达莫尔智能研究 institute) Department of Mechanical Engineering(机械工程系) Politecnico di Milano (PoliMi)(米兰理工学院) Faculty of Informatics(信息学院) Università della Svizzera Italiana (USI)(瑞士意大利大学)

AI总结 本文提出DynaMimicGen框架,通过少量人类示范生成数据,支持动态任务学习,产生适应性强的轨迹,提升机器人在复杂环境中的表现。

详情
AI中文摘要

学习稳健的操作策略通常需要大量且多样化的数据集,但收集这些数据耗时费力且不适用于动态环境。本文引入DynaMimicGen(D-MG),一种可扩展的数据生成框架,能够在极少量人类监督下训练策略,同时支持动态任务设置。仅需少量人类示范,D-MG首先将示范分割为有意义的子任务,然后利用动态运动片段(DMPs)来适应和推广演示行为到新颖且动态变化的环境。改进了依赖静态假设或简单轨迹插值的先前方法,D-MG生成平滑、真实且任务一致的笛卡尔轨迹,能够实时适应任务执行过程中物体姿态、机器人状态或场景几何的变化。我们的方法支持不同场景——包括场景布局、物体实例和机器人配置——使其适用于静态和高度动态的操作任务。我们证明机器人代理通过模仿学习在D-MG生成的数据上实现了在长时间跨度和接触丰富的基准测试中的强大表现,包括立方体堆叠和将杯子放入抽屉等任务,即使在不可预测的环境变化下也是如此。通过消除对大量人类示范的需求并使动态设置的泛化成为可能,D-MG提供了一种强大而高效的替代手动数据收集方法,为可扩展的自主机器人学习铺平道路。

英文摘要

Learning robust manipulation policies typically requires large and diverse datasets, the collection of which is time-consuming, labor-intensive, and often impractical for dynamic environments. In this work, we introduce DynaMimicGen (D-MG), a scalable dataset generation framework that enables policy training from minimal human supervision while uniquely supporting dynamic task settings. Given only a few human demonstrations, D-MG first segments the demonstrations into meaningful sub-tasks, then leverages Dynamic Movement Primitives (DMPs) to adapt and generalize the demonstrated behaviors to novel and dynamically changing environments. Improving prior methods that rely on static assumptions or simplistic trajectory interpolation, D-MG produces smooth, realistic, and task-consistent Cartesian trajectories that adapt in real time to changes in object poses, robot states, or scene geometry during task execution. Our method supports different scenarios - including scene layouts, object instances, and robot configurations - making it suitable for both static and highly dynamic manipulation tasks. We show that robot agents trained via imitation learning on D-MG-generated data achieve strong performance across long-horizon and contact-rich benchmarks, including tasks like cube stacking and placing mugs in drawers, even under unpredictable environment changes. By eliminating the need for extensive human demonstrations and enabling generalization in dynamic settings, D-MG offers a powerful and efficient alternative to manual data collection, paving the way toward scalable, autonomous robot learning.

2510.24435 2026-06-19 cs.AI

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

人类级推理:大型语言模型在逻辑与抽象推理上的比较研究

Benjamin Grando Moreira

发表机构 * Universidade Federal de Santa Catarina(联邦圣卡塔琳娜大学)

AI总结 本文通过八个定制推理问题比较了多个LLM在逻辑和抽象推理能力上的表现,揭示了模型在推理任务上的差异及不足。

Comments 12 pages

Journal ref Proceedings of the 2026 Computer on the Beach

详情
AI中文摘要

评估大型语言模型(LLMs)的推理能力对于推动人工智能发展至关重要,因为它超越了单纯的语言任务表现。它涉及理解这些模型是否真正理解信息、能否进行推理并以逻辑有效的方式得出结论。本研究通过一组八个定制设计的推理问题,比较了多个LLM在逻辑和抽象推理能力上的表现,包括GPT、Claude、DeepSeek、Gemini、Grok、Llama、Mistral、Perplexity和Sabiá。LLM的结果与人类在相同任务上的表现进行基准测试,揭示了显著差异,并指出了LLM在推理方面存在的不足。

英文摘要

Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabiá - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.

2507.23027 2026-06-19 cs.CV cs.AI

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

恢复诊断价值:超分辨率辅助的资源受限成像中的心电图分类

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校) All India Institute of Medical Sciences(全印度医学科学研究所) Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

AI总结 本文研究了基于深度学习的超分辨率技术在低质量2D超声心动图分类中的应用,通过CAMUS数据集验证了SRGAN和SRResNet在提升分类准确率和计算效率方面的有效性。

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

详情
AI中文摘要

在资源受限环境下,自动心脏解读常受限于低质量超声心动图图像,限制了后续诊断模型的效果。尽管超分辨率(SR)技术在增强磁共振成像(MRI)和计算机断层扫描(CT)扫描方面表现出潜力,但其在超声心动图-一种广泛但易受噪声影响的模态中的应用仍待探索。本文研究了基于深度学习的SR技术在低质量2D超声心动图分类中的潜力。使用公开的CAMUS数据集,我们按图像质量分层样本,并评估了两个临床相关的任务:相对简单的两腔 vs. 四腔(2CH vs. 4CH)视图分类和更复杂的终舒张期 vs. 终收缩期(ED vs. ES)相分类。我们应用了两种广泛使用的SR模型-Super-Resolution Generative Adversarial Network(SRGAN)和Super-Resolution Residual Network(SRResNet),以增强低质量图像并观察到性能指标上的显著提升,特别是SRResNet,它还提供了计算效率。我们的发现表明,SR可以有效恢复降质超声扫描的诊断价值,使其成为资源受限环境(RCS)中AI辅助护理的可行工具,实现以少胜多。

英文摘要

Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

2406.15465 2026-06-19 cs.CL cs.AI

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

RadEx:基于大型语言模型的结构化信息提取框架

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland(以患者为中心的数字健康研究所,伯恩应用科学大学,比尔,瑞士) ID Suisse AG, St. Gallen, Switzerland(ID瑞士股份有限公司,圣加尔,瑞士)

AI总结 RadEx框架通过15个软件组件和10个工具,实现从放射科报告中自动提取结构化信息,支持生成式和编码器模型,提升临床应用中的信息处理效率与系统互操作性。

详情
AI中文摘要

每年全球有超过30亿次放射学检查和计算机断层扫描,产生大量未结构化的放射科报告,包含自由文本。尽管结构化报告有潜在优势,但其采用受限于现有流程、资源限制和信息丢失风险。然而,结构化信息对于自动分析、临床试验匹配和健康结果预测至关重要。本研究介绍RadEx,一个端到端框架,包含15个软件组件和10个工具,用于开发自动提取放射科报告信息的系统。该框架涵盖从标注训练数据到信息提取的完整过程,提供一致的通用信息模型并设定模型开发边界。RadEx允许临床医生定义相关临床领域(如乳腺摄影)的信息,并创建报告模板。该框架支持生成式和编码器-only模型,并通过将信息提取与模板填充解耦,实现独立的模型改进。根据RadEx框架开发信息提取系统,可简化实现和维护,因为组件易于替换,标准化的工具确保组件间互操作性。

英文摘要

Annually and globally, over three billion radiography examinations and computer tomography scans result in mostly unstructured radiology reports containing free text. Despite the potential benefits of structured reporting, its adoption is limited by factors such as established processes, resource constraints and potential loss of information. However, structured information would be necessary for various use cases, including automatic analysis, clinical trial matching, and prediction of health outcomes. This study introduces RadEx, an end-to-end framework comprising 15 software components and ten artifacts to develop systems that perform automated information extraction from radiology reports. It covers the complete process from annotating training data to extracting information by offering a consistent generic information model and setting boundaries for model development. Specifically, RadEx allows clinicians to define relevant information for clinical domains (e.g., mammography) and to create report templates. The framework supports both generative and encoder-only models and the decoupling of information extraction from template filling enables independent model improvements. Developing information extraction systems according to the RadEx framework facilitates implementation and maintenance as components are easily exchangeable, while standardized artifacts ensure interoperability between components.

2306.12679 2026-06-19 cs.CL

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

构建波斯语社交媒体微博客情感分析的口语数据集

Mojtaba Mazoochi, Leila Rabiei, Farzaneh Rahmani, Zeinab Rajabi

发表机构 * Faculty member in ICT Research Institute(ICT研究所教员) Iran Telecommunication Research Center (ITRC)(伊朗电信研究中心) Faculty member in Computer Department(计算机系教员) Mehralborz University(梅赫拉布尔兹大学) Hazrat-e Masoumeh University(玛苏姆大学)

AI总结 本文构建了波斯语口语数据集并提出基于CNN的模型,提升社交媒体微博客口语文本的情感分析性能,实验结果显示72%的准确率。

Journal ref Multimedia Tools and Applications, 2025

详情
AI中文摘要

介绍:微博网站为情感分析和观点挖掘提供了丰富的数据源。然而,由于微博帖子通常缺乏语法一致的术语和代表性词汇,且用户不愿撰写长文,情感分类效率较低。此外,低资源语言也存在局限性。波斯语具有独特特征,需要独特的标注数据和模型进行情感分析,这与英语文本特征不同。方法:本文首先在协作环境中构建了一个名为ITRC-Opinion的用户意见数据集,包含60,000条来自Twitter和Instagram等社交媒体的非正式波斯语文本。其次,本文提出了一种基于卷积神经网络(CNN)的新型架构,以更有效地进行社交媒体微博客口语文本的情感分析。构建的数据集用于评估所提出的架构。此外,一些模型,如LSTM、CNN-RNN、BiLSTM和BiGRU,结合不同的词嵌入,包括Fasttext、Glove和Word2vec,也研究了我们的数据集并评估了结果。结果:结果表明我们的数据集和所提模型(72%准确率)的优势,展示了情感分类性能的显著提升。

英文摘要

Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented architecture. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.

1902.06202 2026-06-19 cs.CV cs.CG

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

利用持续同调量化飓风菲利克斯的日变化

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering(密歇根州立大学,计算数学、科学与工程系) Michigan State University, Dept. of Mathematics(密歇根州立大学,数学系) Cooperative Institute for Marine and Atmospheric Studies, University of Miami(马里安诺大气研究合作机构,迈阿密大学) Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory(飓风研究部,国家海洋和大气管理局/大西洋海洋学和气象实验室) University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences(阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校,大气与环境科学系)

AI总结 本文提出利用持续同调量化热带气旋日变化的方法,通过追踪最大持续性并利用离散傅里叶变换量化日变化特征。

详情
AI中文摘要

热带气旋的日变化是卫星图像中出现的每日云层循环,可能对气旋结构和强度有影响。这些日变化脉冲在红外卫星图像中表现为周期性脉冲,从几乎所有大西洋气旋中心向外径向传播。我们提出利用持续同调追踪最大持续性并利用离散傅里叶变换量化日变化,通过Geostationary Operational Environmental Satellite IR影像数据检测飓风菲利克斯的日变化。

英文摘要

The diurnal cycle of tropical cyclones (TCs) is a daily cycle in clouds that appears in satellite images and may have implications for TC structure and intensity. The diurnal pattern can be seen in infrared (IR) satellite imagery as cyclical pulses in the cloud field that propagate radially outward from the center of nearly all Atlantic-basin TCs. These diurnal pulses, a distinguishing characteristic of the TC diurnal cycle, begin forming in the storm's inner core near sunset each day and appear as a region of cooling cloud-top temperatures. The area of cooling takes on a ring-like appearance as cloud-top warming occurs on its inside edge and the cooling moves away from the storm overnight, reaching several hundred kilometers from the circulation center by the following afternoon. The state-of-the-art TC diurnal cycle measurement has a limited ability to analyze the behavior beyond qualitative observations. We present a method for quantifying the TC diurnal cycle using one-dimensional persistent homology, a tool from Topological Data Analysis, by tracking maximum persistence and quantifying the cycle using the discrete Fourier transform. Using Geostationary Operational Environmental Satellite IR imagery data from Hurricane Felix (2007), our method is able to detect an approximate daily cycle.

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 新提交

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

发表机构 * University of Hamburg(汉堡大学)

AI总结 提出将预训练的语音分类器作为扩散生成的主干,通过附加轻量子网络并仅训练该子网络,实现单主干模型的高质量条件语音生成,降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情
AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型:一个分类器和一个扩散模型。因此,我们研究了一种更紧凑的替代方案,其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始,我们附加一个轻量子网络,该子网络重用中间分类器表示,并在去噪分数匹配目标下仅训练该子网络。我们的工作表明,预训练的分类器可以重新用于条件生成,为判别建模和条件语音合成之间提供了有吸引力的桥梁,从而在单主干模型中实现高语音质量,同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

2606.20451 2026-06-19 stat.ML cs.LG stat.AP stat.CO 新提交

SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data

SSH-Net: 一种用于竞争风险下预测失效时间分布函数的深度神经网络及其在GPU数据上的应用

Jie Min, Yueyao Wang, Mengkun Chen

发表机构 * Department of Mathematics & Statistics, University of South Florida(佛罗里达州立大学数学与统计学系) School of Statistics and Data Science, Zhejiang Gongshang University(浙江工商大学统计与数据科学学院) Department of Statistics, Virginia Tech(弗吉尼亚理工学院统计学系)

AI总结 提出结构化分段风险深度神经网络(SSH-Net),通过将网络结构与数据结构关联,允许不同协变量组通过子网络影响预测,在竞争风险框架下预测失效时间分布函数,仿真和GPU数据验证了准确性。

详情
AI中文摘要

竞争风险在工程领域常见,当应用场景复杂时会给时间事件数据建模带来挑战。近年来,深度神经网络因其灵活性和高学习能力在竞争风险预测中受到广泛关注。然而,神经网络结构的复杂性使得基于不同数据输入的超参数调优更加困难。此外,当工程系统具有多层级的复杂物理结构时,将所有结构层级视为单一输入组可能无法捕捉关键信息。为解决这些问题,我们提出了一种结构化分段风险深度神经网络(SSH-Net),用于在特定原因竞争风险框架下预测失效时间。我们的方法将神经网络结构与数据结构相关联,并允许不同的协变量组通过分离的子网络影响失效预测。神经网络基于特定原因竞争风险模型构建。SSH-Net输出特定原因风险函数,并采用惩罚对数似然作为损失函数。通过评估Brier分数、接收者操作特征曲线下面积(AUC)和预测的特定原因累积发生函数的均方根误差(RMSE),仿真研究验证了SSH-Net的预测准确性。我们进一步使用Titan GPU失效时间数据展示了模型预测失效时间分布函数的能力。

英文摘要

Competing risks are commonly observed in engineering fields and can bring challenges to time-to-event data modeling when the application scenarios are complicated. Recently, deep neural networks have received great attention for prediction with competing risks, due to their flexibility and high learning capability. However, the complexity of neural network structure brings extra difficulty in hyperparameter tuning based on different data inputs. Additionally, when an engineered system has complex physical structures with multiple hierarchical levels, treating all structural levels as a single group of inputs may fail to capture critical information. To address the issues, we propose a Structured Segmented Hazard Deep Neural Network (SSH-Net) for failure time prediction under cause-specific competing risks framework. Our approach associates neural network structure with data structures, and allows different covariate groups to impact the failure prediction through separate sub-networks. The neural network is constructed based on a cause-specific competing risks model. The SSH-Net outputs cause-specific hazard functions, and utilizes the penalized log-likelihood as the loss function. The prediction accuracy of SSH-Net is validated through simulation studies by evaluating the Brier score, the area under receiver operating characteristic curves (AUC), and the root mean square error (RMSE) of the predicted cause-specific cumulative incident function. We further demonstrate the model's ability to predict failure time distribution functions using the Titan GPU failure time data.

2606.20206 2026-06-19 stat.ML cs.LG 新提交

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

马尔可夫决策过程中奖励非随机缺失的缺失感知策略的离线评估

Ziheng Wei, Annie Qu, Rui Miao

发表机构 * Department of Statistics, University of Michigan at Ann Arbor(密歇根大学安娜堡分校统计学系) Department of Statistics(统计学系) Applied Probability, University of California at Santa Barbara(加州大学圣巴巴拉分校应用概率系) Department of Mathematical Sciences, University of Texas at Dallas(德克萨斯大学达拉斯分校数学科学系)

AI总结 针对奖励非随机缺失的离线强化学习问题,提出基于未来状态作为影子变量的识别方法,并利用桥函数和min-max估计器恢复条件均值奖励,实现缺失感知策略的离线评估。

Comments Accepted at ICML 2026. 31 pages, 6 figures

详情
AI中文摘要

在离线强化学习中,由于记录稀疏或不规则,或超出特定奖励值的审查,记录批次数据中的即时奖励通常未被观测到。这个问题出现在实际场景中,包括医疗和营销。我们研究了有限时域马尔可夫决策过程中奖励非随机缺失时的离线策略评估,这破坏了可忽略性,并即使在以状态和行动为条件后也会引起选择偏差。为了解决这个问题,我们形式化了一个依赖于奖励的倾向模型,并使用未来状态作为影子变量来识别完整数据的条件均值奖励。我们进一步引入了一个桥函数,无需显式建模MNAR机制即可恢复条件均值奖励,并通过min-max过程进行估计以避免双重采样。基于这些识别结果,我们提出了一个类似Fitted-Q-Evaluation的估计器,该估计器传播恢复的奖励,同时允许目标策略依赖于过去的缺失指示符。最后,我们为我们的OPE估计器建立了一致性和有限样本误差界,并通过实验在模拟数据和MIMIC-III脓毒症数据上展示了我们方法相比现有方法的强性能。

英文摘要

In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

发表机构 * LY Corporation(LY公司)

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

2606.20106 2026-06-19 eess.AS cs.SD 新提交

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

发表机构 * Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan(计算机科学与信息工程系,台湾国立台湾师范大学) United Link Co., Ltd., Taiwan(台湾联链公司)

AI总结 提出ZP-KWS轻量框架,结合音素监督音频编码器和紧凑说话人编码器,通过乘法后融合实现零样本关键词检测与说话人验证,在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

用户自定义关键词唤醒(UD-KWS)能够从文本实现零样本唤醒词检测,但现有系统学习的是说话人不变表示,无法拒绝说出正确关键词的冒名顶替者。我们针对这种双重零样本设置——未见关键词和未见说话人——提出了ZP-KWS,一个轻量级框架,将音素监督的音频编码器与GE2E预训练的紧凑说话人编码器(约0.9M参数)相结合。推理时的乘法后融合赋予每个分支独立的否决权,支持从传统检测到严格说话人门控激活的模式,无需重新训练。在LibriPhrase、Google Speech Commands和Qualcomm数据集上,ZP-KWS在1%虚警率下将目标仅误拒率相对于最强基线降低了高达60%,同时保持有竞争力的关键词检测,且总参数量在1.55M以内,适合边缘部署。

英文摘要

User-defined keyword spotting (UD-KWS) enables zero-shot wake-word detection from text, but existing systems learn speaker-invariant representations that cannot reject impostors uttering the correct keyword. We address this dual zero-shot setting -- unseen keywords and unseen speakers -- with ZP-KWS, a lightweight framework combining a phoneme-supervised audio encoder with a GE2E-pretrained compact speaker encoder (about 0.9M parameters). Multiplicative late fusion at inference grants each branch independent veto power, supporting modes from conventional detection to strict speaker-gated activation without retraining. On LibriPhrase, Google Speech Commands, and Qualcomm datasets, ZP-KWS reduces target-only FRR at 1% FAR by up to 60% relative to the strongest baseline while maintaining competitive keyword detection, all within a 1.55M parameter budget for edge deployment.

2606.20074 2026-06-19 eess.SP cs.AI cs.LG 新提交

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

用于ICU中基于事件的爆发-抑制检测的EEG基础模型评估

Elisa Vasta, Thorir Mar Ingolfsson, Andrea Cossettini, Luca Benini, Tilman Beck, Emanuela Keller, Una Pale

发表机构 * DEI, University of Bologna, Bologna, Italy(DEI,博洛尼亚大学,博洛尼亚,意大利)

AI总结 本研究首次评估EEG基础模型在ICU中无需患者校准的爆发检测性能,REVE-base模型在事件级F1分数上达到0.868,并将每分钟爆发错误率分别降低52.1%和36.2%。

Comments 4 pages, 1 figure. Code available upon publication

详情
AI中文摘要

爆发抑制(BS)是一种临床相关的脑电图(EEG)模式,用于监测危重患者的镇静深度和脑活动,特别是在重症监护病房(ICU)的诱导昏迷期间。自动爆发检测仍然具有挑战性,因为BS模式在不同患者之间差异很大,且标注数据集稀缺。最近,EEG基础模型(FMs)在多个下游EEG应用中显示出前景,但它们在BS检测中的实用性尚未被探索。我们提出了第一项研究,评估EEG FMs在减少导联的ICU EEG中无需患者校准的爆发检测性能。我们将REVE-base、LUNA-large和LuMamba-Tiny与自适应阈值基线以及任务特定的EEGNet基线进行比较。此外,我们补充了基于事件的爆发检测评估,以替代传统的EEG窗口分类。这有助于临床评估爆发事件是否被正确检测,减少预期标注变异性的影响。最佳模型REVE-base取得了最高的事件级F1分数($0.868 \pm 0.167$),并且与EEGNet和自适应阈值相比,分别将每分钟爆发错误减少了52.1%和36.2%,支持了FMs在ICU中可扩展的EEG监测。消融实验表明,与冻结骨干训练、两步微调和基于LoRA的适应相比,全微调是最有效的适应策略,对于LUNA-large,事件级F1分数比冻结骨干训练提高了最多$+0.102$。在减少标注数据集的情况下,预训练的REVE-base在25%的队列中比随机初始化高出$+0.723$事件级F1点,证明了在有限标注数据下适应爆发检测时预训练FM表示的优势。

英文摘要

Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理:一种基于模型的经济分析代理框架,结合RAG、知识图谱和大语言模型

Masahiro Kato

发表机构 * Mizuho-DL Financial Technology, Co., Ltd.(Mizuho-DL金融科技有限公司)

AI总结 提出一种基于RAG的AI经济学家代理框架,利用知识图谱和大语言模型进行经济情景分析,通过代理规划、检索证据、选择模型并生成报告,提高经济叙事的连贯性和可追溯性。

详情
AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家,具有用于经济情景分析的代理框架,使用大语言模型(LLMs)和知识图谱。虽然LLMs可以生成流畅的经济叙事,但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机,本研究提出了一种基于RAG的AI经济学家,它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中,我们不直接仅使用语言模型产生定量主张;相反,我们生成基于显式模型计算的叙事,并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理:为美国通胀持续性和美联储政策生成经济学家报告,以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

发表机构 * Nagoya Institute of Technology, Japan(名古屋技术大学,日本) LY Corporation, Japan(LY公司,日本)

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.19823 2026-06-19 eess.AS cs.LG 新提交

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

低负担数据增强:通过零样本语音克隆改善构音障碍语音识别

Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

发表机构 * DeepNet Discovery Network, University of Auckland, New Zealand(奥克兰大学深网发现网络, 新西兰) University of Illinois Urbana-Champaign, USA(伊利诺伊大学厄巴纳-香槟分校, 美国)

AI总结 针对构音障碍语音数据稀缺和变异性大的问题,提出使用零样本语音克隆(Higgs Audio V2)生成合成数据,微调Whisper-medium模型,在TORGO数据集上达到与真实数据微调相近的词错误率,并显著降低数据收集成本。

Comments Accepted to Interspeech 2026, Sydney, Australia

详情
AI中文摘要

由于数据稀缺和说话人之间高度变异,自动语音识别对于构音障碍语音仍然不可靠。虽然合成数据可以弥补这些不足,但传统方法通常需要大量的说话人特定数据,重新引入了数据收集瓶颈。我们研究零样本语音克隆作为一种低负担的增强策略,使用Higgs Audio V2克隆TORGO数据集中的说话人。我们在克隆数据、真实数据和混合数据上微调Whisper-medium,并在保留的真实语音上进行评估。与零样本基线(31.62%)相比,克隆数据微调实现了具有竞争力的26.00%词错误率,几乎与真实数据微调(24.44%)和混合数据微调(25.12%)相当。值得注意的是,对于中重度构音障碍说话人,克隆和混合微调优于真实数据微调。在SAP-1102上的跨语料库评估中,克隆微调取得了最佳结果(相对提升11.45%)。这些结果表明,零样本克隆提供了可扩展的训练数据,绕过了昂贵的数据收集瓶颈。

英文摘要

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 新提交

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(电子与通信工程系,印度尼特拉特技术学院Sikkim分校) Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, USA(信号分析与解释实验室(SAIL),美国南加州大学洛杉矶分校)

AI总结 针对构音障碍语音识别中数据稀缺和严重程度差异的问题,本文探索了四种数据增强方法(SRM、PM、FM、VTLP)对预训练Wav2Vec2模型进行微调,在不同严重程度上实现了显著的字错误率降低。

详情
AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而,由于严重程度不同和数据可用性有限,准确识别构音障碍语音面临重大挑战。在本文中,我们通过微调端到端预训练Wav2Vec2模型,探索了针对构音障碍自动语音识别(ASR)系统的数据增强技术,特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题,我们研究了四种主要的数据增强方法:语速修改(SRM)、音高修改(PM)、共振峰修改(FM)和声道长度扰动(VTLP),这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外,我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明,每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}(9.02%)和\textit{中}(38.11%)严重程度,使用SRM($s$=0.8)获得了最佳WER;对于\textit{高}严重程度(55.15%),使用PM($\ au$=0.8)获得了最佳WER,分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 新提交

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究:频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(电子与通信工程系,印度尼特技术学院锡金分校) Department of Information and Communications Engineering, Aalto University, Finland(信息与通信工程系,阿尔托大学,芬兰) Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, USA(信号分析与解释实验室(SAIL),美国南加州大学洛杉矶分校)

AI总结 本文系统研究不同频谱特征与声学模型的组合,通过引入音高特征和优化训练帧重叠数,在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情
AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能,特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查,我们证明了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法,与先前研究相比,在构音障碍语音的孤立词识别中获得了4.65%的相对改进,在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性,这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 新提交

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化:低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(印度西西姆国立技术学院电子与通信工程系) Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA(美国南加州大学洛杉矶分校信号分析与解释实验室)

AI总结 针对低资源儿童语音识别,系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现,发现特定策略能显著提升泛化能力。

详情
AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能,尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究,我们展示了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法,与先前研究相比,在孤立词识别上实现了4.65%的相对改进,在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性,这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19714 2026-06-19 stat.ML cs.AI cs.LG stat.CO stat.ME 新提交

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

AURA: 用于LLM作为评判审计的自适应不确定性感知精炼

Zilong Zhang, Yi-Ting Hung, Weiyi He, Junxi Zhang, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics(数学与统计学系) Georgia State University(佐治亚州立大学) Department of Probability and Statistics(概率与统计学系) Department of Computer Science and Engineering(计算机科学与工程系) Michigan State University(密歇根州立大学) Concordia University(Concordia 大学) Department of Statistics(统计学系) University of Manitoba(曼尼托巴大学)

AI总结 提出AURA框架,通过自适应不确定性感知精炼,在少量人工验证下迭代学习人类一致性信号,优先审核不确定比较,提升LLM评判的可靠性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作开放式生成的评判者,因为大规模人工评估通常昂贵且难以扩展,但它们的偏好仍然是人类判断的不完美代理。现有的审计流程通常假设事先存在可靠的示例子集或干净的监督信号,例如来自人工注释、启发式过滤或强评判者的输出。在LLM评估中,这一假设是脆弱的:初始分割可能继承评判者偏差,而人工验证通常过于稀缺,无法在规模上定义稳定组。我们提出AURA,一种自适应不确定性感知精炼框架,用于在选定的人工验证下审计成对LLM作为评判的决策。AURA迭代学习人类一致性信号,传播可靠证据,并优先将不确定的比较提交人工审核。关键思想是将对评判者的信任视为一个潜在量,随着证据积累逐步精炼。我们提供了紧凑的公式、稳定的精炼过程,以及在合成和真实成对LLM答案数据上的全面评估。

英文摘要

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

2606.19643 2026-06-19 stat.ML cs.LG 新提交

Variational Consensus Monte Carlo for Bayesian Mixture

变分共识蒙特卡洛用于贝叶斯混合模型

Julie Fendler, Francesca L. Crowe, Tom Marshall, Sylvia Richardson, Paul D. W. Kirk

发表机构 * MRC Biostatistics Unit, University of Cambridge(剑桥大学生物统计学单位) Institute of Applied Health Research, University of Birmingham(伯明翰大学应用健康研究学院)

AI总结 提出变分共识蒙特卡洛方法扩展至过拟合贝叶斯混合模型,通过新颖的聚类匹配算法和聚合策略,在联邦学习设置下推断聚类数和所有参数,并在模拟和真实电子健康记录数据上验证了有效性。

详情
AI中文摘要

受健康数据的隐私、敏感性和共享限制的驱动,我们提出了一个在联邦学习设置下(即数据无法在计算节点之间完全共享或汇集)对贝叶斯混合模型进行推断的全面流程。我们采用共识蒙特卡洛(CMC)方法,在每个数据孤岛内独立运行MCMC算法以估计局部后验分布,然后聚合这些分布以近似完整数据的后验。Rabinovich, Angelino 和 Jordan (2015) [1] 的变分CMC方法将聚合步骤视为变分推断问题,但他们应用于混合模型时假设聚类数和关键混合参数已知。我们的主要方法贡献是:(i) 将变分CMC扩展到过拟合贝叶斯混合模型,该模型推断聚类数和所有模型参数,无需共轭性;(ii) 适用于跨孤岛设置的新颖聚类匹配算法,其中并非每个聚类都出现在每个局部数据集中;(iii) 针对聚合步骤的多种推断策略,匹配不同的联邦学习约束;以及 (iv) 在实践中选择这些策略的指南。一项全面的模拟研究验证了该框架,并允许我们与最先进的联邦学习替代方法进行比较。值得注意的是,我们表明当局部数据集的组成反映了数据中的底层聚类结构时,我们的方法可以比应用于汇集数据的标准MCMC更准确地恢复小聚类。我们在大规模电子健康记录数据上展示了该框架,识别了英国老年人群中的多发病模式。

英文摘要

Motivated by the privacy, sensitivity and sharing limitations of health data, we present a comprehensive pipeline for inference of Bayesian mixture models within a federated learning setting, i.e. when data cannot be fully shared or pooled across compute nodes. We adopt a Consensus Monte Carlo (CMC) approach, in which an MCMC algorithm is run independently within each data silo to estimate local posterior distributions, which are then aggregated to approximate the posterior over the full data. The variational CMC approach of Rabinovich, Angelino and Jordan (2015) [1] frames the aggregation step as a variational inference problem, but their application to mixtures assumes the number of clusters and key mixture parameters to be known. Our main methodological contributions are: (i) an extension of variational CMC to over-fitted Bayesian mixture models that infer the number of clusters and all model parameters, without requiring conjugacy; (ii) novel cluster-matching algorithms suitable for cross-silo settings in which not every cluster appears in each local dataset; (iii) a number of inference strategies for the aggregation step, matched to different federated learning constraints; and (iv) guidelines for choosing among these in practice. A comprehensive simulation study validates the framework and allows us to compare to state-of-the-art federated learning alternatives. Notably, we show that when the composition of local datasets reflects the underlying clustering structure in the data, our approach can recover small clusters with greater accuracy than standard MCMC applied to the pooled data. We illustrate the framework on large-scale electronic health record data, identifying multi-morbidity patterns in a British geriatric population.

2606.19587 2026-06-19 stat.ML cs.LG 新提交

A Solver-Free Training Method for Predict-then-Optimize

一种无求解器的预测后优化训练方法

Beichen Wan, Mo Liu

发表机构 * Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, NC, USA(统计与运筹学系,北卡罗来纳大学教堂山分校)

AI总结 提出一种基于测度变换的决策聚焦学习管道,通过无求解器代理损失实现预测后优化中预测模型的高效训练,理论保证Fisher一致性,训练时间降低数个数量级。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出了一种可扩展的方法,用于在预测后优化范式中训练预测(机器学习)模型,其中模型输出作为后续线性优化任务的系数。直接最小化经验决策遗憾对于线性规划和组合优化是不可行的,因为决策映射是分段常数,且梯度几乎处处为零。虽然现有方法通过平滑微分过程来解决这一问题,但它们存在可扩展性问题,因为每次梯度评估都需要调用计算昂贵的求解器。为了解决这个问题,我们提出了一种基于测度变换原理的决策聚焦学习管道,该管道在训练期间产生一个完全无优化求解器的新代理损失。我们建立了理论保证,包括Fisher一致性和超额风险界。实验上,我们的方法在实现与最先进方法相当的决策质量的同时,将训练时间减少了数个数量级。

英文摘要

We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.