arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2502.18816 2026-05-08 cs.CV

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

Grad-ECLIP: 基于梯度的CLIP视觉与文本解释

Chenyang Zhao, Kun Wang, Janet H. Hsiao, Antoni B. Chan

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Division of Social Science and Department of Computer Science & Engineering, Hong Kong University of Science & Technology(香港科学与技术大学社会科学学院及计算机科学与工程系) SenseTime Group Ltd(时光集团有限公司)

AI总结 本文提出Grad-ECLIP方法,通过分解CLIP编码器架构并分析匹配相似度与中间空间特征的关系,生成有效热图以解释CLIP匹配结果。通过通道和空间权重提升视觉解释质量,并通过定性定量评估验证其有效性。

详情
AI中文摘要

在对比语言-图像预训练(CLIP)视觉语言模型的改进和下游应用取得显著进展的同时,对CLIP的解释研究较少。本文提出基于梯度的CLIP视觉与文本解释方法(Grad-ECLIP),用于解释特定输入图像-文本对的CLIP匹配结果。通过分解编码器架构并发现匹配相似度与中间空间特征之间的关系,Grad-ECLIP生成有效热图,显示图像区域或词语对CLIP结果的影响。不同于以往关注自注意力图的Transformer解释方法,由于CLIP中自注意力图通常极稀疏,Grad-ECLIP通过在token特征上应用通道和空间权重生成高质量视觉解释。定性与定量评估验证了Grad-ECLIP相比现有最先进方法的有效性和优越性。此外,基于我们的视觉与文本解释结果,我们进行了系列分析,探讨了图像-文本匹配的工作机制、CLIP在归因识别中的优缺点,以及词语的明确性/抽象性与其在CLIP中的使用之间的关系。最后,基于解释图中指示输入图像文本特定显著区域的能力,我们还提出了一种应用,用于在CLIP微调中提升细粒度对齐。Grad-ECLIP的代码可在https://github.com/Cyang-Zhao/Grad-Eclip获取。

英文摘要

Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.

2502.16022 2026-05-08 cs.CL

Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation

通过数据增强提升LLM识别和优先级排序电子健康记录笔记中的重要医学术语

Won Seok Jang, Sharmin Sultana, Zonghai Yao, Hieu Tran, Zhichao Yang, Sunjae Kwon, Hong Yu

发表机构 * Miner School of Computer and Information Sciences, UMass Lowell, MA, USA(UMass Lowell 计算机与信息科学学院) Manning College of Information and Computer Sciences, UMass Amherst, MA, USA(UMass Amherst 信息与计算机科学学院) Center for Healthcare Organization and Implementation Research, VA Bedford Health Care, MA, USA(VA Bedford医疗中心医疗组织与实施研究中心)

AI总结 本文通过数据增强提升LLM在低资源环境下识别和优先级排序电子健康记录中的医学术语,评估了闭源和开源模型的性能,发现微调和数据增强能提升模型表现。

Comments 21pages, 5 figures, 4 tables

详情
AI中文摘要

OpenNotes使患者能够访问EHR笔记,但医学术语可能阻碍理解。为提高理解,我们评估了闭源和开源LLM在提取和优先级排序关键医学术语方面的性能,使用提示、微调和数据增强方法。我们在106个专家标注的EHR笔记上评估LLM,测试了(i)通用与结构化提示,(ii)零样本与少量样本提示,(iii)微调,以及(iv)数据增强。为了增强低资源环境下的开源模型,我们使用ChatGPT进行数据增强并应用排序技术。我们逐步增加增强数据集的大小(10到10,000)并进行5折交叉验证,报告F1分数和平均倒数排名(MRR)。我们的结果显示,微调和数据增强优于其他策略。GPT-4 Turbo实现了最高的F1(0.433),而Mistral7B结合数据增强实现了最高的MRR(0.746)。微调或增强的开源模型优于闭源模型。值得注意的是,最佳F1和MRR分数并不总是一致。少量样本提示在普通模型中优于零样本提示,而结构化提示在不同模型中产生不同的偏好。微调提升了零样本性能但有时会降低少量样本性能。数据增强的表现与其它方法相当或更好。我们的评估突显了提示、微调和数据增强在低资源环境下改进医学术语提取模型性能的有效性。

英文摘要

OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.

2501.02721 2026-05-08 cs.LG

Learning Stochastic Nonlinear Dynamics with Embedded Latent Transfer Operators

通过嵌入式潜在转移算子学习随机非线性动力学

Naichang Ke, Ryogo Tanaka, Yoshinobu Kawahara

发表机构 * The University of Osaka(大阪大学) RIKEN(日本科学技术研究所)

AI总结 本文提出基于算子的潜在马尔可夫表示,利用谱方法学习随机非线性动力学的嵌入表示,并探讨了在随机非线性系统中序列状态估计和算子基于的特征分解的应用。

Journal ref International Conference on Artificial Intelligence and Statistics (AISTATS) 2025, PMLR, pp. 4861-4869

详情
AI中文摘要

我们考虑了一种基于算子的潜在马尔可夫表示,用于描述嵌入到再生核希尔伯特空间中的潜在状态的随机演化,通过相应的转移算子进行描述,并基于随机实现理论开发了学习该表示的谱方法。嵌入可能通过再生核同时学习,例如通过由前馈神经网络构造的再生核。我们还探讨了在随机非线性系统中序列状态估计(卡尔曼滤波)和基于算子的动态特征分解的推广。几个合成和实际数据的例子展示了我们方法的实证特性,并研究了我们的模型在序列状态估计和特征分解中的性能。

英文摘要

We consider an operator-based latent Markov representation of a stochastic nonlinear dynamical system, where the stochastic evolution of the latent state embedded in a reproducing kernel Hilbert space is described with the corresponding transfer operator, and develop a spectral method to learn this representation based on the theory of stochastic realization. The embedding may be learned simultaneously using reproducing kernels, for example, constructed with feed-forward neural networks. We also address the generalization of sequential state-estimation (Kalman filtering) in stochastic nonlinear systems, and of operator-based eigen-mode decomposition of dynamics, for the representation. Several examples with synthetic and real-world data are shown to illustrate the empirical characteristics of our methods, and to investigate the performance of our model in sequential state-estimation and mode decomposition.

2412.15689 2026-05-08 cs.CV

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

DOLLAR: 通过蒸馏和潜在奖励优化实现少步视频生成

Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Krishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, Yuchen Liu

发表机构 * Princeton University(普林斯顿大学) Adobe Research(Adobe研究)

AI总结 本文提出结合变分分数蒸馏和一致性蒸馏的方法,实现高质量且多样化的少步视频生成,并通过潜在奖励模型微调提升生成性能,实现高效生成。

详情
AI中文摘要

扩散概率模型在视频生成中取得了显著进展;然而,其计算效率受限于所需的大量采样步骤。减少采样步骤往往会牺牲视频质量或生成多样性。在本文中,我们引入了一种蒸馏方法,结合变分分数蒸馏和一致性蒸馏,以实现少步视频生成,同时保持高质量和多样性。我们还提出了一种潜在奖励模型微调方法,根据任何指定的奖励指标进一步提升视频生成性能。该方法减少了内存使用,并且不需要奖励可微。我们的方法在10秒视频(128帧,12 FPS)的少步生成中展示了最先进的性能。蒸馏的学生模型在VBench上获得了82.57的分数,超过了教师模型以及基线模型Gen-3、T2V-Turbo和Kling。一步蒸馏将教师模型的扩散采样速度提高了278.6倍,使生成接近实时。人类评估进一步验证了我们的4步学生模型相比使用50步DDIM采样的教师模型具有更优的性能。

英文摘要

Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable. Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, T2V-Turbo, and Kling. One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.

2411.19182 2026-05-08 cs.CV cs.AI

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

信息播种:利用大规模语言模型在图像生成中培养上下文一致性

Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 本文提出SOW方法,通过多模态大语言模型实现文本-视觉到图像生成,提升图像像素级条件保真度和视觉语义一致性。

Comments Project page: https://pyh-129.github.io/SOW/

详情
AI中文摘要

起源于物理学中的扩散现象,描述粒子的随机运动和碰撞,扩散生成模型在数据空间中模拟去噪轨迹的随机游走。这使得信息能在区域间扩散,产生和谐结果。然而,扩散模型中信息扩散的混沌和无序性质常导致图像区域间不必要的干扰,造成细节保真度下降和上下文不一致。本文通过将无序扩散重新解释为文本-视觉到图像生成(TV2I)任务的强大工具,实现像素级条件保真度的同时保持图像的视觉和语义一致性。我们首先引入循环单向扩散(COW),提供高效的单向扩散框架以实现精确的信息传输并最小化破坏性干扰。基于COW,我们进一步提出选择性单向扩散(SOW),利用多模态大语言模型(MLLMs)来澄清图像内的语义和空间关系。基于这些见解,SOW结合注意力机制,根据上下文关系动态调节扩散的方向和强度。大量实验展示了受控信息扩散的潜力,为在无学习方式下更适应和多功能的生成模型提供了一条路径。

英文摘要

Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.

2411.13549 2026-05-08 cs.CV

KFC-W: Generating 3D-Consistent Videos from Unposed Internet Photos

KFC-W: 从未经摆拍的互联网照片生成3D一致的视频

Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely

发表机构 * Cornell University(康奈尔大学) Adobe Research(Adobe研究)

AI总结 本文提出KFC-W模型,通过自监督学习从未经摆拍的互联网照片生成3D一致的视频,无需3D标注,优于现有方法,在几何和外观一致性上表现更佳,适用于3D高斯点扩散等应用。

Comments project page: https://genechou.com/kfcw/

详情
AI中文摘要

我们解决了从未经摆拍的互联网照片生成视频的问题。少量输入图像作为关键帧,我们的模型在它们之间插值以模拟相机路径。给定随机图像,模型捕捉底层几何、识别场景身份并关联帧的相机位置和方向的能力反映了对3D结构和场景布局的基本理解。然而,现有视频模型如Luma Dream Machine无法完成此任务。我们设计了一种自监督方法,利用视频的一致性和多视角互联网照片的多样性,训练一个可扩展的、3D感知的视频模型,无需任何3D标注如相机参数。我们验证了我们的方法在几何和外观一致性上优于所有基线。我们还展示了我们的模型适用于启用相机控制的应用,如3D高斯点扩散。我们的结果表明,仅使用视频和多视角互联网照片等2D数据即可扩展场景级3D学习。

英文摘要

We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

2409.00417 2026-05-08 cs.LG stat.ME

Learning linear acyclic causal model including Gaussian noise using ancestral relationships

利用祖先关系学习包含高斯噪声的线性无环因果模型

Ming Cai, Penggang Gao, Hisayuki Hara

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究生院) Institute for Liberal Arts and Sciences, Kyoto University(京都大学文科艺术研究院)

AI总结 本文提出一种时间复杂度低于PC-LiNGAM的算法,用于学习线性因果模型的分布等价模式,通过改进的因果祖先寻找算法处理高斯扰动。

Comments 30 pages, 6 figures

详情
AI中文摘要

本文讨论了学习因果DAG的算法。PC算法仅假设因果模型的忠实性,可识别至马尔可夫等价类。LiNGAM假设因果模型线性且具有连续非高斯扰动,其因果DAG被证明完全可识别。PC-LiNGAM是PC算法和LiNGAM的混合方法,可识别线性因果模型的分布等价模式,即使存在高斯扰动。然而,在最坏情况下,PC-LiNGAM的时间复杂度与变量数成因子关系。本文提出一种算法,利用Maeda和Shimizu的因果祖先寻找算法,扩展以处理高斯扰动,从而学习线性因果模型的分布等价模式,且时间复杂度低于PC-LiNGAM。

英文摘要

This paper discusses algorithms for learning causal DAGs. The PC algorithm makes no assumptions other than the faithfulness to the causal model and can identify only up to the Markov equivalence class. LiNGAM assumes linearity and continuous non-Gaussian disturbances for the causal model, and the causal DAG defining LiNGAM is shown to be fully identifiable. The PC-LiNGAM, a hybrid of the PC algorithm and LiNGAM, can identify up to the distribution-equivalence pattern of a linear causal model, even in the presence of Gaussian disturbances. However, in the worst case, the PC-LiNGAM has factorial time complexity for the number of variables. In this paper, we propose an algorithm for learning the distribution-equivalence patterns of a linear causal model with a lower time complexity than PC-LiNGAM, using the causal ancestor finding algorithm in Maeda and Shimizu, which is generalized to account for Gaussian disturbances.

2407.18128 2026-05-08 cs.CV eess.IV

Estimating Earthquake Magnitude in Sentinel-1 Imagery via Ranking

通过排序估计哨兵-1影像中的地震 magnitude

Daniele Rege Cambrin, Isaac Corley, Paolo Garza, Peyman Najafirad

发表机构 * Politecnico di Torino(托斯卡纳理工学院) University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校)

AI总结 本文提出将地震 magnitude 估计作为度量学习问题,通过 Sentinel-1 影像实现地震 magnitude 估计与样本排序,实验显示在 MAE 上比传统回归方法提升 30% 以上。

Comments Accepted to ECML-PKDD 2024 MACLEAN Workshop

详情
AI中文摘要

地震通常通过物理地震站估计,但这些站的安装要求和成本使全球覆盖变得不切实际。开发机器学习模型以全球监控地球观测数据,以定位受这些自然灾害影响的区域是一个高效且低成本的替代方案。然而,由于历史上记录的地震数量较少,这成为了一个低数据问题,需要算法改进以在回归地震 magnitude 时达到最佳性能。在本文中,我们提出将地震 magnitude 估计作为度量学习问题,训练模型不仅从 Sentinel-1 卫星影像估计地震 magnitude,还对样本进行排序。我们的实验显示,在 MAE 上比基于回归的传统方法最大提升超过 30%,特别是在基于变压器的架构中。

英文摘要

Earthquakes are commonly estimated using physical seismic stations, however, due to the installation requirements and costs of these stations, global coverage quickly becomes impractical. An efficient and lower-cost alternative is to develop machine learning models to globally monitor earth observation data to pinpoint regions impacted by these natural disasters. However, due to the small amount of historically recorded earthquakes, this becomes a low-data regime problem requiring algorithmic improvements to achieve peak performance when learning to regress earthquake magnitude. In this paper, we propose to pose the estimation of earthquake magnitudes as a metric-learning problem, training models to not only estimate earthquake magnitude from Sentinel-1 satellite imagery but to additionally rank pairwise samples. Our experiments show at max a 30%+ improvement in MAE over prior regression-only based methods, particularly transformer-based architectures.

2404.02534 2026-05-08 cs.CL cs.AI

ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model

ANGOFA: 利用OFA嵌入初始化和合成数据进行安哥拉语言模型

Osvaldo Luamba Quinjica, David Ifeoluwa Adelani

发表机构 * Masakhane NLP Department of Computer Science University College London(计算机科学系伦敦大学学院) University College London(伦敦大学学院)

AI总结 本文提出四个针对安哥拉语言定制的预训练语言模型,采用多语言自适应微调方法,通过有意识的嵌入初始化和合成数据提升模型性能,优于SOTA AfroXLMR-base和OFA模型。

Comments Accepted at AfricaNLP 2024

详情
AI中文摘要

近年来,预训练语言模型(PLMs)的发展势头迅猛,展现出跨越语言障碍和促进知识转移的能力。然而,这一进展主要忽略了非常低资源语言,导致多语言领域存在明显空白。本文通过引入四个专门针对安哥拉语言微调的PLM,采用多语言自适应微调(MAFT)方法来填补这一空白。本文探讨了有意识的嵌入初始化和合成数据在提升MAFT模型下游任务性能中的作用。我们通过12.3和3.8个点分别提升了基线SOTA AfroXLMR-base(通过MAFT开发)和OFA(有效的嵌入初始化)的性能。

英文摘要

In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.

2401.04560 2026-05-08 cs.CV

Phase-shifted remote photoplethysmography for estimating heart rate and blood pressure from facial video

相位偏移远程光体积脉搏描记术用于从面部视频估计心率和血压

Gyutae Hwang, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University(全南国立大学电子与信息工程系)

AI总结 本文提出基于视觉的方法估计心率和血压,采用双阶段深度学习框架,利用相位偏移rPPG信号提升心率和血压估计精度。

Comments 13 pages, 10 figures

Journal ref Measurement 2026

详情
AI中文摘要

心血管疾病如高血压、心律失常和中风会严重威胁人类健康。心率和血压是心血管系统监测和早期诊断的重要生物信息。现有方法依赖于心电图和光体积脉搏描记术,需要接触皮肤表面。此外,基于导管和绑带的血压测量方法存在不便且适用性有限。因此,本文提出一种基于视觉的方法估计心率和血压。本文提出一个两阶段深度学习框架,包括双远程光体积脉搏描记术网络(DRP-Net)和受限血压网络(BBP-Net)。在第一阶段,DRP-Net推断出远距离光体积脉搏描记术(rPPG)信号用于指尖和面部区域,这些相位偏移的rPPG信号用于估计心率。在第二阶段,BBP-Net整合时间特征,并分析指尖和面部rPPG信号之间的相位差异以估计收缩压和舒张压值。为了提高心率估计的准确性,我们采用了基于帧插值模型的数据增强方法。此外,我们设计BBP-Net通过引入缩放的Sigmoid函数来推断血压值。我们的方法在MMSE-HR数据集上实现了心率估计的平均绝对误差(MAE)为1.78 BPM,比最近的方法减少了34.31%。心率估计的MAE为10.19 mmHg和7.09 mmHg。在V4V数据集上,心率、收缩压和舒张压的MAE分别为3.83 BPM、13.64 mmHg和9.4 mmHg。

英文摘要

Human health can be critically affected by cardiovascular diseases, such as hypertension, arrhythmias, and stroke. Heart rate and blood pressure are important biometric information for the monitoring of cardiovascular system and early diagnosis of cardiovascular diseases. Existing methods for estimating the heart rate are based on electrocardiography and photoplethyomography, which require contacting the sensor to the skin surface. Moreover, catheter and cuff-based methods for measuring blood pressure cause inconvenience and have limited applicability. Therefore, in this thesis, we propose a vision-based method for estimating the heart rate and blood pressure. This thesis proposes a 2-stage deep learning framework consisting of a dual remote photoplethysmography network (DRP-Net) and bounded blood pressure network (BBP-Net). In the first stage, DRP-Net infers remote photoplethysmography (rPPG) signals for the acral and facial regions, and these phase-shifted rPPG signals are utilized to estimate the heart rate. In the second stage, BBP-Net integrates temporal features and analyzes phase discrepancy between the acral and facial rPPG signals to estimate SBP and DBP values. To improve the accuracy of estimating the heart rate, we employed a data augmentation method based on a frame interpolation model. Moreover, we designed BBP-Net to infer blood pressure within a predefined range by incorporating a scaled sigmoid function. Our method resulted in estimating the heart rate with the mean absolute error (MAE) of 1.78 BPM, reducing the MAE by 34.31 % compared to the recent method, on the MMSE-HR dataset. The MAE for estimating the systolic blood pressure (SBP) and diastolic blood pressure (DBP) were 10.19 mmHg and 7.09 mmHg. On the V4V dataset, the MAE for the heart rate, SBP, and DBP were 3.83 BPM, 13.64 mmHg, and 9.4 mmHg, respectively.

2312.15868 2026-05-08 cs.CV

Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM

以区域可区分先验为导向的视频帧插值方法:基于SAM的区域可区分先验

Yan Han, Xiaogang Xu, Yingqi Lin, Jiafei Wu, Zhe Liu, Ming-Hsuan Yang

发表机构 * Zhejiang Lab(浙江实验室) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) University of California at Merced(加州大学默塞德分校)

AI总结 本文提出一种基于SAM2的区域可区分先验方法,通过区域可区分先验提升视频帧插值的运动估计精度,采用层次区域感知特征融合模块提升中间帧合成效果。

Comments Code will be released

详情
AI中文摘要

在现有的以恢复为导向的视频帧插值(VFI)方法中,相邻帧之间的运动估计起着关键作用。然而,现有方法的估计精度仍面临挑战,主要由于在相邻帧中识别对应区域存在固有歧义。因此,在运动估计之前通过区分不同区域来提升精度至关重要。本文介绍了一种新的解决方案,利用开放世界分割模型(如SAM2)来推导不同帧中的区域可区分先验(RDPs)。这些RDPs以空间变化的高斯混合物形式表示,能够区分任意数量的区域,采用统一的模态。RDPs可以整合到现有的基于运动的VFI方法中,通过我们设计的play-and-plug层次区域感知特征融合模块(HRFFM)来增强运动估计的特征。HRFFM在VFI的编码器的各个层次阶段中结合RDP,通过残差学习方式使用RDP引导的特征归一化(RDPFN)。通过HRFFM和RDP,VFI编码器内的特征在相邻帧的匹配区域中具有相似的表示,从而提高中间帧的合成效果。大量实验表明,HRFFM在各种场景中一致地提升了VFI的性能。

英文摘要

In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI's encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI's encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.

2312.06409 2026-05-08 cs.CV

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-timestamp 3D Human Pose Estimation

LiCamPose:结合多视角LiDAR和RGB相机实现鲁棒的单帧3D人体姿态估计

Zhiyu Pan, Zhicheng Zhong, Wenxuan Guo, Yifan Chen, Jianjiang Feng, Jie Zhou

发表机构 * Department of Automation, BNRist, Tsinghua University, China(自动化系、北京人工智能研究院、清华大学、中国)

AI总结 本文提出LiCamPose,结合多视角RGB和稀疏点云数据,通过单帧实现鲁棒的3D人体姿态估计,并设计了合成数据生成器和无监督域适应策略,验证了方法在多种数据集上的泛化能力。

Comments Accepted by WACV2025

详情
AI中文摘要

几种方法已提出从多视角图像估计3D人体姿态,已在相对简单条件下公开数据集上取得满意性能。然而,研究从多模态输入(如RGB和点云数据)提取3D人体骨骼的方法有限。为此,我们引入LiCamPose,一个整合多视角RGB和稀疏点云信息的管道,通过单帧估计鲁棒的3D人体姿态。我们展示了体积分层架构在结合这些模态方面的有效性。此外,为避免手动标注3D人体姿态注释的需要,我们开发了合成数据生成器进行预训练,并设计了无监督域适应策略来训练3D人体姿态估计器,无需手动标注。为了验证方法的泛化能力,LiCamPose在四个数据集上进行评估,包括两个公开数据集、一个合成数据集和一个具有挑战性的自采集数据集BasketBall,覆盖多样场景。结果表明,LiCamPose表现出强大的泛化性能和显著的应用潜力。代码、生成器和数据集将在论文接受后公开。

英文摘要

Several methods have been proposed to estimate 3D human pose from multi-view images, achieving satisfactory performance on public datasets collected under relatively simple conditions. However, there are limited approaches studying extracting 3D human skeletons from multimodal inputs, such as RGB and point cloud data. To address this gap, we introduce LiCamPose, a pipeline that integrates multi-view RGB and sparse point cloud information to estimate robust 3D human poses via single frame. We demonstrate the effectiveness of the volumetric architecture in combining these modalities. Furthermore, to circumvent the need for manually labeled 3D human pose annotations, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy to train a 3D human pose estimator without manual annotations. To validate the generalization capability of our method, LiCamPose is evaluated on four datasets, including two public datasets, one synthetic dataset, and one challenging self-collected dataset named BasketBall, covering diverse scenarios. The results demonstrate that LiCamPose exhibits great generalization performance and significant application potential. The code, generator, and datasets will be made available upon acceptance of this paper.

2605.06628 2026-05-08 eess.IV cs.LG cs.MM eess.AS eess.SP

LiVeAction: a Lightweight, Versatile, and Asymmetric Neural Codec Design for Real-time Operation

LiVeAction:一种轻量、多功能且不对称的神经编解码器设计用于实时操作

Dan Jacobellis, Neeraja J. Yadwadkar

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出LiVeAction神经编解码器,通过降低编码器复杂度和替换损失函数,提升率失真性能,适用于低功耗传感器。

Comments DCC 2026

详情
AI中文摘要

现代传感器生成丰富高保真数据,但可穿戴或远程传感设备的应用受限于带宽和功率预算。标准编解码器如JPEG和MPEG在比特率和感知质量之间实现高效权衡,但设计用于人类感知,限制其在机器感知任务和非传统模态如空间音频阵列、高光谱图像和3D医学图像中的应用。通用压缩方案基于标量量化或分辨率降低广泛适用,但未能利用信号冗余,导致率失真性能不佳。最近的生成神经编解码器或标记器模型复杂信号依赖性,但常过参数化、数据渴求且模态特定,使其在资源受限环境中不切实际。我们引入轻量、多功能且不对称的神经编解码器架构(LiVeAction),通过两个关键思想解决这些限制:(1)为减少编码器复杂度以满足执行环境的资源约束,我们施加类似FFT的结构并减少基于神经网络的分析变换的整体大小和深度。(2)为允许任意信号模态并简化训练,我们用基于方差的速率惩罚替代对抗性和感知损失。我们的设计产生编解码器,其率失真性能优于最先进的生成标记器,同时适用于低功耗传感器部署。我们发布代码、实验和Python库于https://github.com/UT-SysML/liveaction。

英文摘要

Modern sensors generate rich, high-fidelity data, yet applications operating on wearable or remote sensing devices remain constrained by bandwidth and power budgets. Standardized codecs such as JPEG and MPEG achieve efficient trade-offs between bitrate and perceptual quality but are designed for human perception, limiting their applicability to machine-perception tasks and non-traditional modalities such as spatial audio arrays, hyperspectral images, and 3D medical images. General-purpose compression schemes based on scalar quantization or resolution reduction are broadly applicable but fail to exploit inherent signal redundancies, resulting in suboptimal rate-distortion performance. Recent generative neural codecs, or tokenizers, model complex signal dependencies but are often over-parameterized, data-hungry, and modality-specific, making them impractical for resource-constrained environments. We introduce a Lightweight, Versatile, and Asymmetric neural codec architecture (LiVeAction), that addresses these limitations through two key ideas. (1) To reduce the complexity of the encoder to meet the resource constraints of the execution environments, we impose an FFT-like structure and reduce the overall size and depth of the neural-network-based analysis transform. (2) To allow arbitrary signal modalities and simplify training, we replace adversarial and perceptual losses with a variance-based rate penalty. Our design produces codecs that deliver superior rate-distortion performance compared to state-of-the-art generative tokenizers, while remaining practical for deployment on low-power sensors. We release our code, experiments, and python library at https://github.com/UT-SysML/liveaction .

2605.06608 2026-05-08 stat.ML cs.LG stat.ME

DARTS: Targeting Prognostic Covariates in Budget-Constrained Sequential Experiments

DARTS:在预算约束下的连续实验中针对预后协变量

Kateryna Husar, Alexander Volfovsky

发表机构 * Department of Statistical Science(统计科学系)

AI总结 DARTS通过动态适应重随机化和回归调整,减少批次层面的处理效应方差,同时保持随机化有效性。

详情
AI中文摘要

随机对照试验通常假设预后协变量已知且无成本。在实践中,获取高维预处理数据成本高,需在协变量适应精度和测量预算之间做出权衡。我们引入动态适应重随机化via汤普森采样(DARTS),将协变量获取视为嵌入在基于设计的因果推断任务中的连续优化问题。一个受预算限制的组合汤普森采样器学习哪些协变量在后续批次中最具有预后性;所选协变量驱动重随机化和回归调整以减少批次层面的平均处理效应方差。我们的主要理论贡献是分离结果:基于过去批次的适应性协变量选择保持批次层面的随机化有效性,并且累积逆方差加权估计量在渐近上达到名义覆盖率。我们进一步推导了获取层的贝叶斯风险界,该界与最小最大下界在对数因子范围内匹配。经验上,DARTS系统地将预算集中在信息丰富的特征上,显著缩小了与理想设计的效率差距,同时保持严格的推断有效性。

英文摘要

Randomized controlled trials typically assume that prognostic covariates are known and available at no cost. In practice, obtaining high-dimensional pretreatment data is costly, forcing a trade-off between covariate-adaptive precision and a measurement budget. We introduce Dynamic Adaptive Rerandomization via Thompson Sampling (DARTS), which treats covariate acquisition as a sequential optimization problem embedded within a design-based causal inference task. A budgeted combinatorial Thompson sampler learns which covariates are most prognostic across successive batches; selected covariates then drive rerandomization and regression adjustment to reduce batch-level average treatment effect variance. Our primary theoretical contribution is a decoupling result: adaptive covariate selection based on past batches preserves batch-level randomization validity, and the cumulative inverse-variance weighted estimator achieves at least nominal asymptotic coverage. We further derive a Bayes risk bound for the acquisition layer that matches the minimax lower bound up to logarithmic factors. Empirically, DARTS systematically concentrates the budget on informative features, significantly closing the efficiency gap to oracle designs while maintaining strict inferential validity.

2605.06601 2026-05-08 cs.CR cs.AI

Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

Patch2Vuln: 基于Linux发行版二进制补丁的漏洞重构

Isaac David, Arthur Gervais

发表机构 * University College London(伦敦大学学院)

AI总结 本文探讨利用语言模型代理从Linux发行版二进制补丁中重构漏洞,提出Patch2Vuln流程,通过提取ELF对、差异分析和代理审计,验证了二进制补丁中漏洞重构的可行性。

详情
AI中文摘要

安全更新创建了一个短暂但重要的窗口,使防御者和攻击者可以比较易受攻击和修补的软件。然而,在许多操作环境中,最易获得的证据是二进制包,而非源补丁或指导文本。本文探讨了一个语言模型代理是否能在仅本地二进制衍生证据的限制下重构Linux发行版更新的安全含义。Patch2Vuln是一个本地、可恢复的流程,提取旧/新ELF对,用Ghidra和Ghidriff进行差异分析,排名变化函数,构建候选档案,并要求离线代理生成初步审计、受限制的验证计划和最终审计。我们对25个Ubuntu .deb包对进行了评估:20个安全更新对和五个负对照,均经过手动裁定与私有源补丁和二进制函数地面真相对照。代理在10个安全对中定位了已验证的安全相关补丁函数,并在20个中分配了接受的最终根本原因类别。Oracle诊断显示,六个安全对在模型推理前失败,因为二进制差异或排名器遗漏了正确的函数,其中有一个额外的上下文导出遗漏。一个单独的受限制验证流程产生了两个目标级最小化的旧/新差异,均为tcpdump,但无崩溃、超时、 sanitizer发现或内存破坏证明;所有五个负对照均被分类为未知,并产生无验证差异。这些结果支持了从二进制补丁中重构漏洞作为有用的研究目标,同时显示了二进制差异覆盖率和本地行为验证仍然是限制因素。

英文摘要

Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.

2605.06596 2026-05-08 cs.CR cs.LG

FedAttr: Towards Privacy-preserving Client-Level Attribution in Federated LLM Fine-tuning

FedAttr: 向联邦LLM微调中实现隐私保护的客户端级归因

Su Zhang, Junfeng Guo, Heng Huang

发表机构 * Department of Computer Science(计算机科学系)

AI总结 FedAttr提出一种新的联邦学习协议,通过配对子集差分机制识别训练过水印文档的客户端,同时保持隐私保障和性能。该方法在理论和实验上均表现出色,具有高召回率和低误报率。

Comments 39 pages, 4 figures, 21 tables (including appendix)

详情
AI中文摘要

水印放射性测试类型的方法可以检测模型是否在水印文档上训练,已成为保护大型语言模型(LLM)微调中数据所有权的关键工具。现有工作已证明其在集中式LLM微调中的有效性。然而,此类方法面临多个挑战,在联邦学习(FL)中仍处于探索阶段,FL是一种广泛应用于不同用户在私有数据上协作微调LLM的范式。FL主要通过安全聚合(SA)确保隐私,允许服务器聚合更新同时保持客户端更新的隐私。这种机制虽然保护隐私,但使难以确定哪些客户端训练过水印文档。在本文中,我们提出FedAttr,一种新的FL客户端级归因协议。FedAttr通过配对子集差分机制识别训练过水印数据的客户端,同时保持SA和FL性能的隐私保障。FedAttr分为三个步骤:(i)通过差分两个SA查询估计每个客户端的更新;(ii)通过微分评分对估计进行评分;(iii)通过Stouffer方法结合各轮次的分数。我们理论证明FedAttr产生每个客户端更新的无偏估计,具有有界互信息泄露(即每轮更新为O(d*/N))。此外,FedAttr实验证明实现了100% TPR和0% FPR,比所有基线在TPR上至少高出44.4%,在FPR上高出19.1%,仅比FL训练时间多6.3%。消融研究证实FedAttr对协议参数和配置具有鲁棒性。

英文摘要

Watermark radioactivity testing type of methods can detect whether a model was trained on watermarked documents, and have become key tools for protecting data ownership in the fine-tuning of large language models (LLMs). Existing works have proved their effectiveness in centralized LLM fine-tuning. However, this type of method faces several challenges and remains underexplored in federated learning (FL), a widely-applied paradigm for fine-tuning LLMs collaboratively on private data across different users. FL mainly ensures privacy through secure aggregation (SA), which allows the server to aggregate updates while keeping clients' updates private. This mechanism preserves privacy but makes it difficult to identify which client trained on watermarked documents. In this work, we propose FedAttr, a new client-level attribution protocol for FL. FedAttr identifies which clients trained on watermarked data via a paired-subset-difference mechanism, while preserving the privacy guarantees of SA and FL performance. FedAttr proceeds in three steps: (i) estimate each client's update by differencing two SA queries, (ii) score the estimate with the watermark detector via differential scoring, and (iii) combine scores across rounds via Stouffer method. We theoretically show that FedAttr produces an unbiased estimator of each client's update with bounded mutual information leakage (i.e., $O(d^*/N)$ per-round update). Moreover, FedAttr empirically achieves 100% TPR and 0% FPR, outperforming all baselines by at least 44.4% in TPR or 19.1% in FPR, with only 6.3% overhead relative to FL training time. Ablation studies confirm that FedAttr is robust to protocol parameters and configurations.

2605.06564 2026-05-08 stat.ML cs.LG

Dynamic Treatment on Networks

网络中的动态治疗

Bengusu Nar, Jiguang Li, Veronika Ročková, Panos Toulis

发表机构 * Booth School of Business of the University of Chicago(芝加哥大学商学院)

AI总结 本文提出Q-Ising框架,通过贝叶斯动态Ising模型估计网络采纳动态,结合离线强化学习学习动态策略,以量化动态决策的不确定性,提升政策影响。

详情
AI中文摘要

在网络中,有效的动态治疗分配需要决定治疗对象和时间,以通过溢出效应放大政策影响。早期干预在高度连接的节点可能触发级联效应,改变后续目标节点。现有网络干扰下的治疗策略多为静态,而动态治疗框架通常忽略网络结构。本文整合这两种视角,提出Q-Ising框架,包含三个阶段:(i) 通过单个观察面板的贝叶斯动态Ising模型估计网络采纳动态;(ii) 用连续后验潜在状态补充治疗采纳历史;(iii) 通过离线强化学习学习动态策略。贝叶斯机制使动态决策具有不确定性量化,产生可解释的溢出估计。我们提供有限样本遗憾上界,将其分解为标准离线RL不确定性、网络抽象误差和Ising状态估计第一阶段误差。我们应用该方法到印度村庄微金融网络数据和合成随机块模型,在模拟的异质易感-感染-易感(SIS)动态下进行测试,证明适应性目标优于静态中心性基准。

英文摘要

In networks, effective dynamic treatment allocation requires deciding both whom to treat and also when, so as to amplify policy impact through spillovers. An early intervention at a well-connected node can trigger cascades that change which nodes are worth targeting in the next period. Existing treatment strategies under network interference are largely static while dynamic treatment frameworks typically ignore network structure altogether. We integrate these perspectives and propose Q-Ising, a three-stage pipeline that (i) estimates network adoption dynamics via a Bayesian dynamic Ising model from a single observed panel, (ii) augments treatment adoption histories with continuous posterior latent states, and (iii) learns a dynamic policy via offline reinforcement learning. The Bayesian mechanism enables uncertainty quantification over dynamic decisions, yielding posterior ensemble policies with interpretable spillover estimates. We provide a finite-sample regret upper bound that decomposes into standard offline-RL uncertainty, network abstraction error, and first stage error in Ising state estimation. We apply our method to data from Indian village microfinance networks and synthetic stochastic block models under simulated heterogeneous susceptible-infected-susceptible (SIS) dynamics and demonstrate that adaptive targeting outperforms static centrality benchmarks.

2605.06557 2026-05-08 cs.MA cs.AI cs.LG

Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

协作至关重要:合作多智能体强化学习的评估

Maria Ana Cardei, Matthew Landers, Afsaneh Doryab

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出一种考虑协作的评估视角,通过STAT测试平台评估六种价值导向的MARL方法,揭示不同协调机制对性能的影响。

Comments 27 pages. Submitted and under review

详情
AI中文摘要

合作多智能体强化学习(MARL)基准通常强调聚合结果如回报、成功率或完成时间。尽管重要,这些指标往往无法揭示智能体如何协作,特别是在智能体、任务和联合分配选择呈指数级增长的场景中。我们提出了一种协调意识的评估视角,通过STAT测试平台,系统地变化智能体、任务和环境规模,同时保持观察访问和任务规则固定。我们评估了六种代表性的价值导向MARL方法,在不同集中化水平下。结果表明,相似的回报趋势可能反映不同的协调机制,包括冗余分配、分配多样性以及任务完成效率的差异。我们发现,在受约束的任务分配中,性能在规模上的表现不仅受名义动作空间大小影响,还受分配压力、稀疏决策机会和相互依赖智能体间的冗余选择影响。我们的发现促使协调意识的评估成为合作MARL回报导向基准的必要补充。

英文摘要

Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and joint assignment choices scale combinatorially. We propose a coordination-aware evaluation perspective that supplements return with process-level diagnostics. We instantiate this perspective using STAT, a controlled commitment-constrained spatial task-allocation testbed that systematically varies agents, tasks, and environment size while holding observation access and task rules fixed. We evaluate six representative value-based MARL methods across varying levels of centralization. Our results show that similar return trends can reflect distinct coordination mechanisms, including differences in redundant assignment, assignment diversity, and task-completion efficiency. We find that in commitment-constrained task allocation, performance under scale is shaped not only by nominal action-space size, but also by assignment pressure, sparse decision opportunities, and redundant choices among interdependent agents. Our findings motivate coordination-aware evaluation as a necessary complement to return-based benchmarking for cooperative MARL.

2605.06520 2026-05-08 cs.GT cs.LG cs.MA stat.ME

Optimizing Social Utility in Sequential Experiments

在连续实验中优化社会效用

Ander Artola Velasco, Stratis Tsirtsis, Manuel Gomez-Rodriguez

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Hasso Plattner Institute(哈索·platzer研究所)

AI总结 本文提出一种统计协议,通过开发者连续进行随机对照试验并由监管机构部分补贴成本,以提高社会效用。通过动态规划和分治法,有效找到最优策略和补贴水平。

详情
AI中文摘要

在高风险领域如药物开发中,产品监管批准需要通过大规模随机对照试验提供安全性和有效性的统计证据。然而,这些试验的高成本可能阻止缺乏绝对确定性的开发者,从而抑制可能带来高社会效用的'月球跳跃'产品开发。为了解决这一效率问题,本文引入了一种实验统计协议,其中产品开发者(代理人)连续进行随机对照试验,而监管机构(委托人)部分补贴其成本。通过将该协议建模为信念马尔可夫决策过程,我们显示代理人最优策略可通过动态规划高效找到。进一步,我们显示社会效用是监管机构选择补贴水平的分段线性凸函数,因此社会最优补贴也可通过分治法高效找到。使用公开的抗生素开发和批准数据进行的模拟实验表明,我们的统计协议可以将社会效用相对于标准非连续协议提高超过35%。

英文摘要

Regulatory approval of products in high-stakes domains such as drug development requires statistical evidence of safety and efficacy through large-scale randomized controlled trials. However, the high financial cost of these trials may deter developers who lack absolute certainty in their product's efficacy, ultimately stifling the development of `moonshot' products that could offer high social utility. To address this inefficiency, in this paper, we introduce a statistical protocol for experimentation where the product developer (the agent) conducts a randomized controlled trial sequentially and the regulator (the principal) partially subsidizes its cost. By modeling the protocol using a belief Markov decision process, we show that the agent's optimal strategy can be found efficiently using dynamic programming. Further, we show that the social utility is a piecewise linear and convex function over the subsidy level the principal selects, and thus the socially optimal subsidy can also be found efficiently using divide-and-conquer. Simulation experiments using publicly available data on antibiotic development and approval demonstrate that our statistical protocol can be used to increase social utility by more than $35$$\%$ relative to standard, non-sequential protocols.

2605.06516 2026-05-08 math.OC cs.AI

Learning to Cut: Reinforcement Learning for Benders Decomposition

学习切割:用于Benders分解的强化学习

Haochen Cai, Xian Yu

发表机构 * Department of Integrated Systems Engineering(整合系统工程系)

AI总结 本文提出RLBD框架,通过神经网络策略自适应选择切割,提升求解两阶段随机电动汽车充电站选址问题的效率,并在相似结构问题中表现出良好的泛化能力。

详情
AI中文摘要

Benders分解(BD)是求解现实决策中不确定性下两阶段随机规划问题的常用方法。然而,随着切割数量增加,主问题收敛速度变慢。本文提出Reinforcement Learning for BD(RLBD),通过基于神经网络的随机策略自适应选择切割,策略通过REINFORCE算法进行训练。我们在两阶段随机电动汽车充电站选址问题上评估了该方法,并与传统BD和基于支持向量机的LearnBD进行比较。数值结果表明,RLBD在计算效率上取得显著改进,并在具有相似结构但不同数据输入和决策变量维度的问题中表现出强泛化能力。

英文摘要

Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this paper, we propose Reinforcement Learning for BD (RLBD), a framework that adaptively selects cuts using a neural network-based stochastic policy. The policy is trained using a policy gradient method via the REINFORCE algorithm. We evaluate the proposed approach on a two-stage stochastic electric vehicle charging station location problem and compare it with vanilla BD and LearnBD, a supervised learning approach that classifies cuts using a support vector machine. Numerical results demonstrate that RLBD achieves substantial improvements in computational efficiency and exhibits strong generalization to problems with similar structures but varying data inputs and decision variable dimensions.

2605.06508 2026-05-08 cs.CR cs.AI

On the Security of Research Artifacts

关于研究制品的安全性

Nanda Rani, Christian Rossow

发表机构 * CISPA – Helmholtz Center for Information Security(CISPA 欧文中心信息安全研究所)

AI总结 本文研究了509个顶级安全会议中的研究制品,发现其中41.6%可能存在安全风险,提出SAFE框架以提升制品评估的安全性。

详情
AI中文摘要

研究制品广泛共享以支持可复现性,而制品评估(AE)主要检查其功能和可复现性,忽视潜在安全风险。本文分析了509个顶级安全会议中的研究制品,发现许多包含不安全代码模式,提出上下文感知的安全评估分类,通过静态分析和过滤误报,识别真实安全风险。SAFE框架在区分安全与非安全风险方面准确率达84.80%,F1-score达84.63%,表明安全在AE中同样重要,以促进安全负责任的研究共享。源代码见:https://github.com/nanda-rani/SAFE

英文摘要

Research artifacts are widely shared to support reproducibility, and artifact evaluation (AE) has become common at many leading conferences. However, AE mainly checks whether artifacts work as claimed and can be reproduced. It largely overlooks potential security risks. Since these artifacts are publicly released and reused, they may unintentionally create opportunities for misuse and raise concerns about safe and responsible sharing. We study 509 research artifacts from top-tier security venues and find that many contain insecure code patterns that may introduce potential attack vectors. We propose a taxonomy for context-aware security assessment to enable structured analysis of such risks. We perform static analysis and examine the resulting findings, filtering false positives and identifying real security risks. Our analysis shows that 41.60% of the prevalent findings may pose security concerns under practical usage. To support scalable analysis, we introduce SAFE (Security-Aware Framework for Artifact Evaluation), a first step toward an autonomous framework that analyzes tool-reported findings by considering code semantics, execution context, and practical exploitability. SAFE achieves 84.80% accuracy and 84.63% F1-score in distinguishing security and non-security risks. Overall, our results show that security is also important in AE for promoting safe and responsible research sharing. The source code is available at: https://github.com/nanda-rani/SAFE

2605.06484 2026-05-08 stat.ME cs.LG stat.ML

Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts

在随机分布偏移下通过代理进行推断的水平估计

Steven Wilkins-Reeves, Alexandra N. M. Darmon, Deeksha Sinha

发表机构 * Central Applied Science, Meta(Meta应用科学中心)

AI总结 本文提出一种基于领域适应的估计层框架,用于校准代理推断,通过建模代理-主要指标差异作为参数级随机效应,以应对分布偏移带来的不准确性。

Comments 10 pages, 5 figures

详情
AI中文摘要

在许多科学领域,包括实验中,研究人员依赖于代理结果的测量来实现更快更频繁的读取,特别是当主要结果难以直接测量时。尽管代理提供了更易获取的观测数据用于推断,但最终目标是推断主要结果参数,并且代理数据通常在某些方面不完美。为了纠正这些不完美之处,当前的统计推断方法通常依赖于严格的识别假设(如替代性、协变量/标签偏移或缺失性假设)。这些假设可能难以验证,并且可能被各种额外的分布偏移源所违反,从而导致参数估计偏倚和不确定性量化不准确。我们引入了一种受领域适应技术启发的估计层框架,用于经验性校准基于代理的推断。该框架将代理-主要指标差异建模为参数级的随机效应,从过去不同领域(如实验、时间段或不同细分市场)的聚合历史观测中估计其分布。该方法避免了保留个体层面响应数据的必要性。此外,这种调整可以叠加在现有代理校正方法(如预测驱动推断或重要加权)之上,以应对那些校正方法未解决的额外偏倚。为了在历史领域数量有限时管理不确定性,我们提供了矩估计法和领域自助法。我们进一步通过公开可用的数据集和实际实验验证了这种方法。

英文摘要

In many scientific domains, including experimentation, researchers rely on measurements of proxy outcomes to achieve faster and more frequent reads, especially when the primary outcome of interest is challenging to measure directly. While proxies offer a more readily accessible observation for inference, the ultimate goal is to draw statistical inferences about the primary outcome parameter and proxy data are typically imperfect in some ways. To correct for these imperfections, current statistical inference methods often depend on strict identifying assumptions (such as surrogacy, covariate/label shift, or missingness assumptions). These assumptions can be difficult to validate and may be violated by various additional sources of distribution shift, potentially leading to biased parameter estimates and miscalibrated uncertainty quantification. We introduce an estimate-level framework, inspired by domain adaptation techniques, to empirically calibrate proxy-based inference. This framework models the proxy-primary metric discrepancy as a random effect at the parameter level, estimating its distribution from aggregated historical observations across past domains (e.g., experiments, time periods, or distinct segments). This method avoids the requirement for retaining individual-level response data. Additionally, this adjustment can be layered on top of existing proxy-correction methods (such as prediction-powered inference or importance weighting) to account for additional biases not addressed by those corrections. To manage uncertainty when the number of historical domains is limited, we provide both a method-of-moments estimator and a domain bootstrap procedure. We further validate this approach using publicly available datasets and real-world experiments.

2605.06479 2026-05-08 stat.ML cs.LG math.ST stat.TH

Risk-Controlled Post-Processing of Decision Policies

受风险控制的决策策略后处理

Sunay Joshi, Tao Wang, Hamed Hassani, Edgar Dobriban

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文研究了在风险约束下如何通过后处理优化决策策略,提出了一种基于阈值结构的后处理方法,能够在保持基线策略一致性的前提下控制风险,实验验证了其在医疗诊断、LLM路由和多类决策任务中的有效性。

详情
AI中文摘要

预测模型通常通过现有决策策略部署,除非风险约束要求干预,否则利益相关者不愿更改。我们研究了受风险控制的后处理:给定一个确定性基线策略,选择一个新策略,使其在满足用户指定损失的随机约束下最大程度与基线一致。在总体层面,我们证明最优策略具有阈值结构:它遵循基线策略,除非在某些上下文中切换到Oracle回退策略能显著降低条件违反风险。在有限样本层面,给定拟合的回退策略和评分,我们开发了一种后处理算法,利用校准数据选择阈值。利用算法稳定性与随机过程的工具,我们证明在正则条件下,i.i.d.情况下后处理策略的预期超额风险为O(log n/n)。在回退策略恰好安全的情况下,该算法在交换性下实现精确的预期风险控制。在此设置中,我们还给出了后处理策略的高概率近优性保证。在新冠放射影像诊断任务、LLM路由问题和合成多类决策任务上的实验表明,针对后处理能够满足或接近风险预算,同时比无偏随机混合保持显著更多的基线一致性。

英文摘要

Predictive models are often deployed through existing decision policies that stakeholders are reluctant to change unless a risk constraint requires intervention. We study risk-controlled post-processing: given a deterministic baseline policy, choose a new policy that maximizes agreement with the baseline subject to a chance constraint on a user-specified loss. At the population level, we show that the optimal policy has a threshold structure: it follows the baseline except on contexts where switching to the oracle fallback policy yields a large reduction in conditional violation risk. At the finite-sample level, given a fitted fallback policy and score, we develop a post-processing algorithm that uses calibration data to select a threshold. Leveraging tools from algorithmic stability and stochastic processes, we show that under regularity conditions, in the i.i.d. setting, the expected excess risk of the post-processed policy is $O(\log n/n)$. In the special case when an exact-safe fallback policy is available, the algorithm achieves precise expected risk control under exchangeability. In this setting, we also give high-probability near-optimality guarantees on the post-processed policy. Experiments on a COVID-19 radiograph diagnosis task, an LLM routing problem, and a synthetic multiclass decision task show that targeted post-processing can meet or nearly meet risk budgets while preserving substantially more agreement with the baseline than score-blind random mixing.

2605.06469 2026-05-08 math.OC cs.LG cs.SY eess.SY

Dynamic Controlled Variables Based Dynamic Self-Optimizing Control

基于动态可控变量的动态自优化控制

Chenchen Zhou, Shaoqi Wang, Hongxin Su, Xinhui Tang, Yi Cao, Shuang-Hua Yang

发表机构 * College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, Zhejiang Province, P.R. China(浙江大学化学与生物工程学院) Institute of Zhejiang University-Quzhou, Quzhou, Zhejiang Province, P.R. China(浙江大学Quzhou研究院)

AI总结 本文提出动态自优化控制方法,引入动态可控变量概念,通过数据驱动设计解决动态优化问题,验证了其在多值和不连续函数逼近及非固定时间跨度优化中的有效性。

Journal ref Journal of Process Control, 2024, 138: 103228

详情
AI中文摘要

本文提出动态自优化控制方法,引入动态可控变量概念,通过数据驱动设计解决动态优化问题,验证了其在多值和不连续函数逼近及非固定时间跨度优化中的有效性。

英文摘要

Self-optimizing control is a strategy for selecting controlled variables, where the economic objective guides the selection and design of controlled variables, with the expectation that maintaining the controlled variables at constant values can achieve optimization effects, translating the process optimization problem into a process control problem. Currently, self-optimizing control is widely applied to steady-state optimization problems. However, the development of process systems exhibits a trend towards refinement, highlighting the importance of optimizing dynamic processes such as batch processes and grade transitions. This paper formally introduces the self-optimizing control problem for dynamic optimization, termed the dynamic self-optimizing control problem, extending the original definition of self-optimizing control. A novel concept, "dynamic controlled variables" (DCVs), is proposed, and an implicit control policy is presented based on this concept. The paper theoretically analyzes the advantages and generality of DCVs compared to explicit control strategies and elucidates the relationship between DCVs and traditional controllers. Moreover, this paper puts forth a data-driven approach to designing self-optimizing DCVs, which considers DCV design as a mapping identification problem and employs deep neural networks to parameterize the variables. Three case studies validate the efficacy and superiority of DCVs in approximating multi-valued and discontinuous functions, as well as their application to dynamic optimization problems with non-fixed horizons, which traditional self-optimizing control methods are unable to address.

2605.06445 2026-05-08 cs.SE cs.AI

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

约束衰减:后端代码生成中LLM代理的脆弱性

Francesco Dente, Dario Satriani, Paolo Papotti

发表机构 * EURECOM University of Basilicata(巴利卡塔大学)

AI总结 研究探讨了LLM代理在多文件后端生成中处理结构约束的能力,发现随着约束累积,性能显著下降,且框架敏感性导致不同环境下的性能差异。

详情
AI中文摘要

大型语言模型(LLM)代理在宽松规范下表现出色,但在生产软件中需严格遵守结构约束,如架构模式、数据库和对象关系映射。现有基准测试忽视非功能需求,奖励功能正确但结构随意的解决方案。本文通过固定统一API合同,评估代理在多文件后端生成中处理结构约束的能力。使用端到端行为测试和静态验证器进行双重评估,发现约束衰减现象:随着结构要求累积,代理性能大幅下降。某些配置在完全指定任务中断言通过率下降30分,部分较弱配置接近零。框架敏感性分析揭示显著性能差异:代理在最小、显式框架(如Flask)中表现良好,但在惯例密集环境(如FastAPI、Django)中表现差。最终,错误分析发现数据层缺陷(如错误查询组成和ORM运行时违规)是主要原因。本研究强调,同时满足功能和结构要求仍是编码代理的关键挑战。

英文摘要

Large Language Model (LLM) agents demonstrate strong performance in autonomous code generation under loose specifications. However, production-grade software requires strict adherence to structural constraints, such as architectural patterns, databases, and object-relational mappings. Existing benchmarks often overlook these non-functional requirements, rewarding functionally correct but structurally arbitrary solutions. We present a systematic study evaluating how well agents handle structural constraints in multi-file backend generation. By fixing a unified API contract across 80 greenfield generation tasks and 20 feature-implementation tasks spanning eight web frameworks, we isolate the effect of structural complexity using a dual evaluation with end-to-end behavioral tests and static verifiers. Our findings reveal a phenomenon of constraint decay: as structural requirements accumulate, agent performance exhibits a substantial decline. Capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero. Framework sensitivity analysis exposes significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django). Finally, error analysis identifies data-layer defects (e.g., incorrect query composition and ORM runtime violations) as the leading root causes. This work highlights that jointly satisfying functional and structural requirements remains a key open challenge for coding agents.

2605.06439 2026-05-08 cs.CY cs.CV cs.ET

From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles

从评审到设计:面向风险缓解、事故响应和问责的伦理多模态驾驶员监控系统

Bilal Khana, Waseem Shariff, Rory Coyne, Muhammad Ali Farooq, Peter Corcoran

发表机构 * C3I Imaging Lab, School of Engineering, University of Galway(Galway大学工程学院C3I成像实验室) Royal College of Surgeons in Ireland(爱尔兰皇家外科医师学院)

AI总结 本文提出针对自动驾驶车辆中多模态驾驶员监控系统的设计框架,解决伦理和法律挑战,强调透明、可信和以人为本的设计方法。

详情
AI中文摘要

随着车辆向更高自动化水平过渡,驾驶员监控系统(DMS)已成为确保人类监督、安全和监管合规的关键。这些系统依赖多模态传感和人工智能推理来评估驾驶员注意力、认知状态和接管准备情况。尽管技术上具有前景,其部署引入了复杂的伦理和法律挑战,包括隐私、同意、数据所有权和算法公平性。虽然GDPR、欧盟人工智能法案和IEEE标准提供了重要指导,但它们缺乏解决车内传感技术独特风险的具体性。本文从评审到设计的视角,批判性地审查现有监管工具和伦理框架,识别其在多模态、人工智能驱动的车内监控中的适用性缺口。基于此审查,我们提出一个模块化的伦理设计框架,专门针对驾驶员监控系统。该框架将高层原则转化为可操作的设计和部署指导,包括用户可配置的同意机制、公平意识模型开发、透明性和可解释性工具,以及保护驾驶员情感福祉的保障措施。最后,本文概述了风险分析和故障缓解策略,强调针对DMS情境的主动事故响应和问责机制。这些贡献旨在指导开发透明、可信和以人为本的下一代自动驾驶车辆驾驶员监控系统。

英文摘要

As vehicles transition toward higher levels of automation, Driver Monitoring Systems (DMS) have become essential for ensuring human oversight, safety, and regulatory compliance in a vehicle. These systems rely on multimodal sensing and AI-driven inference to assess driver attention, cognitive state, and readiness to take control. While technologically promising, their deployment introduces a complex set of ethical and legal challenges - ranging from privacy and consent to data ownership and algorithmic fairness. While overarching frameworks such as the GDPR, EU AI Act, and IEEE standards offer important guidance, they lack the specificity required for addressing the unique risks posed by in-cabin sensing technologies. This paper adopts a review-to-design perspective, critically examining existing regulatory instruments and ethical frameworks -- such as the GDPR, the EU AI Act, and IEEE guidelines -- and identifying gaps in their applicability to the distinctive risks posed by multimodal, AI-enabled in-cabin monitoring. Building on this review, we propose a modular ethical design framework tailored specifically to Driver Monitoring Systems. The framework translates high-level principles into actionable design and deployment guidance, including user-configurable consent mechanisms, fairness-aware model development, transparency and explainability tools, and safeguards for driver emotional well-being. Finally, the paper outlines a risk analysis and failure mitigation strategy, emphasizing proactive incident response and accountability mechanisms tailored to the DMS context. Together, these contributions aim to inform the development of transparent, trustworthy, and human-centered driver monitoring systems for next-generation autonomous vehicles.

2605.06438 2026-05-08 stat.ML cs.LG q-fin.RM

Neural-Actuarial Longevity Forecasting: Anchoring LSTMs for Explainable Risk Management

神经-精算长寿预测:通过LSTM锚定实现可解释的风险管理

Davide Rindori

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Florence(佛罗伦萨大学)

AI总结 本文提出Hybrid-Lift框架,结合层级LSTM与Mean-Bias Correction机制,解决传统线性模型在长寿风险评估中的非线性问题,实验证明其在瑞典和西德的预测优势,同时提供治理工具提升监管资本校准和压力测试。

Comments 26 pages, 12 figures. Code available at https://github.com/davide-rindori/Actuarial-DS-Portfolio/tree/main/04_Multi_Population_Longevity_XAI

详情
AI中文摘要

传统多群体模型,如Li-Lee框架,假设国家特定偏差具有均值回复性。然而,高长寿集群数据表明这一范式存在系统性断裂。我们发现瑞典和西德等国家的死亡率残差表现出持续的单位根,导致线性模型对长寿风险的系统性误定价。为解决这些非线性问题,我们提出Hybrid-Lift框架,结合层级LSTM网络与Mean-Bias Correction(MBC)锚定机制。该框架作为治理友好的模型挑战者而非传统方法的替代品,在2012-2020年外推验证中表现出选择性优势:在瑞典和西德分别比Li-Lee高17.40%和12.57%,而在瑞士和日本等近线性地区表现相当。我们补充预测模型,整合基于SHAP的跨国家影响映射、双不确定性框架用于监管资本校准(瑞士ES 99.0%的+1.153年)以及反向压力测试以确定偿付能力缓冲耗尽的临界冲击阈值。本研究证明,当神经网络通过精算原则正确锚定时,可以作为SST和Solvency II标准下的有效长寿风险管理模式挑战者。

英文摘要

Traditional multi-population models, such as the Li-Lee framework, rely on the assumption of mean-reverting country-specific deviations. However, recent data from high-longevity clusters suggest a systemic break in this paradigm. We identify a stationarity paradox where mortality residuals in countries like Sweden and West Germany exhibit persistent unit roots, leading to a systematic mispricing of longevity risk in linear models. To address these non-linearities, we propose Hybrid-Lift, a neural-actuarial framework that combines Hierarchical LSTM networks with a Mean-Bias Correction (MBC) anchoring mechanism. Positioned as a governance-friendly model challenger rather than a replacement of classical approaches, the framework exhibits selective superiority on out-of-sample validation (2012-2020): it outperforms Li-Lee by 17.40% in Sweden and 12.57% in West Germany, while remaining comparable for near-linear regimes such as Switzerland and Japan. We complement the predictive model with an integrated governance suite comprising SHAP-based cross-country influence mapping, a dual uncertainty framework for regulatory capital calibration (Swiss ES 99.0% of +1.153 years), and a reverse stress test identifying the critical shock threshold for solvency buffer exhaustion. This research provides evidence that neural networks, when properly anchored by actuarial principles, can serve as effective model challengers for longevity risk management under the SST and Solvency II standards.

2605.06413 2026-05-08 stat.ML cs.LG

Decoupled PFNs: Identifiable Epistemic-Aleatoric Decomposition via Structured Synthetic Priors

解耦的PFNs:通过结构化合成先验实现可识别的epistemic-aleatoric分解

Richard Bergna, Stefan Depeweg, José Miguel Hernández-Lobato

发表机构 * University of Cambridge(剑桥大学) Siemens AG(西门子公司)

AI总结 本文提出解耦PFNs,通过结构化合成先验实现epistemic-aleatoric分解,改进了在噪声和异方差设置下的主动学习和贝叶斯优化性能。

详情
AI中文摘要

Prior-Fitted Networks (PFNs)通过元学习合成任务先验来实现贝叶斯预测,但其标准输出是对噪声观测的后验预测分布。在序列决策制定中,如主动学习和贝叶斯优化,获取应优先考虑对潜在信号的epistemic不确定性而非不可约的aleatoric观测噪声。我们显示,这种epistemic-aleatoric分解无法仅从后验预测分布中一般识别,即使该分布已知精确。然后我们利用PFNs的一个独特优势:由于合成数据生成过程在我们控制之下,每个任务可以包含显式的潜在信号和噪声函数,并且生成器可以为无噪声目标和观测噪声方差提供查询级别的标签。我们使用这些标签训练一个解耦的PFN,具有独立的潜在信号和aleatoric头部。观测级预测通过将潜在信号分布与学习的噪声模型卷积得到。实证表明,仅epistemic的获取缓解了在噪声和异方差设置下的总方差探索失败模式。在匹配比较中,解耦模型通常优于调优的观测级基线,最明显的收益出现在HPO中;在更广泛的扫描中,解耦模型在HPO和合成BO中均获得最佳平均排名。

英文摘要

Prior-Fitted Networks (PFNs) amortize Bayesian prediction by meta-learning over a synthetic task prior, but their standard output is a posterior predictive distribution over noisy observations. For sequential decision-making, such as active learning and Bayesian optimization, acquisition should prioritize epistemic uncertainty about the latent signal rather than irreducible aleatoric observation noise. We show that this epistemic--aleatoric split is not identifiable in general from the posterior predictive distribution alone, even when that distribution is known exactly. We then exploit a distinctive advantage of PFNs: because the synthetic data-generating process is under our control, each task can contain an explicit latent signal and noise function, and the generator can provide query-level labels for both the noiseless target and the observation-noise variance. We use these labels to train a decoupled PFN with separate latent-signal and aleatoric heads. The observation-level predictive is induced by convolving the latent signal distribution with the learned noise model. Empirically, epistemic-only acquisition mitigates the failure mode of total-variance exploration in noisy and heteroscedastic settings. In matched comparisons, decoupled models usually improve over tuned observation-level baselines, with the clearest gains in HPO; in broader sweeps, a decoupled model obtains the best average rank in both HPO and synthetic BO.

2605.06407 2026-05-08 eess.AS cs.AI cs.CL

WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling

WavCube:通过语义-声音联合建模统一语音表示以实现理解和生成

Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song, Ziyang Ma, Yushen Chen, Zeyu Xie, Tianrui Wang, Yifan Yang, Wenxi Chen, Qi Chen, Wenrui Liu, Shan Yang, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Tencent(腾讯) Independent Researcher(独立研究员) Peking University(北京大学) Tianjin University(天津大学) Zhejiang University(浙江大学)

AI总结 WavCube通过语义-声音联合建模统一语音表示,解决语音理解和生成的表示兼容问题,实现压缩后的高性能语音任务表现。

详情
AI中文摘要

将语音理解和生成整合是构建统一语音模型的关键步骤。然而,这两个任务所需的不同表示目前存在显著的兼容性挑战。通常,语义导向的特征通过自监督学习(SSL)学习,而声音导向的特征通过重建学习。这种碎片化的表示阻碍了真正统一的语音系统的实现。我们提出了WavCube,一种从SSL语音编码器中衍生出的紧凑连续潜在表示,能够同时支持语音理解、重建和生成。WavCube采用两阶段训练方案。第一阶段训练一个语义瓶颈,过滤掉使原始SSL特征难以处理的冗余信息。第二阶段通过端到端重建注入细粒度的声音细节,同时语义锚定损失确保表示始终保持在原始语义流形上。全面的实验表明,尽管在维度上压缩了8倍,WavCube在SUPERB上接近WavLM的性能,重建质量与现有声音表示相当,实现了最先进的零样本TTS性能,训练收敛速度显著加快,并在SUPERB-SG基准测试中在语音增强、分离和语音转换任务中表现出色。系统消融研究揭示,WavCube的两阶段配方解决了SSL特征在生成建模中的两个内在缺陷,为未来统一的语音系统铺平了道路。代码和检查点可在https://github.com/yanghaha0908/WavCube上获取。

英文摘要

Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube's two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.

2605.06386 2026-05-08 econ.EM cs.LG math.ST stat.ME stat.ML stat.TH

Covariate Balancing and Riesz Regression Should Be Guided by the Neyman Orthogonal Score in Debiased Machine Learning

协变量平衡与里斯回归应由奈曼正交分数引导以实现去偏机器学习

Masahiro Kato

发表机构 * Data Analytics Department, Mizuho-DL Financial Technology, Co., Ltd.(摩根大通数字技术财务科技部门)

AI总结 本文指出,在去偏机器学习中,平衡函数应基于奈曼正交分数推导,而非仅作为协变量函数。协变量平衡在回归误差可由协变量表示时有效,但针对处理效应异质性下的ATE估计,分数误差通常包含处理特定成分,因此提倡通过里斯回归实现回归平衡。

详情
AI中文摘要

本文主张在去偏机器学习中,平衡函数应基于奈曼正交分数推导,而非仅作为协变量函数。协变量平衡在回归误差可由协变量表示时有效,而针对处理效应异质性下的ATE估计,分数误差通常包含处理特定成分,因此提倡通过里斯回归实现回归平衡。

英文摘要

This position paper argues that, in debiased machine learning, balancing functions should be derived from the Neyman orthogonal score, not chosen only as functions of covariates. Covariate balancing is effective when the regression error entering the score can be represented by functions of covariates alone, and it is the natural finite-dimensional approximation for targets such as ATT counterfactual means. For ATE estimation under treatment effect heterogeneity, however, the score error generally contains treatment-specific components because the outcome regression is a function of the full regressor $X=(D,Z)$. In that case, balancing common functions of $Z$ can leave the treatment-specific component unbalanced. We therefore advocate regressor balancing, implemented by Riesz regression with basis functions of $X$, as the general balancing principle for DML. The position is not that covariate balancing is invalid, but that covariate balancing should be understood as the special case that is appropriate when the score-relevant regression error is a function of covariates alone.