arXivDaily arXiv每日学术速递 周一至周五更新

1. 深度学习架构与训练方法 7 篇

2606.19368 2026-06-19 math.NA cs.LG cs.NA math.OC 交叉投稿

Neural Architectures as Functional Priors in Physics-Informed Control Problems

物理信息控制问题中的神经架构作为函数先验

Sonia Rubio Herranz, Fernando Carlos López Hernández, Antonio López Montes

AI总结 研究神经架构作为隐式函数先验在常微分方程控制问题中的作用,发现不同架构(MLP与傅里叶KAN)在相同条件下产生定性不同的控制,表现出功能特化现象。

Comments 17 pages, 6 figures. Physics-informed neural networks, optimal control, spectral bias, Kolmogorov-Arnold Networks

详情
AI中文摘要

在这项工作中,我们研究了神经架构作为隐式函数先验在由常微分方程控制的问题中的作用。我们的目标不是关注高度复杂的问题,而是在最简单的物理可解释设置中研究受控动力系统中依赖于架构的效应。特别地,我们研究了一个受控的线性RLC电路和一个非线性Duffing型动力系统。这两个系统首先通过经典最优控制公式进行分析,然后通过基于PINN的方法进行分析。我们比较了多层感知器(MLP)和基于傅里叶的KAN类架构的不同组合,并分析了它们对所得控制的影响。数值实验表明,即使在相同的控制方程、损失函数、初始和目标状态、训练参数以及物理约束下,不同的架构选择也会系统地产生定性不同的控制。学习到的解在谱结构、平滑性、能量分布和相空间行为方面出现显著差异。这项工作的一个核心观察是,当神经架构被允许足够的自由度来塑造学习到的控制结构时,会出现功能特化现象。更具体地说,在我们考虑的系统中,基于傅里叶的架构倾向于产生具有更丰富振荡内容的轨迹,而更平滑的低频偏置架构倾向于产生更规则且能量效率更高的控制。这表明控制问题的不同功能组件可能由不同的神经架构更有效地处理,从而导致状态表示和控制生成之间的隐式特化。

英文摘要

In this work we investigate the role of neural architectures as implicit functional priors in control problems governed by ordinary differential equations. Rather than focusing on highly complex problems, our objective is to investigate architecture-dependent effects in controlled dynamical systems within the simplest physically interpretable settings possible. In particular, we study a controlled linear RLC electrical circuit and a nonlinear Duffing-type dynamical system. Both systems are analyzed first through classical optimal-control formulations and later through PINN-based approaches. We compare different combinations of multilayer perceptrons (MLPs) and Fourier-based KAN-like architectures, and analyze their influence on the resulting controls. The numerical experiments suggest that different architectural choices systematically generate qualitatively distinct controls, even under identical governing equations, loss functionals, initial and target states, training parameters and physical constraints. Significant differences appear in the spectral structure, smoothness, energy distribution, and phase-space behavior of the learned solutions. A central observation of this work is the emergence of a functional specialization phenomenon when the neural architectures are allowed sufficient freedom to shape the structure of the learned controls. More specifically, in the systems considered here, Fourier-based architectures tend to produce trajectories with richer oscillatory content, whereas smoother low-frequency-biased architectures tend to generate more regular and energetically efficient controls. This suggests that different functional components of the control problem may be handled more efficiently by different neural architectures, leading to an implicit specialization between state representation and control generation.

2606.19538 2026-06-19 cs.AI cs.LG 交叉投稿

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

ITNet: 一种可学习的积分变换,统一卷积、注意力与循环

Ashim Dhor, Rasel Mondal, Pin Yu Chen

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校) IBM Research(IBM研究院)

AI总结 提出可学习积分变换网络ITNet,通过位置-特征联合核函数统一卷积、注意力和循环架构,实现跨模态高性能。

详情
AI中文摘要

卷积网络、循环网络和变换器各自编码不同的归纳偏置——局部性、序列记忆和内容相关的成对交互——自诞生以来在数学上一直彼此独立。我们表明,这种碎片化反映的不是信号处理方式的根本多样性,而是对单一底层数学对象的不完整视角:可学习的积分变换。我们引入积分变换网络(ITNet),这是一种统一架构,围绕一个依赖于位置和特征的联合可学习核构建。该核实现为一个小型神经网络(具体为MLP),用于建模成对交互,使模型能够从数据中自适应其行为。我们证明,卷积、自注意力(包括多头)和自回归循环(包括LSTM、GRU、S4和Mamba)在适当参数化下均作为特例出现,且ITNet是连续算子的通用逼近器。为使其实用,我们开发了分块核融合、重要性加权蒙特卡洛积分和可学习低秩分解,实现高效可扩展计算。单个ITNet架构,共享算子与轻量级模态特定编码器,在ImageNet-1K、GLUE、ModelNet40、VQA v2和NLVR2上匹配或超越专用基线。结果表明,单一学习交互机制可从数据中恢复所有三个架构族的行为。

英文摘要

Convolutional networks, recurrent networks, and transformers each encode different inductive biases -- locality, sequential memory, and content-dependent pairwise interaction -- and have remained mathematically distinct since their inception. We show that this fragmentation reflects not a fundamental diversity in how signals should be processed, but rather incomplete views of a single underlying mathematical object: a learnable integral transform. We introduce the Integral Transform Network (ITNet), a unified architecture built around a learnable kernel that depends jointly on positions and features. This kernel is implemented as a small neural network, specifically an MLP, that models pairwise interactions, enabling the model to adapt its behavior from data. We show that convolution, self-attention (including multi-head), and autoregressive recurrence (including LSTM, GRU, S4, and Mamba) arise as special cases under appropriate parameterizations, and that ITNet is a universal approximator of continuous operators. To make this practical, we develop tiled kernel fusion, importance-weighted Monte Carlo integration, and learned low-rank factorization, enabling efficient and scalable computation. A single ITNet architecture with a shared operator and lightweight modality-specific encoders matches or exceeds specialized baselines on ImageNet-1K , GLUE, ModelNet40, VQA\,v2 and NLVR2. The results demonstrate that a single learned interaction mechanism can recover the behavior of all three architectural families from data.

2606.19617 2026-06-19 cs.CV cs.GR cs.LG 交叉投稿

GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

GB-LSR:一种具有单一全局带宽的快速局部光谱图像表示,用于连续重建和超分辨率

Max Shad, Naeem Khoshnevis

发表机构 * Harvard University(哈佛大学)

AI总结 提出GB-LSR,一种基于全局带宽的局部光谱表示,通过共享卷积编码器预测截断傅里叶基系数,实现连续图像重建,在Kodak等基准上PSNR提升2.8-3.6 dB,推理速度比最慢基线快约4倍。

详情
AI中文摘要

我们提出GB-LSR(全局带宽局部光谱表示),一种用于连续图像重建的固定网格局部光谱表示。图像域被划分为非重叠的方形块,每个块携带从共享卷积编码器特征预测的截断傅里叶基系数。一个可训练的标量带宽在所有块和图像中全局共享,在任何连续坐标处的重建是固定大小的基收缩,其成本与图像大小无关。我们研究了三种带宽处理变体:可训练的全局标量(主要)、固定的全局标量和逐块带宽场。在Kodak、Set14和Urban100上的标准化原生重建基准测试中,主要变体在匹配预算的LIIF/LTE/WIRE重实现上PSNR高出2.8-3.6 dB,LPIPS低0.11-0.15,同时推理成本约为最慢基线的四分之一。经验上,单个全局标量就足够了:逐块自适应带宽替代方案在闭式局部性诊断或端到端消融中均未带来改进。在独立的任意尺度超分辨率(ASR)扩展中,GB-LSR在标准SR协议下实现了具有竞争力的PSNR-Y,并在x4时比LIIF-RDN快1.44倍,比LTE-SwinIR快3.25倍;在同一扩展中,一个变体在训练和评估时不使用四角局部集成平均,速度提升1.77倍,峰值内存降低35%,PSNR变化可忽略,而将RDN编码器从64通道扩展到96通道时,PSNR略有提升,速度提升1.58倍,峰值内存降低31%。原生重建声明限定于匹配预算的摊销协议,ASR声明限定于独立的标准SR协议。

英文摘要

We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

2606.19635 2026-06-19 cs.IR cs.AI cs.LG 交叉投稿

Token Factory: Efficiently Integrating Diverse Signals into Large Recommendation Models

Token Factory:高效整合多样化信号于大型推荐模型

Xilun Chen, Shao-Chuan Wang, Baykal Cakici, Lukasz Heldt, Lichan Hong, Raghu Keshavan, Aniruddh Nath, Li Wei, Xinyang Xi

AI总结 提出Token Factory框架,将传统信号转化为软令牌,高效集成到基于Transformer的大型推荐模型中,避免提示长度爆炸并提升性能。

Comments 8 pages, 10 figures

详情
AI中文摘要

大型推荐模型(LRM)在工业级推荐任务中展现了强大的能力。然而,如何有效且高效地将传统信号整合到这些基于Transformer的架构中仍然是一个主要挑战。传统的直接“文本化”这些信号或创建离散物品表示的方法往往导致过长的提示、巨大的内存占用和高计算开销。为了克服这些限制,我们提出了“Token Factory”,一个旨在将传统信号转化为可由LRM直接处理的“软令牌”的框架。这种方法能够高效集成和压缩异构输入特征,防止提示长度爆炸,同时提升模型性能。我们详细描述了Token Factory的架构,并展示了在工业级推荐环境中验证其有效性的实验结果。

英文摘要

Large Recommendation Models (LRMs) have demonstrated promising capabilities in industry-scale recommendation tasks. However, holistically integrating traditional signals into these transformer-based architectures effectively and efficiently remains a major challenge. Conventional approaches that "textualize" these signals directly or create discrete item representations often lead to excessively long prompts, substantial memory footprints, and high computational overhead. To overcome these limitations, we propose "Token Factory", a framework designed to transform traditional signals into "soft tokens" that can be directly processed by LRMs. This approach enables efficient integration and compression of heterogeneous input features, preventing prompt length explosion while enhancing model performance. We detail the architecture of Token Factory and present experimental results validating its effectiveness in a production-scale recommendation environment.

2606.19831 2026-06-19 cs.CL cs.LG 交叉投稿

Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language Models

杠杆不等于可达性:语言模型中单神经元操控的控制窗口定律

Hongliang Liu

发表机构 * Palo Alto Networks

AI总结 提出预算归一化控制窗口框架,通过残差范数与写入范数之比定义的相干预算,预测单神经元干预何时产生连贯行为控制,并在15个神经元上验证了预测精度。

详情
AI中文摘要

对齐语言模型通过稀疏前馈神经元门控拒绝和语言路由等行为,但尚无理论预测单神经元干预何时连贯地控制行为而非导致输出崩溃。我们开发了一个预算归一化的控制窗口框架用于单神经元操控。沿一个写入方向的剂量简化为一个控制坐标:残差流与写入之间的对齐,该对齐沿着一条通用饱和曲线驱动,以残差范数除以写入范数设定的相干预算为单位。当行为触发点低于崩溃上限时,存在连贯控制。同一坐标控制良性模式切换和拒绝;上限由权重和一次通用前向传播得出,而触发点在 rollout 时测量。在15个保留神经元上,预测上限的平均绝对误差为0.14,在批量层中约为0.07,并且承诺的开启或关闭判定在11个神经元上成立,而多数基线为10/15。关闭情况揭示了三种失败模式而非违反:触发前崩溃、深度不足以传播、或归一化限制了单个神经元能推动的距离。该定律解释了为什么局部梯度归因反直觉地预测控制:真正的控制器偏离读出轴写入,并携带接近零的一阶梯度。由窗口精确化的仅前向对比筛选恢复了归因遗漏的控制器。在拒绝这一最难案例中,干预成功是类型化的而非标量:连贯旁路和严格可操作可达性分离,因此一个神经元可以在流畅、任务相关且无操作内容的文本中翻转拒绝,而真正的可操作可达性仅出现在六个审计的 Llama 枢轴中的三个,且仅在较晚的 rollout 时间范围内。因此,单神经元操控是对可控性的预算化、类型化审计,而非固定剂量的轶事。

英文摘要

Aligned language models gate behaviors such as refusal and language routing through sparse feed forward neurons, yet no theory predicts when a single neuron intervention controls a behavior coherently rather than collapsing the output. We develop a budget normalized control window framework for single neuron steering. A dose along one write direction reduces to one control coordinate: the alignment between the residual stream and the write, driven along a universal saturation curve in units of a coherence budget set by the residual norm divided by the write norm. Coherent control exists when a behavior trigger lies below the collapse ceiling. The same coordinate governs benign mode switches and refusal; the ceiling follows from weights and one generic forward pass, while triggers are measured at rollout. On fifteen held out neurons, the predicted ceiling has mean absolute error 0.14, about 0.07 in bulk layers, and the committed open or closed verdict holds on eleven against a ten of fifteen majority baseline. Closed cases expose three failure modes rather than violations: collapse before trigger, too little depth to propagate, or a normalization that caps how far one neuron can push. The law explains why local gradient attribution anti predicts control: true controllers write off the readout axis and carry a near zero first order gradient. A forward only contrastive screen made precise by the window recovers controllers that attribution misses. On refusal, the hardest case, intervention success is typed, not scalar: coherent bypass and strict actionable reach separate, so a neuron can flip refusal in fluent, on task text with no actionable content, and genuine actionable reach appears only for three of six audited Llama pivots and only at later rollout horizons. Single neuron steering is therefore a budgeted, typed audit of controllability rather than a fixed dose anecdote.

2606.19946 2026-06-19 cs.CL cs.LG 交叉投稿

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS: 几何约束使LLM中多语义叠加成为可能

Yu Deng

AI总结 提出GEMS方法,通过范数保持加权叠加、目标注意力路径注入和实时正交化两个几何约束,解决无训练多方向激活干预中的分布偏差和方向干扰问题,在GSM8K上保持98%准确率。

Comments 30 pages, 5 figures, 20 tables. Code and logs are available at: https://github.com/LuLu663939/gems-multi-semantic-steering

详情
AI中文摘要

激活引导通过在推理时修改中间隐藏状态来控制模型行为,无需重新训练。现有方法仅处理单方向注入;当多个语义方向无约束叠加时,模型崩溃。我们证明这种崩溃分解为两个独立作用的来源:分布偏差(加法扰动在层间累积范数并将激活推出训练分布)和方向干扰(非正交语义向量叠加时相互抑制)。这两个来源定义了任何无训练多方向干预必须满足的设计约束。作为这些原则的一个实例,我们提出GEMS,一种无训练方法,将每个来源映射到相应的几何约束:针对分布偏差的范数保持加权叠加和目标注意力路径注入,以及针对方向干扰的实时正交化。在GSM8K上,注入三个并发非数学方向保持98%的准确率(基线92%),而无约束加法崩溃至4%;在Wikitext-2上,相同注入仅导致2.2%的PPL增加。组件消融隔离了每个约束的因果作用,层级探针确认正交化信号通过FFN路径存活并以语义特异性到达输出分布。定性引导效果跨架构从3B到31B迁移。

英文摘要

Activation steering controls model behavior by modifying intermediate hidden states at inference time without retraining. Existing methods handle only single-direction injection; when multiple semantic directions are superposed without constraints, the model collapses. We show that this collapse decomposes into two independently acting sources: distributional deviation, where additive perturbations accumulate in norm across layers and drive activations outside the training distribution, and directional interference, where non-orthogonal semantic vectors mutually dampen when superposed. These two sources define the design constraints that any training-free multi-directional intervention must address. As one instantiation of these principles, we propose GEMS, a training-free method that maps each source to a corresponding geometric constraint: norm-preserving weighted superposition and targeted attention-pathway injection for distributional deviation, and real-time orthogonalization for directional interference. On GSM8K, injecting three concurrent non-mathematical directions preserves accuracy at 98% (baseline 92%), while unconstrained addition collapses to 4%; on Wikitext-2, the same injection incurs only 2.2% PPL increase. Component ablation isolates the causal role of each constraint, and layer-level probes confirm that orthogonalized signals survive the FFN pathway and reach the output distribution with semantic specificity. Qualitative steering effects transfer across architectures from 3B to 31B.

2606.20035 2026-06-19 cs.CV cs.LG 交叉投稿

PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

PU-UNet:用于医学图像分割的稳定乘法交互

Ziyuan Li, Osamah Sufyan, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz(科布伦茨应用科学大学数学、信息学与技术系) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PU-UNet,通过稳定乘积单元残差块在低分辨率阶段实现显式乘法特征交互,在三个医学图像分割数据集上提升Dice和IoU,降低假阳性率。

Comments Accepted to the ICANN 2026

详情
AI中文摘要

许多密集预测网络依赖于加性特征变换,并且仅隐式地建模高阶特征交互。乘积单元为乘法特征建模提供了显式机制,但其对数-指数公式可能导致数值不稳定性,这限制了它们在深度密集预测网络中的使用。在这项工作中,我们提出了乘积单元U-Net(PU-UNet),这是一种残差U-Net,它将稳定的乘积单元残差块集成到丰富的低分辨率阶段,用于医学图像分割。所提出的公式结合了平滑正性映射和对数域裁剪,实现了稳定的乘法特征学习,且计算开销可忽略不计。在ISIC 2018、Kvasir-SEG和BUSI上,PU-UNet分别达到了0.942、0.959和高达0.925的Dice分数。与匹配的残差U-Net基线相比,PU-UNet在保持参数、FLOPs和推理延迟几乎不变的情况下,持续提高了Dice和IoU,并将正常BUSI病例的图像级假阳性率从0.077降至零。消融研究表明,这些增益与乘积单元交互相关,在低分辨率放置下最强,并受益于所提出的稳定化设计。这些结果表明,稳定的乘积单元残差学习可以成为通过显式乘法交互增强U-Net风格分割网络的有效方式。

英文摘要

Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

2. 表示学习、自监督与对比学习 2 篇

2606.19882 2026-06-19 cs.CV cs.LG 交叉投稿

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego(加州大学圣地亚哥分校)

AI总结 提出多模态概念瓶颈模型(MM-CBM),利用双概念瓶颈层对齐图像和文本嵌入,实现可解释的零样本分类和图像检索,在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情
AI中文摘要

概念瓶颈模型(CBM)通过将图像提取的特征与自然概念对齐,增强了深度学习网络的可解释性。然而,现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制,其中预期概念之外的预测信号被无意中利用。在本文中,我们提出了多模态概念瓶颈模型(MM-CBM)来解决这些问题,并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层(CBL)将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务,如零样本分类或图像检索。与现有方法相比,MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率,在黑盒性能的约5%以内,同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

2606.20559 2026-06-19 cs.CV cs.LG 交叉投稿

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO:代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

AI总结 提出分层多教师蒸馏框架UNIEGO,通过代理模型将异构教师知识转化为同质自我中心空间,并采用选择性代理蒸馏自适应筛选可靠监督,在三个自我中心视频理解任务上达到最优。

详情
AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角:单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为,真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识,同时仍能仅从自我中心视频部署。为此,我们引入了一个分层多教师蒸馏框架,生成UNIEGO,一个统一的自我中心编码器,使用九个教师(涵盖自我-外部视角、RGB、深度和骨架模态)以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏(其不兼容的架构和特征几何会导致冲突梯度),而是在其中插入一层表示特定的代理模型,将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏,即选择性代理蒸馏(SPD),然后自适应地为每个训练样本选择既正确又自信的代理子集,仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定,在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务(动作识别、视频检索和动作分割)上,在三个具有挑战性的自我-外部基准测试中达到了最先进的性能,优于朴素的多教师蒸馏基线,并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

3. 强化学习与序列决策 8 篇

2606.19632 2026-06-19 cs.RO cs.AI cs.LG cs.LO cs.MA 交叉投稿

Formal Verification of Learned Multi-Agent Communication Policies via Decision Tree Distillation

通过决策树蒸馏对学习到的多智能体通信策略进行形式化验证

Ahmad Farooq, Kamran Iqbal

发表机构 * University of Arkansas at Little Rock(阿肯色大学小石城分校)

AI总结 提出通过决策树蒸馏将多智能体强化学习策略转化为可解释模型,并利用PRISM进行形式化验证,确保安全属性转移至原始网络,在无人机编队任务中实现88.9%属性满足率。

Comments 9 pages, 3 figures, 7 tables. Accepted at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026), Pittsburgh, Pennsylvania, USA, September 27-October 1, 2026

详情
AI中文摘要

多智能体强化学习使智能体能够通过涌现通信发展协调策略,但神经策略缺乏无人机群和自动驾驶车队等安全关键机器人部署所需的形式化安全保证。我们提出了首个通过学习策略抽象进行安全验证的端到端框架:神经策略被蒸馏为可解释的决策树,然后进行形式化验证,并通过经验验证确认验证的安全属性可转移至原始网络。我们的四阶段流程包括:从智能体观测中提取领域特定特征;决策树蒸馏达到97.9% +/- 1.2%的神经策略保真度;自动翻译为PRISM概率模型检查器规范,具有完整的特征到状态变量对应关系;以及通过成对分解、联合界聚合和经验邻居建模对概率计算树逻辑属性进行组合验证。评估用于5-7个智能体多无人机协调的矢量量化变分信息瓶颈策略,我们验证了18个涵盖安全性、活性和合作的时间逻辑属性,实现了88.9%的属性满足率,所有五个安全阈值均满足(碰撞概率0.3% vs 阈值1%)。原始神经策略的蒙特卡洛验证确认验证的安全属性转移偏差<=0.6个百分点(95%置信区间)。离散VQ-VIB消息相比连续方法提供+11.6至+13.6个百分点的保真度优势,实现3-4倍更快的验证。我们的框架为蒸馏策略抽象提供了经验验证的安全验证,作为深度多智能体强化学习与多机器人部署形式化安全工作流之间的实用桥梁。

英文摘要

Multi-agent reinforcement learning (MARL) enables agents to develop coordination strategies through emergent communication, but neural policies lack the formal safety guarantees required for safety-critical robotic deployment in drone swarms and autonomous vehicle fleets. We present the first end-to-end framework for safety verification of learned multi-agent communication policies through policy abstraction: neural policies are distilled into interpretable decision trees, then formally verified, with empirical validation confirming that verified safety properties transfer to original networks. Our four-stage pipeline consists of domain-specific feature extraction from agent observations, decision tree distillation achieving 97.9% +/- 1.2% fidelity to neural policies, automated translation to PRISM probabilistic model checker specifications with complete feature-to-state-variable correspondence, and compositional verification of Probabilistic Computation Tree Logic (PCTL) properties via pairwise decomposition with union-bound aggregation and empirical neighbor modeling. Evaluating Vector-Quantized Variational Information Bottleneck (VQ-VIB) policies for multi-drone coordination with 5-7 agents, we verify 18 temporal logic properties across safety, liveness, and cooperation, achieving 88.9% property satisfaction with all five safety thresholds satisfied (0.3% collision probability vs. 1% threshold). Monte Carlo validation of original neural policies confirms that verified safety properties transfer with <=0.6 percentage-point deviation (95% CI). Discrete VQ-VIB messages provide +11.6 to +13.6 percentage-point fidelity advantages over continuous methods, enabling 3-4x faster verification. Our framework provides empirically validated safety verification for distilled policy abstractions, serving as a practical bridge between deep MARL and formal safety workflows for multi-robot deployment.

2606.19656 2026-06-19 cs.RO cs.LG 交叉投稿

DF-ExpEnse: Diffusion Filtered Exploration for Sample Efficient Finetuning

DF-ExpEnse: 扩散滤波探索用于高效样本微调

Calvin Luo, Chen Sun, Shuran Song

发表机构 * Stanford University(斯坦福大学) Brown University(布朗大学)

AI总结 提出DF-ExpEnse探索技术,利用生成控制策略的多模态建模能力和评论家集成,在微调中高效收集在线经验,提升样本效率。

Comments ICML 2026

详情
AI中文摘要

智能机器人决策的自然方案是从预训练的生成控制策略初始化,该策略总结了离线经验,并将其适应于自收集的在线经验。我们提出了DF-ExpEnse,一种探索技术,可提高在线经验收集的质量,从而提升微调样本效率。DF-ExpEnse利用生成控制策略的多模态建模能力,创建一个表达性强且易于评估的候选集。然后,它利用评论家集成来识别在质量与高探索兴趣之间最佳平衡的动作。在群体设置中,DF-ExpEnse进一步支持跨智能体通信,以促进群体协作探索。DF-ExpEnse可以无缝集成到通过强化学习微调预训练生成控制策略的现有策略中。我们通过实验验证,在各种操作和 locomotion 任务中,与默认微调和替代动作选择方案相比,DF-ExpEnse 持续带来样本效率优势。项目可在此 https URL 找到。

英文摘要

A natural recipe for intelligent robotic decision-making is initializing from pretrained generative control policies, which have summarized offline experience, and adapting them to self-collected online experience. We present DF-ExpEnse, an exploration technique that improves the quality of online experience collection, thus increasing finetuning sample-efficiency. DF-ExpEnse leverages the multimodal modeling capabilities of the generative control policy to create an expressive and tractably evaluatable candidate set. It then utilizes an ensemble of critics to identify the action that best balances quality with high exploration interest. In fleet settings, DF-ExpEnse further enables cross-agent communication to facilitate collaborative exploration as a group. DF-ExpEnse can be seamlessly integrated with existing strategies that finetune pretrained generative control policies via reinforcement learning. We experimentally validate consistent sample-efficiency benefits through DF-ExpEnse across a variety of manipulation and locomotion tasks, compared to default finetuning and alternative action selection schemes. Project can be found at https://df-expense.github.io.

2606.19920 2026-06-19 cs.RO cs.LG cs.MA 交叉投稿

Deep-Unfolded Coordination

深度展开协调

Hunter Kuperman, Minchan Jung, Rahul V. Ghosh, Alex Oshin, Evangelos A. Theodorou

发表机构 * Autonomous Control and Decision Systems Laboratory Georgia Institute of Technology United States(佐治亚理工学院自主控制与决策系统实验室)

AI总结 提出Deep Coordinator框架,通过深度展开ADMM-DDP迭代学习动态调整超参数,实现非凸优化器求解时自适应惩罚参数,在车队和四旋翼仿真中速度提升6.18-9.44倍且可扩展至8倍规模。

Comments The second and third authors contributed equally (equal second authorship). 35 pages (10 pages main text), 17 figures, 3 tables

详情
AI中文摘要

分布式优化是一种高度可扩展且结构透明的技术,用于解决多机器人问题;然而,这类方法通常需要高度专门化、针对特定问题的超参数调整。在这项工作中,我们提出了Deep Coordinator,一个深度展开框架,学习在求解时根据优化器性能动态调整ADMM-DDP(一种流行的机器人任务分布式求解器)的超参数。我们的架构包括将固定数量的ADMM-DDP迭代展开成一个神经网络,层之间具有可学习的函数,将优化器状态映射到下一个超参数。据我们所知,Deep Coordinator是第一个在求解时调整非凸优化器惩罚参数的深度展开框架;我们展示了主流的监督方法在训练此类模型时可能产生退化解,并提出了一种无监督学习方案。在车队和四旋翼飞行器的仿真中,Deep Coordinator生成的轨迹质量与常规求解器相当,但速度快6.18-9.44倍。此外,当部署到比训练规模大8倍的系统时,Deep Coordinator仍能保持其性能优势。

英文摘要

Distributed optimization is a highly scalable and structurally transparent technique to solve multi-agent robotics problems; however, such methods often suffer from the need for highly-specialized, problem-specific hyperparameter tunings. In this work, we propose Deep Coordinator, a deep-unfolding framework that learns to dynamically adjust the hyperparameters of ADMM-DDP, a popular distributed solver for robotics tasks, at solve-time in response to optimizer performance. Our architecture consists of unrolling a fixed number of ADMM-DDP iterations into a neural network with learnable functions between layers mapping the optimizer state to the next hyperparameters. To the best of our knowledge, Deep Coordinator is the first deep-unfolding framework to adapt the penalty parameters of a non-convex optimizer at solve-time; we show that the mainstream supervised approach can yield degenerate solutions when training such models, and propose an unsupervised learning scheme. On simulations with fleets of cars and quadrotors, Deep Coordinator produces trajectories of comparable quality 6.18-9.44x faster than conventional solvers. Furthermore, Deep Coordinator retains its performance benefits when deployed to systems up to 8x larger than trained on.

2606.20022 2026-06-19 stat.ML cs.LG math.OC 交叉投稿

Stochastic Linear Contextual Bandits with Bounded Noise: A Set-Membership Approach

具有有界噪声的随机线性上下文赌博机:一种集合成员方法

Haonan Xu, Yingying Li

AI总结 针对有界奖励噪声的随机线性上下文赌博机,提出基于集合成员估计和乐观原则的SME-OFU算法,实现O(log T)的遗憾界,优于次高斯噪声下的最优界。

Comments 23 pages, 1 figure

详情
AI中文摘要

本文考虑具有有界奖励噪声的随机线性上下文赌博机(SLCB)。现有工作通常假设次高斯奖励噪声和有界期望奖励,在此条件下最优遗憾界关于时间T为$\tilde{O}(\sqrt{T})$。然而,在许多应用中,实现/观测到的奖励也自然有界,这意味着奖励噪声有界。有界噪声比次高斯条件更具信息性,但在SLCB文献中尚未被明确利用。本文通过利用一种称为集合成员估计(SME)的不确定性量化方法,并应用面对不确定性的乐观原则(OFU),提出了一种新颖的算法SME-OFU。我们的算法享有改进的遗憾界$O(\log T)$。注意,这并不与次高斯噪声下现有的最优界$\tilde{O}(\sqrt{T})$矛盾,因为有界噪声是更强的条件。最后,仿真表明,当奖励噪声有界时,SME-OFU相对于为次高斯噪声设计的基准算法在经验上有所改进。

英文摘要

This paper considers stochastic linear contextual bandits (SLCB) with bounded reward noise. Existing works typically assume sub-Gaussian reward noise and bounded expected rewards, under which the optimal regret bound scales as $\tilde{O}(\sqrt{T})$ in terms of horizon $T$. However, in many applications, realized/observed rewards are also naturally bounded, implying bounded reward noise. Bounded noise is more informative than the sub-Gaussian condition but has not been leveraged explicitly in the SLCB literature. In this paper, we propose a novel algorithm SME-OFU by utilizing an uncertainty quantification method called set-membership estimation (SME) and applying the principle of optimism in the face of uncertainty (OFU). Our algorithm enjoys an improved regret bound $O(\log T)$. Notice that this does not contradict the existing optimal bound $\tilde{O}(\sqrt{T})$ for sub-Gaussian noise because bounded noise is a stronger condition. Finally, simulations show empirical improvements of SME-OFU over a benchmark algorithm designed for sub-Gaussian noise when the reward noise is bounded.

2606.20206 2026-06-19 stat.ML cs.LG 交叉投稿

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

马尔可夫决策过程中奖励非随机缺失的缺失感知策略的离线评估

Ziheng Wei, Annie Qu, Rui Miao

AI总结 针对奖励非随机缺失的离线强化学习问题,提出基于未来状态作为影子变量的识别方法,并利用桥函数和min-max估计器恢复条件均值奖励,实现缺失感知策略的离线评估。

Comments Accepted at ICML 2026. 31 pages, 6 figures

详情
AI中文摘要

在离线强化学习中,由于记录稀疏或不规则,或超出特定奖励值的审查,记录批次数据中的即时奖励通常未被观测到。这个问题出现在实际场景中,包括医疗和营销。我们研究了有限时域马尔可夫决策过程中奖励非随机缺失时的离线策略评估,这破坏了可忽略性,并即使在以状态和行动为条件后也会引起选择偏差。为了解决这个问题,我们形式化了一个依赖于奖励的倾向模型,并使用未来状态作为影子变量来识别完整数据的条件均值奖励。我们进一步引入了一个桥函数,无需显式建模MNAR机制即可恢复条件均值奖励,并通过min-max过程进行估计以避免双重采样。基于这些识别结果,我们提出了一个类似Fitted-Q-Evaluation的估计器,该估计器传播恢复的奖励,同时允许目标策略依赖于过去的缺失指示符。最后,我们为我们的OPE估计器建立了一致性和有限样本误差界,并通过实验在模拟数据和MIMIC-III脓毒症数据上展示了我们方法相比现有方法的强性能。

英文摘要

In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.

2606.20236 2026-06-19 cs.AI cs.LG cs.MA 交叉投稿

A Multi-Agent system for Multi-Objective constrained optimization

多目标约束优化的多智能体系统

Federica Filippini

发表机构 * University of Milano-Bicocca(米兰比可卡大学)

AI总结 提出MAMO,通过多智能体强化学习解耦任务执行与目标设计,自动学习奖励权重以平衡主目标优化与约束违反,提升动态环境下RL的自主性和鲁棒性。

Comments Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, https://optlearnmas.github.io), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

计算和网络系统中的许多决策问题可以自然地表述为在性能约束下的成本最小化问题。在动态环境中,强化学习(RL)通常通过在运行时将成本和约束违反通过加权惩罚项嵌入到单个标量奖励中(遵循拉格朗日启发式公式)来解决此类问题。然而,在这种背景下,学习策略的行为关键取决于这些权重的选择,而权重通常是手动选择的。这使得难以在优化主要目标和有效避免约束违反之间找到适当的权衡,特别是在非平稳环境中,它们的相对重要性可能发生变化。本文提出了MAMO(多目标约束优化的多智能体系统),一种通过多智能体RL解决这种平衡问题的方法。MAMO通过将奖励权重的选择表述为一个学习问题,将任务执行与目标设计解耦,为动态环境中约束优化问题的更自主和鲁棒的基于RL的解决方案迈出了第一步。

英文摘要

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

2606.20324 2026-06-19 cs.SE cs.LG 交叉投稿

A Model-Driven Approach for Developing Families of Reinforcement Learning Environments

一种模型驱动的方法用于开发强化学习环境族

Xiaoran Liu, Istvan David

AI总结 提出一种模型驱动方法,通过混合遗传算法和模型转换自动生成强化学习训练环境族,以解决手动开发环境族耗时且易错的问题,并在野火缓解场景中验证了其有效性。

详情
AI中文摘要

虚拟训练环境是软件密集型系统,强化学习(RL)智能体在其中学习、适应并展示有意义的行为。虚拟训练环境为在现实环境中训练智能体提供了一种安全且成本效益高的替代方案。然而,为了收敛,大多数现实的RL问题需要在多个相似但略有不同的环境中进行训练——即环境变体族。环境族的典型开发过程是一项劳动密集型且容易出错的手动工作,难以扩展。为了缓解这些问题,本文提出了一种模型驱动的方法来开发RL训练环境族。为了获得环境族,我们开发了一种方法和原型工具。在我们的方法中,一种混合遗传算法——基于种群的全局搜索和启发式局部搜索的结合——生成环境族。变异和约束被表达为模型转换,并通过最先进的模型转换引擎操作化为搜索过程。我们在野火缓解场景和课程学习(一种依赖于环境族的特定学习范式)中展示了我们方法的有效性。

英文摘要

Virtual training environments are software-intensive systems in which reinforcement learning (RL) agents learn, adapt, and demonstrate meaningful behavior. Virtual training environments offer a safe and cost-efficient alternative to training agents in real-world settings. However, to converge, most realistic RL problems require training in multiple, mostly similar but slightly different environments - i.e., families of environment variants. The typical development process of environment families is a labor-intensive and error-prone manual endeavor that does not scale well. To alleviate these issues, in this paper, we propose a model-driven approach for developing families of RL training environments. To obtain the family of environments, we develop an approach and prototype tool. In our approach, a hybrid genetic algorithm - a combination of population-based global search and heuristic local search - generates environment families. Mutations and constraints are expressed as model transformations and are operationalized into a search process by a state-of-the-art model transformation engine. We demonstrate the soundness of our approach in a wildfire mitigation scenario and curriculum learning - a particular learning paradigm that relies on environment families.

2606.20356 2026-06-19 math.OC cs.AI cs.LG math.PR stat.ML 交叉投稿

Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise

公共噪声Wasserstein不确定性下的平均场控制鲁棒$Q$-学习

Mathieu Laurière, Ariel Neufeld, Kyunghyun Park

AI总结 提出一种针对公共噪声分布Wasserstein不确定性的离散时间平均场控制鲁棒$Q$-学习算法,结合量化投影与Wasserstein对偶,证明同步和异步学习的收敛性及有限时间界,并在系统风险和流行病模型中验证鲁棒性-性能权衡。

详情
AI中文摘要

在本文中,我们提出了一种针对公共噪声定律下Wasserstein不确定性的离散时间平均场控制问题的鲁棒$Q$-学习算法。该算法将量化投影方案与公共噪声空间上的Wasserstein对偶重述相结合。我们建立了其收敛性以及同步和异步学习方案的有限时间迭代界。关于系统风险和流行病模型的数值实验将异步实现与理想化的Bellman迭代进行了比较,说明了在公共噪声误设下的鲁棒性-性能权衡,并报告了异步$Q$-学习算法的观察收敛行为。

英文摘要

In this article, we present a robust $Q$-learning algorithm for discrete-time mean-field control problems under Wasserstein uncertainty in the common noise law. The algorithm combines a quantization-and-projection scheme with a Wasserstein dual reformulation on the common-noise space. We establish its convergence together with finite-time iteration bounds for both synchronous and asynchronous learning schemes. Numerical experiments on systemic risk and epidemic models compare the asynchronous implementation with an idealized Bellman iteration, illustrate the robustness-performance tradeoff under common-noise misspecification, and report the observed convergence behavior of the asynchronous $Q$-learning algorithm.

4. 生成模型与概率建模 4 篇

2606.19460 2026-06-19 cs.CV cs.AI cs.LG 交叉投稿

Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

使用整流流变换器扩展胸部X光片的生成式基础模型

Fabio De Sousa Ribeiro, Emma A. M. Stanley, Charles Jones, Tian Xia, Dominic C. Marshall, Laurent Renard Triché, Christopher V. Cosgriff, Panagiotis Dimitrakopoulos, Sotirios A. Tsaftaris, Ben Glocker

发表机构 * Imperial College London(帝国理工学院) Causality in Healthcare AI Hub(医疗AI因果关系中心) University of Edinburgh(爱丁堡大学) Cleveland Clinic London(克利夫兰诊所伦敦) Department of Perioperative Medicine, CHU Clermont-Ferrand(克莱蒙费朗大学医院围手术期医学科) Department of Medicine, Massachusetts General Hospital(麻省总医院医学部) Broad Institute of MIT and Harvard(麻省理工学院与哈佛大学博德研究所)

AI总结 提出首个十亿参数级胸部X光片生成基础模型,通过整流流变换器实现高保真可控合成,显著提升合成图像与真实图像的不可区分性。

Comments Project page: https://RadiT-project.github.io

详情
AI中文摘要

我们引入了首个从零开始在十亿参数规模上训练的胸部X光片合成生成基础模型。现有的放射学AI模型通常在不同患者亚群、机构和采集设置下泛化能力差,导致实际临床效用有限。可控、高保真的胸部X光片合成是多样化临床数据集和评估诊断模型鲁棒性的有前景途径。因此,我们提出了迄今为止最大的胸部X光片专用生成基础模型,拥有超过13亿参数,在包含120万张X光片和临床专家指导元数据的精选异质数据集上训练了1.6万亿个token。我们的模型支持跨多个人口统计亚组、采集视图和十多种病理的可控X光片生成和编辑。此外,我们显著推进了X光片合成保真度的最新技术,生成的图像对临床专家而言与真实X光片无法区分。

英文摘要

We introduce the first generative foundation model for chest radiograph synthesis trained from scratch at the billion-parameter scale. Existing radiographic AI models often suffer from poor generalisation across patient subpopulations, institutions, and acquisition settings, resulting in limited real-world clinical utility. Controlled, high-fidelity synthesis of chest radiographs is a promising path toward diversifying clinical datasets and evaluating the robustness of diagnostic models. Therefore, we present the largest specialist generative foundation model for chest radiographs to date, with over 1.3B parameters, trained for 1.6T tokens on a curated, heterogeneous dataset comprising 1.2M radiographs and clinical expert-guided metadata. Our model supports controllable radiograph generation and editing across multiple demographic subgroups, acquisition views, and a dozen pathologies. Moreover, we significantly advance the state of the art in radiograph synthesis fidelity, producing images that are indistinguishable from real radiographs to clinical experts.

2606.19651 2026-06-19 cs.AI cs.CV cs.LG 交叉投稿

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

BrainG3N:用于可控3D脑MRI生成的双用途分词器

Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

发表机构 * Department of Biomedical Data Science, Stanford University School of Medicine(斯坦福大学医学院生物医学数据科学系) Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University(根特大学数学建模、统计与生物信息学系) Department of Electrical Engineering, Stanford University(斯坦福大学电气工程系)

AI总结 提出基于3D掩码自编码器的分词器,解耦编码器与解码器,在23项线性探测任务中21项超越SOTA,并支持条件生成和纵向预测。

详情
AI中文摘要

三维(3D)脑MRI是临床神经病学和神经肿瘤学的核心,生成模型可以增强代表性不足的队列、模拟疾病轨迹并支持隐私保护的数据共享。潜在扩散已成为建模成像数据的首选解决方案,但它对分词器提出了两个竞争性要求:编码器嵌入必须保留下游任务所需的临床信息,解码器必须重建解剖学上准确的体积。现有的重建驱动分词器以牺牲前者为代价实现了后者。为了解决这个问题,我们引入了一种基于全体积掩码自编码器(MAE)的分词器,用于3D脑MRI潜在扩散,解耦编码器和解码器:冻结的3D MAE编码器产生临床信息丰富的嵌入,而专用的CNN解码器从这些嵌入的线性投影重建体素。我们在来自18个公共队列的35,309个体积上预训练编码器,涵盖四种模态、十种疾病类别和200多个采集站点,并在两种设置中展示了其双重用途。首先,在23项线性探测基准测试中,编码器在21项任务上优于或匹配SOTA模型(即BrainIAC、BrainSegFounder和MedicalNet)。其次,在这些临床信息丰富的嵌入上训练的条件扩散变压器(DiT)支持跨六个变量的条件生成和患者特定的纵向预测。这些结果共同建立了一个单一的3D脑MRI嵌入空间,能够同时支持下游临床任务和可控生成。

英文摘要

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 交叉投稿

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror:在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon(亚马逊)

AI总结 提出MakeupMirror扩散模型,通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器,在保持面部特征和肤色的同时实现高质量化妆迁移,相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情
AI中文摘要

化妆迁移模型能够实现有趣的增强现实(AR)体验以及在线化妆购物的虚拟试妆(VTO)。尽管最近最先进的基于扩散的解决方案(如Stable-Makeup)显著提高了化妆迁移的准确性和逼真度,但在身份和肤色保持方面仍存在局限性,使得用于化妆购物的生产级VTO不切实际。在这项工作中,我们提出了MakeupMirror,一种基于扩散的化妆迁移方法,在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新:(1)将面部几何条件与ControlNets集成以保持面部保真度;(2)区域特定的化妆迁移控制,以便在面部区域(如皮肤、眼睛和嘴唇)实现精确的化妆应用;(3)基于肤色的化妆迁移调制,防止跨主体迁移场景中的肤色改变;(4)集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及(本文新收集的、更多样化的)MakeupSelfies数据集上的实验表明,与Stable-Makeup相比,MakeupMirror将相对面部识别相似度提高了+60%,将相对肤色差异降低了-50%,延迟为0.7秒,同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 交叉投稿

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

AI总结 提出将预训练的语音分类器作为扩散生成的主干,通过附加轻量子网络并仅训练该子网络,实现单主干模型的高质量条件语音生成,降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情
AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型:一个分类器和一个扩散模型。因此,我们研究了一种更紧凑的替代方案,其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始,我们附加一个轻量子网络,该子网络重用中间分类器表示,并在去噪分数匹配目标下仅训练该子网络。我们的工作表明,预训练的分类器可以重新用于条件生成,为判别建模和条件语音合成之间提供了有吸引力的桥梁,从而在单主干模型中实现高语音质量,同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

5. 优化、泛化与理论分析 7 篇

2606.19410 2026-06-19 stat.ML cs.LG 交叉投稿

The Representational Limit of Scalar Interactions: An Interventional Decomposition

标量交互的表征限制:一种干预分解

Potito Aghilar, Sabino Roccotelli, Stanislao Fidanza, Vito Walter Anelli, Sebastiano Stramaglia, Tommaso Di Noia

AI总结 本文证明标量交互指标混淆了唯一性、冗余性和协同性,并提出Stochastic Hi-Fi方法,通过干预掩码推理分解每个特征的U/R/S轮廓,在表格和图像任务中恢复被标量基线遗漏的结构。

详情
AI中文摘要

有符号的成对交互指标从根本上混淆了唯一性(U)、冗余性(R)和协同性(S)。我们在一个最小的3路XOR结构因果模型上证明了这一点:忠实的指标如Shapley-Taylor对每对返回零,而投影指标如Shapley Interaction将三阶效应扩散到混淆三种机制的成对标量中。我们引入了Stochastic Hi-Fi,一种事后、无需重新训练的可预测性分解方法,通过干预掩码推理估计每个特征的U/R/S轮廓。该估计器提供精确的干预语义、有限样本蒙特卡洛界限、耦合菱形采样带来的严格方差减少以及均匀的有限词汇收敛。在表格SCM上,Stochastic Hi-Fi恢复了被标量基线遗漏的结构(交互幅度恢复比高达411倍)。它还在GPT-2 IOI电路中分离了冗余和协同头。在NIH ChestX-ray14上,Stochastic Hi-Fi在Pointing Game中匹配GradCAM,并在Deletion AUC上显著改进。

英文摘要

Signed pairwise interaction scores fundamentally conflate uniqueness (U), redundancy (R), and synergy (S). We prove this on a minimal 3-way XOR structural causal model: faithful indices such as Shapley-Taylor return zero per pair, whereas projective indices such as Shapley Interaction spread the third-order effect into pair scalars that conflate the three mechanisms. We introduce Stochastic Hi-Fi, a post-hoc, retraining-free predictability decomposition that estimates per-feature U/R/S profiles by interventional masked inference. The estimator provides exact interventional semantics, finite-sample Monte Carlo bounds, strict variance reduction from coupled diamond sampling, and uniform finite-vocabulary convergence. Across tabular SCMs, Stochastic Hi-Fi recovers structure missed by scalar baselines (up to 411x larger interaction-magnitude recovery ratios). It also separates redundant and synergistic heads in the GPT-2 IOI circuit. On NIH ChestX-ray14, Stochastic Hi-Fi matches GradCAM on Pointing Game and improves substantially on Deletion AUC.

2606.19486 2026-06-19 quant-ph cs.IT cs.LG math.IT 交叉投稿

Optimal Ansatz-free Hamiltonian Learning In Situ

无假设哈密顿量的最优原位学习

Taiqi Zhou, Weiyuan Gong

AI总结 提出一种无需控制、无需辅助比特的算法,仅用泡利乘积态制备和测量,以最优总演化时间学习无假设哈密顿量,适用于近中期量子实验。

Comments 51 pages, 2 figures

详情
AI中文摘要

描述控制量子系统的哈密顿量特征,是量子设备校准、信号传感和纠错的基本子程序。近期工作提出了协议,通过实时演化实现无假设哈密顿量的最优海森堡极限学习,无需完全指定相互作用结构。然而,这些协议依赖于带有交错探测和控制的深电路以及极短的时间分辨率,使其难以在近中期原位量子实验中实现。本文提出一种计算高效、无需控制、无需辅助比特的算法,仅使用泡利乘积态制备和测量,在总演化时间 $\Theta(\frac{\Lambda}{\epsilon^2}\log(\frac{\Lambda}{\epsilon}))$ 内学习无假设哈密顿量 $H$(满足 $||H||\leq\Lambda$)。该算法的演化时间成本对于任何无控制协议是最优的,因为我们进一步证明了 $\Omega(\frac{\Lambda}{\epsilon^2}\log(\frac{\Lambda}{\epsilon}))$ 的下界。技术上,我们的方法引入了一个随机采样框架,结合了带限核时间采样和用于哈密顿量结构学习的位移筛。特征探测时间分辨率仅依赖于 $\Lambda$ 而非 $\varepsilon$,这使得我们的协议在传感和校准的高精度场景中特别有吸引力。我们还表明,当哈密顿量在校准后是局域的时,该算法在存在状态制备和测量(SPAM)噪声的情况下保持相同的渐近总演化时间。我们的结果展示了实验友好型哈密顿量学习的基本成本,并为近中期量子平台的严格原位表征提供了实用途径。

英文摘要

Characterizing the features of a Hamiltonian that governs a quantum system serves as a fundamental subroutine of quantum device calibration, signal sensing, and error correction. Recent works proposed protocols have achieved the optimal Heisenberg-limited scaling learning ansatz-free Hamiltonians from their real-time evolutions without fully specifying interaction structures. However, these protocols rely on both deep circuits with interleaving probes and control, and extremely short time resolution, making them difficult to implement on near- and intermediate-term in situ quantum experiments. In this work, we propose a computationally efficient, control-free, and ancilla-free algorithm that uses only Pauli product state preparation and measurement, and learns an ansatz-free Hamiltonian $H$ with $||H||\leqΛ$ in total evolution time of $Θ(\fracΛ{ε^2}\log(\fracΛε))$. The evolution time cost of our algorithm is optimal for any control-free protocols as we further prove a lower bound of $Ω(\fracΛ{ε^2}\log(\fracΛε))$. Technically, our method introduces a randomized-sampling framework that combines band-limited kernel-based time sampling with a displacement sieve for Hamiltonian structure learning. The characteristic probe time resolution depends only on $Λ$ instead of $\varepsilon$, which makes our protocol especially appealing in the high-precision regime for sensing and calibration applications. We also show that the algorithm maintains the same asymptotic total evolution time in the presence of state-preparation-and-measurement (SPAM) noise when the Hamiltonian is local after calibration. Our results demonstrate the fundamental cost of experimentally friendly Hamiltonian learning and provide a practical route to rigorous in situ characterization of near-term quantum platforms.

2606.19587 2026-06-19 stat.ML cs.LG 交叉投稿

A Solver-Free Training Method for Predict-then-Optimize

一种无求解器的预测后优化训练方法

Beichen Wan, Mo Liu

AI总结 提出一种基于测度变换的决策聚焦学习管道,通过无求解器代理损失实现预测后优化中预测模型的高效训练,理论保证Fisher一致性,训练时间降低数个数量级。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出了一种可扩展的方法,用于在预测后优化范式中训练预测(机器学习)模型,其中模型输出作为后续线性优化任务的系数。直接最小化经验决策遗憾对于线性规划和组合优化是不可行的,因为决策映射是分段常数,且梯度几乎处处为零。虽然现有方法通过平滑微分过程来解决这一问题,但它们存在可扩展性问题,因为每次梯度评估都需要调用计算昂贵的求解器。为了解决这个问题,我们提出了一种基于测度变换原理的决策聚焦学习管道,该管道在训练期间产生一个完全无优化求解器的新代理损失。我们建立了理论保证,包括Fisher一致性和超额风险界。实验上,我们的方法在实现与最先进方法相当的决策质量的同时,将训练时间减少了数个数量级。

英文摘要

We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.

2606.19859 2026-06-19 cs.IT cs.LG math.IT math.PR math.ST stat.TH 交叉投稿

Doeblin Curves

Doeblin 曲线

Dongmin Lee, William Lu, Anuran Makur, Japneet Singh

AI总结 提出 Doeblin 曲线概念,量化马尔可夫核在不同散度和功率水平下的收缩行为,并应用于噪声迭代优化、噪声电路可靠计算和差分隐私等领域的更细粒度收缩分析。

Comments 42 pages, 2 figures

Journal ref IEEE Transactions on Information Theory, vol. 72, no. 6, pp. 3556-3596, June 2026

详情
AI中文摘要

近期关于 Doeblin 系数的研究揭示了它们作为 TV 距离的 Dobrushin 收缩系数的多路泛化的有用性,这与它们在马尔可夫链遍历性理论中的经典作用不同。然而,为了建立信息收缩的存在性,通常需要强条件,例如远离 0。基于最近提出的非线性信息收缩概念,我们旨在提出一种更细粒度的基于 Doeblin 的多路收缩行为刻画,即使对于 Doeblin 系数为 0 的信道,也能产生非平凡的收缩保证。为此,我们引入了 Doeblin 曲线的概念——一种非线性函数,它量化了马尔可夫核在特定散度和功率水平下对输入分布集合的收缩行为。在我们的分析过程中,我们发展了 Doeblin 系数的新变分刻画,提出了 Doeblin 曲线的若干性质,定义了功率约束 Doeblin 曲线的几个版本,并利用上述变分刻画推导了上下界。然后,我们将这些结果应用于不同领域,包括噪声迭代优化的泛化界、噪声电路可靠计算的误差界以及在线迭代算法的差分隐私保证。特别是,我们将这些领域的结果扩展到更广泛的领域或群体设置,利用 Doeblin 曲线揭示比 Doeblin 系数更细粒度的收缩现象。

英文摘要

Recent research on Doeblin coefficients has shed light on their usefulness as a multi-way generalization of the Dobrushin contraction coefficient for TV distance, in a separate vein from their classic role in the theory of Markov chain ergodicity. However, strong conditions, such as being bounded away from 0, are typically necessary for Doeblin coefficients to establish the existence of information contraction. Building on recently formulated concepts of nonlinear information contraction, we aim to propose a finer-grained Doeblin-based characterization of multi-way contraction behavior which yields non-vacuous contraction guarantees even for channels whose Doeblin coefficient is 0. To this end, we introduce the notion of a Doeblin curve -- a nonlinear function which quantifies the contraction behavior of a Markov kernel on collections of input distributions at specific levels of divergence and power. Through the course of our analysis, we develop a new variational characterization of Doeblin coefficients, present several properties of Doeblin curves, define several versions of power-constrained Doeblin curves, and derive upper and lower bounds using our aforementioned variational characterization. We then utilize these results in diverse areas, including generalization bounds for noisy iterative optimization, error bounds for reliable computation with noisy circuits, and differential privacy guarantees for online iterative algorithms. In particular, we extend results in these areas to broader domains or group settings, leveraging Doeblin curves to reveal finer-grained contraction phenomena than Doeblin coefficients.

2606.20062 2026-06-19 math.OC cs.LG math.PR 交叉投稿

Optimal Coarse Correlated Equilibria in Mean Field Games: Linear Programming and No-Regret Learning

平均场博弈中的最优粗相关均衡:线性规划与无遗憾学习

Luciano Campi, Federico Cannerozzi, Ioannis Tzouanas

AI总结 针对连续时间平均场博弈,提出最优粗相关均衡的线性规划刻画,并设计基于拉格朗日对偶的无遗憾学习算法,给出收敛速率。

Comments 55 pages, 3 figures

详情
AI中文摘要

我们引入了连续时间平均场博弈的最优粗相关均衡。粗相关均衡是一种随机推荐方案,任何玩家都无法通过忽略推荐并转向替代策略而获益。问题如下:一个协调者在所有平均场粗相关均衡中选择一个,以优化一个规定的性能准则,该准则可能不同于代表性玩家的目标。在问题公式化之后,我们开发了一个线性规划(LP)公式,证明了最优LP粗相关均衡的存在性,并将LP刻画与原始概率设定联系起来。基于这一刻画,我们设计了一个无遗憾原始-对偶算法,基于外部遗憾约束的等价拉格朗日公式,用于学习此类均衡。我们提供了学习算法的显式收敛速率,数值例子说明了该方法。

英文摘要

We introduce optimal coarse correlated equilibria for continuous-time mean field games. A coarse correlated equilibrium is a randomized recommendation scheme from which no player can gain by ignoring the recommendation and switching to an alternative strategy. The problem is as follows: a moderator selects, among all mean-field coarse correlated equilibria, one that optimizes a prescribed performance criterion, which may differ from the representative player's objective. After formulating the problem, we develop a linear programming (LP) formulation, prove the existence of optimal LP coarse correlated equilibria, and relate the LP characterization to the original probabilistic setting. Building on this characterization, we design a no-regret primal-dual algorithm, based on an equivalent Lagrangian formulation of the external-regret constraint, for learning such equilibria. We provide explicit convergence rates for the learning algorithm, and numerical examples illustrate the method.

2606.20082 2026-06-19 math.OC cs.DS cs.LG 交叉投稿

Beyond Averaging in John Ellipsoid Approximation: High-Accuracy Algorithms in the Leverage-Score Model

超越John椭球逼近中的平均化:杠杆分数模型中的高精度算法

Xiaoyu Li, Junwei Yu, Jiaojiao Jiang, Junbin Gao, Andi Han

AI总结 本文分离了John椭球逼近算法中的认证、识别和精度三种成本,证明精度依赖仅为双对数,并提出了加速方法和阻尼牛顿法,在杠杆分数模型中实现了高精度逼近。

详情
AI中文摘要

对称多面体 $P=\{\mathbf{x}\in\mathbb{R}^d:\|\mathbf{A}\mathbf{x}\|_\infty\le1\}$, $\mathbf{A}\in\mathbb{R}^{n\times d}$ 的 John 椭球由一系列杠杆分数算法计算,从 Cohen, Cousins, Lee 和 Yang (COLT 2019) 到其后续工作 [WY24, CLS+25],均在 $\Theta(\varepsilon^{-1}\log(n/d))$ 次迭代内达到 $(1+\varepsilon)$-逼近。我们将这一复杂度分离为现代算法混淆的三种成本(认证、识别和精度),并发现历史上的 $\varepsilon^{-1}$ 仅存在于第一种成本中。在等价的 D-最优设计形式 $\min_{\mathbf{p}\in\Delta_n}-\log\det(\sum_i p_i\mathbf{a}_i\mathbf{a}_i^\top)$ 中,杠杆分数预言机恰好是一阶预言机,而 $(1+\varepsilon)$-John 保证对应于 Frank-Wolfe 间隙 $g(\mathbf{p})\le\varepsilon d$;通过这一对应关系,成本得以分离。$\varepsilon^{-1}$ 是认证的产物:迭代点的均匀平均(该系列算法中使用的认证)的间隙恰好为 $\Theta(1/T)$,无论每次迭代多么廉价。相反,针对最后迭代点,同一预言机是快速的:热启动加速方法在 $\varepsilon$-无关的初始化 $C(\mathbf{A})$ 后,仅需 $C(\mathbf{A})+O(\sqrt{\kappa}\log(1/\varepsilon))$ 次查询即可达到保证;一旦最优面被识别,面问题成为无约束自和谐最小化,其 Hessian 可由预言机精确恢复,因此阻尼牛顿法仅需 $O(\log\log(1/\varepsilon))$ 步,总查询数为 $C(\mathbf{A})+O(d^2\log\log(1/\varepsilon))$。因此,在 $\varepsilon$-无关、条件依赖的初始化后,精度依赖是双对数的;开放问题在于剩余的识别成本(达到最优面的无条件界)和下界。精度并非障碍。

英文摘要

The John ellipsoid of a symmetric polytope $P=\{\mathbf{x}\in\mathbb{R}^d:\|\mathbf{A}\mathbf{x}\|_\infty\le1\}$, $\mathbf{A}\in\mathbb{R}^{n\times d}$, is computed by a long line of leverage-score algorithms, from Cohen, Cousins, Lee and Yang (COLT 2019) to its successors [WY24, CLS+25], all reaching a $(1+\varepsilon)$-approximation in $Θ(\varepsilon^{-1}\log(n/d))$ iterations. We separate this complexity into three costs the modern line conflates (certification, identification, and accuracy) and locate the historical $\varepsilon^{-1}$ in the first alone. In the equivalent D-optimal-design form $\min_{\mathbf{p}\inΔ_n}-\log\det(\sum_i p_i\mathbf{a}_i\mathbf{a}_i^\top)$, the leverage-score oracle is exactly the first-order oracle and the $(1+\varepsilon)$-John guarantee the Frank-Wolfe gap $g(\mathbf{p})\le\varepsilon d$; through this dictionary the costs come apart. The $\varepsilon^{-1}$ is a certification artifact: the uniform average of the iterates, the certificate used throughout the line, has gap exactly $Θ(1/T)$, however cheap each iteration is made. Pointed instead at the last iterate the same oracle is fast: a warm-started accelerated method reaches the guarantee in $C(\mathbf{A})+O(\sqrtκ\log(1/\varepsilon))$ queries after an $\varepsilon$-independent setup $C(\mathbf{A})$, and once the optimal face is identified the facial problem is an unconstrained self-concordant minimization whose Hessian the oracle recovers exactly, so damped Newton needs only $O(\log\log(1/\varepsilon))$ steps, for a total of $C(\mathbf{A})+O(d^2\log\log(1/\varepsilon))$ queries. The accuracy dependence is thus doubly logarithmic after an $\varepsilon$-independent, condition-dependent setup; the open problem is the remaining identification cost (a condition-free bound on reaching the optimal face) and lower bounds. Accuracy is not the obstruction.

2606.20299 2026-06-19 stat.ML cs.LG hep-ph physics.data-an 交叉投稿

Statistical Properties of Training & Generalization

训练与泛化的统计特性

Itay Lavie, Noam Levi, Yonatan Kahn

AI总结 从物理学角度研究深度学习的关键特征和意外现象,回顾神经缩放定律及其与物理问题中约束和归纳偏置的相互作用。

Comments 32 pages, 3 figures. Part of the VERaiPHY initiative

详情
AI中文摘要

深度学习成功规避了经典统计学的众多直觉,在多个现实任务中取得了前所未有的性能。本文从物理学角度研究深度学习的关键特征和意外现象,注意指出并尽可能证明构建深度学习模型时固有的多种选择。特别地,我们回顾了神经缩放定律的现象,并讨论了它们与在物理问题中应用机器学习时可能存在的约束和归纳偏置之间的相互作用。

英文摘要

Deep learning has managed to evade numerous intuitions from classical statistics to achieve unprecedented performance on a number of real-world tasks. In this article, we investigate the key features and surprises of deep learning from a physics-informed perspective, taking care to point out and justify where possible the many choices inherent in constructing a deep learning model. In particular, we review the phenomenon of neural scaling laws and discuss their interplay with the constraints and inductive biases which may be present when applying machine learning to problems in physics.

6. 高效学习、压缩与部署 3 篇

2606.19354 2026-06-19 cs.CL cs.LG 交叉投稿

Granularity-Regulated Adaptive Computational Efficiency for Optimal Verification in Test-Time Scaling

粒度调控的自适应计算效率:测试时扩展中的最优验证

Ardit Krasniqi, Luan Vejsiu, Elira Dervishi

发表机构 * European University of Tirana(欧洲地拉那大学)

AI总结 提出GRACE理论框架,将验证粒度建模为问题难度、验证器准确率和计算预算的函数,证明存在相变:细粒度验证在计算预算大或问题难时占优,粗粒度验证在低预算简单问题时更优,自适应策略可达到计算-性能帕累托前沿。

详情
AI中文摘要

测试时扩展(TTS)已成为一种强大的范式,通过在推理时投入额外计算来提升大语言模型(LLMs)的推理性能。TTS的核心组件是验证器,它选择或评分候选解以引导搜索过程。虽然先前工作已探索验证的益处,但一个基本问题仍未充分探索:在给定计算预算下,最优验证粒度是什么?粗粒度的结果奖励模型(ORMs)和细粒度的过程奖励模型(PRMs)代表两个极端,但两者单独均无法在所有场景下实现计算最优性。本文建立了一个统一的理论框架,称为GRACE(粒度调控的自适应计算效率),该框架将最优验证粒度刻画为问题难度、验证器准确率和计算预算的显式函数。我们证明存在一个相变:当计算预算大或问题难时,细粒度验证占优;而在低预算、简单问题场景下,粗粒度验证更受青睐。我们的理论将Best-of-N、束搜索和步骤级MCTS统一在一个帕累托最优框架内,并激发了一种自适应粒度策略,该策略可证明达到计算-性能帕累托前沿。在MATH-500、GSM8K和AIME基准上的实验结果证实了所有四个理论主张,在匹配计算量下,我们的自适应策略相比固定粒度基线准确率提升高达3.1%。

英文摘要

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning performance of large language models (LLMs) by investing additional compute at inference time. A central component of TTS is the \emph{verifier}, which selects or scores candidate solutions to guide the search process. While prior work has explored the benefit of verification, a fundamental question remains underexplored: \emph{what is the optimal granularity of verification under a given compute budget?} Coarse-grained outcome reward models (ORMs) and fine-grained process reward models (PRMs) represent two extremes, yet neither alone achieves compute-optimality across all regimes. In this paper, we establish a unified theoretical framework, called \textbf{GRACE} (\underline{G}ranularity-\underline{R}egulated \underline{A}daptive \underline{C}omputational \underline{E}fficiency), that characterizes the optimal verification granularity as an explicit function of problem difficulty, verifier accuracy, and compute budget. We prove that there exists a phase transition: fine-grained verification dominates when either the compute budget is large or the problem is hard, whereas coarse-grained verification is preferred in the low-budget, easy-problem regime. Our theory unifies Best-of-$N$, beam search, and step-level MCTS within a single Pareto-optimality framework, and motivates an adaptive granularity strategy that provably achieves the compute-performance Pareto frontier. Empirical results on MATH-500, GSM8K, and AIME benchmarks corroborate all four theoretical claims, with our adaptive strategy outperforming fixed-granularity baselines by up to 3.1\% accuracy at matched compute.

2606.19799 2026-06-19 cs.SE cs.LG 交叉投稿

The Hidden Environmental Cost of Poor Coding Practices in TensorFlow and Keras Applications: A Study on Resource Leaks and Carbon Emissions

TensorFlow和Keras应用中不良编码实践的隐藏环境成本:资源泄漏与碳排放研究

Bashar Abdallah, Gustavo Santos, Rola Al Bataineh, Alain Abran, Mohammad Hamdaqa

AI总结 研究TensorFlow/Keras中两种资源泄漏气味(IMR和UTR)对能耗和碳排放的影响,实验表明两者分别增加约32%和46%的电力消耗,证明资源泄漏显著降低ML能效并增加环境负担。

详情
AI中文摘要

效率和可持续性是机器学习(ML)应用开发和部署中的关键考量。在影响可持续性的因素中,ML代码中的资源泄漏可能引入隐藏的低效率,从而增加能源消耗和CO2排放。尽管如此,量化其环境影响的实证证据仍然有限。这篇新兴结果论文对两种常见的资源泄漏气味,即不当模型重用(IMR)和未释放张量引用(UTR),及其对TensorFlow和Keras工作负载中能源消耗和CO2排放的影响进行了初步实证研究。通过执行相同的训练任务,并与无气味基线进行比较,对每种气味进行了受控实验。我们的初步结果表明,两种气味都持续增加了估计的用电量和碳排放。IMR和UTR分别使电力消耗增加约32%和46%,CO2排放也成比例增加。配对统计检验表明这些差异是系统性的且具有统计显著性,提供了初步的实证证据,表明资源泄漏气味可能降低ML的能效和环境可持续性。这些发现表明,资源泄漏气味对软件质量和可持续性构成可衡量的风险,强调了将资源生命周期管理和能效考虑纳入ML开发的重要性。

英文摘要

Efficiency and sustainability are critical considerations in the development and deployment of machine learning (ML) applications. Among the factors influencing sustainability, resource leaks in ML code can introduce hidden inefficiencies that elevate energy consumption and CO2 emissions. Despite this, empirical evidence quantifying their environmental impact remains limited. This emerging results paper presents an initial empirical investigation of two common resource-leak smells, namely Improper Model Reuse (IMR) and Unreleased Tensor References (UTR), and their impact on energy consumption and CO2 emissions in TensorFlow and Keras workloads. Controlled experiments were conducted for each smell by executing identical training tasks while comparing against a smell-free baseline. Our preliminary results show that both smells consistently increase estimated electricity usage and carbon emissions. IMR and UTR increased electricity consumption by approximately 32% and 46%, respectively, with proportional increases in CO2 emissions. Paired statistical tests indicate that these differences are systematic and statistically significant, providing initial empirical evidence that resource-leak smells may degrade ML energy efficiency and environmental sustainability. These findings suggest that resource-leak smells pose measurable risks to both software quality and sustainability, emphasizing the importance of integrating resource-lifecycle management and energy-efficiency considerations into ML development.

2606.19989 2026-06-19 cs.DC cs.LG 交叉投稿

Online Dynamic Batching with Formal Guarantees for LLM Training

面向LLM训练的具有形式保证的在线动态批处理

Dian Li, Zekun Wang, Yaoru Wang, Jiahong Yan

AI总结 提出在线动态批处理(ODB)系统,在数据加载器侧将批构建延迟到样本真实成本可观测时,解决离线批采样中预处理成本不可见问题,实现1.58-4.43x吞吐量提升,并提供无死锁有界终止的形式化保证。

Comments 29 pages, 3 figures, 21 tables

详情
AI中文摘要

现代LLM训练打破了离线批采样器背后的一个核心假设:样本的真实训练成本只有在预处理、增强、模板化、分词和多模态视觉标记扩展之后才能观察到。除非为依赖于预处理和增强的长度缓存付费,否则批构建对于决定填充、内存使用和GPU饱和度的量是盲目的。我们引入了在线动态批处理(ODB),这是一个数据加载器侧的即插即用系统,它将批形成移动到这一精确可观测性点,同时保持DDP步骤对齐。我们将这一同步需求形式化为分布式组对齐问题,并证明了在默认加入模式身份覆盖和可选非加入样本配额封闭下的无死锁有界终止。ODB不需要修改模型、优化器或注意力核,并以轻量级训练器适配器的形式发布为online-dynamic-batching。在UltraChat/LLaVA/ShareGPT4o上对公开的2B/8B Qwen3-VL进行的实验中,与固定批Standard相比,ODB在单节点全量微调/LoRA上实现了1.58-2.51倍的逐字样本吞吐量提升,在两节点全量微调上实现了1.71-3.78倍提升,质量与Standard相当;生产环境MM-Mix达到4.43倍。与GMT/BMT离线令牌预算预言机相比,ODB在UltraChat/LLaVA上差距在15%以内,在高变异系数的ShareGPT4o上更快:单节点全量微调/LoRA为2.24-2.39倍,两节点全量微调为3.06-3.69倍。总之,ODB占据了高异质性LLM微调的在线/即插即用领域:在质量与Standard相当的情况下实现大幅吞吐量提升,提供形式化的DGAP保证,无需长度缓存预计算或核重写。

英文摘要

Modern LLM training breaks a core assumption behind offline batch samplers: the true training cost of a sample is only observable after preprocessing, augmentation, templating, tokenization, and multimodal visual-token expansion. Unless one pays for a preprocessing- and augmentation-dependent length cache, batch construction is therefore blind to the quantity that determines padding, memory use, and GPU saturation. We introduce Online Dynamic Batching (ODB), a DataLoader-side drop-in system that moves batch formation to this point of accurate observability while preserving DDP step alignment. We formalize this synchronization requirement as the Distributed Group Alignment Problem and prove deadlock-free bounded termination with default join-mode identity coverage and opt-in non-join sample-quota closure. ODB requires no model, optimizer, or attention-kernel changes and is released as online-dynamic-batching with lightweight trainer adapters. Across public 2B/8B Qwen3-VL runs on UltraChat/LLaVA/ShareGPT4o, ODB improves literal emitted-sample throughput vs. fixed-batch Standard by 1.58-2.51x on single-node Full FT/LoRA and 1.71-3.78x on two-node Full FT, with Standard-comparable quality; production MM-Mix reaches 4.43x. Against GMT/BMT offline token-budget oracles, ODB is within 15% on UltraChat/LLaVA and faster on high-CV ShareGPT4o: 2.24-2.39x single-node Full FT/LoRA and 3.06-3.69x two-node Full FT. Together, ODB occupies the online/drop-in regime for high-heterogeneity LLM fine-tuning: large throughput gains at Standard-comparable quality, formal DGAP guarantees, and no length-cache precompute or kernel rewrites.

7. 联邦学习、隐私与安全 4 篇

2606.19535 2026-06-19 cs.CR cs.LG 交叉投稿

FloatDoor: Platform-Triggered Backdoors in LLMs

FloatDoor: 大语言模型中的平台触发后门

Nils Loose, Jonas Sander, Felix Mächtle, Thomas Eisenbarth

AI总结 提出FloatDoor,首个输入无关、平台触发的后门攻击,利用浮点运算平台差异,通过两个轻量LoRA适配器在目标平台触发恶意行为,同时保持模型正常效用。

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在软件工程等敏感环境中,其输出直接影响下游工件。最近的研究表明,由于非结合浮点运算和不同的内核实现,同一模型在不同部署平台上可能产生可测量的不同输出。我们研究了这种平台依赖可变性的安全影响,并揭示了LLM部署中一种新的攻击面。我们提出了FloatDoor,这是首个针对生成式LLM的输入无关、平台触发的后门攻击。被攻陷的模型在目标平台上表现出对手选择的行为,而在其他平台上则表现正常。FloatDoor通过两个轻量级LoRA适配器实现:一个放大平台间数值差异,另一个将由此产生的平台签名绑定到恶意下游任务,同时保持模型整体效用基本不变。FloatDoor利用了模型审计和部署之间的显著检查时间与使用时间差距。我们在Qwen3-4B上展示了FloatDoor,涵盖了广泛的部署目标,包括NVIDIA GPU、Google TPU、AWS Graviton和阿里巴巴Yitian-710。作为最终案例研究,我们展示了FloatDoor能够在选定的目标平台上可靠地诱导可利用的代码漏洞。我们的结果建立了一类新的LLM部署攻击,并强调了在敏感的LLM驱动应用中建立可信模型供应链的迫切需求。

英文摘要

Large language models (LLMs) are increasingly deployed in sensitive settings such as software engineering, where their outputs directly shape downstream artifacts. Recent work has shown that an identical model can produce measurably different outputs depending on the deployment platform, a consequence of non-associative floating-point arithmetic and divergent kernel implementations. We study the security implications of this platform-dependent variability and uncover a novel attack surface on LLM deployments. We introduce FloatDoor, the first input-independent, platform-triggered backdoor attack against generative LLMs. The compromised model exhibits adversary-chosen behavior when served on a target platform and is otherwise benign. FloatDoor is realized through two lightweight LoRA adapters, one that amplifies inter-platform numerical divergence and one that binds the resulting platform signature to a malicious downstream task, while leaving aggregate model utility largely intact. FloatDoor exploits a pronounced time-of-check, time-of-use gap between model auditing and serving. We demonstrate FloatDoor on Qwen3-4B across a broad range of deployment targets, including NVIDIA GPUs, Google TPUs, AWS Graviton, and Alibaba Yitian-710. As a final case study, we show that FloatDoor reliably induces exploitable code vulnerabilities on a chosen target platform. Our results establish a new class of attacks on LLM deployments and underscore the pressing need for trusted model supply chains in sensitive, LLM-powered applications.

2606.19643 2026-06-19 stat.ML cs.LG 交叉投稿

Variational Consensus Monte Carlo for Bayesian Mixture

变分共识蒙特卡洛用于贝叶斯混合模型

Julie Fendler, Francesca L. Crowe, Tom Marshall, Sylvia Richardson, Paul D. W. Kirk

AI总结 提出变分共识蒙特卡洛方法扩展至过拟合贝叶斯混合模型,通过新颖的聚类匹配算法和聚合策略,在联邦学习设置下推断聚类数和所有参数,并在模拟和真实电子健康记录数据上验证了有效性。

详情
AI中文摘要

受健康数据的隐私、敏感性和共享限制的驱动,我们提出了一个在联邦学习设置下(即数据无法在计算节点之间完全共享或汇集)对贝叶斯混合模型进行推断的全面流程。我们采用共识蒙特卡洛(CMC)方法,在每个数据孤岛内独立运行MCMC算法以估计局部后验分布,然后聚合这些分布以近似完整数据的后验。Rabinovich, Angelino 和 Jordan (2015) [1] 的变分CMC方法将聚合步骤视为变分推断问题,但他们应用于混合模型时假设聚类数和关键混合参数已知。我们的主要方法贡献是:(i) 将变分CMC扩展到过拟合贝叶斯混合模型,该模型推断聚类数和所有模型参数,无需共轭性;(ii) 适用于跨孤岛设置的新颖聚类匹配算法,其中并非每个聚类都出现在每个局部数据集中;(iii) 针对聚合步骤的多种推断策略,匹配不同的联邦学习约束;以及 (iv) 在实践中选择这些策略的指南。一项全面的模拟研究验证了该框架,并允许我们与最先进的联邦学习替代方法进行比较。值得注意的是,我们表明当局部数据集的组成反映了数据中的底层聚类结构时,我们的方法可以比应用于汇集数据的标准MCMC更准确地恢复小聚类。我们在大规模电子健康记录数据上展示了该框架,识别了英国老年人群中的多发病模式。

英文摘要

Motivated by the privacy, sensitivity and sharing limitations of health data, we present a comprehensive pipeline for inference of Bayesian mixture models within a federated learning setting, i.e. when data cannot be fully shared or pooled across compute nodes. We adopt a Consensus Monte Carlo (CMC) approach, in which an MCMC algorithm is run independently within each data silo to estimate local posterior distributions, which are then aggregated to approximate the posterior over the full data. The variational CMC approach of Rabinovich, Angelino and Jordan (2015) [1] frames the aggregation step as a variational inference problem, but their application to mixtures assumes the number of clusters and key mixture parameters to be known. Our main methodological contributions are: (i) an extension of variational CMC to over-fitted Bayesian mixture models that infer the number of clusters and all model parameters, without requiring conjugacy; (ii) novel cluster-matching algorithms suitable for cross-silo settings in which not every cluster appears in each local dataset; (iii) a number of inference strategies for the aggregation step, matched to different federated learning constraints; and (iv) guidelines for choosing among these in practice. A comprehensive simulation study validates the framework and allows us to compare to state-of-the-art federated learning alternatives. Notably, we show that when the composition of local datasets reflects the underlying clustering structure in the data, our approach can recover small clusters with greater accuracy than standard MCMC applied to the pooled data. We illustrate the framework on large-scale electronic health record data, identifying multi-morbidity patterns in a British geriatric population.

2606.20344 2026-06-19 quant-ph cs.DC cs.LG 交叉投稿

Quantum ring all-reduce: communication and privacy advantages for distributed learning

量子环全归约:分布式学习的通信与隐私优势

María Gragera Garcés, Lirandë Pira

AI总结 提出量子环全归约协议,利用预共享纠缠和超密编码将每链路在线通信量减半,并通过验证纠缠实现信息论安全的可组合ε-安全聚合,同时获得通信与隐私优势。

Comments 23 pages, 1 figure

详情
AI中文摘要

机器学习模型已扩展到前所未有的规模,使得跨分布式设备的训练成为该领域的事实标准。在这项工作中,我们探讨量子通信如何使分布式训练在通信效率和信息论隐私方面都更具优势,适用于经典和量子学习模型。环全归约是大规模分布式训练的基础通信原语。我们提出一种量子版本,通过预共享纠缠和超密编码,将每链路在线通信量减少一个可证明最优的因子二,且无需改变学习模型或梯度计算。除了带宽优势,该原语还能实现任何经典协议在信息论上不可能实现的隐私保证,通过验证纠缠以GHZ副本的2倍开销实现可组合的ε-安全聚合。我们的混合量子-经典通信架构为大规模分布式训练同时带来通信和安全优势,无论学习本身是量子还是经典。最后,我们描述了在带宽约束下服务器到客户端通信中梯度冲突检测的量子优势,该设置出现在环全归约完成后,当完整梯度广播到外部客户端不可行时。该问题的两个变体呈现出不同的分离。对于基于间隔的对齐测试(\textsc{GapIP}_{\tau}),量子优势在间隔参数上是二次的:\widetilde{O}({\tau}^{-1}\log P) 量子比特对比 \widetilde{O}(\min(\{\tau}^{-2},P)) 比特。对于针对私有参数匹配的符号一致性审计(\textsc{TieAudit}_{\epsilon}),优势表现为通信复杂度的指数级分离:\Omega(\sqrt{P}) 比特,而 O({\epsilon}^{-2}\log P) 量子比特就足够了。

英文摘要

Machine learning models have scaled to unprecedented sizes, making training across distributed devices the de facto standard in the field. In this work, we explore how quantum communications can make distributed training both more communication-efficient and information-theoretically private, for both classical and quantum learning models. Ring all-reduce is the foundational communication primitive for large-scale distributed training. We present a quantum version that reduces per-link online communication by a provably optimal factor of two using pre-shared entanglement and superdense coding, without requiring the learning model or gradient computation to change. Beyond bandwidth, the primitive enables privacy guarantees that are information-theoretically impossible for any classical protocol, achieving composable ε-secure aggregation, via verified entanglement, at a 2x overhead in GHZ copies. Our hybrid quantum-classical communication architecture yields simultaneous communication and security advantages for large scale distributed training, regardless of whether the learning itself is quantum or classical. Finally, we characterise quantum advantages in gradient conflict detection for server-to-client communication under bandwidth constraints, a setting that arises after ring all-reduce is completed, when full gradient broadcast to external clients is infeasible. Two variants of the problem admit different separations. For margin-based alignment testing (\textsc{GapIP}_τ), the quantum advantage is quadratic in the margin parameter: \widetilde{O}(τ^{-1}\log P) qubits versus \widetilde{O}(\min(\τ^{-2},P)) bits. For sign-consistency auditing against a private parameter matching (\textsc{TieAudit}_ε), the advantage represents an exponential separation in communication complexity: Ω(\sqrt{P}) bits whereas O(ε^{-2}\log P) qubits suffice.

2606.20520 2026-06-19 cs.CR cs.AI cs.DC cs.LG 交叉投稿

Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control Planes

主权执行代理:在智能体控制平面中强制执行证书绑定权限

Jun He, Deying Yu

AI总结 针对自主代理在生产环境中执行变更时缺乏强制权限验证的问题,提出主权执行代理(SEB),通过证书验证、状态检查和范围身份实现运行时强制权限控制,并在AWS和Kubernetes上验证了其安全性和性能。

Comments 19 pages, 6 figures, 10 tables

详情
AI中文摘要

自主代理越来越多地连接到云、部署和数据控制工作流,但生产环境的变更权限不应存在于非确定性推理过程中。现有的访问控制机制授权身份,而保证层认证提议的操作;两者单独都无法在变更时刻提供对认证权限的强制执行点。本文介绍了主权执行代理(SEB),一种用于证书绑定智能体基础设施的运行时强制边界。SEB消耗由主权保证边界(SAB)颁发的证书,验证请求的变更与认证的执行合约匹配,检查有效期窗口、策略时期、撤销时期和实时状态漂移,铸造范围执行身份,调用基础设施API,并记录签名的决策和结果记录。通过分离提议、准入和执行,SEB将认证权限转化为短暂的、可撤销的、可审计的运行时能力,前提是生产变更API拒绝非代理身份。我们展示了SEB执行模型、证书和重放验证谓词、范围身份语义、绕过预防部署模式、失败行为以及一个具体的原型实现。我们在AWS和Kubernetes集群上评估了原型,测量了延迟开销、撤销传播、漂移检测以及故障注入下的安全性。

英文摘要

Autonomous agents are increasingly connected to cloud, deployment, and data-control workflows, but production mutation authority should not reside inside non-deterministic reasoning processes. Existing access-control mechanisms authorize identities, while assurance layers certify proposed actions; neither alone provides a mandatory enforcement point for certified authority at the moment of mutation. This paper introduces the Sovereign Execution Broker (SEB), a runtime enforcement boundary for certificate-bound agentic infrastructure. SEB consumes certificates issued by the Sovereign Assurance Boundary (SAB), verifies that the requested mutation matches the certified execution contract, checks validity windows, policy epochs, revocation epochs, and live-state drift, mints scoped execution identity, invokes infrastructure APIs, and records signed decision and outcome records. By separating proposal, admission, and execution, SEB turns certified authority into a short-lived, revocable, auditable runtime capability, provided that production mutation APIs reject non-broker identities. We present the SEB execution model, certificate and replay-verification predicates, scoped identity semantics, bypass-prevention deployment patterns, failure behavior, and a concrete prototype implementation. We evaluate the prototype on AWS and Kubernetes clusters, measuring latency overheads, revocation propagation, drift detection, and security under fault injection.

8. 鲁棒性、不确定性与可信学习 4 篇

2606.19353 2026-06-19 cs.CL cs.LG 交叉投稿

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

量化上下文学习中的偶然不确定性以稳健衡量LLM预测置信度

Jinseok Chung, Minkyoung Song, Hyunji Jung, Namhoon Lee

发表机构 * POSTECH(浦项科技大学)

AI总结 针对上下文学习(ICL)中预测对提示设计敏感的问题,提出基于贝叶斯观点和机制可解释性的自函数向量,直接估计偶然不确定性,并设计严格评估协议,在合成和真实数据集上验证了方法的可靠性及在幻觉检测等应用中的实用性。

Comments Accepted to ACL 2026

详情
AI中文摘要

上下文学习(ICL)使LLM能够从少量示例中适应新任务,但其可靠性仍存疑虑:预测对提示设计和模型理解上下文的能力高度敏感,使得失败源于数据特性还是模型限制难以区分。不确定性分解——将偶然不确定性从认知不确定性中分离——在此场景中尤为关键,然而现有方法针对标准生成任务设计,未能捕捉ICL的独特动态。为解决此问题,我们引入基于贝叶斯观点和ICL机制可解释性的自函数向量概念。这些向量利用模型内部表示来建模上下文提示中学习的潜在概念,从而在贝叶斯框架内直接估计偶然不确定性,并规避了对脆弱的输入或解码操作的依赖。鉴于缺乏既定基准和合适的评估协议,我们还提出了首个严格的评估协议,其中数据以受控方式被操纵,以便精确量化偶然不确定性并将其与认知不确定性分离。借助这一新的评估框架(最初基于合成任务进行概念开发,随后扩展到真实世界数据集),我们展示了所提出的方法比现有替代方法更可靠地衡量LLM在ICL下做出的预测的不确定性。此外,我们展示了它可作为可信相关应用(如幻觉检测)的实用工具。我们的发现为将不确定性的量化观点与模型行为的机制理解联系起来开辟了新方向。

英文摘要

In-Context Learning (ICL) allows LLMs to adapt to new tasks from a few demonstrations, but its reliability remains a concern: predictions are highly sensitive to both prompt design and the model's ability to understand the context, obscuring whether failures arise from data properties or model limitations. Uncertainty decomposition-separating aleatoric from epistemic sources-is particularly crucial in this setting, yet existing methods, designed for standard generation tasks, fail to capture the unique dynamics of ICL. To address this, we introduce a concept of self-function vectors, built upon Bayesian views and the mechanistic interpretability of ICL. These vectors leverage internal model representations to model the latent concept learned during in-context prompting, thereby enabling a direct estimation of aleatoric uncertainty within a Bayesian framework and circumventing the reliance on brittle input or decoding manipulations. Given the lack of established benchmarks and suitable evaluation protocols, we also propose the first and rigorous evaluation protocol, in which data is manipulated in controlled ways so as to quantify aleatoric uncertainty precisely and separately from epistemic uncertainty. With this new evaluation framework, initially grounded in synthetic tasks for conceptual development and subsequently extended to real-world datasets, we show that our proposed methodology can measure uncertainty of LLM predictions made under ICL more reliably than existing alternative methods. Moreover, we show it can be used as a practical tool for trustworthy-related applications, such as hallucination detection. Our findings pave a new direction for connecting the quantitative view of uncertainty with the mechanistic understanding of model behavior.

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Tri-Info: 基于信息论的VLA模型可泛化、可解释的故障预测

Jinghan Yang, Yunchao Zhang, Wang Yuan, Haolun Wan, Jiaming Zhang, Zhengyang Hu, Yanchao Yang

发表机构 * InfoBodied AI Lab, The University of Hong Kong(香港大学信息具身人工智能实验室) HKU Musketeers Foundation Institute of Data Science(香港大学赛马会数据科学研究院)

AI总结 提出Tri-Info方法,通过信息论信号捕捉动作多样性、时间一致性和状态耦合,实现跨架构、环境及仿真到现实的零样本故障检测,准确率达83%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地部署在各种任务中,但它们仍然是黑箱,其物理交互可能导致不可逆的伤害,因此需要可泛化和可解释的故障检测。我们观察到成功和失败的轨迹具有系统不同的信息论特征。基于此,我们将VLA控制形式化为闭环信息管道,并推导出三重信息论(Tri-Info)信号,这些信号捕捉动作是否保持多样性、时间一致性以及与状态转换的耦合。在六个VLA模型和三个基准环境中,Tri-Info在域内匹配最强的基线。此外,Tri-Info无需重新训练即可跨架构、环境和仿真到现实差距迁移,在现实世界任务中达到83%的准确率,而先前的检测器则降至随机水平。这确立了Tri-Info作为一种简单而强大的方法,不仅能够检测故障并具有强大的跨域泛化能力,还能提供底层故障模式的可解释诊断。

英文摘要

Vision-Language-Action (VLA) models are increasingly deployed across diverse tasks, yet they remain black boxes whose physical interactions can cause irreversible harm, making generalizable and interpretable failure detection essential. We observe that successful and failed rollouts carry systematically different information-theoretic signatures. Building on this, we formalize VLA control as a closed-loop information pipeline and derive the Triple Information-theoretic (Tri-Info) signals that capture whether actions remain diverse, temporally consistent, and coupled to state transitions. Across six VLA models and three benchmark environments, Tri-Info matches the strongest baselines in-domain. Moreover, Tri-Info transfers across architectures, environments, and the sim-to-real gap without retraining, reaching 83\% accuracy on real-world tasks where prior detectors collapse to chance. This establishes Tri-Info as a simple yet powerful method that not only detects failures with strong cross-domain generalization, but also delivers interpretable diagnostics of the underlying failure modes.

2606.20508 2026-06-19 cs.AI cs.LG 交叉投稿

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

安全对齐的LLM从混合顺从演示中学到了什么?

Sihui Dai, Mann Patel

AI总结 研究通过混合良性顺从演示和有害顺从演示,探究演示组成如何驱动有害顺从,发现演示内容、顺序和训练方法影响模型提取的信息。

详情
AI中文摘要

先前工作表明,上下文演示可以越狱语言模型,但模型如何解释不同类型的顺从演示仍不清楚。我们通过混合良性顺从演示(无害请求,有帮助响应)与有害顺从演示(有害请求,有帮助响应)并测试关于演示组成如何驱动有害顺从的三个假设来研究这一点。在四个模型中,我们发现良性和有害演示不可互换:良性演示根据模型不同可以减少或增加有害顺从。我们进一步表明,偏好优化是防止良性演示增加有害顺从的关键训练阶段,演示顺序表现出强烈的近因偏差,并且模型在拒绝与上下文学习的交互方式上有所不同:一些模型在拒绝时也采用演示的格式,而其他模型在拒绝时覆盖所有上下文信号。综合来看,这项工作超越了展示基于演示的越狱有效,而是描述了其工作原理:模型从顺从演示中提取的内容取决于演示内容、顺序和训练方法。

英文摘要

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

2606.20544 2026-06-19 cs.AI cs.LG 交叉投稿

Toward Calibrated Mixture-of-Experts Under Distribution Shift

面向分布偏移下的校准混合专家模型

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 研究混合专家模型在分布偏移下的校准问题,提出对抗性重加权方法以改善路由聚合的校准误差,提升准确率-校准权衡。

Journal ref ICML 2026

详情
AI中文摘要

校准将模型的预测不确定性与其经验结果的频率对齐,对于理解和信任报告的概率很重要。最近的研究表明,在单个预测器级别强制执行校准可以提高集成准确性和校准,特别是混合专家(MoE)模型显示出强烈的经验改进;然而,校准有助于MoE的条件尚不清楚。在这项工作中,我们研究了MoE模型在分布偏移下的行为,重点关注路由机制如何与专家级校准相互作用。我们表明,在硬路由模型中,专家校准足以确保整体模型在一大类分布偏移下的校准,但不足以校准软路由模型。为了解决这个问题,我们提出了一种对抗性重加权方法,惩罚分布偏移下路由聚合的校准误差,并证明它在平均情况下以及在数据的困难子集上,跨模型类别、预测任务和分布偏移,改善了准确率-校准权衡。

英文摘要

Calibration aligns a model's predictive uncertainty with the frequencies of its empirical outcomes and is important for understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can improve ensemble accuracy and calibration, with mixture-of-experts (MoE) models showing strong empirical improvements in particular; however, the conditions under which calibration helps MoE are not well understood. In this work, we study how MoE models behave under distribution shift, focusing on how routing mechanisms interact with expert-level calibration. We show that expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts in hard-routed models, but is insufficient for calibrating soft-routed models. To address this, we propose an adversarial reweighting that penalizes calibration errors of the routed aggregate under distribution shift, and we demonstrate that it improves the accuracy-calibration tradeoff both on average and on difficult subsets of the data, across model classes, prediction tasks, and distribution shifts.

9. 数据集、基准与评测 9 篇

2606.19597 2026-06-19 cs.SD cs.AI cs.LG 交叉投稿

PrefSQA: Pairwise Preference Prediction for Speech Quality Assessment and the Critical Role of High Quality Datasets

PrefSQA: 用于语音质量评估的成对偏好预测及高质量数据集的关键作用

Junyi Fan, Donald S. Williamson

发表机构 * Department of Computer Science and Engineering, The Ohio State University, USA(美国俄亥俄州立大学计算机科学与工程系)

AI总结 提出PrefSQA模型,通过不确定性感知logits、损伤注意力头和非匹配参考比较模块,利用高质量偏好数据集提升语音质量评估的准确性。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)广泛用于语音质量评估,但标量标签对评估者变异性和听力测试差异敏感,这引入了标签噪声,限制了MOS预测的可靠性。偏好预测通过让听者直接比较信号来减少这种变异性,产生更干净的标签。我们研究了无MOS的偏好预测,并提出了PrefSQA,它结合了不确定性感知logits、损伤注意力头以及基于非匹配参考比较的模块。我们使用并精炼了五个数据集,包括MOS衍生和低噪声模拟集(包含匹配和非匹配内容),在人类偏好集上进行实验,并在未见数据上测试。实验表明,在MOS衍生数据上改进较小,而其他数据集显示出相对于基线的明显改进,突显了高质量偏好数据的价值,并证明了所提出方法的有效性。

英文摘要

Mean opinion scores (MOS) are widely used for speech quality assessment, yet scalar labels are sensitive to rater variability and listening test differences. This introduces labeling noise, which limits the reliability of MOS prediction. Preference prediction reduces this variability as listeners compare signals directly, producing cleaner labels. We study MOS-free preference prediction and propose PrefSQA, which incorporates uncertainty-aware logits, an impairment attention head, and a module based on non-matching-reference comparisons. We use and refine five datasets, including MOS-derived and low-noise simulated sets with matching and non-matching content, experiment with human preference sets, and test on unseen data. Experiments show small improvements on MOS-derived data, while other sets reveal clear improvement over the baselines, highlighting the value of high-quality preference data and demonstrating the effectiveness of the proposed method.

2606.19714 2026-06-19 stat.ML cs.AI cs.LG stat.CO stat.ME 交叉投稿

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

AURA: 用于LLM作为评判审计的自适应不确定性感知精炼

Zilong Zhang, Yi-Ting Hung, Weiyi He, Junxi Zhang, Lei Ding, Chi-Kuang Yeh

AI总结 提出AURA框架,通过自适应不确定性感知精炼,在少量人工验证下迭代学习人类一致性信号,优先审核不确定比较,提升LLM评判的可靠性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作开放式生成的评判者,因为大规模人工评估通常昂贵且难以扩展,但它们的偏好仍然是人类判断的不完美代理。现有的审计流程通常假设事先存在可靠的示例子集或干净的监督信号,例如来自人工注释、启发式过滤或强评判者的输出。在LLM评估中,这一假设是脆弱的:初始分割可能继承评判者偏差,而人工验证通常过于稀缺,无法在规模上定义稳定组。我们提出AURA,一种自适应不确定性感知精炼框架,用于在选定的人工验证下审计成对LLM作为评判的决策。AURA迭代学习人类一致性信号,传播可靠证据,并优先将不确定的比较提交人工审核。关键思想是将对评判者的信任视为一个潜在量,随着证据积累逐步精炼。我们提供了紧凑的公式、稳定的精炼过程,以及在合成和真实成对LLM答案数据上的全面评估。

英文摘要

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 交叉投稿

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.20074 2026-06-19 eess.SP cs.AI cs.LG 交叉投稿

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

用于ICU中基于事件的爆发-抑制检测的EEG基础模型评估

Elisa Vasta, Thorir Mar Ingolfsson, Andrea Cossettini, Luca Benini, Tilman Beck, Emanuela Keller, Una Pale

AI总结 本研究首次评估EEG基础模型在ICU中无需患者校准的爆发检测性能,REVE-base模型在事件级F1分数上达到0.868,并将每分钟爆发错误率分别降低52.1%和36.2%。

Comments 4 pages, 1 figure. Code available upon publication

详情
AI中文摘要

爆发抑制(BS)是一种临床相关的脑电图(EEG)模式,用于监测危重患者的镇静深度和脑活动,特别是在重症监护病房(ICU)的诱导昏迷期间。自动爆发检测仍然具有挑战性,因为BS模式在不同患者之间差异很大,且标注数据集稀缺。最近,EEG基础模型(FMs)在多个下游EEG应用中显示出前景,但它们在BS检测中的实用性尚未被探索。我们提出了第一项研究,评估EEG FMs在减少导联的ICU EEG中无需患者校准的爆发检测性能。我们将REVE-base、LUNA-large和LuMamba-Tiny与自适应阈值基线以及任务特定的EEGNet基线进行比较。此外,我们补充了基于事件的爆发检测评估,以替代传统的EEG窗口分类。这有助于临床评估爆发事件是否被正确检测,减少预期标注变异性的影响。最佳模型REVE-base取得了最高的事件级F1分数($0.868 \pm 0.167$),并且与EEGNet和自适应阈值相比,分别将每分钟爆发错误减少了52.1%和36.2%,支持了FMs在ICU中可扩展的EEG监测。消融实验表明,与冻结骨干训练、两步微调和基于LoRA的适应相比,全微调是最有效的适应策略,对于LUNA-large,事件级F1分数比冻结骨干训练提高了最多$+0.102$。在减少标注数据集的情况下,预训练的REVE-base在25%的队列中比随机初始化高出$+0.723$事件级F1点,证明了在有限标注数据下适应爆发检测时预训练FM表示的优势。

英文摘要

Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.

2606.20108 2026-06-19 cs.CV cs.LG 交叉投稿

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

EFIQA: 基于解剖先验的可解释眼底图像质量评估

Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(维也纳医科大学医学数据科学中心人工智能研究所) Christian Doppler Lab for Artificial Intelligence in Retina, Medical University of Vienna, Austria(维也纳医科大学视网膜人工智能克里斯蒂安·多普勒实验室)

AI总结 提出无需质量标签的EFIQA框架,利用解剖先验通过掩膜解剖修复学习正常结构,生成空间质量图,在多个基准上超越监督方法,兼具可解释性。

Comments Accepted in MIDL 2026. Code: https://github.com/penway/EFIQA

Journal ref Proceedings of Machine Learning Research 315:2248-2264, 2026

详情
AI中文摘要

图像质量控制对于广泛的下游应用至关重要。基于深度学习的图像质量评估方法通常根据数据集特定的质量标签训练分类器,这继承了两种局限性:(1)泛化能力受限于训练集的标注标准;(2)这些方法无法提供质量下降的空间反馈,缺乏可解释性。在这项工作中,我们提出了EFIQA,一个无需质量相关监督的框架,并通过设计生成空间质量图。EFIQA不是从人工标注的标签中学习“什么是退化”,而是通过利用解剖先验来学习“应该有什么”。对于眼底摄影,我们将其实例化为两阶段方法:首先通过掩膜解剖修复训练无监督异常检测器,以识别缺失血管区域;然后将这一先验知识蒸馏到一个浅层适配器中,将冻结基础模型的特征映射到精确的质量图。外部数据集评估表明,这种无需标签且只需最小适配的方法,在不同质量标准的基准上,与监督方法相比,实现了更好的性能和可解释性,突显了其在现实应用中的潜力。

英文摘要

Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

2606.20128 2026-06-19 cs.SE cs.DC cs.LG 交叉投稿

The Correctness Illusion in LLM-Generated GPU Kernels

LLM生成的GPU内核中的正确性错觉

Dipankar Sarkar

AI总结 通过高精度CPU参考和操作模式感知的模糊测试,发现现有基准测试中基于固定形状的allclose检查无法检测LLM风格的转录错误,提出一种新协议并验证其有效性。

Comments 10 pages, 2 figures, LNCS format. Companion papers to follow on arXiv next week; IDs will be added in a v2 replace

详情
AI中文摘要

针对LLM生成的GPU内核的基准测试(KernelBench、TritonBench、GEAK)通过固定形状、小样本的allclose风格检查来评分正确性。不同基准测试的输入数量不同。每个内核的形状、数据类型和容差是固定的。我们凭经验测试了该oracle。我们构建了一个包含24个Triton和CPU替代内核(15个正确对照和9个带有记录转录错误的LLM风格错误变体)的受控语料库,并在操作模式感知的种子模糊测试下,使用高精度(fp64)CPU参考和每个(操作,数据类型)的绝对容差重新评估。种子oracle标记了9个错误内核中的9个,并通过了15个正确对照中的15个,对照的精度成本为零。我们将语料库扩展到26个操作(添加一个flash-attention对),并在五类GPU(RTX 3060、A10、L40S、A100 SXM4、H100 NVL)上重新运行相同的协议。所有五个GPU的判定结果相同:10个错觉中的10个被捕获,16个对照中的16个干净。语料库结果涉及LLM风格的转录错误,这些错误被单形状allclose oracle认证为正确,而不涉及任何特定部署的LLM的错误率。每个标记的失败都从存储的种子逐字节重放。

英文摘要

Benchmarks for LLM-generated GPU kernels (KernelBench, TritonBench, GEAK) score correctness through fixed-shape, small-sample allclose-style checks. The number of inputs varies between benchmarks. The shape, dtype, and tolerance are fixed for each kernel. We test that oracle empirically. We construct a controlled corpus of 24 Triton and CPU stand-in kernels (15 correct controls and 9 LLM-style buggy variants seeded with documented transcription errors) and re-evaluate it under op-schema-aware seeded fuzzing with a high-precision (fp64) CPU reference and per-(op, dtype) absolute tolerances. The seeded oracle flags 9 of 9 buggy kernels and passes 15 of 15 correct controls, at zero precision cost on controls. We extend the corpus to 26 ops (adding a flash-attention pair) and re-run the same protocol on five GPU classes (RTX 3060, A10, L40S, A100 SXM4, H100 NVL). The verdicts are identical across all five GPUs: 10 of 10 illusions caught and 16 of 16 controls clean. The corpus result is about LLM-style transcription bugs that the allclose-on-one-shape oracle certifies as correct, not about the bug rate of any specific deployed LLM. Every flagged failure replays byte-for-byte from a stored seed.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 交叉投稿

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 交叉投稿

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好:利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) York University(约克大学)

AI总结 针对显式反馈稀缺的问题,提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型,将文本奖励模型准确率从55%提升至64%,并显著提高DPO对齐后响应质量。

详情
AI中文摘要

为了对齐大型语言模型(LLM),大多数现有方法收集显式的人类反馈,并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先,用户很少为LLM响应提供显式反馈,这使得高质量偏好标注的收集成本高昂。其次,这些方法没有利用隐式人类反馈,而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值,我们构建了一个名为IFLLM的新数据集,收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%,并在将DPO应用于八个LLM后,相对响应质量改进几乎翻了三倍,证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到:此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

10. 机器学习应用 27 篇

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.19372 2026-06-19 eess.IV cs.CV cs.LG 交叉投稿

Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

全自诊断(FSD): 通过逆问题和算子学习从智能手机视频进行基于物理的可视生物标志物推断

Jonathan Thomas, Harsh Thaker

AI总结 提出全自诊断(FSD)框架,结合物理前向模型、信息论可观测性、正则化逆问题、算子学习和随机变分推断,从9秒面部视频恢复生理状态,在59名受试者38812次扫描中验证,血糖MARD达29.86%。

Comments 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

详情
AI中文摘要

我们提出全自诊断(FSD),一个统一的数学框架,用于从消费级智能手机拍摄的无约束9秒面部视频中恢复潜在生理状态。该方法整合了五个相互增强的组件:(1)基于辐射传输方程和发色团吸收的物理前向模型,将相机观测映射到生物标志物浓度;(2)信息论可观测性理论,证明多通道视觉信号(光谱、脉搏、呼吸、微表情和眼动)与生理状态包含严格递增的互信息;(3)具有域均匀可辨识性保证的稳定Tikhonov正则化逆问题;(4)算子学习公式,实现跨设备、分辨率和人群的泛化;(5)可解释为随机变分推断的监督学习过程,从配对生物传感器真实值持续优化模型,性能随配对观测数量的平方根倒数比例提升。在59名受试者的38812次真实世界配对扫描上的实证验证展示了实际性能。第一作者自采数据(血糖范围35-550 mg/dL)的MARD为29.86%,97.57%的预测落在Clarke误差网格A+B区,仅0.27%在危险E区。一位管理良好的糖尿病参与者在较窄的70-180 mg/dL范围内达到MARD 17%。这些结果证实,消费级面部视频编码了足够的结构化信息,可在完全无约束条件下进行临床相关的非侵入性生物标志物推断,且性能随更多配对数据的可用性可预测地提升。

英文摘要

We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

2606.19380 2026-06-19 cs.SE cs.LG 交叉投稿

AgentArmor: A Framework, Evaluation, \& Mitigation of Coding Agent Failures

AgentArmor:编码代理失败的框架、评估与缓解

Kenneth Ge, Andre Assis

AI总结 提出AgentArmor框架,通过系统提示增强、命令分类器、三振政策等机制,缓解编码代理因规范不足、能力错误和工具错误导致的失败,显著提升安全性。

详情
AI中文摘要

软件工程和部署正越来越多地委托给AI编码代理。它们的广泛采用暴露了罕见但极具破坏性的失败模式。在本文中,我们研究这些失败模式源于三种不同的机制:规范不足,即默认模型行为不安全;能力错误,即安全动作可用但模型因偏见或能力限制而未遵循;以及代理工具错误,即模型未能通过工具执行安全动作。我们在8个不同的评估中评估这些机制,每个评估都受实际部署失败的启发,总计20个编码环境和59个合成转录模板。基于此评估,我们提出AgentArmor,一种代理工具修改,以缓解这些错误。通过添加扩展的系统提示、单独的命令分类器、“三振”策略、确定性护栏以及代理编辑自身上下文的工具,我们证明AgentArmor在统计显著数量的样本上更安全。因此,我们为当前编码代理提出具体缓解措施,并为未来代理工具功能提出设计理念。

英文摘要

Software engineering and deployment are increasingly being delegated to AI coding agents. The scale of their adoption is surfacing rare, but highly destructive, failure modes. In this paper, we study these failure modes as stemming from three distinct mechanisms: underspecification, where default model behavior is unsafe; capability errors, where the safe action is available but the model does not adhere to it due to bias or capability limitations; and agent harness errors, where the model fails to execute the safe action through the harness. We evaluate these across 8 different evaluations, each inspired by real-life deployment failures, totaling 20 coding environments and 59 synthetic transcript templates. Based on this evaluation, we propose AgentArmor, an agent harness modification, to mitigate these errors. By adding an extended system prompt, a separate command classifier, a ``3 strikes'' policy, deterministic guardrails, and tools for the agent to edit its own context, we show that AgentArmor is safer across a statistically significant number of samples. Thus, we suggest concrete mitigations for current coding agents and a design philosophy for future agent harness features.

2606.19501 2026-06-19 cs.AI cs.CL cs.LG q-fin.RM 交叉投稿

DeXposure-Claw: An Agentic System for DeFi Risk Supervision

DeXposure-Claw: 一个用于DeFi风险监管的智能体系统

Aijie Shu, Bowei Chen, Wenbin Wu, Cathy Yi-Hsuan Chen, Fengxiang He

发表机构 * University of Edinburgh(爱丁堡大学) University of Glasgow(格拉斯哥大学) University of Cambridge(剑桥大学)

AI总结 针对DeFi监管中LLM智能体易误报的问题,提出DeXposure-Claw系统,通过图时间序列基础模型预测风险网络,结合确定性监控和置信度门控生成可审计监管票据,并构建六轴评估基准DeXposure-Bench,实验验证有效性。

详情
AI中文摘要

去中心化金融使监管者面临快速变化的网络化信用风险。通用LLM智能体不适合此场景:它们过度解读弱证据并推荐高风险干预,而现有评估无法提供符合监管者需求的误报衡量方式。我们提出DeXposure-Claw,一个基于预测的智能体监管系统,通过结构化证据引导LLM决策:(1) DeXposure-FM,一个图时间序列基础模型,预测未来风险网络;(2) 确定性监控和压力场景将预测转化为类型化警报、归因信号和场景证据;(3) 数据健康和置信度门控在DeXposure-Claw发出带有理由的可审计监管票据前限制升级。我们进一步开发了DeXposure-Bench,一个六轴评估框架,其决策轴根据符合监管者的绝对损失真实情况和显式误干预率对票据评分。在五年每周真实数据上的实验充分支持了我们的系统。代码见 https://this URL。

英文摘要

Decentralized finance exposes supervisors to fast-moving, networked credit risks. General-purpose LLM agents fit this setting poorly: they over-read weak evidence and recommend high-stakes interventions, while existing evaluations offer no regulator-aligned way to measure the resulting false alarms. We introduce DeXposure-Claw, a forecast-grounded agentic supervision system that routes LLM decisions through structured evidence: (1) DeXposure-FM, a graph time-series foundation model, forecasts future exposure networks; (2) deterministic monitors and stress scenarios then turn those forecasts into typed alerts, attribution signals, and scenario evidence; and (3) data-health and confidence gates constrain escalation before DeXposure-Claw emits auditable supervisory tickets with rationales. We further develop DeXposure-Bench, a six-axis evaluation harness, whose decision axis scores tickets against a regulator-aligned absolute-loss ground truth and an explicit false-intervention rate. Experiments on five years of weekly real data fully support our system. Code is at https://github.com/EVIEHub/DeXposure-Claw.

2606.19627 2026-06-19 cs.IR cs.AI cs.LG 交叉投稿

VCG: A Multimodal Retrieval Framework for E-Commerce Video Feeds under Extreme Cold-Start Conditions

VCG:极端冷启动条件下电商视频流的多模态检索框架

Katya Mirylenka, Egor Malykh, Mahdyar Ravanbakhsh, Michael Gygli, Marco-Andrea Buchmann, Andrew Dzhoha, Svitlana Borzenko, Francesca Catino, Mohamed Gaafar, Maarten Versteegh, Thomas Kober, Dario d'Andrea, Ellie Langhans

AI总结 针对电商视频流中的极端冷启动和偏差问题,提出基于领域自适应视觉-语言模型(CLIP)的可扩展多模态检索系统VCG,实现零样本检索,在线测试显示深度视频完成率提升50%。

详情
AI中文摘要

数字商业格局正从静态的搜索驱动型目录转向动态的沉浸式视频流。这一转变引入了“极端冷启动”问题:与传统商品不同,新的短视频缺乏协同过滤所需的密集交互历史。此外,沉浸式视频流引入了强烈的位置和时长偏差,扭曲了标准参与信号。在本文中,我们展示了视频候选生成(VCG)系统,这是一个可扩展的多模态检索引擎,旨在解决大规模电商环境中的这些挑战。通过利用领域自适应的视觉-语言模型(基于CLIP),我们将用户和视频映射到共享语义空间,实现基于视觉内容而非行为历史的零样本检索。我们详细介绍了系统的架构,并进行了严格的评估,比较了生成式(LLM)和判别式(CLIP)嵌入。结果表明,虽然生成式模型在属性预测方面表现出色,但在检索任务中会出现嵌入空间坍塌。在线A/B测试表明,VCG有效缓解了参与偏差,使深度视频完成率提升了50%。为了展示系统的能力,我们提供了一个交互式演示,包含三种双向检索场景:产品到视频、视频到产品和零样本语义搜索。

英文摘要

The digital commerce landscape is shifting from static, search-driven catalogs to dynamic, immersive video feeds. This transition introduces an ``extreme cold-start'' problem: unlike traditional items, new short-form videos lack the dense interaction history required for collaborative filtering. Furthermore, immersive feeds introduce strong position and duration biases that distort standard engagement signals. In this paper, we demonstrate the Video Candidate Generation (VCG) system, a scalable multimodal retrieval engine designed to solve these challenges in a large-scale e-commerce environment. By leveraging a domain-adapted vision-language model (based on CLIP), we map users and videos into a shared semantic space, enabling zero-shot retrieval based on visual content rather than behavioral history. We detail the system's architecture and present a rigorous evaluation comparing generative (LLM) vs. discriminative (CLIP) embeddings. Our results show that while generative models excel at attribute prediction, they suffer from embedding space collapse in retrieval tasks. Online A/B testing demonstrates that VCG effectively mitigates engagement biases, yielding a 50\% uplift in deep video completion. To showcase the system's capabilities, we present an interactive demonstration featuring three bi-directional retrieval scenarios: Product-to-Video, Video-to-Product, and Zero-Shot Semantic Search.

2606.19629 2026-06-19 cs.SD cs.AI cs.LG 交叉投稿

RIVET: Robust Idempotent Voice Attribute Editing

RIVET: 鲁棒的幂等语音属性编辑

Dareen Alharthi, Bhuvan Koduru, Rita Singh, Bhiksha Raj

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出RIVET训练框架,通过幂等性正则化提升语音属性编辑模型对标签噪声的鲁棒性,在合成噪声和真实噪声数据集上均优于标准训练。

详情
AI中文摘要

语音属性编辑模型在保留说话人身份的同时修改年龄和性别等特征。然而,在大规模语音数据集中,属性标注通常带有噪声或不一致,这可能导致条件生成模型产生不稳定的编辑。在这项工作中,我们证明幂等性为提升对噪声标签的鲁棒性提供了一种有效机制。幂等算子是指重复应用不会改变结果的算子,即 f(f(x)) = f(x)。强制这一性质作为一种隐式正则化器,降低了对错误标注样本的敏感性。我们引入了 RIVET,一种结合幂等性目标以提升对标签噪声鲁棒性的训练框架。我们在受控标签噪声下以及在具有自然噪声标注的 GLOBE 数据集上评估了 RIVET。RIVET 提高了编辑成功率,并且比标准训练更好地保留了说话人身份,表明幂等性提升了语音编辑模型的鲁棒性。

英文摘要

Voice attribute editing models modify characteristics such as age and gender while preserving speaker identity. In large-scale speech datasets, however, attribute annotations are often noisy or inconsistent, which can cause conditional generative models to produce unstable edits. In this work, we show that idempotency provides an effective mechanism for improving robustness to noisy labels. An idempotent operator is one for which repeated application does not change the result, i.e., f(f(x)) = f(x). Enforcing this property acts as an implicit regularizer that reduces sensitivity to mislabeled examples. We introduce RIVET, a training framework that incorporates an idempotency objective to improve robustness to label noise. We evaluate RIVET under controlled label noise and on the GLOBE dataset with naturally noisy annotations. RIVET improves editing success and better preserves speaker identity than standard training, showing that idempotency improves robustness in voice editing models.

2606.19699 2026-06-19 cs.RO cs.LG cs.SY eess.SY 交叉投稿

Comparative Study on Agility, Efficiency, and Impact Absorption of Bipedal Robots with Active Toes

具有主动脚趾的双足机器人敏捷性、效率和冲击吸收的比较研究

Joong-Gil Kim, Wontae Ye, Geunwoo Cho, Seong-Ho Yun, Se-Hyoung Cho, Yong-Jae Kim

发表机构 * School of Electrical, Electronics and Communication Engineering, Korea University of Technology and Education(韩国技术教育大学电气、电子与通信工程学院) Artificial Intelligence and Robotics Institute, Korea Institute of Science and Technology(韩国科学技术研究院人工智能与机器人研究所) Robot Innovation Hub, WIRobotics Inc.(WIRobotics公司机器人创新中心)

AI总结 提出一种14自由度双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性,通过高保真仿真训练环境,对比有无主动脚趾的配置,发现脚趾机器人以1.33米/秒行走时,CoT降低17.5%,脚跟冲击力降低5.0%,路径偏差平均和最大分别降低25.0%和34.0%。

Comments 6 pages, 7 figures

详情
AI中文摘要

人类腿部表现出高效率、敏捷性和冲击吸收能力,其中脚趾在这些能力中起着关键作用。尽管已经有许多尝试在机器人中实现类似人类的脚趾,但它们尚未完全复制人类特征,也没有严格验证其益处。我们提出了一种14自由度的双足机器人,模拟人类脚趾的轻量、高扭矩、坚固特性。为了定量分析主动脚趾在敏捷性、效率和冲击吸收方面的有效性,我们开发了一个高保真仿真训练环境,该环境反映了具有耦合传动和精确功耗的实际执行器。为了确保有和没有主动脚趾的配置之间的公平比较,我们设计了一个最小化强化学习奖励函数,并对两者应用了相同的训练程序。仿真结果表明,在1.33米/秒行走时,与无脚趾配置相比,配备脚趾的机器人将CoT降低了17.5%,脚跟冲击力降低了5.0%。在敏捷性测试中,平均和最大路径偏差分别降低了25.0%和34.0%。

英文摘要

Human legs exhibit high efficiency, agility, and impact absorption, with toes playing a crucial role in these capabilities. While many attempts have been made to implement human-like toes in robots, they have not fully replicated human characteristics nor rigorously validated their benefits. We propose a 14-DOF biped robot emulating human toes' lightweight, high-torque, robust nature. To quantitatively analyze the effectiveness of the active toes in terms of agility, efficiency, and impact absorption, we developed a high-fidelity simulation training environment that reflects actual actuators with coupled transmissions and accurate power consumption. To ensure a fair comparison between configurations with and without active toes, we designed a minimal RL reward function and applied an identical training procedure to both. The simulation results indicate that, at 1.33 m/s walking, the toe-equipped robot reduced CoT by 17.5% and heel-strike GRF by 5.0% compared with the toe-ablation configuration. On the agility test, average and maximum path deviation decreased by 25.0% and 34.0%, respectively.

2606.19711 2026-06-19 cs.RO cs.LG cs.SY eess.SY 交叉投稿

A Differentiable Composite Approximation Framework for Autonomous Underwater Vehicle Maneuvering Modeling from Sea-Trial Data

一种可微复合近似框架:基于海试数据的自主水下航行器机动建模

Aobo Wang, Aifei Xia, Zihao Wang, Lizhu Hao

发表机构 * College of Shipbuilding Engineering, Harbin Engineering University(哈尔滨工程大学船舶工程学院) China Academy of Aerospace Aerodynamics(中国航天空气动力技术研究院) Institute of Artificial Intelligence, Shanghai University(上海大学人工智能研究院) China Ship Scientific Research Center(中国船舶科学研究中心)

AI总结 提出可微复合近似框架,结合多项式基与数据自适应基联合校准,并引入转向运动电流估计补偿,提升AUV机动预测精度。

详情
AI中文摘要

基于机载测量的场建模可以生成反映真实运行特性的自主水下航行器(AUV)机动模型。从近似角度看,传统机动模型使用预定义的约束多项式基,而数据驱动模型使用数据自适应基。受此基函数视角启发,本文提出一种可微复合近似公式,其中多项式基分量和数据自适应基分量被视为单个预测器的可微部分并联合校准。开发了一种基于梯度的协同校准方法用于全尺寸AUV机动预测,其中灵敏度感知机制调节有界多项式更新,而神经残差在共享预测目标下捕获剩余非线性差异。为了考虑现场数据中的海流效应,引入了一种基于转向运动的电流估计和补偿程序,以构建电流补偿的学习目标用于训练和滚动预测。该框架使用从7米长AUV在多种机动条件下收集的海试数据进行评估。结果表明,与纯多项式、纯神经网络和冻结先验混合基线相比,所提方法改进了递归轨迹和速度预测,证明了其在基于现场数据的AUV机动建模中的适用性。

英文摘要

Field-based modeling from onboard measurements can produce autonomous underwater vehicle (AUV) maneuvering models that reflect real operating characteristics. From an approximation perspective, conventional maneuvering models use predefined constraint polynomial bases, whereas data-driven models use data-adaptive bases. Motivated by this basis-function view, this paper presents a differentiable composite-approximation formulation, in which the polynomial-basis component and the data-adaptive basis component are treated as differentiable parts of a single predictor and calibrated jointly. A gradient-based co-calibration method is developed for full-scale AUV maneuvering prediction, where a sensitivity-aware mechanism regulates bounded polynomial updates while the neural residual captures remaining nonlinear discrepancies under a shared prediction objective. To account for ocean-current effects in field data, a turning-motion-based current estimation and compensation procedure is incorporated to construct current-compensated learning targets for training and rollout. The framework is evaluated using sea-trial data collected from a 7-meter AUV under multiple maneuvering conditions. Results show that the proposed method improves recursive trajectory and velocity prediction compared with polynomial-only, neural-only, and frozen-prior hybrid baselines, demonstrating its applicability to field-data-based AUV maneuvering modeling.

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 交叉投稿

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究:频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

AI总结 本文系统研究不同频谱特征与声学模型的组合,通过引入音高特征和优化训练帧重叠数,在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情
AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能,特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查,我们证明了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法,与先前研究相比,在构音障碍语音的孤立词识别中获得了4.65%的相对改进,在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性,这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19812 2026-06-19 cs.AI cs.LG 交叉投稿

Human-on-the-Loop Orchestration for AI-Assisted Legal Discovery

AI辅助法律发现中的人机协同编排

Anushree Sinha, Srivaths Ranganathan, Abhishek Dharmaratnakar, Debanshu Das

AI总结 针对AI代理在电子取证中因多步推理错误导致的法律风险,提出一种四层验证架构,通过人机协同阈值减少特权豁免风险达61%。

详情
AI中文摘要

自主大语言模型(LLM)代理越来越多地部署于电子发现(e-discovery),其中跨多步推理链的复合错误可能构成法律渎职。与单轮检索不同,在特权文档语料库上运行的代理工作流表现出我们称之为“轨迹崩溃”的一类失败:早期错误分类无声传播,导致整个特权审查失效。本文做出三项贡献。首先,我们提出一个按功能阶段组织的法律信息检索中代理失败的结构化分类法。其次,我们引入一个四层验证架构——涵盖规划、推理、执行和不确定性量化——旨在这些失败复合之前拦截它们。第三,我们在一个合成电子取证语料库上进行初步模拟研究,展示强制性人机协同(HOTL)升级阈值如何相对于完全自主基线降低特权豁免风险。我们的结果表明,与完全自主部署相比,校准的不确定性阈值可将特权豁免风险降低高达61%,同时将不到四分之一的文档路由给律师审查。

英文摘要

Autonomous Large Language Model (LLM) agents are increasingly deployed in electronic discovery (e-discovery), where compounding errors across multi-step reasoning chains can constitute legal malpractice. Unlike single-turn retrieval, agentic workflows operating over privileged document corpora exhibit a class of failure we term "trajectory collapse": an early misclassification silently propagates, rendering an entire privilege review invalid. This paper makes three contributions. First, we propose a structured taxonomy of agentic failures in legal information retrieval, organized by functional stage. Second, we introduce a four-layer verification architecture -- spanning planning, reasoning, execution, and uncertainty quantification -- designed to intercept these failures before they compound. Third, we present a preliminary simulation study on a synthetic e-discovery corpus that demonstrates how mandatory Human-on-the-Loop (HOTL) escalation thresholds reduce privilege-waiver risk relative to fully autonomous baselines. Our results suggest that calibrated uncertainty thresholds can reduce privilege-waiver risk by up to 61% versus fully autonomous deployment, while routing fewer than one quarter of documents to attorney review.

2606.19821 2026-06-19 cs.AI cs.LG 交叉投稿

TelcoAgent: A Scalable 5G Multi-KPM Forecasting With 3GPP-Grounded Explainability

TelcoAgent: 一种可扩展的5G多KPM预测与3GPP基础可解释性

Geon Kim, Dara Ron, Sukhdeep Singh, Suyog Moogi, Pranshav Gajjar, V V N K Someswara Rao Koduri, Een Kee Hong, Vijay K. Shah

发表机构 * NextG Wireless Lab, North Carolina State University(北卡罗来纳州立大学下一代无线实验室) Kyung Hee University(庆熙大学)

AI总结 提出TelcoAgent框架,利用基础模型实现多KPM的零样本预测,通过3GPP知识图谱和可解释性管道提供可操作诊断。

Comments 6 pages, 6 figures. Submitted to IEEE GLOBECOM 2026

详情
AI中文摘要

关键性能测量(KPM)预测对于5G及下一代电信网络的主动网络管理至关重要。然而,现有的机器学习(ML)方法在可扩展性和可解释性方面存在显著局限性,限制了其在实际部署中的有效性。我们提出TelcoAgent,一个基于基础模型的框架,能够在不需站点特定训练的情况下,跨不同网络单元实现多个KPM的准确、可扩展和可解释预测。具体而言,该框架包含三个关键组件:(i) 一个自动化的三智能体管道,直接从规范文档构建第三代合作伙伴计划(3GPP)知识图谱;(ii) 一个可扩展的基于时间序列基础模型(TSFM)的预测管道,以提供准确的零样本预测;以及(iii) 一个推理和解释管道,提供可操作的、领域基础的诊断。使用来自美国网络运营商的三个月真实城市级5G KPM数据集进行评估,TelcoAgent在200个单元中针对每个单元的7个KPM均展示了高预测准确性,同时提供了可解释的见解和可操作的指令来解决网络退化问题。

英文摘要

Key Performance Measurement (KPM) forecasting is essential for proactive network management of 5G and next-generation telecom networks. However, existing machine learning (ML) approaches face significant limitations in scalability and explainability, restricting their effectiveness in real-world deployments. We propose TelcoAgent, a foundation model-based framework that enables accurate, scalable, and explainable forecasting of multiple KPMs across diverse network cells without the need for site-specific training. Specifically, the framework comprises three key components: (i) an automated three-agent pipeline that constructs a 3rd Generation Partnership Project (3GPP) knowledge graph directly from specification documents, (ii) a scalable, time-series foundation model (TSFM)-based prediction pipeline to deliver accurate, zero-shot forecasting, and finally (iii) a reasoning and explanation pipeline that provides actionable, domain-grounded diagnostics. Evaluated using a 3-month, real-world, city-scale 5G KPM dataset from a U.S.-based network operator, TelcoAgent demonstrates high forecasting accuracy for all 7 considered KPMs per cell across 200 cells, while delivering explainable insights and actionable instructions to address network degradations.

2606.19823 2026-06-19 eess.AS cs.LG 交叉投稿

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

低负担数据增强:通过零样本语音克隆改善构音障碍语音识别

Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

AI总结 针对构音障碍语音数据稀缺和变异性大的问题,提出使用零样本语音克隆(Higgs Audio V2)生成合成数据,微调Whisper-medium模型,在TORGO数据集上达到与真实数据微调相近的词错误率,并显著降低数据收集成本。

Comments Accepted to Interspeech 2026, Sydney, Australia

详情
AI中文摘要

由于数据稀缺和说话人之间高度变异,自动语音识别对于构音障碍语音仍然不可靠。虽然合成数据可以弥补这些不足,但传统方法通常需要大量的说话人特定数据,重新引入了数据收集瓶颈。我们研究零样本语音克隆作为一种低负担的增强策略,使用Higgs Audio V2克隆TORGO数据集中的说话人。我们在克隆数据、真实数据和混合数据上微调Whisper-medium,并在保留的真实语音上进行评估。与零样本基线(31.62%)相比,克隆数据微调实现了具有竞争力的26.00%词错误率,几乎与真实数据微调(24.44%)和混合数据微调(25.12%)相当。值得注意的是,对于中重度构音障碍说话人,克隆和混合微调优于真实数据微调。在SAP-1102上的跨语料库评估中,克隆微调取得了最佳结果(相对提升11.45%)。这些结果表明,零样本克隆提供了可扩展的训练数据,绕过了昂贵的数据收集瓶颈。

英文摘要

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

2606.19852 2026-06-19 cs.CL cs.LG 交叉投稿

Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical Narratives

提示、规划、提取:用于从临床叙述中提取肺部病理学的零样本智能体LLM工作流

Aman Pathak, Cheng Peng, Mengxian Lyu, Ziyi Chen, Reema Solan, Sankalp Talankar, Yasir Khan, Hiren Mehta, Aokun Chen, Yi Guo, Yonghui Wu

AI总结 提出零样本智能体工作流,利用开源大语言模型从肺切除病理报告中提取13个CAP字段,在无训练下达到0.893 Micro-F1,接近监督方法。

Comments 7 pages, 2 figures, 3 tables. Affiliations: (1) Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; (2) Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA; (3) College of Nursing, Florida State University, Tallahassee, FL, USA

详情
AI中文摘要

从病理报告中提取信息对于癌症分期和肿瘤登记人群至关重要。然而关键数据仍嵌入在叙述性报告中,使得手动提取劳动密集且易出错。传统的监督自然语言处理流程通过完全监督的命名实体识别和关系提取来解决这一问题,但需要昂贵的人工标注,并且当上游实体缺失时会出现级联故障。在本研究中,我们开发了一个零样本智能体工作流,并评估了五个开源生成式大语言模型(LLMs),以从肺切除病理报告中填充13个美国病理学家学会的概要字段。我们使用一种新颖的、与注册对齐的评估框架,将它们与最先进的监督GatorTron NER-RE基线进行比较。基线达到了0.960的Micro-F1,而最佳零样本模型(GPT-OSS-20B)达到了0.893的Micro-F1(召回率:0.949),在没有任务特定训练的情况下准确提取了复杂关系(如病理分期)。这些结果表明,开源零样本智能体LLMs是提取肺部病理信息的低成本解决方案。

英文摘要

Information extraction from pathology reports is essential for cancer staging, tumor registry population. Yet key data remains embedded in narrative reports, making manual extraction labor-intensive and error-prone. Traditional supervised Natural Language Processing pipelines address this through fully supervised Named Entity Recognition and Relation Extraction, but require expensive manual annotation and suffer cascading failures when upstream entities are missed. In this study, we developed a zero-shot, agentic workflow, and evaluated five open-source generative Large Language Models (LLMs) to populate 13 College of American Pathologists synoptic fields from lung resection pathology reports. We compared them against a state-of-the-art supervised GatorTron NER-RE baseline using a novel, registry-aligned evaluation framework. The baseline achieved Micro-F1of 0.960, while the best zero-shot model (GPT-OSS-20B) achieved Micro-F1 of 0.893 (recall: 0.949), accurately extracting complex relations like Pathologic Stage without task-specific training. These results suggest that open-source, zero-shot agentic LLMs are a low-cost solution for extracting lung pathology information.

2606.19895 2026-06-19 math.NA cs.LG cs.NA 交叉投稿

A fast direct solver based neural network for solving PDEs

基于快速直接求解器的神经网络求解偏微分方程

Jashwanth Reddy Kadaru, Vaishnavi Gujjula

AI总结 提出一种学习HODLR矩阵逆运算的神经网络,并扩展为非线性PDE求解算子,实验表明在多种PDE上高效且泛化良好。

Comments 26 pages, 7 Figures, 5 Tables

详情
AI中文摘要

大规模$N$体问题产生的矩阵可以使用层次矩阵高效表示,其关键思想是允许跨矩阵分区层次结构的可接受非对角子矩阵可以通过低秩矩阵很好地近似。HODLR(层次非对角低秩)矩阵是层次矩阵的一个子类,其中递归二分划分的每一级的所有非对角子矩阵都是低秩的。本文提出一种神经网络,基于Ambikasaran和Darve(2013)开发的HODLR矩阵快速直接求解器,学习HODLR矩阵的逆运算。我们进一步通过将部分线性层替换为深度子网络,扩展该架构以学习与PDE相关的非线性解算子。我们通过进行一组全面的实验来展示所提出架构的性能,包括(i)求解线性问题,如第二类Fredholm积分方程,(ii)求解PDE,如非线性薛定谔方程、Burgers方程和稳态达西流方程,(iii)跨不同参数值的泛化研究,(iv)将所提出网络的推理时间与经典数值求解器的运行时间进行比较,以及(v)将所提出网络与一些现有的神经算子学习网络进行比较。

英文摘要

The matrices arising from large scale $N$-body problems can be efficiently represented using hierarchical matrices, whose key idea is that the admissible off-diagonal sub-matrices can be well approximated by low-rank matrices across a hierarchy of matrix partitions. HODLR (Hierarchical Off-Diagonal Low-Rank) matrices are a subclass of hierarchical matrices in which all off-diagonal submatrices at every level of a recursive binary partition are low-rank. In this article, we present a neural network that learns the inverse operation of HODLR matrices based on the fast direct solver for HODLR matrices developed by Ambikasaran and Darve (2013). We further extend the architecture to learn nonlinear solution operators associated with PDEs by replacing some of the linear layers with deep sub-networks. We demonstrate the performance of the proposed architecture by performing a comprehensive set of experiments that include (i) solving a linear problem such as the Fredholm integral equation of the second kind, (ii) solving PDEs such as the nonlinear Schrödinger equation, Burgers' equation, and the steady-state Darcy's flow equation, (iii) generalization study across varying parameter values, (iv) comparing the inference time of the proposed network with the run time of a classical numerical solver, and (v) comparing the proposed network with some of the existing neural operator learning networks.

2606.19912 2026-06-19 math.NA cs.LG cs.NA physics.comp-ph 交叉投稿

Structure-Oriented Randomized Neural Networks for Poisson-Nernst-Planck and Poisson-Nernst-Planck-Navier-Stokes Systems

面向结构的随机神经网络用于泊松-能斯特-普朗克和泊松-能斯特-普朗克-纳维-斯托克斯系统

Yunlong Li, Fei Wang

AI总结 提出结构导向随机神经网络(SO-RaNN)框架,通过解耦线性化子问题、逐点截断保持浓度正性、离散质量缩放因子和SAV后处理修正,实现PNP和PNP-NS系统的高效求解,并理论推导残差估计和收敛性。

详情
AI中文摘要

我们开发了一种面向结构的随机神经网络框架,称为SO-RaNN,用于泊松-能斯特-普朗克(PNP)系统和泊松-能斯特-普朗克-纳维-斯托克斯(PNP-NS)系统。解耦的线性化子问题通过随机神经网络在时空框架中迭代求解。对于浓度变量,使用逐点截断在数值层面强制正性,并在选定的修正时刻计算离散质量缩放因子并在时间上插值,以确保在这些时刻精确匹配质量并促进近似质量守恒。为了引入辅助离散耗散机制,我们进一步采用SAV型后处理修正,该修正使得SAV辅助变量在理想SAV更新下具有单调性。对于PNP-NS系统,使用保结构随机神经网络(SP-RaNN)处理速度场,使得速度近似通过构造满足逐点不可压缩约束。在理论方面,我们推导了线性化子问题的原始未修正RaNN求解器的残差估计,为PNP系统的原始外Picard迭代制定了条件性局部时间收敛结果,并分析了数值层面的正性修正以及质量修正和SAV后处理步骤。对于PNP-NS系统,我们建立了SP-RaNN空间的逼近结果,并给出了相应线性化Oseen型问题的条件性误差陈述。数值实验展示了源驱动制造测试中的逼近精度,并说明了预期中的数值层面正性修正、选定时刻质量匹配、基于最终规范固定势的计算自由能曲线以及基准测试中的无散度逼近。

英文摘要

We develop a structure-oriented randomized neural network framework, termed SO-RaNN, for the Poisson-Nernst-Planck (PNP) system and the Poisson-Nernst-Planck-Navier-Stokes (PNP-NS) system. The decoupled linearized subproblems are solved iteratively by randomized neural networks in a space-time framework. For the concentration variables, a pointwise cut-off is used to enforce positivity at the value level, and discrete mass-scaling factors are computed at selected correction instants and interpolated in time, so as to ensure exact mass matching at those instants and to promote approximate mass preservation between them. To introduce an auxiliary discrete dissipation mechanism, we further employ an SAV-type post-processing correction, which yields monotonicity of the SAV auxiliary variable under the ideal SAV update. For the PNP-NS system, a structure-preserving randomized neural network (SP-RaNN) is used for the velocity field, so that the velocity approximation satisfies the incompressibility constraint pointwise by construction. On the theoretical side, we derive residual-based estimates for the raw, uncorrected RaNN solvers of the linearized subproblems, formulate a conditional local-in-time convergence result for the raw outer Picard iteration of the PNP system, and analyze the value-level positivity correction together with the mass-correction and SAV post-processing steps. For the PNP-NS system, we establish an approximation result for the SP-RaNN space and provide a conditional error statement for the corresponding linearized Oseen-type problem. Numerical experiments demonstrate approximation accuracy in the source-driven manufactured tests and illustrate the intended value-level positivity correction, selected-time mass matching, computed free-energy curves based on the final gauge-fixed potential, and divergence-free approximation in benchmark tests.

2606.19947 2026-06-19 quant-ph cs.LG 交叉投稿

QMaxCal: Path-Space Regularization for Open Quantum Control via Girsanov's Theorem

QMaxCal: 基于 Girsanov 定理的开环量子控制路径空间正则化

Merijn Moody, Zier Mensch, Miranda C. N. Cheng, Peter G. Bolhuis, Max Welling

AI总结 针对开放量子系统退相干问题,利用 Girsanov 定理推导 KL 散度的可微估计器,提出两种正则化项以最小化退相干影响,在多种量子系统中优于未正则化的梯度方法和强化学习基线。

Comments 26 pages, 6 figures. ICML 2026 AI4Physics Workshop

详情
AI中文摘要

在存在退相干的条件下,可靠的量子控制需要能够对抗环境噪声对受控动力学影响的策略。连续监测下的开放量子系统产生经典测量记录,其漂移依赖于系统所经历的噪声;共享相同退相干通道的两个演化的记录仅在此漂移上有所不同,因此 Girsanov 定理给出了它们轨迹分布之间 KL 散度的闭式、可微估计器。我们用两个物理动机的参考度量实例化该估计器,得到两个正则化项,它们都将系统驱动到退相干效应最小的状态:Wiener KL (KL_W),在噪声模型的某些条件下经验上更有效;以及漂移方差正则化项 (R_DV),适用于所有噪声模型。两者在性质上不同于现有的控制通量或平滑性惩罚:它们惩罚控制对退相干通道的可观测后果,而非控制幅度本身。这些正则化项在一系列开放量子系统中优于未正则化的基于梯度和强化学习的基线——包括单量子比特和多量子比特基准测试,以及一个校准到已发表的 IBM Kingston 处理器快照的多量子比特链——在多个评估维度上:最终态保真度、对假设噪声模型失配的鲁棒性(在训练噪声下增益从 +17 个百分点增长到 2.5 倍噪声失配下的 +27 个百分点),以及禁止态的占据。正则化项将不保真度降低高达 50%,在校准的 IBM Kingston 链上获得约 16% 的增益。

英文摘要

Reliable quantum control in the presence of decoherence requires policies that combat the effect of environmental noise on the controlled dynamics. Open quantum systems under continuous monitoring generate classical measurement records whose drift depends on the noise experienced by the system; the records of two evolutions sharing the same decoherence channels differ only in this drift, so Girsanov's theorem yields a closed-form, differentiable estimator of the KL divergence between their trajectory distributions. We instantiate this estimator with two physically motivated reference measures, yielding two regularizers that both drive the system toward states where the effects of decoherence are minimal: the Wiener KL (KL_W), which is empirically more effective under certain conditions on the noise model, and the drift-variance regularizer (R_DV), which works for all noise models. Both are qualitatively distinct from existing penalties on control fluence or smoothness: they penalize the observable consequences of control on the decoherence channels rather than the control amplitude itself. The regularizers outperform unregularized gradient-based and reinforcement-learning baselines across a range of open quantum systems -- including single- and multi-qubit benchmarks and a multi-qubit chain calibrated to a published snapshot of the IBM Kingston processor -- along several axes of evaluation: final-state fidelity, robustness to mismatch in the assumed noise model (gains grow from +17 pp at training noise to +27 pp under 2.5x noise mismatch), and occupation of forbidden states. The regularizers reduce infidelity by up to 50%, with ~16% gains on the calibrated IBM Kingston chain.

2606.19966 2026-06-19 cs.CV cs.LG 交叉投稿

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

语义锚定证据融合用于域鲁棒的全切片生存分析

Yucheng Xing, Ling Huang, Pei Liu, Jingying Ma, Jiaqing Xu, Kai He, Mengling Feng

发表机构 * National University of Singapore(新加坡国立大学) Imperial College London(帝国理工学院) Hunan University(湖南大学)

AI总结 提出SAEFS框架,通过视觉问答提取语义锚点,结合双流证据提取和狄利克雷主观逻辑建模不确定性,实现跨域零样本生存分析,平均C-index提升10.2%。

详情
AI中文摘要

全切片图像(WSIs)广泛用于计算癌症预后。然而,现有方法主要关注域内性能,难以泛化到不同临床中心。这一局限性源于它们依赖像素级表示,极易受到染色协议和扫描硬件导致的域特定伪影影响。我们假设高级病理语义(如肿瘤分级和微环境结构)提供了域不变的语义表示,反映了人类病理学家的鲁棒诊断逻辑。因此,我们提出了语义锚定证据融合生存(SAEFS)框架,其中SAEFS通过视觉问答(VQA)从WSIs中推导语义锚点,采用双流WSI证据提取架构,使用基于狄利克雷的主观逻辑建模不确定性,并通过谨慎合取规则融合语义和视觉证据,以避免来自相关源的过度自信融合。仅在单一源域上训练并在四个未见域上进行零样本评估,SAEFS在预测准确性和可靠性上均一致优于最先进模型,平均C-index提升10.2%。定量分析进一步表明,VQA导出的语义特征比像素级特征表现出显著更低的跨中心差异,突显了其在跨中心临床应用中的鲁棒性。

英文摘要

Whole-slide images (WSIs) are widely used for computational cancer prognosis. However, most existing methods primarily focus on in-domain performance and fail to generalize across clinical centers. This limitation stems from their reliance on pixel-derived representations that are highly susceptible to domain-specific artifacts caused by staining protocols and scanner hardware. We hypothesize that high-level pathology semantics, such as tumor grade and micro-environmental architecture, provide a domain-invariant semantic representation that mirrors the robust diagnostic logic of human pathologists. Therefore, we propose a Semantic-Anchored Evidential Fusion Survival (SAEFS) framework, where SAEFS derives semantic anchors from WSIs via Visual Question Answering (VQA), employs a dual-stream WSI evidence extraction architecture, uses Dirichlet-based Subjective Logic to model uncertainty, and fuses semantic and visual evidence through a cautious conjunction rule to avoid overconfident fusion from correlated sources. Trained exclusively on one source domain and evaluated zero-shot across four unseen domains, SAEFS consistently outperforms state-of-the-art models both in prediction accuracy and reliability, improving the average C-index by 10.2%. Quantitative analyses further show that VQA-derived semantic features exhibit significantly lower cross-center divergence than pixel-derived features, highlighting their robustness for cross-center clinical applications.

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 交叉投稿

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理:一种基于模型的经济分析代理框架,结合RAG、知识图谱和大语言模型

Masahiro Kato

AI总结 提出一种基于RAG的AI经济学家代理框架,利用知识图谱和大语言模型进行经济情景分析,通过代理规划、检索证据、选择模型并生成报告,提高经济叙事的连贯性和可追溯性。

详情
AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家,具有用于经济情景分析的代理框架,使用大语言模型(LLMs)和知识图谱。虽然LLMs可以生成流畅的经济叙事,但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机,本研究提出了一种基于RAG的AI经济学家,它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中,我们不直接仅使用语言模型产生定量主张;相反,我们生成基于显式模型计算的叙事,并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理:为美国通胀持续性和美联储政策生成经济学家报告,以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

2606.20118 2026-06-19 cs.RO cs.LG 交叉投稿

Pose6DAug: Physically Plausible Multi-view Object Swapping for Robot Data Augmentation

Pose6DAug: 用于机器人数据增强的物理合理多视图物体替换

Jonghoon Lee, Seong Hyeon Park, Byungwoo Jeon, Minha Lee, Jinwoo Shin

AI总结 提出Pose6DAug,一种基于失败驱动的数据增强框架,通过3D网格和6D姿态轨迹替换成功轨迹中的物体,生成多视图一致的物理合理演示,无需额外数据收集,在新型物体上提升VLA策略成功率16.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略在通用操作中展现出强大潜力,但在外观或几何形状偏离训练分布的新型分布外物体上常常失败。标准的补救措施是为每个失败案例收集多视图遥操作数据,但这在成本和时间上扩展性差。我们提出Pose6DAug,一种失败驱动的数据增强框架,将策略自身的成功回合转化为针对其失败模式的目标演示,无需任何新数据收集。我们的关键洞察是,每个成功回合已经编码了一个物理有效的动作轨迹以及校准的多视图观测。通过仅替换被操作物体同时保留该轨迹,我们获得新的且物理基础的演示。然而,简单的2D视频编辑会破坏多视图一致性和物理合理性,特别是在严重遮挡和以自我为中心的视角下。我们的方法直接在3D中操作,通过时间一致的6D姿态轨迹驱动的显式网格锚定目标物体,确保所有相机视图的几何一致渲染。在我们方法增强的数据上微调VLA,相对于最先进的基线,在新型物体上的成功率提高了16.5%,同时保持了分布内性能。这些结果表明,多视图和物理一致的增强是实现可扩展VLA泛化的实用途径。

英文摘要

Vision-language-action (VLA) policies have shown strong potential for general-purpose manipulation, yet they often fail on novel, out-of-distribution objects whose appearance or geometry deviates from the training distribution. The standard remedy is to collect multi-view teleoperation data for every failure case, but this scales poorly in both cost and time. We introduce Pose6DAug, a failure-driven data augmentation framework that turns a policy's own successful episodes into targeted demonstrations for its failure modes, without any new data collection. Our key insight is that each successful episode already encodes a physically valid action trajectory together with calibrated multi-view observations. By swapping only the manipulated object while preserving this trajectory, we obtain new and physically grounded demonstrations. However, naive 2D video editing breaks multi-view consistency and physical plausibility, particularly under heavy occlusion and egocentric viewpoints. Our method instead operates directly in 3D, anchoring the target object with an explicit mesh driven by a temporally coherent 6D pose trajectory, ensuring geometrically consistent renderings across all camera views. Fine-tuning a VLA on data augmented by our method improves success rates by 16.5% relative to the state-of-the-art baseline on novel objects, while preserving in-distribution performance. These results show that multi-view and physically consistent augmentation is a practical path to scalable VLA generalization.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 交叉投稿

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

2606.20138 2026-06-19 cs.AI cs.CL cs.HC cs.LG 交叉投稿

Learning to Prompt: Improving Student Engagement with Adaptive LLM-based High-School Tutoring

学习提示:基于自适应LLM的高中辅导提升学生参与度

Po-Chin Chang, Nicholas Hogan, Aske Plaat, Michiel T. van der Meer

发表机构 * Leiden University(莱顿大学) FutureWhiz

AI总结 提出一种基于14个教学特征的主题感知提示路由模型,通过模拟训练和在线A/B测试,在高中辅导中实现自适应策略切换,提高教学效率并减少交互轮次。

详情
AI中文摘要

LLMs可以个性化教育,尽管当前的静态提示辅导系统难以适应不同的学科。我们开发并测试了一个具有主题感知提示的系统,该系统基于从原始转录中提取的14个教学特征(例如,辅导支架、学生理解)。我们首先在模拟环境中训练一个提示路由模型,然后将其部署到实际高中学生的在线适应中。模拟基准测试显示,路由器的性能优于两个静态基线($0.694$ vs. $0.647$ 和 $0.64$, $p<0.001$)。A/B测试($N=656$ 次对话,来自359名学生)显示了从模拟到现实的迁移,其中模型从分析策略切换到支架学习策略。我们的自适应提示选择机制提高了教学效率,保持了教学质量,并减少了约3轮交互($p=0.007$)。虽然贪婪路由器的练习转化率与基线相当($19.1\%$ vs. $19.6\%$),但随机采样策略的随机路由器实现了更高的转化率($28.1\%$)。

英文摘要

LLMs can personalize education, although current static-prompt tutoring systems struggle to adapt to diverse academic disciplines. We develop and test a system with subject-aware prompting, based on 14 pedagogical features (e.g., tutor scaffolding, student understanding) extracted from raw transcripts. We first train a prompt routing model in a simulation environment, and then deploy it for online adaptation with actual high-school students. The simulation benchmark shows the router outperforming two static baselines ($0.694$ vs. $0.647$ and $0.64$, $p<0.001$). A/B testing ($N=656$ conversations from 359 students) shows sim-to-real transfer where the model switches from analytical to scaffolding learning strategies. Our adaptive prompt selection mechanism improves instructional efficiency, maintains pedagogical quality and reduces interactions by around 3 turns ($p=0.007$). While a greedy router achieves a comparable exercise conversion rate with the baseline ($19.1\%$ vs. $19.6\%$), a stochastic router that samples strategies leads to a higher conversion rate ($28.1\%$).

2606.20164 2026-06-19 cs.CL cs.AI cs.LG q-bio.QM 交叉投稿

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM:用于长上下文临床推理、传感器引导筛查、证据支持决策及社区到三级转诊优化的递归多模态健康智能

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer Communication Technology Sirindhorn International Institute of Technology, Thammasat University Pathum Thani, Thailand 1

AI总结 提出MedRLM递归多模态健康智能框架,通过递归检查、分解、检索、验证和合成患者信息,协调多个专业代理并引入临床证据图记忆,实现长上下文临床推理和传感器引导筛查。

Comments 9 pages, 3 figures, 3 tables, 1 Algorithm, 29 equations

详情
AI中文摘要

现实世界的临床决策支持需要对异质性和纵向的患者信息进行推理,而不是回答孤立的医学问题。然而,当前的医学大语言模型和检索增强生成系统通常依赖单步提示或检索,当临床证据分布在长电子健康记录、医学图像、传感器流、指南和转诊约束中时,这可能变得脆弱。本文提出MedRLM,一个用于长上下文临床推理、传感器引导筛查和社区到三级转诊支持的递归多模态健康智能框架。MedRLM不是将所有患者信息压缩到一个提示中,而是将患者病例视为一个外部临床环境,可以递归地检查、分解、检索、验证和综合。该框架协调了专门用于临床文本、纵向EHR、医学影像、生理传感器信号、指南检索、不确定性审计和转诊规划的代理。它进一步引入了临床证据图记忆,将患者特定的观察结果与检索到的证据、标准化定义、传感器衍生的生物标志物和转诊标准连接起来。传感器引导的递归触发机制在检测到异常生理或行为模式时激活更深层次的推理,而不确定性门控细化支持临床医生对高风险或低置信度病例的审查。我们还概述了一个使用公共和经认证的临床数据集(涵盖EHR、放射学、ECG、ICU时间序列和转诊代理结果)的真实数据评估设计。MedRLM旨在将医学AI从静态问答转向可审计、多模态和流程感知的临床决策支持。

英文摘要

Real-world clinical decision support requires reasoning over heterogeneous and longitudinal patient information rather than answering isolated medical questions. However, current medical large language models and retrieval-augmented generation systems often rely on single-step prompting or retrieval, which can be fragile when clinical evidence is distributed across long electronic health records, medical images, sensor streams, guidelines, and referral constraints. This paper proposes MedRLM, a Recursive Multimodal Health Intelligence framework for long-context clinical reasoning, sensor-guided screening, and community-to-tertiary referral support. Instead of compressing all patient information into one prompt, MedRLM treats the patient case as an external clinical environment that can be recursively inspected, decomposed, retrieved, verified, and synthesized. The framework coordinates specialized agents for clinical text, longitudinal EHR, medical imaging, physiological sensor signals, guideline retrieval, uncertainty auditing, and referral planning. It further introduces a Clinical Evidence Graph Memory to connect patient-specific observations with retrieved evidence, standardized definitions, sensor-derived biomarkers, and referral criteria. A sensor-guided recursive triggering mechanism activates deeper reasoning when abnormal physiological or behavioral patterns are detected, while uncertainty-gated refinement supports clinician review for high-risk or low-confidence cases. We also outline a real-data evaluation design using public and credentialed clinical datasets spanning EHR, radiology, ECG, ICU time series, and referral-proxy outcomes. MedRLM aims to move medical AI from static question answering toward auditable, multimodal, and workflow-aware clinical decision support.

2606.20437 2026-06-19 hep-ex cs.LG 交叉投稿

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

HEPTv2:用于带电粒子重建的端到端高效点变换器

Siqi Miao, Shitij Govil, Jack P. Rodgers, Mia Liu, Javier Duarte, Shih-Chieh Hsu, Yuan-Tang Chou, Pan Li

AI总结 提出HEPTv2,一种端到端点变换器架构,通过局部敏感哈希编码和扇区化解码,无需图构建即可从探测器击中点直接重建粒子轨迹,在TrackML上以0.8%假率实现98.6%追踪效率,延迟仅15ms。

详情
AI中文摘要

带电粒子追踪——从稀疏探测器测量中重建轨迹——是一个基础的高能物理推理问题,也是在极端组合歧义下学习的典型例子。在高亮度大型强子对撞机(HL-LHC)上,尽管碰撞密度前所未有,追踪必须保持准确和高效。图神经网络表现强劲,但图构建和处理带来了大量成本,而基于变换器的方法依赖辅助阶段,阻碍了端到端优化。为解决这一问题,我们提出了HEPTv2,一种端到端点变换器架构,在一个可训练管道中从探测器击中点重建轨迹。HEPTv2结合了局部感知点编码器和轨迹解码器,无需图构建、聚类或过滤即可预测完整轨迹。编码器在探测器坐标空间中使用局部敏感哈希,以保留追踪相关几何结构,同时实现高效的局部注意力。解码器通过扇区化解码和联合编码器-解码器监督下的直接击中到轨迹预测来消除歧义,使整个管道能够端到端优化。在TrackML上,HEPTv2以0.8%的假率实现了98.6%的双多数追踪效率,同时在NVIDIA A100 GPU上每个事件仅需约15毫秒推理时间和0.4 GB峰值内存。对于最多包含$5\ imes10^5$个击中点的事件,延迟和内存大致线性扩展。HEPTv2在精度-延迟权衡中建立了新的最先进水平,相比之前最强的变换器效率提升4.5%,相比优化的基于图管道提升1.1-2.2%,同时延迟分别降低7倍和38-52倍。这些结果表明,端到端变换器能够提供HL-LHC实时粒子重建所需的精度和效率。

英文摘要

Charged-particle tracking -- reconstructing trajectories from sparse detector measurements -- is a fundamental high-energy-physics inference problem and a canonical example of learning under extreme combinatorial ambiguity. At the High-Luminosity Large Hadron Collider (HL-LHC), tracking must remain accurate and efficient despite unprecedented collision densities. Graph neural networks perform strongly, but incur substantial costs from graph construction and processing, while transformer-based approaches rely on auxiliary stages that prevent end-to-end optimization. To address this, we present HEPTv2, an end-to-end point-transformer architecture that reconstructs tracks from detector hits in one trainable pipeline. HEPTv2 combines a locality-aware point encoder with a track decoder that predicts complete trajectories without graph-building, clustering, or filtering. The encoder uses locality-sensitive hashing in detector coordinate space to preserve tracking-relevant geometry while enabling efficient local attention. The decoder resolves ambiguities through sectorized decoding and direct hit-to-track prediction under joint encoder-decoder supervision, allowing the full pipeline to be optimized end-to-end. On TrackML, HEPTv2 achieves 98.6% double-majority tracking efficiency at a 0.8% fake rate, while requiring only $\sim$15~ms inference time and 0.4~GB peak memory per event on a NVIDIA A100 GPU. Latency and memory scale approximately linearly for events with up to $5\times10^5$ hits. HEPTv2 establishes a new state of the art in the accuracy-latency trade-off, improving efficiency by 4.5% over the strongest prior transformer and by 1.1--2.2% over optimized graph-based pipelines, while reducing latency by factors of 7 and 38--52, respectively. These results show end-to-end transformers can deliver the accuracy and efficiency required for real-time particle reconstruction at the HL-LHC.

2606.20443 2026-06-19 eess.SY cs.LG cs.SY math.AT 交叉投稿

Topological Data Analysis for High-Dimensional Dynamic Process Monitoring

高维动态过程监测的拓扑数据分析

Angan Mukherjee, Tyler A. Soderstrom, Michael J. Kurtz, Victor M. Zavala

AI总结 提出结合拓扑数据分析和机器学习的方法,将多变量时间序列表示为流形,用拓扑描述符总结结构,并用神经常微分方程学习拓扑结构动态演化,实现高效事件检测。

详情
AI中文摘要

实时过程监测需要从高维时间序列数据中提取可操作信息的方法。在这项工作中,我们提出了一种新的过程监测方法,结合了拓扑数据分析(TDA)和机器学习工具。在所提出的方法中,我们将多变量时间序列数据表示为流形,并使用拓扑描述符来总结此类数据的结构;然后,我们使用神经常微分方程来学习系统拓扑结构的动态演化。使用来自工业过程的真实数据,我们表明这种基于轨迹的事件检测方法能有效检测多种类型的事件。我们将该方法与基于重构的方法(如主成分分析和自编码器)以及使用Koopman自编码器的基于轨迹的方法进行了对比。

英文摘要

Real-time process monitoring requires methods that extract actionable information from high-dimensional time-series data. In this work, we present a new approach for process monitoring that combines tools of topological data analysis (TDA) and machine learning. In the proposed approach, we represent multivariate time-series data as manifolds and use topological descriptors to summarize the structure of such data; we then use a neural ordinary differential equation to learn the dynamic evolution of the topological structure of the system. Using real data from an industrial process, we show that this trajectory-based event detection approach is effective at detecting diverse types of events. We contrast this approach against reconstruction-based approaches such as principal component analysis and autoencoders and against a trajectory-based approach that uses Koopman autoencoders.

2606.20451 2026-06-19 stat.ML cs.LG stat.AP stat.CO 交叉投稿

SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data

SSH-Net: 一种用于竞争风险下预测失效时间分布函数的深度神经网络及其在GPU数据上的应用

Jie Min, Yueyao Wang, Mengkun Chen

AI总结 提出结构化分段风险深度神经网络(SSH-Net),通过将网络结构与数据结构关联,允许不同协变量组通过子网络影响预测,在竞争风险框架下预测失效时间分布函数,仿真和GPU数据验证了准确性。

详情
AI中文摘要

竞争风险在工程领域常见,当应用场景复杂时会给时间事件数据建模带来挑战。近年来,深度神经网络因其灵活性和高学习能力在竞争风险预测中受到广泛关注。然而,神经网络结构的复杂性使得基于不同数据输入的超参数调优更加困难。此外,当工程系统具有多层级的复杂物理结构时,将所有结构层级视为单一输入组可能无法捕捉关键信息。为解决这些问题,我们提出了一种结构化分段风险深度神经网络(SSH-Net),用于在特定原因竞争风险框架下预测失效时间。我们的方法将神经网络结构与数据结构相关联,并允许不同的协变量组通过分离的子网络影响失效预测。神经网络基于特定原因竞争风险模型构建。SSH-Net输出特定原因风险函数,并采用惩罚对数似然作为损失函数。通过评估Brier分数、接收者操作特征曲线下面积(AUC)和预测的特定原因累积发生函数的均方根误差(RMSE),仿真研究验证了SSH-Net的预测准确性。我们进一步使用Titan GPU失效时间数据展示了模型预测失效时间分布函数的能力。

英文摘要

Competing risks are commonly observed in engineering fields and can bring challenges to time-to-event data modeling when the application scenarios are complicated. Recently, deep neural networks have received great attention for prediction with competing risks, due to their flexibility and high learning capability. However, the complexity of neural network structure brings extra difficulty in hyperparameter tuning based on different data inputs. Additionally, when an engineered system has complex physical structures with multiple hierarchical levels, treating all structural levels as a single group of inputs may fail to capture critical information. To address the issues, we propose a Structured Segmented Hazard Deep Neural Network (SSH-Net) for failure time prediction under cause-specific competing risks framework. Our approach associates neural network structure with data structures, and allows different covariate groups to impact the failure prediction through separate sub-networks. The neural network is constructed based on a cause-specific competing risks model. The SSH-Net outputs cause-specific hazard functions, and utilizes the penalized log-likelihood as the loss function. The prediction accuracy of SSH-Net is validated through simulation studies by evaluating the Brier score, the area under receiver operating characteristic curves (AUC), and the root mean square error (RMSE) of the predicted cause-specific cumulative incident function. We further demonstrate the model's ability to predict failure time distribution functions using the Titan GPU failure time data.

2606.20504 2026-06-19 quant-ph cs.LG 交叉投稿

Entropy Estimation in Multi-Qutrit Systems via Variational and Classical Neural Networks

多qutrit系统中基于变分和经典神经网络的熵估计

Sai Sakunthala Guddanti, Anil Prabhakar, Ria Rushin Joseph

AI总结 本文系统研究了多qutrit量子系统中von Neumann熵的估计,采用变分量子算法和经典卷积神经网络两种方法,发现VQA适用于小系统,而CNN在大系统中更具可扩展性和鲁棒性。

详情
AI中文摘要

我们使用两种互补方法——变分量子算法(VQAs)和经典卷积神经网络(CNNs),在理想(无噪声)量子模拟器上对多qutrit量子系统中的von Neumann熵估计进行了系统研究。对于最多三个qutrit的系统,我们构建并评估了11种硬件高效的SU(3)启发ansatzes。参数扫描表明,在存在足够纠缠的情况下,估计精度主要由可训练参数的数量决定。基于此研究,我们将后续实验的参数数量固定为约120,观察到纠缠门数量超过阈值后仅带来边际改进。对于更大的系统(二至五个qutrit),我们使用在张量积互无偏基测量结果上训练的CNN。该模型实现了准确且稳定的预测,并表现出随系统大小系统性改善的性能,其中二qutrit系统的误差最高,五qutrit系统的误差最低。值得注意的是,仅使用全状态层析所需测量的12.5%就足以使四和五qutrit系统的90百分位绝对误差达到约0.13-0.16 nat。CNN模型还对散粒噪声具有鲁棒性,并能很好地泛化到分布外状态。总体而言,在我们研究的模拟设置中,结果表明了实用方法的转变:VQAs对小系统有效,而基于CNN的估计器为更大的qutrit系统提供了更好的可扩展性和鲁棒性。

英文摘要

We present a systematic study of von Neumann entropy estimation in multi-qutrit quantum systems using two complementary approaches: variational quantum algorithms (VQAs) and classical convolutional neural networks (CNNs), evaluated using an ideal (noise-free) quantum simulator. For systems up to three qutrits, we construct and evaluate 11 hardware-efficient SU(3)-inspired ansatzes. A parameter sweep shows that estimation accuracy is primarily determined by the number of trainable parameters, provided sufficient entanglement is present. Based on this study, we fix the parameter count to approximately 120 for subsequent experiments, observing that increasing entangling-gate counts beyond a threshold yields only marginal improvements. For larger systems (two to five qutrits), we use a CNN trained on measurement outcomes from tensor-product mutually unbiased bases. The model achieves accurate and stable predictions and exhibits a systematic improvement in performance with system size, with the highest errors for two-qutrit systems and the lowest for five-qutrit systems. Notably, using only 12.5% of the measurements required for full state tomography is sufficient to reach 90th-percentile absolute errors of approximately 0.13-0.16 nats for both four- and five-qutrit systems. The CNN model is also robust to shot noise and generalizes well to out-of-distribution states. Overall, within the simulated settings studied here, our results indicate a transition in practical methods: VQAs are effective for small systems, while CNN-based estimators offer improved scalability and robustness for larger qutrit systems.

2606.20512 2026-06-19 cs.SE cs.LG 交叉投稿

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

AI总结 提出探测与精炼调优方法,通过合成bug修复探测迭代诊断和修补仓库指导文件,在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率,优于静态知识库的28.3%和无指导基线的25.5%。

详情
AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识(哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复),这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令,但它们是否有帮助存在争议:最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中,我们展示了指导的产生方式才是决定性变量,并引入了\emph{探测与精炼调优}:一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程,使用单次LLM调用,在调优期间没有代理循环或工具使用。在SWE-bench Verified上,使用Qwen3.5-35B-A3B进行200步的四个独立试验中,探测与精炼实现了33.0%的平均解决率,而用于初始化的静态知识库为28.3%,无指导基线为25.5%(两个探测与精炼对比的p < 0.001)。改进来自覆盖率而非精确度:精炼后的指导为14.5个百分点(pp)更多的实例生成了可评估的补丁,而每个补丁的精确度在统计上保持不变(约59%,p = 0.119),表明改进的指导帮助代理到达正确的文件,而不是提高它们所做更改的质量。此外,一个步骤预算实验表明,指导让代理能够更有效地利用更大的步骤预算,而一个跨模型实验(使用NVIDIA-Nemotron-3-Nano-30B-A3B)发现,当模型无法生成足够诊断性的输出时,调优循环会退化,尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

11. 其他/综合机器学习 4 篇

2606.19386 2026-06-19 cs.SE cs.AI cs.LG 交叉投稿

Bistable by Construction: Wall-Clock-Calibrated State Monitors Have No Moment-Detection Regime at Agent Cadence

通过构造实现双稳态:挂钟校准的状态监视器在代理节奏下没有瞬间检测机制

Manvendra Modgil

AI总结 本文发现挂钟校准的泄漏积分器监视器在代理流中无法作为瞬间检测器工作,揭示了校准类别的关键影响,并提出了上升沿触发作为替代方案。

Comments 10 pages, 5 figures. Sequel to arXiv:2606.04296. Pre-registered; falsification clauses honored (H5 unsupported; H7 strict band 16/20) repo:https://github.com/2025eb1100268-tech/intervention-timing-saturation-trap

详情
AI中文摘要

自主代理的运行时监视器通常对累积的内部状态(行为基线、漂移统计量,或在我们之前工作中的建模情感状态)设置阈值。我们之前报告了一个状态饱和陷阱:在连续情感引擎上基于阈值的状态触发在SWE-bench调试代理(Modgil 2026)上变成了近乎恒定的警报。发布后审计发现引擎在动作之间接收到的dt=0,因此其指数衰减从未运作:已发布的陷阱是一个纯累加器的结果。我们更正了记录(勘误,v2)并将该缺陷视为一个实验。它揭示的关键变量是监视器的动态是在样本时间(每次观测,如CUSUM)还是挂钟时间(半衰期以秒计,如情感模型和EMA基线)校准的。在固定速率流上两者一致;在代理流上,动作间时间变化几个数量级,它们不一致。在20条轨迹上对均匀间隔(dt在{0..600}秒内)的预注册扫描显示,挂钟水平触发器有两个机制:在dt<=1秒时恒定警报(20/20;中位数18次触发);在dt>=60秒时静默。每个关键dt位于(1,30]秒内。真实代理运行测量延迟中位数为1.53秒(p90 2.33秒);真实编码节奏位于陷阱机制内,在修正机制下证实了经验发现。该结构是校准类别的属性,而非引擎:在原始误差流上的最小挂钟累加器重现了相同的悬崖,而相同流上的样本时间CUSUM恰好是dt不变的(20/20)。带有滞后的上升沿触发器在每个条件下每条轨迹触发0-3次。我们得出结论,挂钟校准的泄漏积分器监视器在代理流上不存在作为瞬间检测器的机制;转换检测在每个节奏下都逃脱了陷阱,但无法恢复人工干预时机。

英文摘要

Runtime monitors for autonomous agents commonly threshold an accumulated internal state - a behavioural baseline, a drift statistic, or, in our prior work, a modelled affective state. We previously reported a State Saturation Trap: threshold-on-state triggers over a continuous affect engine become near-constant alarms on SWE-bench debugging agents (Modgil 2026). A post-release audit found the engine received dt=0 between actions, so its exponential decay never operated: the published trap is a pure-accumulator result. We correct the record (erratum, v2) and treat the flaw as an experiment. The key variable it exposes is whether a monitor's dynamics are calibrated in sample time (per observation, as in CUSUM) or wall-clock time (half-lives in seconds, as in affect models and EMA baselines). On fixed-rate streams these coincide; on agent streams, where inter-action time varies by orders of magnitude, they do not. A pre-registered sweep over uniform intervals (dt in {0..600}s) on 20 trajectories shows the wall-clock level trigger has two regimes: at dt<=1s a constant alarm (20/20; median 18 firings); at dt>=60s silent. Every critical dt lies in (1,30]s. Real agent runs measure latency at median 1.53s (p90 2.33s); real coding cadence sits inside the trap regime, vindicating the empirical finding under a corrected mechanism. The structure is a property of the calibration class, not the engine: a minimal wall-clock accumulator over the raw error stream reproduces the same cliff, while a sample-time CUSUM over the identical stream is exactly dt-invariant (20/20). A rising-edge trigger with hysteresis fires 0-3 times per trajectory in every condition. We conclude that wall-clock-calibrated leaky-integrator monitors admit no regime in which they act as moment detectors on agent streams; transition detection escapes the trap at every cadence, but does not recover human intervention timing.

2606.19625 2026-06-19 cs.CL cs.LG 交叉投稿

Where Does Social Reasoning Come From? Capability Provenance in Language Models

社会推理从何而来?语言模型中的能力来源

Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla, Louis Jaburi, Alvin Deng, Taywon Min, Lucia Quirke, Stella Biderman, Mark Riedl

发表机构 * Georgia Institute of Technology, College of Computing(佐治亚理工学院计算学院) MATS Program(MATS项目) EleutherAI KAIST AI(韩国科学技术院人工智能学院) Georgia Tech AI Safety Initiative(佐治亚理工学院人工智能安全倡议)

AI总结 通过训练数据归因方法,发现OLMo3-7B中社会推理和STEM推理依赖于不同的预训练语料区域,且推理层面的差异比知识层面更显著。

Comments Under review at COLM 2026 (Conference)

详情
AI中文摘要

我们使用训练数据归因作为可解释的工具进行能力发现,映射预训练语料库中哪些区域支持OLMo3-7B的社会推理与STEM推理。训练数据归因衡量每个训练文档对模型在基准测试上的预测的影响强度,但文档级别的分数过于嘈杂,无法识别哪些语料区域支持哪些能力,且先前的工作侧重于事实知识而非推理。我们在从去重后的Dolma3混合数据中抽取的工作集上计算基于梯度的归因(通过Bergmann的TrackStar),聚合跨WebOrganizer的24格式×24主题分类(576个箱子)的影响,并在2×2设计中对比基准对,该设计变化领域(社会 vs. STEM)和能力类型(推理 vs. 知识):SocialIQA和MMLU社会科学对比ARC-Challenge和MMLU STEM。社会和STEM推理依赖于定性不同的语料区域,且推理层面的对比比知识层面更尖锐。有针对性的机器遗忘提供了部分因果验证:遗忘高归因主题箱(例如,SocialIQA的文学)比箱内随机基线更严重地降低对齐的基准,我们开源所有代码、采样清单、箱级影响矩阵和遗忘检查点。

英文摘要

We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

2606.19741 2026-06-19 cs.AI cs.LG 交叉投稿

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出演化程序瓶颈(EPB)框架,通过将黑盒神经组合优化模型蒸馏为可读程序组合,利用LLM和混合梯度下降实现可解释性,揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情
AI中文摘要

神经组合优化(NCO)取得了强劲性能,但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具(如概念瓶颈模型)不适用于NCO,因为其决策是动态的、状态依赖的,且缺乏适当的概念词汇定义。为弥合这一差距,我们引入了演化程序瓶颈(EPB),据我们所知,这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序,其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作:模块I固定程序库容量,并引入混合文本-数值梯度下降方案,该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合;模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性,蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化,并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO,并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

2606.19803 2026-06-19 cs.DB cs.AI cs.LG 交叉投稿

Policy-aware Vector Search: A Vision for Fine Grained Access Control in Vector Databases

策略感知向量搜索:向量数据库中细粒度访问控制的愿景

Lakshmi Sahithi Yalamarthi, Primal Pappachan

AI总结 本文提出策略感知向量搜索的愿景,形式化向量数据库中的细粒度访问控制(FGAC)策略模型与实施问题,比较不同实施策略并指出未来挑战。

Comments Accepted at SeQureDB 26, Sigmod 2026

详情
AI中文摘要

向量数据库越来越多地用于安全敏感的场景,如检索增强生成和组织AI管道;然而,其安全能力仍然有限。具体而言,现代向量数据库不完全支持细粒度访问控制(FGAC),而FGAC是确保数据访问符合用户特定策略所必需的。与关系数据库不同,向量数据库结合结构化和非结构化属性以提供语义近似查询结果,这使FGAC实现复杂化。这就在正确执行FGAC策略、实现高ANN搜索召回率和保持低查询延迟之间产生了内在张力。在本文中,我们通过形式化向量数据库中的FGAC策略模型以及实施问题,提出了策略感知向量搜索的愿景。我们比较了各种实施策略,展示了初步发现,并指出了未来策略感知向量搜索研究的关键开放挑战。

英文摘要

Vector databases are increasingly used in security sensitive contexts with Retrieval Augmented Generation and organizational AI pipelines; however, their security capabilities remain limited. Specifically, Fine-grained Access Control (FGAC) which is required to ensure that data access adheres to user-specific policies is not fully supported in modern vector databases. Unlike relational databases, vector databases combine structured and unstructured attributes to provide semantic, approximate query results, which complicates FGAC implementation. This creates an inherent tension between enforcing FGAC policies correctly, achieving high ANN search recall and maintaining low query latency. In this paper, we present a vision for Policy-aware Vector Search by formalizing the FGAC policy model in vector databases as well as the enforcement problem. We compare various enforcement strategies, present preliminary findings, and identify key open challenges for future research in policy-aware vector search.