arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
专题追踪
2605.21807 2026-05-22 cs.CL

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

当案例变得稀少时:一个用于偏离指南临床问答的检索基准

Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

发表机构 * The Ohio State University(俄亥俄州立大学) The Ohio State University Wexner Medical Center(俄亥俄州立大学韦克斯纳医学中心) University of Chicago Medical Center(芝加哥大学医学中心)

AI总结 本文提出OGCaReBench基准,用于评估医疗问答中超越常规指南的开放性推理能力,通过检索医学文献提升模型在真实世界医疗场景中的表现。

Comments 34 pages, 20 figures

详情
AI中文摘要

在医学各专科中,临床实践基于证据导向的指南,这些指南通常无法覆盖现实世界中未被涵盖的长尾部分。然而,大多数医疗大型语言模型(LLMs)训练时专注于编码常见、指南导向的医学知识。当前评估主要测试模型在回忆和推理这些记忆内容上的能力,通常在多项选择设置中进行。鉴于证据导向推理在医学中的基础重要性,实践中依赖记忆是不可行且不可靠的。为此,我们引入OGCaReBench,一个以自由形式检索为核心的基准,旨在评估LLMs在回答需要超越典型指南的临床问题上的能力。该基准从发表的医学案例报告中提取,并由医学专家验证,包含需要自由文本回答的长形式临床问题,提供了一个系统框架,用于评估罕见、案例基于场景中的开放性医学推理。我们的实验发现,即使最佳基线模型(GPT-5.2)在基准中也只正确回答了56%,而仅使用专业模型的仅达到42%。通过增强模型检索医学文献,性能可提升至82%(使用GPT-5.2),突显了证据基础在真实世界医疗推理任务中的重要性。本文因此为基准测试和推动通用和医疗LLM在挑战性临床情境中产生可靠答案奠定了基础。

英文摘要

Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

2605.21803 2026-05-22 cs.LG

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

相同架构,不同容量:优化器诱导的谱缩放定律

Nandan Kumar Jha, Brandon Reagen

发表机构 * New York University(纽约大学)

AI总结 研究探讨了优化器如何影响Transformer架构的谱缩放定律,发现相同架构使用不同优化器时,谱容量的缩放行为存在显著差异,提出了优化器与架构协同设计的重要性。

Comments 31 pages, 10 figures, 30 tables. Project page: https://optimizer-scaling-laws.github.io

详情
AI中文摘要

缩放定律使语言模型性能可从模型大小、数据和计算量预测,但通常将优化器视为固定训练细节。我们显示,这一假设忽略了表示缩放的一个基本轴:优化器如何有效地将增加的FFN宽度转换为利用的谱容量。通过测量前馈网络表示的谱特征,通过软和硬谱秩,我们发现,当使用不同优化器训练时,相同的Transformer架构实现了显著不同的谱缩放定律。在固定架构和宽度计划的情况下,AdamW在稀有词(TAIL)表示上表现出弱的硬秩缩放(β=0.44),而在相同区域,Muon实现了线性缩放(β=1.02),缩放指数增加了2.3倍。这一差异无法归因于验证损失:AdamW配置可以在扩展训练下匹配低秩Dion变体的困惑度,但表现出显著不同的谱几何结构,表明匹配的损失不意味着匹配的表示结构。硬-软秩不对称进一步揭示,优化器不仅在实现的容量上有所不同,还影响了容量在特征模式上的结构。为了区分优化器效应与架构效应,我们比较了架构干预(例如注意力秩和位置编码),并发现优化器诱导的谱位移往往超过架构效应。这些结果表明优化是表示缩放的第一轴,推动了优化器-架构协同设计。

英文摘要

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($β$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($β$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

2605.21801 2026-05-22 cs.LG cs.CL

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

为何语义熵失效:面向策略优化的几何感知与校准不确定性

Zheyuan Zhang, Kaiwen Shi, Han Bao, Zehong Wang, Tianyi Ma, Yanfang Ye

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 本文提出了一种新的策略优化框架GCPO,通过几何感知措施捕捉语义分歧,并利用基于奖励的校准对齐不确定性与学习信号强度,从而更准确地跟踪梯度变化并提升训练后性能。

详情
AI中文摘要

训练后已成为改进大语言模型推理和对齐的关键,其中无批评模型能够实现从模型生成输出的可扩展学习,但缺乏区分信息性与噪声信号的原理性机制。最近的方法利用响应级度量作为不确定性信号来调节基于群体的优化方法,如GRPO。然而,其经验成功仍不稳定,且不清楚它们如何影响优化动态。在本文中,我们提供迄今为止第一个原理性公式,将不确定性信号解释为表征和调节梯度方差和学习信号质量的机制。基于经验和理论分析,我们识别出当前基于熵的估计器的两个关键缺陷:各向异性缺口和校准缺口。受此分析启发,我们提出几何感知校准策略优化(GCPO),一种新的框架,整合几何感知度量以捕捉语义分歧,利用基于奖励的校准对齐不确定性与学习信号强度。在多个基准测试中的实验表明,我们的方法更忠实跟踪梯度变化,并且一致提升训练后性能。我们的结果强调了设计与优化动态对齐的不确定性信号的重要性,为稳健训练后方法提供了原理性视角。

英文摘要

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

2605.21800 2026-05-22 cs.LG cs.RO

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

stable-worldmodel: 一个用于可重复世界建模研究和评估的平台

Lucas Maes, Quentin Le Lidec, Luiz Facury, Nassim Massaudi, Ayush Chaurasia, Francesco Capuano, Richard Gao, Taj Gillin, Dan Haramati, Damien Scieur, Yann LeCun, Randall Balestriero

发表机构 * Mila & Université de Montréal(Mila与蒙特利尔大学) New York University(纽约大学) Universidade Federal de Minas Gerais(巴西联邦大学矿务学院) Independent Researcher(独立研究者) LanceDB University of Oxford(牛津大学) Brown University(布朗大学)

AI总结 本文提出stable-worldmodel平台,旨在解决世界建模研究中代码库、数据管道和评估协议碎片化的问题,通过提供高性能的数据层、现代世界模型基线和规划求解器的实现,以及扩展的环境和任务,实现标准化和可重复的世界建模研究和评估。

详情
AI中文摘要

世界模型是构建能够推理、规划并在训练数据之外进行泛化的重要组成部分。然而,目前世界模型的研究仍然碎片化,不同的代码库、数据管道和评估协议阻碍了可重复性和公平比较。当前实践还受到三个关键瓶颈的限制:脆弱的一次性代码库、缓慢的视频数据加载以及缺乏标准化的泛化基准。我们提出了stable-worldmodel (swm),一个开源平台,用于标准化和可重复的世界建模研究和评估。它提供了(1)一个高性能的Lance数据层,支持和转换MP4、HDF5和LeRobot数据集;(2)干净、经过良好测试的现代世界模型基线和规划求解器的实现;(3)一个广泛的环境和任务套件,扩展了可控的视觉、几何和物理因素的变化,以系统地评估动态理解、控制性能、表示质量和分布外泛化。通过在单一可扩展框架下统一整个流程, exttt{swm}显著减少了研究开销,并加速了向可靠世界模型的可信进展。

英文摘要

World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \texttt{swm} dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

2605.21798 2026-05-22 cs.LG stat.ML

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

三次成本:神经过程在高斯过程推断中的摊销

Robin Young

发表机构 * University of Cambridge, Cambridge, UK(剑桥大学)

AI总结 本文研究了神经过程在高斯过程推断中的摊销成本,将高斯过程的后验推断从精确的O(n^3)转换为学习的O(n)映射,分析了标签污染、信息瓶颈和摊销误差三个来源,并提出了架构优化建议。

Comments To appear at ProbNum 2026

详情
AI中文摘要

神经过程用于摊销高斯过程推断,将精确的O(n^3)后验替换为学习的O(n)映射,从上下文集到预测分布。对于一类潜在的神经过程,我们界定了高斯过程和LNP预测之间的KL散度,将其分解为三个可解释的来源,即标签污染,因为神经过程使用标签值来估计在精确高斯过程中标签无关的量;信息瓶颈,因为有限维表示无法解析完整的上下文几何;以及摊销误差,因为单个编码器网络在所有上下文中共享。瓶颈截断项随着表示维度d衰减为O(e^{-cd^{2/d_x}}),对于平方指数核在R^{d_x}上,其中c>0是核依赖的常数,以及对于Matérn-ν核为O(d^{-2ν/d_x}),直接将架构尺寸与核平滑度和输入维度联系起来。标签污染项通常为O(1),只有观测噪声部分衰减为O(1/n),识别了通过标签依赖的表示路由不确定性估计的持续成本。这些结果刻画了在分析类别中的摊销成本,并产生了架构建议,以在高斯过程摊销范围内仅从上下文位置预测方差,并用二阶池化代替均值聚合以关闭主导的摊销差距。

英文摘要

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

2605.21796 2026-05-22 cs.CV cs.CL

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

MM-Conv: 一种多模态数据集和基准,用于上下文感知的3D对话中指代解析

Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, Jonas Beskow

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种多模态数据集和基准,用于在动态3D环境中实现上下文感知的指代解析,通过引入包含6.7小时第一人称VR交互的同步语音、动作、注视和3D场景几何数据的基准,以及一个两阶段的指代解析流水线,改进了对话中的指代解析性能。

Comments Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

Journal ref Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), Palma de Mallorca, Spain

详情
AI中文摘要

在物理世界中将语言进行定位需要AI系统解释在对话中动态出现的参考。尽管当前的视觉语言模型(VLMs)在静态图像任务上表现出色,但在自发的多轮对话中解决歧义表达方面存在困难。我们通过引入(1)一个用于动态3D环境中的指代交流的基准,该基准基于6.7小时的第一人称VR交互,同步语音、动作、注视和3D场景几何数据,以及(2)一个两阶段的定位流水线,该流水线在视觉定位之前显式解决对话中的歧义,来填补这一空白。该基准包含超过4,200个经过人工验证的指代表达,涵盖完整、部分和代词类型。我们的上下文重写方法在平均上将定位性能提高了11-22个百分点,纯检测器(GroundingDINO)在重写后在代词上达到了56.7%的准确率,几乎是最佳端到端基线的两倍。结果表明,将语言推理与视觉感知解耦比端到端方法在对话定位中更有效。

英文摘要

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

2605.21792 2026-05-22 cs.CL cs.AI cs.DB cs.LG

Residual Skill Optimization for Text-to-SQL Ensembles

残差技能优化用于文本到SQL集成

Jiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant, Nikki Lijing Kuang, Seyedeh Baharan Khatami, Canwen Xu, Xiaodong Yu, Yingyu Lin, Zhewei Yao, Yuxiong He, Babak Salimi

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Snowflake AI Research(Snowflake人工智能研究)

AI总结 本文提出DivSkill-SQL,一种残差技能优化框架,通过在当前技能集成失败的示例上优化新技能,从而构建互补的文本到SQL集成,提升Pass@K性能,在Spider2-Lite上实现了显著的准确性提升,同时在不同方言和任务上表现出一致的改进。

详情
AI中文摘要

文本到SQL集成通过生成多个SQL候选并选择一个来优于单一候选生成,但其效果受限于Pass@K,即至少有一个K候选正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性,导致候选集受相关失败主导。我们提出DivSkill-SQL,一种残差技能优化框架,构建互补的文本到SQL集成而无需模型微调:每个新技能在当前技能集成失败的示例上进行优化,证明其对Pass@K的边际贡献。在Spider2-Lite上,DivSkill-SQL在Snowflake和BigQuery上分别比最强集成基线提升11.1和8.3个点,且在两个基础模型(Opus-4.6和GPT-5.4)上表现一致。在单个方言上无重新训练即可转移至其他方言(Snowflake、BigQuery、SQLite)和不同任务形式(如BIRD-Critic,+2.6个点)。错误诊断显示幻觉的模式参考和函数调用减少3倍,表明收益来自真正可靠的互补技能,而非表面形式变化。

英文摘要

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

2605.21788 2026-05-22 cs.CV cs.RO

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

SceneGraphGrounder: 通过结构化场景图匹配实现零样本3D视觉定位

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文提出SceneGraphGrounder框架,通过结构化场景图匹配将3D定位问题转化为结构化图匹配问题,利用视觉标记提示策略从2D视图推断物体间关系,并在3D场景图中建立持久编码,从而在ScanRefer基准测试中实现了零样本条件下与现有方法相当的性能,并在真实机器人部署中验证了其在长周期物理环境中的鲁棒空间推理能力。

详情
AI中文摘要

零样本3D视觉定位需要从非结构化环境中通过自由形式自然语言定位物体。最近的视觉-语言模型(VLM)方法取得了有希望的结果,但依赖于视点依赖的推理或隐式表示,限制了组合查询的空间一致性和可解释性。我们提出了SceneGraphGrounder,一个将3D定位重新表述为在重建的3D场景图上的结构化图匹配的框架。为了实现这种表述,我们引入了一种视觉标记提示策略,使VLM能够从2D视图推断物体-物体关系,这些关系随后被提升为持久的3D场景图编码,既包含空间关系又包含语义关系。给定一个查询,我们构建查询图并与场景图进行受限对齐,确保多视图一致性和可解释的推理。在ScanRefer基准测试中,我们的方法在零样本条件下实现了与现有方法相当的性能,仅使用RGB-D输入。我们进一步通过在移动机器人上的真实世界部署验证了我们的框架,展示了其在长周期物理环境中的鲁棒空间推理能力。我们将在接受后公开我们的代码。

英文摘要

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

2605.21783 2026-05-22 cs.LG stat.ML

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

Ahanaf Hasan Ariq

发表机构 * Ideal School and College(理想学校和学院)

AI总结 本文提出了一种基于PAC-Bayesian框架的测试时间适应方法,通过将MMD球体解释为 credal sets,提供了对epistemic不确定性量化的自然方法,并建立了与MMD相关的泛化界限、有限样本版本、统一最坏情况风险界限以及几何保持界限。

Comments 15 pages, 0 figures. Accepted at the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)

详情
AI中文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

英文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

2605.21781 2026-05-22 cs.CL

Reflective Prompt Tuning through Language Model Function-Calling

通过语言模型功能调用实现反思式提示调优

Farima Fatahi Bayat, Moin Aminnaseri, Pouya Pezeshkpour, Estevam Hruschka

发表机构 * Megagon Labs(梅加穹实验室)

AI总结 本文提出了一种名为Reflective Prompt Tuning (RPT)的框架,利用语言模型功能调用模拟人类提示工程师的迭代工作流程,通过诊断函数评估目标模型,生成结构化诊断报告,并利用历史报告优化提示,从而提升提示效果和置信度校准。

Comments 17 pages, 6 figures

详情
AI中文摘要

大型语言模型(LLMs)在遵循指令和复杂推理方面的能力不断增强,使提示成为一种灵活的接口,用于在不更新参数的情况下调整模型。然而,提示设计仍然劳动密集且对格式、措辞和指令顺序高度敏感,这促使了自动提示优化方法的发展,以减少手动努力并保持推理时的灵活性。然而,现有方法通常在提示候选项上进行搜索或使用由单个示例或小批量驱动的固定批评-优化流水线,限制了它们捕捉系统性错误模式和基于失败历史进行针对性编辑的能力。我们提出Reflective Prompt Tuning (RPT),一个框架,利用语言模型功能调用模拟人类提示工程师的迭代工作流程。一个语言模型优化器调用一个诊断函数,该函数在完整的优化集上评估目标模型,总结反复出现的失败模式,并返回结构化的诊断报告。优化器利用此报告,以及累积的记忆中的先前报告,来修改下一个迭代的提示。RPT进一步通过在诊断反馈和最终提示选择中使用校准信号支持置信度-aware的优化。在三个推理任务上,RPT在初始提示上提高了高达12.9个点,与最先进方法保持竞争力,并提高了置信度校准。我们的分析显示,RPT在多跳和数学推理中特别有效,产生针对性的提示修改,与诊断的失败模式对齐,并导致任务性能和校准的提升。

英文摘要

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

2605.21780 2026-05-22 cs.LG cs.CR

Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy

通过微分隐私的对偶视角证明对后门攻击的鲁棒性

Aman Saxena, Jan Schuchardt, Yan Scholten, Stephan Günnemann

发表机构 * Department of Computer Science, Technical University of Munich(慕尼黑技术大学计算机科学系) Munich Data Science Institute(慕尼黑数据科学研究所) MCML Machine Learning Research, Morgan Stanley(摩根大通机器学习研究)

AI总结 本文提出一种基于对偶视角的微分隐私框架,用于证明对抗性扰动下的鲁棒性,通过整合随机平滑与隐私配置文件,提供对训练时间和推理时间攻击的联合鲁棒性保证。

详情
AI中文摘要

随机平滑是一种强大的工具,可用于证明对对抗扰动的鲁棒性,包括通过随机训练的污染攻击和通过随机推理的逃避攻击。将这些保证扩展到后门攻击,其中训练和测试数据共同被扰动,仍然具有挑战性,因为训练和测试时间的随机化机制必须在单一鲁棒性证书内进行分析。我们通过将随机平滑与通过隐私配置文件连接到微分隐私的对偶视角,提供了一种数值程序,用于组合异构机制。所得到的框架能够实现对复杂、组合机制的紧密、模块化、端到端认证,同时利用现有微分隐私机制的分析。我们为DP-SGD和带有推理时间平滑的深度分区聚合实例化该框架,推导出对训练时间和推理时间攻击的联合鲁棒性保证。在MNIST和CIFAR-10上的实验展示了该框架的有效性。总体而言,我们提供了一个系统且通用的框架,用于使用复合机制在复杂的威胁模型下证明鲁棒性,该模型更好地捕捉了现实对手的能力。

英文摘要

Randomized smoothing is a powerful tool for certifying robustness to adversarial perturbations, including poisoning attacks via randomized training and evasion attacks via randomized inference. Extending these guarantees to backdoor attacks, where training and test data are jointly perturbed, remains challenging because training- and test-time randomized mechanisms must be analyzed within a single robustness certificate. We address this by connecting randomized smoothing to the dual view of differential privacy through privacy profiles, which provide a numerical procedure for composing heterogeneous mechanisms. The resulting framework enables tight, modular, end-to-end certification of complex, composed mechanisms while leveraging existing analyses of differentially private mechanisms. We instantiate the framework for DP-SGD and Deep Partition Aggregation with inference-time smoothing, deriving joint robustness guarantees against both training-time and inference-time attacks. Experiments on MNIST and CIFAR-10 demonstrate the effectiveness of our framework. Overall, we provide a principled and general framework for using composite mechanisms to certify robustness under complex threat models that better capture the capabilities of real-world adversaries.

2605.21778 2026-05-22 cs.AI

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

什么是AI阿谀奉承?对一个碎片化概念的分类和专家调查

Meryl Ye, Lujain Ibrahim, Jessica Y. Bo, Myra Cheng, Ida Mattsson, Daniel Vennemeyer, Robert Kraut, Steve Rathje

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Oxford(牛津大学) University of Toronto(多伦多大学) Stanford University(斯坦福大学) University of Cincinnati(克里夫兰医学中心大学) New York University(纽约大学)

AI总结 本文通过文献综述和专家调查,揭示了AI阿谀奉承行为的分类和测量挑战,提出了一种统一的分类体系以促进对这一问题的理解和应对。

详情
AI中文摘要

AI阿谀奉承已成为大型语言模型(LLM)研究中的重要关注点。然而,该术语缺乏一致的定义,已被应用于从同意用户虚假主张到过度赞扬用户,再到 withhold corrective feedback 的各种行为。当研究人员、公司和政策制定者用同一术语描述不同行为时,评估结果难以比较,缓解策略无法转移,对一种阿谀奉承形式具有抵抗力的系统仍会表现出其他形式。为此,我们做出了两项贡献。首先,我们回顾了70篇关于AI阿谀奉承的论文,以开发该行为的分类。该分类区分了(1)模型是否对用户的立场和信念表现出阿谀奉承,或对用户的更广泛个人特质和情绪表现出阿谀奉承,以及(2)这种行为是通过显性的直接语言还是更隐性的微妙行为,如框架、省略或语气。将现有文献映射到我们的分类中,发现当前研究主要集中在对用户信念的显性阿谀奉承上,而更微妙和以人为核心的行为相对研究较少。其次,我们调查了106位在AI阿谀奉承及相关领域专家,以检查研究人员是否同意哪些模型行为属于阿谀奉承。尽管专家几乎一致认为阿谀奉承是当前AI系统中的重大问题(94.3%同意),但他们对哪些具体行为符合阿谀奉承存在显著分歧。共同,这些发现表明,AI阿谀奉承是一种行为广谱,具有不同的测量挑战、干预要求和治理影响。我们的分类提供了一种共享的词汇以理解和应对这些行为。

英文摘要

AI sycophancy has become a prominent concern in large language model (LLM) research. Yet the term lacks a consistent definition and has been applied to behaviors ranging from agreeing with a user's false claim to excessively praising the user to withholding corrective feedback. When researchers, companies, and policymakers use the same term to describe different behaviors, evaluation results become difficult to compare, mitigation strategies fail to transfer, and systems that are resistant to one form of sycophancy continue exhibiting other forms. To address this, we make two contributions. First, we reviewed 70 papers on AI sycophancy to develop a taxonomy of how the behavior has been defined and measured. The taxonomy distinguishes (1) whether a model is sycophantic toward a user's positions and beliefs, or toward the user's broader personal traits and emotions, and (2) whether this occurs through explicit, direct language or more implicit, subtle behaviors such as framing, omission, or tone. Mapping existing literature to our taxonomy reveals that current research has focused on overt forms of sycophancy toward users' beliefs, leaving more subtle and person-directed behaviors relatively understudied. Second, we surveyed 106 experts in AI sycophancy and related fields to examine whether researchers agree on which model behaviors are sycophantic. While experts are nearly unanimous in believing that sycophancy is a significant problem in current AI systems (94.3% agree), they disagree substantially on which specific behaviors qualify. Together, these findings demonstrate that AI sycophancy is a broad family of behaviors with different measurement challenges, intervention requirements, and governance implications. Our taxonomy provides a shared vocabulary for understanding and addressing these behaviors.

2605.21776 2026-05-22 cs.CL

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

PromptNCE:仅使用LLM和对比估计提示进行点互信息预测

Juliette Woodrow, Chris Piech

发表机构 * Department of Computer Science Stanford University(计算机科学系 斯坦福大学)

AI总结 本文提出PromptNCE方法,通过将条件概率估计转化为对比任务,并引入显式的OTHER类别来恢复真实的条件概率,从而在低数据情况下实现零样本点互信息估计。

详情
AI中文摘要

从文本中估计互信息通常需要训练特定任务的批评者,这限制了其在低数据设置中的应用。我们问大语言模型能否仅通过提示和提取的概率来零样本估计点互信息。我们引入了一个包含三个公开可用数据集的人类衍生真实PMI基准,并评估了五个信息论提示基于的估计器。我们的主要方法PromptNCE将条件概率估计框架为对比任务,并在候选集上引入显式的OTHER类别。我们理论证明,添加OTHER类别可以恢复真实的条件P(y | x),而不是仅仅在列出的候选者之间进行排名,将对比提示转化为通用的零样本概率估计器。PromptNCE在所有三个数据集上都是最佳的零样本方法,与人类衍生的PMI达到Spearman相关性高达0.82。我们还展示了在计算机科学教育中的案例研究,说明这些估计器如何在低数据情况下用于评分学生知识摘要。

英文摘要

Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

2605.21770 2026-05-22 cs.LG

Manifold-Guided Attention Steering

基于流形的注意力引导

Ian Li, Kapilesh Guruprasad, Raunak Sengupta, Ninad Satish, Loris D'Antoni, Rose Yu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出了一种基于流形的注意力引导方法,通过在推理过程中监控注意力头与正确性流形的距离,动态纠正偏差,从而提高大语言模型在数学推理、代码生成和分子生成等任务中的表现。

详情
AI中文摘要

尽管大型语言模型具备完成正确推理所需的知识,但在推理任务中仍经常出现错误。一种可能的改进方法是通过激活引导。然而,现有激活引导方法使用固定且预先计算的修正向量,忽略了模型当前所处的生成轨迹位置;结果是无差别扰动,会像错误步骤一样自由地破坏已正确步骤。我们提出基于流形的注意力引导(MAGS),这是一种基于几何观察的轨迹感知推理过程干预方法:特定注意力头的输出激活在错误点偏离低维正确性流形,并且这种偏差会通过后续步骤累积。对于每个识别出的注意力头,我们从正确和错误轨迹的对比对中学习一个低维子空间,该子空间捕捉了误差行为偏离正确行为的方向。在推理过程中,我们监控每个头与该流形的距离,并在偏差超过学习阈值时应用针对性的投影修正,将注意力输出引导回正确的子空间,防止误差传播。MAGS在数学推理(MATH-500,GSM8K)、代码生成(HumanEval,MBPP)和分子生成(SMILES)等基准测试中均优于未引导的基线和静态引导方法,表明正确性流形是LLM注意力几何学中的普遍特征。

英文摘要

Large language models frequently produce errors in reasoning tasks despite possessing the underlying knowledge required for correct reasoning. One possible approach to improve reasoning consistency is through activation steering. However, existing activation steering approaches apply fixed, pre-computed correction vectors, ignoring where the model currently sits along its generation trajectory; the result is indiscriminate perturbation that disrupts already-correct steps as freely as erroneous ones. We propose Manifold-Guided Attention Steering (MAGS), a trajectory-aware inference-time intervention grounded in a geometric observation: the output activations of specific attention heads diverge from a low-dimensional correctness manifold at the point of error, and this deviation compounds through subsequent steps. For each identified attention head, we learn a low-dimensional subspace from contrastive pairs of correct and incorrect traces that capture the directions along which error behavior deviates from correct behavior. During inference, we monitor each head's proximity to this manifold and apply a targeted projection correction when deviation exceeds a learned threshold, steering the attention output back toward the correct subspace before the error propagates. MAGS consistently outperforms both unsteered baselines and static steering approaches across benchmarks spanning mathematical reasoning (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular generation (SMILES), suggesting that correctness manifolds are a general feature of LLM attention geometry.

2605.21768 2026-05-22 cs.LG cs.MA

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-R2: 长时间 horizon 记忆增强 LLM agent 的公平信用分配

Sikuan Yan, Ahmed Bahloul, Ercong Nie, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma

发表机构 * Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑)) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出 Memory-R2 框架,通过结合局部和全局组相对优化方法,解决长时间 horizon 记忆增强 LLM agent 在多会话环境中训练时由于记忆状态差异导致的信用分配不公平问题,同时联合优化记忆形成与记忆演化。

详情
AI中文摘要

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

英文摘要

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

2605.21766 2026-05-22 cs.CV cs.GR

BodyReLux: Temporally Consistent Full-Body Video Relighting

BodyReLux: 时序一致的全身人体视频重照明

Li Ma, Mingming He, Xueming Yu, David M. George, Ahmet Levent Taşel, Paul Debevec, Julien Philip

发表机构 * Eyeline Labs(Eyeline实验室)

AI总结 本文提出BodyReLux,一种基于视频扩散的框架,用于在时序一致的方式下重照明全身人体表演。该方法利用混合数据集训练,结合传统静态单光源捕捉和新型动态表演捕捉技术,通过引入新的光照条件表示方法和数据增强管道,实现了高质量、鲁棒且时序一致的视频重照明。

Comments Siggraph 2026 Journal Track. Project page: https://eyeline-labs.github.io/bodyrelux/

详情
AI中文摘要

能够重照明人体表演是后期制作和内容创作中的基本任务。我们提出了BodyReLux,一种针对特定主体的视频扩散框架,用于在时序一致的方式下重照明全身人体表演。我们的模型是在一个混合的像素对齐视频重照明数据集上训练的,涵盖了多样化的光照条件、表演和视角组合。为了获得这样的数据集,我们结合了传统的静态单光源捕捉(OLAT)和一种新的动态表演捕捉方法,在其中两个平滑变化的光照序列被快速交错。由于光照操作在人类闪烁融合阈值之上,交错不会显得闪烁。我们从预训练的文本到视频模型中训练视频重照明模型,以充分利用生成先验来产生高质量视频。为了实现精确的光照控制,我们引入了一种新的光照条件方法,将每个光源表示为一个标记。我们进一步使用掩码注意力对光照序列进行条件处理,以支持动态光照控制。结合精心设计的数据增强管道,我们实现了高质量、鲁棒且时序一致的特定主体人体表演视频重照明。

英文摘要

Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.

2605.21765 2026-05-22 cs.LG

Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

Position: The Time for Sampling Is Now! Charting a New Course for Bayesian Deep Learning

Emanuel Sommer, David Rügamer

发表机构 * Department of Statistics, LMU Munich(统计系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文探讨了在贝叶斯深度学习中采样推理(SAI)的潜力,指出其在计算效率上已与优化方法相当,并可能成为更有效的推理方法。核心贡献是推动SAI在贝叶斯神经网络中的应用,解决现有误解,以实现更精确的不确定性量化。

Comments In Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

详情
AI中文摘要

贝叶斯神经网络(BNNs)中基于采样的推理(SAI)的实用应用仍然有限,部分原因是持续存在的关于其可行性和效率的误解。本文认为,SAI在计算上已与基于优化的方法达到平衡,并即将超越这些方法,成为BNNs中更有效和高效的推理方法。这一发展应成为整个社区的利益,推动BNNs作为一种原则性的范式,实现其长期未实现的承诺,即为神经网络提供原则性的不确定性量化。SAI甚至可以做到更多——通过模型平均获得更优的预测性能,成为各种可能的下游任务的基础,并为BNNs的景观提供关键见解。为了实现这种变革并释放采样的潜力,克服当前的误解是必要的第一步。下一步是重新定向研究努力,解决SAI中尚存的挑战。特别是,社区必须专注于两个核心问题:充分探索后验景观和高保真度地蒸馏后验样本以实现高效的下游推理。通过解决概念和实践上的障碍,我们可以解锁SAI的全部潜力,并将其确立为贝叶斯深度学习中的核心工具。

英文摘要

The practical adoption of sampling-based inference (SAI) in Bayesian neural networks (BNNs) remains limited, partly due to persistent misconceptions about the feasibility and efficiency of sampling. This position paper argues that SAI has achieved computational parity with optimization-based methods and is at the verge of superseding such methods for effective and efficient inference in BNNs. This development should be in the interest of the whole community, promoting BNNs as a principled paradigm with its long-standing yet unfulfilled promise of providing principled uncertainty quantification for neural networks. SAI can even do more -- yielding superior prediction performance through model averaging, serving as the foundation for a plethora of possible downstream tasks, and providing crucial insights into the landscape of BNNs. In order to make such a change happen and unfold the potential of sampling, overcoming current misconceptions is a necessary first step. The next step is to realign research efforts toward addressing remaining challenges in SAI. In particular, the community must focus on two core problems: sufficient exploration of the posterior landscape and high-fidelity distillation of posterior samples for efficient downstream inference. By addressing conceptual and practical obstacles, we can unlock the full potential of SAI and establish it as a central tool in Bayesian deep learning.

2605.21763 2026-05-22 cs.LG cs.SY eess.SY stat.ML

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

关于优化确定等价的折扣强化学习样本复杂性

Oliver Mortensen, Mohammad Sadegh Talebi

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 本文研究了有限折扣MDP中的风险敏感强化学习,考虑了优化确定等价(OCE)这一风险度量家族,分析了在递归OCE下学习最优状态-动作价值函数和最优策略的样本复杂性,并给出了PAC可学习的效用函数的精确刻画,同时建立了基于模型的简单方法的PAC样本复杂性界,并展示了当效用函数的域不为全实数时问题不可PAC学习,最后给出了价值和策略学习的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了有效时间 horizon 1/(1-γ) 的依赖性。

Comments Accepted to RLC 2026. arXiv admin note: substantial text overlap with arXiv:2506.00286

详情
AI中文摘要

我们研究了有限折扣MDP中的风险敏感强化学习,其中假设存在MDP的生成模型。我们考虑了一类称为优化确定等价(OCE)的风险度量家族,其中包括重要的风险度量,如熵风险、CVaR和均方差。我们的重点是递归OCE下学习最优状态-动作价值函数(价值学习)和最优策略(策略学习)的样本复杂性。我们提供了效用函数u的精确刻画,使得对应的OCE定义了一个PAC可学习的目标。我们分析了一个简单的基于模型的方法并推导了PAC样本复杂性界。我们证明了当u的域不为全实数dom(u)≠R时,相应的问题不可PAC学习。最后,我们为价值和策略学习建立了相应的下界,证明了在状态-动作空间大小SA上的紧性,并对更受限的效用类推导了下界,使有效时间 horizon 1/(1-γ) 的依赖性显式化。具体而言,对于CVaR_τ,我们展示了τ的正确依赖性为1/τ²,从而在状态-of-the-art上改进了1/τ因子,尽管我们的界在1/(1-γ)上的依赖性是次优的。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-γ}$ explicit. Specifically, for $\text{CVaR}_τ$ we show that the correct dependence on $τ$ is $\frac{1}{τ^2}$, thus improving by a factor of $\frac{1}τ$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-γ}$.

2605.21762 2026-05-22 cs.LG

Machine learning prediction of obstructive coronary artery disease using opportunistic coronary calcium and epicardial fat assessments from CT calcium scoring scans

利用CT钙扫描中的机会性冠状动脉钙化和心外膜脂肪评估进行阻塞性冠状动脉疾病的机器学习预测

Juhwan Lee, Ammar Hoori, Tao Hu, Justin N. Kim, Mohamed H. E. Makhlouf, Michelle C. Williams, David E. Newby, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

发表机构 * Department of Biomedical Engineering, Virginia Commonwealth University(弗吉尼亚联邦大学生物医学工程系) Department of Biomedical Engineering, Case Western Reserve University(凯斯西储大学生物医学工程系) Harrington Heart and Vascular Institute, University Hospitals Cleveland Medical Center(克利夫兰医学中心哈灵顿心脏和血管研究所) BHF Centre for Cardiovascular Science, University of Edinburgh(爱丁堡大学BHF心血管科学中心) Department of Radiology, Case Western Reserve University(凯斯西储大学放射学系)

AI总结 本研究开发了一种先进的机器学习框架,通过分析CT钙扫描中的冠状动脉钙化和心外膜脂肪数据,预测阻塞性冠状动脉疾病,展示了该方法在提高预测性能和减少对增强CT或侵入性检查依赖方面的潜力。

Comments 16 pages, 4 figures, 3 tables

详情
AI中文摘要

非对比计算断层扫描钙评分(CTCS)是一种成本效益高的成像模态,广泛用于检测冠状动脉钙化。本研究旨在开发一种先进的机器学习框架,利用CTCS图像中冠状动脉钙化和心外膜脂肪的定量分析来预测阻塞性冠状动脉疾病(CAD)。研究人群包括1,324名接受CTCS和冠状动脉CT血管造影的SCOT-HEART临床试验参与者。我们从CTCS图像中提取并分析了广泛特征,包括24个临床变量、189个钙组学和211个心外膜脂肪组学特征。特征选择使用CatBoost算法结合SHapley Additive exPlanation(SHAP)值进行。预测建模利用CatBoost梯度提升方法,专注于最有信息量的特征。从初始的424个候选特征中,通过CatBoost-SHAP方法确定了14个最具有预测性的特征。前两个预测特征来自脂肪组学,其余12个特征来自钙组学。优化后的模型表现出稳健的预测能力,显示出灵敏度为83.1±4.6%、特异性为93.8±1.7%、准确度为85.3±2.0%、F1分数为73.9±3.3%。包括钙组学和脂肪组学数据显著提高了预测性能。值得注意的是,该模型在具有不同冠状动脉钙化评分的患者中也表现出可靠的预测准确性,包括在零钙化评分的情况下仍存在阻塞性CAD的病例。这种创新方法有潜力改善临床决策,并可能减少对增强CT或侵入性诊断程序的依赖,特别是在低至中等风险患者群体中。

英文摘要

Non-contrast computed tomography calcium scoring (CTCS) is a cost-effective imaging modality widely used to detect coronary artery calcifications. This study aimed to develop an advanced machine learning framework that utilizes quantitative analyses of coronary calcium and epicardial fat from CTCS images to predict obstructive coronary artery disease (CAD). The study population consisted of 1,324 patients from the SCOT-HEART clinical trial who underwent both CTCS and coronary CT angiography. We extracted and analyzed a broad range of features, including 24 clinical variables, 189 calcium-omics, and 211 epicardial fat-omics features from the CTCS images. Feature selection was conducted using the CatBoost algorithm combined with SHapley Additive exPlanation (SHAP) values. Predictive modeling utilized the CatBoost gradient boosting method, focusing on the most informative features. From an initial set of 424 candidate features, 14 were identified as most predictive through the CatBoost-SHAP method. The top two predictive features originated from fat-omics, with the remaining 12 features derived from calcium-omics. The optimized model achieved robust predictive capabilities, demonstrating a sensitivity of 83.1+/-4.6%, specificity of 93.8+/-1.7%, accuracy of 85.3+/-2.0%, and an F1 score of 73.9+/-3.3%. Inclusion of calcium-omics and fat-omics data significantly improved predictive performance. Notably, the model also showed reliable predictive accuracy in patients with diverse coronary calcium scores, including cases with obstructive CAD despite a zero-calcium score. This innovative approach holds promise for improving clinical decision-making and potentially reducing dependence on contrast-enhanced or invasive diagnostic procedures, particularly within low-to intermediate-risk patient groups.

2605.21758 2026-05-22 cs.AI

A Causal Argumentation Method for Explainability of Machine Learning Models

一种用于机器学习模型可解释性的因果论辩方法

Henry Salgado, Meagan R. Kendall, Martine Ceberio

发表机构 * Department of Computer Science, The University of Texas at El Paso, El Paso, TX, USA(计算机科学系,德克萨斯理工大学埃尔帕索分校) Department of Engineering Education and Leadership, The University of Texas at El Paso, El Paso, TX, USA(工程教育与领导力系,德克萨斯理工大学埃尔帕索分校)

AI总结 本文提出一种结合因果推理和论辩推理的方法,用于解释机器学习模型为何做出特定预测,通过因果发现方法识别变量间的因果关系,并将其转化为双极论辩框架来表示特征间的支持与反对交互,最终通过半稳定语义确定解释性特征扩展。

Comments To be published in The 4th World Conference on eXplainable Artificial Intelligence

详情
AI中文摘要

可解释人工智能(XAI)方法旨在识别影响模型预测的相关特征,但往往无法清晰解释为何某些决策被做出。在本工作中,我们提出了一种新颖的方法,将因果推理与基于论辩的推理相结合,以解释模型为何做出预测。我们的方法首先使用因果发现方法识别变量间的因果关系,然后将这些关系转化为双极论辩框架(BAF)以表示特征间的支持与反对交互。通过使用半稳定语义,我们找到能够解释为何某些结果被选择的特征扩展。我们在两个基准数据集上展示了我们的方法,并将其结果与标准事后可解释性方法进行比较。

英文摘要

Explainable AI (XAI) methods identify which features are relevant to a model's predictions but often fail to clarify why certain decisions are made. In this work, we present a novel method that integrates causality with argument-based reasoning to explain why models may be making predictions. Our approach first identifies causal relationships among variables using causal discovery methods and then translates these into a Bipolar Argumentation Framework (BAF) to represent supportive and opposing interactions among features. By using semi-stable semantics, we find extensions of features that explain why certain outcomes may have been chosen. We demonstrate our method on two benchmark datasets and compare its results against standard post-hoc explainability approaches.

2605.21752 2026-05-22 cs.LG cs.AI

PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation

PEARL:通过对比学习实现工业级直播推荐的无偏百分位估计

Blake Gella, Wei Wu, Yuhao Yin, Zexi Huang, Zikai Wang, Emily Liu, Junlin Zhang, Wentao Guo, Qinglei Wang

发表机构 * TikTok(字节跳动) ByteDance(字节跳动)

AI总结 本文提出PEARL框架,通过对比学习方法解决用户行为不平衡问题,通过相对偏好信号建模提升推荐系统的性能和鲁棒性。

详情
AI中文摘要

训练于用户交互数据的推荐系统容易受到行为强度不平衡的影响——这种系统性扭曲源于用户间异质的参与模式。这种不平衡会使反馈信号失真,使得观察到的互动不再真实反映真实的偏好,导致模型过度放大高活跃用户信号而低估其他人,最终在大规模情况下降低推荐质量与鲁棒性。为了解决这个问题,我们提出了一种非参数对比百分位近似框架PEARL,该框架建模相对偏好信号而非绝对参与程度。基于相对优势去偏,PEARL利用真实的对比交互样本直接近似百分位关系,而无需依赖辅助分布估计模型。我们提供了理论证明,表明这种成对比较能产生无偏的基于百分位的偏好信号估计。为了更广泛的应用,我们引入了基于预测的重采样机制用于百分位平滑以处理稀疏和离散的反馈,以及通用的价值加权形式和共训练策略以增强建模灵活性和表示学习。大量离线实验表明,PEARL有效减轻了行为偏差,并在多个排序目标上一致提高了推荐性能。在拥有数十亿用户的大规模直播平台部署后,在线A/B测试确认了实际收益:观看时长增加2.10%,消费金额增加0.80%,互动率增加1.49%,举报率降低6.91%。

英文摘要

Recommender systems trained on user interaction data are susceptible to behavioral intensity imbalance--a systematic distortion arising from heterogeneous engagement patterns across users. This imbalance skews feedback signals such that observed interactions no longer faithfully reflect true preferences, causing models to disproportionately amplify signals from highly active users while underrepresenting others, which ultimately degrades recommendation quality and robustness at scale. To address this issue, we propose a nonparametric contrastive percentile approximation framework, PEARL, that models relative preference signals instead of absolute engagement magnitudes. Building upon relative advantage debiasing, PEARL leverages real contrastive interaction samples to approximate percentile relationships directly, without relying on auxiliary distribution estimation models. We provide theoretical justification demonstrating that such pairwise comparisons yield unbiased estimates of percentile-based preference signals. For broader applicability, we introduce a prediction-based bootstrapping mechanism for percentile smoothing to handle sparse and discrete feedback, alongside a generalized value-weighted formulation and a co-training strategy to enhance both modeling flexibility and representation learning. Extensive offline experiments demonstrate that PEARL effectively mitigates behavioral bias and consistently improves recommendation performance across multiple ranking targets. Deployed in a production livestream platform with a combined user base of billions, online A/B testing confirms substantial real-world gains: +2.10% Watch Duration, +0.80% Consumption Amount, +1.49% Interaction Rate, and -6.91% Report Rate.

2605.21751 2026-05-22 cs.LG

Models Can Model, But Can't Bind: Structured Grounding in Text-to-Optimization

模型可以建模,但无法绑定:文本到优化中的结构化 grounding

Zhiqi Gao, Albert Ge, Alexander Berenbeim, Nathaniel D. Bastian, Frederic Sala

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) United States Military Academy(美国军事学院)

AI总结 本文研究了文本到优化任务中建模与绑定两个关键能力的分离性,发现随着实例数据增长,模型准确性下降,提出BIND方法通过结构化文件外部化数据来提升绑定性能,验证了绑定专精模型在不同优化类别中的优势。

详情
AI中文摘要

文本到优化需要两种可分离的能力:建模——选择正确的优化结构——和绑定——将每个系数、索引和参数在具体问题数据中具体化。我们通过Text2Opt-Bench,一个涵盖12类问题的可扩展基准,研究了这一问题,该基准包含从教科书线性规划到具有数千变量的随机和多目标形式的求解器验证优化问题。在10多个模型上,我们发现当实例数据增长时,准确性下降,即使优化形式本身简单。我们称此为有效绑定限制。我们通过一种简单的推理时间方法BIND来解决,该方法将数值数据外部化到结构化文件中,使模型能够程序化地绑定数据,而不是从提示中转录。BIND将GPT-5-Nano的准确性从59.1%提升到82.4%,在低于pass@1的token成本下达到pass@5(82.0%)的水平,并将GPT-5的准确性从86.2%提升到95.8%。此外,我们通过仅在绑定上微调模型验证了我们的假设,证明在三个结构上不同的优化类别中,绑定专精模型在端到端SFT和RL中表现更优,1.5B绑定专精模型单独即可达到7B端到端基线的水平。

英文摘要

Text-to-optimization requires two separable capabilities: modeling -- choosing the right optimization structure -- and binding -- grounding every coefficient, index, and parameter in the concrete problem data. We study this via Text2Opt-Bench, a scalable benchmark of solver-verified optimization problems spanning 12 categories, from textbook linear programs to stochastic and multi-objective formulations with up to thousands of variables. Across 10+ models, we find that accuracy collapses as instance data grows, even when the formulation itself is simple. We call this the effective binding limit. We address this via a simple inference-time approach, BIND, which externalizes numeric data to structured files so the model binds data programmatically rather than transcribing from the prompt. BIND improves GPT-5-Nano from 59.1% to 82.4% accuracy, matching pass@5 (82.0%) at lower token cost than pass@1, and GPT-5 from 86.2% to 95.8%. Furthermore, we validate our hypothesis by finetuning a model exclusively on binding and show that it outperforms end-to-end SFT and RL across three structurally distinct optimization categories, with a 1.5B binding specialist alone matching a 7B end-to-end baseline.

2605.21748 2026-05-22 cs.CL

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge: 一个多轮LLM-as-a-Judge合成基准生成器

Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell

发表机构 * Layer 6 AI

AI总结 本文提出RankJudge,一种用于评估LLM作为评判者在多轮对话中表现的合成基准生成器,通过生成带有单个缺陷的对话对,实现对评判准确性的严格评估,并通过领域覆盖和21个前沿LLM评判者评估,验证了评判排名的稳定性。

详情
AI中文摘要

随着交互式基于LLM的应用被创建和优化,模型开发者需要在多个可能的轴上评估生成文本的质量。对于更简单的系统,人工评估可能是可行的,但在复杂的系统如对话聊天机器人中,生成文本的数量可能会超出人类注释资源的承受能力。模型开发者已经开始依赖自动评估,其中LLM也被用来判断生成质量。然而,现有的LLM-as-a-judge基准主要集中在简单的问答任务上,而无法匹配多轮对话的复杂性。我们引入了RankJudge,一种用于评估LLM-as-a-judge在基于参考文档的多轮对话中的基准生成器。RankJudge生成对话对,其中一组对话在某一回合中注入了一个单一的缺陷。这种构造使得对话对可以被无歧义地标记为更好或更差,并且能够精确地将失败类别隔离到单个回合中,从而实现一个严格的联合正确性标准来评判。我们实现了RankJudge在机器学习、生物医学和金融领域,评估了21个前沿LLM评判者,并通过Bradley-Terry模型对这些评判者进行排名。我们的方法还允许对每个对话对进行难度评分,我们利用这些评分动态地整理评估切片以减少标签噪声,这已通过人工注释得到验证。我们发现,在部分可观测性、更粗略的正确性标准以及替代的随机游走评分算法下,评判排名是稳定的。

英文摘要

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

2605.21747 2026-05-22 cs.CV cs.RO

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

通过利用视觉语言模型推断车辆信息以改进自动驾驶中的3D标注

Steven Chen, Shivesh Khaitan, Nemanja Djuric

发表机构 * Aurora Innovation, Inc.(Aurora创新公司)

AI总结 本文提出了一种利用视觉语言模型推断车辆信息以提高自动驾驶中3D车辆标注精度的方法,通过零样本推理车辆信息,结合车辆型号和型号识别方法,提升了标注效率和质量。

Comments To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

详情
AI中文摘要

我们提出了一种通过零样本推理车辆信息来提高自动驾驶应用中3D车辆标注的方法,利用车辆制造商和型号识别(VMMR)方法。所提出的方法利用视觉语言模型(VLM)从图像片段中推断车辆的制造商、型号和代数,并输出准确的3D包围盒尺寸以引导手动标注。我们评估了迭代提示工程和不同VLMs选择对车辆包围盒推断和制造商/型号/代数识别的影响。与强大的基线相比,所提出的方法不仅在准确性上表现出色,而且在缓解特定失败模式方面也表现出色,例如在车辆显著遮挡的情况下,VLMs提供的尺寸比初始激光雷达辅助的人工标注标签更优。在公共和专有数据上的实验强烈表明,我们的结论可以推广到不同的标注者和数据集。结果表明,将VLMs整合到标注过程中可以减少手动标注时间,同时提高标注质量。

英文摘要

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

2605.21745 2026-05-22 cs.LG

Quantitative coronary calcification analysis for prediction of myocardial ischemia using non-contrast CT calcium scoring

基于非增强CT钙化评分的冠状动脉钙化定量分析用于预测心肌缺血

Juhwan Lee, Sadeer Al-Kindi, Ammar Hoori, Tao Hu, Hao Wu, Justin N. Kim, Robert Gilkeson, Sanjay Rajagopalan, David L. Wilson

发表机构 * Virginia Commonwealth University(弗吉尼亚共同市场大学) Case Western Reserve University(凯斯西储大学) Houston Methodist Hospital(休斯顿 Methodist 医院) University Hospitals Cleveland Medical Center(克利夫兰医学中心) Department of Radiology, Case Western Reserve University(凯斯西储大学放射科)

AI总结 本文提出了一种新的机器学习框架,利用非增强CT钙化评分扫描中的定量冠状动脉钙化评估来预测心肌缺血,通过XGBoost和SHAP识别相关特征,并在5折交叉验证中训练和评估模型,结果显示钙化组学特征显著提高了预测性能。

Comments 15 pages, 4 figures, 3 tables

详情
AI中文摘要

非增强计算机断层扫描钙化评分(CTCS)被广泛认可为心血管风险分层的有效工具。本研究旨在开发一种新的机器学习框架,利用常规非增强CTCS扫描进行定量冠状动脉钙化评估,以预测心肌缺血。本研究分析了1,375名患者,这些患者在一年内于克利夫兰医学中心接受了非增强CTCS和去甲肾上腺素应力心脏正电子发射断层扫描心肌灌注成像。总共评估了74个变量,包括临床变量、Agatston评分和钙化组学特征。通过XGBoost结合SHAP确定了相关特征。使用5折交叉验证训练和评估预测模型。在987名患者中,89名(9%)被确定为心肌缺血阳性。最终模型整合了Agatston评分、八个钙化组学特征和年龄。所提出的模型实现了98.9±3.0%的精度,79.2±8.4%的灵敏度,以及87.7±5.3%的F1分数。与仅使用临床变量或临床变量加Agatston评分的模型相比,添加钙化组学特征显著提高了预测性能(p<0.05)。有趣的是,尽管基于SHAP分析,钙化动脉的数量是排名最低的特征,但在逻辑回归分析中,它与心肌缺血的关联最强(比值比:3.63,95%置信区间:2.80-4.77,p<0.00001)。我们开发了一种机器学习方法,用于使用常规获取的非增强CTCS扫描预测心肌缺血。钙化组学特征在传统风险因素和Agatston评分之外提供了额外的预测价值,并可能支持更可及的心血管风险分层。

英文摘要

Non-contrast computed tomography calcium scoring (CTCS) is widely recognized as an effective tool for cardiovascular risk stratification. This study aimed to develop a novel machine learning framework for predicting myocardial ischemia from routine non-contrast CTCS scans using quantitative coronary calcium assessment. This study analyzed 1,375 patients who underwent both non-contrast CTCS and regadenoson stress cardiac positron emission tomography myocardial perfusion imaging within one year at University Hospitals Cleveland Medical Center. A total of 74 variables, including clinical variables, Agatston score, and calcium-omics features, were evaluated. Relevant features were identified using XGBoost with Shapley Additive exPlanations (SHAP). Predictive models were trained and evaluated using 5-fold cross-validation. Among 987 patients, 89 (9%) were positive for myocardial ischemia. The final model incorporated the Agatston score, eight calcium-omics features, and age. The proposed model achieved a precision of 98.9+/-3.0%, sensitivity of 79.2+/-8.4, and F1 score of 87.7+/-5.3%. The addition of calcium-omics features significantly improved predictive performance compared with models using clinical variables alone or clinical variables with the Agatston score (p<0.05). Interestingly, the number of calcified arteries, despite being the lowest-ranked feature based on SHAP analysis, showed the strongest association with myocardial ischemia in logistic regression analysis (odds ratio: 3.63, 95% confidence interval: 2.80-4.77, p<0.00001). We developed a machine learning approach for predicting myocardial ischemia using routinely acquired non-contrast CTCS scans. Calcium-omics features provided incremental predictive value beyond conventional risk factors and Agatston scoring and may support more accessible cardiovascular risk stratification.

2605.21742 2026-05-22 cs.LG cs.IT math.IT

Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification

修正先验数据拟合网络在表格分类中的类别不平衡

Samuel McDowell, Nathan Stromberg, Lalitha Sankar

发表机构 * School of Electrical, Computer and Energy Engineering(电气、计算机与能源工程学院)

AI总结 本文研究了如何修正先验数据拟合网络在表格分类中因类别不平衡导致的性能问题,通过分析现有技术发现阈值法因PFNs的校准特性表现优异,下采样因PFNs的有限数据性能表现相当,并具有降低推理计算成本的优势。

Comments 5 pages, 6 figures, Information Theory Workshop (ITW)

详情
AI中文摘要

Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

英文摘要

Prior-data fitted networks (PFNs) have achieved exceptional performance on tabular classification tasks. However, like other classifiers, their performance can suffer under the effect of class imbalance, resulting in poor performance for rare classes. Several techniques exist which attempt to mitigate the deleterious effect of class imbalance on classification performance, but the in-context learning (ICL) dynamic of PFNs means that loss-based strategies are impossible, and other techniques are unproven. We have adapted several classical techniques addressing class imbalance and analyzed their performance on PFN classification. We observe that thresholding performs exceptionally well because of the calibration characteristics of PFNs, and downsampling performs comparably because of PFNs exceptional limited-data performance, with the additional benefit of reduced computation cost for inference.

2605.21728 2026-05-22 cs.CV cs.CL cs.LG

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

发表机构 * Instituto Superior Técnico(里斯本大学理工学院) INESC-ID Instituto de Telecomunicações(电信机构)

AI总结 本文提出了一种无参考图像描述评估方法BEiTScore,通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足,提出了一种新的评估指标,并在多种场景下验证了其优越的性能。

详情
AI中文摘要

图像描述评估仍是一个重大挑战,因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型(LLMs)作为评判者的大量计算成本,或者受到标准CLIP基于编码器的限制,例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力,因为将描述视为“词袋”。我们提出了一种新的学习度量标准,以解决上述挑战,基于一个轻量级交叉编码器,其初始化来自视觉问答模型检查点,平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习,特征是对抗性的LLM基于数据增强,以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准,用于在多种场景中评估详细的描述评估。实验结果表明,所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时,实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

2605.21726 2026-05-22 cs.CL cs.AI

Probabilistic Attribution For Large Language Models

基于概率的大型语言模型归因

Shilpika Shilpika, Carlo Graziani, Bethany Lusch, Venkatram Vishwanath, Michael E. Papka

发表机构 * Argonne Leadership Computing Facility(阿贡领导计算设施) Argonne National Laboratory(阿贡国家实验室) Mathematics and Computer Science Division(数学与计算机科学 division) Department of Computer Science(计算机科学系)

AI总结 本文提出了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示,从而提高大型语言模型的可解释性。

Comments 29 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLMs)生成性的特性体现在它们计算每个响应token的条件概率,以根据先前的token进行采样。这些概率编码了模型在训练中学习的分布结构,并在推理中加以利用。在本文中,我们利用这些概率将LLMs置于随机过程的数学理论框架中。我们使用此框架设计了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示。该表示独立于模型的计算结构。此表示给出了响应给提示的条件概率,以及在移除一个token后的响应给提示的条件概率。我们的归因分数是这两个概率比值的对数。我们进一步计算了单个提示token分布的熵,条件于剩余的上下文。熵与归因分数之间的相互作用揭示了LLM的行为。我们评估了8个模型在7个提示上的表现,并调查了异常、token敏感性、响应稳定性、模型稳定性以及训练收敛性,从而提高了可解释性,并引导用户关注生成中不确定或不稳定的部分。

英文摘要

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

2605.21723 2026-05-22 cs.RO cs.AI cs.MA cs.SY eess.SY

Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems

在异质多团队系统中学习利他性协作

Riwa Karam, Ruoyu Lin, Brooks A. Butler, Magnus Egerstedt

发表机构 * Samueli School of Engineering, University of California, Irvine(加州大学欧文分校萨缪尔学学院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究了通过动态机器人分配实现的异质多团队协作,将机器人视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是组合性的,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

详情
AI中文摘要

本文研究了通过动态机器人分配实现的异质多团队协作,其中机器人被视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,我们提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是一个组合问题,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

英文摘要

This paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton's rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton's rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.

2605.21719 2026-05-22 cs.RO cs.SY eess.SY

Mind the Gaps: Multi-Robot Feedback-Driven Ergodic Coverage in Unknown Environments

注意缝隙:未知环境中的多机器人反馈驱动的遍历覆盖

Thales Costa Silva, Nora Ayanian

发表机构 * Department of Computer Science at Brown University(布朗大学计算机科学系)

AI总结 本文提出了一种多机器人反馈驱动的遍历覆盖策略,通过实时环境模型反馈调整机器人采样行为,以提高未知环境中的覆盖效率和资源分配。

详情
AI中文摘要

在本文中,我们解决了多机器人自适应覆盖的问题,其中机器人团队通过连续调整位置进行动态采样以收集环境数据。此任务具有挑战性,特别是在机器人必须随时间高效分配到新采样位置时。遍历搜索方法通过确保机器人时间平均的空间分布与环境信息的空间分布一致来优化机器人轨迹。虽然这些方法在目标分布已知的情况下能促进有效探索,但往往无法考虑环境的未知先验分布。为克服这一限制,我们提出了一种自适应覆盖策略,利用环境模型的实时反馈来调整机器人采样行为以应对未知条件。我们的方法通过基于环境参数模型构建目标空间信息分布,该分布在线更新,从而增强传统遍历轨迹优化。该策略假设环境是静态或变化缓慢相对于机器人运动。我们的框架使机器人能够动态优先考虑高兴趣区域,提高覆盖效率,为单个代理合成有效的控制策略,并在未知先验分布的设置中优化资源使用。我们通过仿真验证了我们的方法,证明了其在提高覆盖和资源分配方面的有效性。

英文摘要

In this work, we address the problem of multi-robot adaptive coverage, where teams of robots perform dynamic sampling by continuously adjusting their positions to collect data in an environment. This task can be challenging, particularly when robots must be efficiently allocated to new sampling locations over time. Ergodic search methods optimize robot trajectories by ensuring that the robots' time-averaged spatial distribution aligns with the spatial distribution of environmental information. While these methods promote effective exploration provided a target distribution, they often fail to account for unknown prior distributions of the environment. To overcome this limitation, we propose an adaptive coverage strategy that utilizes real-time feedback from an environmental model to adjust robot sampling behavior in response to unknown conditions. Our approach enhances traditional ergodic trajectory optimization by constructing a target spatial information distribution based on parametric models of the environment, which are updated online. This strategy assumes that the environment is either static or changes slowly compared to the robot's motion. Our framework allows robots to dynamically prioritize regions of high interest, improving coverage efficiency, synthesizing effective control policies for individual agents, and optimizing resource use in settings with unknown prior distributions. We validate our approach through simulations, demonstrating its effectiveness in enhancing coverage and resource allocation.