arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.21862 2026-05-22 cs.RO cs.AI

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

EvoScene-VLA: 在动作解码器中进化场景信念用于分块机器人控制

Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li

AI总结 本文提出EvoScene-VLA,通过在动作解码器中维护更新的场景状态,改进分块机器人控制中的多步控制预测,提升了场景信念的持续性和准确性。

详情
AI中文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, extbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

英文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

2605.21861 2026-05-22 cs.CV cs.AI

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

在多模态医学视觉基础模型中学习涌现的模块化表示

Yuting He, Chenyu You, Shuo Li

AI总结 本文提出Director-Experts (DEX)框架,通过调控模块化动态,在多模态医学视觉基础模型中学习稳定的模块化表示,并在新的医学视觉基准数据集上验证了其在26个下游任务中的优越性。

Comments Accepted by KDD 2026

详情
AI中文摘要

多模态医学视觉(MV)基础模型(FM)在异质成像模态间面临显著的非独立同分布(Non-IID)特征统计挑战。对这类数据进行单一监督优化会引发冲突梯度,导致表示向模态主导的捷径坍缩。本文将这一失败重新解释为涌现模块化中专门化与协调之间的失衡,并提出Director-Experts(DEX)模块化网络,该网络在堆叠模块中显式调控这些动态。每个DEX模块包含一组专家,通过我们的图像级激活策略动态适应,自主专注于模态主导的统计特征,同时结合通过我们组指数移动平均更新的Director,将多专家知识蒸馏到共享空间,实现跨模态的语义整合,从而驱动模块化表示的涌现。我们构建了一个新的基准数据集Medical Vision Universe,包含超过400万张图像,覆盖10种模态,为DEX提供了最广泛的模态覆盖的FM级预训练。在26个下游任务上的广泛评估表明,DEX在优化行为和迁移性方面有所改进,表明DEX是通用多模态医学AI的有原则的一步。我们的代码和数据集将在https://github.com/YutingHe-list/DEX上公开。

英文摘要

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

2605.21858 2026-05-22 cs.CL

Hypergraph as Language

超图作为语言

Mengqi Lei, Guohuan Xie, Shihui Ying, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao

AI总结 本文提出了一种基于超图的语言模型对齐框架Hyper-Align,通过将超图结构转换为可被大语言模型理解的超图令牌,以更有效地处理高阶关联关系,从而在结构建模任务中取得显著优势。

详情
AI中文摘要

大型语言模型(LLMs)最近在建模关系结构方面展现出了强大的潜力。然而,现有方法仍然从根本上以图为中心:它们专注于将成对图结构转换为LLMs可以理解的令牌。相比之下,许多现实世界的关系模式并不自然地符合成对边的假设,而更适合用超图中的高阶关联来建模。对于超图结构,现有方法往往无法保留多个对象由同一高阶关系共同连接的本源语义,限制了其对复杂结构的利用能力。为了解决这一限制,我们提出了

英文摘要

Large language models (LLMs) have recently shown strong potential in modeling relational structures. However, existing approaches remain fundamentally graph-centric: they focus on processing pairwise graph structures into tokens that LLMs can understand. In contrast, many real-world relational patterns do not naturally conform to the pairwise-edge assumption, and are better modeled as high-order associations in hypergraphs. For hypergraph structures, existing methods often fail to preserve the native semantics that multiple objects are jointly connected by the same high-order relation, limiting their ability to exploit complex structures. To address this limitation, we put forth the "Hypergraph as Language" perspective and propose Hyper-Align, a hypergraph-native alignment framework for large language models. Hyper-Align compiles the query-object-centered hypergraph context into hypergraph tokens directly consumable by a base LLM. Specifically, we introduce Hypergraph Incidence Detail Template with Overview (HIDT-O), which serializes high-order association structures into a fixed-shape hybrid template combining local incidence details and overview-level summaries. We then design a Hypergraph Incidence Projector (HIP), which maps native high-order incidence structures into the LLM token space through explicit semantic-structural decoupling and bidirectional message passing between vertices and hyperedges. We further define a concrete Hypergraph-as-Language input protocol, which jointly feeds hypergraph tokens and textual prompts into a frozen base LLM, supporting both vertex-level and hyperedge-level tasks under a unified question-answering paradigm. To systematically evaluate different methods in hypergraph structural modeling, we introduce HyperAlign-Bench. Extensive experiments show that Hyper-Align significantly outperforms existing methods across in-domain and zero-shot evaluations.

2605.21856 2026-05-22 cs.LG cs.AI

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

推理的幻觉:通过零CoT截断揭示LLM中的逃避数据污染

Yifan Lan, Yuanpu Cao, Hanyu Wang, Lu Lin, Jinghui Chen

AI总结 本文提出零CoT探针(ZCP)方法,通过截断整个推理过程来暴露模型中的潜在捷径映射,以检测LLM中的直接和逃避数据污染,提出了 contamination confidence 指标来量化污染的可能性和严重性。

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上展示了令人印象深刻的推理能力,但数据污染破坏了这些能力的客观评估。这个问题进一步加剧了恶意模型发布者使用逃避或间接污染策略,例如改写基准数据以逃避现有检测方法并人为提升排行榜表现。当前的方法难以可靠地检测这种隐蔽的污染。在本工作中,我们揭示了一个关键现象:模型生成的推理步骤主动掩盖其底层的记忆。受此启发,我们提出了零CoT探针(ZCP),一种新颖的黑盒检测方法,故意截断整个链式思维(CoT)过程以暴露潜在的捷径映射。为进一步将记忆与模型的内在问题解决能力区分开来,ZCP将模型在原始基准上的零CoT表现与等价扰动的参考数据集进行比较。此外,我们引入了污染置信度(Contamination Confidence),一个量化污染可能性和严重性的指标,超越了简单的二元分类。对已识别的污染模型和特别微调的污染模型的广泛实验表明,ZCP能够稳健地检测直接和逃避的数据污染。ZCP的代码可在https://github.com/Yifan-Lan/zero-cot-probe获取。

英文摘要

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.

2605.21852 2026-05-22 cs.CV

Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding

癫痫半相学套件(S3):一个临床多模态数据集、基准和模型用于癫痫半相学理解

Lina Zhang, Tonmoy Monsoor, Peizheng Li, Jiarui Cui, Xinyi Peng, Chong Han, Prateik Sinha, Siyuan Dai, Jessica Nichole Pasqua, Colin M McCrimmon, Weiting Liu, Hailey Marie Miranda, Bing Hu, Xiangting Wu, Tengyou Xu, Chunhan Li, Jiaye Tian, Jiarui Tang, Detao Ma, Lingye Kong, Junnan Lyu, Jungang Li, Yan Zan, Junhua Huang, Rajarshi Mazumder, Vwani Roychowdhury

AI总结 本文提出S3数据集和基准,用于细粒度、结构化的癫痫半相学理解,通过评估多模态大语言模型在低级视觉感知、时间序列处理、叙述报告生成和癫痫诊断中的能力,揭示了现有模型在左右脑推理、时间定位、症状序列和临床忠实报告方面的系统性弱点,并展示了针对癫痫的微调和双阶段神经符号框架在癫痫与非癫痫癫痫分类中的高F1分数。

Comments Accepted to ICML 2026 as a Spotlight presentation

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般视频理解方面表现出色,但其解释非自主、时空演变的病理运动行为如癫痫半相学的能力仍鲜有研究。为此,我们引入癫痫半相学套件(S3),一个临床导向的数据集和基准,用于细粒度、结构化的癫痫半相学理解。数据集包含438个癫痫视频,标注超过35,000个密集标签,涵盖20个ILAE定义的半相学特征。基于此数据集,我们提出了一个七任务分层基准,系统评估MLLMs从低级视觉感知到时间序列处理、叙述报告生成和癫痫诊断的能力。为进一步评估生成报告的临床意义,我们引入了癫痫半相学报告质量指数(Seizure-RQI)。在11个开放权重MLLMs上的广泛基线揭示了在左右脑推理、时间定位、症状序列和临床忠实报告方面的系统性弱点。我们展示,针对癫痫的微调显著提高了各任务的性能,而双阶段神经符号框架在癫痫与非癫痫癫痫分类中的F1分数达到0.96。S3数据集为评估多模态模型在安全关键医疗视频理解中的严谨基准,并指导开发临床可靠、领域适应的多模态智能。

英文摘要

While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatio-temporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite, a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.

2605.21849 2026-05-22 cs.LG cs.CL

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

基于几何适应的解释器:在分布偏移下字典基础可解释性的忠实性

Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song

AI总结 本文提出了一种几何适应解释器(GAE),用于在分布偏移下提高基于字典的可解释性。通过重新对齐解释器的字典与偏移活跃子空间,同时保持原始特征结构,GAE在无监督的情况下减少了分布偏移下的忠实性差距。

详情
AI中文摘要

机制可解释性旨在通过识别因果负责的内部结构来解释模型的行为。基于字典的解释器如稀疏自编码器和转码器是主要工具,但其在分布外(OOD)偏移下的忠实性却很少受到系统性关注。我们证明分布偏移会旋转模型所使用的子空间,导致解释器的字典在训练分布(ID)激活上训练时出现对齐偏差。我们将这种偏差正式化为忠实性差距,即ID字典与OOD活跃子空间之间的几何距离,并证明其控制OOD忠实性退化。为了减少这种差距,我们提出了几何适应解释器(GAE),它在保持原始特征结构的同时,重新对齐解释器的字典与OOD活跃子空间。这只需要未标记的OOD激活,并且不需要梯度更新。我们证明GAE在无适应ID解释器上有所改进,其额外损失被二次限制于二阶矩偏移。经验上,GAE在多个模型和OOD设置中甚至匹配或超过了所有基于训练的基线在因果忠实性上的表现。

英文摘要

Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.

2605.21845 2026-05-22 cs.CL cs.AI

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

对比LLM和微调模型在不同提示复杂度下的NVDRS情境提取性能

Geoffrey Martin, Xuan Zhong Feng, Yifan Peng

AI总结 本文研究了在不同提示复杂度下,LLM与微调模型在NVDRS情境提取任务中的表现差异,提出了一种复杂度评分算法,并展示了一个混合方法,通过不同情境选择提示策略,发现LLM在低 prevalence 情境中表现更优,且框架能跨不同前沿LLM通用。

Comments Accepted at IEEE ICHI 2026

详情
AI中文摘要

自杀是美国的主要死亡原因之一,理解其前因需要从死亡调查叙述中提取结构化信息。许多前因需要语义推理而非简单的关键词匹配。我们开发了一种“复杂度评分”算法,分析编码手册结构以预测何时详细提示(包含完整编码指南)比仅名称提示更优。随后,我们构建了一种混合方法,根据情境选择提示策略。我们评估了大型语言模型(LLMs)与微调的RoBERTa在25个从国家暴力死亡报告系统(NVDRS)中提取的推断复杂情境上的表现。我们发现,在训练数据不足的低 prevalence 情境中,LLMs表现显著优于微调模型。我们进一步展示了我们的框架能够跨前沿LLM通用,GPT-5.2、Gemini 2.5 Pro和Llama-3 70B显示出一致的表现模式。这些发现支持了一种混合架构,其中LLMs处理罕见的推断复杂情境,而微调模型处理常见情境。

英文摘要

Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

2605.21842 2026-05-22 cs.LG cs.CL eess.SP

Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

能量门控注意力:频谱显著性作为Transformer注意力的归纳偏置

Athanasios Zeris

AI总结 本文提出能量门控注意力(EGA),通过频谱显著性作为归纳偏置来改进Transformer注意力机制,通过在键嵌入的频谱能量上进行门控,提高了信息密集位置的注意力权重,实验结果显示在多个数据集上均取得显著效果。

Comments 12 pages, 4 figures

详情
AI中文摘要

标准的Transformer注意力计算查询和键之间的成对相似性,将所有标记视为具有同等显著性,无论其内在信息含量如何。在湍流流体力学中,相干结构——在背景混沌中持续存在的能量主导、空间组织化的模式——承载了总能量的不成比例份额,并控制所有传输。我们提出,标记在Transformer注意力中扮演类似的角色:信息密集的位置(形态边界、语法头、话语标记)集中了频谱能量,并应比背景标记(功能词、重复模式、低信息填充词)获得更多的注意力。我们提出能量门控注意力(EGA):一种简单的修改,通过键标记嵌入的频谱能量来门控值聚合,该计算通过一个单个学习的线性投影完成,以发现嵌入场的主导频谱模式。在TinyShakespeare上,EGA仅使用12,480个额外参数(<0.26%的开销)和没有可测量的计算成本,就实现了+0.103的验证损失改进。结果在Penn Treebank上也一致(+0.101),证明了数据集的独立性。在三种小波家族(固定Morlet、Daubechies db2/db4和参数化Morlet)的系统消融研究中,发现固定结构基底是次优的——最优的能量方向是数据自适应的且非正弦的——同时识别出学习的小波包作为有前途的开放方向。学习的能量阈值收敛到tau ~ 0.35,无论初始化如何,对应于英语文本中携带高于平均频谱能量的约36%的标记比例,这是一个稳定的语言属性,与英语文本中内容词的比例一致。

英文摘要

Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures -- the energetically dominant, spatially organized patterns that persist amid background chaos -- carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned linear projection that discovers the dominant spectral mode of the embedding field. On TinyShakespeare, EGA achieves +0.103 validation loss improvement with only 12,480 additional parameters (<0.26% overhead) and no measurable computational cost. The result is consistent on Penn Treebank (+0.101), demonstrating dataset independence. A systematic ablation across three wavelet families (fixed Morlet, Daubechies db2/db4, and a parametric Morlet) establishes that fixed structured bases are suboptimal -- the optimal energy direction is data-adaptive and non-sinusoidal -- while identifying learned wavelet packets as a promising open direction. The learned energy threshold converges to tau ~= 0.35 independently of initialization, corresponding to the fraction (~36%) of tokens carrying above-average spectral energy in English text, a stable linguistic property consistent with the fraction of content words in running English text.

2605.21836 2026-05-22 cs.RO

Analytical and Experimental Force Analysis of a Soft Linear Pneumatic Actuator

软线性气动执行器的分析与实验力分析

Mohammed Abboodi

AI总结 本文通过分析和实验研究了一种线性软套筒执行器(LSSA)的力特性,探讨了压力、几何形状、位移、负载和轴向刚度之间的耦合效应,揭示了其力生成机制。

详情
AI中文摘要

软套筒执行器(SSAs)最近被开发为可穿戴和辅助机器人系统中的气动驱动方法。通过将驱动结构集成到套筒状几何形状中,这些执行器可以减少对外部附件层和传动机制的依赖,同时保持与肢体形状表面的顺应性。然而,SSAs的力生成行为仍然解释不足,特别是在伸展过程中输出力的变化、外部负载的影响以及轴向刚度的机械作用方面。本文提出了线性软套筒执行器(LSSA)的分析和实验力分析。通过将净轴向力表示为由帽和折叠壁产生的压力生成贡献,并减去与轴向刚度相关的力,开发了一个准静态分析模型。该模型结合了内部压力、投影压力面积、折叠壁几何形状、轴向位移以及实验拟合的轴向刚度关系。进行了预设伸展和静态负载实验以评估执行器响应。在125kPa时,生成的力从零伸展时的约112N减少到40mm时几乎为零。静态负载延迟了可测量的力生成并减少了力输出,特别是在低和中等压力下。结果表明,LSSA的力生成由压力、几何形状、位移、负载和轴向刚度的耦合效应所支配。

英文摘要

Soft sleeve actuators (SSAs) have recently been developed as a pneumatic actuation approach for wearable and assistive robotic systems. By integrating the actuation structure into a sleeve-like geometry, these actuators can reduce reliance on external attachment layers and transmission mechanisms while maintaining compliance with limb-shaped surfaces. However, the force-generation behavior of SSAs remains insufficiently explained, particularly with respect to the variation of output force during extension, the influence of external loading, and the mechanical role of axial stiffness. This paper presents an analytical and experimental force analysis of a linear soft sleeve actuator (LSSA). A quasi-static analytical model was developed by expressing the net axial force as the pressure-generated contribution from the cap and folded walls, reduced by the force associated with axial stiffness. The model incorporates internal pressure, projected pressure areas, folded wall geometry, axial displacement, and an experimentally fitted axial stiffness relation. Prescribed-extension and static-load experiments were conducted to evaluate the actuator response. At 125 kPa, the generated force decreased from approximately 112 N at zero extension to nearly zero at 40 mm. Static loading delayed measurable force generation and reduced force output, particularly at low and intermediate pressures. The results show that LSSA force generation is governed by coupled effects of pressure, geometry, displacement, loading, and axial stiffness.

2605.21834 2026-05-22 cs.LG

On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation

基于策略的一致性训练通过最小能力退化提升大语言模型安全性

Andy Han, Kristina Fujimoto, Avidan Shah, Kiet Nguyen, Kai Xu, Chen Yueh-Han, Ilia Sucholutsky, Rico Angell

AI总结 本文提出基于策略的一致性训练(OPCT)方法,通过模型自身响应对比性提示来提升大语言模型的安全性,实验表明OPCT在抑制顺从性、防止越狱和增强安全意识方面优于传统监督微调(SFT),同时避免了SFT导致的能力退化问题。

详情
AI中文摘要

对齐的模型可能以多种方式表现不当:它们常常谄媚,容易被越狱攻击,或未能包含适当的安全警告。一致性训练是一种有前途的新对齐范式,通过使用对比输入对训练模型的不变性来缓解此类失败。现有的一致性训练过程在离线生成监督信号,并使用监督微调(SFT)来更新模型。不幸的是,由此产生的模型往往只是记忆训练分布的表面形式,因此泛化能力差且能力退化。我们引入基于策略的一致性训练(OPCT),一种新的一致性训练方法,其目标是在模型自身对提示的响应上计算,由自身对相应对比提示的条件监督。我们评估了OPCT在三个安全轴上的表现:顺从性、越狱和安全意识。在三个模型家族中,OPCT在所有安全目标上均优于其SFT对应物。与基线相比,OPCT将顺从率几乎减半(8.1% vs. 15.4%,相比之下SFT为11.2%)。在适应性目标攻击者下,OPCT在保持的越狱行为上保持越狱防御成功率接近99%,而SFT平均达到87%。在安全意识方面,OPCT在两个模型中优于SFT,其余模型中与SFT相当。OPCT还大大避免了SFT引发的能力退化,如在MATH-500上下降28分。我们的结果表明,一致性训练最好以OPCT而不是SFT的方式实施,尤其是在希望超越训练分布泛化时。

英文摘要

Aligned models can misbehave in several ways: they are often sycophantic, fall victim to jailbreaks, or fail to include appropriate safety warnings. Consistency training is a promising new alignment paradigm to mitigate such failures by training invariants into the model using contrastive input pairs. Existing consistency training procedures generate the supervision signal once, offline, and use supervised fine-tuning (SFT) to update the model. Unfortunately, the resulting models tend to merely memorize the surface forms of the training distribution and thus generalize poorly and regress in their capabilities. We introduce On-Policy Consistency Training (OPCT), a new consistency training approach where the objective is computed over the model's own responses to prompts, supervised by itself conditioned on corresponding contrastive prompts. We evaluate OPCT on three safety axes: sycophancy, jailbreaking, and safety awareness. Across three model families, OPCT outperforms its SFT counterpart on all safety desiderata. It nearly halves the sycophancy rate relative to baseline (8.1% vs. 15.4%, compared to 11.2% for SFT). Under an adaptive per-target attacker, OPCT holds jailbreak defense success near 99% on held-out jailbreak behaviors, whereas SFT achieves 87% on average. On safety awareness, OPCT outperforms SFT in two out of three models, and matches it on the other. OPCT also largely avoids the capability regressions that SFT induces, such as a 28-point drop on MATH-500. Our results suggest that consistency training is best implemented as OPCT rather than as SFT, especially when generalization beyond the training distribution is desired.

2605.21827 2026-05-22 cs.CL cs.AI

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

‘稍微’意味着‘ somewhat’吗?在LLM数值行为中测量模糊强度词

Daniel Tabach

AI总结 本文研究了语言模型在必须生成数值行为时是否保留强度词的顺序意义,发现模型在数值输出中压缩了模糊强度词,其解释依赖于状态并接近操作边界时出现不连续性。

Comments 9 figures, 2 tables, 16 references

详情
AI中文摘要

语言模型在必须生成数值行为时是否保留强度词的顺序意义?我研究了一个由研究者构建的10个英语程度修饰词尺度,从稍微到剧烈,依据Quirk等人程度修饰词分类法,在受控资源分配环境中进行测试,其中Claude Haiku接收自然语言指令,生成数值分配,并由确定性后端转换为可测量结果。在测试中,唯一变化的变量是强度词或起始系统状态,从而隔离了它们对模型数值输出的影响。在6,620次运行中(T=0.0和T=0.7),出现了三种模式。首先,模型将10个强度词压缩为5个不同的中位数输出:四个低层级词都映射到相同值,而更强的词则进入更高层级(Spearman rho=0.845,p<0.001)。其次,当当前系统状态作为上下文提供时,Kruskal-Wallis检验显示按起始分配分组捕捉到的基于排名的方差远多于按词分组(epsilon-squared基线=0.782 vs. epsilon-squared词=0.079),并且当系统接近容量时,词汇区分度降为零。第三,接近可行性极限时,模型表现出三种行为模式:弱词通过小调整进行妥协,强词完全回避,而“剧烈”词则推至局部天花板。这些模式在温度变化下保持不变,随机采样扩展了分布但未恢复词之间的顺序差异。在该模型和领域中,模型对模糊强度词的数值解释是压缩的、依赖状态的,并且在操作边界附近出现不连续性。

英文摘要

Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p < 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.

2605.21825 2026-05-22 cs.AI cs.HC

Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

迈向AI可视化合作者:一种通用且端到端的代理工具,用于解决复杂数据可视化任务

Haichao Miao, Zhimin Li, Kuangshi Ai, Kaiyuan Tang, Chaoli Wang, Peer-Timo Bremer, Shusen Liu

AI总结 本文提出了一种端到端的代理工具,能够基于数据和高层任务描述自动生成定制的可视化分析应用,推动实现通用AI合作者的愿景。

详情
AI中文摘要

检查、解释和沟通复杂数据的能力对于任何科学探索都是至关重要的,但通常需要在核心领域之外的大量专业知识,从数据管理和分析到可视化设计和实现。我们提出了一种端到端的代理工具,仅基于数据和任务的高层描述,独立设计定制的可视化分析应用(VIS应用)。这代表了朝着许多设想的通用AI合作者的重要一步,即一个能够根据高层指令自主执行长周期任务的自主系统。我们提出的VIS合作者是这一更广泛AI合作者愿景的重要组成部分:一个能够利用一组代理和专门技能,自主分析数据并设计可视化解决方案的工具,这些技能协调探索性分析、计划、配置环境、实施、验证界面,并最重要的是评估整体任务完成情况。每个阶段都产生文档和指令制品,指导后续工作并实现迭代改进。我们通过IEEE SciVis比赛验证了这种方法,这些比赛覆盖多个科学和工程领域,是理想的测试场,因为它们编码了现实世界的复杂性:模糊的要求、多样的数据模态、设计权衡和任务驱动的验证。仅给定数据和目标任务,我们的系统能够自主生成具有验证链接视图行为的功能单页VIS应用,高度定制以满足领域专家指定的任务和需求。

英文摘要

The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts' specified tasks and needs.

2605.21822 2026-05-22 cs.AI

Implicit Safety Alignment from Crowd Preferences

从大众偏好中隐式安全对齐

Qian Lin, Daniel S. Brown

AI总结 本文研究如何从大众偏好数据中提取共享的安全标准,并将其转移到下游强化学习任务中以规范智能体行为并确保安全。提出了一种基于大众偏好的安全强化学习框架,通过高级策略将安全对齐的技能组合起来,以安全地解决下游任务。

Comments Accepted to ICML 2026. Conference paper

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)可以揭示超出任务完成的隐式目标,如安全考虑。在本工作中,我们关注嵌入在大众偏好数据集中的常见安全标准,其中不同用户可能表达不同的偏好或目标,但遵循相似的安全原则。我们的目标是从大众偏好中发现共享的安全标准,并将其转移到下游RL任务中以规范智能体行为并确保安全。我们首先证明了直接奖励组合——优化偏好学习的奖励模型与下游任务奖励——具有内在限制。受此启发,我们提出了基于大众偏好的安全强化学习(Safe Crowd Preference-based RL),这是一种分层框架,从大众偏好中提取安全对齐的技能,并通过高级策略将它们组合起来,以安全地解决下游任务。在安全RL环境和一个初步的LLM风格任务中,实验表明,我们的方法在没有访问显式安全奖励的情况下显著降低了安全成本,同时在任务性能上与使用真实安全信号训练的oracle方法相当。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

2605.21820 2026-05-22 cs.LG cond-mat.mtrl-sci

Beyond Scalar Objectives: Expert-Feedback-Driven Autonomous Experimentation for Scientific Discovery at the Nanoscale

超越标量目标:基于专家反馈的自主实验探索用于纳米尺度科学发现

Ralph Bulanadi, Jefferey Baxter, Arpan Biswas, Hiroshi Funakubo, Dennis Meier, Jan Schultheiß, Rama Vasudevan, Yongtao Liu

AI总结 本文提出了一种名为深度核成对学习(DKPL)的方法,通过整合专家知识和跨学科科学知识,改进自主显微实验,从而在纳米尺度上更有效地发现科学现象。

详情
AI中文摘要

自动驾驶实验室或自主实验正成为加速科学发现的变革性平台。贝叶斯优化(BO)是用于此目的最广泛使用的机器学习框架之一,但这些基于BO的框架依赖于预定义的标量描述符来指导实验。在许多情况下,确定合适的标量描述符可能具有挑战性,并且可能无法捕捉到专家所察觉的微妙但科学重要的现象。为克服这一限制,本文开发了深度核成对学习(DKPL),一种用于自主显微实验的方法,该方法将人类专业知识和跨学科科学知识整合到一个主动学习循环中。与依赖显式标量目标不同,DKPL使专家能够直接评估哪些实验输出更有前途,使用跨学科知识。DKPL然后从这些专家判断中学习一个潜在的效用函数,以指导后续的自主显微实验。我们通过一个具有已知真实值的实验模型数据集展示了DKPL在学习物理有意义的纳米级结构方面的能力,同时有效优先考虑高信息测量区域。我们进一步将DKPL应用于分析铁电域墙的特性,在BiFeO3中区分高和低特征域墙角度,并在ErMnO3中发现头对头和尾对尾的域墙特性。这一发展建立了一种将专家知识整合到自主显微实验中的方法,并展示了一条通向能够解决超越标量度量驱动学习限制的科学问题的专家引导的自动驾驶实验室的路径。

英文摘要

Self-driving laboratories or autonomous experimentation are emerging as transformative platforms for accelerating scientific discovery. Bayesian optimization (BO) is among the most widely used machine learning frameworks for these purposes, but these BO-based frameworks rely on predefined scalar descriptors to guide experimentation. In many situations, the determination of an appropriate scalar descriptor can be challenging, and may fail to capture subtle yet scientifically important phenomena apparent to experts with interdisciplinary insight. To overcome this limitation, here we develop deep-kernel pairwise learning (DKPL), an approach for autonomous microscopy experiments which incorporates human expertise and interdisciplinary scientific knowledge into an active learning loop. Instead of relying on explicit scalar objectives, DKPL enables experts to directly evaluate which experimental output is more promising using interdisciplinary knowledge. DKPL then learns a latent utility function from these expert judgements to guide subsequent autonomous microscopy experiments. We demonstrate DKPL's performance in learning physically meaningful nanoscale structures while effectively prioritizing high-information measurement regions using an experimental model dataset with known ground truth. We further apply DKPL to analyze the character of ferroelectric domain walls, where we find DKPL capable of distinguishing between high and low characteristic domain-wall angles in bismuth ferrite, and able to discover both head-to-head and tail-to-tail domain-wall character in erbium manganite. This development establishes an approach to integrate expert knowledge into autonomous microscopy experiments and demonstrates a pathway toward expert-guided self-driving laboratories capable of addressing scientific problems beyond the limits of scalar-metrics-driven learning.

2605.21811 2026-05-22 cs.RO

Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation

安全且可操控的几何运动策略用于机器人灵巧操作

Albert Wu, Riccardo Bonalli, Thomas Lew, C. Karen Liu

AI总结 本研究提出SafePBDS框架,通过几何一致的方法计算最优且可证明安全的配置流形加速度,以实现机器人灵巧操作中的目标和约束的持续协调,并在模拟和Franka Panda-Allegro手平台上验证了其在灵巧抓取和手部重定向中的高效规划和安全保障。

Comments 24 pages, 10 figures, 5 tables. Project page and demo video: https://tml.stanford.edu/safe-pbds

详情
AI中文摘要

机器人灵巧操作需要持续协调在异构几何空间上定义的目标和约束:一个在$\mathbb{R}^7$配置流形上控制的机器人可能需要在$\mathrm{SE}(3)$上跟踪末端执行器姿态,同时在$\mathbb{R}$上满足障碍物避让边距。我们提出了Safe Pullback Bundle Dynamical Systems(SafePBDS),一种几何一致的框架,该框架从任意任务流形上的目标和安全要求计算最优且可证明安全的配置流形加速度。SafePBDS建立在先前工作之上,将预定义的任务流形动力学系统结合以产生自主运动。其第一个创新是拉回控制屏障函数构造,将任务流形的安全条件转换为配置流形加速度上的线性约束。第二个创新是任务流形动作接口,允许高层策略注入低维残差运动;零输入恢复自主行为,而任意输入下保持安全。这使高层策略能够高效地引导探索,同时将精确运动留给自主行为。我们通过模拟和23自由度Franka Panda-Allegro手平台验证了SafePBDS。在灵巧抓取中,SafePBDS在20个家庭物体和120次试验中实现了92.5%的成功率。通过动作接口,该方法可通过一维动作排除抓取中的任一手指,实现94.4%的3指抓取成功率。SafePBDS的高效规划和安全保证还使其成为首个基于模型的、完全驱动的手部在手重定向方法,能够超过360度的yaw旋转,无论物体重量和腕部运动如何变化。演示视频和细节:https://tml.stanford.edu/safe-pbds

英文摘要

Robotic dexterous manipulation requires continuously reconciling objectives and constraints defined on heterogeneous geometric spaces: a robot controlled on a $\mathbb{R}^7$ configuration manifold may need to track end effector poses on $\mathrm{SE}(3)$ while satisfying obstacle avoidance margins in $\mathbb{R}$. We present Safe Pullback Bundle Dynamical Systems (SafePBDS), a geometrically consistent framework that computes optimal, certifiably safe configuration manifold accelerations from objectives and safety requirements on arbitrary task manifolds. SafePBDS builds on prior work that combines predefined task manifold dynamical systems to produce autonomous motion. Its first innovation is a pullback control barrier function construction, which converts task manifold safety conditions into linear constraints on configuration manifold accelerations. The second innovation is a task manifold action interface that allows a high-level policy to inject low dimensional residual motions; zero input recovers the autonomous behavior, while safety is preserved under arbitrary inputs. This lets high-level policies efficiently steer exploration while leaving precise motion to the autonomous behavior. We validate SafePBDS in simulation and on a 23-DOF Franka Panda-Allegro Hand platform. On dexterous grasping, SafePBDS achieves a $92.5\%$ success rate across 20 household objects and 120 trials. Using the action interface, the method can exclude any one of the four fingers during grasping via a one-dimensional action, achieving $94.4\%$ 3-finger grasp success across 3 objects and 36 trials. The efficient planning and safety guarantee of SafePBDS also enables the first model-based, fully actuated palm-down in-hand reorientation, exceeding $360^\circ$ of yaw rotation in both directions under varying object weight and wrist motion. Demo video and details: https://tml.stanford.edu/safe-pbds

2605.21810 2026-05-22 cs.AI cs.MA

Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Trace2Skill: 验证器引导的技能进化用于长上下文EDA代理

Zijian Du, Nathaniel Pinckney

AI总结 本文提出Trace2Skill框架,通过验证器引导的技能进化提升硬件代理在复杂验证问题上的性能,无需RTL专用模型微调,通过密集验证器反馈实现任务通过率的显著提升。

详情
AI中文摘要

复杂Verilog设计问题(CVDP)挑战硬件LLM代理,因为解决这些问题需要在大型仓库快照中本地化相关RTL、测试平台、包含路径和构建依赖,进行精确编辑,并从稀疏隐藏验证器失败中恢复。我们提出了Trace2Skill,一个测试时间扩展框架,它在不进行RTL专用模型微调的情况下改进硬件代理。而不是训练新模型或仅采样更多候选解决方案,Trace2Skill将代理的自然语言技能视为可进化策略。它挖掘重复的运行轨迹以识别成功和失败模式,将其转换为密集的诊断和 oracle 教训,并使用 oracle、变异器和选择器循环生成任务特定的技能,以引导后续的搜索、编辑、验证和恢复。由于最终通过/失败标签通常对硬故障太粗略,Trace2Skill还支持有界运行时间密集验证器反馈,该反馈返回经过清理的功能观察,同时保持隐藏的Harness和参考解决方案对代理不可见。这种反馈通过连接技能文本、验证器证据和下游行为来引导技能进化和代理执行。在击败种子CVDP代理的硬CVDP任务上,包括也击败前沿编码代理的任务,Trace2Skill结合密集验证器反馈显著提高了任务通过率,并在之前未解决的任务上实现了突破性通过,而无需高质量微调数据、专用RTL模型训练或模型权重更新。相同的框架提供了一种通用测试时间扩展策略,可以扩展到其他可验证的EDA任务。

英文摘要

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

2605.21807 2026-05-22 cs.CL

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

当案例变得稀少时:一个用于偏离指南临床问答的检索基准

Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

AI总结 本文提出OGCaReBench基准,用于评估医疗问答中超越常规指南的开放性推理能力,通过检索医学文献提升模型在真实世界医疗场景中的表现。

Comments 34 pages, 20 figures

详情
AI中文摘要

在医学各专科中,临床实践基于证据导向的指南,这些指南通常无法覆盖现实世界中未被涵盖的长尾部分。然而,大多数医疗大型语言模型(LLMs)训练时专注于编码常见、指南导向的医学知识。当前评估主要测试模型在回忆和推理这些记忆内容上的能力,通常在多项选择设置中进行。鉴于证据导向推理在医学中的基础重要性,实践中依赖记忆是不可行且不可靠的。为此,我们引入OGCaReBench,一个以自由形式检索为核心的基准,旨在评估LLMs在回答需要超越典型指南的临床问题上的能力。该基准从发表的医学案例报告中提取,并由医学专家验证,包含需要自由文本回答的长形式临床问题,提供了一个系统框架,用于评估罕见、案例基于场景中的开放性医学推理。我们的实验发现,即使最佳基线模型(GPT-5.2)在基准中也只正确回答了56%,而仅使用专业模型的仅达到42%。通过增强模型检索医学文献,性能可提升至82%(使用GPT-5.2),突显了证据基础在真实世界医疗推理任务中的重要性。本文因此为基准测试和推动通用和医疗LLM在挑战性临床情境中产生可靠答案奠定了基础。

英文摘要

Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

2605.21803 2026-05-22 cs.LG

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

相同架构,不同容量:优化器诱导的谱缩放定律

Nandan Kumar Jha, Brandon Reagen

AI总结 研究探讨了优化器如何影响Transformer架构的谱缩放定律,发现相同架构使用不同优化器时,谱容量的缩放行为存在显著差异,提出了优化器与架构协同设计的重要性。

Comments 31 pages, 10 figures, 30 tables. Project page: https://optimizer-scaling-laws.github.io

详情
AI中文摘要

缩放定律使语言模型性能可从模型大小、数据和计算量预测,但通常将优化器视为固定训练细节。我们显示,这一假设忽略了表示缩放的一个基本轴:优化器如何有效地将增加的FFN宽度转换为利用的谱容量。通过测量前馈网络表示的谱特征,通过软和硬谱秩,我们发现,当使用不同优化器训练时,相同的Transformer架构实现了显著不同的谱缩放定律。在固定架构和宽度计划的情况下,AdamW在稀有词(TAIL)表示上表现出弱的硬秩缩放(β=0.44),而在相同区域,Muon实现了线性缩放(β=1.02),缩放指数增加了2.3倍。这一差异无法归因于验证损失:AdamW配置可以在扩展训练下匹配低秩Dion变体的困惑度,但表现出显著不同的谱几何结构,表明匹配的损失不意味着匹配的表示结构。硬-软秩不对称进一步揭示,优化器不仅在实现的容量上有所不同,还影响了容量在特征模式上的结构。为了区分优化器效应与架构效应,我们比较了架构干预(例如注意力秩和位置编码),并发现优化器诱导的谱位移往往超过架构效应。这些结果表明优化是表示缩放的第一轴,推动了优化器-架构协同设计。

英文摘要

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ($β$=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ($β$=1.02) in the same regimes, a $2.3\times$ increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

2605.21801 2026-05-22 cs.LG cs.CL

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

为何语义熵失效:面向策略优化的几何感知与校准不确定性

Zheyuan Zhang, Kaiwen Shi, Han Bao, Zehong Wang, Tianyi Ma, Yanfang Ye

AI总结 本文提出了一种新的策略优化框架GCPO,通过几何感知措施捕捉语义分歧,并利用基于奖励的校准对齐不确定性与学习信号强度,从而更准确地跟踪梯度变化并提升训练后性能。

详情
AI中文摘要

训练后已成为改进大语言模型推理和对齐的关键,其中无批评模型能够实现从模型生成输出的可扩展学习,但缺乏区分信息性与噪声信号的原理性机制。最近的方法利用响应级度量作为不确定性信号来调节基于群体的优化方法,如GRPO。然而,其经验成功仍不稳定,且不清楚它们如何影响优化动态。在本文中,我们提供迄今为止第一个原理性公式,将不确定性信号解释为表征和调节梯度方差和学习信号质量的机制。基于经验和理论分析,我们识别出当前基于熵的估计器的两个关键缺陷:各向异性缺口和校准缺口。受此分析启发,我们提出几何感知校准策略优化(GCPO),一种新的框架,整合几何感知度量以捕捉语义分歧,利用基于奖励的校准对齐不确定性与学习信号强度。在多个基准测试中的实验表明,我们的方法更忠实跟踪梯度变化,并且一致提升训练后性能。我们的结果强调了设计与优化动态对齐的不确定性信号的重要性,为稳健训练后方法提供了原理性视角。

英文摘要

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

2605.21800 2026-05-22 cs.LG cs.RO

stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation

stable-worldmodel: 一个用于可重复世界建模研究和评估的平台

Lucas Maes, Quentin Le Lidec, Luiz Facury, Nassim Massaudi, Ayush Chaurasia, Francesco Capuano, Richard Gao, Taj Gillin, Dan Haramati, Damien Scieur, Yann LeCun, Randall Balestriero

AI总结 本文提出stable-worldmodel平台,旨在解决世界建模研究中代码库、数据管道和评估协议碎片化的问题,通过提供高性能的数据层、现代世界模型基线和规划求解器的实现,以及扩展的环境和任务,实现标准化和可重复的世界建模研究和评估。

详情
AI中文摘要

世界模型是构建能够推理、规划并在训练数据之外进行泛化的重要组成部分。然而,目前世界模型的研究仍然碎片化,不同的代码库、数据管道和评估协议阻碍了可重复性和公平比较。当前实践还受到三个关键瓶颈的限制:脆弱的一次性代码库、缓慢的视频数据加载以及缺乏标准化的泛化基准。我们提出了stable-worldmodel (swm),一个开源平台,用于标准化和可重复的世界建模研究和评估。它提供了(1)一个高性能的Lance数据层,支持和转换MP4、HDF5和LeRobot数据集;(2)干净、经过良好测试的现代世界模型基线和规划求解器的实现;(3)一个广泛的环境和任务套件,扩展了可控的视觉、几何和物理因素的变化,以系统地评估动态理解、控制性能、表示质量和分布外泛化。通过在单一可扩展框架下统一整个流程, exttt{swm}显著减少了研究开销,并加速了向可靠世界模型的可信进展。

英文摘要

World models are central to building agents that can reason, plan, and generalize beyond their training data. However, research on world models is currently fragmented, with disparate codebases, data pipelines, and evaluation protocols hindering reproducibility and fair comparison. Current practice is further limited by three key bottlenecks: fragile one-off codebases, slow video data loading, and the lack of standardized generalization benchmarks. We present stable-worldmodel (swm), an open-source platform for standardized and reproducible world modeling research and evaluation. It delivers (1) a high-performance Lance-based data layer with native support and conversion tools for MP4, HDF5, and LeRobot datasets, (2) clean, well-tested implementations of modern world model baselines and planning solvers, and (3) a broad suite of environments and tasks extended with controllable visual, geometric, and physical factors of variation for systematic in-silico evaluation of dynamics understanding, control performance, representation quality, and out-of-distribution generalization. By unifying the full pipeline under a single, scalable framework, \texttt{swm} dramatically reduces research overhead and accelerates trustworthy progress toward reliable world models.

2605.21798 2026-05-22 cs.LG stat.ML

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

三次成本:神经过程在高斯过程推断中的摊销

Robin Young

AI总结 本文研究了神经过程在高斯过程推断中的摊销成本,将高斯过程的后验推断从精确的O(n^3)转换为学习的O(n)映射,分析了标签污染、信息瓶颈和摊销误差三个来源,并提出了架构优化建议。

Comments To appear at ProbNum 2026

详情
AI中文摘要

神经过程用于摊销高斯过程推断,将精确的O(n^3)后验替换为学习的O(n)映射,从上下文集到预测分布。对于一类潜在的神经过程,我们界定了高斯过程和LNP预测之间的KL散度,将其分解为三个可解释的来源,即标签污染,因为神经过程使用标签值来估计在精确高斯过程中标签无关的量;信息瓶颈,因为有限维表示无法解析完整的上下文几何;以及摊销误差,因为单个编码器网络在所有上下文中共享。瓶颈截断项随着表示维度d衰减为O(e^{-cd^{2/d_x}}),对于平方指数核在R^{d_x}上,其中c>0是核依赖的常数,以及对于Matérn-ν核为O(d^{-2ν/d_x}),直接将架构尺寸与核平滑度和输入维度联系起来。标签污染项通常为O(1),只有观测噪声部分衰减为O(1/n),识别了通过标签依赖的表示路由不确定性估计的持续成本。这些结果刻画了在分析类别中的摊销成本,并产生了架构建议,以在高斯过程摊销范围内仅从上下文位置预测方差,并用二阶池化代替均值聚合以关闭主导的摊销差距。

英文摘要

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2ν/d_x})$ for Matérn-$ν$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

2605.21796 2026-05-22 cs.CV cs.CL

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

MM-Conv: 一种多模态数据集和基准,用于上下文感知的3D对话中指代解析

Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, Jonas Beskow

AI总结 本文提出了一种多模态数据集和基准,用于在动态3D环境中实现上下文感知的指代解析,通过引入包含6.7小时第一人称VR交互的同步语音、动作、注视和3D场景几何数据的基准,以及一个两阶段的指代解析流水线,改进了对话中的指代解析性能。

Comments Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

详情
Journal ref
Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), Palma de Mallorca, Spain
AI中文摘要

在物理世界中将语言进行定位需要AI系统解释在对话中动态出现的参考。尽管当前的视觉语言模型(VLMs)在静态图像任务上表现出色,但在自发的多轮对话中解决歧义表达方面存在困难。我们通过引入(1)一个用于动态3D环境中的指代交流的基准,该基准基于6.7小时的第一人称VR交互,同步语音、动作、注视和3D场景几何数据,以及(2)一个两阶段的定位流水线,该流水线在视觉定位之前显式解决对话中的歧义,来填补这一空白。该基准包含超过4,200个经过人工验证的指代表达,涵盖完整、部分和代词类型。我们的上下文重写方法在平均上将定位性能提高了11-22个百分点,纯检测器(GroundingDINO)在重写后在代词上达到了56.7%的准确率,几乎是最佳端到端基线的两倍。结果表明,将语言推理与视觉感知解耦比端到端方法在对话定位中更有效。

英文摘要

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

2605.21792 2026-05-22 cs.CL cs.AI cs.DB cs.LG

Residual Skill Optimization for Text-to-SQL Ensembles

残差技能优化用于文本到SQL集成

Jiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant, Nikki Lijing Kuang, Seyedeh Baharan Khatami, Canwen Xu, Xiaodong Yu, Yingyu Lin, Zhewei Yao, Yuxiong He, Babak Salimi

AI总结 本文提出DivSkill-SQL,一种残差技能优化框架,通过在当前技能集成失败的示例上优化新技能,从而构建互补的文本到SQL集成,提升Pass@K性能,在Spider2-Lite上实现了显著的准确性提升,同时在不同方言和任务上表现出一致的改进。

详情
AI中文摘要

文本到SQL集成通过生成多个SQL候选并选择一个来优于单一候选生成,但其效果受限于Pass@K,即至少有一个K候选正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性,导致候选集受相关失败主导。我们提出DivSkill-SQL,一种残差技能优化框架,构建互补的文本到SQL集成而无需模型微调:每个新技能在当前技能集成失败的示例上进行优化,证明其对Pass@K的边际贡献。在Spider2-Lite上,DivSkill-SQL在Snowflake和BigQuery上分别比最强集成基线提升11.1和8.3个点,且在两个基础模型(Opus-4.6和GPT-5.4)上表现一致。在单个方言上无重新训练即可转移至其他方言(Snowflake、BigQuery、SQLite)和不同任务形式(如BIRD-Critic,+2.6个点)。错误诊断显示幻觉的模式参考和函数调用减少3倍,表明收益来自真正可靠的互补技能,而非表面形式变化。

英文摘要

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

2605.21788 2026-05-22 cs.CV cs.RO

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

SceneGraphGrounder: 通过结构化场景图匹配实现零样本3D视觉定位

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

AI总结 本文提出SceneGraphGrounder框架,通过结构化场景图匹配将3D定位问题转化为结构化图匹配问题,利用视觉标记提示策略从2D视图推断物体间关系,并在3D场景图中建立持久编码,从而在ScanRefer基准测试中实现了零样本条件下与现有方法相当的性能,并在真实机器人部署中验证了其在长周期物理环境中的鲁棒空间推理能力。

详情
AI中文摘要

零样本3D视觉定位需要从非结构化环境中通过自由形式自然语言定位物体。最近的视觉-语言模型(VLM)方法取得了有希望的结果,但依赖于视点依赖的推理或隐式表示,限制了组合查询的空间一致性和可解释性。我们提出了SceneGraphGrounder,一个将3D定位重新表述为在重建的3D场景图上的结构化图匹配的框架。为了实现这种表述,我们引入了一种视觉标记提示策略,使VLM能够从2D视图推断物体-物体关系,这些关系随后被提升为持久的3D场景图编码,既包含空间关系又包含语义关系。给定一个查询,我们构建查询图并与场景图进行受限对齐,确保多视图一致性和可解释的推理。在ScanRefer基准测试中,我们的方法在零样本条件下实现了与现有方法相当的性能,仅使用RGB-D输入。我们进一步通过在移动机器人上的真实世界部署验证了我们的框架,展示了其在长周期物理环境中的鲁棒空间推理能力。我们将在接受后公开我们的代码。

英文摘要

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

2605.21783 2026-05-22 cs.LG stat.ML

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

MMD-Balls as Credal Sets: A PAC-Bayesian Framework for Epistemic Uncertainty in Test-Time Adaptation

Ahanaf Hasan Ariq

AI总结 本文提出了一种基于PAC-Bayesian框架的测试时间适应方法,通过将MMD球体解释为 credal sets,提供了对epistemic不确定性量化的自然方法,并建立了与MMD相关的泛化界限、有限样本版本、统一最坏情况风险界限以及几何保持界限。

Comments 15 pages, 0 figures. Accepted at the 2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)

详情
AI中文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

英文摘要

Test-time adaptation (TTA) methods improve model performance under distribution shift but lack formal guarantees connecting shift magnitude to prediction reliability. We develop a PAC-Bayesian framework yielding generalization bounds explicitly parameterized by the maximum mean discrepancy (MMD) between source and target distributions. Our principal contribution is interpreting MMD-balls around the source distribution as credal sets in Walley's imprecise probability theory, yielding natural epistemic uncertainty quantification. We establish: (i) a PAC-Bayesian bound with an MMD-dependent shift penalty under an RKHS-Lipschitz loss assumption; (ii) a finite-sample version via MMD concentration; (iii) a uniform worst-case risk bound over all distributions in the credal set, with a lower-upper risk decomposition; and (iv) geodesic preservation bounds explaining why kernel-guided adaptation protects local feature geometry. The credal set interpretation separates epistemic from aleatoric uncertainty and provides a principled decision criterion for when adaptation is warranted.

2605.21781 2026-05-22 cs.CL

Reflective Prompt Tuning through Language Model Function-Calling

通过语言模型功能调用实现反思式提示调优

Farima Fatahi Bayat, Moin Aminnaseri, Pouya Pezeshkpour, Estevam Hruschka

AI总结 本文提出了一种名为Reflective Prompt Tuning (RPT)的框架,利用语言模型功能调用模拟人类提示工程师的迭代工作流程,通过诊断函数评估目标模型,生成结构化诊断报告,并利用历史报告优化提示,从而提升提示效果和置信度校准。

Comments 17 pages, 6 figures

详情
AI中文摘要

大型语言模型(LLMs)在遵循指令和复杂推理方面的能力不断增强,使提示成为一种灵活的接口,用于在不更新参数的情况下调整模型。然而,提示设计仍然劳动密集且对格式、措辞和指令顺序高度敏感,这促使了自动提示优化方法的发展,以减少手动努力并保持推理时的灵活性。然而,现有方法通常在提示候选项上进行搜索或使用由单个示例或小批量驱动的固定批评-优化流水线,限制了它们捕捉系统性错误模式和基于失败历史进行针对性编辑的能力。我们提出Reflective Prompt Tuning (RPT),一个框架,利用语言模型功能调用模拟人类提示工程师的迭代工作流程。一个语言模型优化器调用一个诊断函数,该函数在完整的优化集上评估目标模型,总结反复出现的失败模式,并返回结构化的诊断报告。优化器利用此报告,以及累积的记忆中的先前报告,来修改下一个迭代的提示。RPT进一步通过在诊断反馈和最终提示选择中使用校准信号支持置信度-aware的优化。在三个推理任务上,RPT在初始提示上提高了高达12.9个点,与最先进方法保持竞争力,并提高了置信度校准。我们的分析显示,RPT在多跳和数学推理中特别有效,产生针对性的提示修改,与诊断的失败模式对齐,并导致任务性能和校准的提升。

英文摘要

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

2605.21780 2026-05-22 cs.LG cs.CR

Provable Robustness against Backdoor Attacks via the Primal-Dual Perspective on Differential Privacy

通过微分隐私的对偶视角证明对后门攻击的鲁棒性

Aman Saxena, Jan Schuchardt, Yan Scholten, Stephan Günnemann

AI总结 本文提出一种基于对偶视角的微分隐私框架,用于证明对抗性扰动下的鲁棒性,通过整合随机平滑与隐私配置文件,提供对训练时间和推理时间攻击的联合鲁棒性保证。

详情
AI中文摘要

随机平滑是一种强大的工具,可用于证明对对抗扰动的鲁棒性,包括通过随机训练的污染攻击和通过随机推理的逃避攻击。将这些保证扩展到后门攻击,其中训练和测试数据共同被扰动,仍然具有挑战性,因为训练和测试时间的随机化机制必须在单一鲁棒性证书内进行分析。我们通过将随机平滑与通过隐私配置文件连接到微分隐私的对偶视角,提供了一种数值程序,用于组合异构机制。所得到的框架能够实现对复杂、组合机制的紧密、模块化、端到端认证,同时利用现有微分隐私机制的分析。我们为DP-SGD和带有推理时间平滑的深度分区聚合实例化该框架,推导出对训练时间和推理时间攻击的联合鲁棒性保证。在MNIST和CIFAR-10上的实验展示了该框架的有效性。总体而言,我们提供了一个系统且通用的框架,用于使用复合机制在复杂的威胁模型下证明鲁棒性,该模型更好地捕捉了现实对手的能力。

英文摘要

Randomized smoothing is a powerful tool for certifying robustness to adversarial perturbations, including poisoning attacks via randomized training and evasion attacks via randomized inference. Extending these guarantees to backdoor attacks, where training and test data are jointly perturbed, remains challenging because training- and test-time randomized mechanisms must be analyzed within a single robustness certificate. We address this by connecting randomized smoothing to the dual view of differential privacy through privacy profiles, which provide a numerical procedure for composing heterogeneous mechanisms. The resulting framework enables tight, modular, end-to-end certification of complex, composed mechanisms while leveraging existing analyses of differentially private mechanisms. We instantiate the framework for DP-SGD and Deep Partition Aggregation with inference-time smoothing, deriving joint robustness guarantees against both training-time and inference-time attacks. Experiments on MNIST and CIFAR-10 demonstrate the effectiveness of our framework. Overall, we provide a principled and general framework for using composite mechanisms to certify robustness under complex threat models that better capture the capabilities of real-world adversaries.

2605.21778 2026-05-22 cs.AI

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

什么是AI阿谀奉承?对一个碎片化概念的分类和专家调查

Meryl Ye, Lujain Ibrahim, Jessica Y. Bo, Myra Cheng, Ida Mattsson, Daniel Vennemeyer, Robert Kraut, Steve Rathje

AI总结 本文通过文献综述和专家调查,揭示了AI阿谀奉承行为的分类和测量挑战,提出了一种统一的分类体系以促进对这一问题的理解和应对。

详情
AI中文摘要

AI阿谀奉承已成为大型语言模型(LLM)研究中的重要关注点。然而,该术语缺乏一致的定义,已被应用于从同意用户虚假主张到过度赞扬用户,再到 withhold corrective feedback 的各种行为。当研究人员、公司和政策制定者用同一术语描述不同行为时,评估结果难以比较,缓解策略无法转移,对一种阿谀奉承形式具有抵抗力的系统仍会表现出其他形式。为此,我们做出了两项贡献。首先,我们回顾了70篇关于AI阿谀奉承的论文,以开发该行为的分类。该分类区分了(1)模型是否对用户的立场和信念表现出阿谀奉承,或对用户的更广泛个人特质和情绪表现出阿谀奉承,以及(2)这种行为是通过显性的直接语言还是更隐性的微妙行为,如框架、省略或语气。将现有文献映射到我们的分类中,发现当前研究主要集中在对用户信念的显性阿谀奉承上,而更微妙和以人为核心的行为相对研究较少。其次,我们调查了106位在AI阿谀奉承及相关领域专家,以检查研究人员是否同意哪些模型行为属于阿谀奉承。尽管专家几乎一致认为阿谀奉承是当前AI系统中的重大问题(94.3%同意),但他们对哪些具体行为符合阿谀奉承存在显著分歧。共同,这些发现表明,AI阿谀奉承是一种行为广谱,具有不同的测量挑战、干预要求和治理影响。我们的分类提供了一种共享的词汇以理解和应对这些行为。

英文摘要

AI sycophancy has become a prominent concern in large language model (LLM) research. Yet the term lacks a consistent definition and has been applied to behaviors ranging from agreeing with a user's false claim to excessively praising the user to withholding corrective feedback. When researchers, companies, and policymakers use the same term to describe different behaviors, evaluation results become difficult to compare, mitigation strategies fail to transfer, and systems that are resistant to one form of sycophancy continue exhibiting other forms. To address this, we make two contributions. First, we reviewed 70 papers on AI sycophancy to develop a taxonomy of how the behavior has been defined and measured. The taxonomy distinguishes (1) whether a model is sycophantic toward a user's positions and beliefs, or toward the user's broader personal traits and emotions, and (2) whether this occurs through explicit, direct language or more implicit, subtle behaviors such as framing, omission, or tone. Mapping existing literature to our taxonomy reveals that current research has focused on overt forms of sycophancy toward users' beliefs, leaving more subtle and person-directed behaviors relatively understudied. Second, we surveyed 106 experts in AI sycophancy and related fields to examine whether researchers agree on which model behaviors are sycophantic. While experts are nearly unanimous in believing that sycophancy is a significant problem in current AI systems (94.3% agree), they disagree substantially on which specific behaviors qualify. Together, these findings demonstrate that AI sycophancy is a broad family of behaviors with different measurement challenges, intervention requirements, and governance implications. Our taxonomy provides a shared vocabulary for understanding and addressing these behaviors.

2605.21776 2026-05-22 cs.CL

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

PromptNCE:仅使用LLM和对比估计提示进行点互信息预测

Juliette Woodrow, Chris Piech

AI总结 本文提出PromptNCE方法,通过将条件概率估计转化为对比任务,并引入显式的OTHER类别来恢复真实的条件概率,从而在低数据情况下实现零样本点互信息估计。

详情
AI中文摘要

从文本中估计互信息通常需要训练特定任务的批评者,这限制了其在低数据设置中的应用。我们问大语言模型能否仅通过提示和提取的概率来零样本估计点互信息。我们引入了一个包含三个公开可用数据集的人类衍生真实PMI基准,并评估了五个信息论提示基于的估计器。我们的主要方法PromptNCE将条件概率估计框架为对比任务,并在候选集上引入显式的OTHER类别。我们理论证明,添加OTHER类别可以恢复真实的条件P(y | x),而不是仅仅在列出的候选者之间进行排名,将对比提示转化为通用的零样本概率估计器。PromptNCE在所有三个数据集上都是最佳的零样本方法,与人类衍生的PMI达到Spearman相关性高达0.82。我们还展示了在计算机科学教育中的案例研究,说明这些估计器如何在低数据情况下用于评分学生知识摘要。

英文摘要

Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

2605.21770 2026-05-22 cs.LG

Manifold-Guided Attention Steering

基于流形的注意力引导

Ian Li, Kapilesh Guruprasad, Raunak Sengupta, Ninad Satish, Loris D'Antoni, Rose Yu

AI总结 本文提出了一种基于流形的注意力引导方法,通过在推理过程中监控注意力头与正确性流形的距离,动态纠正偏差,从而提高大语言模型在数学推理、代码生成和分子生成等任务中的表现。

详情
AI中文摘要

尽管大型语言模型具备完成正确推理所需的知识,但在推理任务中仍经常出现错误。一种可能的改进方法是通过激活引导。然而,现有激活引导方法使用固定且预先计算的修正向量,忽略了模型当前所处的生成轨迹位置;结果是无差别扰动,会像错误步骤一样自由地破坏已正确步骤。我们提出基于流形的注意力引导(MAGS),这是一种基于几何观察的轨迹感知推理过程干预方法:特定注意力头的输出激活在错误点偏离低维正确性流形,并且这种偏差会通过后续步骤累积。对于每个识别出的注意力头,我们从正确和错误轨迹的对比对中学习一个低维子空间,该子空间捕捉了误差行为偏离正确行为的方向。在推理过程中,我们监控每个头与该流形的距离,并在偏差超过学习阈值时应用针对性的投影修正,将注意力输出引导回正确的子空间,防止误差传播。MAGS在数学推理(MATH-500,GSM8K)、代码生成(HumanEval,MBPP)和分子生成(SMILES)等基准测试中均优于未引导的基线和静态引导方法,表明正确性流形是LLM注意力几何学中的普遍特征。

英文摘要

Large language models frequently produce errors in reasoning tasks despite possessing the underlying knowledge required for correct reasoning. One possible approach to improve reasoning consistency is through activation steering. However, existing activation steering approaches apply fixed, pre-computed correction vectors, ignoring where the model currently sits along its generation trajectory; the result is indiscriminate perturbation that disrupts already-correct steps as freely as erroneous ones. We propose Manifold-Guided Attention Steering (MAGS), a trajectory-aware inference-time intervention grounded in a geometric observation: the output activations of specific attention heads diverge from a low-dimensional correctness manifold at the point of error, and this deviation compounds through subsequent steps. For each identified attention head, we learn a low-dimensional subspace from contrastive pairs of correct and incorrect traces that capture the directions along which error behavior deviates from correct behavior. During inference, we monitor each head's proximity to this manifold and apply a targeted projection correction when deviation exceeds a learned threshold, steering the attention output back toward the correct subspace before the error propagates. MAGS consistently outperforms both unsteered baselines and static steering approaches across benchmarks spanning mathematical reasoning (MATH-500, GSM8K), code generation (HumanEval, MBPP), and molecular generation (SMILES), suggesting that correctness manifolds are a general feature of LLM attention geometry.