arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2605.26434 2026-05-27 cs.LG cs.AI

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

基于重建的脑电图基础模型中的非周期和低频谱偏差

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Simon Bock Segaard, Jeppe Roden Münster, Andreas Peter Juhl Hansen, Takfarinas Medani, Tiantian Feng, Richard Leahy, Shrikanth Narayanan

AI总结 研究揭示基于重建预训练的脑电图基础模型存在非周期和低频成分偏差,导致低资源场景下性能不佳,并提出通过辅助损失关注高频振荡结构来改进。

Comments 18 pages, 13 figures, 3 tables

详情
AI中文摘要

脑电图基础模型在大规模无标签脑电图数据上预训练,已成为学习可泛化脑电图表示的有前景方向。尽管在数据丰富场景下表现积极,但在低资源设置中,它们往往无法显著优于完全监督的小型模型。我们对此缺陷提供了机制性解释,将其归因于基于重建的预训练任务与脑电图信号独特的频谱结构之间的根本性不匹配,该结构分解为高功率非周期成分和低功率振荡成分。通过使用受控的合成脑电图输入,我们证明脑电图基础模型嵌入偏向于捕捉脑电图信号的非周期成分,而低估振荡成分,尤其是高频成分。此外,在真实BCI数据集上的线性探针评估进一步揭示,嵌入比任务相关信息更强烈地编码受试者身份,从而强化了主要基于重建目标训练的基础模型嵌入中的低频和非周期成分偏差。这些发现共同阐明了基于重建的脑电图基础模型中的一种失败模式,并激励未来工作纳入明确针对高频振荡结构的辅助损失,作为实现更强大和可泛化的脑电图表示的途径。

英文摘要

EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable EEG representations. Despite showing positive results in data-rich regimes, they often fail to outperform significantly smaller supervised models in low-resource settings compared to fully supervised models. We provide a mechanistic account of this shortcoming, attributing it to a fundamental mismatch between reconstruction-based pretext tasks and the idiosyncratic spectral structure of EEG signals, which decompose into distinct high-power aperiodic and low-power oscillatory components. Using controlled, synthetically-generated EEG inputs, we demonstrate that EEG foundation model embeddings are biased to capture the aperiodic components of the EEG signal while under-representing oscillatory components, particularly at higher frequencies. Additionally, linear probe evaluations on real-world BCI datasets further reveal that embeddings encode subject identity more strongly than task-relevant information, thereby reinforcing the low-frequency and aperiodic component bias in foundation model embeddings trained primarily on reconstruction based objectives. Together, these findings elucidate a failure mode in reconstruction based EEG foundation models and motivate future work to incorporate auxiliary losses explicitly targeting high-frequency oscillatory structure as a path toward more capable and generalizable EEG representations.

2605.26433 2026-05-27 cs.CL

Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization

向量并非中性:从摘要任务中导出的LLM表示进行敏感信息推断

Weixin Liu, Bowen Qu, Juming Xiong, Congning Ni, Bradley A. Malin, Zhijun Yin

AI总结 研究LLM摘要系统导出向量中的敏感信息泄露风险,提出SurfaceLoRA微调方法降低特定向量的可恢复性,但未针对的池化向量仍存在风险。

Comments 30 pages, 2 figures; preprint

详情
AI中文摘要

大型语言模型(LLM)摘要系统可能将私有输入的紧凑向量表示传递给下游检索、监控、审计或分析工作流。即使源文档保持访问受限,派生向量可能在不同访问控制下处理,仍支持敏感信息推断,造成残留的信息披露风险。我们以临床出院摘要生成为高风险案例研究,使用电子健康记录(EHR)记录的种族作为受控敏感标签审计。我们审计系统可能保留或暴露给下游组件的两个工件:最终提示令牌隐藏状态和均值池化提示表示。我们的结果表明,从一个导出工件降低案例研究敏感标签的可恢复性并不一定能降低另一个工件的可恢复性。作为缓解案例研究,我们引入了SurfaceLoRA,一种针对导出向量的参数高效微调方法,该方法使用连接到指定导出向量的梯度反转鉴别器。在平衡的五向探测协议下,SurfaceLoRA将EHR记录的种族可恢复性从目标最终令牌工件降低到接近随机水平,同时保持摘要效用,但从未经目标池化工件的可恢复性仍然显著更高。这些发现表明,隐私审计和缓解应针对保留或暴露给下游组件的确切向量工件进行。

英文摘要

Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.

2605.26423 2026-05-27 cs.LG eess.IV

FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

FM-fMRI:用于静息态到任务态fMRI时间序列合成的事件条件流匹配

Peiyu Duan, Jiyao Wang, Nicha C. Dvornek, Junlin Yang, Ziqi Gao, Lawrence H. Staib, James S. Duncan

AI总结 提出FM-fMRI模型,利用事件条件流匹配从静息态fMRI和任务事件信息生成任务态fMRI时间序列,在频谱、连接性和分布匹配上优于扩散模型、GAN和VAE,并提升自闭症分类性能。

Comments MICCAI 2026 Early Accepted

详情
AI中文摘要

基于任务的fMRI提供了任务诱发神经动力学的直接读数,但获取成本高且难以大规模采集,这促使从广泛可用的静息态fMRI(rsfMRI)进行静息态到任务态的合成。我们提出FM-fMRI,一种事件条件流匹配模型,它学习一个连续时间条件向量场,从受试者的rsfMRI和任务事件信息生成任务ROI时间序列。该公式支持基于ODE的快速采样和对异构事件调度的灵活条件设置。我们不是优化逐点重建,而是使用互补标准评估生成的信号,这些标准探究时间和频谱结构、受试者和组水平连接组一致性以及分布对齐。在公共人类连接组项目和内部BioPoint自闭症队列上,FM-fMRI在频谱和连接性一致性上达到最强,并在分布级匹配上优于条件扩散模型、生成对抗网络(GAN)和变分自编码器(VAE)基线。此外,我们通过使用我们的方法合成任务fMRI ROI时间序列来扩充BioPoint队列,改进了下游自闭症分类,并在数据有限的临床环境中展示了实用性。代码将在GitHub上提供。

英文摘要

Task-based fMRI provides a direct readout of task-evoked neural dynamics, but it is expensive and difficult to acquire at scale, motivating rest-to-task synthesis from widely available resting-state fMRI (rsfMRI). We propose FM-fMRI, an event-conditioned flow-matching model that learns a continuous-time conditional vector field to generate task ROI time series from a subject's rsfMRI and the task event information. The formulation enables fast ODE-based sampling and flexible conditioning over heterogeneous event schedules. Rather than optimizing for pointwise reconstruction, we evaluated generated signals using complementary criteria that probe temporal and spectral structure, subject and group-level connectome consistency, and distributional alignment. On the public Human Connectome Project and internal BioPoint autism cohort, FM-fMRI achieves the strongest spectral and connectivity agreement and improved distribution-level matching over conditional diffusion, generative adversarial networks (GANs), and variational autoencoders (VAEs) baselines. Furthermore, we augment the BioPoint cohort by synthesizing task-fMRI ROI time series with our method, improving downstream autism classification and demonstrating practical utility in data-limited clinical settings. The code will be available on GitHub.

2605.26421 2026-05-27 cs.CV

HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

HydraPrompt: 面向合成图像检测的视觉语言模型自适应非对称框架

Senyuan Shi, Hao Tan, Zichang Tan, Shuhan Feng, Ajian Liu, Sergio Escalera, Jun Wan

AI总结 提出一种非对称提示框架HydraPrompt,通过动态调整类别中心对齐细粒度图像线索,结合条件监督对比学习,实现合成图像检测的SOTA性能。

Comments 8 pages, 6 figures

详情
AI中文摘要

生成模型的快速发展导致伪造内容激增,对现有合成图像检测方法构成重大挑战。利用视觉语言模型(如CLIP)的进展,最近的工作通过可学习的文本提示来识别合成图像。然而,它们仍使用静态提示作为真实和伪造图像的固定边界,无法适应推理过程中出现的各种伪造类型。为解决这一问题,我们提出**HydraPrompt**,一种非对称提示框架,通过对齐细粒度图像线索动态调整类别中心。具体而言,我们提出非对称提示适配器(**APA**):(1)对于真实类别,引入单组提示以捕获一致的代表性模式,作为真实内容的统一锚点;(2)对于伪造类别,构建样本自适应提示,专门捕获不同样本中的多样线索,实现伪造图像变体的自适应建模。为增强不同合成图像间的可区分性,我们进一步引入条件监督对比(**CSC**)目标,在压缩真实表示的同时捕获细粒度伪造线索。在主流SID基准上的大量实验表明,我们的框架达到了最先进的性能。

英文摘要

The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language models (e.g., CLIP), recent attempts have leveraged learnable textual prompts to identify synthetic images. However, they still leverage static prompt as a fixed boundary for real and fake images, failing to adapt to the varying types of forgery that emerge during inference. To overcome this issue, we propose **HydraPrompt**, an asymmetric prompting framework that dynamically adjusts the category centers by aligning with fine-grained image cues. Specifically, we propose an Asymmetric Prompt Adapter (**APA**): (1) for authentic category, we introduce a single set of prompts to capture the consistent representative patterns, which serves as a unified anchor for real content. While (2) for fake category, we construct sample-adaptive prompts that specialize in capturing diverse cues from different samples, enabling adaptive modeling of forgery image variations. To increase pronounced discriminability within different synthetic images, we further introduce a Conditional Supervised Contrastive (**CSC**) objective, which compacts the authentic representations while capturing fine-grained forgery clues. Extensive experiments on popular SID benchmarks demonstrate the state-of-the-art performance of our framework.

2605.26419 2026-05-27 cs.LG

Amortized Factor Inference Networks for Posterior Inference

摊销因子推理网络用于后验推理

Joohwan Ko, Justin Domke

AI总结 提出摊销因子推理网络(AFINs),通过编码-合并-解码架构实现跨不同先验、似然和维度的后验推理泛化,在保持后验精度的同时大幅降低测试时计算量。

详情
AI中文摘要

摊销推理承诺快速的测试时贝叶斯推理,但现有方法固有地依赖于固定模型。将摊销扩展到未见过的模型通常需要重新训练或昂贵的测试时微调。在本文中,我们提出:是否可能构建一个能够跨不同先验、似然和维度进行泛化的单一推理网络?我们引入了摊销因子推理网络(AFINs),这是一类基于维度无关模块的编码-合并-解码推理网络,将模型规范及其观测映射到变分后验的参数。实验表明,单个训练好的AFIN在达到与NUTS和几种变分推理方法相当的后验精度的同时,测试时计算量减少了2到4个数量级。代码可在 https://github.com/joohwanko/AFINs 获取。

英文摘要

Amortized inference promises fast test-time Bayesian inference, but existing methods are inherently tied to fixed models. Extending amortization to unseen models typically requires retraining or costly test-time finetuning. In this paper, we ask: is it possible to build a single inference network capable of generalizing across varying priors, likelihoods, and dimensionality? We introduce Amortized Factor Inference Networks (AFINs), a family of encode-merge-decode inference networks built on dimension-independent modules that map a model specification and its observations to the parameters of a variational posterior. Experimentally, a single trained AFIN achieves posterior accuracy comparable to NUTS and several variational inference methods, while requiring 2 to 4 orders of magnitude less test-time compute. Code is available at https://github.com/joohwanko/AFINs.

2605.26415 2026-05-27 cs.CV cs.AI

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

拯救效应:时空语义早期退出绕过CLIP中的量化崩溃

Kahyeon Nam, Hyesong Choi

AI总结 针对CLIP模型INT8量化导致的表示崩溃问题,提出LRA-EE方法,通过时空语义聚合、多特征门控和层自适应阈值实现早期退出,在ImageNet-1K零样本分类中降低13.4% FLOPs并提升2.44%准确率。

详情
AI中文摘要

在资源受限的硬件上部署视觉-语言模型通常需要INT8量化,但在CLIP等联合嵌入架构中,这引入了一种不同于量化CNN分类器的故障模式:跨Transformer块累积的激活噪声扰乱了多模态嵌入的方向,侵蚀了零样本检索所依赖的余弦对齐。我们将此特征化为量化诱导的表示崩溃(QIRC),并在INT8 CLIP ViT-B/32上量化它,其中逐层噪声信号比从浅层块的低于10%增长到第11层的52%。我们提出LRA-EE(逐层表示感知早期退出),它通过时空语义聚合(用全局补丁令牌平均替代不成熟的浅层[CLS])、学习到的多特征门控(置信度、top-2间隔、空间激活方差)以及根据每层信息噪声比校准的层自适应置信阈值,绕过噪声饱和的深层。在ImageNet-1K零样本分类上,LRA-EE相比INT8基线减少了13.4%的FLOPs,并将Top-1准确率提高了+2.44个百分点(58.72% -> 61.16%)。四象限分解隔离了拯救效应:9.5%的样本在浅层出口被正确分类,但在全深度被噪声丢失,而只有7.1%遭受相反情况。

英文摘要

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

2605.26414 2026-05-27 cs.AI cs.CL cs.LG

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

推理、代码,还是两者兼有?大型语言模型如何处理数学问题的变化

Matthew Kutakh

AI总结 本研究通过对比链式思维推理、单次代码执行和迭代代码执行三种方法在GSM-Symbolic数据集上的表现,发现代码执行并未提升大型语言模型在数学问题变体上的推理鲁棒性。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

大型语言模型(LLMs)在数学推理基准测试中取得了令人印象深刻的准确性,但当问题被修改为不同的名字或数字等简单变化时,它们的性能会下降。代码执行方法允许模型生成并运行Python代码,而不是用自然语言进行推理,已被提出作为解决方案,但其对推理鲁棒性(即在问题变体中保持准确性的能力)的影响尚未得到系统测试。本研究在GSM-Symbolic数据集的1000个问题上评估了三种方法:使用链式思维(CoT)提示的纯推理、使用程序辅助语言模型(PAL)的单次代码执行,以及使用逐步编码(SBSC)的迭代代码执行。所有三种方法均在配对的原始问题和修改问题上使用Claude Haiku 4.5运行。CoT是最鲁棒的方法,在扰动下准确率下降1.3个百分点,1.8%的问题被破坏。PAL的鲁棒性最差,准确率下降1.7个百分点,3.1%的问题被破坏,SBSC介于两者之间。尽管这些差异在统计上不显著($p = .096$),但方向趋势在所有指标上一致,表明无论是单次还是迭代的代码执行,都没有提高小学水平问题变体的推理鲁棒性。

英文摘要

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

2605.26405 2026-05-27 cs.CL

Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM

面向即时自适应反馈:通过知识增强的大语言模型提升学生学习

Younghun Lee, Amir Bralin, Nobel Sanjay Rebello, Dan Goldwasser

AI总结 提出一个框架,利用领域专家知识增强大语言模型,在真实教学场景中提供即时自适应反馈,并在大规模大学课程中提升学生成绩超过80%。

Comments 8 pages, Accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

教育干预是提升学生学习的有效工具。虽然大语言模型(LLMs)允许大规模生成自适应反馈,但当前研究缺乏在真实教学环境中提供即时(JiT)反馈的明确方法。在本文中,我们提出了一个框架,通过将LLMs与领域专家知识相结合来提供自适应反馈。我们的方法收集学生的书面推理逻辑(策略文章),基于推理内容分析潜在错误类型,并提供非侵入性反馈,旨在澄清缺失或错误的概念。我们在一个大规模大学课程(N > 1000)中部署了该框架,与以往学期相比,学生成绩提升了超过80%。最后,我们通过分析学习轨迹验证了该框架的教学实用性;我们展示了与LLM的迭代对话如何促进从错误概念向正确理解的转变。

英文摘要

Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale university course (N > 1000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework's pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one's misconception to correct understanding.

2605.26403 2026-05-27 cs.AI

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

从静态上下文到校准的交互式强化学习:利用对齐模拟器缓解多轮对话中的分布偏移

Xiaohua Wang, Jiakang Yuan, Zisu Huang, Muzhao Tian, Changze Lv, Kaitao Song, Tao Chen, Xiaoqing Zheng

AI总结 本文提出校准的交互式强化学习框架,通过将交互式强化学习与模拟器对齐相结合,缓解多轮对话中因策略和模拟器导致的分布偏移,提升对话质量。

详情
AI中文摘要

研究界的一个长期目标是开发高度交互的基于LLM的对话代理。最近的研究侧重于基于固定离线日志(静态上下文强化学习)或基于提示的模拟器(交互式强化学习)来优化策略。在这项工作中,我们从理论上证明,这两种范式都受到上下文分布偏移的根本限制——即训练期间观察到的对话历史与真实对话中遇到的对话历史之间的不匹配。这种偏移在每轮对话中呈二次方累积,严重降低对话质量。具体来说,我们将这种偏移归因于两个不同的来源:(i)策略引起的偏移,源于在静态历史而非自生成轨迹上进行训练;(ii)模拟器引起的偏移,源于模拟行为与真实人类行为之间的差异。为了解决这些挑战,我们提出了校准的交互式强化学习,这是一个统一的框架,将交互式强化学习与模拟器对齐相结合。通过将模拟器与人类交互模式对齐,我们的方法减少了模拟到真实的差距,并减轻了累积的分布偏移。在多个对话任务上的实验证实了我们的理论分析:(i)交互式强化学习通过缓解策略分布偏移,显著优于静态上下文基线;(ii)使用我们的对齐方法校准模拟器进一步弥合了模拟到真实的差距,产生了最先进的下游性能。

英文摘要

A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.

2605.26399 2026-05-27 cs.CV

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

OmniGF: 一种用于统一视线跟随的双分支视觉-语言框架

Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras

AI总结 提出OmniGF框架,通过双分支解码策略(语言分支生成离散推理状态,连续空间分支利用密集隐藏状态)结合头部嵌入,实现多人场景下精确的空间视线估计、语义视线预测和复杂社会视线推理,在多个基准上达到新最优。

详情
AI中文摘要

理解人类注视行为对于复杂场景理解和人机交互至关重要。传统的视线跟随模型通常局限于纯空间定位,缺乏推理语义目标或复杂社会背景的高级能力。此外,这些模型通常顺序处理个体,对同一场景图像进行多人体推理时需要冗余计算。虽然最近的视觉-语言模型(VLM)提供了处理与视线相关语义任务所需的卓越语义推理能力,但它们对离散文本生成的依赖本质上限制了在连续空间任务(如视线定位)中的精度。为弥合这一差距,我们提出OmniGF,一个统一的视觉-语言框架,使基础VLM适应高度可扩展的多人体视线推理。该模型采用双分支解码策略:结构化语言分支生成离散推理状态,而连续空间分支直接利用VLM的密集隐藏状态。用高分辨率视线目标热图监督这些提取的表示,有效克服了仅文本坐标生成的空间瓶颈。此外,为明确将模型锚定于多人场景,我们通过从裁剪的人头图像编码的头嵌入增强输入,同时为所有个体提供细粒度的外观和方向线索。通过建模所有个体并利用VLM的强大语义能力,OmniGF无缝集成了精确的空间视线目标估计、语义视线预测和复杂社会视线推理。大量实验表明,我们的框架在多个标准基准上建立了新的最优性能。代码可在https://github.com/cvlab-stonybrook/omnigf获取。

英文摘要

Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.

2605.26394 2026-05-27 cs.CL

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

多轮文本到SQL的记忆架构:基准测试与实证研究

Ravi Kumar Tummalapenta, Suman Addanki

AI总结 针对多轮Text-to-SQL任务,提出EnterpriseMem-Bench基准并评估五种前沿模型,发现无状态方法在第三轮后执行准确率归零,且记忆架构复杂度并不单调提升准确率。

Comments 18 pages, 4 figures, 14 tables; includes appendices with verbatim prompts, example session, and full ablation tables; prepared by the LLM Suite Engineering Team, JP Morgan Chase & Co

详情
AI中文摘要

多轮Text-to-SQL是企业分析的核心,但现有评估主要集中于单轮场景。我们引入EnterpriseMem-Bench,一个包含300个会话和1400轮的多轮Text-to-SQL基准,通过编程方式从三个企业领域(BIRD金融、SEC EDGAR、Northwind)构建,具有确定性真实标签和每轮记忆关键标注。我们在五种记忆条件下评估五个前沿模型——GPT-5 mini、GPT-5.2、Claude Sonnet 4.5、Sonnet 4.6和Opus 4.6,通过三路消融实验独立隔离工作记忆窗口大小、情景检索和语义增强的影响。所有Claude模型均启用扩展思考以保持与GPT推理模型的对等性。我们引入记忆收益分数(MBS)作为每轮诊断指标。四项发现如下:(1)无状态多轮Text-to-SQL在所有五个模型下,即使启用推理,到第三轮时执行准确率也降为零;(2)记忆架构复杂度并不单调提升准确率——工作记忆占主导,额外组件产生模型和数据集依赖的效果,变化范围从+14到-16个百分点;(3)Claude Sonnet 4.6在SEC EDGAR上各条件下表现比Sonnet 4.5差17-33个百分点,这是一个在推理下仍然存在的代际退化;(4)在推理下,Claude的错误分布变为单峰——每个非正确轮次都是错误结果。我们发布了基准、智能体和评估代码。

英文摘要

Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in single-turn settings. We introduce EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark of 300 sessions and 1,400 turns built programmatically from three enterprise domains (BIRD financial, SEC EDGAR, Northwind), with deterministic ground truth and per-turn memory-critical annotation. We evaluate five frontier models -- GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 -- across five memory conditions enabling a three-way ablation isolating working-memory window size, episodic retrieval, and semantic augmentation as independent effects. All Claude models are evaluated with extended thinking enabled to maintain parity with GPT reasoning models. We introduce the Memory Benefit Score (MBS) as a per-turn diagnostic metric. Four findings emerge: (1) stateless multi-turn Text-to-SQL collapses to zero execution accuracy by Turn 3 across all five models, even under reasoning; (2) memory-architecture complexity does not monotonically improve accuracy -- working memory dominates, and additional components produce model- and dataset-dependent effects from +14 to -16 percentage points; (3) Claude Sonnet 4.6 underperforms Sonnet 4.5 by 17-33pp on SEC EDGAR across conditions, a generational regression persisting under reasoning; (4) under reasoning, Claude error distributions become mono-modal -- every non-correct turn is a wrong-result error. We release the benchmark, agent, and evaluation code.

2605.26383 2026-05-27 cs.CV

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

基于多阶段SAM3特征融合的零样本物体重识别在自我中心厨房视频中的应用

Dmytro Klepachevskyi, Alexander Wong, Sirisha Rambhatla, Yuhao Chen

AI总结 针对自我中心厨房视频中物体重识别的挑战,提出一种基于SAM3分割的多阶段零样本方法,通过融合SAM3、DINOv2和CLIP特征并引入掩码形状IoU和k-倒数重排序,将mAP从45.3%提升至52.8%。

详情
AI中文摘要

由于视角快速变化、频繁遮挡、场景杂乱以及类内外观差异大,自我中心厨房视频中的物体重识别(ReID)具有挑战性。物体可能离开并重新进入视野,且实例多样性大且标注有限,使得监督式ReID难以扩展,从而推动了零样本方法的研究。我们在EPIC-Kitchens基准上研究零样本物体ReID,目标是仅使用预训练的视觉特征匹配跨帧的活跃食物和厨房工具实例。我们首先评估了五种最先进的特征提取器,包括视觉语言模型(VLM)——CLIP、DINOv2、DreamSim、I-JEPA和SAM3,并显示零样本方法失败,最佳基线仅达到45.3% mAP。然后,我们提出了一种增强的SAM3 ReID流水线,这是一种以SAM3分割为核心组件的零样本多阶段方法。阶段1使用SAM3抑制背景杂乱。阶段2将SAM3、DINOv2和CLIP的嵌入融合为单个L2归一化描述符。阶段3用掩码形状IoU增强余弦相似度以实现几何一致性,阶段4应用k-倒数重排序。整个流水线将性能提升7.5% mAP,达到52.8%。

英文摘要

Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.

2605.26382 2026-05-27 cs.CV

Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation

细节一致的分阶段蒸馏用于高效3D MRI分割

Mengchen Fan, Baocheng Geng, Xi Xiao, Tianyang Wang, Siyuan Mei, Pulin Che, Xiaoqian Jiang, Qizhen Lan

AI总结 提出细节一致蒸馏(DCD)框架,通过小波分解对齐教师-学生特征,在分阶段蒸馏中保留多尺度结构细节,实现高效3D MRI分割。

Comments Accepted by MICCAI 2026. 11 pages, 3 figures

详情
AI中文摘要

部署高性能3D医学图像分割器(如nnU-Net)通常受到内存占用和推理延迟的限制。因此压缩是必要的,但紧凑的3D编码器往往会在多分辨率阶段重复下采样时丢失细微的结构线索(小病变和锐利边界)。我们提出细节一致蒸馏(DCD),一种分阶段蒸馏框架,通过在小波分解表示中对齐教师-学生特征,跨尺度保留结构细节。在每个编码器阶段,DCD在小波域中蒸馏方向细节分量,同时相对不约束粗略近似,避免对全局语义的过度正则化。DCD仅在训练期间使用,不引入推理开销。在BraTS 2024和ISLES 2022基准上的实验表明,我们的方法在使用3D多模态数据的MRI分割中取得了优越性能。DCD的代码和实现细节可在https://github.com/ClinicaAlpha/DCD-3D-MedSeg公开获取。

英文摘要

Deploying high-performing 3D medical image segmenters (e.g., nnU-Net) is often limited by memory footprint and inference latency. Compression is therefore necessary, but compact 3D encoders tend to lose fine structural cues (small lesions and sharp boundaries) as downsampling repeats across multi-resolution stages. We propose Detail Consistent Distillation (DCD), a stage-wise distillation framework that preserves structural detail across scales by aligning teacher-student features in a wavelet-decomposed representation. At each encoder stage, DCD distills directional detail components in the wavelet domain while leaving the coarse approximation comparatively unconstrained, avoiding over-regularization of global semantics. DCD is used only during training and introduces no inference-time overhead. Experiments on the BraTS 2024 and ISLES 2022 benchmarks demonstrate that our approach achieves superior performance in MRI segmentation using 3D multi-modal data. Code and implementation details for DCD are publicly available at https://github.com/ClinicaAlpha/DCD-3D-MedSeg.

2605.26381 2026-05-27 cs.CV

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

基于Perceiver IO融合卫星和街景图像的多模态建筑检测

Niels Sombekke, Rob G. J. Wijnhoven, Martin R. Oswald

AI总结 提出一种通过Perceiver IO架构融合卫星和街景图像的多模态分类框架,使用共享DINOv2骨干网络的空间补丁令牌,无需填充或固定大小池化即可处理可变数量的街景视图,并联合预测屋顶元素和材料类别,在包含10个国家32135栋建筑的数据集上验证了RGB-M掩码策略和融合模型的有效性。

详情
AI中文摘要

我们提出了一种多模态分类框架,通过Perceiver IO架构融合卫星和街景图像,该架构基于共享DINOv2骨干网络的空间补丁令牌。该设计自然地处理每栋建筑可变数量的街景视图,无需填充或固定大小池化,并联合预测多标签屋顶元素和屋顶材料类别。我们构建了一个包含10个国家32,135栋建筑(61,672个片段)的大规模数据集,将卫星图像与每个片段最多八个街景视图配对,并评估了四种用于隔离目标建筑的掩码策略。我们提出了一种RGB-M掩码策略,将建筑足迹掩码作为第四个输入通道,提供了一种软空间先验,在两种模态下均优于硬裁剪。Perceiver IO融合模型优于所有其他融合策略,并在街景可见的属性上取得了显著的每类增益(例如,石板+11.3 AP,老虎窗+1.3 AP),尽管仅卫星基线在主要从上方可见的类别的宏观平均mAP上仍保持轻微优势。这些结果为多模态建筑检测建立了一种可扩展、灵活的架构,能够处理异构输入和多个输出任务。

英文摘要

We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.

2605.26380 2026-05-27 cs.CV cs.AI

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

VisualNeedle: 信息密集场景中的主动视觉搜索基准

Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen, Richeng Xuan, Liang Yang, Zhichao Hu, Fanyang Lu

AI总结 针对多模态大语言模型在细粒度感知基准中依赖捷径而非真实视觉证据的问题,提出VisualNeedle基准,通过反事实裁剪-黑化设置评估模型在信息密集场景中的主动视觉搜索能力,实验表明最佳模型准确率仅56.01%,落后人类63.00%。

详情
AI中文摘要

前沿多模态大语言模型(MLLMs)被报道在细粒度感知基准上达到超过90%的准确率。然而,这样的分数并不一定意味着对视觉证据的忠实使用。先前的研究已经识别出三种抬高基准性能的捷径。首先,问题中的语言先验和词汇线索使模型能够在未见图像的情况下推断出看似合理的答案。其次,来自视觉编码器的粗略全局语义可以绕过细粒度的局部细节。第三,在一些“用图像思考”的基准中,破坏视觉工具返回的中间图像几乎不影响最终答案。这些发现表明,仅靠更高的输入分辨率或更大的问题池并不能引发真正的主动视觉搜索。为了解决这个问题,我们引入了VisualNeedle,这是一个具有挑战性、信息密集且细粒度的基准,用于关键证据在空间上局限于微小区域且无法一眼看出的场景。我们进一步提出了一种反事实裁剪-黑化设置,将工具返回的裁剪区域替换为相同大小的黑色图像,以测试工具启用的性能是否真正依赖于中间视觉证据。我们在三种设置下评估了9个著名的MLLMs:无工具、标准工具启用和裁剪-黑化。无工具准确率保持在20%以下,最佳工具启用模型仅达到56.01%,仍落后于63.00%的人类多数投票准确率。这些结果揭示了细粒度视觉搜索中持续存在的局限性,而裁剪-黑化消融实验证实,VisualNeedle上的成功依赖于真正的中间视觉证据。

英文摘要

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.

2605.26376 2026-05-27 cs.CV cs.AI cs.LG

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

BioFact-MoE:基于生物学因子分解的混合专家模型用于肝细胞癌的视觉-语言预后建模

Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro

AI总结 提出BioFact-MoE框架,通过生物学监督的混合专家模型显式分解肝脏和肿瘤因子,在肝细胞癌预后预测中提升准确性和生物学可解释性。

Comments Early accepted at MICCAI 2026

详情
AI中文摘要

肝细胞癌(HCC)具有生物学异质性,由肝功能储备和肿瘤相关肿瘤学因素之间的相互作用塑造;因此,相似的生存结果可能反映根本不同的潜在生物学过程。HCC的预后建模依赖于来自多参数MRI和常规临床实践放射学报告的丰富多模态信息。现有的预后视觉-语言模型(VLM)学习单一的纠缠潜在表示,混合了肝脏和肿瘤相关因素,限制了准确性和生物学可解释性。我们提出BioFact-MoE,一个生物学因子分解的混合专家(MoE)框架,通过残差MoE生存架构中的生物学监督专家显式分解肝脏和肿瘤因素。在N=588名患者的HCC队列(在4,582个3D MRI图像-报告对上预训练)中,BioFact-MoE在所有时间范围内持续优于所有基线的生存预测,实现了12、18和24个月的AUC分别为75.33%、75.85%和73.96%。除了标量风险预测,门控专家权重实现了表型感知的风险分层。通路感知的门控揭示了临床上有意义的治疗相关生存异质性。在保留验证中,肝脏和肿瘤嵌入分别与肝功能标志物和肿瘤负荷标志物显示出选择性关联(p<0.05),无需监督。代码可在https://github.com/jy-639/BioFact-MoE获取。

英文摘要

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

2605.26373 2026-05-27 cs.LG math.OC stat.ML

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

通过算法等价性在隐凸损失上的在线学习:最优遗憾、几何障碍与Bandit反馈

Anas Barakat, Andreas Kontogiannis, Vasilis Pollatos, Ioannis Panageas, Antonios Varvitsiotis

AI总结 本文通过更精确的离散时间算法等价性论证,证明在线梯度下降在隐凸损失上达到最优的$\mathcal{O}(\sqrt{T})$遗憾,并澄清了所需几何条件,同时扩展到单点Bandit反馈得到$\mathcal{O}(T^{3/4})$期望遗憾。

Comments 43 pages

详情
AI中文摘要

我们研究具有隐凸损失的对抗性在线学习,即经过非线性重参数化后变为凸的非凸损失。Ghai, Lu和Hazan (2022)证明,在几何和光滑性假设下,此类非凸损失上的在线梯度下降(OGD)近似模拟了具有适当正则化器的底层凸损失上的在线镜像下降(OMD),得到$\mathcal{O}(T^{2/3})$遗憾。他们留下了是否可以在隐凸设置中恢复在线凸优化的最优$\Theta(\sqrt{T})$遗憾的开放问题。我们肯定地回答了这个问题。更具体地,通过更尖锐的离散时间算法等价性论证,我们证明在相同假设下OGD达到$\mathcal{O}(\sqrt{T})$遗憾,匹配对抗性在线凸优化的最坏情况最优速率。我们还解决了Ghai, Lu和Hazan (2022)的另一个开放问题,澄清了这种算法等价性所需的几何条件。我们将对角雅可比充分条件替换为必要且充分的Hessian相容性条件,从而扩展了可允许重参数化的类别。我们用下界补充了紧的遗憾界,表明Hessian相容性假设对OGD是必要的;当该条件不成立时,我们构造一个光滑的重参数化和一个对抗性的隐凸损失序列,使得OGD遭受$\Omega(T)$遗憾。最后,我们将分析扩展到单点Bandit反馈,并证明使用球形平滑的Bandit OGD的$\mathcal{O}(T^{3/4})$期望遗憾界,匹配其在凸损失上的经典速率。

英文摘要

We study adversarial online learning with hidden-convex losses, i.e., nonconvex losses that become convex after a nonlinear reparameterization. Ghai, Lu and Hazan (2022) proved that, under geometric and smoothness assumptions, online gradient descent (OGD) on such nonconvex losses approximately simulates online mirror descent (OMD) on the underlying convex losses with a suitable regularizer, yielding $\mathcal{O}(T^{2/3})$ regret. They left open whether the optimal $Θ(\sqrt{T})$ regret from online convex optimization can be recovered in this hidden-convex setting. We answer this question affirmatively. More specifically, via a sharper discrete-time algorithmic equivalence argument, we prove that OGD achieves $\mathcal{O}(\sqrt{T})$ regret under the same assumptions, matching the optimal worst-case rate for adversarial online convex optimization. We also address another open question of Ghai, Lu and Hazan (2022) by clarifying the geometry required for this algorithmic equivalence. We replace the diagonal-Jacobian sufficient condition with a necessary-and-sufficient Hessian compatibility condition, thereby expanding the class of admissible reparameterizations. We complement our tight regret bound with a lower bound showing that the Hessian compatibility assumption is essential for OGD; when it fails, we construct a smooth reparameterization and an adversarial sequence of hidden-convex losses for which OGD suffers $Ω(T)$ regret. Finally, we extend our analysis to one-point bandit feedback and prove a $\mathcal{O}(T^{3/4})$ expected regret bound for bandit OGD with spherical smoothing, matching its classical rate on convex losses.

2605.26370 2026-05-27 cs.CV

Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

航空影像中屋顶结构的联合实例分割与几何属性回归

Luuk Versteeg, Rob G. J. Wijnhoven, Martin R. Oswald

AI总结 提出一种从单张航空正射影像中联合预测屋顶实例分割掩码和三个连续几何属性(建筑高度、屋顶坡度、屋顶方位角)的方法,通过条件方位角损失和对数归一化高度表示解决数据噪声和分布偏斜问题,在荷兰大规模数据集上实现了高精度,并可从单张图像重建简化3D建筑模型。

详情
AI中文摘要

我们提出了一种方法,用于从单张航空正射影像中联合预测实例级屋顶分割掩码以及三个连续几何属性——建筑高度、屋顶坡度和屋顶方位角。我们的方法扩展了Mask R-CNN,增加了一个专门的属性回归分支,并引入了两个关键创新:一个条件方位角损失,抑制了对屋顶平坦段(其中方位角标签固有噪声)的监督;以及一个对数归一化高度表示,解决了建筑高度严重偏斜分布的问题。我们在一个大规模荷兰航空图像数据集上进行训练和评估,该数据集与从3DBAG(一个全国性的基于LiDAR的3D建筑数据集)自动导出的真实值配对。使用DINOv3 ConvNeXt-Base骨干网络,我们的方法在屋顶坡度上实现了约4度的平均绝对误差,方位角为7度,建筑高度为1米,实例分割AP$_{50}$为0.566。预测的每段掩码和属性足以从单张俯视图像重建简化的3D建筑模型(LoD2),仅需在训练时使用昂贵的3D参考数据。

英文摘要

We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

2605.26365 2026-05-27 cs.CL

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

大型语言模型中通过潜在激活引导的文化价值对齐

Trung Duc Anh Dang, Sarah Masud

AI总结 提出一种基于场景行为探测和潜在激活引导的框架,用于评估和干预LLMs的文化价值,发现文化价值以耦合结构编码,限制了精确对齐。

Comments ACL 2026 Student Research Workshop (Non-Archival Track)

详情
AI中文摘要

大型语言模型(LLMs)通常表现出同质化的文化视角。虽然世界价值观调查(WVS)为映射人类价值观提供了黄金标准,但传统的直接提示LLMs回答WVS问题往往无法触及模型的潜在文化深度,导致安全对齐的拒绝或中性回应。在此,我们提出一个通用的文化评估与干预框架,从抽象查询过渡到基于场景的行为探测。通过提取300个情境困境中的隐式token概率,我们绕过表面层次的对齐,映射LLMs文化价值的潜在坐标。我们进一步引入激活引导,在前向传播过程中无需重新训练即可改变这些内部对齐。在多个LLMs上,我们发现适应性存在显著差异,并揭示了一个一致的现象——潜在纠缠,即沿一个文化维度的干预会引发沿另一维度的偏移。这些结果表明,文化价值被编码为耦合结构,限制了精确对齐。本工作建立了一个计算高效的文化引导框架,突出了在LLMs中导航全球价值观时的结构复杂性。

英文摘要

Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model's latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining. Across multiple LLMs, we find substantial variation in adaptability and uncover a consistent phenomenon of latent entanglement, where interventions along one cultural dimension induce shifts along another. These results suggest that cultural values are encoded as coupled structures, limiting precise alignment. This work establishes a computationally efficient framework for cultural steering, highlighting the structural complexities when navigating global value with LLMs.

2605.26362 2026-05-27 cs.CL cs.AI

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

为什么LLMs会在结构化知识上产生幻觉:对线性化表示推理的机制分析

Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song, Langzhou He, Kenan Kamel A Alghythee, Philip S. Yu

AI总结 本文通过机制分析发现,大型语言模型在结构化知识推理中产生幻觉是由于注意力过度集中于捷径式结构线索和前馈层未能将知识语义接地,导致模型依赖参数记忆。

Comments To appear in Proceedings of ACL 2026

详情
AI中文摘要

在许多推理任务中,大型语言模型(LLMs)依赖于结构化外部知识,如图和表格,这些知识通常被线性化为连续的令牌表示。然而,即使有足够的知识可用,LLMs仍然可能产生幻觉输出,这种失败背后的潜在机制仍然知之甚少。我们研究了这些机制,发现幻觉源于系统性的内部动态而非随机噪声。首先,注意力不成比例地集中在类似捷径的结构线索上,而不是分布在完整的上下文中。其次,前馈表示未能将提供的知识接地,导致模型回归到参数记忆。此外,我们的结果表明,幻觉始终与前馈层中的语义接地失败相关,而注意力分配表现出更大的任务依赖性。最后,我们展示了这些机制模式从单跳图推广到多跳和表格设置,从而能够在结构化知识格式中有效检测幻觉。

英文摘要

In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.

2605.26356 2026-05-27 cs.CL

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

检索增强生成的上下文优化:梯度下降视角

Mingchen Li, Jiatan Huang, Chuxu Zhang, Liang Zhao, Hong Yu

AI总结 本文从梯度下降视角研究检索增强生成(RAG)作为上下文优化过程,提出一种轻量级前向更新方法,在冻结LLM和检索器的情况下提升生成器对检索证据的利用。

详情
AI中文摘要

上下文学习最近被与线性自注意力模型中的隐式梯度下降联系起来,表明上下文可以诱导前向传递更新。检索增强生成(RAG)也依赖于上下文,但检索到的文档通常被视为静态证据而非适应信号。我们将RAG研究为一种上下文优化过程。首先,我们展示一个线性自注意力层可以在统一的线性化RAG目标上实现一步梯度下降,该目标涵盖基于投影和基于点积的检索接口。这给出了检索增强预测与上下文优化一致的一个精确区域。我们使用这一结果并非作为LLM计算的字面模型,而是作为调整查询与检索证据之间交互的指南。然后,我们测试这种对应关系的边界:在受控的线性扩展下保持稳定,但在非线性架构下变得依赖于特征分布。最后,我们将这一观点转化为一种针对冻结RAG LLM的轻量级方法。该方法保持检索器和骨干网络固定,并预测一个上下文条件更新到生成器侧的证据使用接口。在七个QA基准、两个检索器和两个冻结LLM骨干网络上,这种仅前向的更新改进了共享接口基线,迁移到未见任务,并以更低的每查询成本接近测试时的梯度适应。

英文摘要

In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.

2605.26355 2026-05-27 cs.LG cs.CL eess.SP

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

能量门控注意力与小波位置编码:Transformer注意力的互补归纳偏置

Athanasios Zeris

AI总结 针对标准注意力缺乏能量显著性和尺度选择性局部性两种互补归纳偏置的问题,提出能量门控注意力(EGA)和莫雷特位置编码(MoPE),两者组合在字符级语言建模上实现超加性性能提升。

Comments 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: https://github.com/AthanasiosZeris/energy-gated-attention

详情
AI中文摘要

标准Transformer注意力计算成对标记相似性,但将所有标记视为同等显著、所有位置视为同等局部,忽略了输入的信息结构。我们识别出标准注意力缺乏两种互补归纳偏置:能量显著性(哪些标记集中了信息能量,通过端到端学习而不需要显式频率分解)和尺度选择性局部性(在每个频率上位置影响的范围,通过Morlet小波编码实现)。我们通过两个简单组件解决这两个问题。能量门控注意力(EGA)通过键标记嵌入的学习能量估计(通过单个线性投影计算)来门控值聚合;它选择关注什么。莫雷特位置编码(MoPE)用学习的高斯窗口小波替换固定的正弦编码,使联合位置-频率定位适应语料库;它指定注意力在每个尺度上操作的位置。在TinyShakespeare上,单独EGA相比标准注意力实现+0.092验证损失改进(相比Phase 1-3基线+0.103);单独MoPE为-0.032(作为独立编码低于基线);但它们的组合实现+0.119——超过各部分之和。这种超加性在两个独立训练运行中观察到,是核心实证发现:显著性和局部性是互补归纳偏置,各自填补对方无法单独填补的空白。消融实验证实,结构化谱先验(Morlet小波门控、尺度初始化头、固定正弦PE)始终不如其无约束学习对应物,而互补学习组件交互产生超加性。所有实验都在小规模(≤6M参数、字符级基准、单种子)进行;更大规模的多种子验证是未来工作最重要的方向。

英文摘要

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.

2605.26353 2026-05-27 cs.CV cs.AI cs.LG

Personalized Generative Models for Contextual Debiasing

用于上下文去偏的个性化生成模型

Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

AI总结 提出DecoupleGen方法,利用个性化文本到图像扩散模型生成罕见上下文图像,作为训练增强以缓解视觉识别中的上下文偏差。

Comments CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at https://github.com/princetonvisualai/DecoupleGen

详情
AI中文摘要

不同的视觉模式在世界中出现的频率不同:例如,沙滩球出现在沙滩上比出现在道路上更常见。这些统计数据反映在视觉数据集中,因此训练好的模型更容易在常见场景中识别物体。然而,在道路上识别沙滩球可能比在沙滩上识别更重要。我们研究如何缓解这种差异。由于在现实世界中收集不常见的图像可能很困难,我们探索生成具有较少频繁上下文的图像是否可以作为有效的训练增强。一个关键挑战是引导生成保持在原始数据集分布附近,同时创建具有不常见上下文的多样化图像。我们引入了DecoupleGen方法,该方法个性化文本到图像扩散模型,以促进罕见上下文图像的连贯合成,同时保留原始视觉细节。生成的图像包含语义上有意义的内容,并在视觉上与原始数据集保持一致。我们进一步应用验证约束以确保增强数据的相关性。我们在复杂场景数据集上的物体分类和识别任务中评估了我们的方法。实验表明,我们的方法比先前的方法有一致的改进,并且我们的分析确定了这些改进背后的因素。

英文摘要

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

2605.26352 2026-05-27 cs.CL

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

RICE-PO:将检索交互转化为推理智能体的信用信号

Mingchen Li, Hansi Zeng, Zhuo Qian, Jiatan Huang, Hamed Zamani, Hong Yu

AI总结 提出RICE-PO框架,通过将检索交互转化为局部学习信号,解决推理型检索智能体在训练中推理步骤的信用分配问题,在BRIGHT和BEIR上优于基线。

详情
AI中文摘要

检索正从一次性匹配向交互式推理转变,语言智能体迭代检查证据、重新表述查询并再次搜索。训练此类智能体面临信用分配挑战:可执行动作(如查询或摘要)可由检索器直接评估,而潜在推理步骤不可直接观察,仅影响未来的可执行动作。这种不对称使得结果级奖励分配不可靠,因为相同的最终奖励可能奖励那些实际上并未促成检索成功的推理步骤。我们提出RICE-PO,一种无批评策略优化框架,将检索交互转化为局部学习信号。RICE-PO选择高不确定性的可执行动作作为锚点,使用检索指标评估局部反事实分支,并仅在推理到动作的影响强且未来残差效应稳定时,将信用传播给潜在推理步骤。在BRIGHT和BEIR上,RICE-PO在相同检索器设置下始终优于基于提示的智能体和基于组的强化学习基线。这些结果表明,智能体-环境交互结构本身可以为训练基于推理的检索智能体提供有用的监督。

英文摘要

Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

2605.26350 2026-05-27 cs.LG cs.AI

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

当正确示例有害时:重新思考示例在上下文学习中的作用

Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou

AI总结 本文通过引入任务保持扰动,揭示了正确示例不一定有益甚至可能降低上下文学习准确性的反直觉现象,并提出了上下文证据转移的概念来解释正确性与效用之间的差距。

详情
AI中文摘要

上下文学习(ICL)通常被直觉所驱动,即示例之所以有帮助是因为它们提供了正确的输入-输出对。然而,我们揭示了一个反直觉的现象:正确性并不能保证示例的效用,一些正确的示例甚至可能降低ICL的准确性。为了研究这种正确性-效用差距,我们引入了任务保持扰动,其中仅改变示例输入,而该示例仍然是同一任务的正确实例。具体来说,每个扰动后的示例被赋予由任务映射诱导的目标。该框架涵盖了标签更新扰动(其中任务相关语义发生变化且目标被重新计算)和更严格的目标保持扰动(其中原始目标仍然有效)。我们将由此产生的失败模式形式化为上下文证据转移:任务保持扰动可以改变模型用于上下文推理的有效证据混合,从而将示例正确性与示例效用分离。在情感分类、逻辑推理和数学应用题中,我们发现任务保持扰动的示例会显著降低ICL性能,尤其是对于较小的模型、较难的任务和较高的扰动比例。我们的结果表明,鲁棒的ICL不仅需要评估示例是否正确,还需要评估它们如何影响上下文推理。代码可在 https://github.com/Chenghao-Qiu/Task-Preserving-ICL 获取。

英文摘要

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao-Qiu/Task-Preserving-ICL.

2605.26349 2026-05-27 cs.RO

Closing the Loop in Teleoperation: Episode-Level Data Quality Assessment and Feedback for High-Quality Demonstration Collection

在遥操作中闭环:面向高质量演示收集的片段级数据质量评估与反馈

Gokul Narayanan, Yash Shahapurkar, Melih Erdogan, Brian Zhu, Eugen Solowjow

AI总结 提出数据质量评估与反馈框架,通过语义任务进度和机器人遥测数据提供即时后片段反馈,帮助新手操作员提升演示质量。

详情
AI中文摘要

工业自动化正处于关键时刻,物理AI正推动从刚性、手工设计的自动化系统向更灵活、自适应的系统转变。这一转变产生了对大规模、真实世界机器人演示数据的需求,使得遥操作成为越来越重要的数据收集机制。然而,在实践中,高质量的遥操作演示仍然难以获得,因为新手操作员经常产生任务成功但下游使用次优的片段,原因包括低效运动、重复修正或接近机器人关节极限操作。我们提出一个数据质量评估与反馈(DQAF)框架,通过提供基于语义任务进度和机器人遥测的即时后片段反馈,在遥操作中实现闭环。该框架提取质量相关信号,如子任务进度、运动平滑度、停顿、运动学极限,并将其转化为结构化质量评估和可操作的自然语言反馈。与二元成功或失败反馈不同,所提系统解释了片段为何次优,并突出显示下次试验中需要纠正的具体行为。我们通过诊断验证研究和试点用户研究评估该框架。在验证研究中,系统在数据集整理过程中与人类评审员进行比较,产生拒绝原因和可操作的改进反馈。在涉及三个新手操作员的两项操作任务的试点研究中,接收系统即时自动后片段反馈的操作员比未接收的改进更快,更早产生更高质量的演示。

英文摘要

Industrial automation is at a pivotal moment, as Physical AI is driving a transition from rigid, hand-engineered automation systems toward more flexible and adaptive systems. This shift has created a growing demand for large-scale, real-world robot demonstration data, making teleoperation an increasingly important mechanism for data collection. However, high-quality teleoperated demonstrations remain difficult to obtain in practice, as novice operators often produce episodes that are task-successful but suboptimal for downstream use due to inefficient motion, repeated corrections, or operation near robot joint limits. We present a Data Quality Assessment and Feedback (DQAF) framework that closes the loop in teleoperation by providing immediate post-episode feedback grounded in semantic task progress and robot telemetry. The framework extracts quality relevant signals such as sub-task progress, motion smoothness, stalls, kinematic limits and converts them into structured quality assessments and actionable natural-language feedback. Unlike binary success or failure feedback, the proposed system explains why an episode is suboptimal and highlights specific behaviors to correct in the next trial. We evaluate the framework through a diagnostic validation study and a pilot user study. In the validation study, the system is compared with a human reviewer during dataset curation, producing rejection reasons and actionable feedback for improvement. In the pilot study with three novice operators across two manipulation tasks, the operator who received the systems immediate, automated post-episode feedback improved faster than those who did not, producing higher-quality demonstrations sooner.

2605.26348 2026-05-27 cs.RO

RCSP: Risk-Sensitive Conjectural Scenario Planning for Safe Dynamic Robot Navigation

RCSP: 面向安全动态机器人导航的风险敏感推测性场景规划

Zhengye Han, Quanyan Zhu

AI总结 提出风险敏感推测性场景规划(RCSP),通过轻量级信念维护、未来交互采样和高风险尾部惩罚,结合局部安全检查,解决移动机器人在动态障碍物环境中的预测性近碰撞承诺问题,并在仿真中验证其提升安全性和路径质量。

详情
AI中文摘要

移动机器人在碰撞之前就可能失败:当前安全的速度可能使机器人陷入即将被移动障碍物关闭的通道。我们研究了这种预测性近碰撞承诺问题,并提出了风险敏感推测性场景规划(RCSP),这是一个规划层,它根据合理的短视障碍物未来对候选命令进行评估。RCSP维护一个关于局部运动推测的轻量级信念,采样未来交互,惩罚高风险尾部,并通过局部安全检查执行。在受控的MuJoCo瓶颈任务中,RCSP规划器无碰撞地到达目标,并且与非自适应预测器相比,提供了更高的次要安全性和路径质量点估计,但增加了延迟。在ROS2/Gazebo中,将局部安全层添加到标准Nav2堆栈可减少动态近碰撞失败。在官方DynaBARN/Jackal迁移中,调整后的DWA和TEB在严格的基准成功率上仍然更强,揭示了该方法的边界。这些仿真结果将RCSP定位为一个预测风险模块,在动态瓶颈机制中补充现有的导航堆栈。

英文摘要

Mobile robots can fail before they collide: a velocity that is safe now may commit the robot to a passage that moving obstacles will soon close. We study this predictive near-miss commitment problem and propose Risk-Sensitive Conjectural Scenario Planning (RCSP), a planning layer that evaluates candidate commands against plausible short-horizon obstacle futures. RCSP maintains a lightweight belief over local motion conjectures, samples future interactions, penalizes high-risk tails, and executes through a local safety check. In controlled MuJoCo bottleneck tasks, the RCSP planner reaches the goal without collisions and yields higher secondary safety and path-quality point estimates than a non-adaptive predictor, with additional latency. In ROS2/Gazebo, adding the local safety layer to a standard Nav2 stack reduces dynamic near-miss failures. On official DynaBARN/Jackal transfer, tuned DWA and TEB remain stronger on strict benchmark success, revealing the boundary of the approach. These simulation results position RCSP as a predictive-risk module that complements existing navigation stacks in dynamic bottleneck regimes.

2605.26346 2026-05-27 cs.CL

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

每日剂量:工作流集成的大型语言模型自动化在放射肿瘤学中的临床总结和试验识别

Jason Holmes, Federico Mastroleo, Mariana Borras-Osorio, Srinivas Seetamsetty, Satomi Shiraishi, Mirek Fatyga, Judy C. Boughey, Cornelius A. Thiels, William G. Breen, Daniel J. Ma, Daniel K. Ebner, David M. Routman, Brady S. Laughlin, Carlos E. Vargas, Samir H. Patel, Sujay A. Vora, Nadia N. Laack, Andrew Y. K. Foong, Wei Liu, Mark R. Waddle

AI总结 本文介绍了一个名为“每日剂量”的LLM驱动的自动化临床总结和临床试验识别系统,该系统集成到常规放射肿瘤学实践中,并通过混合方法评估其可用性、满意度和感知有用性,结果显示高使用率和满意度。

Comments 28 pages, 4 figures, 1 table

详情
AI中文摘要

目的:描述“每日剂量”(TDD)的设计和早期临床评估,这是一个由LLM驱动的自动化临床总结和临床试验识别系统,集成到常规放射肿瘤学实践中。设计:在系统部署1个月后,采用横断面匿名临床医生调查进行混合方法评估。暴露:每日自动生成医生特定的电子邮件摘要,使用RadOnc-GPT生成,包括患者日程、从电子健康记录中提取的简洁临床状态摘要,以及自动识别新就诊或咨询就诊的潜在相关临床试验。主要结果和指标:主要结果包括自我报告的可用性、满意度、感知有用性、对工作流程的感知影响、节省的时间以及继续使用的意愿。使用Cronbach's α评估内部一致性信度。结果:在55名受访者中,52名(94.5%)在放射肿瘤学领域工作,38名(69.1%)是主治医师。大多数参与者(83.6%)报告每天或每周多次使用TDD。平均(SD)得分为:可用性和满意度3.89(1.04),感知有用性3.43(1.24),影响和未来使用3.80(1.17)(5点李克特量表)。总体满意度与感知时间节省呈正相关(p < .001)。参与者报告的时间节省不一,27%的人估计每天节省≥10分钟。问卷表现出极好的内部一致性(总体Cronbach's α = 0.97)。

英文摘要

Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice. Design: Mixed-methods evaluation using a cross-sectional, anonymous clinician survey administered after 1 month of system deployment. Exposure: Daily automated delivery of physician-specific email summaries generated using RadOnc-GPT, including patient schedules, concise EHR-derived clinical-status summaries, and automated identification of potentially relevant clinical trials for new or consult visits. Main Outcomes and Measures: Primary outcomes included self-reported usability, satisfaction, perceived usefulness, perceived impact on workflow, time savings, and intention for continued use. Internal consistency reliability was assessed using Cronbach's $α$. Results: Among 55 respondents, 52 (94.5\%) worked in radiation oncology, and 38 (69.1\%) were attending physicians. Most participants (83.6\%) reported using TDD daily or several times per week. Mean (SD) scores were 3.89 (1.04) for usability and satisfaction, 3.43 (1.24) for perceived usefulness, and 3.80 (1.17) for impact and future use (5-point Likert scale). Overall satisfaction was positively associated with perceived time savings ($p < .001$). Participants reported variable time savings, with 27\% estimating $\geq 10$ minutes saved per day. The questionnaire demonstrated excellent internal consistency (overall Cronbach's $α$ = 0.97).

2605.26343 2026-05-27 cs.LG

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL:强化学习智能体进行电路发现以实现机械可解释性

Barsat Khadka

AI总结 提出将电路发现转化为强化学习问题,使用PPO策略在GPT-2 small的144个注意力头上进行零消融和对比奖励,成功在训练任务和未见任务上恢复标准电路,验证了强化学习在机械可解释性中的有效性。

详情
AI中文摘要

机械可解释性已经识别出在Transformer语言模型中实现特定行为的小型注意力头集合,但恢复这些电路通常需要为每个新任务定制分析流程。我们将电路发现重新定义为强化学习问题。一个智能体在GPT-2 small的144个注意力头上操作,作为离散动作空间;每个动作触发零消融和对比奖励,该奖励从消融对目标任务的损害中减去其对通用下一个词预测的损害。一个在向量化多任务环境中训练于两个任务(归纳和IOI)的单一PPO策略,在两个训练任务以及一个保留的第三个任务(文档字符串补全)上均达到每轮最优。其偏好的头与现有文献中规范的头一致,恰好符合这些论文在单头消融下识别为因果非冗余的轴;它们识别为冗余的类别被智能体正确降级。在保留任务上,最佳五次规划在评估时未提供任务信号的情况下恢复了最优上限的96%。这些结果表明,基于因果干预的强化学习是识别机械电路单头瓶颈的可行且可迁移的方法,与现有的路径修补方法互补。

英文摘要

Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new task. We recast circuit discovery as a reinforcement-learning problem. An agent operates over the 144 attention heads of GPT-2 small as a discrete action space; each action triggers a zero-ablation and a contrastive reward that subtracts the ablation's damage to general next-token prediction from its damage to the target task. A single PPO policy, trained on two tasks (induction and IOI) in a vectorised multi-task environment, attains the per-episode oracle on both training tasks and on a held-out third task (docstring completion). Its preferred heads coincide with the canonical heads of established literature on precisely the axes those papers identify as causally non-redundant under single-head ablation; the categories they identify as redundant are correctly de-prioritised by the agent. On the held-out task, best-of-five planning recovers 96\% of the oracle ceiling with no task signal supplied at evaluation. These results indicate that reinforcement learning over causal interventions is a viable, transferable substrate for identifying the single-head bottlenecks of mechanistic circuits, complementary to existing path-patching approaches.

2605.26341 2026-05-27 cs.LG stat.ML

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

物理信息机器学习的泛化性的PAC-Bayesian视角

Thien V. Nguyen, Amaury Habrard, Benjamin Guedj

AI总结 本文通过PAC-Bayesian框架,针对无界损失下的回归问题,推导了物理信息机器学习的高概率泛化界,并提出了自界感知学习算法,在标准PDE基准上验证了界的非平凡性和更紧性。

详情
AI中文摘要

物理信息机器学习(PIML)将机械知识(通常以偏微分方程(PDE)的形式)整合到数据驱动模型中。尽管经验性能强劲,但其统计泛化性质仍未被充分理解,尤其是在具有无界损失的回归设置中。现有分析依赖于近似或稳定性论证,未能完全捕捉物理结构如何影响有限数据的泛化。在这项工作中,我们为PIML开发了一个PAC-Bayesian框架,在存在无界损失的情况下提供高概率泛化保证。我们采用多任务视角,联合处理数据保真度、PDE残差、初始条件和边界条件,避免了标准联合界方法导致的松散性。我们的分析利用物理信息目标的结构,推导出新的界,其中复杂度与损失的输入梯度范数成比例,揭示了物理正则性与泛化之间的直接联系。我们在Sobolev和Poincaré型假设下实例化该框架,得到两类界,在不同机制下权衡统计复杂性和光滑性。基于这些结果,我们提出了一种自界感知学习算法,直接优化推导界的可处理代理,以及一种在实际设置中估计相关常数的实用程序。在标准PDE基准上的实证评估表明,我们的界是非平凡的,显著比联合界基线更紧,并且可以在训练过程中有效最小化。总体而言,我们的结果为物理信息模型的泛化提供了原则性的统计基础。

英文摘要

Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincaré-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.