arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26394 2026-05-27 cs.CL

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

多轮文本到SQL的记忆架构：基准测试与实证研究

Ravi Kumar Tummalapenta, Suman Addanki

发表机构 * LLM Suite Engineering Team, JP Morgan Chase & Co.（JP摩根大通公司LLM套件工程团队）

AI总结针对多轮Text-to-SQL任务，提出EnterpriseMem-Bench基准并评估五种前沿模型，发现无状态方法在第三轮后执行准确率归零，且记忆架构复杂度并不单调提升准确率。

Comments 18 pages, 4 figures, 14 tables; includes appendices with verbatim prompts, example session, and full ablation tables; prepared by the LLM Suite Engineering Team, JP Morgan Chase & Co

详情

AI中文摘要

多轮Text-to-SQL是企业分析的核心，但现有评估主要集中于单轮场景。我们引入EnterpriseMem-Bench，一个包含300个会话和1400轮的多轮Text-to-SQL基准，通过编程方式从三个企业领域（BIRD金融、SEC EDGAR、Northwind）构建，具有确定性真实标签和每轮记忆关键标注。我们在五种记忆条件下评估五个前沿模型——GPT-5 mini、GPT-5.2、Claude Sonnet 4.5、Sonnet 4.6和Opus 4.6，通过三路消融实验独立隔离工作记忆窗口大小、情景检索和语义增强的影响。所有Claude模型均启用扩展思考以保持与GPT推理模型的对等性。我们引入记忆收益分数（MBS）作为每轮诊断指标。四项发现如下：（1）无状态多轮Text-to-SQL在所有五个模型下，即使启用推理，到第三轮时执行准确率也降为零；（2）记忆架构复杂度并不单调提升准确率——工作记忆占主导，额外组件产生模型和数据集依赖的效果，变化范围从+14到-16个百分点；（3）Claude Sonnet 4.6在SEC EDGAR上各条件下表现比Sonnet 4.5差17-33个百分点，这是一个在推理下仍然存在的代际退化；（4）在推理下，Claude的错误分布变为单峰——每个非正确轮次都是错误结果。我们发布了基准、智能体和评估代码。

英文摘要

Multi-turn Text-to-SQL is central to enterprise analytics yet remains predominantly evaluated in single-turn settings. We introduce EnterpriseMem-Bench, a multi-turn Text-to-SQL benchmark of 300 sessions and 1,400 turns built programmatically from three enterprise domains (BIRD financial, SEC EDGAR, Northwind), with deterministic ground truth and per-turn memory-critical annotation. We evaluate five frontier models -- GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6 -- across five memory conditions enabling a three-way ablation isolating working-memory window size, episodic retrieval, and semantic augmentation as independent effects. All Claude models are evaluated with extended thinking enabled to maintain parity with GPT reasoning models. We introduce the Memory Benefit Score (MBS) as a per-turn diagnostic metric. Four findings emerge: (1) stateless multi-turn Text-to-SQL collapses to zero execution accuracy by Turn 3 across all five models, even under reasoning; (2) memory-architecture complexity does not monotonically improve accuracy -- working memory dominates, and additional components produce model- and dataset-dependent effects from +14 to -16 percentage points; (3) Claude Sonnet 4.6 underperforms Sonnet 4.5 by 17-33pp on SEC EDGAR across conditions, a generational regression persisting under reasoning; (4) under reasoning, Claude error distributions become mono-modal -- every non-correct turn is a wrong-result error. We release the benchmark, agent, and evaluation code.

URL PDF HTML ☆

赞 0 踩 0

2605.26383 2026-05-27 cs.CV

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

基于多阶段SAM3特征融合的零样本物体重识别在自我中心厨房视频中的应用

Dmytro Klepachevskyi, Alexander Wong, Sirisha Rambhatla, Yuhao Chen

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对自我中心厨房视频中物体重识别的挑战，提出一种基于SAM3分割的多阶段零样本方法，通过融合SAM3、DINOv2和CLIP特征并引入掩码形状IoU和k-倒数重排序，将mAP从45.3%提升至52.8%。

详情

AI中文摘要

由于视角快速变化、频繁遮挡、场景杂乱以及类内外观差异大，自我中心厨房视频中的物体重识别（ReID）具有挑战性。物体可能离开并重新进入视野，且实例多样性大且标注有限，使得监督式ReID难以扩展，从而推动了零样本方法的研究。我们在EPIC-Kitchens基准上研究零样本物体ReID，目标是仅使用预训练的视觉特征匹配跨帧的活跃食物和厨房工具实例。我们首先评估了五种最先进的特征提取器，包括视觉语言模型（VLM）——CLIP、DINOv2、DreamSim、I-JEPA和SAM3，并显示零样本方法失败，最佳基线仅达到45.3% mAP。然后，我们提出了一种增强的SAM3 ReID流水线，这是一种以SAM3分割为核心组件的零样本多阶段方法。阶段1使用SAM3抑制背景杂乱。阶段2将SAM3、DINOv2和CLIP的嵌入融合为单个L2归一化描述符。阶段3用掩码形状IoU增强余弦相似度以实现几何一致性，阶段4应用k-倒数重排序。整个流水线将性能提升7.5% mAP，达到52.8%。

英文摘要

Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.

URL PDF HTML ☆

赞 0 踩 0

2605.26382 2026-05-27 cs.CV

Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation

细节一致的分阶段蒸馏用于高效3D MRI分割

Mengchen Fan, Baocheng Geng, Xi Xiao, Tianyang Wang, Siyuan Mei, Pulin Che, Xiaoqian Jiang, Qizhen Lan

发表机构 * University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； Friedrich-Alexander-Universität Erlangen-Nürnberg（埃尔兰根-纽伦堡弗里德里希-亚历山大大学）； UTHealth Houston（休斯顿UT健康）

AI总结提出细节一致蒸馏（DCD）框架，通过小波分解对齐教师-学生特征，在分阶段蒸馏中保留多尺度结构细节，实现高效3D MRI分割。

Comments Accepted by MICCAI 2026. 11 pages, 3 figures

详情

AI中文摘要

部署高性能3D医学图像分割器（如nnU-Net）通常受到内存占用和推理延迟的限制。因此压缩是必要的，但紧凑的3D编码器往往会在多分辨率阶段重复下采样时丢失细微的结构线索（小病变和锐利边界）。我们提出细节一致蒸馏（DCD），一种分阶段蒸馏框架，通过在小波分解表示中对齐教师-学生特征，跨尺度保留结构细节。在每个编码器阶段，DCD在小波域中蒸馏方向细节分量，同时相对不约束粗略近似，避免对全局语义的过度正则化。DCD仅在训练期间使用，不引入推理开销。在BraTS 2024和ISLES 2022基准上的实验表明，我们的方法在使用3D多模态数据的MRI分割中取得了优越性能。DCD的代码和实现细节可在https://github.com/ClinicaAlpha/DCD-3D-MedSeg公开获取。

英文摘要

Deploying high-performing 3D medical image segmenters (e.g., nnU-Net) is often limited by memory footprint and inference latency. Compression is therefore necessary, but compact 3D encoders tend to lose fine structural cues (small lesions and sharp boundaries) as downsampling repeats across multi-resolution stages. We propose Detail Consistent Distillation (DCD), a stage-wise distillation framework that preserves structural detail across scales by aligning teacher-student features in a wavelet-decomposed representation. At each encoder stage, DCD distills directional detail components in the wavelet domain while leaving the coarse approximation comparatively unconstrained, avoiding over-regularization of global semantics. DCD is used only during training and introduces no inference-time overhead. Experiments on the BraTS 2024 and ISLES 2022 benchmarks demonstrate that our approach achieves superior performance in MRI segmentation using 3D multi-modal data. Code and implementation details for DCD are publicly available at https://github.com/ClinicaAlpha/DCD-3D-MedSeg.

URL PDF HTML ☆

赞 0 踩 0

2605.26381 2026-05-27 cs.CV

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

基于Perceiver IO融合卫星和街景图像的多模态建筑检测

Niels Sombekke, Rob G. J. Wijnhoven, Martin R. Oswald

发表机构 * University of Amsterdam (UvA)（阿姆斯特丹大学）； Spotr

AI总结提出一种通过Perceiver IO架构融合卫星和街景图像的多模态分类框架，使用共享DINOv2骨干网络的空间补丁令牌，无需填充或固定大小池化即可处理可变数量的街景视图，并联合预测屋顶元素和材料类别，在包含10个国家32135栋建筑的数据集上验证了RGB-M掩码策略和融合模型的有效性。

详情

AI中文摘要

我们提出了一种多模态分类框架，通过Perceiver IO架构融合卫星和街景图像，该架构基于共享DINOv2骨干网络的空间补丁令牌。该设计自然地处理每栋建筑可变数量的街景视图，无需填充或固定大小池化，并联合预测多标签屋顶元素和屋顶材料类别。我们构建了一个包含10个国家32,135栋建筑（61,672个片段）的大规模数据集，将卫星图像与每个片段最多八个街景视图配对，并评估了四种用于隔离目标建筑的掩码策略。我们提出了一种RGB-M掩码策略，将建筑足迹掩码作为第四个输入通道，提供了一种软空间先验，在两种模态下均优于硬裁剪。Perceiver IO融合模型优于所有其他融合策略，并在街景可见的属性上取得了显著的每类增益（例如，石板+11.3 AP，老虎窗+1.3 AP），尽管仅卫星基线在主要从上方可见的类别的宏观平均mAP上仍保持轻微优势。这些结果为多模态建筑检测建立了一种可扩展、灵活的架构，能够处理异构输入和多个输出任务。

英文摘要

We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.26380 2026-05-27 cs.CV cs.AI

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

VisualNeedle: 信息密集场景中的主动视觉搜索基准

Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen, Richeng Xuan, Liang Yang, Zhichao Hu, Fanyang Lu

发表机构 * Hunyuan, Tencent（腾讯 Hunyuan）； Peking University（北京大学）； Zhejiang University（浙江大学）

AI总结针对多模态大语言模型在细粒度感知基准中依赖捷径而非真实视觉证据的问题，提出VisualNeedle基准，通过反事实裁剪-黑化设置评估模型在信息密集场景中的主动视觉搜索能力，实验表明最佳模型准确率仅56.01%，落后人类63.00%。

详情

AI中文摘要

前沿多模态大语言模型（MLLMs）被报道在细粒度感知基准上达到超过90%的准确率。然而，这样的分数并不一定意味着对视觉证据的忠实使用。先前的研究已经识别出三种抬高基准性能的捷径。首先，问题中的语言先验和词汇线索使模型能够在未见图像的情况下推断出看似合理的答案。其次，来自视觉编码器的粗略全局语义可以绕过细粒度的局部细节。第三，在一些“用图像思考”的基准中，破坏视觉工具返回的中间图像几乎不影响最终答案。这些发现表明，仅靠更高的输入分辨率或更大的问题池并不能引发真正的主动视觉搜索。为了解决这个问题，我们引入了VisualNeedle，这是一个具有挑战性、信息密集且细粒度的基准，用于关键证据在空间上局限于微小区域且无法一眼看出的场景。我们进一步提出了一种反事实裁剪-黑化设置，将工具返回的裁剪区域替换为相同大小的黑色图像，以测试工具启用的性能是否真正依赖于中间视觉证据。我们在三种设置下评估了9个著名的MLLMs：无工具、标准工具启用和裁剪-黑化。无工具准确率保持在20%以下，最佳工具启用模型仅达到56.01%，仍落后于63.00%的人类多数投票准确率。这些结果揭示了细粒度视觉搜索中持续存在的局限性，而裁剪-黑化消融实验证实，VisualNeedle上的成功依赖于真正的中间视觉证据。

英文摘要

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.26376 2026-05-27 cs.CV cs.AI cs.LG

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

BioFact-MoE：基于生物学因子分解的混合专家模型用于肝细胞癌的视觉-语言预后建模

Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro

发表机构 * Department of Radiology \& Biomedical Imaging, Department of Biomedical Engineering, Department of Electrical Engineering, Department of Statistics \& Data Science Yale University, New Haven, CT, 06510, USA

AI总结提出BioFact-MoE框架，通过生物学监督的混合专家模型显式分解肝脏和肿瘤因子，在肝细胞癌预后预测中提升准确性和生物学可解释性。

Comments Early accepted at MICCAI 2026

详情

AI中文摘要

肝细胞癌（HCC）具有生物学异质性，由肝功能储备和肿瘤相关肿瘤学因素之间的相互作用塑造；因此，相似的生存结果可能反映根本不同的潜在生物学过程。HCC的预后建模依赖于来自多参数MRI和常规临床实践放射学报告的丰富多模态信息。现有的预后视觉-语言模型（VLM）学习单一的纠缠潜在表示，混合了肝脏和肿瘤相关因素，限制了准确性和生物学可解释性。我们提出BioFact-MoE，一个生物学因子分解的混合专家（MoE）框架，通过残差MoE生存架构中的生物学监督专家显式分解肝脏和肿瘤因素。在N=588名患者的HCC队列（在4,582个3D MRI图像-报告对上预训练）中，BioFact-MoE在所有时间范围内持续优于所有基线的生存预测，实现了12、18和24个月的AUC分别为75.33%、75.85%和73.96%。除了标量风险预测，门控专家权重实现了表型感知的风险分层。通路感知的门控揭示了临床上有意义的治疗相关生存异质性。在保留验证中，肝脏和肿瘤嵌入分别与肝功能标志物和肿瘤负荷标志物显示出选择性关联（p<0.05），无需监督。代码可在https://github.com/jy-639/BioFact-MoE获取。

英文摘要

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

URL PDF HTML ☆

赞 0 踩 0

2605.26373 2026-05-27 cs.LG math.OC stat.ML

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

通过算法等价性在隐凸损失上的在线学习：最优遗憾、几何障碍与Bandit反馈

Anas Barakat, Andreas Kontogiannis, Vasilis Pollatos, Ioannis Panageas, Antonios Varvitsiotis

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； National Technical University of Athens（雅典国家技术大学）； National and Kapodistrian University of Athens（雅典国家与卡多斯大学）； University of California, Irvine（加州大学 Irvine 分校）； Archimedes, Athena Research Center, Greece（希腊阿提卡研究中心 Archimedes）； National University of Singapore, Centre for Quantum Technologies（新加坡国立大学量子技术中心）

AI总结本文通过更精确的离散时间算法等价性论证，证明在线梯度下降在隐凸损失上达到最优的$\mathcal{O}(\sqrt{T})$遗憾，并澄清了所需几何条件，同时扩展到单点Bandit反馈得到$\mathcal{O}(T^{3/4})$期望遗憾。

Comments 43 pages

详情

AI中文摘要

我们研究具有隐凸损失的对抗性在线学习，即经过非线性重参数化后变为凸的非凸损失。Ghai, Lu和Hazan (2022)证明，在几何和光滑性假设下，此类非凸损失上的在线梯度下降(OGD)近似模拟了具有适当正则化器的底层凸损失上的在线镜像下降(OMD)，得到$\mathcal{O}(T^{2/3})$遗憾。他们留下了是否可以在隐凸设置中恢复在线凸优化的最优$\Theta(\sqrt{T})$遗憾的开放问题。我们肯定地回答了这个问题。更具体地，通过更尖锐的离散时间算法等价性论证，我们证明在相同假设下OGD达到$\mathcal{O}(\sqrt{T})$遗憾，匹配对抗性在线凸优化的最坏情况最优速率。我们还解决了Ghai, Lu和Hazan (2022)的另一个开放问题，澄清了这种算法等价性所需的几何条件。我们将对角雅可比充分条件替换为必要且充分的Hessian相容性条件，从而扩展了可允许重参数化的类别。我们用下界补充了紧的遗憾界，表明Hessian相容性假设对OGD是必要的；当该条件不成立时，我们构造一个光滑的重参数化和一个对抗性的隐凸损失序列，使得OGD遭受$\Omega(T)$遗憾。最后，我们将分析扩展到单点Bandit反馈，并证明使用球形平滑的Bandit OGD的$\mathcal{O}(T^{3/4})$期望遗憾界，匹配其在凸损失上的经典速率。

英文摘要

We study adversarial online learning with hidden-convex losses, i.e., nonconvex losses that become convex after a nonlinear reparameterization. Ghai, Lu and Hazan (2022) proved that, under geometric and smoothness assumptions, online gradient descent (OGD) on such nonconvex losses approximately simulates online mirror descent (OMD) on the underlying convex losses with a suitable regularizer, yielding $\mathcal{O}(T^{2/3})$ regret. They left open whether the optimal $Θ(\sqrt{T})$ regret from online convex optimization can be recovered in this hidden-convex setting. We answer this question affirmatively. More specifically, via a sharper discrete-time algorithmic equivalence argument, we prove that OGD achieves $\mathcal{O}(\sqrt{T})$ regret under the same assumptions, matching the optimal worst-case rate for adversarial online convex optimization. We also address another open question of Ghai, Lu and Hazan (2022) by clarifying the geometry required for this algorithmic equivalence. We replace the diagonal-Jacobian sufficient condition with a necessary-and-sufficient Hessian compatibility condition, thereby expanding the class of admissible reparameterizations. We complement our tight regret bound with a lower bound showing that the Hessian compatibility assumption is essential for OGD; when it fails, we construct a smooth reparameterization and an adversarial sequence of hidden-convex losses for which OGD suffers $Ω(T)$ regret. Finally, we extend our analysis to one-point bandit feedback and prove a $\mathcal{O}(T^{3/4})$ expected regret bound for bandit OGD with spherical smoothing, matching its classical rate on convex losses.

URL PDF HTML ☆

赞 0 踩 0

2605.26370 2026-05-27 cs.CV

Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

航空影像中屋顶结构的联合实例分割与几何属性回归

Luuk Versteeg, Rob G. J. Wijnhoven, Martin R. Oswald

发表机构 * University of Amsterdam (UvA)（阿姆斯特丹大学）

AI总结提出一种从单张航空正射影像中联合预测屋顶实例分割掩码和三个连续几何属性（建筑高度、屋顶坡度、屋顶方位角）的方法，通过条件方位角损失和对数归一化高度表示解决数据噪声和分布偏斜问题，在荷兰大规模数据集上实现了高精度，并可从单张图像重建简化3D建筑模型。

详情

AI中文摘要

我们提出了一种方法，用于从单张航空正射影像中联合预测实例级屋顶分割掩码以及三个连续几何属性——建筑高度、屋顶坡度和屋顶方位角。我们的方法扩展了Mask R-CNN，增加了一个专门的属性回归分支，并引入了两个关键创新：一个条件方位角损失，抑制了对屋顶平坦段（其中方位角标签固有噪声）的监督；以及一个对数归一化高度表示，解决了建筑高度严重偏斜分布的问题。我们在一个大规模荷兰航空图像数据集上进行训练和评估，该数据集与从3DBAG（一个全国性的基于LiDAR的3D建筑数据集）自动导出的真实值配对。使用DINOv3 ConvNeXt-Base骨干网络，我们的方法在屋顶坡度上实现了约4度的平均绝对误差，方位角为7度，建筑高度为1米，实例分割AP$_{50}$为0.566。预测的每段掩码和属性足以从单张俯视图像重建简化的3D建筑模型（LoD2），仅需在训练时使用昂贵的3D参考数据。

英文摘要

We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

URL PDF HTML ☆

赞 0 踩 0

2605.26365 2026-05-27 cs.CL

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

大型语言模型中通过潜在激活引导的文化价值对齐

Trung Duc Anh Dang, Sarah Masud

发表机构 * University of Copenhagen（哥本哈根大学）

AI总结提出一种基于场景行为探测和潜在激活引导的框架，用于评估和干预LLMs的文化价值，发现文化价值以耦合结构编码，限制了精确对齐。

Comments ACL 2026 Student Research Workshop (Non-Archival Track)

详情

AI中文摘要

大型语言模型（LLMs）通常表现出同质化的文化视角。虽然世界价值观调查（WVS）为映射人类价值观提供了黄金标准，但传统的直接提示LLMs回答WVS问题往往无法触及模型的潜在文化深度，导致安全对齐的拒绝或中性回应。在此，我们提出一个通用的文化评估与干预框架，从抽象查询过渡到基于场景的行为探测。通过提取300个情境困境中的隐式token概率，我们绕过表面层次的对齐，映射LLMs文化价值的潜在坐标。我们进一步引入激活引导，在前向传播过程中无需重新训练即可改变这些内部对齐。在多个LLMs上，我们发现适应性存在显著差异，并揭示了一个一致的现象——潜在纠缠，即沿一个文化维度的干预会引发沿另一维度的偏移。这些结果表明，文化价值被编码为耦合结构，限制了精确对齐。本工作建立了一个计算高效的文化引导框架，突出了在LLMs中导航全球价值观时的结构复杂性。

英文摘要

Large Language Models (LLMs) often exhibit homogenized cultural perspectives. While the World Values Survey (WVS) provides a gold standard for mapping human values, traditional direct prompting of LLMs on WVS often fails to access the model's latent cultural depth, leading to safety-aligned refusals or neutral responses. Here, we propose a generalizable framework for cultural evaluation and intervention that transitions from abstract queries to scenario-based behavioral probing. By extracting implicit token probabilities across 300 situational dilemmas, we bypass surface-level alignment to map the latent coordinates of LLMs cultural value. We further introduce activation steering to shift these internal alignments during the forward pass without retraining. Across multiple LLMs, we find substantial variation in adaptability and uncover a consistent phenomenon of latent entanglement, where interventions along one cultural dimension induce shifts along another. These results suggest that cultural values are encoded as coupled structures, limiting precise alignment. This work establishes a computationally efficient framework for cultural steering, highlighting the structural complexities when navigating global value with LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.26362 2026-05-27 cs.CL cs.AI

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

为什么LLMs会在结构化知识上产生幻觉：对线性化表示推理的机制分析

Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song, Langzhou He, Kenan Kamel A Alghythee, Philip S. Yu

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文通过机制分析发现，大型语言模型在结构化知识推理中产生幻觉是由于注意力过度集中于捷径式结构线索和前馈层未能将知识语义接地，导致模型依赖参数记忆。

Comments To appear in Proceedings of ACL 2026

详情

AI中文摘要

在许多推理任务中，大型语言模型（LLMs）依赖于结构化外部知识，如图和表格，这些知识通常被线性化为连续的令牌表示。然而，即使有足够的知识可用，LLMs仍然可能产生幻觉输出，这种失败背后的潜在机制仍然知之甚少。我们研究了这些机制，发现幻觉源于系统性的内部动态而非随机噪声。首先，注意力不成比例地集中在类似捷径的结构线索上，而不是分布在完整的上下文中。其次，前馈表示未能将提供的知识接地，导致模型回归到参数记忆。此外，我们的结果表明，幻觉始终与前馈层中的语义接地失败相关，而注意力分配表现出更大的任务依赖性。最后，我们展示了这些机制模式从单跳图推广到多跳和表格设置，从而能够在结构化知识格式中有效检测幻觉。

英文摘要

In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.

URL PDF HTML ☆

赞 0 踩 0

2605.26356 2026-05-27 cs.CL

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

检索增强生成的上下文优化：梯度下降视角

Mingchen Li, Jiatan Huang, Chuxu Zhang, Liang Zhao, Hong Yu

发表机构 * University of Massachusetts, Amherst（马萨诸塞大学阿默斯特分校）； University of Connecticut（康涅狄格大学）； Emory University（埃默里大学）

AI总结本文从梯度下降视角研究检索增强生成（RAG）作为上下文优化过程，提出一种轻量级前向更新方法，在冻结LLM和检索器的情况下提升生成器对检索证据的利用。

详情

AI中文摘要

上下文学习最近被与线性自注意力模型中的隐式梯度下降联系起来，表明上下文可以诱导前向传递更新。检索增强生成（RAG）也依赖于上下文，但检索到的文档通常被视为静态证据而非适应信号。我们将RAG研究为一种上下文优化过程。首先，我们展示一个线性自注意力层可以在统一的线性化RAG目标上实现一步梯度下降，该目标涵盖基于投影和基于点积的检索接口。这给出了检索增强预测与上下文优化一致的一个精确区域。我们使用这一结果并非作为LLM计算的字面模型，而是作为调整查询与检索证据之间交互的指南。然后，我们测试这种对应关系的边界：在受控的线性扩展下保持稳定，但在非线性架构下变得依赖于特征分布。最后，我们将这一观点转化为一种针对冻结RAG LLM的轻量级方法。该方法保持检索器和骨干网络固定，并预测一个上下文条件更新到生成器侧的证据使用接口。在七个QA基准、两个检索器和两个冻结LLM骨干网络上，这种仅前向的更新改进了共享接口基线，迁移到未见任务，并以更低的每查询成本接近测试时的梯度适应。

英文摘要

In-context learning has recently been linked to implicit gradient descent in linear self-attention models, suggesting that context can induce a forward-pass update. Retrieval-augmented generation (RAG) also relies on context, but retrieved documents are usually treated as static evidence rather than signals for adaptation. We study RAG as an in-context optimization process. First, we show that one linear self-attention layer can implement one gradient-descent step on a unified linearized RAG objective covering both projection-based and dot-product retrieval interfaces. This gives an exact regime where retrieval-augmented prediction and in-context optimization coincide. We use this result not as a literal model of LLM computation, but as a guide for adapting the interaction between queries and retrieved evidence. We then test the boundary of this correspondence: it remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures. Finally, we turn this view into a lightweight method for frozen RAG LLMs. The method keeps the retriever and backbone fixed, and predicts a context-conditioned update to a generator-side evidence-use interface. Across seven QA benchmarks, two retrievers, and two frozen LLM backbones, this forward-only update improves a shared-interface baseline, transfers to held-out tasks, and approaches test-time gradient adaptation at much lower per-query cost.

URL PDF HTML ☆

赞 0 踩 0

2605.26355 2026-05-27 cs.LG cs.CL eess.SP

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

能量门控注意力与小波位置编码：Transformer注意力的互补归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher（独立研究者）； Athens, Greece（希腊雅典）

AI总结针对标准注意力缺乏能量显著性和尺度选择性局部性两种互补归纳偏置的问题，提出能量门控注意力（EGA）和莫雷特位置编码（MoPE），两者组合在字符级语言建模上实现超加性性能提升。

Comments 10 pages, 1 figure, 3 tables. Part 2 of a five-paper series on spectral methods in transformer attention. Code: https://github.com/AthanasiosZeris/energy-gated-attention

详情

AI中文摘要

标准Transformer注意力计算成对标记相似性，但将所有标记视为同等显著、所有位置视为同等局部，忽略了输入的信息结构。我们识别出标准注意力缺乏两种互补归纳偏置：能量显著性（哪些标记集中了信息能量，通过端到端学习而不需要显式频率分解）和尺度选择性局部性（在每个频率上位置影响的范围，通过Morlet小波编码实现）。我们通过两个简单组件解决这两个问题。能量门控注意力（EGA）通过键标记嵌入的学习能量估计（通过单个线性投影计算）来门控值聚合；它选择关注什么。莫雷特位置编码（MoPE）用学习的高斯窗口小波替换固定的正弦编码，使联合位置-频率定位适应语料库；它指定注意力在每个尺度上操作的位置。在TinyShakespeare上，单独EGA相比标准注意力实现+0.092验证损失改进（相比Phase 1-3基线+0.103）；单独MoPE为-0.032（作为独立编码低于基线）；但它们的组合实现+0.119——超过各部分之和。这种超加性在两个独立训练运行中观察到，是核心实证发现：显著性和局部性是互补归纳偏置，各自填补对方无法单独填补的空白。消融实验证实，结构化谱先验（Morlet小波门控、尺度初始化头、固定正弦PE）始终不如其无约束学习对应物，而互补学习组件交互产生超加性。所有实验都在小规模（≤6M参数、字符级基准、单种子）进行；更大规模的多种子验证是未来工作最重要的方向。

英文摘要

Standard transformer attention computes pairwise token similarity but treats all tokens as equally salient and all positions as equally local, regardless of the informational structure of the input. We identify two complementary inductive biases that standard attention lacks: energy salience (which tokens concentrate informational energy, learned end-to-end without explicit frequency decomposition) and scale-selective locality (how far positional influence extends at each frequency, implemented via Morlet wavelet encoding). We address both with two simple components. Energy-Gated Attention (EGA) gates value aggregation by a learned energy estimate of key token embeddings, computed via a single linear projection; it selects what to attend to. Morlet Positional Encoding (MoPE) replaces fixed sinusoidal encodings with learned Gaussian-windowed wavelets that adapt the joint position-frequency localization to the corpus; it specifies where attention operates at each scale. On TinyShakespeare, EGA alone achieves +0.092 validation loss improvement over standard attention (+0.103 over Phase 1-3 baseline); MoPE alone is -0.032 (below baseline as a standalone encoding); but their combination achieves +0.119 -- more than the sum of parts. This superadditivity, observed across two independent training runs, is the central empirical finding: salience and locality are complementary inductive biases, each addressing a gap the other cannot fill alone. Ablations confirm that structured spectral priors (Morlet wavelet gates, scale-initialized heads, fixed sinusoidal PE) consistently underperform their unconstrained learned counterparts, while complementary learned components interact superadditively. All experiments are at small scale (<=6M parameters, character-level benchmarks, single seed); larger-scale multi-seed validation is the most important direction for future work.

URL PDF HTML ☆

赞 0 踩 0

2605.26353 2026-05-27 cs.CV cs.AI cs.LG

Personalized Generative Models for Contextual Debiasing

用于上下文去偏的个性化生成模型

Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

发表机构 * Department of Computer Science, Princeton University（普林斯顿大学计算机科学系）； LIX, CNRS, École Polytechnique（巴黎政治学院LIX研究所，法国国家科学研究中心）

AI总结提出DecoupleGen方法，利用个性化文本到图像扩散模型生成罕见上下文图像，作为训练增强以缓解视觉识别中的上下文偏差。

Comments CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at https://github.com/princetonvisualai/DecoupleGen

详情

AI中文摘要

不同的视觉模式在世界中出现的频率不同：例如，沙滩球出现在沙滩上比出现在道路上更常见。这些统计数据反映在视觉数据集中，因此训练好的模型更容易在常见场景中识别物体。然而，在道路上识别沙滩球可能比在沙滩上识别更重要。我们研究如何缓解这种差异。由于在现实世界中收集不常见的图像可能很困难，我们探索生成具有较少频繁上下文的图像是否可以作为有效的训练增强。一个关键挑战是引导生成保持在原始数据集分布附近，同时创建具有不常见上下文的多样化图像。我们引入了DecoupleGen方法，该方法个性化文本到图像扩散模型，以促进罕见上下文图像的连贯合成，同时保留原始视觉细节。生成的图像包含语义上有意义的内容，并在视觉上与原始数据集保持一致。我们进一步应用验证约束以确保增强数据的相关性。我们在复杂场景数据集上的物体分类和识别任务中评估了我们的方法。实验表明，我们的方法比先前的方法有一致的改进，并且我们的分析确定了这些改进背后的因素。

英文摘要

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.26352 2026-05-27 cs.CL

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

RICE-PO：将检索交互转化为推理智能体的信用信号

Mingchen Li, Hansi Zeng, Zhuo Qian, Jiatan Huang, Hamed Zamani, Hong Yu

发表机构 * University of Massachusetts, Amherst（马萨诸塞大学阿姆赫斯特分校）； Texas Tech University（得克萨斯科技大学）； University of Connecticut（康涅狄格大学）

AI总结提出RICE-PO框架，通过将检索交互转化为局部学习信号，解决推理型检索智能体在训练中推理步骤的信用分配问题，在BRIGHT和BEIR上优于基线。

详情

AI中文摘要

检索正从一次性匹配向交互式推理转变，语言智能体迭代检查证据、重新表述查询并再次搜索。训练此类智能体面临信用分配挑战：可执行动作（如查询或摘要）可由检索器直接评估，而潜在推理步骤不可直接观察，仅影响未来的可执行动作。这种不对称使得结果级奖励分配不可靠，因为相同的最终奖励可能奖励那些实际上并未促成检索成功的推理步骤。我们提出RICE-PO，一种无批评策略优化框架，将检索交互转化为局部学习信号。RICE-PO选择高不确定性的可执行动作作为锚点，使用检索指标评估局部反事实分支，并仅在推理到动作的影响强且未来残差效应稳定时，将信用传播给潜在推理步骤。在BRIGHT和BEIR上，RICE-PO在相同检索器设置下始终优于基于提示的智能体和基于组的强化学习基线。这些结果表明，智能体-环境交互结构本身可以为训练基于推理的检索智能体提供有用的监督。

英文摘要

Retrieval is increasingly moving from one-shot matching toward interactive reasoning, where language agents iteratively inspect evidence, reformulate queries, and search again. Training such agents raises a credit-assignment challenge: executable actions such as queries or summaries can be directly evaluated by the retriever, while latent reasoning steps are not directly observable and only affect future executable actions. This asymmetry makes outcome-level reward assignment unreliable, as the same final reward may credit reasoning steps that did not actually shape retrieval success. We propose RICE-PO, a critic-free policy optimization framework that converts retrieval interactions into localized learning signals. RICE-PO selects high-uncertainty executable actions as anchors, evaluates local counterfactual branches using retrieval metrics, and propagates credit to latent reasoning steps only when reasoning-to-action influence is strong and future residual effects are stable. On BRIGHT and BEIR, RICE-PO consistently outperforms prompt-based agents and group-based RL baselines under the same retriever setting. These results show that the structure of agent-environment interaction itself can provide useful supervision for training reasoning-based retrieval agents.

URL PDF HTML ☆

赞 0 踩 0

2605.26350 2026-05-27 cs.LG cs.AI

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

当正确示例有害时：重新思考示例在上下文学习中的作用

Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou

发表机构 * Texas A&M University（德克萨斯理工大学）

AI总结本文通过引入任务保持扰动，揭示了正确示例不一定有益甚至可能降低上下文学习准确性的反直觉现象，并提出了上下文证据转移的概念来解释正确性与效用之间的差距。

详情

AI中文摘要

上下文学习（ICL）通常被直觉所驱动，即示例之所以有帮助是因为它们提供了正确的输入-输出对。然而，我们揭示了一个反直觉的现象：正确性并不能保证示例的效用，一些正确的示例甚至可能降低ICL的准确性。为了研究这种正确性-效用差距，我们引入了任务保持扰动，其中仅改变示例输入，而该示例仍然是同一任务的正确实例。具体来说，每个扰动后的示例被赋予由任务映射诱导的目标。该框架涵盖了标签更新扰动（其中任务相关语义发生变化且目标被重新计算）和更严格的目标保持扰动（其中原始目标仍然有效）。我们将由此产生的失败模式形式化为上下文证据转移：任务保持扰动可以改变模型用于上下文推理的有效证据混合，从而将示例正确性与示例效用分离。在情感分类、逻辑推理和数学应用题中，我们发现任务保持扰动的示例会显著降低ICL性能，尤其是对于较小的模型、较难的任务和较高的扰动比例。我们的结果表明，鲁棒的ICL不仅需要评估示例是否正确，还需要评估它们如何影响上下文推理。代码可在 https://github.com/Chenghao-Qiu/Task-Preserving-ICL 获取。

英文摘要

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao-Qiu/Task-Preserving-ICL.

URL PDF HTML ☆

赞 0 踩 0

2605.26349 2026-05-27 cs.RO

Closing the Loop in Teleoperation: Episode-Level Data Quality Assessment and Feedback for High-Quality Demonstration Collection

在遥操作中闭环：面向高质量演示收集的片段级数据质量评估与反馈

Gokul Narayanan, Yash Shahapurkar, Melih Erdogan, Brian Zhu, Eugen Solowjow

发表机构 * Siemens Corporation（西门子公司）

AI总结提出数据质量评估与反馈框架，通过语义任务进度和机器人遥测数据提供即时后片段反馈，帮助新手操作员提升演示质量。

详情

AI中文摘要

工业自动化正处于关键时刻，物理AI正推动从刚性、手工设计的自动化系统向更灵活、自适应的系统转变。这一转变产生了对大规模、真实世界机器人演示数据的需求，使得遥操作成为越来越重要的数据收集机制。然而，在实践中，高质量的遥操作演示仍然难以获得，因为新手操作员经常产生任务成功但下游使用次优的片段，原因包括低效运动、重复修正或接近机器人关节极限操作。我们提出一个数据质量评估与反馈（DQAF）框架，通过提供基于语义任务进度和机器人遥测的即时后片段反馈，在遥操作中实现闭环。该框架提取质量相关信号，如子任务进度、运动平滑度、停顿、运动学极限，并将其转化为结构化质量评估和可操作的自然语言反馈。与二元成功或失败反馈不同，所提系统解释了片段为何次优，并突出显示下次试验中需要纠正的具体行为。我们通过诊断验证研究和试点用户研究评估该框架。在验证研究中，系统在数据集整理过程中与人类评审员进行比较，产生拒绝原因和可操作的改进反馈。在涉及三个新手操作员的两项操作任务的试点研究中，接收系统即时自动后片段反馈的操作员比未接收的改进更快，更早产生更高质量的演示。

英文摘要

Industrial automation is at a pivotal moment, as Physical AI is driving a transition from rigid, hand-engineered automation systems toward more flexible and adaptive systems. This shift has created a growing demand for large-scale, real-world robot demonstration data, making teleoperation an increasingly important mechanism for data collection. However, high-quality teleoperated demonstrations remain difficult to obtain in practice, as novice operators often produce episodes that are task-successful but suboptimal for downstream use due to inefficient motion, repeated corrections, or operation near robot joint limits. We present a Data Quality Assessment and Feedback (DQAF) framework that closes the loop in teleoperation by providing immediate post-episode feedback grounded in semantic task progress and robot telemetry. The framework extracts quality relevant signals such as sub-task progress, motion smoothness, stalls, kinematic limits and converts them into structured quality assessments and actionable natural-language feedback. Unlike binary success or failure feedback, the proposed system explains why an episode is suboptimal and highlights specific behaviors to correct in the next trial. We evaluate the framework through a diagnostic validation study and a pilot user study. In the validation study, the system is compared with a human reviewer during dataset curation, producing rejection reasons and actionable feedback for improvement. In the pilot study with three novice operators across two manipulation tasks, the operator who received the systems immediate, automated post-episode feedback improved faster than those who did not, producing higher-quality demonstrations sooner.

URL PDF HTML ☆

赞 0 踩 0

2605.26348 2026-05-27 cs.RO

RCSP: Risk-Sensitive Conjectural Scenario Planning for Safe Dynamic Robot Navigation

RCSP: 面向安全动态机器人导航的风险敏感推测性场景规划

Zhengye Han, Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering（电气计算机工程系）

AI总结提出风险敏感推测性场景规划（RCSP），通过轻量级信念维护、未来交互采样和高风险尾部惩罚，结合局部安全检查，解决移动机器人在动态障碍物环境中的预测性近碰撞承诺问题，并在仿真中验证其提升安全性和路径质量。

详情

AI中文摘要

移动机器人在碰撞之前就可能失败：当前安全的速度可能使机器人陷入即将被移动障碍物关闭的通道。我们研究了这种预测性近碰撞承诺问题，并提出了风险敏感推测性场景规划（RCSP），这是一个规划层，它根据合理的短视障碍物未来对候选命令进行评估。RCSP维护一个关于局部运动推测的轻量级信念，采样未来交互，惩罚高风险尾部，并通过局部安全检查执行。在受控的MuJoCo瓶颈任务中，RCSP规划器无碰撞地到达目标，并且与非自适应预测器相比，提供了更高的次要安全性和路径质量点估计，但增加了延迟。在ROS2/Gazebo中，将局部安全层添加到标准Nav2堆栈可减少动态近碰撞失败。在官方DynaBARN/Jackal迁移中，调整后的DWA和TEB在严格的基准成功率上仍然更强，揭示了该方法的边界。这些仿真结果将RCSP定位为一个预测风险模块，在动态瓶颈机制中补充现有的导航堆栈。

英文摘要

Mobile robots can fail before they collide: a velocity that is safe now may commit the robot to a passage that moving obstacles will soon close. We study this predictive near-miss commitment problem and propose Risk-Sensitive Conjectural Scenario Planning (RCSP), a planning layer that evaluates candidate commands against plausible short-horizon obstacle futures. RCSP maintains a lightweight belief over local motion conjectures, samples future interactions, penalizes high-risk tails, and executes through a local safety check. In controlled MuJoCo bottleneck tasks, the RCSP planner reaches the goal without collisions and yields higher secondary safety and path-quality point estimates than a non-adaptive predictor, with additional latency. In ROS2/Gazebo, adding the local safety layer to a standard Nav2 stack reduces dynamic near-miss failures. On official DynaBARN/Jackal transfer, tuned DWA and TEB remain stronger on strict benchmark success, revealing the boundary of the approach. These simulation results position RCSP as a predictive-risk module that complements existing navigation stacks in dynamic bottleneck regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.26346 2026-05-27 cs.CL

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

每日剂量：工作流集成的大型语言模型自动化在放射肿瘤学中的临床总结和试验识别

Jason Holmes, Federico Mastroleo, Mariana Borras-Osorio, Srinivas Seetamsetty, Satomi Shiraishi, Mirek Fatyga, Judy C. Boughey, Cornelius A. Thiels, William G. Breen, Daniel J. Ma, Daniel K. Ebner, David M. Routman, Brady S. Laughlin, Carlos E. Vargas, Samir H. Patel, Sujay A. Vora, Nadia N. Laack, Andrew Y. K. Foong, Wei Liu, Mark R. Waddle

发表机构 * Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, United States（辐射肿瘤科，梅奥诊所，凤凰城，亚利桑那州，美国）； Department of Radiation Oncology, Mayo Clinic, Rochester, MN, United States（辐射肿瘤科，梅奥诊所，罗切斯特，明尼苏达州，美国）； Division of Radiation Oncology, IEO, European Institute of Oncology, IRCCS, Milan, Italy（放射肿瘤学部，IEO，欧洲肿瘤研究所，IRCCS，米兰，意大利）

AI总结本文介绍了一个名为“每日剂量”的LLM驱动的自动化临床总结和临床试验识别系统，该系统集成到常规放射肿瘤学实践中，并通过混合方法评估其可用性、满意度和感知有用性，结果显示高使用率和满意度。

Comments 28 pages, 4 figures, 1 table

详情

AI中文摘要

目的：描述“每日剂量”（TDD）的设计和早期临床评估，这是一个由LLM驱动的自动化临床总结和临床试验识别系统，集成到常规放射肿瘤学实践中。设计：在系统部署1个月后，采用横断面匿名临床医生调查进行混合方法评估。暴露：每日自动生成医生特定的电子邮件摘要，使用RadOnc-GPT生成，包括患者日程、从电子健康记录中提取的简洁临床状态摘要，以及自动识别新就诊或咨询就诊的潜在相关临床试验。主要结果和指标：主要结果包括自我报告的可用性、满意度、感知有用性、对工作流程的感知影响、节省的时间以及继续使用的意愿。使用Cronbach's α评估内部一致性信度。结果：在55名受访者中，52名（94.5%）在放射肿瘤学领域工作，38名（69.1%）是主治医师。大多数参与者（83.6%）报告每天或每周多次使用TDD。平均（SD）得分为：可用性和满意度3.89（1.04），感知有用性3.43（1.24），影响和未来使用3.80（1.17）（5点李克特量表）。总体满意度与感知时间节省呈正相关（p < .001）。参与者报告的时间节省不一，27%的人估计每天节省≥10分钟。问卷表现出极好的内部一致性（总体Cronbach's α = 0.97）。

英文摘要

Objective: To describe the design and early clinical evaluation of The Daily Dose (TDD), an LLM-driven, automated clinical summarization and clinical-trial identification system integrated into routine radiation oncology practice. Design: Mixed-methods evaluation using a cross-sectional, anonymous clinician survey administered after 1 month of system deployment. Exposure: Daily automated delivery of physician-specific email summaries generated using RadOnc-GPT, including patient schedules, concise EHR-derived clinical-status summaries, and automated identification of potentially relevant clinical trials for new or consult visits. Main Outcomes and Measures: Primary outcomes included self-reported usability, satisfaction, perceived usefulness, perceived impact on workflow, time savings, and intention for continued use. Internal consistency reliability was assessed using Cronbach's $α$. Results: Among 55 respondents, 52 (94.5\%) worked in radiation oncology, and 38 (69.1\%) were attending physicians. Most participants (83.6\%) reported using TDD daily or several times per week. Mean (SD) scores were 3.89 (1.04) for usability and satisfaction, 3.43 (1.24) for perceived usefulness, and 3.80 (1.17) for impact and future use (5-point Likert scale). Overall satisfaction was positively associated with perceived time savings ($p < .001$). Participants reported variable time savings, with 27\% estimating $\geq 10$ minutes saved per day. The questionnaire demonstrated excellent internal consistency (overall Cronbach's $α$ = 0.97).

URL PDF HTML ☆

赞 0 踩 0

2605.26343 2026-05-27 cs.LG

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

MechRL：强化学习智能体进行电路发现以实现机械可解释性

Barsat Khadka

发表机构 * The University of Southern Mississippi（美国密西西比州立大学）

AI总结提出将电路发现转化为强化学习问题，使用PPO策略在GPT-2 small的144个注意力头上进行零消融和对比奖励，成功在训练任务和未见任务上恢复标准电路，验证了强化学习在机械可解释性中的有效性。

详情

AI中文摘要

机械可解释性已经识别出在Transformer语言模型中实现特定行为的小型注意力头集合，但恢复这些电路通常需要为每个新任务定制分析流程。我们将电路发现重新定义为强化学习问题。一个智能体在GPT-2 small的144个注意力头上操作，作为离散动作空间；每个动作触发零消融和对比奖励，该奖励从消融对目标任务的损害中减去其对通用下一个词预测的损害。一个在向量化多任务环境中训练于两个任务（归纳和IOI）的单一PPO策略，在两个训练任务以及一个保留的第三个任务（文档字符串补全）上均达到每轮最优。其偏好的头与现有文献中规范的头一致，恰好符合这些论文在单头消融下识别为因果非冗余的轴；它们识别为冗余的类别被智能体正确降级。在保留任务上，最佳五次规划在评估时未提供任务信号的情况下恢复了最优上限的96%。这些结果表明，基于因果干预的强化学习是识别机械电路单头瓶颈的可行且可迁移的方法，与现有的路径修补方法互补。

英文摘要

Mechanistic interpretability has identified small sets of attention heads that implement specific behaviours in transformer language models, but recovering these circuits typically requires a bespoke analytical pipeline for each new task. We recast circuit discovery as a reinforcement-learning problem. An agent operates over the 144 attention heads of GPT-2 small as a discrete action space; each action triggers a zero-ablation and a contrastive reward that subtracts the ablation's damage to general next-token prediction from its damage to the target task. A single PPO policy, trained on two tasks (induction and IOI) in a vectorised multi-task environment, attains the per-episode oracle on both training tasks and on a held-out third task (docstring completion). Its preferred heads coincide with the canonical heads of established literature on precisely the axes those papers identify as causally non-redundant under single-head ablation; the categories they identify as redundant are correctly de-prioritised by the agent. On the held-out task, best-of-five planning recovers 96\% of the oracle ceiling with no task signal supplied at evaluation. These results indicate that reinforcement learning over causal interventions is a viable, transferable substrate for identifying the single-head bottlenecks of mechanistic circuits, complementary to existing path-patching approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.26341 2026-05-27 cs.LG stat.ML

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

物理信息机器学习的泛化性的PAC-Bayesian视角

Thien V. Nguyen, Amaury Habrard, Benjamin Guedj

发表机构 * Université Jean Monnet Saint-Étienne, CNRS, Institut d’Optique Graduate School, Laboratoire Hubert Curien UMR 5516（里昂蒙特大学圣埃蒂安分校、法国国家科学研究中心、光学研究生院、Hubert Curien实验室 UMR 5516）； Inria and University College London, France and United Kingdom（Inria 和英国伦敦大学学院，法国和英国）

AI总结本文通过PAC-Bayesian框架，针对无界损失下的回归问题，推导了物理信息机器学习的高概率泛化界，并提出了自界感知学习算法，在标准PDE基准上验证了界的非平凡性和更紧性。

详情

AI中文摘要

物理信息机器学习（PIML）将机械知识（通常以偏微分方程（PDE）的形式）整合到数据驱动模型中。尽管经验性能强劲，但其统计泛化性质仍未被充分理解，尤其是在具有无界损失的回归设置中。现有分析依赖于近似或稳定性论证，未能完全捕捉物理结构如何影响有限数据的泛化。在这项工作中，我们为PIML开发了一个PAC-Bayesian框架，在存在无界损失的情况下提供高概率泛化保证。我们采用多任务视角，联合处理数据保真度、PDE残差、初始条件和边界条件，避免了标准联合界方法导致的松散性。我们的分析利用物理信息目标的结构，推导出新的界，其中复杂度与损失的输入梯度范数成比例，揭示了物理正则性与泛化之间的直接联系。我们在Sobolev和Poincaré型假设下实例化该框架，得到两类界，在不同机制下权衡统计复杂性和光滑性。基于这些结果，我们提出了一种自界感知学习算法，直接优化推导界的可处理代理，以及一种在实际设置中估计相关常数的实用程序。在标准PDE基准上的实证评估表明，我们的界是非平凡的，显著比联合界基线更紧，并且可以在训练过程中有效最小化。总体而言，我们的结果为物理信息模型的泛化提供了原则性的统计基础。

英文摘要

Physics-informed machine learning (PIML) integrates mechanistic knowledge, typically in the form of partial differential equations (PDE), into data-driven models. Despite strong empirical performance, its statistical generalisation properties remain poorly understood, particularly in the regression setting with unbounded losses. Existing analyses rely on approximation or stability arguments and do not fully capture how physical structure influences generalisation from finite data. In this work, we develop a PAC-Bayesian framework for PIML that provides high-probability generalisation guarantees in the presence of unbounded losses. We adopt a multi-task perspective that jointly treats data fidelity, PDE residuals, initial and boundary conditions, avoiding the looseness induced by standard union-bound approaches. Our analysis leverages the structure of physics-informed objectives to derive novel bounds where the complexity scales with input-gradient norms of the losses, revealing a direct link between physical regularity and generalisation. We instantiate this framework under Sobolev and Poincaré-type assumptions, yielding two classes of bounds that trade off statistical complexity and smoothness in different regimes. Building on these results, we propose a self-bounding-aware learning algorithm that directly optimises tractable surrogates of the derived bounds, along with a practical procedure to estimate the associated constants in realistic settings. Empirical evaluations on standard PDE benchmarks demonstrate that our bounds are non-vacuous, significantly tighter than union-bound baselines, and can be effectively minimised during training. Overall, our results provide a principled statistical foundation for the generalisation of physics-informed models.

URL PDF HTML ☆

赞 0 踩 0

2605.26340 2026-05-27 cs.AI cs.CL cs.MA

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne: 迈向基于证据链的人类级自主研究

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister

发表机构 * Google Cloud AI Research（谷歌云人工智能研究）

AI总结提出证据链框架Chain-of-Evidence和自主研究系统ScientistOne，通过可追溯性解决可验证性失败问题，在多项任务上达到或超越人类专家水平。

Comments Project website: https://scientist-one.github.io/

详情

AI中文摘要

自主研究代理能产生有竞争力的解决方案和专业手稿，但其输出存在表面评估无法察觉的可验证性失败：捏造的引用、不可复现的分数以及与实现不符的方法描述。我们通过三项贡献解决这一问题。第一，Chain-of-Evidence (CoE)，一个可验证性框架，要求每个声明都能追溯到其证据来源。第二，ScientistOne，一个端到端的自主研究系统，在文献综述、解决方案发现和论文撰写过程中通过构造保持证据链。第三，CoE Audit，一个事后审计，其四项完整性检查——分数验证、规范违反、引用验证和方法-代码对齐——统一适用于所有系统。在涵盖五个系统和五个前沿研究任务的75篇论文中，每个基线都表现出至少一种系统性失败模式：幻觉引用率高达21%，分数验证通过率低至42%，方法-代码对齐率在20%到80%之间。ScientistOne实现了零幻觉引用（0/337）、完美的分数验证（12/12）和最高的方法-代码对齐率（14/15），同时在所有五个任务上达到或超过人类专家表现。ScientistOne进一步泛化到涵盖医学影像、细粒度识别、3D感知和语言建模的六个额外任务，在Parameter Golf上取得最先进结果，并在基线完全失败的MLE-Bench任务上获得金牌。

英文摘要

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

URL PDF HTML ☆

赞 0 踩 0

2605.26339 2026-05-27 cs.LG cs.CL

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

QAM-W: 通过哈达玛旋转和激活感知缩放实现LLM权重的联合2D码本量化

Preetam Sharma, Kacper Dobek

发表机构 * Independent Research（独立研究）； Institute of Computing Science（计算科学研究所）； Poznan University of Technology（波兹南技术大学）

AI总结提出QAM-W方法，通过L2归一化、块哈达玛旋转和2D坐标配对量化，结合激活感知缩放，在约5.5 bpw下使困惑度接近BF16，优于极坐标编码，并在5-6 bpw范围内保持质量。

详情

AI中文摘要

标量后训练量化器丢弃了权重行内的成对坐标结构。我们引入QAM-W（权重正交幅度调制），一种恢复该结构的编解码器：每行经过L2归一化、块哈达玛旋转、配对为2D坐标，并针对在单位圆高斯上训练的单个Lloyd-Max码本进行量化，同时采用激活感知的每通道缩放。在跨越四个家族（1.1B--13B参数）的五种LLM和八种量化配置的跨模型研究中，激活感知变体在约5.5 bpw下，每个模型的WikiText-2困惑度保持在BF16的±0.4%以内，以少32%的权重比特匹配SmoothQuant W8A8质量包络。联合2D编码在相同比特率下，在ΔPPL上优于极坐标（幅度×相位）编码2--15个百分点，且与BF16的配对KL散度在37个（方法，模型）行上以Spearman ρ=0.99跟踪ΔPPL%，与从编解码器失真到KL散度的单调复合界一致。3.5 bpw变体在量化容忍架构上具有竞争力。在严格的4 bpw下，旋转码本前沿方法QTIP优于QAM-W；贡献在于质量保持的5--6 bpw波段。

英文摘要

Scalar post-training quantizers discard pairwise coordinate structure within weight rows. We introduce QAM-W (Quadrature Amplitude Modulation for Weights), a codec that recovers this structure: each row is L2-normalized, block-Hadamard rotated, paired into 2D coordinates, and quantized against a single Lloyd-Max codebook trained on the unit circular Gaussian, with activation-aware per-channel scaling. In a cross-model study spanning five LLMs from four families (1.1B--13B parameters) and eight quantized configurations, the activation-aware variant at $\approx 5.5$ bpw stays within $\pm 0.4\%$ of BF16 WikiText-2 perplexity on every model, matching the SmoothQuant W8A8 quality envelope at $32\%$ fewer weight bits. Joint 2D coding outperforms polar (amplitude $\times$ phase) coding by 2--15~pp $Δ$PPL at equal bitrate, and paired KL against BF16 tracks $Δ$PPL\% at Spearman $ρ= 0.99$ across 37 (method, model) rows, consistent with a monotone composite bound from codec distortion to KL divergence. A 3.5~bpw variant is competitive on quantization-tolerant architectures. At strict 4~bpw, the rotated-codebook frontier method QTIP outperforms QAM-W; the contribution is the quality-preserving 5--6~bpw band.

URL PDF HTML ☆

赞 0 踩 0

2605.26333 2026-05-27 cs.AI

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

管理虚拟实验室规划中LLM生成程序性知识的不确定性

Polychronis Karpodinis, Dimitris Kalles

发表机构 * School of Science and Technology, Hellenic Open University（希腊开放大学科学与技术学院）

AI总结针对LLM生成实验程序存在的不确定性，提出一个原型框架，通过结构化领域表示和不确定的状态转移样本提取候选程序规则，转化为显式约束并修复不确定步骤，以提升虚拟实验室规划的可靠性。

详情

AI中文摘要

教育虚拟实验室可以使实验培训更具可扩展性、适应性和可访问性，尤其是在学生接触物理实验室设施有限的情况下。然而，编写新的模拟实验程序仍然成本高昂：教育工作者必须描述新设备，定义仪器和材料如何交互，并指定可在虚拟环境中执行或评估的有效程序流程。大型语言模型可以通过生成详细的实验程序来辅助这一编写过程，但其输出不应被视为可直接执行的计划。它们可能遗漏必要的操作，步骤顺序错误，或产生逻辑上不正确或与实验室设备不兼容的指令。本文提出了一个用于管理虚拟实验室规划中LLM生成程序性知识不确定性的原型框架。该框架旨在通过使用结构化领域表示和不确定的LLM生成状态转移样本来提取候选程序规则，将其转化为显式且可检查的约束，并利用它们修复不确定的程序步骤，从而减少程序不确定性。尽管动机领域是教育虚拟实验室，但底层问题更为普遍：在结构化交互环境中管理用于行动规划的不确定程序性知识。我们通过一个涉及实验室仪器、容器、工具和材料转移操作的虚拟实验室领域来展示该方法。

英文摘要

Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.

URL PDF HTML ☆

赞 0 踩 0

2605.26332 2026-05-27 cs.CV cs.AI

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

被擦除但可被利用：针对已遗忘文本到图像扩散模型的黑盒嵌入感知提示攻击

Arian Komaei Koma, Seyed Amir Kasaei, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering（计算机工程系）

AI总结提出一种黑盒嵌入感知对抗提示攻击BEAP，利用大语言模型迭代生成有效对抗提示，以恢复被遗忘概念，并在攻击成功率上提升超过60%。

详情

AI中文摘要

机器遗忘旨在从预训练的文本到图像扩散模型中移除特定概念，然而已有多种白盒和黑盒攻击被提出以使模型生成这些被遗忘的概念。然而，这些攻击并未假设现实的威胁模型，即它们要么假设可以访问模型权重，要么产生无意义的对抗提示，即使通过简单的基于规则的防护也能轻易检测到。本文旨在填补这一空白。我们提出BEAP，一种黑盒、嵌入感知的对抗提示攻击，利用大语言模型（LLM）迭代生成有效的对抗提示并利用这些隐藏的漏洞。BEAP在文本空间中执行嵌入感知搜索，结合多个奖励信号：被遗忘概念的存在性、文本-图像对齐和图像质量，以优化生成的提示。与之前的攻击方法不同，BEAP使其提示对安全过滤器不可检测，同时生成高质量图像。大量实验表明，BEAP的攻击成功率（ASR）比先前方法提高了60%以上，而每次成功攻击平均仅需15个提示。警告：本文包含可能具有冒犯性或令人不安性质的模型输出。

英文摘要

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

URL PDF HTML ☆

赞 0 踩 0

2605.26330 2026-05-27 cs.RO

NightSight: Passive Computation for Navigation in Dark Using Events

NightSight：利用事件在黑暗中进行被动计算导航

Deepak Singh, Brijan Vaghasiya, Shreyas Khobragade, Nitin Sanket

发表机构 * NVIDIA

AI总结提出一种结合单目事件相机、编码孔径镜头和红外点投影仪的轻量级感知方法，通过卷积神经网络解码深度相关模糊签名生成密集深度图，实现小型空中机器人在完全黑暗环境中的实时导航。

Comments 6 pages, 7 figures

详情

AI中文摘要

小型空中机器人由于其敏捷性、低成本以及在大型平台无法进入的杂乱空间中穿行的能力，特别适合在受限和危险环境中进行搜索和救援。然而，在完全黑暗中实现自主导航仍然是一个重大挑战，因为小型空中机器人难以容纳需要大量载荷、功率或计算的感知系统。在这项工作中，我们提出了一种轻量级感知方法，结合单目事件相机、编码孔径镜头和红外点投影仪，以实现在此类条件下的导航。通过编码孔径成像的投影图案会产生深度相关的模糊签名，隐式编码场景几何。我们训练了一个卷积神经网络，仅使用从简单平面墙设置生成的合成数据来将这些签名解码为密集深度图。尽管训练条件有限，该模型能零样本泛化到复杂的真实场景。我们的系统在NVIDIA Jetson Orin Nano上以20 Hz实时运行，展示了其对资源受限平台的适用性。我们进一步分析了不同编码孔径设计对深度估计性能的影响。我们的方法在2.5米范围内实现了高精度（l1误差7.0厘米，2.80%误差）。这些结果突显了结合结构光照明、编码光学和事件传感在完全黑暗中实现鲁棒感知和导航的潜力。

英文摘要

Small aerial robots are particularly well-suited for search and rescue in confined and hazardous environments due to their agility, low cost, and ability to traverse through cluttered spaces that are inaccessible to larger platforms. However, enabling autonomous navigation in complete darkness remains a significant challenge, because small aerial robots cannot easily accommodate perception systems that demand substantial payload, power, or computation. In this work, we present a lightweight perception approach that combines a monocular event camera, a coded aperture lens, and an infrared dot projector to enable navigation in such conditions. The projected pattern, when imaged through the coded aperture, produces depth dependent blur signatures that implicitly encode scene geometry. We train a convolutional neural network to decode these signatures into dense depth maps using only synthetic data generated from a simple planar wall setup. Despite this minimal training regime, the model generalizes zero-shot to complex real-world scenes. Our system operates in real time at 20 Hz on a NVIDIA Jetson Orin Nano, demonstrating suitability for resource-constrained platforms. We further analyze the impact of different coded aperture designs on depth estimation performance. Our approach gives high accuracy (l1 error 7.0cm) upto 2.5m range (2.80% error). These results highlight the potential of combining structured illumination, coded optics, and event-based sensing for enabling robust perception and navigation in complete darkness.

URL PDF HTML ☆

赞 0 踩 0

2605.26329 2026-05-27 cs.AI

JobBench: Aligning Agent Work With Human Will

JobBench：使智能体工作符合人类意愿

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran

发表机构 * University of Washington（华盛顿大学）； University of California, Santa Barbara（加州大学圣芭芭拉分校）； Stanford University（斯坦福大学）； Carnegie Mellon University（卡内基梅隆大学）； Northwestern University（西北大学）； University of Notre Dame（圣母大学）； University of California, Berkeley（加州大学伯克利分校）； Michigan State University（密歇根州立大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Bake AI ； King Abdulaziz City for Science and Technology（国王阿卜杜勒阿齐兹科技城）； Western Washington University（西雅图华盛顿大学）； University of Chicago（芝加哥大学）

AI总结提出JobBench基准，通过专家识别的高优先级工作流程评估AI智能体，以人类需求为中心而非经济价值，覆盖35个职业的130个任务，使用事实锚定的评分链评估，最强模型仅达45.9%，旨在推动从替代到增强的劳动力市场影响。

详情

AI中文摘要

当前职业AI智能体的基准主要基于经济价值，讲述了一个替代的故事。我们引入了JobBench，该基准根据专家识别为高优先级委托的工作流程评估AI智能体，基于人类需求赋权，而非用GDP价值替代他们。JobBench覆盖了35个职业的130个智能体任务。每个任务被打包成一个包含异构参考文件的工作空间，要求智能体在真实专业工作的杂乱信息流中进行推理。输出由事实锚定的评分链进行评分，每个任务平均有35.6个二元标准。我们评估了36个模型；最强的Claude Opus~4.7在Claude Code下仅达到45.9%。我们希望JobBench将社区的目标劳动力市场影响从替代转向增强：构建能够完成人类真正希望委托的任务的智能体，而不仅仅是经济价值最高的任务。

英文摘要

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

URL PDF HTML ☆

赞 0 踩 0

2605.26328 2026-05-27 cs.CV

RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields

RadarSim: 通过多模态神经场模拟单芯片雷达

Chuhan Chen, Tianshu Huang, Akarsh Prabhakara, Chaithanya Kumar Mummadi, Zhongxiao Cong, Anthony Rowe, Matthew O'Toole, Deva Ramanan

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Bosch Research（博世研究）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结提出RadarSim，一种利用RGB相机高角分辨率从相机初始化的神经场生成多普勒雷达距离图像的统一可微渲染器，以解决雷达空间分辨率低的问题，并产生比纯雷达重建更清晰的几何和多普勒距离帧。

Comments Accepted to 3DV 2026. Project website: https://sally-chen.github.io/radar-sim/

详情

AI中文摘要

雷达是相机的理想补充：两者都是廉价、固态的传感器，相机提供精细的角分辨率，而雷达在恶劣天气下提供度量深度和鲁棒性。然而，雷达数据比相机图像更难解释，且不同传感器之间差异显著，这增加了对仿真以进行传感器和处理流水线原型设计的依赖。最近将雷达重建视为新视角合成问题的工作在重建雷达相关几何和模拟低级雷达数据方面显示出巨大潜力。然而，此类方法受到底层雷达低空间分辨率的限制。为了解决这个问题，我们提出了一种统一的可微渲染器RadarSim，它利用RGB相机的高角分辨率从相机初始化的神经场生成多普勒雷达距离图像。通过使用来自定制手持装置的校准雷达相机记录的新数据集，我们证明RadarSim比纯雷达重建产生更清晰的几何和多普勒距离帧。

英文摘要

Radars are an ideal complement to cameras: both are inexpensive, solid-state sensors, with cameras offering fine angular resolution, while radars provide metric depth and robustness under adverse weather. However, radar data is more difficult to interpret than camera images and varies significantly between sensors, necessitating increased reliance on simulation for prototyping sensors and processing pipelines. Recent work treating radar reconstruction as a novel view synthesis problem has shown great promise in reconstructing radar-relevant geometry and simulating low-level radar data. However, such methods are constrained by the low spatial resolution of the underlying radar. To address this, we propose a unified differentiable renderer, RadarSim, which leverages the high angular resolution of RGB cameras to generate Doppler radar range images from a camera-initialized neural field. Using a novel data set of calibrated radar camera recordings from a custom hand-held rig, we demonstrate that RadarSim produces sharper geometry and Doppler range frames than radar-only reconstructions.

URL PDF HTML ☆

赞 0 踩 0

2605.26327 2026-05-27 cs.LG

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

重新参数化Shampoo和SOAP用于子空间基更新和BFloat16存储

Alan Milligan, Zikun Xu, Simon Lacoste-Julien, Felix Dangel, Wu Lin

发表机构 * Mila & Université de Montréal Microsoft（Mila与蒙特利尔大学微软公司）； Concordia University & Mila（康科德大学与Mila）； University of Central Florida（中央佛罗里达大学）

AI总结本文通过重新参数化预条件器，在子空间中仅更新部分基向量，结合QR分解支持BFloat16存储，降低了Shampoo类方法的计算和内存开销，并缓解了低精度存储带来的性能下降。

Comments Preprint, working in progress

详情

AI中文摘要

基于Shampoo的方法，如KL-Shampoo和SOAP，在训练神经网络中表现出强大的性能，并依赖于QR分解。由于现有的QR实现需要单精度（FP32）算术且计算成本高，当预条件矩阵较大时，这些方法变得时间和内存密集。此外，使用BFloat16（BFP16）存储以减少内存使用会降低基于Shampoo的方法的性能。我们提出了一种预条件器的重新参数化，支持BFP16存储，并通过将更新的基向量与未改变的基向量结合形成完整基。通过在子空间中通过QR分解仅更新部分基，我们的方法减少了计算开销，同时缓解了BFP16存储导致的性能下降。我们的方法广泛适用于使用QR分解的基于Shampoo的方法，包括KL-Shampoo、SOAP和KL-SOAP。特别是，它改善了SOAP和KL-SOAP在BFP16存储下的性能，使KL-SOAP能够匹配或超过KL-Shampoo。总体而言，我们的方法使基于Shampoo的方法更加内存和时间高效。

英文摘要

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require single-precision (FP32) arithmetic and remain computationally expensive, these methods become time- and memory-intensive when their preconditioning matrices are large. Moreover, using BFloat16 (BFP16) storage to reduce memory usage can degrade the performance of Shampoo-based methods. We propose a reparametrization of the preconditioner that supports BFP16 storage and forms a complete basis by combining updated basis vectors with unchanged ones. By updating only part of the basis through QR decomposition in a subspace, our approach reduces computational overhead while mitigating the performance degradation caused by BFP16 storage. Our approach applies broadly to Shampoo-based methods that employ QR decomposition, including KL-Shampoo, SOAP, and KL-SOAP. In particular, it improves the performance of SOAP and KL-SOAP under BFP16 storage, enabling KL-SOAP to match or exceed KL-Shampoo. Overall, our approach makes Shampoo-based methods more memory- and time-efficient.

URL PDF HTML ☆

赞 0 踩 0

2605.26324 2026-05-27 cs.LG cs.AI cs.NA math.NA

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

半群一致性作为学习型物理模拟器的诊断工具

Lennon J. Shikhman

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出归一化半群误差作为评估学习型物理模拟器时间组合和长程推演一致性的诊断指标，在热传导和Burgers动力学实验中验证其与推演退化正相关。

Comments 10 pages, 3 figures, 3 tables. Accepted to the AI4Physics Workshop at the 43rd International Conference on Machine Learning

2605.26322 2026-05-27 cs.AI

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM：通过显式信念建模评估大语言模型的心智理论

Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah

发表机构 * University of Central Florida（佛罗里达大学中央分校）

AI总结提出OmniToM基准，通过显式信念结构（包括信念提取和标签化两阶段）评估LLM在叙事中追踪不同角色心智状态的能力，揭示其在知识获取和表征决策上的瓶颈。

Comments 30 pages, 8 figures, 19 tables; includes appendix

详情

AI中文摘要

心智理论（ToM）——推断他人知识、意图和情绪的能力——通常通过端点问答在大语言模型（LLM）中评估，其性能仅由对社交推理查询的最终答案判断。这种范式掩盖了模型是否真正构建了稳健推理所需的基础心智状态表征，尤其是在涉及分歧、演变或错误信念的场景中。为填补这一研究空白，我们引入OmniToM，一个通过要求对叙事中所有相关角色显式建模信念结构来直接评估这些表征的基准。这些结构由信念命题组成：关于角色认为世界或他人心智状态为真的最小陈述，使得知识、意图、情绪和错误信念能以通用格式分析。模型分两阶段评估：阶段1：信念提取，从故事中提取与社会动态相关的信念；阶段2：信念标签化，为每个信念分配一个七维模式标签，涵盖递归顺序、真值状态、知识获取、显式性、内容类型、心智来源和上下文。基于现有ToMBench故事语料库中的895个故事，并扩充了22,343个标记的信念命题，OmniToM使用人类校准的LLM辅助标注流水线。在零样本评估中，OmniToM揭示了不同模型存在特定角色的信念追踪瓶颈：当前LLM难以将叙事事实转化为角色信念和共享心智状态所需的知识获取和表征决策。

英文摘要

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

Cultural Value Alignment Via Latent Activation Steering in Large Language Models

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

Energy-Gated Attention and Wavelet Positional Encoding: Complementary Inductive Biases for Transformer Attention

Personalized Generative Models for Contextual Debiasing

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

Closing the Loop in Teleoperation: Episode-Level Data Quality Assessment and Feedback for High-Quality Demonstration Collection

RCSP: Risk-Sensitive Conjectural Scenario Planning for Safe Dynamic Robot Navigation

The Daily Dose: Workflow-Integrated Large Language Model Automation for Clinical Summarization and Trial Identification in Radiation Oncology

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

A PAC-Bayesian View of Generalisation for Physics-Informed Machine Learning

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

NightSight: Passive Computation for Navigation in Dark Using Events

JobBench: Aligning Agent Work With Human Will

RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling