arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2606.03043 2026-06-03 cs.CL

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

LLM作为评判者的几何学:为什么LLM间共识不等于人类对齐

Sourabrata Mukherjee, Hamna Hamna, Kalika Bali, Sunayana Sitaram

AI总结 通过几何量测量,发现LLM评判者之间高度一致但与人类对齐差,其评分子空间与人类子空间几乎正交,共识源于子空间坍缩而非人类对齐。

详情
AI中文摘要

LLM作为评判者现已普遍使用,但评判者之间高度一致,与人类却只有弱一致性。我们通过测量四个社区构建的印地语数据集、八种印地语语言和41个LLM评判者的四个几何量(分数分布、有效秩、与人类子空间的主角、评判者与人类之间的堆叠相关性,均带有自助置信区间)来检验这是共享信号还是共享偏差。在主观评分标准上,评判者使用的分数范围不到人类的一半($\sigma_J / \sigma_H \approx 0.3$--$0.5$)。他们的评估轴几乎与人类正交,且明显比人类彼此之间更远离人类($87^\circ$--$89^\circ$ 对比 $78^\circ$--$81^\circ$)。LLM间一致性超过LLM-人类一致性($r_{LL} \approx 0.35$ 对比 $r_{LH} \approx 0.27$--$0.32$)。在具有可验证事实答案的评分标准上,相同的诊断指标回落到人类范围内(轴 $58.5^\circ$;$r_{LH} = 0.519$)。微调和偏好优化恢复了分数分布($0.32 \rightarrow 1.08$),但几乎不改变轴(仍为 $87^\circ$--$88^\circ$)。只有在小的人类锚定集上的后验校准才能同时改善所有四个社区健康评分标准,使校准后的24B印地语评判者($r = 0.184$)优于GPT-5.5($r = 0.123$),但仍未达到人类可靠性(在可验证评分标准上人类-人类 $r = 0.474$)。我们认为,只有当对评判者评分子空间的直接几何检查通过时,LLM间一致性才应被视为人类对齐的证据;否则,共识反映的是坍缩子空间内的一致性。

英文摘要

LMs-as-judges are now standard, yet judges agree strongly with one another while agreeing only weakly with humans. We test whether this reflects shared signal or shared bias by measuring four geometric quantities on the standard LLM-as-judge stack across four community-built Indic datasets, eight Indic languages, and 41 LLM judges: score spread, effective rank, principal angle to the human subspace, and stacked correlations among judges and humans, all with bootstrap confidence intervals. On subjective rubrics, judges use less than half the human score range ($σ_J / σ_H \approx 0.3$--$0.5$). Their evaluation axis is nearly orthogonal to the human one and noticeably further from humans than humans are from each other ($87^\circ$--$89^\circ$ versus $78^\circ$--$81^\circ$). Inter-LLM agreement exceeds LLM--human agreement ($r_{LL} \approx 0.35$ versus $r_{LH} \approx 0.27$--$0.32$). On a rubric with a verifiable factual answer, the same diagnostics fall back into the human range (axis $58.5^\circ$; $r_{LH} = 0.519$). Fine-tuning and preference optimization recover spread ($0.32 \rightarrow 1.08$) but barely move the axis (still $87^\circ$--$88^\circ$). Only post-hoc calibration on a small human-anchored set improves all four community-health rubrics together, placing a calibrated 24B Indic judge ($r = 0.184$) ahead of GPT-5.5 ($r = 0.123$), yet still short of human reliability (human-human $r = 0.474$ on the verifiable rubric). We argue that inter-LLM agreement should be considered evidence of human alignment only when a direct geometric check on the judge's score subspace passes; otherwise, the consensus reflects agreement within a collapsed subspace.

2606.03036 2026-06-03 cs.AI

TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

TriEval: 一种资源高效的LLM偏见、毒性和真实性评估流水线

Akshatha Srikantha, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal

AI总结 提出TriEval流水线,通过同时评估LLM输出的偏见、毒性和真实性,在标准笔记本电脑上高效运行,并揭示开源与闭源模型在毒性和真实性上的差异。

详情
AI中文摘要

LLM已经从基本的聊天机器人演变为AI生态系统的支柱,现在广泛应用于医疗、学校和政府服务。LLM的领域范围采用需要持续评估以确保其安全性和公平性。部署LLM后遇到的常见问题包括不一致的输出和错误信息的幻觉。尽管存在许多LLM评估工具,但大多数仅限于一次测试单个参数,或者需要大多数研究人员无法访问的大量计算资源。TriEval通过同时评估LLM输出的多个参数(包括偏见、毒性和真实性)来解决这些挑战,同时最小化计算资源。该流水线与开源和闭源模型兼容,并在没有GPU集群的标准笔记本电脑上运行。TriEval已在四个模型上测试:Llama 3 8B、Mistral 7B、Gemma 2 9B和Claude Haiku。结果显示了开源和闭源模型之间的明显差异,特别是在毒性和真实性方面。TriEval作为开源发布,以使计算资源有限的研究人员能够更广泛地访问。

英文摘要

LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety and fairness. Common issues encountered after deploying LLMs include inconsistent outputs and hallucinations of incorrect information. Although numerous LLM evaluation tools exist, most are limited to testing a single parameter at a time or require massive computational resources that are not accessible to most researchers. TriEval addresses these challenges by evaluating LLM outputs across multiple parameters, including bias, toxicity, and truthfulness together, while minimizing computing resources. The pipeline is compatible with both open- and closed-source models and runs on a standard laptop without a GPU cluster. TriEval has been tested on four models: Llama 3 8B, Mistral 7B, Gemma 2 9B, and Claude Haiku. The results show clear differences between open-source and closed-source models, especially in terms of toxicity and truthfulness. TriEval is being released as open source to enable broader access for researchers with limited computational resources.

2606.02993 2026-06-03 cs.LG math.OC math.RT math.ST stat.ML stat.TH

Neural Networks Provably Learn Spectral Representations for Group Composition

神经网络可证明地学习群组合的谱表示

Jianliang He, Leda Wang, Fengzhuo Zhang, Siyu Chen, Zhuoran Yang

AI总结 通过将投影梯度流提升到傅里叶域,证明两层神经网络在群组合任务中几乎必然收敛到单个不可约表示,并揭示了表示论视角下的特征学习和低秩压缩现象。

详情
AI中文摘要

理解神经网络训练过程中结构化内部结构如何涌现是深度学习研究的核心。我们通过群组合任务研究这一现象,其中训练一个两层神经网络来预测有限群 $G$ 中元素的 $g_1 \star g_2$。通过将投影梯度流提升到傅里叶域,我们证明训练动力学由一个表示论能量泛函上的黎曼梯度上升控制。我们证明,在随机初始化下,该流驱动每个神经元几乎必然收敛到单个不可约表示,而跨层傅里叶系数实现旋转秩一对齐。该框架提供了特征学习的表示论解释,并刻画了矩阵值群表示的一种新颖的低秩压缩现象。此外,对于阿贝尔群,我们提供了完整的总体水平描述:随机初始化促进非平凡表示上的均匀多样化,并诱导 Haar 均匀相位,通过多数投票机制联合逼近指示函数。我们进一步证明相位对齐和表示竞争都以指数收敛速率出现。

英文摘要

Understanding how structured internal structure emerges during neural network training is central to the study of deep learning. We investigate this phenomenon through the group composition task, where a two-layer neural network is trained to predict $g_1 \star g_2$ for elements of a finite group $G$. By lifting the projected gradient flow to the Fourier domain, we demonstrate that the training dynamics are governed by a Riemannian gradient ascent on a representation-theoretic energy functional. We prove that, under random initialization, this flow drives each neuron to converge almost surely toward a single irreducible representation, while the cross-layer Fourier coefficients achieve a rotational rank-one alignment. This framework provides a representation-theoretic account of feature learning and characterizes a novel low-rank compression phenomenon for matrix-valued group representations. Moreover, for Abelian groups, we provide a complete population-level description: random initialization promotes uniform diversification across nontrivial representations and induces Haar-uniform phases, jointly approximating the indicator via a majority-vote mechanism. We further prove that both phase alignment and representation competition emerge with exponential convergence rates.

2606.02983 2026-06-03 cs.CL

A Locally Deployed RAG-Based Academic Advising System for Course Selection

基于本地部署RAG的课程选择学术咨询系统

Feng Li, Yoritaka Iwata

AI总结 提出一种本地部署的RAG学术咨询系统,利用大语言模型和结构化课程大纲检索,以隐私保护方式支持课程选择、先修课程理解和个性化学习规划。

Comments to be published in Elsevier's Procedia Computer. Sci. (KES 2026)

详情
AI中文摘要

基于课程之间先修关系的正确课程顺序对于学生全面发展知识和技能至关重要。然而,学生孤立地制定这一顺序时,常常因认知局限和信息过载而困惑。同时,教育机构由于教育资源有限,在提供关于正确顺序的充分学术建议方面遇到困难。为解决这些挑战,我们提出一种基于课程大纲信息的本地部署RAG学术咨询系统。通过将大语言模型与结构化课程大纲数据的检索相结合,该系统旨在以隐私保护的方式支持课程选择、先修课程理解和个性化学习规划。

英文摘要

The correct sequence of courses in the curriculum based on prerequisites between courses is of great importance for students to develop their knowledge and skills holistically. However, students crafting this sequence in isolation frequently struggle with recognition limitations and information overload that leads to confusion. Simultaneously, education institutions encounter difficulties in providing adequate academic advice for the correct sequence due to limited education resources. To address these challenges, we propose a locally deployed RAG-based academic advising system grounded in syllabus information. By combining large language models with retrieval from structured syllabus data, the system is designed to support course selection, prerequisite understanding, and personalized study planning in a privacy-preserving manner.

2606.02974 2026-06-03 cs.AI cs.HC cs.LG

WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition

WISE-HAR:一种基于WiFi的人类活动识别的可泛化集成深度学习框架

Maheen Arshad, Qindeel E Zahra, Muhammad Khuram Shahzad

AI总结 本文提出WISE-HAR框架,通过集成五种CNN架构、数据增强和跨场景评估,在Wallhack1.8k数据集上实现94.87%的LOS测试准确率,并展现出强泛化能力。

Comments 8 pages, 5 figures

详情
AI中文摘要

利用WiFi信号进行人类活动识别(HAR)已成为智能家居、医疗监控、安全系统和环境辅助生活的一项变革性技术。与引发严重隐私问题且在弱光条件下失效的传统基于摄像头的系统,或需要用户配合的可穿戴传感器不同,基于WiFi的HAR是非侵入性的、保护隐私的、成本效益高的,并且能在任何光照条件下无缝工作。本文提出了一种综合方法,使用Wallhack1.8k WiFi频谱图数据集识别三种不同的人类活动:“无人”(空房间)、“行走”和“行走+挥手”。我们提出了三项关键改进以应对基于WiFi的HAR的主要挑战。首先,为了解决高性能方差问题,我们实现了集成学习,采用五种不同的CNN架构(Deep CNN、Wide CNN、MobileNetV2、ResNet50V2和EfficientNetB0)。其次,为了解决小数据集大小的限制,我们应用了激进的数据增强技术,包括时间扭曲、频率掩蔽和噪声添加。第三,为了评估真实世界的泛化能力,我们进行了跨场景评估(在视距上训练,在非视距上测试)和跨天线评估(在双锥天线上训练,在PIFA天线上测试)。我们的集成模型在使用双锥天线的LOS场景下达到了94.87%的测试准确率,比最佳单个模型高出0.66%。数据增强将随机森林的性能从60%提升到95%。跨场景评估显示准确率下降极小,仅为1.37%和2.07%,证明了强大的泛化能力。结果表明,所提出的方法鲁棒、可靠,适用于不同硬件配置的多样化环境中的实际部署。

英文摘要

Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: "No Presence" (empty room), "Walking", and "Walking + Arm-waving" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.

2606.02924 2026-06-03 cs.CV

ATLAS: A Large-Scale Evaluation Benchmark for Adversarial LiDAR Perception

ATLAS:面向对抗性激光雷达感知的大规模评估基准

Mellon M. Zhang, Siddhant Panse, Zimo Fan, Akshal Dhal, Rishit Sarkar, Glen Chou

AI总结 针对黑盒传感器攻击下激光雷达感知模型的鲁棒性评估空白,提出首个大规模物理驱动基准ATLAS,通过点注入和点移除两种攻击模式,揭示模型性能与鲁棒性的非对称性,并溯源至标准数据增强方法。

Comments preprint

详情
AI中文摘要

自动驾驶感知通常在干净的基准数据上进行评估,然而实际部署需要对罕见、结构化且可能具有对抗性的传感器异常具有鲁棒性。这一差距对于激光雷达尤为关键,因为外部行为者可以在不访问模型的情况下物理操纵传感过程,引发黑盒感知故障。现有的激光雷达基准对此类故障模式几乎不提供可见性。先前的对抗性激光雷达研究主要集中于攻击硬件、几何和算法防御以及早期检测器,而现代感知系统的鲁棒性尚未被探索。为弥补这一评估空白,我们提出了ATLAS(对抗性时间激光雷达攻击套件),这是首个大规模、物理驱动的激光雷达感知模型评估基准,在黑盒传感器攻击下模拟两种主要攻击模式——点注入和点移除,覆盖真实驾驶序列。通过评估当前最先进的激光雷达感知模型的广泛截面,ATLAS揭示了一个令人惊讶的鲁棒性非对称性:在标准基准上表现更强的模型往往更能抵御移除攻击,但实际上比弱模型更容易受到注入攻击。我们将这一脆弱性追溯到标准对象数据库采样增强,揭示了当前训练实践如何引发与架构无关的鲁棒性故障,并研究了缓解两种攻击模式的初步方向。我们发布了ATLAS生成代码,以支持随着攻击能力演进而进行的可扩展、可重复的评估,帮助使黑盒传感器鲁棒性成为未来激光雷达感知发展中的明确考虑因素。

英文摘要

Autonomous driving perception is typically evaluated on clean benchmark data, yet real-world deployment requires robustness to rare, structured, and potentially adversarial sensor anomalies. This gap is especially critical for LiDAR, where external actors can physically manipulate the sensing process to induce black-box perception failures without accessing the model. Existing LiDAR benchmarks provide little visibility into this failure mode. Prior adversarial LiDAR studies have largely centered on attack hardware, geometric and algorithmic defenses, and early-generation detectors, leaving the robustness of modern perception systems unexplored. To address this evaluation gap, we introduce ATLAS (Adversarial Temporal LiDAR Attack Suite), the first large-scale, physically grounded evaluation benchmark for LiDAR perception models under black-box sensor attacks, simulating the two primary attack modes -- point injection and point removal -- across real driving sequences. Evaluating a broad cross-section of current state-of-the-art LiDAR perception models, ATLAS reveals a surprising robustness asymmetry: models with stronger performance on standard benchmarks tend to better withstand removal attacks, yet are actually more vulnerable to injection attacks than weaker models. We trace this vulnerability to standard object database sampling augmentations, revealing how current training practices can induce architecture-agnostic robustness failures, and study initial directions for mitigating both attack modes. We release the ATLAS generation code to support extensible, reproducible evaluations as attack capabilities evolve, helping make black-box sensor robustness an explicit consideration in future LiDAR perception development.

2606.02887 2026-06-03 cs.LG cs.NA math.NA math.OC

A Nonmonotone Gradient-Based Algorithm for Symmetric Nonnegative Matrix Factorization and Graph Clustering

一种用于对称非负矩阵分解和图聚类的非单调梯度算法

Ryan Swart, Johannes Brust

AI总结 提出SNMPBB算法,首次将非单调投影Barzilai-Borwein方法应用于对称非负矩阵分解,并扩展至图聚类和大规模问题,证明全局收敛性,实验显示显著加速和精度提升。

详情
AI中文摘要

对称非负矩阵分解(Symmetric NMF)将矩阵近似为 $WW^T$,其中 $W$ 是非负矩形因子。它在图聚类和机器学习中有广泛应用。与NMF相比,投影梯度方法在对称问题上收敛缓慢。为了解决这个问题,我们引入了SNMPBB,这是非单调投影Barzilai-Borwein方法在对称NMF上的首次应用,表明梯度算法比以前认为的有效得多。我们进一步将SNMPBB扩展到使用图拉普拉斯正则化的图聚类(Graph-SNMPBB)以及使用低秩近似的大规模问题(LAI-SNMPBB)。对于所有变体,我们证明了全局收敛到一阶稳定点,并且Barzilai-Borwein曲率信息在随机近似下得以保留。在合成数据上,SNMPBB在相似残差下比替代的SymANLS快6倍,且优势随秩增加而扩大。在六个真实世界聚类基准测试中,Graph-SNMPBB匹配或超过了SymANLS的精度。最后,LAI-SNMPBB在34个SuiteSparse矩阵上,在运行时间和残差质量方面均优于最先进的LAI-SymPGNCG。

英文摘要

Symmetric nonnegative matrix factorization (Symmetric NMF) approximates a matrix as $WW^T$ with nonnegative rectangular factor $W$. It has broad applications in graph clustering and machine learning. In contrast to the NMF, projected gradient methods for the symmetric problem had been associated with slow convergence. To address this, we introduce SNMPBB, the first adaptation of nonmonotone projected Barzilai-Borwein methods to Symmetric NMF, demonstrating that gradient algorithms are significantly more effective than previously understood. We further extend SNMPBB to graph clustering using the graph Laplacian regularization (Graph-SNMPBB) and to large problems with low-rank approximations (LAI-SNMPBB). For all variants we prove global convergence to first-order stationary points and also that Barzilai-Borwein curvature information is preserved with randomized approximations. On synthetic data, SNMPBB achieves 6 times speedup over the alternative SymANLS for similar residuals, with advantages growing at higher ranks. Across six real-world clustering benchmarks, Graph-SNMPBB matches or exceeds SymANLS accuracy. Lastly, LAI-SNMPBB outperforms state-of-the-art LAI-SymPGNCG on 34 SuiteSparse matrices in both runtime and residual quality.

2606.02852 2026-06-03 cs.LG

RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting

RESCAST-100K:一个用于跨领域住宅负荷和室内温度预测的综合数据集

Jainam Dhruva, Yousaf Raza, A. B. Siddique, Simone Silvestri

AI总结 提出RESCAST-100K大规模基准数据集,通过配置驱动接口支持跨领域泛化研究,涵盖约10万模拟住宅和5个真实数据集,用于评估迁移学习、域适应和零样本预测方法。

详情
AI中文摘要

住宅能源负荷和室内温度的准确短期预测对于家庭能源管理系统、电网级需求响应和社区能效工作至关重要。域适应和迁移学习在改善住宅环境中常见的数据异质性和稀缺性下的预测精度方面显示出潜力。然而,由于缺乏全面的住宅数据集,进展受到限制:现有基准在目标覆盖范围上狭窄,且很少支持结构化的跨领域评估。我们引入了RESCAST-100K,这是一个用于研究跨领域泛化的大规模住宅预测基准。它提供了一个配置驱动的接口,沿着可解释的轴(包括地理、气候区、墙体结构和供暖设备)实例化源域和目标域,从而能够在受控域偏移下系统评估迁移学习、域适应和零样本泛化。该基准涵盖约10万个来自ResStock的EnergyPlus模拟的美国住宅,每个住宅包含三个耦合目标的15分钟时间序列:总负荷、暖通空调负荷和室内温度。这些数据与天气通道、暖通空调设定点以及超过40个静态建筑协变量配对。RESCAST-100K还整合了五个真实世界住宅数据集,采用统一模式,支持在相同任务上进行模拟到真实的评估。我们对循环、注意力和MLP混合器架构进行了零样本性能基准测试,涵盖跨领域、缺失输入条件和预测任务。在域偏移下,交叉注意力和MLP混合器模型始终优于循环和经典Transformer基线。RESCAST-100K旨在帮助机器学习和建筑分析社区在家庭、社区和电网规模上推进跨领域住宅预测。

英文摘要

Accurate short-term forecasting of residential energy load and indoor temperature is essential for home energy management systems, grid-level demand response, and community energy efficiency efforts. Domain adaptation and transfer learning have shown promise for improving forecasting accuracy under data heterogeneity and scarcity commonly seen in residential settings. However, progress is limited by the lack of comprehensive residential datasets: existing benchmarks are narrow in target coverage and rarely support structured cross-domain evaluation. We introduce RESCAST-100K, a large-scale residential forecasting benchmark for studying cross-domain generalization. It provides a configuration-driven interface that instantiates source and target domains along interpretable axes, including geography, climate zone, wall construction, and heating equipment, enabling systematic evaluation of transfer learning, domain adaptation, and zero-shot generalization under controlled domain shifts. The benchmark covers approximately 100,000 EnergyPlus-simulated U.S. homes derived from ResStock, with 15-minute time series for three coupled targets per home: total load, HVAC load, and indoor temperature. These are paired with weather channels, HVAC setpoints, and over 40 static building covariates. RESCAST-100K also integrates five real-world residential datasets under a unified schema, supporting sim-to-real evaluation on the same tasks. We benchmark recurrent, attention-based, and MLP-mixer architectures for zero-shot performance across domains, missing-input conditions, and forecasting tasks. Cross-attention and MLP-mixer models consistently outperform recurrent and classical transformer baselines under domain shift. RESCAST-100K is intended to aid the machine learning and building analytics communities advance cross-domain residential forecasting at home, community, and grid scale.

2606.02849 2026-06-03 cs.LG

A Systematic Evaluation of Current Architectures in Wind Power Forecasting

风电功率预测中当前架构的系统评估

Vinicius Bortolini, Gilson Adamczuk Oliveira, Erick Oliveira Rodrigues, Matheus Henrique Dal Molin Ribeiro

AI总结 本文通过系统文献综述,评估混合深度学习、模态分解和统计方法在风电区间预测中的应用,发现结合VMD或EEMD等分解技术能提高预测精度和可靠性。

详情
Journal ref
IEEE Access 2025
AI中文摘要

区间风速预测对于将风能有效集成到电力系统中至关重要,因为它考虑了风资源的固有不确定性。本研究对风电发电区间预测的混合方法进行了系统文献综述,探讨了深度学习、模态分解和统计方法的结合。为了指导论文选择,应用了潜在狄利克雷分配(LDA)进行主题建模,从而识别出模式和研究趋势。研究结果强调,将混合模型与分解技术(如变分模态分解(VMD)和集合经验模态分解(EEMD))相结合,通过在不牺牲覆盖率的情况下缩小预测区间,提高了预测准确性和可靠性。关于区间构建,大多数研究采用双模型策略,独立预测上下界。输入数据通常使用EMD、EEMD或VMD等技术进行分解,提取基于频率的分量。这些分量作为LSTM或ELM等模型的输入,分别针对每个边界进行训练。这种方法允许对不确定性进行有针对性的建模,提高了灵活性和精度。区间质量通常通过平衡覆盖率和区间宽度的指标进行评估。该综述还强调了挑战,包括缺乏标准化的评估指标、计算复杂性和有限的实际验证。总体而言,该研究强化了区间预测对风能运营的价值,并为提高模型鲁棒性和决策提供了见解。

英文摘要

Interval wind speed forecasting is essential for the efficient integration of wind energy into power systems, as it accounts for the inherent uncertainty of wind resources. This study presents a systematic literature review focused on hybrid approaches to interval forecasting of wind generation, exploring the combination of deep learning, modal decomposition, and statistical methods. To guide the paper selection, Latent Dirichlet Allocation (LDA) was applied for topic modeling, enabling the identification of patterns and research trends. The findings emphasize that integrating hybrid models with decomposition techniques-such as Variational Mode Decomposition (VMD) and Ensemble Empirical Mode Decomposition (EEMD)-enhances forecast accuracy and reliability by narrowing prediction intervals without compromising coverage. Regarding interval construction, most studies adopt a dual-model strategy, independently forecasting the lower and upper bounds. Input data are commonly decomposed using techniques like EMD, EEMD, or VMD, which extract frequency-based components. These components serve as inputs to models such as LSTM or ELM, trained separately for each bound. This approach allows for targeted modeling of uncertainty, improving flexibility and precision, Interval quality is typically evaluated through metrics that balance coverage and interval width. The review also highlights challenges, including the lack of standardized evaluation metrics, computational complexity, and limited real-world validation. Overall, the study reinforces the value of interval forecasting for wind energy operations and offers insights for advancing model robustness and decision-making.

2606.02842 2026-06-03 cs.LG

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

光谱渐进式思维流:轻量级多模态推理

Yixian Shen, Zhiheng Yang, Qi Bi, Changshuo Wang, Shuai Wang, Jia-Hong Huang, George Floros, Prayag Tiwari, Anuj Pathania

AI总结 提出光谱渐进式思维流(SpecFlow),通过在固定大小离散余弦空间中表示中间视觉思维,并利用无分类器引导将视觉状态更新与文本意图对齐,实现轻量级多模态空间推理,在保持竞争性能的同时将计算和KV缓存成本降低高达2.1倍。

Comments Accepted at ICML 2026

详情
AI中文摘要

多模态空间推理通常依赖于长链的中间文本和视觉思维,其中累积的视觉标记和密集的跨模态注意力会带来大量的计算和内存开销。为了解决这一挑战,我们提出了光谱渐进式思维流(SpecFlow),一种新颖的轻量级多模态空间推理框架,它在固定大小的离散余弦空间中表示中间视觉思维。通过利用强大的能量压缩,SpecFlow保留了全局布局和关系结构,同时仅在需要增加空间精度时引入高频细节。为了将视觉状态演化与语言意图对齐,无分类器引导使得自回归文本思维能够引导基于流的视觉工作空间/状态更新,而无需扩展上下文。因此,SpecFlow维持一个有界的视觉工作空间,其更新仅依赖于当前视觉状态和累积的文本轨迹,从而能够以稳定的延迟和内存使用进行长程推理,且与推理深度无关。实验结果表明,SpecFlow在实现竞争性或更优推理性能的同时,将计算和KV缓存成本降低了高达2.1倍。

英文摘要

Multimodal spatial reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address this challenge, we propose Spectral-Progressive Thought Flow (SpecFlow), a novel lightweight multimodal spatial reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space. By exploiting strong energy compaction, SpecFlow preserves global layout and relational structure while introducing high-frequency details only when increased spatial precision is required. To align visual state evolution with linguistic intent, classifier-free guidance enables autoregressive textual thoughts to steer flow-based updates of the visual workspace/state without expanding the context. As a result, SpecFlow maintains a bounded visual workspace whose updates depend only on the current visual state and accumulated textual trace, enabling long-horizon inference with stable latency and memory usage independent of reasoning depth. Empirical results show that SpecFlow achieves competitive or superior reasoning performance while reducing computation and KV cache costs by up to 2.1 times.

2606.02831 2026-06-03 cs.CV

Principled Reflection Separation via Nonlinear Superposition and Feature Interaction

基于非线性叠加与特征交互的原理性反射分离

Qiming Hu, Mingjia Li, Yuntong Li, Xiaojie Guo

AI总结 针对单图像反射分离中传输层与反射层非线性耦合问题,提出可学习非线性叠加模型和广义双流交互框架,实现更优的分解性能与泛化能力。

Comments 23 pages

详情
AI中文摘要

单图像反射分离从根本上受到复杂图像形成过程中传输层和反射层纠缠的挑战。现有方法大多依赖简化假设或独立建模,限制了其处理真实场景的能力。在这项工作中,我们从统一视角重新审视该问题,并指出现有方法的一个关键问题,即广泛采用的sRGB域线性合成模型无法捕捉真实图像信号处理流水线引入的非线性耦合。为解决此问题,我们引入了一个可学习的非线性叠加模型,该模型更真实地刻画层间相互作用并提高分解保真度。基于此公式,我们提出了一个广义双流交互框架,通过特征交换显式建模传输层和反射层之间的双向依赖关系。该框架统一了基于激活、门控和注意力的交互机制,并兼容CNN和Transformer骨干网络。在多种真实世界基准上的大量实验表明,所提方法实现了优越的性能和强泛化能力。更重要的是,我们的研究揭示反射分离并非撤销线性混合,而是学习非线性形成与交互,为原理性图像分解模型的设计提供了新见解。代码和模型已公开于该链接。

英文摘要

Single-image reflection separation is fundamentally challenged by the entanglement of transmission and reflection layers under complex image formation processes. Existing approaches largely rely on simplified assumptions or independent modeling, limiting their ability to handle real-world scenarios. In this work, we revisit the problem from a unified perspective and identify a key issue of existing approaches, i.e., the widely adopted linear composition model in the sRGB domain fails to capture the nonlinear coupling introduced by real-world image signal processing pipelines. To address this, we introduce a learnable nonlinear superposition model that more faithfully characterizes layer interactions and improves decomposition fidelity. Building upon this formulation, we propose a generalized dual-stream interactive framework that explicitly models bidirectional dependencies between transmission and reflection through feature exchange. This framework unifies activation-, gating-, and attention-based interaction mechanisms, and is compatible with both CNN and Transformer backbones. Extensive experiments on diverse real-world benchmarks demonstrate that the proposed approach achieves superior performance with strong generalization capability. More importantly, our study reveals that reflection separation is not about undoing a linear mixture, but about learning nonlinear formation and interaction}, offering new insights into the design of principled image decomposition models. Code and models are publicly available at https://mingcv.github.io/DIRS-Page.

2606.02806 2026-06-03 cs.CL

Translating Classical Poetry into Modern Prose

将古典诗歌翻译成现代散文

Chalamalasetti Kranti, Sowmya Vajjala

AI总结 本文介绍Padyam2Gadyam数据集,用于13-17世纪泰卢固语古典诗歌到当代泰卢固语和英语散文的翻译任务,并评估了5种大语言模型的表现,发现仍有较大改进空间。

Comments Preprint

详情
AI中文摘要

我们介绍了Padyam2Gadyam,一个用于诗歌到散文翻译任务的数据集,涵盖13至17世纪泰卢固语古典诗歌到当代泰卢固语和英语散文的翻译。该数据集包含600首诗歌及其人工验证的泰卢固语和英语散文翻译。我们评估了5种当代大语言模型(LLMs)在将诗歌翻译成泰卢固语和英语散文方面的能力。结果表明,尽管不同LLMs之间存在差异,但它们的整体表现在两种语言中仍有很大的改进空间。通过定性分析,我们讨论了当代机器翻译评估方法在此任务中的能力和局限性。

英文摘要

We introduce Padyam2Gadyam, a dataset for the task of poem-to-prose translation from 13th-17th Century Telugu Classical Poetry to contemporary Telugu and English prose. The dataset consists of 600 poems and their human-verified Telugu and English prose translations. We evaluated 5 contemporary Large Language Models (LLMs) on their ability to do poem-to-prose translation into Telugu and English. Our results indicate that while there are differences across LLMs, their overall performance leave a large room for improvement in both languages. Through qualitative analysis, we discuss the the capabilities and limitations of contemporary MT evaluation approaches for this task.

2606.02789 2026-06-03 cs.CV

Diagnosis of Human Object Interaction Detectors for Real World Educational Applications

面向真实世界教育应用的人-物交互检测器诊断

Divya Mereddy, Ashwin Tudur Sadashiva, Marcos Quinones-Grueiro, Gautam Biswas

AI总结 提出一种诊断驱动框架,结合三元组级HOI错误分类与错误因素归因分析,通过针对性改进将预训练CDN模型在CCATT数据集上的宏F1分数从48.6提升至90.2。

详情
AI中文摘要

人-物交互(HOI)识别对于在复杂教育环境中自动分析学生行为至关重要。尽管最先进的HOI检测器在基准数据集上表现良好,但在实际训练环境中部署时,由于领域特定物体、遮挡和复杂视觉条件,其性能往往会下降。本文针对真实世界的教育视频数据,引入了一种诊断驱动框架,该框架将三元组级HOI错误分类与错误因素归因分析相结合。我们在重症监护空运队(CCATT)混合现实医疗训练的背景下研究这一问题。基于对HOI失败模式及其原因的分析,我们开发了一种诊断信息驱动的改进策略,用于将预训练的HOI模型适应到目标领域。在CCATT数据集上的实验表明,通过由诊断出的错误因素引导的针对性改进,该方法将预训练CDN模型的宏F1分数从48.6提升至90.2。这些结果突显了详细诊断分析对于指导HOI模型在真实教育环境中进行针对性适应的价值。

英文摘要

Human-object interaction (HOI) recognition is critical for automatically analyzing student behavior in complex educational environments. Although state-of-the-art (SOTA) HOI detectors perform well on benchmark datasets, their performance often degrades when deployed in real-world training environments due to domain-specific objects, occlusions, and complex visual conditions. In this paper, we introduce a diagnosis-driven framework that integrates a triplet-level HOI error taxonomy with error-factor attribution analysis for real-world educational video data. We study this problem in the context of Critical Care Air Transport Team (CCATT) mixed-reality medical training. Based on an analysis of HOI failure modes and their causes, we develop a diagnosis-informed refinement strategy for adapting pretrained HOI models to the target domain. Experiments on the CCATT dataset show that this approach improves the macro-F1 score of a pretrained CDN model from 48.6 to 90.2 through targeted refinement guided by diagnosed error factors. These results highlight the value of detailed diagnostic analysis for informing targeted adaptation of HOI models in real-world educational environments.

2606.02747 2026-06-03 cs.CV cs.AI

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

Plan2Map: 基于规划记录的文档驱动地理空间边界重建的多模态基准

Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

AI总结 提出Plan2Map基准和GeoPlanAgent系统,通过文档证据提取、定位、地图配准、边界分割等步骤,从英国规划记录中重建地理空间边界,显著优于直接VLM方法。

Comments Project page: https://odeb1.github.io/Plan2Map_Project_Page/. Fabian Degen and Oishi Deb Contributed Equally

详情
AI中文摘要

规划记录定义了地理区域上的限制,但其源文档通常仅提供间接的空间证据而非机器可读的边界。我们介绍了Plan2Map,一个包含208个案例的多模态基准,用于从英国规划记录中重建文档驱动的地理空间边界。仅给定源规划文档,系统必须从通知文本、时间表、地图图版、地图标签和边界注释中重建有效的地理空间边界;参考GeoJSON被保留用于评分。我们提出了GeoPlanAgent,一个文档驱动、地理空间工具在环的系统,将任务分解为证据提取、定位、地图配准、边界分割、投影和验证。在Plan2Map上,GeoPlanAgent实现了0.736的平均IoU和0.904的中位IoU,其中67.8%的预测IoU达到或超过0.8,显著优于直接VLM到GeoJSON的基线。诊断分析表明,直接VLM预测仍然不可靠,而剩余错误集中在定位和地图配准上,监督边界分割显著提高了像素级掩码质量。Plan2Map为从公共规划记录中进行多模态地理空间重建提供了一个具体的测试平台。项目页面:此https URL。

英文摘要

Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: https://odeb1.github.io/Plan2Map_Project_Page/.

2606.02641 2026-06-03 cs.RO cs.AI

CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving

CARVE: 通过包络实现交互驾驶中被否决机动的认证可负担修复

Yifan Wang

AI总结 针对交互驾驶中规则感知堆栈易忽略的硬规则裕度负值问题,提出CARVE认证层,通过有限格点上的自我与代理战术算子,实现被否决机动的可负担修复认证,并证明其合理性。

Comments 8 pages, 3 figures

详情
AI中文摘要

交互驾驶暴露了规则感知自动驾驶堆栈中容易忽略的失效模式:即使非优先代理的小幅合法让步可恢复可行性,自我候选的硬规则裕度仍可能为负。现有的规则手册、防护和可达性过滤器在否决不安全动作方面表现强劲,而基于预测的规划器则对可能的响应进行建模。两者均未返回运行时证明对象,该对象说明哪个有界多代理编辑修复了机动、谁拥有编辑、请求是否在路权上可负担,以及如果请求未被遵守,自我后备是什么。我们将这一缺失对象形式化为*交互修复认证*,并引入*CARVE*,一个在自我拥有和代理拥有的战术算子有限格点上的无预测认证层。代理拥有的请求仅在\(B_j(s) = eta(\pi_j)\alpha_j^{\max}(s)\)内可接受,这是一个将运动学可达性与规范优先级分离的合作包络。生成的证书记录了绑定规则、修复类别、修复集、责任加权成本分配和后备。在589个基于Lanelet2几何的INTERACTION重放片段上,CARVE-Greedy接受了98.64%的初始否决机动,恢复了370/378个人类解决错误否决,同时保持了589/589的路权尊重、零优先级代理假阳性以及400/400的负压力否决。我们证明了证书的合理性、结构性的路权尊重、精确的有限格点最小性、后备应急性和责任一致性条件。CARVE不预测也不需要其他驾驶员的合规性;它认证在声明假设下提议的交互是否有界、可归因且规范上可接受。

英文摘要

Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative for an ego candidate even though a small lawful accommodation by a non-priority agent would restore feasibility. Existing rulebooks, shields, and reachability filters are strong at vetoing unsafe actions, while prediction-based planners model likely responses. Neither returns a runtime proof object that states which bounded multi-agent edit repairs the maneuver, who owns the edit, whether the request is right-of-way affordable, and what ego fallback remains if the request is not observed. We formulate this missing object as *interactive repair certification* and introduce *CARVE*, a prediction-free certificate layer over a finite lattice of ego-owned and agent-owned tactical operators. Agent-owned requests are admissible only inside \(B_j(s) = β(π_j)α_j^{\max}(s)\), a cooperation envelope that separates kinematic reachability from normative priority. The resulting certificate records the binding rule, repair category, repair set, responsibility-weighted cost split, and fallback. On 589 Lanelet2-geometry-grounded INTERACTION replay episodes, CARVE-Greedy accepts 98.64% of initially vetoed maneuvers and recovers 370/378 human-resolved false vetoes, while preserving 589/589 right-of-way respect, zero priority-agent false positives, and 400/400 negative-stress vetoes. We prove certificate soundness, structural right-of-way respect, exact finite-lattice minimality, fallback contingency, and blame-consistency conditions. CARVE does not predict or require another driver's compliance; it certifies whether a proposed interaction is bounded, attributable, and normatively admissible under declared assumptions.

2606.02584 2026-06-03 cs.CL cs.AI cs.IR

IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation

IdiomX:习语理解、检索和解释的多语言基准

Ayman Ali Sharara

AI总结 提出IdiomX,一个大规模多语言习语基准,通过可复现的多阶段流水线构建,涵盖190K+上下文示例和12K+习语,定义四项任务(检测、上下文-习语检索、阿拉伯语-英语习语检索、习语解释),实验表明上下文Transformer模型提升检测,混合检索重排序增强单语和跨语言检索,习语解释可建模为语义检索任务。

Comments 12 pages, 21 figures. Includes dataset and code. Resources available on HuggingFace, Kaggle, and GitHub

详情
AI中文摘要

习语表达仍然是自然语言处理中的持续挑战,因为它们的含义通常是非组合性的、依赖于上下文的,并且难以跨语言对齐。现有的习语资源在规模、上下文多样性或多语言覆盖方面往往有限,限制了它们对现代语言模型的实用性。我们介绍了IdiomX,一个用于习语理解、检索和解释的大规模多语言基准,通过可复现的多阶段流水线构建,结合词汇资源提取、大规模归一化、受控的大语言模型丰富和结构化验证。生成的数据集包含超过190K个上下文示例,涵盖12K+习语,具有对齐的英语、阿拉伯语和法语语义表示、习语和字面用法标签以及丰富的语言元数据。基于这一资源,我们定义了一个统一的四任务基准,涵盖习语检测、上下文到习语检索、阿拉伯语到英语习语检索和习语解释,将评估从比喻识别扩展到语义基础和可解释的含义检索。实验表明,上下文Transformer模型显著提高了习语检测,而混合检索和重排序架构则显著增强了单语和跨语言习语检索。结果进一步表明,习语解释可以有效地建模为语义检索任务,将可解释性作为基准的补充维度。总体而言,IdiomX提供了一个可扩展的基准,用于研究从检测到检索和语义解释的习语语言进展,并提供了一个模块化框架,可扩展到其他语言和比喻推理任务。

英文摘要

Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, context-dependent, and difficult to align across languages. Existing idiom resources are often limited in scale, contextual diversity, or multilingual coverage, restricting their utility for modern language models. We introduce IdiomX, a large-scale multilingual benchmark for idiom understanding, retrieval, and interpretation, constructed through a reproducible multi-stage pipeline combining lexical resource extraction, large-scale normalization, controlled large language model enrichment, and structured validation. The resulting dataset contains over 190K contextualized examples spanning 12K+ idioms, with aligned English, Arabic, and French semantic representations, idiomatic and literal usage labels, and rich linguistic metadata. Building on this resource, we define a unified four-task benchmark covering idiom detection, context-to-idiom retrieval, Arabic-to-English idiom retrieval, and idiom interpretation, extending evaluation from figurative recognition to semantic grounding and explainable meaning retrieval. Experiments show that contextual transformer models substantially improve idiom detection, while hybrid retrieval and reranking architectures significantly strengthen both monolingual and cross-lingual idiom retrieval. Results further demonstrate that idiom interpretation can be effectively modeled as a semantic retrieval task, introducing interpretability as a complementary benchmark dimension. Overall, IdiomX provides a scalable benchmark for studying idiomatic language as a progression from detection to retrieval and semantic interpretation, and offers a modular framework extensible to additional languages and figurative reasoning tasks

2606.02437 2026-06-03 cs.LG cs.CL

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

论PEFT的扩展:迈向万亿参数百万个性化模型

Mind Lab, :, Vin Bo, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Wenhao Li, Zhihui Li, Allen Lin, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Shiyang Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Adrian Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

AI总结 研究参数高效微调(PEFT)作为共享基础模型上的持久局部状态,通过三个扩展轴(向上、向下、向外)分析其作为个性化模型基质的可行性,并提出MinT基础设施管理适配器生命周期。

详情
AI中文摘要

参数高效微调(PEFT)通常被视为全微调的廉价替代方案。我们研究了一个更广泛的作用:将小型可训练适配器作为强大共享基础模型之上的持久局部状态。在这种框架下,基础模型提供共享能力,而适配器承载实例特定行为,如偏好、技能、工具习惯和类似记忆的更新。我们围绕三个扩展轴组织问题:向上扩展,更强的共享先验使小型局部更新更有用;向下扩展,研究适配器可以多小同时保持可靠性;向外扩展,许多持久化适配实例共存。MinT提供了一个管理适配器身份、修订、来源、评估和服务驻留的基础设施示例。综合来看,结果表明PEFT可以成为持久个性化模型的紧凑基质,而不仅仅是全微调的预算替代方案。

英文摘要

Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters carry instance-specific behavior such as preferences, skills, tool habits, and memory-like updates. We organize the problem around three scaling axes: Scale Up, where stronger shared priors make small local updates more useful; Scale Down, where we study how small adapters can be while remaining reliable; and Scale Out, where many persistent adapted instances coexist. MinT provides one infrastructure example for managing adapter identity, revision, provenance, evaluation, and serving residency. Together, the results suggest that PEFT can be a compact substrate for persistent personal models rather than only a budget substitute for full fine-tuning.

2606.01624 2026-06-03 cs.CV cs.SE

What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

下一步测试什么:驾驶视觉语言模型中可解释的覆盖缺口发现

Abhishek Aich, Sparsh Garg, Vijay Kumar BG, Turgun Yusuf Kashgari, Manmohan Chandraker

AI总结 提出 SliceScorer 和 SliceNav 方法,通过结合暴露先验和邻居失败先验的确定性评分规则,在驾驶视觉语言模型中有效发现高风险覆盖缺口,并支持可解释和可审计的验证流程。

详情
AI中文摘要

驾驶视觉语言模型必须准确理解由操作设计域定义的各种条件下的场景,然而验证仍然稀疏:许多切片缺失,使得经验故障率不可靠。我们提出 SliceScorer,一种用于缺失切片推荐的确定性评分规则,它结合了 (i) 基于暴露的覆盖先验,优先考虑罕见、测试不足的区域,以及 (ii) 邻居失败先验,从类似测试条件传播风险。SliceScorer 刻意简单——可解释、可审计且保守——这些属性对于安全关键验证至关重要。为了在声明的 ODD 之外进行压力测试,我们将 SliceScorer 嵌入 SliceNav,一个由 LLM 编排的验证流程,其中模型解释开发者查询以选择相关操作(分诊、评分、获取、评估)和词汇扩展,组合验证工作流,同时保持所有评分确定性和可审计性。在三个驾驶 VLM(WiseAD、DriveMM、Cosmos-Reason2-2B)上的实验表明,SliceNav 比先前的切片发现方法更有效地发现高风险覆盖缺口,同时在条件空间中保持多样化的推荐。消融实验证实了两个评分组件的贡献,定性分析展示了从开发者查询到目标评估的端到端工作流。

英文摘要

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.

2606.00680 2026-06-03 cs.AI cs.LG

Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

具有后验混合贝叶斯信念的正则化离线策略优化

Hongqiang Lin, Pengfei Wang, Nenggan Zheng

AI总结 提出后验混合贝叶斯信念(PhyB)以统一量化离线强化学习中的认知不确定性,并基于此开发迭代正则化策略优化算法,实现单调改进直至收敛。

详情
AI中文摘要

离线强化学习旨在从预先收集的数据集中优化策略。该范式的一个瓶颈是管理认知不确定性,这种不确定性源于有限的数据覆盖(样本层面)以及从有限数据中识别转移动态的模糊性(模型层面)。为了统一量化这些不确定性,贝叶斯强化学习通过将动态模型视为随机变量并维护相应的信念而被提出。尽管具有理论吸引力,贝叶斯强化学习中的策略优化在计算上仍然具有挑战性,因为它需要求解带有期望的复合目标。先前的方法要么采用计算可扩展性差的基于搜索的技术,要么施加牺牲贝叶斯强化学习适应性的限制性后验假设。为了解决这些局限性,我们提出了后验混合贝叶斯信念(PhyB),它将期望重新表述为动态模型子集上的凸组合。理论分析表明,这种近似引起的目标差异是有界的。基于PhyB,我们开发了一种迭代正则化策略优化算法,该算法为单调改进直至收敛提供了与度量无关的保证。实验结果表明,PhyB在各种基准测试中达到了最先进的性能。

英文摘要

Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and maintaining a corresponding belief. Despite its theoretical appeal, policy optimization in Bayesian RL remains computationally challenging as it requires solving composite objectives with expectations. Prior methods either employ search-based techniques with poor computational scalability or impose restrictive posterior assumptions that sacrifice the adaptability of Bayesian RL. To address these limitations, we propose Posterior Hybrid Bayesian Belief (PhyB), which reformulates the expectation as a convex combination over a subset of dynamics models. Theoretical analysis demonstrates that the objective discrepancy induced by this approximation remains bounded. Based on PhyB, we develop an iterative regularized policy optimization algorithm that provides metric-agnostic guarantees for monotonic improvement until convergence. Empirical results demonstrate that PhyB achieves state-of-the-art performance on various benchmarks.

2605.31067 2026-06-03 cs.RO

Seeing Fast and Slow: Bimodal 3D Scene Graphs for Open-set Tasks

快与慢:面向开放集任务的双模态3D场景图

Marcel Bartholomeus Prasetyo, Shrutika Vishal Thengane, A Manicka Praveen, Yi Loo, Malika Meghjani

AI总结 提出BiMoSG方法,通过默认快速模式生成粗粒度3D场景图,并在需要时切换至慢速模式生成细粒度开放词汇3D场景图,实现实时任务执行。

Comments Submission has not been cleared with funding agency

详情
AI中文摘要

开放集任务执行可以显著受益于根据上下文和机器人探索环境时不断变化的信息,在粗粒度和细粒度场景表示之间无缝切换。例如,通常从粗粒度场景表示开始就足够了,只有当机器人遇到可能包含任务相关对象的区域时,才采用更精细、更细粒度的场景表示。因此,在这项工作中,我们提出了BiMoSG,一种用于开放集任务的双模态3D场景图生成方法。BiMoSG默认采用“快速”模式,以高效生成粗粒度3D场景图,并可以切换到“慢速”模式,为任务相关对象生成更精细的开放词汇3D场景图。我们证明,我们提出的3D场景图生成方法显著快于开源的最新方法。这使得我们能够将场景图生成过程与任务执行集成,用于实时部署。

英文摘要

Open-set task execution can significantly benefit from seamlessly switching between coarse and fine scene representations depending on the context and the evolving information as the robot explores the environment. For example, it is often sufficient to start with a coarse scene representation initially and only employ a finer, more granular scene representation when the robot encounters regions which are likely to contain the task relevant objects. Hence, in this work, we propose BiMoSG, a bimodal 3D scene graph generation approach for open-set tasks. BiMoSG employs a "fast" mode by default to efficiently generate a coarse 3D scene graph and can switch to a "slow" mode for generating a finer open vocabulary 3D scene graph of task relevant objects. We demonstrate that our proposed 3D scene graph generation approach is significantly faster than the open-source state-of-the-art approaches. This allows us to integrate the scene graph generation process with task execution for real-time deployment.

2605.28556 2026-06-03 cs.AI

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

品味问题:提高智能体基准测试的覆盖率和难度

Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichart

AI总结 提出TASTE方法,通过反转任务构建流程,利用自适应对比n-gram模型和聚类自动生成覆盖广泛工具组合的高难度基准任务,以解决现有基准饱和问题。

详情
AI中文摘要

随着智能体能力的提升,现有基准(如$τ^2$-Bench)逐渐饱和。然而构建新的基准任务仍然复杂、昂贵且劳动密集。此外,标准方法(先以自然语言编写场景,再映射到工具序列)仅捕获了智能体使用的工具模式的一个狭窄子集。在本文中,我们通过反转任务构建过程来解决这些问题。我们提出TASTE:基于工具序列进化的任务合成,一种自动生成具有更广工具使用覆盖率的挑战性任务的方法。TASTE利用在LLM判断的有效性信号上训练的自适应对比n-gram模型。这使得能够采样覆盖大量工具组合的有效工具序列。然后TASTE通过聚类从池中选择代表性序列,将其实例化为完整的基准任务,并通过迭代难度进化进行优化。使用TASTE,我们构建了$τ^c$-Bench,这是$τ^2$-Bench三个领域的具有挑战性的扩展。我们评估了11个智能体/用户LLM对,发现几乎饱和$τ^2$-Bench的模型在我们的任务上性能严重下降(例如,Gemini-3-Flash从$0.82-0.94$降至$0.28-0.61$)。除了增加难度,我们生成的任务使智能体必须执行的独特工具组合数量翻倍以上。我们的结果表明,现有基准的高分往往反映饱和而非稳健的任务解决能力。通过自动化生成高难度、高覆盖率的基准,TASTE使得未来智能体的持续、可扩展评估成为可能。

英文摘要

As agent capabilities advance, existing benchmarks, such as $τ^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive $n$-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct $τ^c$-Bench, a challenging extension of the three domains of $τ^2$-Bench. We evaluate $11$ agent/user LLM pairs and find that models nearly saturating $τ^2$-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from $0.82\!-\!0.94$ to $0.28\!-\!0.61$). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.

2605.28119 2026-06-03 cs.CV

ST-ColoNet: Spatio-Temporal Colon Segment Recognition via Hybrid Attention and Edge-Guided Feature Learning

ST-ColoNet: 通过混合注意力与边缘引导特征学习的时空结肠段识别

Crystal Cai, Ziyi Wang, Zhengjie Zhang, Jingsheng Gao, Dahong Qian, Suncheng Xiang

AI总结 提出ST-ColoNet框架,结合Colorlaus模块(度量学习优化边缘空间特征)和Full-Temp模块(三种自注意力模式近似全自注意力),在自建数据集上实现结肠段识别准确率81.0%、F1分数70.7%。

Comments Some experiments need to be updated

详情
AI中文摘要

结肠镜视频中的结肠段识别是许多下游任务的关键需求,但现有自动识别方法仅使用结肠镜图像,未充分利用时间信息,导致性能不佳。此外,相关的公开视频数据集稀缺。为解决此问题,我们整理并发布了一个专门用于结肠段识别任务的标注数据集。此外,我们提出了一种基于两阶段深度学习的框架——时空网络结肠段识别(ST-ColoNet),用于从结肠镜视频中识别结肠段,该框架包括Colorlaus模块(使用度量学习优化边缘介导的空间特征提取)和Full-Temp模块(结合三种自注意力模式,以更好地近似长结肠镜序列上的全自注意力并优化时间特征聚合)。通过大量消融实验,我们证明该框架能够在结肠段识别任务上达到最先进的性能,准确率为81.0%,F1分数为70.7%,相比现有最先进方法有巨大提升。

英文摘要

Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.

2605.26914 2026-06-03 cs.CV

I2PRef: Image-Driven Point Completion with Iterative Refinement

I2PRef: 图像驱动的点云补全与迭代细化

Azhar Hussian, Marina Ritthaler, André Kaup, Vasileios Belagiannis

AI总结 提出一种以图像为主要几何来源的点云补全方法,通过图像到点(I2P)模块直接从单张RGB图像重建完整点云,并利用基于Transformer的点到点(P2P)细化模块迭代优化,在ShapeNet-ViPC上取得最先进性能,Chamfer距离相对提升12.3%。

Comments Accepted at European Signal Processing Conference (EUSIPCO 2026)

详情
AI中文摘要

我们提出了一种图像条件化的点云补全方法,将图像视为主要的几何来源而非次要的引导。为此,我们引入了一个图像到点(I2P)模块,该模块可以直接从单张RGB图像重建完整的点云,无需3D输入。此外,我们引入了一个基于Transformer的点到点(P2P)细化模块,该模块利用点令牌和图像特征之间的自注意力和交叉注意力,迭代地细化粗I2P输出。I2P模块使图像编码器能够学习丰富的几何表示,而P2P模块逐步恢复细粒度细节。与依赖辅助损失或融合模块的现有多模态方法不同,我们的显式I2P任务仅基于图像提供了强大的几何感知先验。在ShapeNet-ViPC上的大量实验表明,我们的方法取得了最先进的补全性能,Chamfer距离相对先前方法提升了12.3%。代码可在 https://github.com/AzharSindhi/I2PRef.git 获取。

英文摘要

We present an image-conditioned point cloud completion approach that treats images as the primary geometric source rather than a secondary guide. To this end, we introduce an Image-to-Point (I2P) module that can reconstruct complete point clouds directly from a single RGB image, with no need for 3D inputs. Additionally, we introduce a transformer-based Point-to-Point (P2P) refinement module that uses self- and cross-attention between point tokens and image features to iteratively refine the coarse I2P output. The I2P module enables the image encoder to learn rich geometric representations, while the P2P module progressively recovers fine-grained details. Unlike existing multimodal methods that rely on auxiliary losses or fusion modules, our explicit I2P task provides a strong, geometry-aware prior based on images alone. Extensive experiments on ShapeNet-ViPC demonstrate state-of-the-art completion performance with a 12.3% relative Chamfer Distance improvement over prior methods. Code is available at: https://github.com/AzharSindhi/I2PRef.git

2605.29661 2026-06-03 cs.CV

Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning

几何引导的基础特征建模实现可泛化的物体形状变形学习

Yiyao Ma, Kai Chen, Zhongxiang Zhou, Zhuheng Song, Dongsheng Xie, Zelong Tan, Rong Xiong, Qi Dou

AI总结 提出一种几何引导的特征建模机制和视图自适应特征聚合模块,通过变形类别级形状模板实现单目3D形状恢复,在形状变化和视角多样性上显著优于现有方法。

Comments 20 pages, 12 figures, accepted by ICML 2026

详情
AI中文摘要

单目3D形状恢复是几何理解的基础,但在任意视角和未见物体类别上实现鲁棒泛化仍然是一个重大挑战。本文提出一个可泛化的变形学习框架,通过显式变形类别级形状模板以匹配目标观测来重建3D物体。为了解决模板与目标之间的复杂形状变化,我们引入了几何引导的特征建模机制。该过程首先用模板拓扑丰富基础特征以生成几何感知表示,然后将其与目标观测显式关联以指导精确变形。此外,为了弥合固定模板与任意目标视图之间的差异,我们提出一个视图自适应特征聚合模块。该模块利用多视图模板特征及其对应的相机姿态来丰富规范模板表示,确保无论目标视角如何都能实现鲁棒的特征对齐。大量实验表明,我们的方法在处理大形状变化和多样化视角方面显著优于最先进的方法,展现出对新颖类别的强泛化能力,并有效支持下游真实世界的灵巧机器人操作任务。项目主页:https://GODeform.github.io/

英文摘要

Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a category-level shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks. Project homepage: https://GODeform.github.io/

2605.28166 2026-06-03 cs.LG cs.AI

QuITE: Query-Based Irregular Time Series Embedding

QuITE: 基于查询的不规则时间序列嵌入

Junghoon Lim

AI总结 提出一种即插即用的嵌入模块QuITE,通过可学习查询令牌聚合不规则观测值,无需插值或修改架构,显著提升多变量时间序列模型的预测和分类性能。

Comments ICML 2026

详情
AI中文摘要

不规则多变量时间序列在实践中很常见,但其不规则采样给有效建模带来了困难。现有方法通常要么(i)设计专门架构,限制了经过验证的多变量时间序列模型的复用,要么(ii)通过插值将不规则时间序列映射到规则时间网格,这可能会引入人工值从而扭曲时间动态。为解决这些限制,我们提出了一种新的基于输入嵌入的方法。我们发现关键瓶颈不在于主干架构,而在于假设均匀采样的传统嵌入层。在这项工作中,我们引入了QuITE(基于查询的不规则时间序列嵌入),一种简单而有效的即插即用嵌入模块。QuITE使用可学习查询令牌通过单层自注意力聚合不规则观测值,直接生成与主干兼容的潜在表示,无需生成人工值或修改架构。在真实世界基准上的大量实验表明,QuITE持续改进多变量时间序列模型,在不同数据集和主干架构上,预测任务平均相对提升高达54.7%,分类任务平均相对提升高达15.8%。代码可在 https://github.com/Meaningfull9502/QuITE 获取。

英文摘要

Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query-Based Irregular Time Series Embedding), a simple yet effective plug-and-play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self-attention layer, directly producing backbone-compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real-world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to $54.7\%$ in forecasting and $15.8\%$ in classification across diverse datasets and backbone architectures. Code is available at: https://github.com/Meaningfull9502/QuITE.

2605.25051 2026-06-03 cs.RO

A Decentralized LiDAR-SLAM System with Certifiably Optimal Pose Graph Optimization

一种具有可认证最优位姿图优化的去中心化LiDAR-SLAM系统

Baoshan Song, Feng Huang, Li-Ta Hsu

AI总结 针对多机器人去中心化LiDAR-SLAM全局一致性问题,提出首个集成可认证最优位姿图优化后端的系统,利用黎曼块坐标下降算法实现全局一致轨迹估计,无需精确初始猜测,轨迹RMSE相比DiSCo-SLAM最高降低48.9%。

Comments In Proceedings of the IEEE International Conference on Robotics & Automation (ICRA'26) 1st Workshop on Robot Meets GNSS and Ranging for Seamless Autonomy, Vienna, Austria, Jun. 5, 2026

详情
AI中文摘要

去中心化多机器人LiDAR-SLAM对于协作任务至关重要,但在保持全局一致性方面面临重大挑战。现有框架主要依赖局部搜索优化或一次性坐标对齐,容易导致次优收敛和长期不一致,尤其是在大规模或退化环境中。为解决这些局限性,本文提出了首个集成最先进的可认证最优位姿图优化(PGO)后端的去中心化LiDAR-SLAM系统。通过利用黎曼块坐标下降(RBCD)算法,我们的系统无需精确初始猜测即可确保全局一致的轨迹估计。实验结果表明,所提出的框架实现了卓越的鲁棒性,与最先进的DiSCo-SLAM相比,轨迹RMSE最高改善了48.9%。

英文摘要

Decentralized multi-robot LiDAR-SLAM is essential for collaborative missions but faces significant challenges in maintaining global consistency. Existing frameworks predominantly rely on local-search optimization or one-time coordinate alignment, which are prone to suboptimal convergence and long-term inconsistency, especially in large-scale or degenerate environments. To address these limitations, this paper presents the first decentralized LiDAR-SLAM system that integrates a state-of-the-art certifiably optimal Pose Graph Optimization (PGO) backend. By leveraging the Riemannian Block Coordinate Descent (RBCD) algorithm, our system ensures globally consistent trajectory estimation without requiring accurate initial guesses. Experimental results demonstrate that the proposed framework achieves superior robustness, improving trajectory RMSE by up to 48.9% compared to the state-of-the-art DiSCo-SLAM.

2603.05691 2026-06-03 cs.LG stat.ML

Improved Scaling Laws via Weak-to-Strong Generalization in Random Feature Ridge Regression

随机特征岭回归中弱到强泛化的改进缩放定律

Diyuan Wu, Lehan Chen, Theodor Misiakiewicz, Marco Mondelli

AI总结 本文通过随机特征岭回归的确定性等价分析,揭示了弱教师训练的强学生模型在偏差主导和方差主导场景下均能改进缩放定律,甚至达到极小极大最优率。

详情
AI中文摘要

在机器学习中,使用学习模型标记数据,然后用这些数据训练更强大的模型变得越来越常见。弱到强泛化现象体现了这种两阶段过程的优势:强学生模型在由弱教师模型获得的不完美标签上训练,但强学生模型的表现优于弱教师模型。在本文中,我们展示了这种潜在改进是显著的,因为它影响了测试误差所遵循的缩放定律。具体来说,我们考虑通过随机特征岭回归(RFRR)训练的学生和教师模型。我们的主要技术贡献是推导出学生模型在教师模型获得的标签上训练时的超额测试误差的确定性等价。通过这个确定性等价,我们识别出学生模型的缩放定律相对于教师模型得到改进的区域,揭示了这种改进可以在偏差主导和方差主导的设置中实现。引人注目的是,无论教师模型的缩放定律如何,学生模型都可能达到极小极大最优率——事实上,即使教师模型的测试误差不随样本量衰减,这一结论也成立。

英文摘要

It is increasingly common in machine learning to use learned models to label data and then employ such data to train more capable models. The phenomenon of weak-to-strong generalization exemplifies the advantage of this two-stage procedure: a strong student is trained on imperfect labels obtained from a weak teacher, and yet the strong student outperforms the weak teacher. In this paper, we show that the potential improvement is substantial, in the sense that it affects the scaling law followed by the test error. Specifically, we consider students and teachers trained via random feature ridge regression (RFRR). Our main technical contribution is to derive a deterministic equivalent for the excess test error of the student trained on labels obtained via the teacher. Via this deterministic equivalent, we then identify regimes in which the scaling law of the student improves upon that of the teacher, unveiling that the improvement can be achieved both in bias-dominated and variance-dominated settings. Strikingly, the student may attain the minimax optimal rate regardless of the scaling law of the teacher -- in fact, when the test error of the teacher does not even decay with the sample size.

2605.15806 2026-06-03 cs.LG

Martingale Neural Operators: Learning Stochastic Marginals via Doob-Meyer Factorization

鞅神经算子:通过Doob-Meyer分解学习随机边际分布

Kai Hidajat

AI总结 提出鞅神经算子(MNO),利用Doob-Meyer定理将随机偏微分方程的边际分布分解为可预测漂移和鞅部分,直接预测条件均值和协方差,在多种任务上显著降低Wasserstein距离并提升效率。

详情
AI中文摘要

神经算子作为确定性代理表现出色,但在应用于随机偏微分方程时不可避免地坍缩到条件均值,丢弃了不确定性量化所依赖的方差和尾部结构。恢复这种结构通常需要蒙特卡洛滚动或嫁接的生成模型,两者都放弃了定义算子范式的单次效率和分辨率不变性。为解决此问题,我们借鉴Doob-Meyer定理,该定理确立了任何半鞅从根本上分解为一个可预测漂移和一个不可预测的零均值鞅。将该定理转化为架构先验,我们引入了鞅神经算子(MNO)。MNO将初始条件直接映射到终端律的条件均值和协方差,参数化为类似漂移的均值和低秩因子$B_ϕ$,其中$B_ϕ^\top B_ϕ$通过构造是半正定的。在我们的实验中,我们使用高斯残差实例化。在一维SPDE、粗糙波动率和二维算子任务中,MNO在$ϕ^4$场论上将Wasserstein距离减少高达$120$倍,在随机Burgers方程上减少$68$倍,在匹配的壁钟训练预算下,评估速度比条件扩散基线快约$3$倍。在二维任务上,MNO在零样本分辨率转移和湍流方面与FNO相当,而准确定性系统(如Gray-Scott)仍然是失败模式。

英文摘要

Neural operators excel as deterministic surrogates, but inevitably collapse to the conditional mean when applied to stochastic PDEs, discarding the variance and tail structure upon which uncertainty quantification depends. Recovering this structure typically requires Monte Carlo rollouts or grafted generative models, both of which surrender the one-shot efficiency and resolution invariance that define the operator paradigm. To resolve this, we draw on the Doob-Meyer theorem, which establishes that any semimartingale fundamentally decomposes into a predictable drift and an unpredictable, zero-mean martingale. Translating this theorem into an architectural prior, we introduce the Martingale Neural Operator (MNO). MNO maps an initial condition directly to the conditional mean and covariance of the terminal law, parameterized by a drift-like mean and a low-rank factor $B_ϕ$ with $B_ϕ^\top B_ϕ$ positive semi-definite by construction. For our experiments, we use a Gaussian residual instantiation. Across 1D SPDEs, rough volatility, and 2D operator tasks, MNO reduces Wasserstein distance by up to $120\times$ on $ϕ^4$ field theory and $68\times$ on stochastic Burgers, evaluating $\sim 3\times$ faster than a conditional diffusion baseline at matched wall-clock training budgets. On 2D tasks, MNO is comparable to FNO on zero-shot resolution transfer and turbulent flow, while quasi-deterministic systems such as Gray-Scott remain a failure mode.

2605.08747 2026-06-03 cs.AI

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

完成,但不确定:具身智能体中世界完成与自我终止的解耦

Ying Chen, Lihuang Fang, Rui Jiang, Mingxu Wang, Zhifeng Gu, Lei Yi, Jie Chen

AI总结 提出VIGIL评估框架,通过分离世界状态完成和终端报告正确性,独立衡量智能体的终止承诺能力,并揭示不同模型在相似完成率下终端报告准确性的显著差异。

详情
AI中文摘要

标准的具身评估不会独立评分智能体在情节结束时是否正确承诺任务完成,我们将这种能力称为终端承诺。行为上不同的失败——从未完成任务、完成但未能停止、以及在没有足够证据的情况下报告成功——都归为相同的基准失败。我们引入了VIGIL,一个使终端承诺可独立测量的评估框架。在VIGIL的默认协议下,智能体仅观察以自我为中心的RGB,不接收动作成功信号,并且必须通过语义报告结束每个情节,该报告会确定性地与隐藏的世界状态进行核对。这产生了两个独立的分数:世界状态完成(W)和基准成功(B),其中B额外要求正确的终端报告。这种解耦使得四种结果类别可区分:未执行、达成后漂移、无依据承诺和验证成功。在1000个冻结情节上对20个模型进行评估,具有可比W的系统在B上相差高达19.7个百分点:一个模型将实现的状态转换为正确的报告,而另一个具有几乎相同执行能力的模型在目标处漂移而未关闭。动作反馈干预进一步测试了这种分离:执行导向的信号广泛改善了W,但在那些尚未将终端报告基于已实现状态的模型中,承诺失败仍然存在。VIGIL提供了一个使终端承诺独立可见和可评分的协议。

英文摘要

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

2605.10328 2026-06-03 cs.CL

ANCHOR: Abductive Network Construction with Hierarchical Orchestration for Reliable Probability Inference in Large Language Models

ANCHOR: 基于层次编排的溯因网络构建用于大型语言模型中可靠概率推断

Wentao Qiu, Guanran Luo, Zhongquan Jian, Jingqi Gao, Meihong Wang, Qingqiang Wu

AI总结 针对LLM概率估计中因子空间稀疏导致“未知”预测和因子增多引入噪声的问题,提出ANCHOR框架,通过层次因子空间构建、层次检索与精炼以及因果贝叶斯网络增强朴素贝叶斯,显著减少未知预测并提高概率估计可靠性。

Comments Accepted by ICML 2026

详情
AI中文摘要

在不完全信息下的大规模决策中,一个核心挑战是估计可靠的概率。最近的方法使用大型语言模型(LLM)生成解释因子和粗粒度概率估计,然后通过朴素贝叶斯模型在因子组合上进行精炼。然而,稀疏的因子空间常常产生“未知”预测,而扩展因子会增加噪声和虚假相关性,削弱条件独立性并降低可靠性。为了解决这些局限性,我们提出了 extsc{Anchor},一个在层次因子空间上的聚合贝叶斯推断框架。它通过迭代生成和聚类构建密集的因子层次结构,通过层次检索和精炼映射上下文,并通过因果贝叶斯网络增强朴素贝叶斯以建模潜在因子依赖关系。实验表明, extsc{Anchor}显著减少了“未知”预测,并产生了比直接LLM基线更可靠的概率估计,实现了最先进的性能,同时大幅减少了时间和令牌开销。

英文摘要

A central challenge in large-scale decision-making under incomplete information is estimating reliable probabilities. Recent approaches use Large Language Models (LLMs) to generate explanatory factors and coarse-grained probability estimates, which are then refined by a Naïve Bayes model over factor combinations. However, sparse factor spaces often yield ``unknown'' predictions, while expanding factors increases noise and spurious correlations, weakening conditional independence and degrading reliability. To address these limitations, we propose \textsc{Anchor}, an aggregated Bayesian inference framework over a hierarchical factor space. It constructs dense factor hierarchies through iterative generation and clustering, maps contexts via hierarchical retrieval and refinement, and augments Naïve Bayes with a Causal Bayesian Network to model latent factor dependencies. Experiments show that \textsc{Anchor} markedly reduces ``unknown'' predictions and produces more reliable probability estimates than direct LLM baselines, achieving state-of-the-art performance while significantly reducing time and token overhead.