arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
热门方向导航
2606.17553 2026-06-17 cs.LG 新提交

SpatioTemporal Causal Network Diagnostics for Geographic Tipping Point Early Warning

地理临界点早期预警的时空因果网络诊断

Zhaoyuan Yu, Zhangyong Liang

发表机构 * Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application(江苏省地理信息资源开发与应用协同创新中心) National Center for Applied Mathematics, Tianjin University(天津大学国家应用数学中心)

AI总结 提出时空因果网络诊断(ST-CND)框架,通过构建数据驱动的有向因果网络,结合局部恢复率估计和脆弱子网识别,解决地理临界点早期预警中的空间稀释、欧氏假设和相关噪声问题,在AMOC任务上AUROC达0.783。

详情
AI中文摘要

生态系统、气候子系统或冰盖中的地理临界点对局部早期预警提出了严峻挑战。经典的空间指标如Moran's I总结了全局空间结构,但难以处理三个问题:空间稀释、欧氏假设和相关噪声。本文引入时空因果网络诊断(ST-CND),该框架通过将地理场表示为随时间演化的有向因果网络来解决这三个问题。核心工作流程如下:(1)通过转移熵推断哪些空间节点有助于预测其他节点,用数据驱动的信息流拓扑替代固定的欧氏邻域;(2)通过动态模态分解估计每个候选子网络内的局部恢复率;(3)结合三个信号——高内部波动、高内部同步和低外部耦合——识别最脆弱的子网络,从而抑制空间相关噪声引起的误报。在合成分岔和两个观测海表温度基准(即印度洋-太平洋SST和北大西洋AMOC)上验证,ST-CND提供了局部且可解释的预警。在AMOC任务上,它达到了0.783的AUROC和0.378的临界子网络IoU,优于递归网络和lambda-AR1基线。该框架为地球系统科学中的空间早期预警提供了可解释且可扩展的流程。

英文摘要

Geographic tipping points in ecosystems, climate subsystems, or ice sheets pose severe challenges for localized early warning. Classical spatial indicators such as Moran's I summarize global spatial structure, but they struggle with three issues: spatial dilution, Euclidean assumptions, and correlated noise. This paper introduces SpatioTemporal Causal Network Diagnostics (ST-CND), a framework that addresses these three issues by representing the geographic field as a time-evolving directed causal network. The core workflow is as follows: (1) infer which spatial nodes help predict other nodes via transfer entropy, replacing fixed Euclidean neighborhoods with data-driven information-flow topology; (2) estimate local recovery rates within each candidate subnetwork via dynamic mode decomposition; and (3) identify the most vulnerable subnetwork by combining three signals, namely high internal fluctuation, high internal synchronization, and low external coupling, thereby suppressing false alarms from spatially correlated noise. Validated on synthetic bifurcations and two observational sea-surface temperature benchmarks, namely Indo-Pacific SST and North Atlantic AMOC, ST-CND delivers localized and interpretable warnings. On the AMOC task, it achieves an AUROC of 0.783 and a critical-subnetwork IoU of 0.378, outperforming recurrence-network and lambda-AR1 baselines. The framework provides an interpretable and scalable pipeline for spatial early warning in Earth system science.

2606.17551 2026-06-17 cs.LG cs.AI 新提交

Reversal Q-Learning

逆向Q学习

Aditya Oberai, Seohong Park, Sergey Levine

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出逆向Q学习(RQL)算法,通过扩展MDP框架和逆向流生成虚拟在线轨迹,结合偏差-方差缩减技术,实现基于流策略的离线强化学习,在50个机器人任务中取得最佳平均性能。

详情
AI中文摘要

迭代生成建模技术(如流匹配)为建模复杂行为以进行有效的离线强化学习(RL)提供了强大工具。在这项工作中,我们提出了一种新的离策略RL算法,该算法基于先验数据训练流策略。我们的想法始于“扩展”马尔可夫决策过程(MDP)框架,该框架将单个流细化步骤视为MDP中的独立动作。为了在该框架中实现离策略RL,我们应用了两种技术:我们通过“逆向”流生成虚拟在线轨迹,使该框架与先验数据兼容;并应用偏差-方差缩减技术来缓解离策略RL中的视界诅咒。我们将由此产生的算法称为逆向Q学习(RQL)。RQL相比先前基于流的RL方法具有若干优势:它不受时间反向传播的影响,更好地利用学习到的价值函数,并直接训练完整的、富有表现力的流策略。通过在50个具有挑战性的模拟机器人任务上的实验,我们表明,与最先进的基于流的离线RL算法相比,RQL实现了最佳的平均离线RL性能。

英文摘要

Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains a flow policy based on prior data. Our idea starts from the "expanded" Markov decision process (MDP) framework, which treats individual flow refinement steps as separate actions in an MDP. To enable off-policy RL within this framework, we apply two techniques: we generate virtual on-policy trajectories (by "reversing" flows) to make this framework compatible with prior data, and we apply a bias-and-variance reduction technique to mitigate the curse of horizon in off-policy RL. We call the resulting algorithm Reversal Q-learning (RQL). RQL has several advantages over previous flow-based RL methods: it does not suffer from backpropagation through time, makes better use of the learned value function, and directly trains the full, expressive flow policy. Through our experiments on 50 challenging simulated robotic tasks, we show that RQL leads to the best average offline RL performance compared to state-of-the-art flow-based offline RL algorithms.

2606.17546 2026-06-17 cs.AI 新提交

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

SEAGym: 自我进化LLM智能体的评估环境

Congjie Zheng, Chuanyi Xue, Bin Liang, Jun Yang, Changshui Zhang

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系) Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University(北京信息科学与技术国家研究中心(BNRist),清华大学)

AI总结 提出SEAGym评估环境,通过训练、验证、测试、重放和成本记录多维度衡量智能体框架更新,揭示更新是否带来可复用改进、过拟合、成本增加或旧行为退化。

详情
AI中文摘要

基于LLM的自我进化智能体主要通过改变其智能体框架(agent harness)来改进:即围绕基础模型的结构化执行层,包括提示、记忆、工具、中间件、运行时状态以及模型-工具交互循环。现有评估通常将此过程简化为孤立的任务分数或单一的顺序曲线,掩盖了更新是否产生可复用的改进、过拟合近期任务、增加成本或损害旧行为。我们引入了SEAGym,一个用于跨训练、验证、测试、重放和成本记录衡量智能体框架更新的评估环境。SEAGym将Harbor兼容的基准测试转化为动态的自我进化任务源,包含训练批次、冻结更新验证、留出ID和OOD迁移视图、重放诊断以及保存的快照和指标记录。在Terminal-Bench 2.0和HLE上实例化SEAGym,我们在共享的epoch/batch协议下比较了ACE、TF-GRPO和AHE。结果表明,这些评估视图提供了关于进化过程的互补信号:频繁更新可能无法改善留出性能,有用的中间快照可能随后崩溃,源多样性和模型后端可能影响框架可靠性。

英文摘要

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing evaluations often reduce this process to isolated task scores or a single sequential curve, obscuring whether an update produces reusable improvement, overfits recent tasks, increases cost, or harms older behavior. We introduce SEAGym, an evaluation environment for measuring agent harness updates across training, validation, test, replay, and cost records. SEAGym turns Harbor-compatible benchmarks into dynamic self-evolution task sources with train batches, frozen update-validation, held-out ID and OOD transfer views, replay diagnostics, and saved snapshot and metric records. Instantiating SEAGym on Terminal-Bench 2.0 and HLE, we compare ACE, TF-GRPO, and AHE under a shared epoch/batch protocol. The results show that these evaluation views provide complementary signals about the evolution process: frequent updates may fail to improve held-out performance, useful intermediate snapshots may collapse later, and source diversity and model backend can affect harness reliability.

2606.17545 2026-06-17 cs.LG q-fin.CP q-fin.PR 新提交

Continuous-time Optimal Stopping through Deep Reinforcement Learning

通过深度强化学习的连续时间最优停止

Cosmin Borsa, Michael Ludkovski

发表机构 * Department of Statistics & Applied Probability, UC Santa Barbara(加州大学圣塔芭芭拉分校统计与应用概率系)

AI总结 提出CARLOS算法,利用聚合深度神经网络学习任意精细时间分辨率下的停止规则,通过渐进式时间网格细化和自适应采样,逼近美式期权价格上界。

Comments 33 pages

详情
AI中文摘要

基于仿真的最优停止问题求解器必须离散化停止决策。在经典动态规划下,粗网格(只有少数停止机会)会显著低估最优期望回报,而在极细网格上,近似误差通过反向递归累积。为消除这一限制,我们开发了一种新的强化学习启发算法,能够在任意精细时间分辨率下学习停止规则。我们的CARLOS(连续时间自适应强化学习最优停止)算法利用聚合深度神经网络(ADNN)学习联合时空决策边界。从粗时间网格开始,我们逐步增加停止机会的频率,同时并行训练ADNN以精化其时机-价值估计。此外,我们设计了一种自适应采样策略,逐渐将训练集中到停止边界附近。基准测试结果表明,CARLOS相比现有百慕大求解器提供更高的价格,接近美式上界,并且相对于非RL比较器实现了高计算效率。

英文摘要

Simulation based solvers for optimal stopping problems must discretize the stopping decision. Under classical dynamic programming, a coarse exercise grid with only a few stopping opportunities can materially undervalue the optimal expected reward, whereas on a very fine grid, approximation errors accumulate through the backward recursion. To remove this limitation, we develop a new reinforcement-learning inspired algorithm that enables us to learn the exercise rule at arbitrarily fine time resolution. Our CARLOS (Continuous-time Adaptive Reinforcement Learning for Optimal Stopping) algorithm utilizes an aggregate deep neural network (ADNN) to learn a joint space-time decision boundary. Starting from a coarse time grid, we progressively increase the frequency of stopping opportunities, while in parallel training the ADNN to refine its timing-value estimates. We moreover design an adaptive sampling strategy that gradually concentrates training effort near the stopping boundary. Benchmarked results show that CARLOS delivers higher prices than existing Bermudan solvers, approaching the American upper bound, and achieves high computational efficiency relative to non-RL comparators.

2606.17542 2026-06-17 cs.CL 新提交

Evaluating Large Language Models Abilities for Addressee, Turn-change, and Next Speaker Prediction in Meetings

评估大型语言模型在会议中的收话人、话轮转换和下一说话人预测能力

Ryo Fukuda, Takatomo Kano, Siddhant Arora, Marc Delcroix, Naohiro Tawara, Atsunori Ogawa, Yuya Chiba, Atsushi Ando, William Chen, Shinji Watanabe

发表机构 * NTT, Inc., Japan(日本电信电话公司) Language Technologies Institute, Carnegie Mellon University, USA(卡内基梅隆大学语言技术研究所)

AI总结 利用大型语言模型(LLMs)研究多模态多人对话中的话轮转换,构建了收话人检测、话轮转换预测和下一说话人预测三个任务的评估框架,实验表明LLMs在下一说话人预测上优于监督模型和人类,但多模态LLM在收话人检测和话轮转换预测上仍低于人类水平。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

我们利用大型语言模型(LLMs)研究多模态多人对话中的话轮转换。我们构建了一个评估框架,涵盖三个任务:收话人检测、话轮转换预测和下一说话人预测。我们比较了针对这些任务训练的监督模型、基于文本的LLMs、多模态LLMs(MM-LLMs)以及人类受试者。在AMI语料库上的实验表明,尽管LLMs未在目标领域进行训练,也无法访问音频或视觉信息,但它们在下一说话人预测方面优于监督模型和人类。多模态LLM在收话人检测和话轮转换预测上表现优于基于文本的LLMs,但仍低于人类表现,表明其难以利用原始音视频信号。消融分析显示,对话上下文至关重要,尤其是对于下一说话人预测。我们观察到人类和LLM的预测模式相似,且话轮转换频繁的区间对两者都具有挑战性。

英文摘要

We investigate turn-taking in multimodal multi-party conversations using large language models (LLMs). We construct an evaluation framework for three tasks: addressee detection, turn-change prediction, and next speaker prediction. We compare supervised models trained for these tasks, text-based LLMs, multimodal LLMs (MM-LLMs), and human subjects. Experiments on the AMI corpus showed that LLMs outperformed supervised models and humans in next speaker prediction, despite not being trained on the target domain and without access to audio or visual information. An MM-LLM performed better than text-based LLMs on addressee detection and turn-change prediction but remained below human performance, indicating difficulty leveraging raw audio-visual signals. Ablation analyses revealed that conversational context was critical, particularly for next speaker prediction. We observed that human and LLM prediction patterns were similar, and intervals with frequent turn changes were difficult for both.

2606.17541 2026-06-17 cs.LG cs.AI 新提交

Offline Preference-Based Trajectory Evaluation

基于偏好的离线轨迹评估

Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对离线评估中仅使用终端成功率导致统计效率低下的问题,提出基于偏好的轨迹评估方法,通过比较轨迹的时间偏好减少平局,提升区分能力、排名稳定性和数据效率。

详情
AI中文摘要

智能系统的离线评估通常将轨迹简化为终端成功,丢弃了部分进展信息并导致大量平局,通过减少有效样本量和削弱区分系统的能力,造成显著的统计低效。我们提出基于偏好的轨迹评估,该方法通过时间偏好(关于进展和返回时间分布)直接比较轨迹。我们发现,在多种智能和交互基准测试中,基于标准成功率的指标在大约75%的实例上产生平局比较,而轨迹感知偏好将平局减少到大约35%,从而提高了区分能力、排名稳定性和数据效率。我们的结果表明,通常归因于数据收集不足或问题难度的基准饱和,也可能由评估指标的选择所解释。

英文摘要

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

2606.17540 2026-06-17 cs.CV 新提交

TaFD: Threat-Aware Frequency Decoupling for Adversarial Robustness against Heterogeneous Attacks

TaFD:针对异构攻击的对抗鲁棒性的威胁感知频率解耦

Mengda Xie, Yiling He, Meie Fang

发表机构 * School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) Information Security Research Group, University College London(伦敦大学学院信息安全研究组)

AI总结 针对异构攻击下联合对抗训练中的梯度不兼容问题,提出威胁感知频率解耦(TaFD)框架,通过频域分治策略实现结构参数分离,在多个基准上平均鲁棒准确率提升约11%。

详情
AI中文摘要

多威胁鲁棒性仍然是深度学习中的一个基本挑战。尽管联合对抗训练(JAT)被广泛采用,但在异构威胁下,特别是$\ell_p$有界攻击和语义攻击之间,它遭受负迁移。通过一阶梯度分析,我们将此形式化为梯度不兼容,并从理论上证明了分离优化的必要性。我们进一步揭示这些冲突的威胁在频域中表现出可分离的频谱特性。受此观察启发,我们提出了威胁感知频率解耦(TaFD),一个两阶段防御框架,将JAT重新表述为频域分治范式。TaFD首先通过攻击频谱原型的无监督聚类发现潜在威胁域,并训练一个轻量级分类器用于推理时的威胁域识别。基于预测,TaFD采用频率条件卷积,学习威胁域特定的频谱掩码,并将每个样本路由到相应的专家,强制结构参数分离并缓解优化冲突。我们在三个代表性图像分类基准(CIFAR-10、CIFAR-100和Tiny-ImageNet)和两个代表性架构(卷积ResNet和混合Transformer MobileViT)上验证了TaFD。大量结果表明,与现有的JAT和频域基线相比,TaFD在异构攻击下实现了更均衡的鲁棒性,在保持领先的干净准确率的同时,平均鲁棒准确率比最强基线提高了约11%。

英文摘要

Multi-threat robustness remains a fundamental challenge in deep learning. Although joint adversarial training (JAT) is widely adopted, it suffers from negative transfer under heterogeneous threats, particularly between $\ell_p$-bounded and semantic attacks. Through first-order gradient analysis, we formalize this as gradient incompatibility and theoretically establish the necessity of decoupled optimization. We further reveal that these conflicting threats exhibit separable spectral characteristics in the frequency domain. Motivated by this observation, we propose Threat-aware Frequency Decoupling (TaFD), a two-stage defense framework that reformulates JAT as a frequency-domain divide-and-conquer paradigm. TaFD first discovers latent threat domains via unsupervised clustering of attack spectral prototypes and trains a lightweight classifier for inference-time threat domain identification. Conditioned on the prediction, TaFD employs a Frequency-Conditional Convolution that learns threat-domain-specific spectral masks and routes each sample to the corresponding expert, enforcing structural parameter separation and alleviating optimization conflicts. We validate TaFD on three representative image-classification benchmarks (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and on two representative architectures (the convolutional ResNet and the hybrid-transformer MobileViT). Extensive results demonstrate that TaFD achieves more balanced robustness against heterogeneous attacks than existing JAT and frequency-domain baselines, improving average robust accuracy by approximately 11\% over the strongest baseline while maintaining leading clean accuracy.

2606.17539 2026-06-17 cs.CV cs.AI 新提交

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

空间视觉语言模型中的双路径推理强化

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu

发表机构 * The University of Hong Kong(香港大学) NVIDIA(英伟达) University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出SR-REAL框架,通过强化学习融合语言推理和3D检测推理两条路径,显著提升空间VLM在复杂几何推理任务中的性能。

详情
AI中文摘要

空间VLM在几何感知方面取得了显著进展,但需要多步推理(涉及深度、距离和场景关系)的复杂空间推理仍然具有挑战性。此外,不同的空间查询需要根本不同的策略:有些最好通过纯语言的逐步演绎来解决,而另一些则需要在进行定量推理之前进行显式的3D定位。我们提出了SR-REAL(通过强化学习实现空间VLM的双路径空间推理),这是一个统一框架,为空间VLM配备了两条互补的推理路径:纯语言推理(LOR),执行逐步语言演绎;以及先检测后推理(DTR),通过区域标记检测3D几何线索(如中心或边界框),然后进行显式几何推理。SR-REAL首先进行冷启动监督微调阶段,构建LOR和DTR的思维链监督,并暴露区域到3D的接口;随后进行强化学习,使用准确性和格式奖励优化策略模型;对于DTR,基于离散中心的检测奖励进一步细化几何对齐。在多种空间基准测试中,SR-REAL显著优于空间VLM基线:(i) 单个RL训练模型支持两条推理路径,DTR通过精确的3D定位在区域感知任务中表现出色,LOR增强了一般空间推理;(ii) 联合训练两条路径促进相互强化;(iii) 高质量、混合的冷启动数据对于稳定的RL优化至关重要;(iv) 模型无需逐任务调整即可跨数据集和领域泛化,展示了LOR和DTR之间的正向迁移。

英文摘要

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

2606.17536 2026-06-17 cs.CV cs.AI 新提交

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

OmniDrive: 一种由LLM编排的多智能体世界模型,用于多视角驾驶视频生成的统一潜在协同压缩

Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang

发表机构 * Peking University(北京大学) Xiamen University(厦门大学) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) National Taiwan University(国立台湾大学) Wuhan University(武汉大学) Wuhan University of Technology(武汉理工大学) Tsinghua University(清华大学) Jimei University(集美大学)

AI总结 提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,通过三个Qwen2.5-VL智能体协同生成位置感知的潜在序列,并利用视图-时间置换与3D VAE协同压缩,实现可控多视角视频生成,在nuScenes上达到SOTA多视角一致性和BEV mAP 21.6。

Comments 24 pages, 10 figures

详情
AI中文摘要

自动驾驶的生成式世界模型面临两个未解决的对立:异构控制注入(自由形式语言、高清地图、轨迹和相机位姿存在于不兼容的表示空间)和事后跨视图融合(每个相机的潜在编码未能编码全局3D几何)。我们将两者追溯到同一个根本原因:在潜在标记级别上缺乏对齐语言、几何和像素的共享符号中间语言。我们提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,将可控多视图视频生成重新定义为潜在编排。三个Qwen2.5-VL智能体——一个解析用户意图为结构化WorldScript的导演,一个将其接地为空间锚定布局标记的制图师,以及一个将跨视图批评反馈为辅助监督的审计员——共同创作一个单一的位置感知标记序列。该序列通过视图-时间置换与多视图视频协同压缩,在3D VAE的卷积感受野内强制实现相机间几何。在nuScenes上,DRIVE-CHOREO以具有竞争力的FVD(45.7)实现了新的最先进的多视图一致性和BEV mAP(21.6);仅在我们的合成数据上训练的检测器在真实验证集上获得了+2.4 NDS,验证了下游实用性。

英文摘要

Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

2606.17534 2026-06-17 cs.RO 新提交

RICH-SLAM: Radar SLAM with Incremental and Continuous Hilbert Mapping

RICH-SLAM:基于增量连续希尔伯特映射的雷达SLAM

Bingbing Zhang, Huan Yin, Yang Xu, Shuo Liu, Shaojie Shen, Fumin Zhang, Wen Xu

发表机构 * State Key Laboratory of Ocean Sensing, Zhejiang University(浙江大学海洋传感国家重点实验室) Interdisciplinary Student Training Platform for Marine areas, Zhejiang University(浙江大学海洋交叉学科学生培养平台) Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(香港科技大学电子与计算机工程系) School of AI and Robotics, Hunan University(湖南大学人工智能与机器人学院) Ocean College, Zhejiang University(浙江大学海洋学院) Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences(中国科学院深海科学与工程研究所)

AI总结 提出RICH-SLAM框架,采用Rao-Blackwellized粒子滤波后端和增量希尔伯特空间降秩高斯过程映射,从稀疏雷达测量中构建连续占用地图,并支持不确定性感知规划。

Comments 12 figures

详情
AI中文摘要

由于雷达对恶劣天气和光照条件具有固有的鲁棒性,使用雷达传感器进行同步定位与地图构建(SLAM)越来越受到关注。然而,与激光雷达和视觉数据相比,雷达测量具有稀疏和噪声大的特点,这给实现密集、连续且一致的地图表示带来了重大挑战。在本文中,我们提出了RICH-SLAM,一个旨在解决这些挑战的雷达SLAM框架。我们的方法采用基于Rao-Blackwellized粒子滤波的后端,使用粒子滤波进行位姿估计,卡尔曼滤波进行地图更新。我们提出了一种增量希尔伯特空间降秩高斯过程映射策略,能够在给定稀疏雷达输入的情况下实现连续且具有不确定性感知的地图表示。我们进一步引入了一种后验感知的粒子加权方案,利用地图参数的完整后验分布进行更鲁棒的似然评估。在自采集和公共ColoRadar数据集上的实验表明,RICH-SLAM能够从稀疏雷达测量中构建连续占用地图,并支持移动机器人的不确定性感知规划。

英文摘要

Simultaneous localization and mapping using radar sensors has gained increasing attention due to radar's inherent robustness to adverse weather and lighting conditions. However, radar measurements are characteristically sparse and noisy compared to LiDAR and visual data, posing significant challenges in achieving dense, continuous, and consistent map representations. In this paper, we present RICH-SLAM, a radar SLAM framework designed to address these challenges. Our approach features a Rao-Blackwellized particle filter-based back end that employs particle filtering for pose estimation and Kalman filtering for map updates. We propose an incremental Hilbert-space reduced-rank Gaussian process mapping strategy that enables continuous and uncertainty-aware map representations given sparse radar inputs. We further introduce a posterior-aware particle weighting scheme that leverages the full posterior distribution of map parameters for more robust likelihood evaluation. Experiments on self-collected and public ColoRadar datasets show that RICH-SLAM constructs continuous occupancy maps from sparse radar measurements and supports uncertainty-aware planning for mobile robots.

2606.17526 2026-06-17 cs.LG 新提交

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

MGUP:一种用于随机优化的动量-梯度对齐更新策略

Da Chang, Ganzhao Yuan

发表机构 * Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) Shenzhen University of Advanced Technology(深圳理工大学) Pengcheng Laboratory(鹏城实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MGUP机制,通过按固定比例选择参数施加大步长、其余参数用小步长,增强动量优化器,理论保证收敛,实验表明提升训练效率与稳定性。

Comments Published in NeurIPS 2025

详情
AI中文摘要

高效优化对于训练大型语言模型至关重要。尽管层内选择性更新已被探索,但缺乏一种既能实现细粒度控制又能保证收敛的通用机制。为填补这一空白,我们提出了\textbf{MGUP},一种新颖的选择性更新机制。\textbf{MGUP}通过每次迭代对选定的固定比例参数应用较大的步长,而对其余参数应用较小的非零步长,增强了标准的基于动量的优化器。作为一个近乎即插即用的模块,\textbf{MGUP}可无缝集成到AdamW、Lion和Muon等优化器中,产生强大的变体,如\textbf{MGUP-AdamW}、\textbf{MGUP-Lion}和\textbf{MGUP-Muon}。在标准假设下,我们为随机优化中的\textbf{MGUP-AdamW}(无权重衰减)提供了理论收敛保证。在包括MAE预训练、LLM预训练和下游微调在内的多种任务上的大量实验表明,与原始基础优化器相比,我们的\textbf{MGUP}增强型优化器实现了更优或更稳定的性能。我们提供了一种原则性、通用且具有理论基础的层内选择性更新策略,加速并稳定了大规模模型的训练。代码已在此https URL公开。

英文摘要

Efficient optimization is essential for training large language models. Although intra-layer selective updates have been explored, a general mechanism that enables fine-grained control while ensuring convergence guarantees is still lacking. To bridge this gap, we propose \textbf{MGUP}, a novel mechanism for selective updates. \textbf{MGUP} augments standard momentum-based optimizers by applying larger step-sizes to a selected fixed proportion of parameters in each iteration, while applying smaller, non-zero step-sizes to the rest. As a nearly {plug-and-play} module, \textbf{MGUP} seamlessly integrates with optimizers such as AdamW, Lion, and Muon. This yields powerful variants such as \textbf{MGUP-AdamW}, \textbf{MGUP-Lion}, and \textbf{MGUP-Muon}. Under standard assumptions, we provide theoretical convergence guarantees for \textbf{MGUP-AdamW} (without weight decay) in stochastic optimization. Extensive experiments across diverse tasks, including MAE pretraining, LLM pretraining, and downstream fine-tuning, demonstrate that our \textbf{MGUP}-enhanced optimizers achieve superior or more stable performance compared to their original base optimizers. We offer a principled, versatile, and theoretically grounded strategy for efficient intra-layer selective updates, accelerating and stabilizing the training of large-scale models. The code is publicly available at https://github.com/MaeChd/MGUP.

2606.17524 2026-06-17 cs.LG 新提交

Learning to Refine Hidden States for Reliable LLM Reasoning

学习精炼隐藏状态以实现可靠的LLM推理

Chia-Hsuan Hsu, Jui-Ming Yao

发表机构 * Tongyu0924

AI总结 提出ReLAR框架,通过强化学习引导的潜在状态精炼,自适应调整推理步数和方向,提升复杂多步推理的准确性和稳定性,降低推理开销。

Comments Code is available at tongyu0924/Learning-to-Refine-Hidden-States

详情
AI中文摘要

大型语言模型展现出强大的推理能力,但在复杂的多步设置中,其内部推理过程可能不稳定,早期隐藏状态错误可能传播到错误预测。我们提出ReLAR,一种强化引导的潜在精炼框架,在解码前迭代更新隐藏表示。ReLAR维护一个紧凑的潜在推理状态,并使用学习到的深度和动作控制器自适应地确定精炼步骤的数量和方向。控制器基于逐步似然改进的策略梯度目标进行训练,实现了高效的输入依赖推理,无需显式的思维链生成。在医学、数学、多跳推理和开放式生成基准上的实验表明,ReLAR提高了准确性、生成质量和推理稳定性,同时推理开销显著低于显式推理基线。

英文摘要

Large language models show strong reasoning ability, but their internal reasoning process can remain unstable in complex multi-step settings, where early hidden-state errors may propagate to incorrect predictions. We propose ReLAR, a reinforcement-guided latent refinement framework that iteratively updates hidden representations before decoding. ReLAR maintains a compact latent reasoning state and uses learned depth and action controllers to adaptively determine both the number and direction of refinement steps. The controllers are trained with a policy gradient objective based on step-wise likelihood improvement, enabling efficient input-dependent reasoning without explicit chain-of-thought generation. Experiments on medical, mathematical, multi-hop reasoning, and open-ended generation benchmarks show that ReLAR improves accuracy, generation quality, and reasoning stability with substantially lower inference overhead than explicit reasoning baselines.

2606.17522 2026-06-17 cs.CL cs.LG 新提交

An expressivity analysis of hierarchical modelling in deep transformers via bounded-depth grammars

深度Transformer中基于有界深度文法的层次建模的表达性分析

Vinoth Nandakumar, Qiang Qu, Pramod Thebe, Sakshi Khachariya, Tongliang Liu

发表机构 * University of Sydney(悉尼大学) San Francisco State University(旧金山州立大学) IIT Madras(印度理工学院马德拉斯分校)

AI总结 通过有界深度上下文无关文法,证明深度Transformer的深度随文法深度线性增长,神经元数随派生树形状和产生式规则数量缩放,支持线性表示假说。

详情
AI中文摘要

深度神经网络普遍被认为其表达能力源于形成层次表示的能力,即在各层中逐步捕获更抽象和组合的特征。在语言建模中,Transformer已成为主导架构,早期层捕获局部句法模式,后期层编码更复杂的从句级依赖。尽管这种直觉塑造了模型设计,但缺乏严格的理论工作来展示深度Transformer如何表示这种层次结构。本文通过有界深度、非递归上下文无关文法的形式化视角,分析深度Transformer模型的表达性。对于这类文法,我们显式构造了具有位置注意力的Transformer,其深度随文法深度线性增长,而神经元数量随派生树形状数量以及产生式规则数量的平方缩放。我们的理论结果支持线性表示假说,证明了这些架构具有将抽象语法状态编码为残差流中低维、线性可分子空间的结构能力。

英文摘要

Deep neural networks are widely believed to derive their expressive power from their ability to form \textbf{hierarchical representations}, capturing progressively more abstract and compositional features across layers. In language modeling, \textbf{transformers} have emerged as the dominant architecture, with early layers capturing local syntactic patterns and later layers encoding more complex clause-level dependencies. While this intuition has shaped model design, there remains a lack of rigorous theoretical work demonstrating \textbf{how} deep transformers represent such hierarchical structures. In this work, we analyze the expressiveness of deep transformer models through the formal lens of bounded-depth, non-recursive context-free grammars. For this class of grammars, we explicitly construct transformers with positional attention whose depth grows linearly with grammar depth, while the neuron count scales with the number of derivation-tree shapes and quadratically with the number of production rules. Our theoretical results support the linear representation hypothesis by demonstrating that these architectures possess the structural capacity to encode abstract grammatical states into low-dimensional, linearly separable subspaces within the residual stream.

2606.17520 2026-06-17 cs.RO cs.CV 新提交

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

GASE:基于高斯溅射的自动化系统用于重建具身仿真环境

Jiawei Zhang, Yiming Yan, Chao Liang, Nuo Xu, Seson Sun, Qichen Zhang, Yuhao Xu, Yantai Yang, Yingqiao Wang, Qin Jin, Zhipeng Zhang

发表机构 * AutoLab, SAI, Shanghai Jiao Tong University(上海交通大学SAI学院AutoLab实验室) AIM3 Lab, School of Information, Renmin University of China(中国人民大学信息学院AIM3实验室) Research Lab, Anyverse Dynamics(Anyverse Dynamics研究实验室)

AI总结 提出GASE系统,利用全景相机阵列和多视图视频流,通过相机位姿策略提取前景物体并修复场景,独立重建后导入物理仿真器,实现高效高保真仿真环境构建,分割精度提升超10%,真实机器人部署性能差距小于10%。

详情
AI中文摘要

在现实世界中训练具身代理需要熟练的操作人员和昂贵的硬件。仿真环境通过实现大规模、低成本的数据增强提供了一种引人注目的替代方案。因此,快速构建具有最小仿真到现实差距的高保真仿真场景已成为机器人学习的关键目标。尽管基于重建的方法提供了优越的视觉质量,但当前的工作流程受到低效的数据采集和次优的前景物体提取的阻碍。因此,我们提出了GASE,一个高度自动化的仿真场景构建系统。GASE利用全景相机阵列的多视角视频流实现快速环境扫描。为确保高质量的资产生成,我们的流程引入了一种基于相机位姿的策略,在2D域中跨帧鲁棒地提取物体,随后进行高保真场景修复。前景物体和静态背景随后被独立重建,并无缝导入物理仿真器用于策略训练。大量实验表明,GASE在分割精度上比现有的基于3D高斯的方法提高了超过10%,同时实现了最先进的修复质量。此外,在操作和导航任务中的真实机器人部署保持了与纯真实世界数据训练策略相比低于10%的性能差距。这些结果证实GASE为弥合仿真到现实差距提供了高效且高度有效的解决方案。代码将发布。

英文摘要

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

2606.17519 2026-06-17 cs.CL cs.AI 新提交

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

扩展企业智能体路由:退化、诊断与恢复

Kellen Gillespie, Robyn Perry

发表机构 * Superhuman, Inc.(Superhuman公司)

AI总结 研究企业助手工具库扩展时路由准确率下降问题,通过嵌入预选恢复F1分数10-17个百分点。

Comments 10 pages (6 main + 4 appendix), 4 figures, 6 tables

详情
AI中文摘要

生产级LLM助手将用户请求路由到日益增长的专用工具库,但随着目录规模扩大,路由准确率如何退化?我们在一个已部署的企业生产力助手的110个智能体、584个工具的目录上研究单步路由,评估了从10到110个智能体的三种前沿模型。在未充分指定的请求上,路由F1分数跨模型下降16-23个百分点。一个oracle分析将退化分解为检索差距(模型无法找到正确工具)和混淆差距(即使完美检索,oracle上限也下降10pp)。基于嵌入的预选在全部规模下为所有三种模型和两个提供商恢复+10-11pp F1分数。一项生产标注研究(1,435个人工标注话语,三个标注者)确认了在真实流量上的恢复,尽管绝对性能低10-15pp,但恢复幅度为+10-17pp。

英文摘要

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16--23 percentage points across models. An oracle analysis decomposes the degradation into a \emph{retrieval} gap (the model cannot surface the right tool) and a \emph{confusion} gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10--11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10--17pp despite 10--15pp lower absolute performance.

2606.17516 2026-06-17 cs.LG cs.AI stat.ME stat.ML 新提交

FoundCause: Causal Discovery with Latent Confounders from Observational Data

FoundCause: 从观测数据中发现含隐混淆因子的因果关系

Patrick Blöbaum, Krishnakumar Balasubramanian, Shiva Prasad Kasiviswanathan

发表机构 * Amazon Web Services(亚马逊云服务) Department of Statistics, University of California, Davis(加州大学戴维斯分校统计系)

AI总结 提出FoundCause,一种基于合成数据训练的摊销因果发现模型,通过单次前向传递直接映射数据集到因果图,显式建模隐混淆因子,在15个真实数据集上优于11种非摊销和4种摊销方法。

Comments Download the model at https://github.com/amazon-science/foundcause

详情
AI中文摘要

从观测数据中发现因果关系仍然具有挑战性,因为需要在没有干预的情况下恢复有向结构和隐混淆因子。我们提出了FoundCause,一种完全在合成数据上训练的摊销因果发现模型,它通过单次前向传递直接将数据集映射到因果图。通过从大量模拟结构因果模型中学习,FoundCause捕获了可迁移的统计模式,这些模式泛化到单个数据集之外。该架构融合了因果发现的几个关键归纳偏置。它使用一个置换不变的Transformer编码器,通过交替关注样本和变量来联合建模跨变量依赖性和每个变量的分布。通过统计条件注意力注入来自经典非对称度量的成对统计特征,引导模型朝向已知的因果信号。一个分解的解码器将边的存在性与方向分离,而一个三角细化模块使得能够推理高阶因果模式,如链和碰撞器。此外,一个基于可学习隐令牌的专用混淆因子模块显式建模隐藏的共同原因,并且模型通过其掩码输入表示显式处理缺失数据。据我们所知,FoundCause是第一个显式建模隐混淆因子的摊销因果发现方法。FoundCause在15个真实数据集上优于11种经典非摊销方法(如PC、GES、NOTEARS风格优化)和4种摊销因果发现方法,相对于最强的非摊销方法,在$F_1$上提高了9.6%,在AUROC上提高了1.2%,结构汉明距离减少了18.9%,同时仅需单次前向传递即可完成推理。

英文摘要

Causal discovery from observational data remains challenging due to the need to recover directed structure and latent confounding without interventions. We propose FoundCause, an amortized causal discovery model trained entirely on synthetic data that maps datasets directly to causal graphs in a single forward pass. By learning from large collections of simulated structural causal models, FoundCause captures transferable statistical patterns that generalize beyond individual datasets. The architecture incorporates several key inductive biases for causal discovery. It uses a permutation-invariant transformer encoder with alternating attention over samples and variables to jointly model cross-variable dependence and per-variable distributions. Pairwise statistical features derived from classical asymmetry measures are injected through statistics-conditioned attention, guiding the model toward known causal signals. A factorized decoder separates edge existence from direction, while a triangular refinement module enables reasoning over higher-order causal motifs such as chains and colliders. In addition, a dedicated confounder module based on learnable latent tokens explicitly models hidden common causes, and the model explicitly handles missing data via its masked input representation. To our knowledge, FoundCause is the first amortized causal discovery approach to explicitly model latent confounding. FoundCause outperforms 11 classical non-amortized methods (e.g., PC, GES, NOTEARS-style optimization) and 4 amortized causal discovery methods on 15 real-world datasets, achieving +9.6% improvement in $F_1$, +1.2% in AUROC, and an 18.9% reduction in structural Hamming distance relative to the strongest non-amortized methods, while performing inference in a single forward pass.

2606.17513 2026-06-17 cs.LG cs.AI 新提交

Geometry-Aware Post-Hoc Uncertainty Quantification in Operator Learning

几何感知的算子学习事后不确定性量化

Oriol Vendrell-Gallart, Nima Negarandeh, Ramin Bostanabad

发表机构 * Department of Mechanical and Aerospace Engineering, University of California, Irvine(加州大学尔湾分校机械与航空航天工程系)

AI总结 提出REEF-GP框架,通过高斯过程拟合冻结神经算子的残差,利用其内在坐标-特征表示构建几何感知的不确定性,在多个PDE基准上实现校准的不确定性估计,且计算成本远低于深度集成。

详情
AI中文摘要

神经算子为偏微分方程提供快速代理模型,但其确定性预测限制了在需要不确定性量化(UQ)的任务中的使用,尤其是在几何变化下。现有方法主要对网络参数进行不确定性建模,很大程度上忽略了算子本身学习的几何感知表示。我们提出REEF-GP(残差嵌入特征高斯过程),一种事后UQ框架,将高斯过程拟合到冻结神经算子的残差上,该算子的内部嵌入定义了核特征空间。REEF-GP不学习单独的特征映射,而是调整算子固有的坐标-特征表示以构建几何感知的不确定性。为了确保非结构化域上的稳定性和可扩展性,REEF-GP结合了谱归一化投影、异方差几何感知噪声以及高效基于子集的训练,避免了限制性的低秩近似。在五个具有不同几何形状的PDE基准测试中,REEF-GP保持了预测准确性,同时实现了与深度集成相竞争但成本仅为其一小部分的校准不确定性估计。我们的方法在几何分布偏移下保持鲁棒性,不确定性集中在物理上有意义的区域(例如激波前沿)。我们的结果表明,神经算子的准确且可扩展的事后UQ可以直接在其学习的特征空间中实现,为参数中心方法提供了实用替代方案。

英文摘要

Neural operators provide fast surrogates for PDEs but their deterministic predictions limit their use in tasks requiring uncertainty quantification (UQ), especially under geometric variability. Existing approaches primarily model uncertainty in network parameters, largely overlooking the geometry-aware representations learned by the operator itself. We propose REEF-GP (Residual on Embedded Features Gaussian Process), a post-hoc UQ framework that fits a GP to the residuals of a frozen neural operator whose internal embeddings define the kernel feature space. Rather than learning a separate feature map, REEF-GP adapts the operator's intrinsic coordinate-feature representations to construct geometry-aware uncertainties. To ensure stability and scalability on unstructured domains, REEF-GP incorporates spectral-normalized projections, heteroscedastic geometry-aware noise, and efficient subset-based training that avoids restrictive low-rank approximations. Across five PDE benchmarks with varying geometries, REEF-GP preserves predictive accuracy while achieving calibrated uncertainty estimates competitive with deep ensembles but at a fraction of their cost. Our approach remains robust under geometric distribution shift, with uncertainty concentrating in physically meaningful regions (e.g., shock fronts). Our results demonstrate that accurate and scalable post-hoc UQ for neural operators can be achieved directly in their learned feature space, offering a practical alternative to parameter-centric approaches.

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 新提交

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学) University of California, Berkeley(加州大学伯克利分校) ShanghaiTech University(上海科技大学)

AI总结 提出MagicSim,一个基于确定性批处理运行时和共享MDP的具身交互基础设施,通过YAML规范解耦内容、放置、行为和智能体暴露,统一世界构建、执行、评估和自动生成轨迹。

详情
AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底,而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层,无法重现、评估和标注同一情节。我们提出MagicSim,一个围绕确定性批处理运行时和共享马尔可夫决策过程(MDP)构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露,MagicSim在单一重置-步进循环中构建多样化的可执行世界,涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化,将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力:基准测试和强化学习评估、自动收集接口(自动将命令转化为具体轨迹)以及面向智能体/VLM的交互。对于自动执行,命令流经Command->Skill->Planner->Robot->Record流水线,而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹,将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此,MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

2606.17508 2026-06-17 cs.LG cs.DC cs.PL cs.SE 新提交

When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs

当下一步不是一步:面向并发Go程序的分布感知执行建模

Kaviru Hapuarachchi

发表机构 * University of Colombo School of Computing(科伦坡大学计算学院)

AI总结 针对并发程序非确定性调度导致单标签预测困难的问题,提出分布感知训练方法,通过多次运行聚合经验分布并微调7B模型,在真实Go缺陷预测中准确率达36.2%,并降低期望校准误差。

Comments 10 pages, 2 figures

详情
AI中文摘要

训练模型预测并发程序中的下一步比看起来更难:从相同跟踪前缀出发的同一程序的两次运行可能产生不同的有效下一事件,因为调度器是非确定性的。针对单一标签训练的模型实际上是在学习猜测随机过程的一个结果。我们反过来利用这种非确定性作为训练信号。我们将每个程序运行多次,将观察到的下一事件聚合成经验分布,并使用KL散度目标微调一个7B模型。在从真实生产Go缺陷(CockroachDB、Kubernetes、gRPC、etcd)中抽取的798个保留预测上,对少于一千个跟踪进行微调达到了36.2%的准确率,超过了零样本的Gemini 3.5 Flash(34.8%)和未微调的同一模型(28.6%)。分布训练在准确率上与交叉熵相当(35.8% vs. 36.2%),同时将期望校准误差从0.205降低到0.169。我们还推导出一类select阻塞goroutine的形式化goroutine泄漏特征,其中P(GoUnblock)=0由调度器语义保证,而非学习得到。我们发布了数据集、训练适配器和所有工具。

英文摘要

Training a model to predict the next step in a concurrent program is harder than it looks: two runs of the same program from the same trace prefix can produce different next events, both valid, because the scheduler is nondeterministic. A model trained against a single label is learning to guess one outcome of a random process. We turn this around and use the nondeterminism as a training signal. We run each program many times, aggregate the observed next events into an empirical distribution, and fine-tune a 7B model to match that distribution with a KL objective. On 798 held-out predictions drawn from real production Go bugs (CockroachDB, Kubernetes, gRPC, etcd), fine-tuning on fewer than a thousand traces reaches 36.2% accuracy, ahead of Gemini 3.5 Flash used zero-shot (34.8%) and the same model without fine-tuning (28.6%). Distribution training matches cross-entropy on accuracy (35.8% vs. 36.2%) while reducing Expected Calibration Error from 0.205 to 0.169. We also derive a formal goroutine-leak signature for a class of select-blocked goroutines where P(GoUnblock)=0 holds by scheduler semantics, not by learning. We release the dataset, trained adapters, and all tooling.

2606.17507 2026-06-17 cs.AI cs.SE 新提交

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

教育中的LLM作为评判者:基于课程标准的评分流水线

Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

发表机构 * NSW Department of Education(新南威尔士州教育部) South Australian Department for Education(南澳大利亚州教育部) OC Selective exam preparation platform(OC精英考试备考平台) Studitory: HSC preparation platform(Studitory: HSC备考平台)

AI总结 提出一种基于课程标准的可配置LLM评判流水线,用于高利害考试评分,通过整合授权课程工件和评分指南,提高评分一致性、透明度和与官方实践的契合度。

详情
AI中文摘要

生成式AI和大语言模型(LLM)越来越多地应用于题目生成和自动评估。然而,在备考高风险考试中部署LLM需要的不仅仅是提示工程,还需要软件流水线,系统地将模型输出锚定在授权课程工件和教育当局发布的评分指南上。本文提出了一种基于课程标准的、可配置的LLM-as-Judge流水线,用于题目级评分,与工业合作伙伴共同开发,以支持大学入学考试准备。该流水线识别问题的相关主题、子主题和认知需求,并组装可验证和授权的上下文以支持LLM判断。课程意图通过具体的课程大纲工件(包括规定的动词和结果、表现等级描述符、术语表定义和评分指南原则)来操作化。采用分阶段LLM工作流,首先生成特定题目的评分标准,捕获结构化的表现期望,然后推导和评估用于分配学生回答分数的评分标准。这种设计提高了与官方评分实践的一致性、透明度和对齐度。初步评估表明,所提出的LLM-as-Judge流水线提供的评分结果与人类导师相当,同时产生的理由更可追溯到授权课程工件和评分标准。该流水线已集成到在线学习平台中,早期部署数据提供了操作使用和手动覆盖的初步见解。

英文摘要

Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in authorised curriculum artefacts and marking guidelines issued by education authorities. This paper presents a curriculum-grounded, configurable LLM-as-Judge pipeline for question-level marking, co-developed with an industrial partner, to support exam preparation for university admission. The pipeline identifies the relevant topics, subtopics, and cognitive demand of a question, and assembles verifiable and authorised context to support LLM judgement. Curriculum intent is operationalised through concrete syllabus artefacts, including prescribed verbs and outcomes, performance band descriptors, glossary definitions, and marking-guideline principles. A staged LLM workflow is employed to first generate question-specific rubrics, capturing structured expectations of performance, and then derive and evaluate marking criteria used to allocate marks to student responses. This design improves consistency, transparency, and alignment with official marking practices. Preliminary evaluation shows that the proposed LLM-as-Judge pipeline delivers marking outcomes comparable to human tutors, while yielding justifications that are more traceable to authorised curriculum artefacts and marking standards. The pipeline has also been integrated into an online study platform, where early deployment data provide initial insights into operational usage and manual overrides.

2606.17506 2026-06-17 cs.CL 新提交

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

通过认知权利评估大语言模型的二阶偏见

Ramaravind Kommiya Mothilal, Terry Jingchen Zhang, Raiyan Ahmed, Zhijing Jin, Shion Guha, Syed Ishtiaque Ahmed

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) EuroSafeAI Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所(德国图宾根))

AI总结 提出基于认知权利理论的逻辑推理任务,评估LLM在判断偏见内容时表现出的二阶偏见,发现模型判断存在系统性偏差且能规避安全护栏。

Comments 20 pages, 13 tables, 2 figures

详情
AI中文摘要

对大语言模型社会偏见的评估主要关注模型是否生成或暗示偏见内容。然而,随着LLM越来越多地被用作偏见评判者,它们可能在评估偏见内容时以更微妙的方式表现出社会偏见,而当前方法并未系统性地捕捉这一点。我们称之为二阶偏见:LLM对社会偏见的判断中存在的偏见,并通过一种新颖的、基于哲学推理的任务进行评估。借鉴认知权利理论,我们将偏见概念化为塑造主体理性探究的错位基础知识,并推导出一个逻辑推理任务,让LLM判断偏见文本对谁是可接受的或不可接受的。我们开发了两个简单指标来衡量LLM评判者在没有充分支持的情况下推断人口统计学可接受性的偏见程度,以及这些推断在不同目标群体间的差异。评估开源和闭源模型时,我们发现我们的任务通过揭示模型判断中的偏见来规避安全护栏。它随目标群体系统性地变化,反映了隐性的社会图谱,并展示了模型如何仍然被人口统计标签触发。我们的工作指出了在判断任务中评估LLM偏见的必要性,以及更广泛地,在NLP中采用更具理论基础的偏见评估方法。我们在以下网址发布代码和模型响应:此 https URL。

英文摘要

Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.

2606.17500 2026-06-17 cs.LG cs.AR 新提交

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

可重构计算挑战:Versal AI Engine上的喷注标记Transformer

Gram Koski, Sean Lipps, Zhenghua Ma, G. Abarajithan, Ryan Kastner

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of California San Diego(加州大学圣地亚哥分校) La Jolla, CA, USA(拉贾拉, 加州, 美国)

AI总结 针对CERN LHC喷注标记任务,提出在AMD Versal AI Engine上部署量化整数Transformer的初始实现,并开发可重用软件框架自动生成Vitis图代码。

Comments 4 pages, 4 figures. In FCCM 2026 proceedings

Journal ref 2026 IEEE 34th Int. Symp. on Field-Programmable Custom Computing Machines (FCCM), Atlanta, GA, USA, 2026, pp. 307-310

详情
AI中文摘要

基于Transformer的模型在CERN LHC的喷注标记中表现出强大的性能,但在低延迟、资源受限的触发系统中部署它们具有挑战性。我们提出了一个在AMD Versal AI Engine(AIE)上用于喷注标记的量化、纯整数Transformer的初始实现,将密集层和多头注意力(MHA)层映射到AIE瓦片。主要贡献是一个可重用的软件框架,该框架将Transformer层表示为可组合的AIE构建块,并从高级Python模型描述自动生成相应的Vitis图代码。该框架为未来研究提供了基础,并作为开源软件在此https URL发布。

英文摘要

Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at https://github.com/KastnerRG/particle_transformer_aie.

2606.17493 2026-06-17 cs.RO 新提交

When Robots Sleep: Offline Skill Consolidation for Shared-Policy Robot Learning

当机器人睡眠时:面向共享策略机器人学习的离线技能巩固

Nethmi Jayasinghe, Diana Gontero, Amit Ranjan Trivedi

发表机构 * University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出睡眠-觉醒框架,通过冻结技能记忆和纳什谈判梯度组合,解决多技能学习中的技能耦合崩溃问题,在Meta-World和SurgicAI上显著提升成功率和可靠性。

详情
AI中文摘要

在长期部署中学习的机器人必须添加新技能,同时不丢失使早期技能可重用的共享策略结构。我们研究顺序机器人技能学习,其中先前的轨迹和任务损失可能不可用,并且部署的策略必须保持单个共享控制器,没有特定任务的头部、路由或适配器。我们识别出技能耦合崩溃,这是一种故障模式,其中单个技能的成功仍然非平凡,而相关技能之间的可靠性下降。我们提出睡眠机器人,一种睡眠-觉醒框架,在觉醒期间学习每个新技能,并在睡眠期间使用紧凑的冻结技能记忆离线巩固共享策略:用于强化学习的冻结评论家与无序状态缓冲区,以及用于模仿学习的冻结演员快照与无序观察缓冲区。在睡眠期间,这些记忆定义了可微分的替代目标,其梯度通过纳什谈判组合,并具有自适应锚定和局部兴奋性以实现稳定巩固。在Meta-World MT5上,睡眠机器人相比最强的非神谕基线将平均成功率提高了64%,将成对可靠性提高了2.0倍;在SurgicAI上,相比持续模仿基线,它提高了平均成功率和反向迁移,同时在成对可靠性上保持竞争力。

英文摘要

Robots that learn over long deployments must add new skills without losing the shared policy structure that makes earlier skills reusable. We study sequential robot skill learning, where previous trajectories and task losses may be unavailable, and the deployed policy must remain a single shared controller without task-specific heads, routing, or adapters. We identify skill-coupling collapse, a failure mode in which individual skill success remains non-trivial while reliability among related skills deteriorates. We propose Sleeping Robots, a wake-sleep framework that learns each new skill during wake and consolidates the shared policy offline during sleep using compact frozen skill memories: frozen critics with unordered state buffers for reinforcement learning and frozen actor snapshots with unordered observation buffers for imitation learning. During sleep, these memories define differentiable surrogate objectives whose gradients are combined through Nash bargaining, with adaptive anchoring and local excitability for stable consolidation. On Meta-World MT5, Sleeping Robots improves average success by 64 % and pairwise reliability by x 2.0 over the strongest non-oracle baseline, and on SurgicAI it improves average success and backward transfer relative to continual imitation baselines while remaining competitive on pairwise reliability.

2606.17489 2026-06-17 cs.LG cs.AI 新提交

Online LLM Selection via Constrained Bandits with Time-Varying Demand

基于时变需求的约束赌博机在线LLM选择

Yin Huang, Qingsong Liu, Jie Xu

发表机构 * Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Manning College of Information and Computer Sciences, University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校曼宁信息与计算机科学学院)

AI总结 针对边缘云推理系统中异构LLM的选择问题,提出一种基于置信界估计和需求预测的在线学习算法,在硬预算和软延迟约束下实现亚线性遗憾和约束违反。

Comments 11 pages, 3 figures with multiple subfigures, 1 table, submitted for possible journal publication

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在边缘云推理系统中,以处理具有异构准确性、延迟和成本配置的多样化用户任务。为每个传入任务选择合适的LLM对于确保服务质量和高效资源利用至关重要。然而,模型异构性、随机且未知的性能特征以及时变的任务需求使得静态选择策略不再适用。实际部署通常施加硬资源预算(如货币支出限制)和软服务级别要求(如延迟保证)。这些约束为在线决策带来了额外挑战。我们将该问题形式化为一个约束随机赌博机学习任务,其中学习者在包装型(硬)和覆盖型(软)约束下顺序选择模型,同时适应时变的任务需求。学习者无法访问底层奖励、成本或延迟分布,必须依赖部分反馈。我们开发了一种新颖的在线学习算法,利用置信界估计和需求预测来平衡奖励最大化与长期约束满足。我们提供了理论保证,表明与具有完整信息的离线基准相比,该算法实现了亚线性遗憾和亚线性覆盖约束违反。在合成工作负载上的实验结果证明了我们的方法在动态、资源受限环境中的有效性和鲁棒性。

英文摘要

Large Language Models (LLMs) are increasingly deployed in edge-cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization. However, model heterogeneity, stochastic and unknown performance characteristics, and time-varying task demands make static selection strategies inadequate. Real-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service-level requirements such as latency guarantees. These constraints introduce additional challenges for online decision-making. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing-type (hard) and covering-type (soft) constraints, while adapting to time-varying task demand. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback. We develop a novel online learning algorithm that leverages confidence-bound estimates and demand predictions to balance reward maximization with long-term constraint satisfaction. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource-constrained environments.

2606.17482 2026-06-17 cs.CV 新提交

SPHINX: First Explain, Then Explore

SPHINX: 先解释,再探索

Nguyen Do, Tue M. Cao, Tien Van Do, András Hajdu, Tamás Bérczes, My T. Thai

发表机构 * University of Florida(佛罗里达大学) University of Debrecen(德布勒恩大学)

AI总结 提出SPHINX闭环框架,通过可解释AI分析驾驶策略的失败模式,并利用视觉语言模型生成针对性对抗场景,提升自动驾驶策略鲁棒性。

Comments 13 pages

详情
AI中文摘要

生成对抗性驾驶场景对于在仿真中评估和改进自动驾驶决策系统至关重要。最近的方法,如ChatScene和LLM-Attacker,主要依赖大型语言模型和视觉语言模型的先验知识来程序化生成驾驶场景。我们认为,对抗性场景应基于驾驶策略的失败诊断(例如,犹豫不决、多帧不一致)来生成,以专门针对策略的弱点,而不是依赖先验假设。在本文中,我们提出SPHINX,一个闭环框架,用于对抗性场景合成,遵循一个简单原则:先解释,再探索。除了盲目探索场景空间外,SPHINX利用可解释人工智能方法分析策略,识别关键视觉概念及其对策略输出的影响,以及决策的不确定性。基于从策略自身决策过程中提取的可解释证据,我们使用视觉语言模型对当前策略的失败模式进行推理和批评。然后,这些批评被用于生成针对性的对抗性场景,以进行策略再训练和改进。我们证明,SPHINX能够突出策略失败的可解释说明,而其他对抗性场景生成方法则不能。在评估的基准和测试套件中,SPHINX可应用于多种最先进的自动驾驶架构,并在现有场景生成方法上带来一致的鲁棒性改进。

英文摘要

Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy's weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy's own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.

2606.17480 2026-06-17 cs.CV cs.RO 新提交

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

GeneralVLA-2: 几何感知重建与受控记忆用于机器人规划

Haoyu Wang, Guoqing Ma, Zeyu Zhang, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) CASIA(中国科学院自动化研究所) AI 2 Robotics

AI总结 针对机器人规划中3D物体重建幻觉和记忆质量不可控的问题,提出GeoFuse-MV3D几何先验引导重建分支和受控长期记忆系统,在GSO-30和Terminal-Bench等基准上显著提升性能。

详情
AI中文摘要

通用视觉-语言-动作系统需要以物体为中心的3D证据和可复用的操作经验来规划可靠的机器人轨迹。GeneralVLA提供了一个层次化接口,用于将语言和RGB-D观测转换为3D末端执行器路径,但仍存在两个瓶颈。首先,单目SAM3D风格的物体重建可能产生姿态和未见几何的幻觉,而操作受益于在标定多视图观测可用时的稳定物体形状。其次,原始的KnowledgeBank主要检索语义相似的片段并附加新知识,这使得难以控制记忆质量、冲突、置信度和几何相关性。为了解决第一个挑战,我们引入了GeoFuse-MV3D,一个几何先验引导的MV-SAM3D重建分支,它用输入视图掩码验证外部几何线索,应用软视觉外壳支持,执行轴方向细化,并仅融合几何同时保留外观。为了解决第二个挑战,我们将KnowledgeBank升级为一个受控的长期记忆系统,具有明确的质量、置信度、生命周期、验证器和冲突元数据,以及面向精度的检索。最后,我们在GSO-30上评估重建分支,在Terminal-Bench 2.0和SWE-Bench Verified上评估记忆模块;GeoFuse-MV3D相比MV-SAM3D基线,CD和LPIPS分别降低2.20%和2.02%,PSNR和SSIM分别提高2.36%和1.03%;KnowledgeBank相比ReasoningBank,在Terminal-Bench SR上提高4.53%,在SWE-Bench解决率上提高3.73%,同时AS分别降低4.95%和5.65%。代码:此 https URL。网站:此 https URL。

英文摘要

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

2606.17478 2026-06-17 cs.CL cs.AI 新提交

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

解码推理型LLM中的隐藏欺骗:用于欺骗审计的激活解释器

Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

发表机构 * Zhejiang University(浙江大学) Griffith University(格里菲斯大学)

AI总结 提出STATEWITNESS,一种通过解码目标模型隐藏状态来生成自然语言查询答案和结构化报告的激活解释器,在欺骗检测中平均AUROC达0.916,优于现有方法。

Comments Under review

详情
AI中文摘要

随着LLM获得更强的推理能力,欺骗行为成为一个日益严重的安全问题。现有的欺骗监控器要么对可见文本进行评分,要么从表示向量中导出标量探针分数,几乎没有留下关于为什么响应可疑的可检查证据。我们引入了STATEWITNESS,一种用于欺骗审计的激活解释器。一个独立的解码器读取目标模型的隐藏状态,然后回答自然语言查询或发出关于它们的结构化报告。我们在两个目标推理LLM上评估了STATEWITNESS,涵盖七个欺骗数据集。在相同评估协议下,STATEWITNESS的平均AUROC达到0.916,比最佳黑盒文本监控器相对提升11.6%,比最佳激活探针基线相对提升25.0%。当与现有监控器结合时,STATEWITNESS在简单阈值集成中减少了遗漏的欺骗示例。除了标量检测,解码器还返回查询级答案、模式报告以及令牌级或句子级证据痕迹供人工检查。我们将此接口视为更广泛的可解释性和对齐工具的潜在构建块。

英文摘要

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

2606.17477 2026-06-17 cs.CV cs.LG 新提交

Theoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer

基于强化学习优化器的分布外检测的理论基础

Salimeh Sekeh, Xin Zhang

发表机构 * San Diego State University(圣地亚哥州立大学)

AI总结 本文提出一种强化学习引导的优化器,通过修正梯度下降更新来降低语义分布外误报率,理论分析了模型变化和环境变化对泛化误差的影响。

详情
AI中文摘要

动态开放世界环境中的分布外(OOD)检测要求模型持续适应不断变化的数据分布,同时泛化到协变量偏移输入并拒绝语义偏移的OOD样本。大多数现有的OOD检测方法仅优化当前步目标,并未明确考虑部署后环境变化如何影响未来的OOD行为。在本文中,我们使用强化学习(RL)引导的优化器为动态OOD检测建立了理论基础,该优化器明确偏好随时间降低语义OOD假阳性率的更新。我们开发了一种新颖的增强优化器,在标准梯度下降(GD)之上使用RL引导的修正项,并展示了其在未来域泛化和语义OOD拒绝方面的改进。我们从模型变化和环境变化泛化误差的角度分析了时间误差分解,并开发了一个新的理论框架来比较GD和RL引导优化器下的泛化误差。

英文摘要

Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.

2606.17476 2026-06-17 cs.LG 新提交

Multi-Adapter PPO: A Cross-Attention Enhanced Wavelength Selection Framework for LIBS Quantitative Analysis

多适配器PPO:一种用于LIBS定量分析的交叉注意力增强波长选择框架

Hao Li, Man Fung Zhuo

发表机构 * Electrical and Computer Engineering(电气与计算机工程系) University of Arizona(亚利桑那大学) Computer Engineering University of Arizona Tucson, USA(计算机工程大学亚利桑那大学图森美国)

AI总结 提出多适配器PPO框架,将波长选择转化为强化学习问题,利用交叉注意力和多适配器捕获光谱关系,在钢铁和煤炭数据集上综合评分平均提升28.4%,预测精度提升45.2%。

Comments 6 pages

详情
AI中文摘要

激光诱导击穿光谱(LIBS)定量分析由于高维光谱数据以及预测精度与特征效率之间的基本权衡,在波长选择方面面临关键挑战。本文提出了一种新颖的多适配器PPO框架,将波长选择转化为强化学习问题,利用交叉注意力机制和多个专用适配器来捕获复杂的光谱关系。我们的方法在钢铁和煤炭数据集上的综合评分平均比传统粒子群优化(PSO)高出28.4%,预测精度高出45.2%。所提出的方法在平衡预测精度与特征效率方面表现出优越性能,在LIBS定量分析中取得了最先进的结果,同时保持了可解释性和计算效率。我们在以下网址发布了代码和数据集:this https URL

英文摘要

Laser-induced breakdown spectroscopy (LIBS) quantitative analysis faces critical challenges in wavelength selection due to high-dimensional spectral data and the fundamental trade-off between prediction accuracy and feature efficiency. This paper presents a novel Multi-Adapter PPO framework that transforms wavelength selection into a reinforcement learning problem, leveraging cross-attention mechanisms and multiple specialized adapters to capture complex spectral relationships. Our approach outperforms traditional Particle Swarm Optimization (PSO) by an average of 28.4\% in comprehensive score and 45.2\% in prediction accuracy across steel and coal datasets. The proposed method demonstrates superior performance in balancing prediction accuracy with feature efficiency, achieving state-of-the-art results in LIBS quantitative analysis while maintaining interpretability and computational efficiency. We released our code and dataset here: https://github.com/Hflying/MAPPO

2606.17475 2026-06-17 cs.CV 新提交

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

StereoFactory: 一种用于鲁棒立体匹配的统一合并框架

Xianda Guo, Pinhan Fu, Ruilin Wang, Wenke Huang, Mang Ye, Qin Zou

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) D-Star Robotics Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出StereoFactory,一种粗到细的进化框架,通过遗传算法选择模型子集和CMA-ES优化模块级路由,实现自适应模型合并,在多个基准上降低误差并显著减少训练时间。

详情
AI中文摘要

立体匹配通过在大规模数据集上训练的基础模型取得了进展,但这种范式存在可扩展性瓶颈:引入新数据需要昂贵的联合重新训练。模型合并提供了一种可扩展的事后替代方案,在源检查点可用后整合来自专门模型的知识。然而,现有的合并方法通常保留所有可用模型或依赖贪婪包含,这可能会保留有害的任务向量干扰。我们提出StereoFactory,一种用于自适应模型合并的粗到细进化框架。第一阶段采用遗传算法搜索模型子集的组合空间,确定哪些模型应该参与。第二阶段通过CMA-ES优化对所选任务向量进行架构自适应路由,并可选地进行模块级缩放,解决模块级知识专门化问题(不同功能模块对知识源表现出不同偏好)。在两个架构和四个基准上的实验表明,在相同检查点池下,StereoFactory始终达到最佳的四基准平均值,相对于最强的受控基线,在NMRF上将平均误差从3.80降至3.30,在FoundationStereo上从2.88降至2.19。事后搜索仅需要相应联合重新训练挂钟时间的2.7–3.7%。分析表明,知识贡献本质上是模块特定的,所选子集可以在架构间转移且性能下降最小。代码将在接收后公开发布于:此 https URL。

英文摘要

Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7--3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: https://github.com/XiandaGuo/StereoFactory.