arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.00762 2026-06-02 cs.RO

STEM: Semantic Target Search and Exploration using MAVs in Cluttered Environments

STEM: 杂乱环境中使用MAV的语义目标搜索与探索

Nikhil Sethi, Max Lodel, Laura Ferranti, Robert Babuška, Javier Alonso-Mora

发表机构 * Department of Cognitive Robotics(认知机器人学系) Delft University of Technology(代尔夫特理工大学) CIIRC(捷克技术大学布拉格分校智能信息研究中心)

AI总结 提出一种基于语义引导视点规划器的框架,利用MAV在非结构化3D环境中最小化目标搜索与探索时间,通过组合规划器和主动感知管道实现高效语义探索。

详情
Comments
Accepted to Autonomous Robots Journal. Nikhil Sethi and Max Lodel contributed equally
AI中文摘要

自主目标搜索对于在应急响应和救援任务中部署微型飞行器(MAV)至关重要。现有方法要么专注于结构化环境中的2D语义导航(在复杂3D环境中效果较差),要么专注于杂乱空间中的机器人探索(通常缺乏高效目标搜索所需的语义推理)。本文通过提出一种新颖框架克服了这些限制,该框架利用语义引导的视点规划器,使用MAV在非结构化3D环境中最小化目标搜索和探索时间。具体来说,我们开发了一个组合规划器,通过优先考虑可能导向目标的视点来生成高效的语义探索计划。为了引导规划器朝向目标,开发了一个主动感知管道,将观察到的物体的语义优先级传播到相邻的前沿体素中,以计算前沿视点的语义信息增益。此外,我们展示了如何利用基于LLM的相似度分数作为我们管道的语义优先级输入。在两个不同模拟环境中的评估表明,所提方法通过快速找到目标同时保持合理的探索时间,始终优于基线方法。使用MAV的真实世界实验进一步证明了该方法处理实际约束(如有限电池寿命、小传感器范围和语义不确定性)的能力。

英文摘要

Autonomous target search is crucial for deploying Micro Aerial Vehicles (MAVs) in emergency response and rescue missions. Existing approaches either focus on 2D semantic navigation in structured environments -- which is less effective in complex 3D settings, or on robotic exploration in cluttered spaces -- which often lacks the semantic reasoning needed for efficient target search. This paper overcomes these limitations by proposing a novel framework that utilizes a semantically-guided viewpoint planner to minimize target search and exploration time in unstructured 3D environments using an MAV. Specifically, we develop a combinatorial planner that generates efficient semantic exploration plans by prioritizing viewpoints that likely lead to the target. To guide the planner towards the target, an active perception pipeline is developed that propagates semantic priorities of observed objects into neighboring frontier voxels for computing semantic information gains of frontier viewpoints. In addition, we demonstrate how LLM-based similarity scores can be leveraged as semantic priority input to our pipeline. Evaluations in two distinct simulation environments show that the proposed method consistently outperforms baselines by quickly finding the target while maintaining reasonable exploration times. Real-world experiments with an MAV further demonstrate the method's ability to handle practical constraints like limited battery life, small sensor range, and semantic uncertainty.

2606.00761 2026-06-02 cs.LG cs.CL

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,新加坡科技研究局) Shanghai University of Engineering Science, China(上海工程技术大学)

AI总结 提出 Confidence-Aware SwiGLU (κ-SwiGLU),通过根据 token 级路由置信度调整专家门控锐度,在 MoE Transformer 中提升性能且仅增加少量参数和计算开销。

详情
Comments
13 pages, 10 figures
AI中文摘要

SwiGLU 已成为现代 Transformer MLP 中的标准门控激活函数,但其门控锐度——即门控函数的平滑性和选择性——在整个训练过程中通常是固定的。在这项工作中,我们提出了 Confidence-Aware SwiGLU (κ-SwiGLU),这是 SwiGLU 的一种变体,用于混合专家 (MoE) 模型,它根据 token 级路由置信度调整专家门控锐度。具体来说,κ-SwiGLU 将 SiLU 门控锐度系数参数化为路由器 logit 的可学习函数,使每个专家门控单元能够在平滑、广泛激活的门控和尖锐、选择性门控之间进行插值。我们在 FineWeb-Edu 数据集上评估了 κ-SwiGLU,使用了从 8 层到 28 层的 MoE Transformer 模型。在这些设置中,κ-SwiGLU 提高了平均 CORE 性能,同时仅增加了可忽略的参数和少量计算开销,表明置信度感知的门控锐度是改进 MoE MLP 的一种有前景的机制。代码可在 https://github.com/askerlee/kappa-swiglu 获取。

英文摘要

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, $κ$-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate $κ$-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, $κ$-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

2606.00759 2026-06-02 cs.LG

Distributed GNEP Algorithms without Multiplier Sharing and Applications to Multi-Robot Coordination and Contextual Bandit-Based Active Learning

无乘子共享的分布式GNEP算法及其在多机器人协调和基于上下文赌博机的主动学习中的应用

Shao-An Yin

发表机构 * Shao-An Yin(殷少安)

AI总结 提出无需交换拉格朗日乘子的全分布式连续时间算法,收敛到广义纳什均衡(非仅变分均衡),并应用于多机器人协调和基于上下文赌博机的主动学习策略选择。

详情
Comments
136 pages, 14 figures
AI中文摘要

人工智能的最新进展将关注点从经典优化扩展到非合作博弈中的均衡分析。许多此类博弈涉及共享约束,从而产生广义纳什均衡问题(GNEP)。现有的分布式算法通常要求智能体交换拉格朗日乘子以强制执行共识并计算变分GNEs(v-GNEs)。 本文介绍了全分布式连续时间算法,并在不需要交换乘子的情况下建立收敛性,从而减少每次迭代的信息交换,同时提高隐私保护。分析聚焦于具有凸个体约束和线性共享约束的强单调博弈。我还提出了连续时间算法的几种离散化方案。所提出的方法收敛到一般的GNEs,而非仅限于v-GNEs,达到的均衡取决于初始化。通过多机器人协调和放置应用展示了所提方法的有效性。 在第二部分中,本文包括与亚马逊科学家合作进行的研究。现实世界机器学习中最具挑战性的问题之一是标记数据收集,这通常需要大量的人力和成本。主动学习旨在减少这种标记需求。然而,现有的手工主动学习策略通常仅在特定类型的数据集上表现良好,而这些数据集往往是事先未知的。在本文中,我提出使用上下文赌博机自适应地选择最合适的主动学习策略。在公开的外部数据集上展示了所提方法的有效性。

英文摘要

Recent advances in artificial intelligence have expanded the focus from classical optimization to include equilibrium analysis in noncooperative games. Many such games involve shared constraints, leading to Generalized Nash Equilibrium Problems (GNEPs). Existing distributed algorithms typically require agents to exchange Lagrange multipliers to enforce consensus and compute variational-GNEs (v-GNEs). This work introduces fully distributed continuous-time algorithms and establishes convergence without requiring multiplier exchange, thereby reducing information exchange per iteration while improving privacy preservation. The analysis focuses on strongly monotone games with convex individual constraints and linear shared constraints. I also propose several discretization schemes for the continuous-time algorithms. The proposed approach converges to general GNEs, rather than being restricted to v-GNEs, with the attained equilibrium depending on the initialization. The effectiveness of the proposed method is demonstrated through applications in multi-robot coordination and placement. In the second part, this work includes research conducted in collaboration with Amazon scientists. One of the most challenging problems in real-world machine learning is labeled data collection, which typically requires substantial human effort and cost. Active learning aims to reduce this labeling requirement. Existing handcrafted active learning strategies, however, generally perform well only on specific types of datasets, which are often unknown in advance. In this work, I propose using contextual bandits to adaptively select the most suitable active learning strategy. The effectiveness of the proposed approach is demonstrated on publicly available external datasets.

2606.00756 2026-06-02 cs.AI

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

CoMIC:云边系统中长周期LLM代理的协作记忆与洞察循环

Yannan Wang, Longli Yang, Zhen Liu, Abhishek Kumar, Carsten Maple

发表机构 * Beijing Jiaotong University(北京交通大学) The Alan Turing Institute(艾伦·图灵研究所) University of Warwick(沃里克大学)

AI总结 提出无需参数更新的云边框架CoMIC,通过集中式反思与分布式执行设计,利用语义子目标标识实现跨代理经验聚合,提升弱边缘代理在长周期任务中的进展率和动作基础。

详情
AI中文摘要

在边缘服务器上部署轻量级大语言模型(LLM)代理可以减少延迟并将代理服务更贴近用户,但资源受限的边缘模型在处理需要持久记忆、子目标跟踪和反思的长周期任务时往往表现不佳。部署后对边缘模型进行微调成本高昂且难以在异构节点上扩展,而纯本地记忆则使代理拥有孤立经验并导致提示上下文不断增长。我们提出 extsc{CoMIC},一种无需参数更新的云边框架,用于协作记忆与洞察循环。 extsc{CoMIC}遵循 extit{集中式反思,分散式执行}的设计:边缘代理使用面向子目标的分层记忆和选择性重新展开相关历史在本地执行,而云端LLM批评者异步评估完成的轨迹,过滤可重用经验,并通过语义子目标标识符聚合跨代理指导。在涵盖符号规划和文本交互的五项长周期代理任务中, extsc{CoMIC}提高了弱边缘代理的进展率和动作基础,并在不更新模型参数的情况下实现了任务相关的成功率提升。

英文摘要

Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textsc{CoMIC}, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textsc{CoMIC} follows a \textit{Centralized Reflection, Decentralized Execution} design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textsc{CoMIC} improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.

2606.00755 2026-06-02 cs.CL cs.LG

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

内化温度:面向强化学习的同策略自蒸馏作为策略加热器

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang

发表机构 * Tsinghua University(清华大学)

AI总结 提出温度缩放同策略自蒸馏(TS-OPSD),通过将温度探索效应内化到模型参数中,缓解强化学习中的熵崩溃问题,无需外部教师或额外推理成本。

详情
AI中文摘要

基于可验证奖励的强化学习提升了大语言模型的推理能力,但常常遭受熵崩溃,即日益集中的策略减少了轨迹多样性和有用的学习信号。现有补救措施要么约束强化学习目标(如熵正则化),要么在轨迹收集期间调整采样温度,但这些干预措施仍外在于模型参数。我们提出温度缩放同策略自蒸馏(TS-OPSD),一种轻量级的策略加热方法,将温度的探索效应内化到模型参数中。从熵崩溃的强化学习检查点开始,TS-OPSD 通过对模型自身的 logits 应用高温缩放来构建自教师,然后将得到的更平滑分布蒸馏回学生。这种策略加热不需要外部教师、特权数据或额外的推理成本。在 Qwen3-4B-Base 和 Qwen3-8B-Base 上的实验表明,策略加热为继续强化学习提供了比标准继续强化学习和轨迹级温度加热更强的初始化。进一步分析表明,TS-OPSD 主要降低输出锐度,同时保留中间表示、顶级候选集和推理能力。这些结果表明,熵恢复可以作为面向推理的强化学习的一种简单的崩溃后干预措施。

英文摘要

Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.

2606.00751 2026-06-02 cs.CV

Head-Pose-Aware Visual Speech Recognition with FiLM Modulation

基于FiLM调制的头部姿态感知视觉语音识别

Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

发表机构 * Department of Artificial Intelligence, Kyushu Institute of Technology(人工智能系,九州工业大学)

AI总结 提出HP-VSR-ResFiLM框架,通过姿态条件残差FiLM模块显式融入头部姿态信息,在LRS2和LRS3上分别达到25.0%和33.2%的词错误率,有效提升非正面视角下视觉语音识别的鲁棒性。

详情
Comments
27 pages, 4 figures
AI中文摘要

视觉语音识别(VSR)旨在从唇部运动等视觉线索中识别语音,但其性能从根本上受到音素模糊性和姿态引起的变化(引入几何畸变和遮挡)的限制。现有方法主要依赖语言上下文或隐式不变性,导致非正面视角下的视觉表示不够鲁棒。本文提出一个姿态感知的音素级框架HP-VSR-ResFiLM,显式地将头部姿态信息融入视觉特征提取。该框架采用两阶段流水线:阶段1为姿态条件视觉编码器,阶段2使用预训练NLLB语言模型进行音素到文本重建。具体地,阶段1在2D CNN前端后引入姿态条件残差特征线性调制(FiLM)块,利用头部姿态信息自适应地优化视觉表示。在LRS2和LRS3上的实验表明,HP-VSR-ResFiLM在可比训练条件下取得了竞争性性能,无需额外训练数据即分别达到25.0%和33.2%的词错误率(WER)。消融研究进一步显示,单个残差FiLM块持续改善整体WER,而第3层和第4层的更深层调制为偏航角大于30°的样本带来更大增益,且不降低小姿态变化样本的性能。这些发现表明,显式的姿态感知特征调制为在无约束场景下提升VSR鲁棒性提供了一种有效且计算高效的解决方案。

英文摘要

Visual Speech Recognition (VSR) aims to recognize speech from visual cues such as lip movements, but its performance is fundamentally limited by viseme ambiguity and pose-induced variations that introduce geometric distortions and occlusions. Existing approaches mainly rely on linguistic context or implicit invariance, leaving visual representations insufficiently robust under non-frontal views. In this work, we propose a pose-aware phoneme-level framework, termed HP-VSR-ResFiLM, that explicitly incorporates head-pose information into visual feature extraction. The proposed framework adopts a two-stage pipeline consisting of a pose-conditioned visual encoder in Stage 1 and a pretrained NLLB language model in Stage 2 for phoneme-to-text reconstruction. Specifically, Stage 1 incorporates a pose-conditioned residual Feature-wise Linear Modulation (FiLM) block after the 2D CNN frontend to adaptively refine visual representations using head-pose information. Experiments on LRS2 and LRS3 demonstrate that HP-VSR-ResFiLM achieves competitive performance under comparable training conditions, attaining word error rates (WER) of 25.0% and 33.2%, respectively, without relying on additional training data. Ablation studies further show that a single residual FiLM block consistently improves overall WER, while deeper modulation at Layers 3 and 4 provides larger gains for samples with yaw angles greater than 30° without degrading performance for smaller pose variations. These findings demonstrate that explicit pose-aware feature modulation offers an effective and computationally efficient solution for improving VSR robustness in unconstrained settings.

2606.00750 2026-06-02 cs.CL

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

I-WebGenBench: 评估大语言模型生成的科学网页应用中的交互性

Dasen Dai, Biao Wu, Meng Fang, Shuoqi Li, Wenhao Wang

发表机构 * Vast Intelligence Lab(vast 智能实验室) UTS University of Liverpool(利物浦大学)

AI总结 提出 Paper-to-Interactive-System Agent 将研究论文转化为可执行交互式网页系统,并构建 PaperVoyager 框架以显式建模机制和交互逻辑,显著提升生成质量。

详情
Comments
9 pages, 4 figures
AI中文摘要

近期视觉语言模型的进展使得自主代理能够进行复杂推理、工具使用和文档理解。然而,现有的文档代理主要将论文转化为静态产物,如摘要、网页或幻灯片,这对于涉及动态机制和状态转换的技术论文来说是不够的。在这项工作中,我们提出了一个论文到交互式系统的代理,将研究论文转化为可执行的交互式网页系统。给定一篇 PDF 论文,该代理无需人工干预即可进行端到端处理,包括论文理解、系统建模和交互式网页合成,使用户能够操作输入并观察动态行为。为了评估这一任务,我们引入了一个包含 19 篇研究论文的基准测试,每篇论文都配有专家构建的交互式系统作为真实值。我们进一步提出了 PaperVoyager,一个结构化生成框架,在合成过程中显式建模机制和交互逻辑。实验表明,PaperVoyager 显著提高了生成的交互式系统的质量,为交互式科学论文理解提供了新的范式。

英文摘要

Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.

2606.00746 2026-06-02 cs.CV cs.LG

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

将并行序列模型扩展到基础规模的视觉编码器

Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

发表机构 * NVIDIA The Chinese University of Hong Kong(香港中文大学) The University of Hong Kong(香港大学) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出C-GSPN,一种基于2D空间传播的基础规模视觉编码器,通过快速CUDA内核、压缩潜在空间传播块和两阶段交叉算子蒸馏,在减少参数的同时提升性能并实现高效推理。

详情
AI中文摘要

视觉基础模型受限于自注意力的二次成本,这限制了可用分辨率并增加了大规模预训练的成本。次二次替代方案如线性注意力和状态空间模型降低了这一成本,但通常将图像序列化为1D令牌流,削弱了对视觉重要的2D空间结构。广义空间传播网络(GSPN)通过线扫描递归直接在2D网格上传播上下文,实现了接近线性的复杂度且无需位置嵌入,但很少用作基础规模的编码器。我们提出C-GSPN,一种基于2D空间传播的基础规模视觉编码器。C-GSPN通过三项改进使该算子实用化:(1)一个快速的GSPN CUDA内核,将每步启动融合为单个warp专用实现,采用共享内存分块、合并访问和紧凑的多通道传播,达到峰值内存带宽的90%以上,运行速度比原始GSPN实现快40-52倍;(2)一个带有融合归一化的压缩潜在空间传播块,将内核级速度转化为块级和模型级效率;(3)一个两阶段交叉算子蒸馏方案,从注意力教师训练新架构,无需从头开始进行基础规模训练的成本。使用6亿图像-文本对进行蒸馏,C-GSPN以少15%的参数匹配同构ViT基线,在ADE20K分割上提升+2.1%,以极少的数据迁移到高分辨率,并在2K分辨率下通过单次无分块推理实现4倍的端到端块加速。

英文摘要

Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation. C-GSPN makes the operator practical through three improvements: (1) a fast GSPN CUDA kernel that fuses per-step launches into a single warp-specialized implementation with shared-memory tiling, coalesced access, and a compact multi-channel propagation, reaching over 90% of peak memory bandwidth and running up to 40--52x faster than the original GSPN implementation; (2) a compressed latent-space propagation block with fused normalization, which turns kernel-level speed into block- and model-level efficiency; and (3) a two-stage cross-operator distillation recipe that trains the new architecture from an attention teacher without the cost of from-scratch foundation-scale training. Distilled with 600M image-text pairs, C-GSPN matches an isomorphic ViT baseline with 15% fewer parameters, improves ADE20K segmentation by +2.1%, transfers to high resolution with a fraction of the data needed from scratch, and delivers a 4x end-to-end block speedup at 2K with single-pass, tiling-free inference.

2606.00741 2026-06-02 cs.LG cs.AI stat.ML

Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment

量子隧穿感知机器学习:面向鲁棒部署的物理衍生噪声模型

Uiwon Hwang, Jaeho Hwang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Human-Centered Artificial Intelligence Research Institute(以人为本的人工智能研究院)

AI总结 本文提出量子隧穿感知机器学习(QTAML),通过WKB近似推导部署时的权重误差分布,并设计隧穿感知补偿(TAC)算法,在无需重训练和标签的情况下,以较低ECC开销恢复模型精度。

详情
AI中文摘要

晶体管缩放正接近量子力学极限,因为薄栅氧化物通过量子隧穿引起电子泄漏。与传统数字系统不同,只要错误结构被正确建模,AI推理可以容忍此类错误。在本文中,我们引入量子隧穿感知机器学习(QTAML)。我们使用Wentzel-Kramers-Brillouin(WKB)近似从第一性原理推导部署时的权重误差分布,并表明它具有通用高斯噪声模型所忽略的结构:精确的仿射均值漂移、由最高有效位主导的逐位方差层级,以及依赖于$\|W_\ell\|_\infty$和训练网络Jacobian的逐层依赖性。我们将这三个结构属性打包成一个单一的部署时算法——隧穿感知补偿(TAC),该算法结合了闭式均值校正和基于WKB方差分解的最优逐层自适应比特预算分配。在$p_\mathrm{flip}=0.10$的四个卷积架构和$p_\mathrm{flip}=0.05$的一个Transformer编码器上,TAC达到了干净精度的95%,同时ECC开销比从相同物理导出的自然基线Uniform-MSP低3.4倍到33.6倍。闭式饱和比$ ho^*$预先预测了这些增益,在异构架构上,WKB导出的评分在小预算下比基于幅度的分配高出多达24个百分点。该算法无需重训练、无需标签,且无推理时开销。我们还验证了WKB导出的分布定理达到蒙特卡洛精度。这些结果将WKB隧穿物理与噪声感知深度学习联系起来,并为超越传统缩放极限的硬件-软件协同设计提供了一条有原则的路径。

英文摘要

Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike conventional digital systems, AI inference can tolerate such errors provided their structure is modeled correctly. In this paper, we introduce quantum tunneling-aware machine learning (QTAML). We derive the deployment-time weight-error distribution from first principles using the Wentzel-Kramers-Brillouin (WKB) approximation and show that it has structure that generic Gaussian noise models miss: an exact affine mean drift, a per-bit variance hierarchy dominated by the most-significant bit, and a per-layer dependence on $\|W_\ell\|_\infty$ and the trained-network Jacobian. We package these three structural properties into a single deployment-time algorithm, Tunneling-Aware Compensation (TAC), that combines closed-form mean correction with an optimal layer-adaptive bit-budget allocation derived from the WKB variance decomposition. Across four convolutional architectures at $p_\mathrm{flip}$=0.10 and a transformer encoder at $p_\mathrm{flip}$=0.05, TAC reaches $95\%$ of clean accuracy with 3.4$\times$ to 33.6$\times$ less ECC overhead than Uniform-MSP, the natural baseline derived from the same physics. The closed-form saturation ratio $ρ^*$ predicts these gains in advance, and on heterogeneous architectures WKB-derived scoring outperforms magnitude-based allocation by up to 24 percentage points at small budgets. The algorithm requires no retraining, no labels, and no inference-time overhead. We also verify the WKB-derived distributional theorems to Monte Carlo precision. These results connect WKB tunneling physics with noise-aware deep learning and suggest a principled path toward hardware--software co-design beyond conventional scaling limits.

2606.00739 2026-06-02 cs.LG

Score $\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation

Score × Decoder:无监督推理时缩放缓解幻觉的统一视角

Yun-Chen Cheng, Che-Yu Lin, Cheng-Lin Yang

发表机构 * CyCraft AI Lab, Taiwan(CyCraft人工智能实验室,台湾)

AI总结 本文提出Score×Decoder框架,通过配对四种内在分数(困惑度、对比度、幂分布似然、自验证)与三种解码族(优化、采样、共识),在无监督条件下选择最佳组合以缓解大语言模型幻觉。

详情
AI中文摘要

大型语言模型即使答案在其参数范围内也会产生幻觉。虽然推理时缩放可以揭示这种潜在知识,但最有效的方法需要监督:一个训练好的验证器或奖励模型。我们探讨仅使用基础语言模型可以做什么:哪个内在信号最能识别正确输出,以及应该如何解码?我们将此视为一个分数×解码器网格,将四种分数(困惑度、对比度、幂分布似然和自验证)与三种解码族(优化、采样、共识)配对,并在MATH500上使用基础版和指令调优版Qwen3-1.7B评估每个单元格。虽然自验证(提示模型判断自己的答案,并通过无训练虚拟思考前缀增强)在大多数设置中效果良好,但没有一个分数具有固定质量:其价值取决于使用它的解码器和模型能力。当没有监督可用时,必须同时选择分数和解码族。

英文摘要

Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward model. We ask what can be done with only a base language model: which intrinsic signal best identifies correct outputs, and how should it be decoded? We cast this as a score~$\times$~decoder grid pairing four scores (perplexity, contrastive, power-distribution likelihood, and self-verification) with three decoding families (optimization, sampling, consensus), and evaluate every cell on MATH500 with the base and instruction-tuned Qwen3-1.7B. While self-verification, which prompts the model to judge its own answer and is sharpened by a training-free virtual-thinking prefix, works well in most settings, no score has a fixed quality: its value depends on the decoder that consumes it and on model capability. When no supervision is available, the score and the decoding family must be chosen together.

2606.00738 2026-06-02 cs.LG cs.AI cs.CV

SORA: Free Second-Order Attacks in Fast Adversarial Training

SORA:快速对抗训练中的自由二阶攻击

Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban

发表机构 * arXiv.org cs.LG(计算机学习)

AI总结 针对快速对抗训练中的灾难性过拟合问题,提出通过扰动变异性和梯度对齐指标PertAlign来预测并防止过拟合,并设计自适应步长方法SORA,实现最优鲁棒性和干净准确率。

详情
Comments
Accepted at ICML 2026
AI中文摘要

对抗训练是对抗性样本的主要防御手段,但在高效的单步变体中常常遭受灾难性过拟合,即尽管单步性能很高,但对多步攻击的鲁棒性却崩溃。我们通过两个贡献来解决这种失效模式。首先,我们形式化了epsilon过拟合(EO),这是一种固定扰动幅度和方向加剧CO的视角,并表明引入扰动变异性可以显著提高不同架构和数据集上的鲁棒泛化能力。其次,我们提出了PertAlign(扰动对齐),这是一种理论上合理、计算开销可忽略的指标,通过测量攻击阶段的梯度对齐来预测CO的发生。利用这些见解,我们引入了SORA,一种自适应步长的AT方法,它根据损失曲面几何动态调整扰动。SORA始终能防止CO,实现最先进的鲁棒性和干净准确率,并使用一组固定的超参数在数据集和架构上泛化,这对于快速AT的适用性至关重要。在不同数据集和架构上的大量实验表明,SORA在提供更高干净准确率和卓越效率的同时,匹配或超越了先前方法的鲁棒性。代码可在https://github.com/SecondOrderAT/SORA获取。

英文摘要

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

2606.00737 2026-06-02 cs.RO math.OC

Beyond Pure Sampling: Hybrid Optimization Mechanisms for Non-Convex Model Predictive Control

超越纯采样:非凸模型预测控制的混合优化机制

Yuichiro Aoyama, Minchan Jung, Akash Ratheesh, Evangelos A. Theodorou

发表机构 * School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA(航空航天工程学院,佐治亚理工学院,美国亚特兰大,GA州) Development Division, Komatsu Ltd., Tokyo, Japan(Komatsu Ltd.开发部门,日本东京) Department of Electrical and Computer Engineering, Inha University, Incheon, Republic of Korea(电气与计算机工程系,inha大学,韩国仁川,大韩民国)

AI总结 本文提出一种结合梯度下降与基于逆Hessian采样的双步优化机制,用于非凸模型预测控制,在多种机器人导航任务中相比纯采样方法(如MPPI)具有更高成功率和稳定性。

详情
Comments
28 pages, 13 figures
AI中文摘要

本文研究了使用最大熵微分动态规划(ME-DDP)框架的非凸模型预测控制(MPC)的优化机制。由非线性动力学、多个障碍物等引起的非凸代价景观仍然是机器人学中的一个基本挑战,其中基于梯度的方法经常收敛到次优局部最小值。我们展示了一种旨在克服这些陷阱的双步优化机制:(1)使用DDP利用代价景观梯度的初始阶段,随后(2)通过从由动作-价值函数的逆Hessian表征的策略中采样来破坏优化。我们对三种ME-DDP变体:单峰高斯ME-DDP、多峰高斯ME-DDP和Stein变分DDP的采样机制进行了严格分析。此外,通过在杂乱环境下的四个机器人系统的导航任务,我们对三种ME-DDP变体与确定性DDP以及最成功的基于采样的方案之一——模型预测路径积分(MPPI)控制(具有与ME-DDP对应的三种策略参数化和更新律)进行了广泛的基准测试。结果表明,在代价景观相对简单且局部信息足够代表性的低维系统中,我们的框架始终优于MPPI。在高维系统中,MPPI有时能够发现激进的机动,使其比基于DDP的方法更快地引导系统,而我们的方法保持更高、更稳定的成功率。最后,我们通过四旋翼飞行器在密集非凸障碍场中导航的硬件实验验证了该框架的实际功效,确认了所提框架在实际部署中的鲁棒性。

英文摘要

This paper investigates the optimization mechanisms of non-convex Model Predictive Control (MPC) using the Maximum Entropy Differential Dynamic Programming (ME-DDP) framework. Navigating non-convex cost landscapes induced by nonlinear dynamics, multiple obstacles, etc. remains a fundamental challenge in robotics, where gradient-based methods frequently converge to suboptimal local minima. We demonstrate a dual-step optimization mechanism designed to overcome these traps. (1) an initial phase of using DDP to exploit the gradient of the cost landscape, followed by (2) disruption of the optimization via sampling from policies characterized by the inverse Hessian of the action-value function. We provide a rigorous analysis of this sampling mechanism of three ME-DDP variants: Unimodal Gaussian ME-DDP, Multimodal Gaussian ME-DDP, and Stein Variational DDP. Furthermore, with navigation tasks of four robotic systems under cluttered environments, we conduct extensive benchmarking of three variants of the ME-DDP, against deterministic DDP, and one of the most successful sampling-based schemes, Model Predictive Path Integral (MPPI) control with three policy parameterizations and update laws that correspond to those of ME-DDPs. The results show that in low-dimensional systems where the cost landscapes are relatively simple and local information is sufficiently representative, our framework consistently outperforms MPPIs. In high-dimensional systems, MPPI can occasionally discover aggressive maneuvers that enable it to steer the systems faster than DDP-based methods, whereas our method maintains a higher, more stable success rate. Finally, we validate the practical efficacy of the framework through hardware experiments with a quadrotor navigating a dense, non-convex obstacle field, confirming the robustness of the proposed framework for real-world deployment.

2606.00730 2026-06-02 cs.RO

Infeasible optimization problems and the hierarchical augmented Lagrangian method in imitation learning

模仿学习中的不可行优化问题与分层增广拉格朗日方法

Roland Andrews, Justin Carpentier, Ajay Sathya

发表机构 * arXiv.org University of Cambridge(剑桥大学)

AI总结 针对模仿学习中约束不可行导致训练不稳定的问题,提出基于增广拉格朗日方法的解决方案,将策略引导至最近可行约束问题的解,并在驾驶示例中验证其有效性。

详情
AI中文摘要

模仿学习(IL)是训练复杂机器人策略的有效方法。最近的研究将硬约束引入模仿学习优化问题,以确保所学策略的安全性、稳定性和鲁棒性。然而,我们认为这些约束有时是不可行的,这可能导致不稳定或困难的训练动态。我们基于不可行设置下增广拉格朗日方法的最新理论结果,研究了一种针对此类情况的简单补救措施。我们表明,我们的方法将所学策略引导至具有理想属性的最近可行约束IL问题的解。该方法在一个具有总加速度约束和行人安全约束的玩具驾驶示例中进行了说明,该设置中不可行性自然出现,同时仍允许安全的所学策略。

英文摘要

Imitation learning (IL) is an effective approach to train complex robotics policies. Recent works have introduced hard constraints into imitation-learning optimization problems to ensure safety, stability, and robustness of the learned policy. However, we argue that these constraints are sometimes infeasible, which can lead to unstable or difficult training dynamics. We study a simple remedy for such situations based on recent theoretical results on the augmented Lagrangian method in infeasible settings. We show that our approach drives the learned policy toward the solution of a closest-feasible constrained IL problem with desirable properties. The method is illustrated on a toy driving example with a total-acceleration constraint and pedestrian-safety constraints, a setting in which infeasibility can naturally arise while still allowing a safe learned policy.

2606.00729 2026-06-02 cs.AI

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

AI主权作为国家学习能力:基于人本学习机制视角看法国、美国与中国

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX(里尔大学、ENSAIT、ULR 2461 – GEMTEX)

AI总结 本文提出将国家AI发展视为一个受控的信息注入与熵耗散平衡的动态学习系统,主张AI主权源于国家调节自身信息动力学的能力,而非单纯规模扩张。

详情
AI中文摘要

在法国,人工智能常被从投资、算力、监管、就业、主权和教育等维度讨论,这些维度通常被分开处理。本文提出一个统一解读:法国应被理解为一个\emph{国家AI学习系统}。基于最近被形式化为熵调控表示学习动力学框架的人本学习机制(HCLM),我们将国家AI发展解释为信息注入与熵耗散之间的受控平衡。信息注入对应算力、数据、人才、研究、资本、产业部署和制度实验;熵耗散对应组织复杂性、协调摩擦、能源约束、监管不确定性、人才流动压力以及加强产业吸收的机会。核心主张是:AI主权并非仅源于规模,而是源于国家调节自身信息动力学的能力。本文将HCLM与神经标度律、内生增长理论、创造性破坏和博弈论联系起来,认为法国AI辩论应超越技术乐观主义与监管优先的二元对立。一个具有竞争力且以人为本的AI战略需要一个受控机制,其中信息注入增长快于制度耗散,同时避免不稳定、不平等或高能耗的扩张。我们提供了一个数学模型、可衡量的政策指标、博弈论命题、国家AI制度的说明性模拟,以及对法国的具体政策启示。所提出的观点将AI政策重新定义为对一个开放、战略性、非均衡学习系统的治理。

英文摘要

Artificial Intelligence is often discussed in France in terms of investment, compute capacity, regulation, employment, sovereignty, and education. These dimensions are usually treated separately. This viewpoint paper proposes a unified interpretation: France should be understood as a \emph{national AI learning system}. Building on Human-Centered Learning Mechanics (HCLM), recently formulated as a dynamical framework for entropy-regulated representation learning, we interpret national AI development as a controlled balance between information injection and entropy dissipation. Information injection corresponds to compute, data, talent, research, capital, industrial deployment, and institutional experimentation. Entropy dissipation corresponds to organizational complexity, coordination frictions, energy constraints, regulatory uncertainty, talent mobility pressures, and opportunities to strengthen industrial absorption. The central claim is that AI sovereignty does not emerge from scale alone but from a country's capacity to regulate its own information dynamics. This paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, and game theory. It argues that the French AI debate should move beyond the binary opposition between techno-optimism and regulation-first caution. A competitive and human-centered AI strategy requires a controlled regime in which information injection grows faster than institutional dissipation, while avoiding unstable, unequal, or energy-intensive expansion. We provide a mathematical model, measurable policy indicators, game-theoretic propositions, illustrative simulations of national AI regimes, and concrete policy implications for France. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system.

2606.00728 2026-06-02 cs.CL

From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users

从共情到个性化共情:根据个体用户调整共情策略

Wuqiang Zheng, Chengbing Wang, Yilin Yang, Junyi Cheng, Jianfei Xiao, Hu Sun, Yi Xie, Yangyang Li, Wenjie Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Technologies(华为技术)

AI总结 针对大语言模型长期交互中忽略用户个性对共情策略影响的问题,提出个性化共情任务,构建PersonaEmp数据集和PereGRM奖励建模框架,实验证明其有效提升个性化共情能力。

详情
AI中文摘要

随着大语言模型(LLMs)越来越多地部署在与用户的长期交互中,共情已成为一项日益重要的能力。然而,现有研究忽视了用户个性特征对长期交互中共情策略的影响。为弥补这一空白,我们引入了个性化共情任务,其重点是根据从历史中获得的用户个性化特征调整共情策略。为了研究和增强这一能力,我们构建了PersonaEmp,一个基于长期用户-AI交互构建的个性化共情数据集,具有丰富的用户历史、人物信息和寻求共情的查询。我们进一步提出了PereGRM,一种奖励建模框架,它将共情评估结构与动态评估标准生成相结合,用于细粒度奖励建模。在不同设置和多个评判模型下的实验结果表明,PereGRM始终取得最强的性能提升,表明其在增强个性化共情能力方面的有效性。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in long-term interactions with users, empathy has become an increasingly important capability. However, existing research overlooks the influence of users' personality traits on empathetic strategies during long-term interactions. To address this gap, we introduce the task of personalized empathy, which focuses on adapting empathetic strategies according to users' personalized characteristics derived from history. To study and enhance this capability, we construct PersonaEmp, a personalized empathy dataset built from long-term user-AI interactions, featuring rich user histories, persona information, and empathy-seeking queries. We further propose PereGRM, a reward modeling framework that combines the empathy evaluation structure with dynamic evaluation criteria generation for fine-grained reward modeling. Experimental results across different settings and multiple judge models show that PereGRM consistently achieves the strongest performance improvements, indicating its effectiveness for enhancing personalized empathetic capabilities.

2606.00726 2026-06-02 cs.AI

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

潜在奖励引导:一种自适应推理时框架,隐式促进推理大语言模型中的认知行为

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li

发表机构 * Rutgers University(罗格斯大学) South China Agricultural University(华南农业大学) Columbia University(哥伦比亚大学) Fenz.AI QuantaAlpha Adobe Santa Clara University(圣克拉拉大学) City University of Hong Kong(香港城市大学)

AI总结 提出潜在奖励引导(LRS)框架,通过优化稀疏自编码器潜在状态隐式促进认知行为,利用最终答案正确性训练潜在奖励模型估计中间状态质量,并在推理时提供状态特定的修正方向,实验表明该方法能提升推理性能并修复原始推理错误。

详情
AI中文摘要

强推理不仅依赖于模型知识,还取决于生成过程中认知行为的有效部署。现有方法通常依赖显式的行为级控制,当失败和所需修正因推理状态、任务和模型而异时,其适应性不足。为此,我们提出潜在奖励引导(LRS),一种自适应推理时框架,通过优化隐式携带认知行为的稀疏自编码器(SAE)潜在状态来促进认知行为。LRS不依赖预定义的认知行为或由此衍生的引导方向,而是基于最终答案正确性在推理轨迹上训练潜在奖励模型,以估计中间潜在状态的质量。推理时,奖励梯度为脆弱的潜在状态提供状态特定的修正方向,而奖励与置信度门控将干预限制在奖励信号标记为脆弱的状态上。在多个推理LLM骨干和基准上的实验表明,LRS一致地提升了相对于各种基线的性能,事后分析进一步表明LRS隐式促进了修复原始推理错误的良好认知行为。代码见:https://github.com/jiakanglee/Latent-Reward-Steering。

英文摘要

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

2606.00724 2026-06-02 cs.CL cs.AI

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

WaveFilter: 通过小波引导的KV缓存过滤增强扩散型大语言模型的长上下文能力

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu

发表机构 * Nanjing University of Science and Technology(南京理工大学) Alibaba Group(阿里巴巴集团) Huzhou Normal University(湖州师范学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对扩散型大语言模型在长上下文任务中计算开销大和推理延迟高的问题,提出一种无需训练的通用缓存框架WaveFilter,利用小波变换分解长序列以精确识别关键token,构建稀疏KV缓存,从而提升现有KV缓存方法在复杂长上下文任务中的性能。

详情
Comments
8 pages,3 figures
AI中文摘要

扩散型大语言模型(DLMs)在各种任务中展现出显著优势。然而,受限于其多步迭代推理机制,它们在长上下文任务中的计算开销和推理延迟已成为限制其大规模部署的核心瓶颈。在处理长序列时,现有的键值(KV)缓存机制常常面临生成质量急剧下降的困境,其核心挑战在于如何在超长上下文中精确且高效地过滤关键token。受人类阅读过程的启发,我们提出了 extbf{WaveFilter},一个通用的、无需训练的缓存框架。该框架创新性地引入小波变换来分解长序列,以实现关键token的精确识别,并基于此构建稀疏KV缓存以计算最终的上下文表示。实验结果表明,WaveFilter作为一个即插即用的通用框架,显著提升了现有主流KV缓存方法在复杂长上下文任务中的性能。

英文摘要

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbf{WaveFilter}, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.

2606.00722 2026-06-02 cs.CL cs.AI

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

EPIC: 扩散语言模型在上下文无关文法约束下的高效并行推理

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出EPIC框架,通过词法记忆化、Earley解析验证和松弛兼容子集选择,解决扩散语言模型在CFG约束解码中的低效和并行性损失问题,推理时间降低67.5%,额外开销减少90.5%。

详情
AI中文摘要

控制语言模型输出对于确保结构有效性、可靠性和下游可用性至关重要,扩散语言模型也不例外。最近扩散语言模型解码的进展已将输出控制从常规约束扩展到上下文无关文法(CFG)约束。然而,现有方法的速度可能比无约束解码慢四倍。更重要的是,它们大大削弱了扩散语言模型相对于自回归模型的关键优势之一,即并行解码。这种减慢是因为在并行生成过程中,顺序有效性检查引入了显著开销。我们提出了一个高效的CFG约束解码框架EPIC,解决了这一限制。我们的方法通过结合词法记忆化、使用Earley风格解析(而非确定性自动机)进行验证,以及用于并行提交的松弛兼容子集选择,提高了解码效率。它减少了重复的词法分析和验证开销,同时允许多个兼容令牌一起提交。在三个基准测试上使用四个模型的实验表明,与现有的CFG约束解码方法相比,我们的方法将推理时间减少了高达67.5%,并将额外开销降低了高达90.5%。我们的实现可在https://github.com/hyundong98/EPIC-Decoding.git获取。

英文摘要

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC-Decoding.git .

2606.00718 2026-06-02 cs.AI math.OC

LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization

LLM驱动的双组件耦合组合优化的协同进化自动启发式设计

Mingen Kuang, Xudong Deng, Xi Lin, Ye Fan, Jianyong Sun, Jialong Shi

发表机构 * Xi’an Jiao Tong University(西安交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出CoEvo-AHD框架,利用大语言模型协同进化两个算子种群,通过合作评估和联合交叉发现互补逻辑,解决旅行窃贼问题等耦合组合优化问题。

详情
AI中文摘要

虽然大语言模型(LLMs)最近在自动启发式设计(AHD)中展现出潜力,但现有方法通常将启发式作为单一算子或搜索策略生成和进化,限制了它们在诸如旅行窃贼问题(TTP)和旅行采购问题(TPP)等问题中对多个决策子结构之间强耦合建模的能力。在这项工作中,我们提出CoEvo-AHD,一个LLM驱动的双种群协同进化框架,用于耦合组合优化中的自动启发式设计。与先前单独进化单个算子的方法不同,CoEvo-AHD利用LLMs协同进化两个紧密相关的算子种群。合作评估机制明确捕获路径和选择算子之间的交互,而成对评分和协同联合交叉有助于发现互补的算子逻辑,以在耦合决策子空间上实现联合改进。我们进一步设计了一个工具调用环境库,将常用核心操作(如局部搜索增量计算)封装为可调用函数,使LLM生成的算子能够使用标准化接口,而不是重新实现低效且易出错的问题特定循环。在TTP和TPP上的实验表明,CoEvo-AHD自动发现合作启发式组合,并达到与传统启发式竞争的解质量。

英文摘要

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and evolve heuristics as a single operator or search strategy, limiting their ability to model strong coupling among multiple decision substructures in problems such as the Traveling Thief Problem (TTP) and the Traveling Purchaser Problem (TPP). In this work, we propose CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in coupled combinatorial optimization. Unlike prior methods that evolve individual heuristics in isolation, CoEvo-AHD leverages LLMs to co-evolve two closely related operator populations. A cooperative evaluation mechanism explicitly captures interactions between route and selection operators, while pairwise scoring and synergistic joint crossover help discover complementary operator logic for joint improvement across coupled decision subspaces. We further design a tool-invocation environment library that encapsulates frequently used core operations, such as local-search delta computation, into callable functions, enabling LLM-generated operators to use standardized interfaces instead of reimplementing inefficient and error-prone problem-specific loops. Experiments on TTP and TPP show that CoEvo-AHD automatically discovers cooperative heuristic combinations and achieves competitive solution quality against traditional heuristics.

2606.00717 2026-06-02 cs.LG cs.AI stat.ML

Multi-Agent Conformal Prediction with Personalized Statistical Validity

具有个性化统计有效性的多智能体共形预测

Martin V. Vejling, Christophe A. N. Biscio, Adrien Mazoyer, Petar Popovski, Shashi Raj Pandey

发表机构 * Department of Electronic Systems(电子系统系) Aalborg University(奥尔堡大学) Department of Mathematical Sciences(数学科学系) Institut de Mathématiques de Toulouse(图卢兹数学研究所) Université de Toulouse(图卢兹大学)

AI总结 提出个性化联邦加权共形预测框架,通过局部密度比加权和加权分位数聚合,在保护隐私的同时纠正数据异质性,为每个参与智能体提供渐近有效的边际和校准条件覆盖保证。

详情
AI中文摘要

不确定性量化在高风险机器学习任务中至关重要。然而,共形预测这一原则性解决方案在局部校准数据有限、隐私约束和数据异质性下面临挑战。在多智能体设置中,现有工作无法同时令人满意地解决这些挑战,其保证要么限于智能体间的平均值,要么在异质性设置中失去有效性。因此,我们提出个性化联邦加权共形预测(PFWCP),该框架结合局部密度比加权与加权分位数聚合,以在保护隐私的同时纠正异质性。该方法为每个参与智能体提供渐近有效的边际和校准条件覆盖保证,并支持一次性通信协议。理论分析呈现了对覆盖方差的调整,该调整由有效样本量表达式控制,这在加权共形预测的背景下是必要的,并且在合成和真实数据集上的实验表明,与最先进的联邦共形基线相比,校准质量有所提高。

英文摘要

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.

2606.00716 2026-06-02 cs.LG eess.SP

Graph Transfer Learning via Shared Latent Geometry: Theory and Applications

基于共享潜在几何的图迁移学习:理论与应用

Tong Wu, Andrew Campbell, Anna Scaglione

发表机构 * University of Central Florida, USA(佛罗里达中央大学) Cornell University, USA(康奈尔大学)

AI总结 提出一种非对称双路径架构,通过教师编码器从高保真模拟器学习算子多项式特征,学生编码器从稀疏数据学习相同潜在几何,实现零样本迁移并给出可证明的误差界。

详情
AI中文摘要

在工程物理系统的推理与控制中,部署时面临高昂的物理代价:状态估计器、逆问题求解器、模型预测控制器、调度器和观测器通常没有闭式解,必须针对每个实例重新求解数值优化问题,且每次需重新提供算子。物理信息学习将这一代价转移到训练阶段,但使用单一编码器路径,其潜在几何在微调时会退化,且无法提供定量迁移保证。我们提出一种非对称双路径架构来解决这两个问题。教师编码器从高保真模拟器中获取特权密集状态,并通过在谱扰动下稳定的算子多项式特征表示系统;学生编码器从稀疏现场数据和算子描述符学习相同的潜在几何。部署时丢弃教师,冻结的学生编码器通过单次前向传播运行,并附带迁移证书。该设计关联了特权信息学习、知识蒸馏和跨模态蒸馏,但目标是跨实例迁移而非固定实例预测:拓扑和算子可以变化,而潜在任务不变。我们通过潜在律之间的Wasserstein距离建立了充分且近乎必要的迁移条件,得到了零样本误差界,并开发了一种在覆盖不完全时主动扩展的有限样本认证协议。该框架适用于任何具有可报告谱的算子的系统。在电力系统估计中,它实现了对100种未见拓扑的零样本迁移,95%的证书通过率,与拓扑感知的牛顿-拉夫逊方法相当的精度,以及亚毫秒级推理。这些结果表明,非对称路径加上算子锚定的潜在几何为认证的零样本推理与控制奠定了基础。

英文摘要

Inference and control in engineered physical systems pay a heavy physics cost at deployment: state estimators, inverse-problem solvers, model-predictive controllers, schedulers, and observers are often not closed-form and must re-solve a numerical optimization per instance, with the operator re-supplied each time. Physics-informed learning moves this cost to training, but uses a single encoder pathway whose latent geometry de-learns under fine-tuning and admits no quantitative transfer guarantee. We propose an asymmetric two-pathway architecture that resolves both issues. A teacher encoder consumes privileged dense states from a high-fidelity simulator and represents the system through operator-polynomial features stable under spectral perturbation; a student encoder learns the same latent geometry from sparse field data and operator descriptors. At deployment the teacher is discarded, and the frozen student runs in a single forward pass with a transfer certificate. The design connects to privileged-information learning, knowledge distillation, and cross-modal distillation, but targets cross-instance transfer rather than fixed-instance prediction: topology and operator may change, while the latent task does not. We establish sufficient and near-necessary transfer conditions via Wasserstein proximity between latent laws, yielding a zero-shot error bound, and develop a finite-sample certification protocol with active expansion when coverage is incomplete. The framework applies wherever a system admits an operator with reportable spectrum. On power-system estimation, it achieves zero-shot transfer to 100 unseen topologies, a 95% certificate pass rate, accuracy competitive with topology-aware Newton--Raphson, and sub-millisecond inference. These results suggest asymmetric pathways plus operator-anchored latent geometry provide a foundation for certified zero-shot inference and control.

2606.00712 2026-06-02 cs.CV

CASTLE2026 Team WDL Technical Report

CASTLE2026 团队 WDL 技术报告

Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding(智能感知与图像理解重点实验室)

AI总结 提出基于 Qwen 的证据感知多模态推理流程,通过提示路由和置信度加权投票解决长视频问答,在 CASTLE 挑战赛中排名第一。

详情
Comments
4 pages
AI中文摘要

CASTLE 挑战赛 @ EgoVis 2026 评估基于 600 多小时多视角记录的长格式自我中心视频问答。每个四选一问题需要来自视频、转录、辅助照片、人物、天数、房间和时间上下文的证据。我们提出了一种基于 Qwen 的证据感知多模态推理流程。我们的系统解析问题提示、检索 ASR 片段、附加辅助图像、采样候选视频帧,并将问题路由到静态视觉、语音/文本、时间和混合类型,并附带专门提示。多次推理通过置信度加权投票进行聚合,并转换为官方 Codabench 格式。在消融实验中,LoRA 将得分从 0.21 提升至 0.50,更多采样帧进一步将其提升至 0.58。我们的最终系统在 CASTLE 挑战赛 @ EgoVis 2026 中排名第一。

英文摘要

The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.

2606.00709 2026-06-02 cs.RO

BEVIO: Efficient Bird's-Eye-View based Sparse-Update Visual-Inertial Odometry for Lunar Day-Night Navigation

BEVIO: 基于鸟瞰图的稀疏更新视觉-惯性里程计用于月球昼夜导航

Mohit Singh, Shehryar Khattak, Ashish Goel, Michael Paton, Kostas Alexis, Issa A. Nesnas

发表机构 * Jet Propulsion Laboratory, California Institute of Technology(喷气推进实验室,加州理工学院) Autonomous Robots Lab at the Norwegian University of Science and Technology(挪威科学技术大学自主机器人实验室)

AI总结 提出一种基于鸟瞰图的图像匹配方案,在极低视觉更新率下实现可靠的视觉-惯性里程计,适用于资源受限的月球车昼夜导航。

详情
Comments
Accepted at the 2026 IEEE International Conference on Robotics and Automation, Vienna
AI中文摘要

视觉-惯性里程计(VIO)提供平滑、高频率的状态估计,已广泛应用于地面和行星应用的机器人导航。然而,其性能通常依赖于视觉更新的频率,这对于在极端资源约束和低帧率下运行的行星车来说是一个挑战。本文研究如何为月球车应用实现具有极稀疏视觉更新的可靠VIO,解决昼夜操作中自照明条件下特征关联特别困难的问题。我们提出了一种基于鸟瞰图(BEV)的图像匹配方案,该方案在较大的帧间运动和显著的视觉外观变化下仍能保持鲁棒性,实现更可靠的特征匹配。我们通过高保真照片级月球仿真和半比例月球车在加利福尼亚州普拉斯特城进行的长期昼夜部署实时机器人实验,广泛评估了我们提出的BEVIO方法。结果表明,我们的方法能够在低至0.25 Hz的视觉更新率下实现可靠的昼夜自照明穿越,突显了其在功耗和计算受限的月球车导航中的适用性。

英文摘要

Visual-Inertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Bird's Eye View (BEV)-based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term day-night deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.

2606.00704 2026-06-02 cs.CV

VICR: Visual In-Context Restoration for Real-World Image Super-Resolution

VICR: 面向真实图像超分辨率的视觉上下文恢复

Qichang Zhang, Hailong Wang, Baiang Li, Linhao Wang, Rong Fu, Erkang Cheng, Simon James Fong

发表机构 * Faculty of Science and Technology, University of Macau(澳门大学科技学院) Nullmax Hefei University of Technology(合肥工业大学) Shandong Normal University(山东师范大学)

AI总结 提出基于扩散变换器的视觉上下文恢复框架,通过解耦的视觉先验注入机制将真实图像超分辨率建模为图像补全,实现结构保真与细节合成的平衡。

详情
Comments
28 pages, 11 figures, 9 tables
AI中文摘要

真实世界图像超分辨率(Real-ISR)需要在结构保真度(对退化观测)与逼真细节合成之间取得平衡。然而,现有的生成式Real-ISR方法通常依赖于纠缠的条件机制,导致结构漂移或语义不一致的细节。为了解决这个问题,我们提出了视觉上下文恢复(VICR),一种基于扩散变换器(DiT)的框架,将Real-ISR表述为图像补全。具体来说,我们引入了一种解耦的视觉先验注入机制,从低质量(LQ)图像中提取局部和全局线索:局部线索有助于恢复图像结构并支持高频细节合成,而全局线索指导整体生成并促进语义一致性。对于严重退化下的模糊区域,VICR采用推理时代理,利用LQ输入的视觉证据优化语义提示,同时保持模型参数固定。实验表明,VICR仅用127M可训练参数就在多个Real-ISR基准上实现了最先进的性能。

英文摘要

Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.

2606.00702 2026-06-02 cs.RO cs.AI

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体:用于多形态机器人设计的价值梯度

Nico Bohlinger, Jan Peters

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Robotics Institute Germany (RIG)(德国机器人研究所) German Research Center for AI (DFKI)(德国人工智能研究中心) hessian.AI(黑森AI)

AI总结 提出将通用多形态价值函数转化为可复用模型,通过价值梯度优化机器人设计,无需为每个机器人重新进行强化学习协同设计。

详情
AI中文摘要

我们提出将通用多形态价值函数转化为可复用的机器人设计模型。不是为每个机器人运行新的强化学习协同设计循环,而是首先在多种机器人设计上训练一个感知形态的策略和价值函数。训练后,冻结的价值函数被用作可微分的代理,通过价值梯度优化候选形态。我们在不同的机器人设计设置中评估了我们的方法,从受扰动的单个机器人到跨形态类别的保留机器人,使用在多达50个机器人和超过1100个连续形态参数的设计空间上训练的单个模型。除了优化完整形态,我们还展示了价值梯度可以识别限制性能的设计和控制参数,从而能够优化和分析新的机器人设计。

英文摘要

We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.

2606.00700 2026-06-02 cs.LG cs.AI

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

COPF:演化图中部署稳定的反事实公平性在线框架

Sheng'en Li, Dongmian Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对演化图上的在线链接推荐,提出COPF框架,通过反事实暴露机会差距、显式探索和残差不可区分性审计,实现部署稳定的公平性监控与控制。

详情
Comments
Accepted at ICML 2026
AI中文摘要

演化图上的在线链接推荐是表演性的:通过选择向用户展示哪些候选链接,系统会改变哪些链接形成以及后续观察到的反馈。因此,来自记录结果的公平性估计可能具有误导性,并且在推荐策略更新后部署时可能会漂移。我们引入了COPF(反事实在线表演性公平性),这是一个用于在线链接推荐中部署稳定的公平性监控和控制的决策层框架。COPF (i) 定义了暴露(展示 vs. 未展示)反事实上的群体级机会差距,(ii) 通过显式探索和记录每个候选被展示的概率(倾向性)使其可估计,以及(iii) 使用图感知双重稳健(GA-DR)估计器,在可配置的审计器族上通过残差结果不可区分性(OI)审计和控制公平性。我们提供了一个噪声传递定理,表明在时间混合和有界局部干扰下,估计的GA-DR残差上的残差OI意味着暴露反事实群体差距的界限,并实例化了一个在线多校准审计器以及一个原始-对偶控制器。在两个TGB流和一个受控的合成二分图流上的实验表明,COPF减少了暴露反事实群体差距的最坏情况峰值,同时对排序效用的影响较小。我们的代码可在 https://github.com/lsnnnnnnnn/COPF 获取。

英文摘要

Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision-layer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal-dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.

2606.00694 2026-06-02 cs.CV

FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation

FROST-STA: 用于Ego4D短期物体交互预测的冻结密集特征

Chaoyang Wang, Lexuan Xu

发表机构 * Beihang University(北航大学)

AI总结 提出FROST-STA模型,利用冻结的密集图像-视频特征和对象中心解码,在Ego4D短期物体交互预测挑战中取得第二名。

详情
AI中文摘要

第一人称视频中的短期预测需要超越对当前场景的识别:系统必须推断摄像头佩戴者将接触哪个物体、将执行什么动作以及接触将在多久后发生。本报告描述了FROST-STA,我们提交至EgoVis 2026 Ego4D短期物体交互预测(STA)挑战的方案。对于每个查询时间,模型输出一组排序的结构化假设,包含主动物体框、名词标签、动词标签、接触时间(TTC)和置信度。FROST-STA基于V-JEPA 2.1 STA评估协议,但通过使用对象中心解码、多头预测以及面向提交的训练和集成方案,使其适应挑战。我们固定V-JEPA 2.1 ViT-G骨干网络,提取两个密集token流:来自查询前缩放至384像素的短视频片段的视频token,以及来自最后观察到的最高分辨率帧的图像token。一个紧凑的对齐模块,由注意力探针和帧引导的时间池化组成,将片段表示映射到最后一帧的空间参考上,然后与图像特征融合。融合后的特征图由Faster R-CNN风格的STA头解码,估计框偏移、名词、动词、TTC值和交互质量。对于最终排行榜提交,我们使用官方训练集加上额外允许的验证标注训练25个epoch,并组合来自8个头和epoch 15-25的检查点的预测。FROST-STA在官方测试服务器上获得5.13总体Top-5 mAP,在挑战中排名第二,表明冻结的密集图像-视频特征可以作为物体级交互预测的坚实基础。

英文摘要

Short-term anticipation in egocentric video requires more than recognizing the current scene: a system must infer which object the camera wearer will contact, which action will follow, and how soon the contact will happen. This report describes FROST-STA, our submission to the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For each query time, the model produces a ranked set of structured hypotheses containing an active-object box, noun label, verb label, time-to-contact (TTC), and confidence. FROST-STA builds on the V-JEPA 2.1 STA evaluation protocol, but adapts it to the challenge by using object-centric decoding, multi-head prediction, and a submission-oriented training and ensembling recipe. We keep the V-JEPA 2.1 ViT-G backbone fixed and extract two dense token streams: video tokens from a short clip resized to 384 pixels before the query, and image tokens from the last observed high-resolution frame. A compact alignment module, consisting of an attentive probe and frame-guided temporal pooling, maps the clip representation onto the spatial reference of the final frame before fusing it with image features. The fused maps are decoded by Faster R-CNN-style STA heads that estimate box offsets, nouns, verbs, TTC values, and interaction quality. For the final leaderboard entry, we train for 25 epochs with the official training split plus additional permitted validation annotations, and combine predictions across eight heads and checkpoints from epochs 15-25. FROST-STA obtains 5.13 Overall Top-5 mAP on the official test server, ranking second in the challenge and showing that frozen dense image-video features can serve as a strong basis for object-level interaction forecasting.

2606.00690 2026-06-02 cs.LG

DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction

DistMatch: 通过分布匹配的自适应分箱用于鲁棒序列共形预测

Enver Menadjiev, Jihyeon Seong, Jisu Yeo, Jaesik Choi

发表机构 * arXiv.org University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学)

AI总结 提出DistMatch方法,利用Kolmogorov-Smirnov统计量递归划分残差以实现近似可交换性,结合在线分位数回归进行局部自适应推理,提升序列共形预测对分布偏移的鲁棒性。

详情
Comments
ICML 2026 (34 pages, 12 figures, 16 tables)
AI中文摘要

序列共形预测在残差可交换性假设下提供有效的不确定性量化。然而,由于时间依赖性和分布偏移,该假设在现实时间序列中常被违反。尽管近期方法尝试通过重新加权来近似可交换性,但确定最优权重仍是一个开放挑战。为解决此局限,我们提出DistMatch,一种基于分箱的方法,利用Kolmogorov-Smirnov统计量在二叉树中递归划分残差。我们从理论上证明,这种划分诱导出近似可交换的叶子节点,从而避免重新加权的需要。通过在每个叶子节点内应用在线更新的分位数回归,DistMatch实现了局部自适应推理,并提高了对分布偏移的鲁棒性。大量实验表明,DistMatch优于现有序列共形预测方法。

英文摘要

Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in real-world time series due to temporal dependencies and distributional shifts. While recent methods attempt to approximate exchangeability through reweighting, identifying optimal weights remains an open challenge. To address this limitation, we propose DistMatch, a binning-based method that recursively partitions residuals within a binary tree using the Kolmogorov-Smirnov (KS) statistic. We theoretically show that this partitioning induces approximately exchangeable leaves, thereby avoiding the need for reweighting. By applying quantile regression with online updates within each leaf, DistMatch enables locally adaptive inference and improves robustness to distributional shifts. Extensive experiments demonstrate that DistMatch outperforms existing sequential CP methods.

2606.00689 2026-06-02 cs.CV

Wavelet-Fusion Diffusion Model for Multimodal Brain MRI Synthesis with Modality and Metadata Conditioning

小波融合扩散模型用于多模态脑MRI合成,具有模态和元数据条件

Muhammad Nabi Yasinzai, Remika Mito, Mangor Pedersen

发表机构 * Department of Psychology & Neuroscience, Auckland University of Technology(心理学与神经科学系,奥克兰技术大学) Department of Psychiatry, University of Melbourne(精神病学系,墨尔本大学)

AI总结 提出一种小波融合扩散模型(WFDM),结合小波融合变分自编码器(WF-VAE)和条件3D U-Net扩散模型,通过显式模态和元数据条件实现多模态脑MRI合成,解决了数据集模态覆盖不均和异质性问题,在分布对齐上优于现有方法。

详情
Comments
51 pages, 7 figures, including supplementary material. Submitted to Imaging Neuroscience
AI中文摘要

多模态MRI为神经影像分析提供互补信息,不同成像模态捕获独特的解剖、组织和病理特征,支持下游AI应用的开发和评估。尽管大规模结构MRI资源日益可用,但公共和汇集神经影像数据集的模态覆盖往往不均匀。这种不均匀的模态覆盖因站点、扫描仪和采集协议之间的异质性,以及跨研究通常稀疏、不一致记录或不可用的人口统计学和临床变量而进一步复杂化。合成MRI生成可以通过合成目标模态体积用于数据集增强和受控合成队列创建,帮助解决这种不平衡。然而,许多现有的MRI合成方法在狭窄的模态集或相对同质的队列上训练,限制了它们对大型汇集神经影像资源的适用性,其中模态可用性、采集协议和元数据覆盖在不同数据集之间差异很大。扩散模型因其强大的样本保真度和多样性而成为MRI合成的一种有吸引力的方法,但直接在3D体素空间采样在推理时计算昂贵且缓慢。潜在扩散通过在学习的3D潜在空间中合成MRI提高了实用性,尽管生成质量取决于自编码器的重建保真度和由此产生的潜在分布。我们的方法将小波融合变分自编码器(WF-VAE)潜在压缩器与在学习的潜在空间中训练的、使用显式模态和元数据条件的条件3D U-Net扩散模型相结合。我们提出的Wavelet-Fusion Diffusion Model (WFDM) 在评估的合成MRI生成器中实现了最强的分布对齐。

英文摘要

Multimodal MRI provides complementary information for neuroimaging analysis, where different imaging modalities capture distinct anatomical, tissue, and pathological features that support the development and evaluation of downstream AI applications. Although large-scale structural MRI resources are increasingly available, their modality coverage is often uneven across public and pooled neuroimaging datasets. This uneven modality coverage is further complicated by heterogeneity across sites, scanners, and acquisition protocols, as well as demographic and clinical variables that are often sparse, inconsistently recorded, or unavailable across studies. Synthetic MRI generation can help address this imbalance by synthesizing target-modality volumes for dataset augmentation and controlled synthetic cohort creation. However, many existing MRI synthesis approaches are trained on narrow modality sets or relatively homogeneous cohorts, limiting their applicability to large pooled neuroimaging resources where modality availability, acquisition protocols, and metadata coverage vary substantially across datasets. Diffusion models have become an attractive approach for MRI synthesis because of their strong sample fidelity and diversity, but sampling directly in 3D voxel space is computationally expensive and slow at inference. Latent diffusion improves practicality by synthesizing MRI in a learned, 3D latent space, although generation quality depends on the autoencoder's reconstruction fidelity and the resulting latent distribution. Our approach combines a Wavelet-Fusion variational autoencoder (WF-VAE) latent compressor with a conditional 3D U-Net diffusion model trained in the learned latent space using explicit modality and metadata conditioning. Our proposed Wavelet-Fusion Diffusion Model (WFDM) achieved the strongest distributional alignment among the evaluated synthetic MRI generators.

2606.00688 2026-06-02 cs.CV

Shape-Prior-Based Point Cloud Completion for Single-Stage Fully Sparse 3D Object Detection

基于形状先验的点云补全用于单阶段全稀疏3D目标检测

Kaizheng Wang, Mingqian Ji, Jian Yang, Shanshan Zhang

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院)

AI总结 针对单阶段全稀疏3D检测器中点云稀疏和不完整的问题,提出一种基于形状先验的点云补全方法,通过实例选择和对齐补全模块显著提升检测性能。

详情
AI中文摘要

单阶段全稀疏3D目标检测器依赖点云数据在自动驾驶场景中检测目标。然而,点云的稀疏性和不完整性严重限制了3D目标检测的性能。为解决此问题,本文提出一种专门针对单阶段全稀疏检测器的点云补全方法。整个基于形状先验的补全过程由两个连续步骤组成。第一步,我们设计了一个新颖的实例选择模块,即使在基线模型未生成提议的情况下,也能识别对应前景目标的点云,同时有效忽略背景区域的点云。第二步,我们引入了一种新颖的基于对齐的点补全模块,该模块将前景目标的点云在中心和朝向上与原型对齐。随后,从原型中选择点来填充前景目标的缺失部分。我们在KITTI数据集上使用两个单阶段全稀疏检测器评估了我们的方法。实验结果表明,所提方法显著提升了检测性能,证实了其有效性和泛化能力。

英文摘要

Single-stage fully sparse 3D object detectors rely on point clouds data to detect objects in autonomous driving scenarios. However, the sparsity and incompleteness of point clouds significantly limit the performance of 3D object detection. To address this issue, this paper proposes a point clouds completion method specifically designed for single-stage fully sparse detectors. The entire shape-prior-based completion process consists of two consecutive steps. In the first step, we design a novel Instance Selection module, which is capable of identifying point clouds corresponding to foreground objects even when the baseline model does not generate proposals, while effectively ignoring the point clouds of background regions. In the second step, we introduce a novel Alignment-Based Point Completion module, which aligns the point clouds of foreground objects with prototypes in terms of both their centers and orientations. Subsequently, points are selected from the prototype to fill in the missing parts of the foreground object. We evaluated our method on two single-stage fully sparse detectors using the KITTI dataset. The experimental results demonstrate that the proposed method significantly improves the detection performance, confirming its effectiveness and generalizability.