arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.01019 2026-06-02 cs.CL cs.AI

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

混合验证解码:在推测解码中学习分配验证

Xin Su, Dawid Majchrowski, Fangyuan Yu, Vanshil Atul Shah, Sebastian Rogawski, Pawel Morkisz, Anahita Bhiwandiwalla, Phillip Howard

发表机构 * Thoughtworks Nvidia

AI总结 提出混合验证解码方法,通过预测缓存草稿的接受长度并在缓存验证与模型草稿之间动态选择,在代理工作流中平均加速2.73倍。

详情
AI中文摘要

大型语言模型(LLM)生成仍然昂贵,因为自回归解码每生成一个新token就调用一次模型。推测解码通过草拟多个token并用目标模型一步验证来降低成本,但其加速取决于接受的草稿token数量。无参数草稿源可以在结构化和代理工作负载中以低成本提出长续写,但一个生成步骤中看起来有前景的缓存匹配可能在下一步收益很低。我们提出混合验证解码,在验证前预测缓存草稿的接受长度,并使用该收益估计在缓存验证和基于模型的草稿器之间进行选择。在三个LLM和十六个数据集上,混合验证解码在代理工作流中特别有效,在每个设置中均优于EAGLE3,平均加速2.73倍。我们的分析揭示了提示结构如何创造缓存机会,高收益缓存草稿如何集中在草稿空间的一小部分,以及收益引导的选择如何减少顺序解码工作,指向运行时草稿选择作为推测解码的一个有前景的方向。

英文摘要

Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.

2606.01016 2026-06-02 cs.CL cs.AI eess.AS

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

PolySpeech-100:面向100多种语言和方言的大规模语音理解基准

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He

发表机构 * Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学) Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) JD AI Research(京东人工智能研究院)

AI总结 为解决现有语音评估基准在资源丰富语言偏向、缺乏语义推理和忽视方言的问题,提出PolySpeech-100基准,通过混合构建管道覆盖110种语言变体,并评估22个模型,发现开源端到端模型在重方言上优于级联系统,而思维链提示在零样本设置下会降低性能。

详情
Comments
19 pages, 13 figures, KDD 2026
AI中文摘要

虽然端到端(E2E)语音大语言模型(Speech-LLMs)正在快速发展,但它们的评估方法仍局限于简单转录的时代。现有基准存在三个关键限制:明显偏向高资源语言、关注低级识别(ASR)而非语义推理,以及忽视区域方言。为弥补这一差距,我们引入了PolySpeech-100,这是一个大规模基准,旨在评估110种语言变体上的“母语级”语音理解。我们采用了一种新颖的混合构建管道,将黄金标准的人类录音与指令驱动的合成语音相结合,从而覆盖了19种不同的中文方言和80多种低资源语言。对22个最先进模型(包括Gemini-3、GPT-Audio和Qwen2.5-Omni)的广泛评估得出了关键见解。首先,我们证明开源端到端模型在重方言上优于级联(ASR+LLM)系统,证明直接音频处理保留了标准转录中经常丢失的关键副语言线索和韵律特征(例如语调、重音)。其次,我们揭示了一个显著的性能差距:虽然商业模型保持稳健,但开源模型在低资源语言上遭受灾难性退化。最后,反直觉的是,我们观察到在标准零样本设置下,思维链提示经常降低大多数评估模型的语音理解性能,揭示了当前架构中潜在的多模态对齐差距。PolySpeech-100为下一代包容性、全能的语音LLM建立了严格标准。数据、演示和代码公开于https://github.com/YoungSeng/PolySpeech-100。

英文摘要

While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.

2606.01015 2026-06-02 cs.RO cs.AI cs.NI cs.SY eess.SY

AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics

AI-IoT-机器人集成:框架、新兴趋势及迈向互联机器人的路径综述

Ranulfo Bezerra, Satoshi Tadokoro, Kazunori Ohno

发表机构 * Tohoku University(东大大学)

AI总结 本文综述了人工智能、物联网和机器人三者融合的现状,提出了模块化系统架构,并强调了小语言模型(SLM)和大型语言模型(LLM)在分布式认知与自主决策中的作用,为下一代互联机器人和物理AI生态系统提供了概念和技术路线图。

详情
Journal ref
IEEE Internet of Things Journal, vol. 13, no. 10, pp. 20398-20412, 15 May15, 2026
Comments
15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal
AI中文摘要

人工智能、物联网和机器人的融合不再是未来的愿景;它正迅速成为实时、智能和上下文感知系统的基础。AI实现感知和推理,IoT提供可扩展的感知和通信,而机器人则提供具身驱动。尽管在AIoT和物联网机器人(IoRT)等两两组合方面取得了显著进展,但仍缺乏完全整合这三者的统一设计框架。本综述综合了这些领域的最新进展,强调了边缘端的小语言模型(SLM)和云端的大型语言模型(LLM)在分布式认知和自主决策中的新兴作用。我们提出了一个符合这些趋势的模块化系统架构,分析了互操作性和反馈控制中存在的持续差距,并根据集成深度对现有工作进行了分类。我们的综述强调了混合SLM-LLM系统与IoT基础设施和机器人代理相结合时,如何应对实时适应、可扩展性和可靠性方面的挑战。这项工作为设计模块化、可解释且能够在动态环境中学习的下一代AI-IoT-机器人生态系统提供了概念和技术路线图,为新兴的互联机器人和物理AI范式铺平了道路。

英文摘要

The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.

2606.01014 2026-06-02 cs.CV cs.AI

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

基于文本的三维人体运动编辑中的跨轴特征融合与关节运动差异预测

Gyojin Han, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电子工程学院)

AI总结 提出一种跨轴特征融合架构和辅助任务,通过联合锚定变换器预测关节运动差异,实现文本驱动的三维人体运动编辑,在MotionFix数据集上达到最优性能。

详情
Comments
CVPR 2026
AI中文摘要

我们研究基于文本的三维人体运动编辑,目标是保留源运动的风格和结构,同时应用自然语言描述的编辑。MotionFix数据集的发布推动了基于训练扩散模型的直接生成编辑运动的研究,这些模型从源运动和文本指令生成编辑运动。虽然先前的工作主要关注学习编辑在时间上何时发生,但我们的目标是创建一个不仅理解时间方面,还理解哪些特定关节负责变化的模型。为此,我们提出了一种新颖的架构和一个互补的辅助任务来辅助其训练。我们的架构由两个轴锚定变换器组成,分别沿关节和时间维度提取不同特征,以及一个跨轴融合块来整合这些表示。我们进一步引入一个辅助任务,训练关节锚定变换器回归源和目标关节旋转之间的Soft-DTW距离。该目标教会模块理解哪些关节需要修改,哪些需要保留。通过在MotionFix数据集上的全面实验,我们证明我们的方法显著提高了与文本指令和源运动的语义对齐,以及生成运动的整体保真度,达到了最先进的结果。

英文摘要

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

2606.01012 2026-06-02 cs.AI cond-mat.mtrl-sci

Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach

堆叠双层材料的性质预测:一种多模态学习方法

An Vuong, Minh-Hao Van, Chen Zhao, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学) Baylor University(贝勒大学)

AI总结 提出一种多模态学习方法,通过联合建模不同材料层间的界面,预测给定配置下垂直堆叠产生的性质,实验证明其有效性和高效性。

详情
Comments
Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
AI中文摘要

AI for materials science 是 AI for science 中的一个关键主题,旨在加速材料发现并产生准确的性质预测。双层二维材料堆叠对于探索具有新功能和内在现象的新材料至关重要,能够创建用于各种实际应用的新型二维双层材料。从实验和计算角度对双层 vdWs 材料的研究已取得显著进展。多种双层材料已通过实验成功合成,并且高通量计算技术的日益普及构建了几个计算二维材料数据库。然而,利用 AI 对双层堆叠进行建模并预测新性质的研究仍不充分,需要进一步研究。在这项工作中,我们提出了一种新颖的多模态学习方法,用于研究不同材料之间的界面,这些界面共同实现新的或多种功能,并预测在给定配置下不同功能材料层垂直集成(堆叠)产生的新性质。综合实验证明了我们方法相对于基线方法的有效性和高效性。我们的代码可在 https://github.com/AnVuong123/bimat_ml 获取。

英文摘要

AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real-world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high-throughput computing technology has con structed several computational two-dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at https://github.com/AnVuong123/bimat ml.

2606.01009 2026-06-02 cs.SD

MelT: GEMM-Native NDFT for Efficient Single-Stage Audio Frontends on Modern Accelerators

MelT: 面向现代加速器的高效单级音频前端的GEMM原生NDFT

Augusto Camargo, Marcelo Finger

发表机构 * Instituto de Ciências Matemáticas e de Computação, University of São Paulo, Brazil(圣保罗大学数学与计算机科学研究所,巴西)

AI总结 提出MelT框架,通过将梅尔间隔非均匀离散傅里叶变换(NDFT)公式化为稠密通用矩阵乘法(GEMM)操作,实现单级音频前端,替代传统STFT+梅尔流水线,在多种加速器上获得高达3.75倍推理加速和3.52倍能耗降低。

详情
AI中文摘要

现代音频处理网络通常部署在加速器上,其峰值吞吐量通过稠密线性代数获得,而传统的声学前端——短时傅里叶变换(STFT)后接稀疏梅尔聚合——在结构上仍然是异构的。这种不匹配会在当代加速器后端引入内存带宽、调度和中间分配开销。本文介绍MelT,一个单级前端框架,其中梅尔间隔非均匀离散傅里叶变换(NDFT)基被预先计算,并通过稠密通用矩阵乘法(GEMM)操作应用于时域声学帧。贡献不在于NDFT算子本身,而在于将梅尔间隔NDFT投影公式化为GEMM原生的音频前端,并将其评估为传统STFT+梅尔流水线的硬件高效替代方案。在从Apple A18 Pro边缘硬件到NVIDIA H100数据中心加速器的多个平台上评估,MelT在保持下游分类准确性的同时,实现了高达3.75倍的推理延迟加速和3.52倍的能耗降低。

英文摘要

Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.

2606.01007 2026-06-02 cs.LG cs.AI

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

超越任务无关:面向通信高效的多任务MoE推理的任务感知分组

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao, Yong Jiang, Qing Li

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Pengcheng Laboratory(鹏城实验室)

AI总结 提出任务感知共激活分组(TACG)框架,通过任务特定的共激活模式优化专家放置,并引入通用专家共享复制(GESR)应对在线负载倾斜,在三个MoE模型上平均降低通信成本31.39%,保持公平性指数0.9975。

详情
AI中文摘要

稀疏激活的混合专家(MoE)模型通过条件计算扩展容量,但分布式推理面临跨GPU专家通信和路由引起的负载不平衡问题。现有的放置方法通过共同定位频繁共激活的专家来降低这一成本;然而,它们从全局聚合的路由轨迹中推导出单一部署方案,从而平均掉了多任务服务中实际驱动通信的异构、任务特定的共激活模式。我们观察到专家共激活强烈依赖于任务:在一个任务族中紧密耦合的专家对在另一个任务族中往往不相关,因此有效的部署应根据任务感知的共激活而非任务无关的平均值来分组专家。基于这一见解,我们提出了任务感知共激活分组(TACG),这是一个部署时框架,利用族特定的调度和共激活轨迹推导每个专家的任务族偏好,重新加权共激活图使得族内局部性主导分组,并在精确容量约束下将每个专家分配到主GPU。为了使静态放置对在线工作负载倾斜保持鲁棒,我们进一步引入了通用专家共享复制(GESR),这是一个轻量级辅助方法,识别具有持续中心共激活特征的通用专家,将它们复制到少量辅助GPU上,并在服务时应用局部性和负载感知的选择。在三个代表性的开源MoE模型上的实验表明,我们的框架相比基线平均降低了31.39%的通信成本,同时保持了平均Jain公平指数0.9975。即使在推理数据出现严重分布偏移的情况下,这一优势依然存在,持续优于强基线。

英文摘要

Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.

2606.01006 2026-06-02 cs.CV

Automated Erythrocyte Detection and Tracking for Retinal Blood Flow Quantification in Erythrocyte-Mediated Angiography

自动红细胞检测与追踪用于红细胞介导血管造影中的视网膜血流定量

Chiao-Yi Wang, Havish S Gadde, Yi-Ting Shen, Saige M. Oechsli, Osamah Saeedi, Yang Tao

发表机构 * Department of Bioengineering, University of Maryland, College Park, MD 20742, USA(生物工程系,马里兰大学,学院公园,MD 20742,美国) Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA(眼科学与视觉科学系,马里兰大学医学院,巴尔的摩,MD 21201,美国) Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA(电气与计算机工程系,马里兰大学,学院公园,MD 20742,美国)

AI总结 提出EMTrack框架,通过流上下文模块和拓扑感知追踪策略实现红细胞自动检测与追踪,用于视网膜血流定量,并在新数据集RBF-EMA上优于基线方法。

详情
AI中文摘要

毛细血管水平的视网膜血流(RBF)作为多种眼病的生物标志物具有巨大潜力。然而,测量毛细血管水平RBF的方法仍然有限。红细胞介导血管造影(EMA)是一种新兴成像技术,通过可视化单个红细胞实现毛细血管水平RBF测量,但自动红细胞检测与追踪(量化血流所必需)仍鲜有探索。为填补这一空白,我们提出EMTrack,一种新颖框架,包含用于区分运动与静止细胞的红细胞检测流上下文模块,以及能够在帧间大位移和显著运动变化下进行追踪的拓扑感知追踪策略。此外,我们建立了RBF-EMA,一个包含全面红细胞检测与追踪标注的新EMA数据集。实验结果表明,我们的方法在RBF-EMA数据集上的检测与追踪任务中,在定量和定性上均优于基线方法。此外,RBF量化结果凸显了我们的框架在自动化视网膜血流测量中的巨大潜力。

英文摘要

Capillary-level retinal blood flow (RBF) has strong potential as a biomarker for various ocular diseases. However, modalities for measuring capillary-level RBF remain limited. Erythrocyte-mediated angiography (EMA), an emerging imaging technique, enables capillary-level RBF measurement by visualizing individual erythrocytes, yet automated erythrocyte detection and tracking, which are essential for quantifying blood flow, remain largely unexplored. To address this gap, we propose EMTrack, a novel framework featuring a flow-context module for erythrocyte detection that distinguishes moving from paused cells and a topology-aware tracking strategy that enables tracking under large inter-frame displacements and substantial motion variations. In addition, we establish RBF-EMA, a new EMA dataset with comprehensive erythrocyte detection and tracking annotations. Experimental results demonstrate that our method outperforms baseline methods both quantitatively and qualitatively on detection and tracking tasks in the RBF-EMA dataset. Moreover, RBF quantification results highlight the strong potential of our framework for automated retinal blood flow measurement.

2606.01000 2026-06-02 cs.LG cs.CL

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

信任函数:通过学习何时信任弱教师实现近乎无损的弱到强泛化

Arda Uzunoglu, Alvin Zhang, Daniel Khashabi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出信任函数为弱标签分配信任分数并据此过滤弱监督,在多个领域实现近乎无损的弱到强泛化,且能通过迭代链放大收益。

详情
Comments
ICML 2026
AI中文摘要

弱到强泛化研究在可靠标签稀缺时,如何利用较弱教师的监督来提升强学生。我们主要将其视为数据选择问题,关键挑战是识别哪些弱标签足够可靠以作为训练信号。为此,我们引入信任函数,为每个弱标签分配一个标量信任分数,并使用这些分数过滤弱监督。在包括世界知识、定量推理和策略游戏在内的多个领域,信任过滤产生的学生匹配甚至有时超越真实监督,实现了近乎无损的弱到强泛化。此外,信任函数能够实现迭代的弱到强链,通过训练学生并将其重用为下一个教师来累积收益,从而放大收益。信任函数的优势可归因于多种机制。

英文摘要

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.

2606.00999 2026-06-02 cs.CV

SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation

SWARD:基于随机窗口注意力的关系蒸馏用于跨架构语义分割

Aditya Makineni, Qing Tian

发表机构 * Department of Computer Science University of Alabama at Birmingham(计算机科学系阿拉巴马大学伯明翰分校)

AI总结 提出SWARD框架,通过多尺度窗口注意力蒸馏和原型判别正则化,弥合Transformer教师与CNN学生之间的表征差距,实现跨架构语义分割的知识蒸馏。

详情
AI中文摘要

大规模视觉基础模型在语义分割等密集预测任务上取得了显著进展,但其规模使得在资源受限环境中部署不切实际,因此知识蒸馏成为将其能力迁移至轻量级学生网络的一种手段。然而,现代基础教师模型主要基于Transformer,编码全局上下文,而高效学生模型通常是具有局部偏置感受野的卷积网络。现有蒸馏方法大多假设架构同质性,并依赖直接特征模仿,这未能弥合这种表征差距,且忽略了准确语义分割所需的结构化空间依赖和判别性组织。在本文中,我们提出SWARD,一种通过两种互补机制解决这一差距的知识蒸馏框架。首先,我们引入多尺度窗口注意力蒸馏(MWAD)模块,该模块在随机移位窗口分区中对齐师生基于注意力的关系,窗口偏移在每次训练迭代中随机重新采样。这消除了窗口边界偏差,并结合多尺度设计,捕获了短程和长程空间依赖。其次,我们引入原型判别正则化(PDR),一种通过强制类间分离和类内紧凑性来塑造学生特征分布的损失,进一步锐化判别结构,超越仅靠特征模仿在学生容量减少下所能产生的效果。在不同视觉应用(即城市场景解析和医学图像分割)上的实验表明,SWARD达到了最先进的性能。

英文摘要

Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student's feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student's reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.

2606.00998 2026-06-02 cs.RO

GraspGen-X: Cross-Embodiment 6-DOF Diffusion-based Grasping

GraspGen-X: 跨形态6自由度扩散抓取

Beining Han, Yu-Wei Chao, Erwin Coumans, Clemens Eppner, Balakumar Sundaralingam, Jia Deng, Stan Birchfield, Adithyavairavan Murali

发表机构 * NVIDIA Princeton University(普林斯顿大学)

AI总结 提出一种基于扩散模型的跨形态6自由度抓取方法,通过扫描体积启发式编码夹爪表示,在20亿抓取数据上训练,实现对新物体、场景和夹爪形态的零样本泛化。

详情
AI中文摘要

我们研究跨形态6自由度机器人抓取。与先前工作不同,我们要求模型不仅泛化到新物体/场景,还要泛化到新夹爪形态和物理抓取过程。我们的方法将基于扩散模型的生成式6自由度抓取模型扩展到对额外夹爪表示的条件化。我们提出一种用于编码夹爪的扫描体积启发式方法。我们使用程序化生成的夹爪和一个包含20亿抓取的大规模数据集训练跨形态模型。在仿真实验中,我们的模型在零样本泛化到新型真实世界夹爪和物体方面优于基线方法。我们的模型也可作为微调以适应新夹爪的良好初始化。在消融实验中,我们展示了扫描体积夹爪表示和程序化夹爪训练数据集的效率。最后,我们展示了在6自由度抓取中对真实世界新型夹爪的零样本泛化,在跨形态泛化方面超越了基线。

英文摘要

We study cross-embodiment 6-DOF robot grasping. Unlike prior works, we require the model not only to generalize to novel objects / scenes but also to novel gripper morphologies and physical grasping processes. Our method extends diffusion model based generative 6-DOF grasping models to condition on the additional gripper's representation. We propose a swept-volume heuristic for encoding the gripper. We train our cross-embodiment model with procedural grippers and a large-scale dataset of 2 Billion grasps. In simulation experiments, our model has the best zero-shot generalization to novel real-world grippers and objects over baseline methods. Our model also serves as a good initialization for fine-tuning to adapt to novel grippers. In ablations, we demonstrate the efficiency of our sweep-volume gripper representation and our procedural gripper training dataset. Last, we show zero-shot generalization to real-world novel grippers for 6-DOF grasping, surpassing baselines in cross-embodiment generalization.

2606.00997 2026-06-02 cs.CL

Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

顺序无关语言模型中的解码:链式法则偏差与均匀扩散

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Zhongguancun Academy(中关村学院)

AI总结 本文研究顺序无关语言模型(OALM)中揭示顺序对似然的影响,提出基于置信度方差的诊断方法,并证明均匀扩散定理以优化解码路径。

详情
AI中文摘要

顺序无关语言模型(OALM),包括离散扩散语言模型(dLLM),被训练用于在任意条件集下预测掩码标记,从而允许在推理时以任意揭示顺序生成或评分序列。在LLaDA-2.1中,我们报告了三个发现。首先,学习到的条件概率并不是一个连贯联合分布的精确分解:仅改变揭示顺序就会使目标对数似然偏移高达0.49 nats/标记,因此仅凭似然就混合了内容难度和路径依赖的伪影。其次,尽管置信度优先(CF)解码是顺序无关的,但其在内容标记上的揭示顺序接近从左到右(L2R)。第三,我们提出了一种基于置信度轨迹形状的补充诊断方法。一个均匀扩散定理表明,在固定总似然下,当每一步的置信度均匀扩散时,目标可恢复性最大化;由此产生的偏差促使我们使用$\mathrm{Var}(\log q_t)$作为比较解码路径的诊断指标。在C4和四个下游基准测试中,低方差将结构化路径与随机排序区分开来,并且方差与下游正确性一致相关。这些结果支持在比较OALM解码路径时联合报告平均置信度和置信度方差。

英文摘要

Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates $\mathrm{Var}(\log q_t)$ as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.

2606.00991 2026-06-02 cs.AI

Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support

交通系统管理与运营中的大语言模型:从文本推理到多模态决策支持

Siyan Li, Zehao Wang, Jiachen Li, Kanok Boriboonsomsin, Matthew J. Barth, Guoyuan Wu

发表机构 * Bourns College of Engineering, Center for Environmental Research and Technology, University of California at Riverside, CA, USA(伯恩斯工程学院,环境研究与技术中心,加州大学河滨分校,美国,加利福尼亚州河滨)

AI总结 本文综述了大语言模型(LLM)和多模态大语言模型(MM-LLM)在交通系统管理与运营(TSMO)中的应用,涵盖运营与服务、移动性与车队服务、数据建模与决策支持三大领域,并指出了数据异构性、实时推理、可解释性等挑战及未来方向。

详情
Comments
Preprint version
AI中文摘要

交通系统管理与运营(TSMO)越来越依赖于对各种传感器流、事件报告、旅行者反馈和视觉观测等异构数据的及时解读。大语言模型(LLM),包括新兴的多模态大语言模型(MM-LLM),为将这些结构化和非结构化输入整合到面向操作者的决策支持中提供了新机制。本文综述了基于LLM和MM-LLM在TSMO中的应用,涵盖三个领域:交通运营与服务(供给)、移动性与车队服务(需求)以及数据、建模与决策支持。通过PRISMA指导的筛选过程,我们综合了当前研究,同时区分了面向操作的应用与原型及新兴概念。我们进一步识别了数据异构性、实时推理、可解释性、多模态融合和治理方面的反复出现的挑战。最后,我们概述了在本地化适应、边缘部署、基准测试和跨机构协作方面的现有差距和未来方向。总体而言,基于LLM的系统作为决策支持层最有前景,而MM-LLM在需要整合异构文本、视觉和传感器输入时尤其有价值。

英文摘要

Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various sensor streams, incident reports, traveler feedback, and visual observations. Large language models (LLMs), including emerging multi-modal large language models (MM-LLMs), provide a new mechanism for integrating these structured and unstructured inputs into operator-facing decision support. This survey paper reviews LLM- and MM-LLM-based applications in TSMO across three domains: transportation operations & services (supply), mobility & fleet services (demand), and data, modeling & decision support. Using a PRISMA-guided screening process, we synthesize current studies while distinguishing operationally oriented applications from prototype and emerging concepts. We further identify recurring challenges in data heterogeneity, real-time inference, explainability, multi-modal fusion, and governance. Finally, we outline existing gaps and future directions in localized adaptation, edge deployment, benchmarking, and cross-agency collaboration. Overall, LLM-based systems appear most promising as a decision-support layer, with MM-LLMs offering particular value when heterogeneous text, visual, and sensor inputs must be integrated.

2606.00990 2026-06-02 cs.RO

OSCAR: Obstacle Survival Curves for Adaptive Robot Navigation

OSCAR: 用于自适应机器人导航的障碍物生存曲线

Hshmat Sahak, Aoran Jiao, Nicholas Rhinehart, Tim Barfoot

发表机构 * University of Toronto(多伦多大学)

AI总结 提出OSCAR框架,利用生存模型学习障碍物清除时间分布,并通过图规划器动态调整等待与重路由的阈值,以减少导航时间。

详情
Comments
8 pages main text, appendices included
AI中文摘要

一个沿已知路线图行驶的移动机器人在临时障碍物阻塞关键边时可能会犯代价高昂的导航错误:在停放的推车后面等待太久浪费时间,但立即绕过一个几秒钟后会移动的人也是低效的。标准的反应式避障处理障碍物周围的局部运动,而固定的等待或重路由规则忽略了不同障碍物类型通常持续的时间。我们提出了OSCAR:一种用于具有临时阻塞的基于图的导航的自适应生存建模框架。假设在遇到障碍物时可以获得障碍物类别标签,机器人从在线经验中学习类别条件的残余清除时间分布,包括在重路由之前未观察到清除时的右删失观测。这些生存模型被集成到一个时间相关的图规划器中,该规划器维护障碍物记忆并计算每个阻塞边的耐心阈值:在采取替代路线之前等待多长时间。该方法在多个回合中持续更新其清除估计,并使用它们来平衡等待与重路由。我们在仿真中和真实移动机器人上(在大学中庭,障碍物包括人、椅子、垃圾桶和管道)评估了该方法。在仿真中,学习策略的目标时间在每类障碍物少于20次观测后收敛到具有真实清除分布的神谕的1%以内,优于所有启发式基线。实际部署证实该策略在线改进,从50个导航回合的经验中调整其耐心阈值。

英文摘要

A mobile robot following a graph of known routes can make costly navigation errors when a temporary obstacle blocks a critical edge: waiting too long behind a parked cart wastes time, but immediately rerouting around a person who would move in a few seconds is also inefficient. Standard reactive obstacle avoidance addresses local motion around obstacles, while fixed wait-or-reroute rules ignore how long different obstacle types tend to persist. We propose OSCAR: an adaptive survival-modeling framework for graph-based navigation with temporary blockages. Assuming obstacle class labels are available at encounter time, the robot learns class-conditioned residual clearance-time distributions from online experience, including right-censored observations when it reroutes before observing clearance. These survival models are integrated into a time-dependent graph planner that maintains obstacle memory and computes a patience threshold at each blocked edge: how long to wait before taking an alternate route. The method continuously updates its clearance estimates across episodes and uses them to balance waiting against rerouting. We evaluate the approach in simulation and on a real mobile robot in a university atrium with obstacles including people, chairs, bins, and tubes. In simulation, the learned policy's time-to-goal converges to within 1% of an oracle with access to ground-truth clearance distributions after fewer than 20 observations per obstacle class, outperforming all heuristic baselines. Real-world deployment confirms that the policy improves online, adapting its patience thresholds from experience across 50 navigation episodes.

2606.00988 2026-06-02 cs.LG

Data Enrichment for Symbolic Regression Using Diffusion Models

使用扩散模型进行符号回归的数据增强

Simon De Reuver, Tamas Kristof Toth, Teddy Lazebnik

发表机构 * Department of Computing(计算系) Jönköping University(约翰·科普丁大学) Department of Information Science(信息科学系) University of Haifa(海法大学)

AI总结 提出一种物理引导的潜在扩散框架,通过生成受物理约束的合成数据来增强稀疏观测,从而提升符号回归在稀疏、噪声或不完整数据下的方程发现可靠性。

详情
AI中文摘要

符号回归(SR)通过将观测转化为可解释的控制方程,为科学发现提供了一条途径。然而,尽管其前景广阔,当时空测量稀疏、有噪声或物理上不完整时(这在实践中很常见),其可靠性会急剧下降。数据增强(DE)已被证明能够缓解这一限制,但除非额外样本保留目标系统的物理结构,否则它们可能误导方程发现。这种DE的隐含要求需要狭窄的领域专业知识以及技术流畅性,极大地限制了其实用性。在本研究中,我们引入了一个物理引导的潜在扩散框架,用于下游SR模型的DE。该框架结合了变分自编码器、条件潜在扩散模型和物理信息残差校正器,通过受控制关系约束的合成场来补全稀疏观测。我们在热传导、不可压缩Navier-Stokes流和移动单质量牛顿引力势上评估了该方法,使用GPLearn、DEAP和PySR作为下游SR后端。我们的结果表明,物理校正的增强在稀疏情况下始终改善了跨物理动力学和SR模型的恢复。这些结果表明,生成式增强可以在不需要额外领域专业知识的情况下加强方程发现。

英文摘要

Symbolic regression (SR) offers a route to scientific discovery by converting observations into interpretable governing equations. However, despite its promise, its reliability degrades sharply when spatiotemporal measurements are sparse, noisy, or physically incomplete, as commonly occurring in practice. Data enrichment (DE) has been shown to be able to mitigate this limitation, yet additional samples can mislead equation discovery unless they preserve the physical structure of the target system. Such implication of DE requires narrow domain expertise as well as technical fluidity, highly limiting its practical usefulness. In this study, we introduce a physics-guided latent diffusion framework for DE for down the line SR models. The proposed framework combines a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector to complete sparse observations with synthetic fields constrained by governing relations. We evaluate the approach on heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential, using GPLearn, DEAP, and PySR as downstream SR backends. Our results reveal that physics-corrected enrichment consistently improves recovery in sparse regimes across physical dynamics and SR models. These results show that generative enrichment can strengthen equation discovery without additional domain expertise.

2606.00987 2026-06-02 cs.CV cs.AI

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

多时相指代分割的开源基准与基线

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence (TeleAI)(人工智能研究所) China Telecom(中国电信) School of Artificial Intelligence, Optics and Electronics (iOPEN)(人工智能、光学与电子学院) Northwestern Polytechnical University(西北工业大学)

AI总结 提出多时相指代分割任务,通过自动化数据构建管道CRAFT-Agent生成首个基准MTRefSeg-21K,并设计两阶段训练的变化感知LVLM框架MTRefSeg-R1,实现优于现有基线的性能。

详情
AI中文摘要

大型视觉语言模型(LVLMs)展现了强大的视觉理解和语言引导定位能力,但其多时相视觉推理能力仍未充分探索。为填补这一空白,我们引入了 extbf{多时相指代分割(MTRS)},这是一个新任务,旨在从多时相图像中分割语言描述的时间变化。MTRS通过联合要求时相对应推理、语言定位和像素级掩码预测,扩展了传统的指代分割和变化检测。我们提出了 extbf{CRAFT-Agent},一个带有人工审核的自动化数据构建管道,并构建了 extbf{MTRefSeg-21K},这是第一个MTRS基准,包含21K个高质量的多时相图像-文本-掩码三元组,覆盖多样化的场景、视角和领域。对一系列基于VLM和LVLM的模型进行基准测试表明,直接推理表现较差,而任务特定的微调仍然有限。为解决这一问题,我们提出了 extbf{MTRefSeg-R1},一个采用两阶段策略训练的变化感知LVLM框架。它首先从20K个仅视觉的双时相样本中学习通用时间变化感知,然后在MTRefSeg-21K上进行微调,以实现细粒度的语言引导时间定位。MTRefSeg-R1显式建模跨时相视觉差异,将语言指令与时间变化对齐,并预测所指变化掩码。大量实验表明,与现有的LVLM基线相比,MTRefSeg-R1实现了强大且通常更优的性能,展示了MTRS的挑战和潜力。

英文摘要

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

2606.00986 2026-06-02 cs.LG

Profiling Privacy Preservation Against Gradient Inversion Attacks in Tabular Federated Learning

表格联邦学习中针对梯度反转攻击的隐私保护分析

Ivo Osterberg Nilsson, Maximilian Birr Engvall, Viktor Valadi, Teddy Lazebnik

发表机构 * Department of Computing(计算系) Jönköping University(琼堡大学) Scaleout Systems University of Haifa(海法大学)

AI总结 本研究通过评估不同联邦学习协议、客户端批量大小、训练阶段、攻击者假设、模型架构及任务类型下梯度反转攻击对表格数据的恢复能力,发现小批量更新最易受攻击,而FT-Transformer架构比MLP更难反转,并指出聚合重建精度可能高估完整记录恢复。

详情
AI中文摘要

联邦学习(FL)允许多个数据持有者在不集中原始数据的情况下协作训练机器学习模型,使其在医疗保健和机构数据共享等隐私敏感领域非常有用。FL将数据保留在客户端本地,仅通信模型更新(如梯度或模型增量)。然而,这些更新可能通过梯度反转攻击(GIA)暴露客户端私有数据。我们研究了在诚实但好奇的服务器威胁模型下,表格FL中的这种风险,涉及FL协议、客户端批量大小、训练阶段、攻击者假设、模型架构以及二分类、多分类和回归任务。我们使用MIMIC-IV和补充基准数据集。我们的评估区分了数值和分类恢复、基线可恢复性、特征级别恢复和精确匹配率(EMR)。我们使用暴露对齐协议评估FedSGD梯度和FedAvg模型增量,比较在匹配的客户端数据暴露(而非匹配的通信轮次)后的受攻击模型。我们比较了多层感知器(MLP)、ResNet和FT-Transformer模型,并通过MLP网格(宽度、深度、激活函数、归一化和丢弃率)隔离架构效应。结果表明,小客户端批量以及代表少量不同记录的更新最易受攻击。更大的本地批量和更强的聚合减少了重建,但并未消除泄露。FT-Transformer始终比独热基线更难反转,而MLP家族内的可重建性也差异很大。这些发现将架构确定为表格FL中一个实用的隐私变量。我们还表明,聚合重建精度可能高估稀疏数据中的完整记录恢复,因此EMR和基线比较至关重要。

英文摘要

Federated learning (FL) enables multiple data holders to train machine learning models collaboratively without centralizing raw data, making it useful in privacy sensitive domains such as healthcare and institutional data sharing. FL keeps data local to clients while communicating only model updates, such as gradients or model deltas. Nevertheless, these updates can expose private client data through gradient inversion attacks (GIAs). We study this risk for tabular FL under an honest-but-curious server threat model across FL protocols, client batch sizes, training stages, attacker assumptions, model architectures, and binary classification, multiclass classification, and regression tasks. We use MIMIC-IV and complementary benchmark datasets. Our evaluation distinguishes numerical and categorical recovery, baseline recoverability, feature level recovery, and exact match rate (EMR). We evaluate FedSGD gradients and FedAvg model deltas with an exposure aligned protocol, comparing attacked models after matched client data exposure rather than matched communication rounds. We compare multilayer perceptron (MLP), ResNet, and FT-Transformer models, and isolate architecture effects through an MLP grid over width, depth, activation, normalization, and dropout. The results show that small client batches and updates representing few distinct records are most vulnerable. Larger local batches and stronger aggregation reduce reconstruction but do not eliminate leakage. FT-Transformer is consistently harder to invert than one-hot baselines, while reconstructability also varies substantially within the MLP family. These findings identify architecture as a practical privacy variable in tabular FL. We also show that aggregate reconstruction accuracy can overstate complete record recovery in sparse data, making EMR and baseline comparisons essential.

2606.00985 2026-06-02 cs.RO

Make Your VLA More Robust Without More Data By Interleaving Motion Planning

通过交错运动规划使您的VLA更鲁棒而无需更多数据

Dan BW Choe, Sundhar Vinodh Sangeetha, Samuel Coogan, Shreyas Kousik

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出MPVI框架,将基于模型的运动规划与视觉-语言-动作模型交错结合,通过VLM完成检查和本体感受触发实现可靠切换,无需额外训练即可提升长时域移动操作任务的鲁棒性,在BEHAVIOR-1K基准上任务进度提升113%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在移动操作方面取得了显著进展,但在长时域任务上的表现仍然较差。这些任务尤其具有挑战性,因为(1)必须在空间分布的子任务的长序列中保持对高层目标的进展,并且(2)早期执行错误会在任务时域内迅速累积。尽管在大规模人类遥操作移动操作数据上进行了微调,这些挑战仍然存在,表明仅靠更多数据可能无法解决问题。为了应对这些挑战,我们提出了MPVI:运动规划器/VLA交错框架,该框架将基于模型的运动规划与VLA集成,无需进一步训练即可提高鲁棒性。所提出的集成通过开放词汇目标检测、前沿探索和运动规划,实现了在杂乱场景中对远处或遮挡目标物体的定位和导航。然而,这种集成并非易事,需要模块之间的可靠切换;我们通过基于VLM的完成检查与本体感受触发器展示了一种可行的方法。我们在BEHAVIOR-1K基准上评估了我们的方法,并展示了在任务进度上比顶级端到端VLA基线提升113%。更多详情请访问项目页面:https://mpvi.netlify.app/。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable progress for mobile manipulation, but their performance on long-horizon tasks remains poor. These tasks are especially challenging because (1) progress toward high-level goals must be maintained across extended sequences of spatially distributed subtasks, and (2) early execution errors compound rapidly over the task horizon. These challenges persist despite finetuning on large human teleoperated mobile manipulation data, indicating that more data alone may not resolve the problem. To address these challenges, we propose MPVI: Motion Planner / VLA Interleaving, a framework that integrates model-based motion planning with VLAs to improve robustness without further training. The proposed integration enables localization and navigation to distant or occluded target objects through cluttered scenes using open-vocabulary object detection, frontier exploration and motion planning. However, such integration is non-trivial, requiring reliable switching between modules; we show one way forward via VLM-based completion checking with proprioceptive triggers. We evaluate our approach on the BEHAVIOR-1K benchmark and demonstrate 113% improvement in task progress over a top end-to-end VLA baseline. Additional details are available at the project page: https://mpvi.netlify.app/.

2606.00981 2026-06-02 cs.CL

Robust Asynchronous Planning via Auto-Formalization

通过自动形式化实现鲁棒的异步规划

Jiayi Zhang, Jianing Yin, Ben Zhou, Li Zhang

发表机构 * Drexel University(德雷塞尔大学) Arizona State University(亚利桑那州立大学)

AI总结 针对异步规划中并发、非均匀时长和执行时间约束的挑战,提出自动形式化方法,通过CP-SAT形式化器在依赖图规模从5到100动作时保持高准确率,并引入状态感知修复策略应对执行时约束更新。

详情
AI中文摘要

LLMs可以通过直接生成动作序列作为规划器,或将任务翻译成领域特定语言供外部求解器作为形式化器来进行规划。虽然大多数现实任务具有非均匀时长、并发和执行时间约束的异步特性,但现有基准测试很少涵盖这些。我们将这些异步规划挑战统一到一个公式下,并引入了三个分别大规模解决这些挑战的基准测试。我们得出结论:形式化表示的选择主要决定了规划是否可扩展:当依赖图从5个动作增长到100个动作时,规划器的规划准确率从96%下降到5%,PDDL2.1形式化器从13%下降到0%,而CP-SAT形式化器平均为94%,在100个动作时仍达到83%。忠实性诊断表明,当LLMs必须保持谓词、效果和目标一致时,PDDL2.1基于谓词的规划表示变得脆弱,而通用约束满足程序则不然。执行时更新规划约束进一步严重降低了性能(规划器23.9%,PDDL2.1 0.7%,CP-SAT 46.1%),但一种仅更新事件诱导约束的状态感知修复策略将CP-SAT形式化器恢复至84.5%。

英文摘要

LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1's predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.

2606.00979 2026-06-02 cs.LG

UME: A Unified Meta-Generalization Framework for Cross-Domain ETA

UME:跨域ETA的统一元泛化框架

Duo Wang, Qiong Wu, Jianguo Wu, Ruiyu Xu, Jinhui Yi, Zhonggen Sun, Zhentao Zhang, Yu Zhang, Ke Xing, Yongjun Yin, Zishuo Li, Jianwen Huang

发表机构 * Peking University(北京大学) Meituan(美团)

AI总结 针对即时物流中跨域ETA预测的零样本泛化、特征缺失和知识迁移问题,提出基于超网络元学习的统一双分支架构UME,通过元模块动态调制特征门控、专家注意力和最终预测,并在美团Keeta平台部署验证。

详情
AI中文摘要

在即时物流中,结账页面的准确预计到达时间(ETA)预测对于提高用户满意度、优化调度和控制运营成本至关重要。在国际按需配送平台上,ETA数据来自具有不同模式的不同国家或地区,多域建模非常重要且已被广泛采用。然而,现有方法在实际部署中仍面临三个关键挑战。首先,当前的多域模型难以泛化到完全未见过的域,无法在初始冷启动阶段实现零样本预测。其次,跨域特征空间通常被假设为一致的,而新域由于缺乏历史数据,常常遭受离线(统计)特征的结构性缺失。第三,这种特征缺失通常迫使工业系统分别对成熟域和冷启动域进行建模,阻碍了知识迁移并增加了维护开销。为了解决这些挑战,我们提出了UME,一个统一的元泛化框架用于ETA。具体来说,UME将统一的双分支架构与一种新颖的元学习机制相结合,该机制采用基于超网络的元学习器。通过利用域级知识和实例级上下文,元学习器赋能三个元模块动态调制特征门控、专家注意力和最终预测,捕获跨域相关性并促进域内适应。进一步引入知识蒸馏策略以提升性能。UME现已部署在美团Keeta配送平台(中国最大的国际食品配送平台)上。大量的离线实验和在线A/B测试表明,UME显著优于现有基线。

英文摘要

Accurate Estimated Time of Arrival (ETA) prediction on checkout page is crucial in instant logistics for enhancing user satisfaction, optimizing dispatching, and controlling operational costs. In international on-demand delivery platforms, where ETA data originates from diverse countries or regions with different patterns, multi-domain modeling is of great importance and has been widely adopted. However, existing methods still face three critical challenges in real-world deployment. First, current multi-domain models struggle to generalize to completely unseen domains, failing to achieve zero-shot prediction during the initial cold-start phase. Second, cross-domain feature spaces are often assumed to be consistent, whereas new domains commonly suffer from structural missingness of offline (statistical) features due to the lack of historical data. Third, such feature missingness often compels industrial systems to model mature and cold-start domains separately, hindering knowledge transfer and increasing maintenance overhead. To address these challenges, we propose \textbf{UME}, a \textbf{U}nified \textbf{M}eta-generalization framework for \textbf{E}TA. Specifically, UME integrates a unified dual-branch architecture with a novel meta-learning mechanism that employs a hypernetwork-based meta learner. By leveraging domain-level knowledge and instance-level context, the meta learner empowers three meta modules to dynamically modulate feature gating, expert attention, and final prediction, capturing cross-domain correlations and facilitating intra-domain adaptation. A knowledge distillation strategy is further introduce to enhance performance. UME has now been deployed in Meituan-keeta delivery platform (the largest international food delivery platform in China). Extensive offline experiments and online A/B tests demonstrate that UME significantly outperforms existing baselines.

2606.00975 2026-06-02 cs.CL

Lost in Delusion: Examining LLM Safety Under User Delusions and Distress

迷失在妄想中:审视用户妄想与痛苦下的LLM安全性

Andrew Aquilina, Chetna Nihalani, Vasudha Varadarajan, Nathan S. Fishbein, Yu-Ru Lin, Maarten Sap

发表机构 * University of Pittsburgh(匹兹堡大学) Carnegie Mellon University(卡内基梅隆大学) Fordham University(福特汉姆大学)

AI总结 本研究通过多轮对话模拟,发现LLM在检测用户痛苦时表现良好,但在痛苦嵌入妄想时干预行为显著减少(高达4.5倍),并提出针对性提示策略以缩小这一差距。

详情
AI中文摘要

LLM聊天机器人日益成为心理困扰人群(包括那些困扰与妄想信念交织的人)的首要支持来源。先前关于LLM心理健康安全性的研究主要评估一般治疗质量或单轮危机检测,未明确模型在持续对话中痛苦与妄想交织时的行为。我们通过匹配的多轮模拟(基于临床角色和六个LLM)填补了这一空白,将每次妄想对话与仅痛苦对照配对,以隔离妄想框架的影响。这揭示了一个识别-干预差距:模型无论框架如何都能以相当比率检测痛苦,但一旦痛苦嵌入妄想,模型便严重未能采取行动,安全干预被抑制高达4.5倍。这种失败追踪的是对用户前提的累积接受,而非情感验证。更糟糕的是,提示模型评估用户痛苦的直观修复在妄想框架下适得其反;只有带有明确响应指导的妄想感知提示才能缩小差距,且这依赖于一个妄想分类器,而该分类器在最脆弱的模型上本身不可靠。因此,安全部署需要将妄想框架视为一种独特的风险信号,覆盖对话顺应。

英文摘要

LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general therapeutic quality or single-turn crisis detection, leaving unclear how models behave when distress is intertwined with delusion over sustained conversations. We address this gap with matched multi-turn simulations, across clinically grounded personas and six LLMs, that pair each delusional conversation with a distress-only control to isolate the effect of delusional framing. This reveals a recognition-intervention gap: models detect distress at comparable rates regardless of framing, yet sharply fail to act on it once distress is embedded in delusion, with safety interventions suppressed by up to 4.5x. The failure tracks accumulated acceptance of the user's premises rather than emotional validation. Worse, the intuitive fix of prompting models to assess user distress backfires under delusional framing; only delusion-aware prompting with explicit response guidance closes the gap, and even this depends on a delusion classifier that is itself unreliable on the most vulnerable models. Safe deployment therefore requires treating delusional framing as a distinct risk signal that overrides conversational accommodation.

2606.00971 2026-06-02 cs.CL

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

HypothesisMed:生物医学问答中的推理时答案融合与结构化假设空间报告

Md Motaleb Hossen Manik, Ge Wang

发表机构 * Department of Computer Science Rensselaer Polytechnic Institute(计算机科学系雷士打理工学院) Department of Biomedical Engineering Rensselaer Polytechnic Institute(生物医学工程系雷士打理工学院)

AI总结 提出HypothesisMed推理时可靠性流水线,通过答案融合和SPACE标签(有效/不完整/矛盾)提升生物医学多项选择问答的准确率、可解析性和结构化可靠性报告。

详情
AI中文摘要

使用大型语言模型进行生物医学问答通常通过答案准确率进行评估,但仅凭答案准确率并不能表明模型能否生成可解析的输出、遵循结构化可靠性指令、识别弱答案空间或避免自信的错误承诺。本文提出HypothesisMed,一个用于生物医学多项选择问答的推理时可靠性流水线。它结合了直接提示、思维链提示、HypothesisMed-v3提示和答案融合。最终答案通过融合选择,而HypothesisMed-v3提供SPACE标签和置信度信息。SPACE标签将答案空间标记为有效、不完整或矛盾。我们在MedQA、MedMCQA和PubMedQA上使用每个数据集1000个样本评估了Qwen2.5-7B、Phi-4-mini、DeepSeek-R1-32B和BioMistral-7B。该流水线在每个模型的最佳直接或思维链基线基础上提高了加权准确率,同时增加了解析和SPACE覆盖率。我们还使用每个模型10183个样本将评估扩展到Qwen2.5-7B和Phi-4-mini。融合将Phi-4-mini的准确率从0.4296提升到0.5192,而Qwen2.5-7B的思维链在答案准确率上仍略高。然而,Qwen2.5-7B融合实现了完全的解析和SPACE覆盖率,且错误承诺更低。一个12000样本的SPACE压力测试表明,答案空间诊断仍然困难,Qwen2.5-7B的SPACE准确率为0.3074,Phi-4-mini为0.4168。这些结果表明,答案准确率、可解析性、结构化可靠性报告、校准行为和错误承诺行为是可分离的能力。主要贡献不是声称通用的最先进性能,而是一个可复现的推理时框架,用于在结构化可靠性约束下评估作为可审计工作流组件的生物医学问答模型。

英文摘要

Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.

2606.00970 2026-06-02 cs.AI cs.LG econ.TH

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

具有灾难性状态的MDP中贝尔曼最优性产生的前景理论行为

Yujiao Chen

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制,发现标准贝尔曼最优性产生前景理论特征:S形值函数、内生损失敏感系数和反射效应策略反转,并推导出渐近损失厌恶平台的闭式表达式。

详情
AI中文摘要

我们研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制。尽管奖励是线性的且智能体没有效用曲率、概率加权或框架依赖,标准贝尔曼最优性产生了三个前景理论特征:S形值函数轮廓(灾难附近凸,远处凹)、内生损失敏感系数$λ^*(S) > 1$以及反射效应策略反转。在495个配置中,最优策略在正漂移(增长)模式下在灾难附近选择安全动作,尽管风险动作的即时期望值更高;在负漂移(衰退)模式下在灾难附近选择风险动作,尽管安全动作的即时期望损失更低。我们推导出渐近损失厌恶平台$\barλ$的闭式表达式,该表达式仅依赖于获胜概率$p$、收益不对称性$r = |Δ_\ell/Δ_w|$和折扣因子$β$,与数值解的拟合$R^2 = 0.999$。该机制不需要不对称收益。在三个不对称水平下对$(p,β)$进行扫描,$\barλ$大于1的不对称份额中位数为4.6%($r = 1.25$时),上升到13.9%($r = 2$时),且在每个测试单元中边界贡献超过不对称贡献。这些现象在表格Q学习(无模型智能体在增长模式下与$V^*$的相关性为0.98,衰退模式下为1.00)以及随机转移(高斯、重尾Student-$t_3$和不对称偏正态噪声,幅度高达步长的50%)中持续存在,其中渐近平台在安全通道噪声下跟踪闭式预测的误差在0.41%以内,在风险通道或双通道噪声下误差在9.6%以内。这些结果将吸收失败状态识别为最优控制下产生前景理论行为的充分结构机制。

英文摘要

We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient $λ^*(S) > 1$, and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action's higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action's lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau $\barλ$ that depends only on win probability $p$, payoff asymmetry $r = |Δ_\ell/Δ_w|$, and discount factor $β$, and matches numerical solutions to $R^2 = 0.999$. The mechanism does not require asymmetric payoffs. Across a sweep of $(p,β)$ at three asymmetry levels, the asymmetry share of $\barλ$ above unity has median 4.6% at $r = 1.25$ and rises to 13.9% at $r = 2$, with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces $V^*$ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student-$t_3$, and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.

2606.00966 2026-06-02 cs.RO

Threading Optimization for Vision-Language-Action Model Inference in Low-Cost Smart Agricultural Manipulation

低成本智能农业机械臂中视觉-语言-动作模型推理的线程优化

Keith Truongcao, Christopher Nhu, Zijian An, Phong Nguyen, Siwei Cai, Lifeng Zhou

发表机构 * Department of Electrical Engineering, Drexel University(德雷塞尔大学电气工程系)

AI总结 针对低成本机械臂上VLA模型推理慢、精细动作调整难的问题,通过优化RTAC算法的线程实现,降低了端到端延迟并提高了响应性,在农产品操作任务中验证了控制稳定性和速度的提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型仍然面临推理速度慢和难以进行精细运动调整等挑战,限制了它们在工业中的广泛应用。虽然实时动作分块(RTAC)算法已被提出以解决这些瓶颈,但从伪代码算法到低成本机械臂上稳定、实际部署的桥梁仍然是一个挑战。在这项工作中,我们提出了一个完整的系统级RTAC实现,专门针对低成本机器人操纵系统。我们通过优化策略推理和控制管道的线程实现,超越了原始的高级伪代码,在不修改底层策略的情况下减少了端到端延迟并提高了响应性。我们在涉及农产品(特别是大蒜球和核桃)操作的任务上评估了该系统。实验结果表明,与RTAC的基本实现相比,我们的自定义线程实现显著提高了控制稳定性和速度。

英文摘要

Vision-Language Action (VLA) models continue to face challenges such as slow inference speed and difficulty performing fine-grained motion adjustments, limiting their widespread adoption in industry. While the Real-Time Action Chunking (RTAC) algorithm has been proposed to address these bottlenecks, bridging the gap between the algorithm provided in pseudocode to a stable, real-world deployment on a low-cost robotic arm remains a challenge. In this work, we present a complete system-level implementation of RTAC tailored for a low-cost robotic manipulation system. We advance beyond the original high-level pseudocode by optimizing the threading implementation for the policy inference and control pipeline, reducing end-to-end latency and improving responsiveness without modifying the underlying policy. We evaluate this system on tasks involving the manipulation of agricultural produce, specifically garlic bulbs and walnuts. Experimental results demonstrate that our custom threading implementation significantly improves control stability and speed compared to the base implementation of RTAC.

2606.00963 2026-06-02 cs.CV cs.CL

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Reasmory: 3D重建作为VLMs空间推理的显式记忆

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

发表机构 * Cornell Tech, Cornell University(康奈尔科技学院、康奈尔大学) NVIDIA(英伟达) illoca AI(illoca人工智能) The University of California, Merced(加州大学梅尔塞德斯分校)

AI总结 提出Reasmory框架,通过结构化程序执行重建的3D显式记忆,并引入轻量级领域特定语言约束VLM查询和操作,在空间推理任务上提升6-18%。

详情
AI中文摘要

视觉语言模型(VLM)展现出新兴的空间推理能力,但在需要精确空间理解的任务(如视角推理、方向比较和距离估计)上仍不可靠。在多视图图像和单目视频中,相关空间线索通常稀疏且分布在冗余观测中,难以组织和利用。基于重建的视觉基础模型(VFM)提供了一种自然的方式将这些观测聚合为显式空间记忆,例如点云。然而,简单地将重建模型作为自由形式工具使用是脆弱的,VLM可能错误调用工具、跳过所需的空间变换或误用中间结果。我们提出 extbf{Reasmory},一个将空间推理形式化为对重建空间记忆的结构化程序执行的框架。Reasmory构建显式3D记忆,用语义锚定的3D对象实例增强它,并引入轻量级领域特定语言(DSL),约束VLM在推理过程中如何查询对象和相机、变换视角以及渲染观测。生成的程序在执行前被解析和验证,从而比无约束的工具使用更可靠地与空间记忆交互。在多视图图像和视频空间推理基准上的实验表明,与强基线(包括GPT-5-mini和Gemini-3-flash)相比,一致提升6-18%,表明显式3D记忆在通过约束、验证的操作而非自由形式的工具调用访问时最为有用。

英文摘要

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

2606.00959 2026-06-02 cs.AI

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

通过部分信息分解理解多模态语言模型中的模态交互

Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 引入部分信息分解(PID)框架,分离感官和语言输入的独特、冗余和协同贡献,揭示多模态大模型中的模态使用模式,并扩展至三模态系统。

详情
Comments
Accepted by ICML 2026
AI中文摘要

理解多模态大语言模型(MLLMs)中的模态交互对于可靠部署至关重要。我们引入部分信息分解(PID)作为决策级框架,将感官和语言输入的独特、冗余和协同贡献分离,超越了表示对齐和基于结果的评估。在视觉-语言基准测试中,PID揭示了重复出现的模态使用模式:推理和接地导向的任务往往表现出高协同性,而专家和知识导向的任务则显示出更强的语言独特性依赖。这些模式在不同模型家族中普遍存在,并能预测对模态级干预的敏感性。我们进一步将PID扩展到三模态系统,提出感官PID,将语言作为控制变量来分解视频-音频信息增益。应用于全模态模型时,感官PID揭示了感官协同瓶颈,即使在音视频融合任务中也以视觉信息为主。最后,PID引导的重新加权为改进多模态推理和接地性能提供了初步证据。

英文摘要

Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision--language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video--audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio--visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.

2606.00957 2026-06-02 cs.CV

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

面向大规模文生视频扩散Transformer的边界保护W8A8 HiFloat8量化

Yiming Zhao

发表机构 * Yiming Zhao(赵毅铭)

AI总结 针对Wan2.1-T2V-14B模型,提出一种边界保护策略的W8A8 HiF8后训练量化方法,通过保留首尾边界块为BF16而量化中间块,在VBench五个维度上匹配或略优于BF16基线。

详情
Comments
6 pages, 5 figures. Accepted to ICME 2026 Grand Challenge
AI中文摘要

我们提出了一种针对Wan2.1-T2V-14B(一个140亿参数文生视频扩散Transformer)的后训练量化方法,目标是在Ascend 910B NPU上实现W8A8 HiFloat8(HiF8)格式。量化视频DiT模型的一个核心挑战是跨Transformer块的异构激活分布:边界块(前几个和后几个块)表现出与中间块根本不同的统计特性,使得均匀量化无效。我们对所有40个WanAttentionBlock进行了系统的逐块激活分析,并利用这些发现提出了一种边界保护策略,该策略保留前两个和后三个块为BF16,同时用W8A8 HiF8量化剩余的35个块。所提出的PTQ方法在评估的所有五个VBench维度上匹配或略优于BF16基线,表明在5提示评估集内没有可测量的精度损失。对四种保护配置的消融研究证实,完全边界保护产生最高的平均VBench分数,验证了数据驱动的块选择。我们还研究了量化感知训练作为补充微调阶段,并分析了在单卡硬件上它无法优于普通PTQ的条件。

英文摘要

We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.

2606.00956 2026-06-02 cs.LG

Optimal-Point Variance Reduction For Bayesian Optimization With Regret Guarantee

具有遗憾保证的贝叶斯优化的最优点方差缩减

Shion Takeno

发表机构 * Nagoya University(名古屋大学)

AI总结 提出一种名为最优点方差缩减(OVR)的单步前瞻贝叶斯优化方法,通过后验采样和蒙特卡洛近似实现,并证明了正则化OVR的贝叶斯期望简单遗憾上界趋于零。

详情
Comments
23pages, 3 figures
AI中文摘要

本文研究了一种单步前瞻贝叶斯优化(BO)方法及其理论保证。尽管单步前瞻BO方法(如熵搜索)的经验有效性已被广泛研究,但它们通常依赖于计算上难以处理的近似,且其遗憾保证仍不完善。因此,本文提出了一种名为最优点方差缩减(OVR)的单步前瞻BO方法,该方法仅需要后验采样和蒙特卡洛近似。我们得到了OVR中蒙特卡洛估计在输入域上的均匀误差界。此外,我们表明,通过轻微修改以促进探索的正则化OVR,实现了贝叶斯期望简单遗憾上界趋于零。最后,我们通过数值实验展示了OVR的有效性。

英文摘要

This paper studies a one-step lookahead Bayesian optimization (BO) method and its theoretical guarantee. Although the empirical effectiveness of one-step lookahead BO methods, such as entropy search, has been studied extensively, they often rely on computationally intractable approximations, and their regret guarantees remain underdeveloped. Thus, this paper proposes a one-step lookahead BO method called optimal-point variance reduction (OVR), which requires only posterior sampling and Monte Carlo approximations. We obtain a uniform error bound over an input domain for the Monte Carlo estimation in OVR. Furthermore, we show that the regularized OVR, with the slight modification to promote exploration, achieves a vanishing Bayesian expected simple regret upper bound. Finally, we demonstrate the effectiveness of OVR through numerical experiments.

2606.00955 2026-06-02 cs.LG q-bio.QM

CryoProt: A Protein Pretraining Framework with Cross-Box Interactions on Cryo-EM Density Maps

CryoProt: 一种基于冷冻电镜密度图跨盒交互的蛋白质预训练框架

Dan Luo, Xuan Lin, Peng Zhou, Junwen Zhu, Tengfei Ma, Xiangxiang Zeng, Yiping Liu

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) School of Computer Science, Xiangtan University(湘潭大学计算机学院)

AI总结 提出 CryoProt 框架,通过多头潜在注意力机制实现密度图跨盒交互建模,并采用多任务预训练策略,在蛋白质柔性预测等下游任务中取得最高12%的性能提升。

详情
AI中文摘要

尽管冷冻电镜(cryo-EM)密度图的数据日益增多,但有效利用它们进行蛋白质表示仍具挑战。首先,当前方法缺乏专门针对cryo-EM密度图设计的通用蛋白质预训练框架,用于蛋白质相关属性预测。其次,现有方法通常将密度图划分为局部盒区域并独立建模,忽略了跨盒交互,而这对捕获cryo-EM密度图中的全局结构上下文至关重要。为解决这些挑战,我们提出CryoProt,一种专为cryo-EM密度图设计的蛋白质预训练框架。CryoProt引入了基于多头潜在注意力(MLA)的图编码器,其中盒级表示通过共享潜在空间进行交互,从而显式建模密度图内的跨盒依赖关系。此外,我们采用多任务预训练策略来学习可泛化的表示,这些表示可以有效地迁移到各种下游任务,例如蛋白质柔性预测,其中不需要cryo-EM密度图,而可以由预训练模型隐式推断。实验结果表明,CryoProt在多个基准测试中持续优于现有最先进方法,相比最佳基线实现了高达12%的提升,突显了在cryo-EM数据中建模跨盒交互的重要性。源代码公开于https://anonymous.4open.science/r/CryoProt。

英文摘要

Despite the growing availability of cryo-electron microscopy (cryo-EM) density maps, effectively leveraging them for protein representation remains challenging. First, current methods lack a general-purpose protein pretraining framework tailored for cryo-EM density maps, designed for protein-related property prediction. Second, existing approaches typically partition density maps into local box regions and model them independently, overlooking interactions across boxes which are essential for capturing global structural context in cryo-EM density map. To address these challenges, we propose CryoProt, a protein pretraining framework designed for cryo-EM density maps. CryoProt introduces a Map Encoder based on multi-head latent attention (MLA), where box-level representations interact through a shared latent space, enabling explicit modeling of cross-box dependencies within the density map. Furthermore, we adopt a multi-task pretraining strategy to learn generalizable representations that can be effectively transferred to diverse downstream tasks, such as protein flexibility prediction, where cryo-EM density maps are not required and can be inferred implicitly by the pretrained model. Experimental results demonstrate that CryoProt consistently outperforms existing state-of-the-art methods across multiple benchmarks, achieving up to 12% improvement over the best-performing baselines, highlighting the importance of modeling cross-box interactions in cryo-EM data. The source code is publicly available at https://anonymous.4open.science/r/CryoProt.

2606.00954 2026-06-02 cs.CV

COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation

COLLAR: 级联对象级潜在精化用于高保真条件生成

Xinlong Zhang, Jia Wei, Xiaoyu Zhang, Teng Zhou, Chengyu Lin, Yongchuan Tang

发表机构 * College of Computer Science, Zhejiang University(浙江大学计算机科学学院)

AI总结 提出COLLAR框架,通过视场扩展和级联对象级潜在精化,在扩散Transformer中实现无训练的高保真对象级控制,优于现有方法。

详情
AI中文摘要

尽管引入了深度和Canny图等结构先验,在扩散Transformer中实现高保真对象级控制仍然是一个重大挑战。当前的对象级条件生成方法经常出现视觉伪影,并且难以在小的局部区域内保持对对象的精确控制。为了解决这些限制,我们提出了级联对象级潜在精化(COLLAR),这是一个无训练框架,通过视场(FoV)扩展逐步优化对象级特征。首先,我们提出了跨尺度语义对齐(CSSA)模块,通过注意力机制将对象级特征注入到扩展FoV分支中,以解决空间语义差距。为了进一步优化这些特征,循环特征注入(CFI)模块引入了一个互逆的背景反馈机制。它利用基于频率的自适应策略,用上下文对齐的局部信息选择性更新全局主干。最后,扩展FoV分支作为特征优化的枢纽,确保对象级特征被集成到全局生成过程中,而不损害最终图像质量。在COCO-MIG和COCO-POS基准上的大量实验表明,我们的方法在语义对齐、图像质量和空间保真度方面始终优于最先进的方法。

英文摘要

Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.