arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
2606.16371 2026-06-16 cs.LG 新提交

CacheMuon: Using Temporal Preconditioning To Approximate Polar Factor

CacheMuon:利用时间预条件近似极分解因子

Bishnu Dev, Sushil Bohara, Martin Takáč, Samuel Horváth

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学)

AI总结 提出CacheMuon,通过缓存历史优化步的极分解因子来减少Muon优化器中牛顿-舒尔茨迭代的计算开销,在保持训练质量的同时降低正交化计算量。

详情
AI中文摘要

Muon是一种优化器,它利用动量矩阵的极分解因子计算更新,并在多种训练设置中展现出强大的实证性能。Muon的一个关键组件是用于计算该极分解因子的牛顿-舒尔茨迭代。尽管这避免了精确奇异值分解的计算成本,但由于每一步优化都要执行,实际中仍然昂贵。同时,动量矩阵在训练过程中平滑变化,表明对应的极分解因子存在强时间相关性。在本文中,我们利用这一结构,提出CacheMuon,一种时间预条件方法,它重用先前优化步的信息来近似当前步的极分解因子。这减少了跨迭代的冗余正交化计算。我们将CacheMuon分析为一种非精确Muon更新,其误差由新鲜求解器误差和缓存陈旧度控制。实验上,CacheMuon提供了可控的质量-效率边界:保守阈值在语言模型和视觉训练中与新鲜Muon紧密匹配,同时减少正交化FLOPs,而更激进的阈值在牺牲适度验证质量下降的情况下带来更大的算术节省。

英文摘要

Muon is an optimizer that computes updates using the polar factor of the momentum matrix and has shown strong empirical performance across a range of training settings. A key component of Muon is the Newton-Schulz iteration used to compute this polar factor. Although this avoids the cost of an exact singular value decomposition, it remains expensive in practice because it is applied at every optimization step. At the same time, the momentum matrix changes smoothly over training, suggesting strong temporal correlation in the corresponding polar factors. In this paper, we exploit this structure and propose CacheMuon, a temporal preconditioning method that reuses information from previous optimization steps to approximate the polar factor at the current step. This reduces redundant orthogonalization computation across iterations. We analyze CacheMuon as an inexact Muon update, with error controlled by fresh-solver error and cache staleness. Empirically, CacheMuon provides a controllable quality-efficiency frontier: conservative thresholds closely match fresh Muon on language-model and vision training while reducing orthogonalization FLOPs, whereas more aggressive thresholds yield larger arithmetic savings at the cost of modest validation-quality degradation.

2606.16370 2026-06-16 cs.RO 新提交

ART-Glove: Articulated Tactile Glove for Contact-Grounded Dexterous Interaction Capture

ART-Glove:用于接触接地灵巧交互捕获的关节式触觉手套

Changyi Lin, Ding Zhao

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出ART-Glove关节式触觉手套,通过16个刚性功能表面和22个解剖对齐关节,同步捕获22自由度关节运动和2048触觉点接触信息,支持下游灵巧机器人学习。

详情
AI中文摘要

我们提出ART-Glove,一种关节式触觉手套,旨在捕获接触接地的灵巧演示,同时保持人类灵巧性。ART-Glove通过覆盖手指、拇指和手掌的16个刚性功能表面使手侧接触几何显式化。22个解剖对齐关节连接这些表面,使其在灵巧操作过程中跟随人类手部运动。基于编码器的传感跟踪表面运动,而密集的压阻式触觉传感记录相同表面上的接触。完整系统以120 Hz同步捕获22自由度关节测量和2048触觉点测量。我们通过运动自由度、关节传感、触觉传感和接触丰富交互捕获实验评估ART-Glove,证明其能够在记录支持下游灵巧机器人学习的接触接地信息的同时保持人类灵巧性。

英文摘要

We present ART-Glove, an articulated tactile glove designed to capture contact-grounded dexterous demonstrations while preserving human dexterity. ART-Glove makes hand-side contact geometry explicit with 16 rigid functional surfaces covering the fingers, thumb, and palm. Twenty-two anatomically aligned joints connect these surfaces and allow them to follow human hand motion during dexterous manipulation. Encoder-based sensing tracks surface motion, while dense piezoresistive tactile sensing records contact over the same surfaces. The complete system captures synchronized 22-DoF joint measurements and 2048-taxel tactile measurements at 120 Hz. We evaluate ART-Glove across experiments on motion freedom, joint sensing, tactile sensing, and contact-rich interaction capture, demonstrating its ability to preserve human dexterity while recording contact-grounded information that can support downstream dexterous robot learning.

2606.16368 2026-06-16 cs.CL cs.LG 新提交

Evaluating LLM Personalization via Semantic Constraint Verification

通过语义约束验证评估LLM个性化

Xuran Li, Guanqin Zhang, Imran Razzak, Hakim Hacid, Eleanna Kafeza, Hao Xue, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) The Technology Innovation Institute(技术创新研究所) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出NLICV框架,利用自然语言推理模型将句子映射到真值条件集,验证个性化约束,将LLM行为分为四类,与人类标注高度一致,并大幅降低延迟和成本。

详情
AI中文摘要

当前大型语言模型(LLM)个性化的评估范式严重依赖于脆弱的表面匹配指标或计算成本高昂的LLM作为评判者的协议,两者都缺乏可解释性。为了解决这些局限性,我们引入了自然语言推理约束验证(NLICV),这是一个可扩展的、语义不变的框架,它将句子含义映射到真值条件集,通过自然语言推理(NLI)模型验证个性化约束。超越二元评分,NLICV将LLM行为分为四种不同模式:个性化、泛化、谄媚和失败。大量实验表明,NLICV与人工标注高度一致,同时大幅降低了与LLM评判者相关的延迟和令牌成本(高达2100倍推理加速)。最后,通过基于消融的程序,NLICV精确定位驱动约束验证的准确句子,为其评估提供忠实、可理解的证据。

英文摘要

Current evaluation paradigms for Large Language Model (LLM) personalization rely heavily on brittle surface-matching metrics or computationally expensive LLM-as-a-judge protocols, both of which lack interpretability. To address these limitations, we introduce Natural Language Inference Constraint Verification (NLICV), a scalable, semantically invariant framework that maps sentence meanings to truth-condition sets to verify personalization constraints via a Natural Language Inference (NLI) model. Moving beyond binary scoring, NLICV categorizes LLM behaviors into four distinct modes: personalization, generalization, sycophancy, and failure. Extensive experiments demonstrate that NLICV aligns closely with human annotations while drastically reducing the latency and token costs associated with LLM judges (up to 2100 inference speedup). Finally, through an ablation-based procedure, NLICV pinpoints the exact sentences driving the constraint verification, yielding faithful, understandable evidence for its evaluations.

2606.16360 2026-06-16 cs.CL cs.AI 新提交

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

Tyler: 语言模型的类型化潜在推理——何时思考、计算什么以及分配多少

Hanyu Lin, Min Cai, Jiawei Wen, Haodi Zhang

发表机构 * Shenzhen University(深圳大学) University of Alberta(阿尔伯塔大学)

AI总结 提出Tyler框架,通过类型化潜在推理模块和预算感知策略,在自回归解码中动态选择文本生成或潜在计算,显著提升推理准确率并降低遗忘。

Comments website: https://typed-latent-reasoning.github.io

详情
AI中文摘要

链式思维(CoT)提示通过将中间计算外化为离散文本标记来改进大型语言模型(LLM)的推理能力,但这种文本接口也引入了冗余和推理开销。潜在推理通过在连续表示中执行部分计算提供了一种有前景的替代方案。然而,现有方法通常预定义潜在计算何时被调用以及如何在解码过程中分配,留下一个关键问题未解决:何时调用潜在计算、执行何种类型的计算以及分配多少预算。我们提出\textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning(Tyler),一个用于自回归解码过程中潜在推理的类型化和预算感知框架。Tyler学习一个策略,在每个解码步骤中,选择发射一个文本标记或切换到专门用于特定推理功能的潜在计算模块。一旦被调用,一个算子将当前推理状态映射为支持全局规划、局部状态更新或可重用过程抽象的潜在标记。在三个骨干LLM上的广泛实验中,Tyler相比CoT提高了最多14.49个百分点的准确率,相比最强的竞争基线提高了最多4.30个百分点。它进一步在多种推理领域上泛化,并以最低的遗忘实现了最佳的最后阶段性能。

英文摘要

Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead. Latent reasoning offers a promising alternative by carrying part of the computation in continuous representations. However, existing methods typically predefine when latent computation is invoked and how it is allocated during decoding, leaving a key problem unresolved: when to invoke latent computation, what type of computation to perform, and how much budget to allocate. We propose \textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning (Tyler), a typed and budget-aware framework for latent reasoning during autoregressive decoding. Tyler learns a policy that, at each decoding step, chooses between emitting a text token and switching to a latent computation module specialized for a particular reasoning function. Once invoked, an operator maps the current reasoning state into latent tokens that support global planning, local state updates, or reusable procedural abstraction. Across extensive experiments on three backbone LLMs, Tyler improves accuracy by up to 14.49 points over CoT and by up to 4.30 points over the strongest competing baseline. It further generalizes across diverse reasoning domains and achieves the best final-stage performance with the lowest forgetting.

2606.16354 2026-06-16 cs.CV 新提交

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

GraphBEV++: 自动驾驶中的多模态特征对齐

Ziying Song, Caiyan Jia, Lin Liu, Shaoqing Xu, Lei Yang, Yadan Luo

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University(北京交通大学计算机科学与技术学院,交通数据挖掘与具身智能北京市重点实验室) School of Artificial Intelligence (School of Software), Yanshan University(燕山大学人工智能学院(软件学院)) University of Macau(澳门大学) Nanyang Technological University(南洋理工大学) The University of Queensland(昆士兰大学)

AI总结 针对自动驾驶中BEV感知的特征未对齐问题,提出GraphBEV++框架,通过局部对齐(LocalAlign-v2)和全局对齐(GlobalAlign-v2)模块,利用图匹配、可变形偏移和扩散去噪方法,在多种基准上实现最优性能。

Comments 30 pages, 7 figures

详情
AI中文摘要

BEV感知中的特征未对齐是自动驾驶中一个关键但常被忽视的挑战,尤其是在激光雷达和相机传感器之间的标定不确定情况下。为了解决这个问题,我们提出了一个鲁棒的多模态融合框架GraphBEV++,该系统性地缓解了投影引起的未对齐。该框架包含两个关键模块:LocalAlign-v2和GlobalAlign-v2。LocalAlign-v2通过图匹配引入邻域感知深度特征来纠正局部未对齐。它支持基于LSS和基于查询的BEV表示,使其与BEVFusion和BEVFormer架构兼容,实现跨范式的一致对齐。GlobalAlign-v2包含两种变体:可变形和扩散。可变形变体通过显式学习跨模态特征偏移来解决基于LSS的多模态BEV中的全局未对齐。相比之下,扩散变体针对基于查询的BEV中的隐式未对齐,通过注入噪声模拟未对齐,并采用去噪过程恢复对齐特征。实验结果表明,GraphBEV++在nuScenes和Waymo子集上的未对齐噪声下实现了最先进的性能,改进了Argoverse2上的远距离检测,并有效泛化到3D占用预测任务,在干净和有噪声设置下均一致提高了占用估计的准确性和鲁棒性。此外,GraphBEV++有效缓解了端到端自动驾驶中的未对齐问题。与五个基线(UniAD、VAD、FusionAD、MomAD和WoTE)相比,它在感知、预测和规划任务上的开环(nuScenes)和闭环(Bench2Drive和NAVSIM)评估中均表现出更优的性能。

英文摘要

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

2606.16353 2026-06-16 cs.CV cs.AI 新提交

What Should a Streaming Video Model Remember?

流式视频模型应该记住什么?

Haonan Ge, Yiwei Wang, Hang Wu, Yujun Cai

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) University of California, Merced(加州大学默塞德分校) The University of Queensland(昆士兰大学)

AI总结 针对流式视频理解中固定记忆预算下的长程历史利用问题,提出选择性潜在记忆框架SelectStream,通过惊喜驱动自适应窗口、优先级保持合并和查询条件图推理三个机制,实现高效在线推理,在多个基准上取得领先性能。

详情
AI中文摘要

流式视频理解模型必须在持续流中的任意时刻回答查询,仅使用到目前为止观察到的内容,并在固定的记忆和计算预算下工作。现有方法通过添加记忆库、检索模块或视觉令牌压缩来保存长程历史。然而,强近期窗口基线表明,不加区分地注入历史可能会稀释当前场景感知,这表明关键挑战不在于是否使用记忆,而在于如何选择性分配记忆。我们将此形式化为预算在线潜在证据分配,并提出\textbf{SelectStream},一个选择性潜在记忆框架,该框架保持当前观察对冻结VLM直接可见,同时仅通过紧凑的、查询条件的证据预算暴露历史信息。三个协调机制控制何时写入、保留什么以及如何检索:惊喜驱动的自适应窗口、优先级保持合并以及固定容量潜在记忆图上的查询条件图推理。检索到的证据被校准并作为潜在令牌注入以生成答案,无需重放帧或随着流长度增长上下文。实验结果表明,SelectStream实现了强大的在线流式性能,并保持了通用视频理解能力,在StreamingBench上达到82.67%,在OVO-Bench上达到67.03%,在离线视频基准上平均准确率达到74.4%,同时优于强近期窗口基线和先前的流式记忆方法。

英文摘要

Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.

2606.16352 2026-06-16 cs.LG cs.AI 新提交

Communication-Efficient Verifiable Attention for LLM Inference

面向LLM推理的高效通信可验证注意力机制

Ziqun Chen, Ming Wu, Michael Heinrich, Jason Zeng, Huiying Lan, Tianwei Zhang, Rui Tan

发表机构 * Nanyang Technological University(南洋理工大学) Zero Gravity Labs(零重力实验室)

AI总结 提出VeriAttn,通过将注意力计算卸载到GPU并由TEE验证,结合两阶段流水线和分区策略,显著降低TEE计算和通信开销,实现LLM推理加速。

Comments 19 pages, 16 figures

详情
AI中文摘要

远程大型语言模型(LLM)服务的计算完整性可能存在问题。对于传统深度神经网络(DNN),现有的TEE屏蔽DNN分区(TSDP)方法使用可信执行环境(TEE)计算非线性组件,并验证卸载到不可信GPU的线性组件的完整性。然而,直接将TSDP应用于基于Transformer的LLM会导致大量的TEE计算和TEE-GPU通信开销。本文提出通信高效的TEE-GPU注意力机制(\textsc{VeriAttn}),用于加速可验证的LLM推理。\textsc{VeriAttn}将注意力的线性和非线性计算都卸载到GPU,而TEE执行验证。此外,对于预填充阶段,\textsc{VeriAttn}使用两级流水线来重叠数据移动、TEE前后处理和GPU计算。对于解码阶段,当键值缓存超过可用GPU内存时,\textsc{VeriAttn}将注意力在TEE和GPU之间分区,以减少重复的键值传输。在Intel TDX平台上的评估表明,对于6k令牌提示和10k令牌输出,\textsc{VeriAttn}在预填充和解码阶段分别比TSDP加速2.60-3.38倍和3.86-5.42倍。

英文摘要

Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

2606.16351 2026-06-16 cs.CL 新提交

TMASC: Transmasculine Attitude and Speech Corpus

TMASC:跨男性态度与语音语料库

Sidney Wong

发表机构 * Centre for Sustainability Research, University of Otago(奥塔哥大学可持续发展研究中心) Te Pūnaha Matatini Centre of Research Excellence for Complex Systems(Te Pūnaha Matatini复杂系统卓越研究中心)

AI总结 介绍一个包含196名跨男性个体的多模态语料库,包括问卷和66份录音,用于支持跨男性个体研究。

Comments Accepted to Interspeech 2026 Main Track

详情
AI中文摘要

我们介绍了跨男性态度与语音语料库(TMASC),这是一个包含196名跨男性个体的多模态语料库,包括问卷回答和66份录音。问卷包含探索跨男性个体声音健康的问题。录音包括咳嗽和清嗓样本、一段阅读文章以及额外的特定会话问题。本文概述了该语料库的开发过程和数据收集程序。为了说明该语料库的实用性,我们展示了三个案例研究,演示了如何使用这个众包多模态语料库来支持跨男性个体。这些案例包括感知和声学数据的整合、群体层面特征的识别以及声学测量的校准。

英文摘要

We introduce the Transmasculine Attitudes and Speech Corpus (TMASC), a multimodal corpus of 196 transmasculine individuals, including questionnaire responses and 66 audio recordings. The questionnaire includes items exploring the vocal health of transmasculine individuals. The audio recordings include cough and throat-clearing samples, a reading passage, and additional session-specific questions. This paper outlines the development of this corpus and the data collection procedures. To illustrate the utility of this corpus, we present three case studies demonstrating how this crowd-sourced multimodal corpus can be used to support transmasculine individuals. These include the integration of perceptual and acoustic data, the identification of group-level characteristics, and the calibration of acoustic measurements.

2606.16344 2026-06-16 cs.AI cs.CL cs.CY cs.LG 新提交

Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM-assisted hotel selection

AI推荐哪家酒店?LLM辅助酒店选择中声誉信号的算法审计

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Asher Ali

发表机构 * Fandaqah, Al Khobar, Saudi Arabia(沙特阿拉伯阿尔科巴尔Fandaqah) Hamdard University, Karachi, Pakistan(巴基斯坦卡拉奇哈姆达德大学)

AI总结 通过随机选择联合实验审计12种LLM,发现客人评分和价格主导推荐,但过度重视生态认证而忽略管理回复,且列表位置(无内容特征)有因果影响。

Comments 32 Pages

详情
AI中文摘要

旅行者越来越多地询问大语言模型(LLM)助手预订哪家酒店,使这些系统成为物业可见性的守门人——但什么驱动了它们的推荐尚未有记录。我们使用基于随机选择的联合实验进行预先指定的算法审计:跨角色、提示模板和十二个开放权重及专有模型,助手在五家酒店中进行选择,这些酒店的客人评分、评论数量和时效性、管理回复、连锁品牌、价格、生态认证和列表位置均被独立随机化。我们估计每个信号对推荐概率的平均边际成分效应。客人评分和价格占主导地位(高评分使选择概率提高31.6个百分点;高价格使其降低30.0个百分点),重现了人类效价和价格优先性,但过度重视生态认证而忽略管理回复。列表位置——一个无内容的伪影——因果性地改变推荐,价值约为每晚12美元。陈述的理由与揭示的权重不完全一致。这些发现为生成式引擎优化和AI信息中介的可问责性提供了因果证据。

英文摘要

Travelers increasingly ask large language model (LLM) assistants which hotel to book, making these systems gatekeepers of property visibility -- yet what moves their recommendations is undocumented. We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. We estimate the average marginal component effect of each signal on the probability of recommendation. Guest rating and price dominate (a top rating raises selection by 31.6 percentage points; a high price lowers it by 30.0), reproducing human valence-and-price primacy but over-weighting eco-certification and ignoring management response. List position -- a content-free artifact -- shifts recommendations causally, worth about \$12 per night. Stated reasons track revealed weights imperfectly. The findings ground generative engine optimization and the accountability of AI infomediaries in causal evidence.

2606.16342 2026-06-16 cs.CV 新提交

When the Past Matters: FlashBack Memory for Precipitation Nowcasting

当过去重要时:用于降水临近预报的FlashBack记忆

Yuhao Du, Boxiao Huang, Chengrong Wu, Jiankai Zhang

发表机构 * College of Atmospheric Sciences, Lanzhou University(兰州大学大气科学学院) Fuqua School of Business, Duke University(杜克大学福库商学院) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Supercomputing Center of Lanzhou University(兰州大学超级计算中心)

AI总结 提出FlashBack Memory模块,通过动态检索关键历史状态并自适应融合,增强循环模型时空表征能力,显著提升高分辨率降水预测的准确性和时序一致性。

详情
AI中文摘要

准确的降水临近预报对于减灾和社会经济规划至关重要,然而现有方法在高时空分辨率下常面临虚警、漏报和长程依赖建模困难。为解决这些问题,我们提出FlashBack Memory(FB)模块,该模块动态检索关键历史状态并通过自适应融合门进行整合,增强循环模型的时空表征能力。我们将FB集成到PredRNN、PredRNNpp、MIM、MotionRNN和PredRNN-V2中,并在CIKM2017、Shanghai2020和SEVIR数据集上评估。实验结果表明,FB显著改善了MSE、MAE、SSIM和CSI指标,特别是对于高强度降雨和长序列预测,同时减少了虚警和漏报,增强了时间一致性和空间定位。所提方法提供了一种通用且高效的记忆增强机制,提升了基于循环的降水临近预报模型的整体性能。

英文摘要

Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

2606.16341 2026-06-16 cs.LG cs.DB 新提交

Filtered ANN as a Phase Transition: When Selectivity-Estimation Error Causes Plan Regret

过滤式近似最近邻搜索作为相变:选择性估计误差导致计划遗憾

Madhulatha Mandarapu, Sandeep Kunkunuru

发表机构 * VaidhyaMegha Private Limited, India(VaidhyaMegha 私人有限公司,印度)

AI总结 本文研究过滤式近似最近邻搜索中,选择性估计误差如何导致计划遗憾,并揭示其仅在相变边界附近产生,遗憾呈对数宽度楔形,通过有限尺度标度验证。

Comments 8 pages, 4 figures. Code, benchmarks, and full pre-registration:https://github.com/samyama-ai/filtered-ann-regret

详情
AI中文摘要

过滤式近似最近邻(ANN)查询返回满足属性谓词P(选择性为s)的向量中最近的k个向量。最佳执行策略——预过滤、后过滤或内过滤——随s变化,因此系统必须估计s并选择。我们将其建模为在具有相(各策略获胜区域)的景观上的argmax,相由边界分隔,并表明选择性估计误差仅在边界周围的临界区域产生计划遗憾(相对于最优策略的召回损失)。遗憾是一个对数宽度等于乘法估计误差ε、高度等于局部悬崖|V'(s*)|ε的楔形;翻转裕度1/|V'(s*)|是作为局部边界理论重新出现的兄弟基数估计研究的条件数。两个相边界来自独立的数学:顺序统计将后过滤悬崖置于s ~ k/K,而站点渗流将内过滤悬崖置于s_c ~ 0.83/M(图度数M,与语料库大小无关)。临界性仅在受限预算B < sqrt(k n)下存在。在预先注册的决策规则下,我们在合成扫描和真实SIFT1M上确认,遗憾在边界处集中约290倍,且遗憾曲线在语料库大小的两个数量级上服从有限尺寸标度坍缩为一个通用楔形。真实的近似索引不会错误定位边界,但有偏的成本模型会打开一个持续的校准偏差带,估计误差鲁棒性无法修复。贡献在于表征,而非新索引。代码和完整的预注册已公开。

英文摘要

A filtered approximate-nearest-neighbor (ANN) query returns the k nearest vectors among those satisfying an attribute predicate P of selectivity s. The best execution strategy -- pre-filter, post-filter, or in-filter -- changes with s, so a system must estimate s and choose. We model this as an argmax over a landscape with phases (regions where each strategy wins) separated by boundaries, and show that selectivity-estimation error produces plan regret -- recall lost versus the oracle strategy -- only in the critical regions around those boundaries. The regret is a wedge of log-width equal to the multiplicative estimation error epsilon and height equal to the local cliff |V'(s*)| epsilon; the flip-margin 1/|V'(s*)| is the condition number of a sibling cardinality-estimation study reappearing as the local boundary theory. The two phase boundaries follow from independent mathematics: order statistics place the post-filter cliff at s ~ k/K, and site percolation places the in-filter cliff at s_c ~ 0.83/M for graph degree M (corpus-size independent). Criticality exists only under a constrained budget B < sqrt(k n). Under pre-registered decision rules we confirm, on synthetic sweeps and real SIFT1M, that regret concentrates ~290x at the boundary and that the regret curves obey a finite-size scaling collapse onto one universal wedge across two decades of corpus size. A real approximate index does not mis-locate the boundary, but a biased cost model opens a persistent miscalibration band that estimation-error robustness cannot fix. The contribution is a characterization, not a new index. Code and the full pre-registration are public.

2606.16334 2026-06-16 cs.CV 新提交

Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

时间盲:使用CHRONOSIGHT基准测试视觉语言模型的时间推理能力

Parthaw Goswami, Jaynto Goswami Deep

发表机构 * Department of Computer Science, University of Missouri(密苏里大学计算机科学系) SAP

AI总结 提出CHRONOSIGHT基准,从五个维度评估视觉语言模型的时间推理能力,发现模型与人类存在巨大差距(人类平均0.89,最佳模型0.40),并通过微调显著提升性能。

详情
AI中文摘要

人类对视觉场景的感知本质上是时间性的。我们本能地识别水果是在成熟还是腐烂,建筑是在进展还是被拆除,以及两张同一主体的照片之间大致相隔多少时间。大型视觉语言模型(VLM)是否具备这种能力仍然是一个开放且具有实际重要性的问题。我们引入了CHRONOSIGHT,一个严格控制的基准,评估视觉时间推理的五个维度:CHRONORANK(图像序列的时间顺序排序)、CHRONOLOCATE(从单张图像定位阶段顺序)、CHRONODELTA(估计两张图像之间经过的时间,采用对数尺度)、CHRONOREVERSE(检测时间反转序列)以及CHRONOODD(识别集合中的时间异常值)。该基准包含来自八个过程系列(生物生长、食物转化、物理风化、建筑、环境变化、人类衰老、天文现象和城市动态)的1000个项目,时间跨度从分钟到千年。我们在两种提示模式下评估了八个开源VLM(参数从5亿到190亿),并收集了人类表现基线。人类在所有任务上的平均表现为0.89;最佳开源模型(Qwen2.5-VL-7B)在直接提示下达到0.40,我们将这一差距称为时间盲。在151个样本上进行轻量级LoRA微调,将CHRONODELTA的准确率从接近零提升到0.43,并零样本迁移到相关任务(CHRONOODD:0.37;CHRONOREVERSE:0.64),这表明瓶颈部分在于指令遵循而非视觉感知。基准、代码和预测将在接收后发布。

英文摘要

Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.

2606.16333 2026-06-16 cs.CV cs.GR cs.LG 新提交

Differentiable Packing of Irregular 3D Objects with Adaptive Container Estimation

不规则3D物体的可微分装箱与自适应容器估计

Palak Gupta, Shanmuganathan Raman

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院甘地讷格尔分校)

AI总结 提出一种可微分装箱框架,通过梯度优化联合调整物体姿态和容器尺寸,利用自适应挤压机制和基于张量广播的快速计算,在单个GPU上数分钟内实现比基线方法小11-32%的容器。

Comments Comments: 20 pages, 8 figures, 5 tables. Under review at Computers & Graphics (Elsevier)

详情
AI中文摘要

大多数现有方法要么预先固定容器,要么通过外部搜索循环仅优化单个容器维度,其余维度则作为手动调整问题。我们提出了一种可微分装箱框架,在单个基于梯度的循环内联合优化所有6N个物体姿态参数和所有三个容器边长。该公式结合了六个基于物理的、可微分的损失项,这些损失项通过轴对齐包围盒代理直接在三角形网格上计算。自适应挤压机制在重叠损失低于按对数量缩放的阈值时周期性收紧容器,导致容器体积先大幅下降,然后进行小幅细化。所有成对计算均以张量广播形式编写,与基于循环的参考实现相比,速度提升了3.4到54倍。该流程使用Python和PyTorch实现,无需物理引擎、FFT库或凸分解。在多个物体类别上,该方法在N=100时产生的容器比时间匹配的DBLF和模拟退火基线小11%至32%,同时在单个消费级GPU上每个实例的运行时间不到4分钟。

英文摘要

Most existing approaches either fix the container in advance or optimize only a single container dimension through an outer search loop, leaving the remaining dimensions as a manual tuning problem. We present a differentiable packing framework that jointly optimizes all 6N object pose parameters and all three container side lengths inside a single gradient-based loop. The formulation combines six physics-inspired, differentiable loss terms computed directly on triangle meshes through axis-aligned bounding-box proxies. An adaptive squeezing mechanism periodically tightens the container whenever the overlap loss falls below a pair-count-scaled threshold, producing a large initial drop in container volume, followed by small refinements. All pairwise computations are written in tensor-broadcasting form, giving a 3.4 to 54 times speedup over a reference loop-based implementation. The pipeline is implemented in Python and PyTorch, with no physics engine, FFT library, or convex decomposition. On multiple object categories, the method produces containers that are 11 to 32 percent smaller than time-matched DBLF and simulated-annealing baselines at N =100, while running in under 4 minutes per instance on a single consumer GPU.

2606.16331 2026-06-16 cs.LG 新提交

Diffusion Offline Reinforcement Learning for Fair and Energy-Efficient UAV-Assisted Wireless Networks

面向公平与节能的无人机辅助无线网络的扩散离线强化学习

Eslam Eldeeb, Hirley Alves

发表机构 * Centre for Wireless Communications (CWC), University of Oulu(奥卢大学无线通信中心(CWC))

AI总结 提出扩散软演员-评论家方法,结合保守Q学习与扩散模型,在离线强化学习中优化无人机轨迹与调度,降低能耗并提升公平性,性能优于现有算法。

详情
AI中文摘要

生成式人工智能与无线通信及信号处理系统的融合为未来6G网络中的智能数据驱动决策开辟了新途径。本文提出一种扩散软演员-评论家方法,利用去噪扩散概率模型增强的离线强化学习,优化无人机网络中的轨迹与调度控制。虽然离线强化学习方法(如保守Q学习)可以从静态数据集中学习,但在低数据或动态条件下往往难以泛化。为此,我们将保守Q学习的鲁棒性与扩散模型的生成能力相结合,实现超越行为策略的、具有信号感知能力的策略学习。将该框架应用于无人机辅助无线网络,可最小化传输能量并提高设备间的公平性。仿真表明,扩散软演员-评论家方法优于标准离线强化学习基线,即使在有限数据集下也能实现更稳定的收敛和更高的奖励。该方法提升了数据效率,降低了能耗,与现有算法相比吞吐量提高了35%以上,展示了其在下一代无线控制系统中进行鲁棒策略学习的潜力。

英文摘要

The integration of generative artificial intelligence with wireless communication and signal processing systems has opened new avenues for intelligent, data-driven decision-making in future 6G networks. This work proposes a diffusion soft actor-critic (Diffusion-SAC) approach that leverages offline reinforcement learning (RL) enhanced by denoising diffusion probabilistic models (DDPMs) to optimize trajectory and scheduling control in unmanned aerial vehicle (UAV) networks. While offline RL methods, such as conservative Q-learning (CQL), can learn from static datasets, they often struggle to generalize in low-data or dynamic conditions. To address this, we combine the robustness of CQL with the generative power of diffusion models, enabling expressive and signal-aware policy learning that generalizes beyond behavior policies. Applied to a UAV-assisted wireless network, the proposed framework minimizes transmission energy and improves fairness among devices. Simulations show that Diffusion-SAC outperforms standard offline RL baselines, achieving more stable convergence and higher rewards even with limited datasets. The method enhances data efficiency, reduces energy consumption, and increases throughput by more than 35 % compared to existing algorithms, demonstrating its potential for robust policy learning in next-generation wireless control systems.

2606.16330 2026-06-16 cs.AI 新提交

Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery

装配线中断恢复中面向阶段的引导注入用于循环MAPPO

Xin Huang, Yongcai Wang, Fengyi Zhang, Zhikun Tao, Yunjun Han, Naiqi Wu

发表机构 * School of Information, Renmin University of China(中国人民大学信息学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) The Information Science Academy, China Electronics Technology Group Corporation(中国电子科技集团公司信息科学研究院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) The Institute of Systems Engineering, Macau University of Science and Technology(澳门科技大学系统工程研究所)

AI总结 提出面向阶段的引导注入框架,在评估时通过logit级动作偏置增强训练好的循环MAPPO调度策略,利用规则、回放和在线LLM引导减少异常恢复时间并保持准时交付。

Comments 6 pages, 4 figures, accepted by the 2026 IEEE International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

工业装配线的中断恢复需要在机器故障、工人缺勤和紧急订单下及时做出决策。现有方法要么依赖僵化的手工恢复逻辑,要么学习自适应策略,但无法在决策时轻易利用异构的外部恢复知识来减少异常恢复时间(ART)并保持准时交付(OTD)。为解决这一差距,我们提出了一种面向阶段的引导注入框架,通过在评估期间引入logit级动作偏置来增强训练好的循环MAPPO(RMAPPO)调度策略。该框架为基于规则、基于回放和基于在线LLM的引导提供了统一的决策时接口,同时仅在异常和恢复阶段激活干预。在自定义的AssemblyLineEnv上的实验表明,高质量的规则引导带来最强的性能提升,基于回放的引导在不完美可用性下平滑退化,而在线LLM引导仍能提供有用的中间改进。这些结果表明,决策时引导注入可以在不重新设计actor的情况下利用异构恢复提示。

英文摘要

Disruption recovery in industrial assembly lines requires timely decisions under machine faults, worker absence, and emergency orders. Existing methods either rely on rigid handcrafted recovery logic or learn adaptive policies that do not readily exploit heterogeneous external recovery knowledge at decision time to reduce abnormal recovery time (ART) and preserve on-time delivery (OTD). To address this gap, we propose a phase-aware guidance injection framework that augments a trained recurrent MAPPO (RMAPPO) scheduling policy through logit-level action bias during evaluation. The framework provides a unified decision-time interface for rule-based, replay-based, and online LLM-based guidance, while activating intervention only during abnormal and recovery phases. Experiments on a custom AssemblyLineEnv show that high-quality rule guidance yields the strongest gains, replay-based guidance degrades smoothly under imperfect availability, and online LLM guidance still provides useful intermediate improvements. These results show that decision-time guidance injection can exploit heterogeneous recovery hints without redesigning the actor.

2606.16329 2026-06-16 cs.AI 新提交

Exploiting Search in Symbolic Numeric Planning with Patterns

利用模式在符号数值规划中进行搜索

Matteo Cardellini, Enrico Giunchiglia

发表机构 * DIBRIS, University of Genoa(热那亚大学DIBRIS)

AI总结 提出基于符号模式规划(SPP)的数值规划过程,通过动态重计算模式并利用中间状态引导搜索,提高规划效率。

Comments Under Review at the Journal of Artificial Intelligence Research

详情
AI中文摘要

在本文中,我们提出了一种基于符号模式规划(SPP)的数值规划过程。给定一个数值规划问题 $Π$,一个模式 $\prec$ 是一个动作序列,用于定义一个公式,该公式编码了从起始状态 $S$ 可执行的 $\prec$ 的子序列。Cardellini, Giunchiglia, 和 Maratea (2024a) 遵循规划作为可满足性的方法,在每一步 $n \ge 0$ 定义一个公式 $Π^\prec_n$,其中 $(i)$ 模式 $\prec$ 仅在 $n=0$ 时在 $Π$ 的初始状态 $I$ 中计算,然后在每一步 $n$ 中被利用,$(ii)$ 起始状态 $S$ 设置为 $I$,$(iii)$ 目标集 $G$ 要求在通过将 $\prec$ 的子序列连接 $n$ 次所能达到的最后一个状态中成立。该过程从 $n=0$ 开始,一旦 $Π^\prec_n$ 可满足则终止,否则递增 $n$ 继续。在本文中,可能在每一步,$(i)$ 我们符号化地搜索一个从 $I$ 可达的中间状态 $P$,该状态更接近目标状态,$(ii)$ 动态重计算模式 $\prec_h$ —— 用于下一步 —— 在 $P$ 中,$(iii)$ 精炼用于到达 $P$ 的模式 $\prec_g$,以及 $(iv)$ 从状态 $S$ 开始新的搜索,$S$ 可以是初始状态 $I$ 或最后计算的中间状态 $P$,利用计算出的模式 $\prec_g$ 和 $\prec_h$ 来定义搜索中使用的模式 $\prec$。特别地,在每一步,我们定义一个公式 $Π^{\prec}_{S,P}$,编码存在一个状态 $P'$ 比 $P$ 更接近目标状态,且 $P'$ 从起始状态 $S$ 使用模式 $\prec$ 可达。我们提出了不同的技术来生成这样的公式,每种技术对应一种不同的搜索空间探索策略。我们证明了它们的正确性和完备性,后者在一定条件下成立。

英文摘要

In this paper, we present a procedure for numeric planning based on Symbolic Pattern Planning (SPP). Given a numeric planning problem $Π$, a pattern $\prec$ is a sequence of actions used to define a formula encoding the subsequences of $\prec$ executable from a starting state $S$. Cardellini, Giunchiglia, and Maratea (2024a) follow the Planning as Satisfiability approach by defining, at each step $n \ge 0$, a formula $Π^\prec_n$ in which $(i)$ the pattern $\prec$ is computed only for $n=0$ in the initial state $I$ of $Π$, and then exploited at each step $n$, $(ii)$ the starting state $S$ is set to $I$, and $(iii)$ the set $G$ of goals is required to hold in the last state that can be reached by one of the subsequences of $\prec$ concatenated $n$ times. The procedure begins with $n=0$, terminates as soon as $Π^\prec_n$ is satisfiable, and otherwise proceeds by incrementing $n$. In this paper, possibly at each step, $(i)$ we symbolically search for an intermediate state $P$ reachable from $I$, closer to a goal state, $(ii)$ dynamically recompute the pattern $\prec_h$ -- to be used in the next step -- in $P$, $(iii)$ refine the pattern $\prec_g$ used to reach $P$, and $(iv)$ start the new search from the state $S$ which can be either the initial state $I$ or the last computed intermediate state $P$, exploiting the computed patterns $\prec_g$ and $\prec_h$ to define the pattern $\prec$ to be used in the search. In particular, at each step, we define a formula $Π^{\prec}_{S,P}$ encoding the existence of a state $P'$ closer than $P$ to a goal state, with $P'$ reachable from the starting state $S$ when using the pattern $\prec$. We present different techniques for producing such formulas, each corresponding to a different strategy for exploring the search space. We prove their correctness and completeness, the latter under certain conditions.

2606.16328 2026-06-16 cs.AI 新提交

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

AdaSTORM: 通过自适应时空多智能体协作扩展动态图上的LLM推理

Bing Hao, Ruijie Wang, Haodong Qian, Yunlong Chu, Yuhang Liu, Yumeng Lin, Minglai Shao, Jianxin Li

发表机构 * Tianjin University, China(天津大学,中国) Beihang University, China(北航大学,中国)

AI总结 提出AdaSTORM框架,通过自适应分区和时空解耦的多智能体协作,将动态图推理扩展到千节点规模,准确率超90%,无需外部工具。

详情
AI中文摘要

大型语言模型(LLM)在动态图推理中展现出显著潜力,但面临扩展瓶颈:当前模型只能处理数十个节点的图,受限于指数级推理开销和有限的上下文窗口。尽管多智能体系统(MAS)提供了集体推理和拓扑感知编排的能力——这些能力天然适用于图结构任务,但其在动态图上的应用仍未探索。本文提出通过自适应时空多智能体协作扩展动态图上的LLM推理(AdaSTORM),这是一个将大规模动态图推理重构为两个阶段的框架:(i)自适应分区,将大规模动态图划分为与模型推理能力匹配的子区域,同时最小化推理成本;(ii)协作推理,将图分区拓扑与时空解耦的多智能体架构对齐。AdaSTORM是首个专为动态图推理设计的多智能体框架。大量实验表明,AdaSTORM成功突破了扩展瓶颈,将推理扩展到千节点图,在多个大规模动态图设置中准确率超过90%,且无需外部工具,显著优于七个竞争基线。此外,它在现有基准上达到了最先进的准确率,并稳健地泛化到真实世界数据集。源代码可在 https://github.com/irisorchid107/AdaSTORM/ 获取。

英文摘要

Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead and finite context windows. While multi-agent systems (MAS) offer collective reasoning and topology-aware orchestration, capabilities naturally suited for graph-structured tasks, their application to dynamic graphs remains unexplored. This paper presents Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration (AdaSTORM), a framework that reformulates large-scale dynamic graph reasoning into two stages: (i) Adaptive Partitioning, partitioning large-scale dynamic graphs into subregions that match the model's reasoning capacity while minimizing inference cost; and (ii) Collaborative Reasoning, aligning graph partition topologies with a spatio-temporal decoupled multi-agent architecture. AdaSTORM is the first multi-agent framework tailored for dynamic graph reasoning. Extensive experiments show that AdaSTORM successfully breaks through the scaling bottleneck, scaling reasoning to thousand-node graphs with over 90% accuracy across several large-scale dynamic graph settings without external tools, significantly outperforms seven competitive baselines. Furthermore, it achieves state-of-the-art accuracy on existing benchmarks and generalizes robustly to real-world datasets. The source code is available at: https://github.com/irisorchid107/AdaSTORM/.

2606.16327 2026-06-16 cs.SD cs.AI eess.AS 新提交

ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion

ArtBoost: 用于声学到发音逆映射的合成发音数据增强

Hyung Kyu Kim, Byungchan Hwang, Hak Gu Kim

发表机构 * Anonymous 1(匿名机构1)

AI总结 提出ArtBoost数据增强策略,利用大规模语音-网格数据集提取伪发音轨迹进行预训练,在有限EMA数据下提升声学到发音逆映射性能,PCC和RMSE一致改善。

Comments Accepted in Interspeech26

详情
AI中文摘要

最近的声学到发音逆映射(AAI)模型依赖于电磁发音描记术(EMA)数据,这些数据成本高昂且规模有限。为了解决这一限制,我们提出了\textit{ArtBoost},一种新颖的数据增强策略,利用最初为语音驱动的3D面部动画开发的大规模语音-网格数据集,在有限的EMA监督下改进AAI。\textit{ArtBoost}从可见的面部锚点提取伪发音轨迹,并在真实EMA数据上微调之前用于预训练。实验显示PCC和RMSE一致改善。轨迹分析证实伪发音信号反映了物理上有意义的可见发音动态。在不同AAI架构上的额外评估表明稳定的性能提升,表明\textit{ArtBoost}可以集成到多种AAI模型中。这些结果表明语音-网格数据为AAI提供了一种有效且可扩展的发音监督来源。项目页面:https://cau-irislab.github.io/Interspeech26-ArtBoost/

英文摘要

Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textit{ArtBoost}, a novel data augmentation strategy that leverages large-scale speech--mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. \textit{ArtBoost} extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that \textit{ArtBoost} can be integrated into diverse AAI models. These results suggest that speech--mesh data provide an effective and scalable source of articulatory supervision for AAI. Project page: https://cau-irislab.github.io/Interspeech26-ArtBoost/

2606.16325 2026-06-16 cs.CV 新提交

Attention-Based Prototype Calibration for Multi-Rater Few-Shot Medical Image Segmentation

基于注意力机制的原型校准用于多评估者少样本医学图像分割

Truong Vu, Minh Khoi Ho, Yutong Xie

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一种注意力原型校准框架,通过建模评估者特异性偏差,在不修改骨干网络的情况下实现个性化分割,有效提升多评估者少样本分割性能。

Comments MICCAI 2026 main track

详情
AI中文摘要

少样本医学图像分割方法通常假设单一真实标注,忽略了临床数据集中常见的不同专家评估者之间的系统性差异。我们提出了一种基于注意力机制的原型校准框架,用于少样本多评估者分割,该框架在原型空间中建模评估者相对于共识表示的特定偏差。一个轻量级且原理性的注意力算子直接优化评估者原型,而不修改骨干特征提取器,使得该方法与现有的基于原型的少样本分割方法完全兼容。这种设计在保持语义一致性的同时,以最小的计算开销实现个性化分割输出。在多评估者医学影像数据集上的实验表明,与基线原型方法相比,该方法持续改进,突出了结构化原型校准在建模标注变异性方面的有效性。我们的代码可在 https://github.com/truong2710-cyber/JAPC 获取。

英文摘要

Few-shot medical image segmentation methods typically assume a single ground-truth annotation, overlooking systematic variability across expert raters commonly observed in clinical datasets. We propose an attention-based prototype calibration framework for few-shot multi-rater segmentation that models rater-specific deviations from a consensus representation in prototype space. A lightweight yet principled attention operator directly refines rater prototypes without modifying the backbone feature extractor, making the approach fully compatible with existing prototype-based few-shot segmentation methods. This design preserves semantic consistency while enabling personalized segmentation outputs with minimal computational overhead. Experiments on multi-rater medical imaging datasets demonstrate consistent improvements over baseline prototype approaches, highlighting the effectiveness of structured prototype calibration for modeling annotation variability. Our code is available at https://github.com/truong2710-cyber/JAPC.

2606.16323 2026-06-16 cs.CV cs.GR 新提交

HAFMat: Hybrid Priors Guided Adaptive Fusion for Single-Image Human Material Estimation

HAFMat: 混合先验引导的自适应融合用于单张图像人体材质估计

Yu Jiang, Jiahao Xia, Jiongming Qin, Jianchi Sun, Chunxia Xiao

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Faculty of Engineering and IT, University of Technology Sydney(悉尼科技大学工程与信息技术学院)

AI总结 提出HAFMat框架,通过混合先验(外观、几何、结构及预训练模型预测)引导自适应特征融合,解决单张图像人体PBR材质估计的病态问题,在合成和真实数据上达到最优性能。

详情
AI中文摘要

基于物理的渲染(PBR)材质估计是一项基础的外观分解任务,在虚拟内容创建、重光照和数字人体渲染中具有广泛应用。然而,从单张人体图像估计PBR材质仍然高度病态,因为光照、几何和反射率在观察到的外观中严重纠缠。为缓解这种歧义,我们提出HAFMat,一种混合先验引导的单图像人体材质估计框架。我们的方法引入编码互补线索的引导图,包括外观、身体几何、结构以及来自预训练模型的先验材质预测。一个关键观察是这些引导线索是异质的:一些线索主要提供纹理级约束,而其他线索传达更高层的语义信息。为利用这一特性,我们设计了一种多层自适应特征融合机制,在不同阶段自适应地将引导特征与解码器特征融合。该设计使纹理主导和语义主导的线索能够在适当层次引导材质解码,从而实现更准确且物理合理的材质估计。在合成和真实数据上的大量实验表明,我们的方法在材质估计和下游重光照任务中达到了最先进的性能。

英文摘要

Physically based rendering (PBR) material estimation is a fundamental appearance decomposition task with broad applications in virtual content creation, relighting, and digital human rendering. However, estimating PBR materials from a single human image remains highly ill-posed, since illumination, geometry, and reflectance are heavily entangled in the observed appearance. To mitigate this ambiguity, we propose HAFMat, a hybrid-prior-guided framework for single-image human material estimation. Our method introduces guidance maps that encode complementary cues, including appearance, body geometry, structure, and prior material predictions from pre-trained models. A key observation is that these guidance cues are heterogeneous: some cues mainly provide texture-level constraints, while others convey higher-level semantic information. To exploit this property, we design a Multi-layer Adaptive Feature Fusion Mechanism, which adaptively fuses guidance features with decoder features at different stages. This design enables texture-dominant and semantic-dominant cues to guide material decoding at appropriate levels, leading to more accurate and physically plausible material estimation. Extensive experiments on both synthetic and real data demonstrate that our method achieves state-of-the-art performance in material estimation and downstream relighting.

2606.16319 2026-06-16 cs.AI 新提交

Architectural Wisdom: A Framework for Governing Optimization in AI Systems

架构智慧:AI系统中优化治理的框架

Edward Y. Chang

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种可修正的目标治理层,通过时间跨度、关系边界和不可逆性三个结构承诺,解决AI系统优化目标不当导致的失败问题。

Comments 17 pages, 2 tables, 2 figures

详情
AI中文摘要

现代AI系统表现出仅靠能力扩展无法可靠修复的结构性失败:它们在缺乏质疑目标是否应该被优化的架构机制下,优化未充分指定的目标。参与度最大化可能放大有害路径;使用工具的智能体可能造成不可逆行动;偏好训练的语言模型可能变得谄媚。我们认为这种失败是智慧问题,而非智能问题。我们有意在架构意义上使用“智慧”,而非将其视为关于美德、意识或道德全知的断言。智能接受目标并在其内优化;智慧质疑目标是否应该被优化。两者是可分离的架构属性。我们提出架构智慧作为优化基底之上的一个可修正的目标治理层。该层在任何行动之前明确并非退化地做出三个结构承诺:时间跨度、关系边界和不可逆性。它由四个组件(结构效用转换器、道德可接受性接口、仲裁与升级控制器、价值修正通道)实现,这些组件计算一个六坐标的智慧元组,涵盖时间跨度、关系覆盖、不可逆性、可接受性、价值修正和可审计性。我们通过八个案例来激励该架构,这些案例来自当代AI失败、世俗智慧传统和艰难伦理情境,并利用目标质疑而非目标接受、Bostrom的正交性、我们示例案例中的结构分离以及尽管能力扩展但持续存在的失败模式,来捍卫该区分与智能完备性论题。该框架是更大架构的概念契约,其形式规范和实证验证将在后续工作中展开。

英文摘要

Modern AI systems exhibit structural failures that capability scaling alone does not reliably fix: they optimize under-specified objectives with no architectural mechanism to question whether the objective should be optimized at all. Engagement maximization can amplify harmful pathways; tool-using agents can commit irreversible actions; preference-trained language models can become sycophantic. We argue that this failure is a wisdom problem, not an intelligence problem. We use "wisdom" in a deliberately architectural sense, not as a claim about virtue, consciousness, or moral omniscience. Intelligence accepts a goal and optimizes within it; wisdom interrogates whether the goal should be optimized at all. The two are separable architectural properties. We propose architectural wisdom as a corrigible objective-governance layer above the optimization substrate. The layer makes three structural commitments explicit and nondegenerate before any action: temporal horizon, relational boundary, and irreversibility. It is realized by four components (Structural Utility Transform, Moral Admissibility Interface, Arbitration and Escalation Controller, Value Revision Channel) that compute a six-coordinate wisdom tuple over horizon, relational coverage, irreversibility, admissibility, value revision, and auditability. We motivate the architecture by eight cases drawn from contemporary AI failures, secular wisdom traditions, and hard ethical situations, and defend the distinction against the intelligence-completeness thesis using goal-questioning over goal-taking, Bostrom's orthogonality, structural separation in our exemplar cases, and persistent failure modes despite capability scaling. The framework is the conceptual contract for a larger architecture whose formal specifications and empirical validation are developed in subsequent work.

2606.16317 2026-06-16 cs.CV 新提交

Training-free sparse attention based on cumulative energy filtering

基于累积能量过滤的无训练稀疏注意力

Chunlu Li, Yixuan Pan, Bai Du, Zhenyuan Chen, Yanzhao Li, Hui Dong, Hui Wang, Zhiqiang Zou

发表机构 * Huawei Technologies(华为技术有限公司)

AI总结 提出动态阈值策略,在保持固定召回率的同时提高稀疏性,并与Flash Attention深度集成,无需额外掩码计算,在Wan 2.2上稀疏度从61.42%提升至82%,VBench指标下降小于5%。

详情
AI中文摘要

稀疏注意力通过仅计算重要令牌而跳过其余令牌,加速用于视频生成的扩散变换器(DiTs)。令牌选择策略是平衡稀疏性和准确性的关键。我们将令牌过滤过程形式化为一个双目标优化问题:最大化稀疏性和最小化准确性下降。现有算法无法同时实现这两个目标。例如,Top-p仅考虑准确性约束,而Top-k维持固定的计算预算但放松了准确性约束。本文证明,维持固定的召回率足以保证准确性,而固定阈值对于降低计算成本是次优的。因此,我们提出一种动态阈值方案,在保持相同准确性水平的同时提高稀疏性。此外,我们的算法与Flash Attention(FA)深度集成,无需任何额外的掩码计算开销。在Wan 2.2上的实验结果表明,与同样集成FA的BLASST算法相比,我们的动态阈值策略将稀疏性从61.42%提升至82%,而VBench指标下降小于5%。这导致注意力计算减少约15%,计算效率提升1.61倍,比BLASST高1.18倍。

英文摘要

Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.

2606.16313 2026-06-16 cs.RO cs.AI 新提交

Is Your Trajectory Displacement Safe in Long-tail?

你的轨迹位移在长尾场景中安全吗?

Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao

发表机构 * Shanghai Qi Zhi Institute(上海期智研究院) Tsinghua University(清华大学) Tongji University(同济大学)

AI总结 提出FluidTest评估框架,通过成对WebUI协议、32种语义威胁分类和三元验证系统,检测规划轨迹相对于专家参考的额外威胁,实验发现SOTA规划器仍存在大量安全相关失败。

Comments 20 pages, 15 figures

详情
AI中文摘要

长尾场景仍然是自动驾驶评估的主要瓶颈,即使数据集规模增长数个数量级。现有的评估流水线很少同时具备人类对齐、安全感知、可验证和可解释性:闭环指标在强规划器中常常饱和,而无结构的人类评分在没有精心设计协议的情况下可能充满噪声。我们将规划评估表述为额外威胁检测:给定规划器轨迹和专家参考,规划器的位移是否引入了新的不安全驾驶行为?我们提出FluidTest,一个包含三个组件的评估流水线:用于可靠人工标注的成对WebUI协议;包含32种语义威胁及其基于证据的决策图的分类法;以及一个带有反思的三元验证系统,用于精确性和可审计性。在WOD-E2E数据集上的实验表明,FluidTest在训练过的标注者中产生一致的标签,并在65%的Poutine轨迹和51%的RAP轨迹中识别出额外威胁。这些结果表明,尽管具有高评分者反馈分数(RFS)和低平均位移误差(ADE),最先进的规划器仍可能表现出大量与安全相关的失败。更多细节、指导和代码请访问https://fluidtest.web.app。

英文摘要

Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.

2606.16310 2026-06-16 cs.LG cs.CL 新提交

QK-Normed MLA: QK normalization without full key caching

QK归一化MLA:无需完整键缓存的QK归一化

Yizhou Han, Yao Zhao, Jun Zhou, Longfei Li, Ruoyu Sun

发表机构 * The Chinese University of Hong Kong(香港中文大学) Ant Group(蚂蚁集团)

AI总结 提出QK归一化与MLA兼容的方法,通过吸收静态权重和动态标量,无需缓存完整键,在400M模型训练中降低损失并提升下游精度,解码延迟增加小于2%。

Comments 13 pages, 5 figures, conference-style manuscript

详情
AI中文摘要

查询-键(QK)归一化通过控制点积前查询和键的尺度来稳定注意力,但无法直接与多头潜在注意力(MLA)兼容。MLA通过缓存低维潜在状态而非完整键来实现高效解码,而投影后的QK RMSNorm似乎需要对每个缓存的token使用完全投影的键。我们表明这种明显的不兼容性是实现伪影,而非架构约束。RMSNorm分解为静态仿射权重和动态标量RMS统计量。静态键侧权重可以吸收到MLA查询侧投影中;动态键统计量简化为每个token和KV组的一个逆RMS标量。得到的公式在精确算术中与显式投影后QK RMSNorm完全等价,并保留了MLA的潜在解码路径。在我们训练高达100B token的400M参数模型中,QK归一化MLA相比QK裁剪实现了更低的训练损失和更好的下游准确率,而H800解码基准测试显示在高达256k上下文下延迟开销小于2%。这些结果使得QK归一化成为MLA模型实用的稳定选项,无需完整键缓存。

英文摘要

Query-key (QK) normalization stabilizes attention by controlling the scale of queries and keys before the dot product, but is not immediately compatible with Multi-head Latent Attention (MLA). MLA achieves efficient decoding by caching low-dimensional latent states instead of full keys, whereas post-projection QK RMSNorm appears to require the fully projected key for every cached token. We show this apparent incompatibility is an implementation artifact, not an architectural constraint. RMSNorm decomposes into a static affine weight and a dynamic scalar RMS statistic. The static key-side weight can be absorbed into the MLA query-side projection; the dynamic key statistic reduces to one inverse-RMS scalar per token and KV group. The resulting formulation is exactly equivalent to explicit post-projection QK RMSNorm in exact arithmetic and preserves MLA's latent decode path. In our 400M runs trained for up to 100B tokens, QK-Normed MLA achieves lower training loss and better downstream accuracy than QK clipping, while H800 decode benchmarks show less than 2% latency overhead up to 256k context. These results make QK normalization a practical stabilization option for MLA models without requiring full-key caching.

2606.16307 2026-06-16 cs.AI cs.CL 新提交

State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs

面向工具增强型大语言模型的基于状态的多智能体合成数据生成

Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu, Mayank Malhotra, Arup Das, Jitesh Chandra, Yun-Shiuan Chuang, Chaitanya Kulkarni, Arun Menon, Linsey Pang, Avinash Karn, Mouli V, Prakhar Mehrotra

发表机构 * PayPal AI

AI总结 提出StateGen平台,通过四角色LLM循环和状态管理器生成多轮、工具接地的高质量训练对话,消除工具调用幻觉,支持层次化多智能体设置。

Comments 9 pages, 5 figures, 6 tables, 1 algorithm

详情
AI中文摘要

训练工具增强型LLM代理需要大量多轮、工具接地的对话数据,这些数据标注成本高、生产环境中受隐私限制,且公共数据集中基本缺失。我们提出StateGen,一个合成数据生成平台,通过编排四角色LLM循环(角色条件用户模拟器、被测代理、状态接地工具模拟器和多轴LLM评判器)生成带有评分和丰富推理轨迹的训练对话。关键架构贡献是一个权威状态管理器,它在多轮对话中维护一个结构化的世界状态对象,强制执行后端即事实的不变性,从而从结构上消除了最主要的工具调用幻觉类别。StateGen通过将子代理声明为工具(所有子代理共享一个状态对象)自然地扩展到层次化多智能体设置。我们在三个生产语料库上报告了64,698个评估对话的结果:工具调用幻觉得分达到9.66/10,系统通过23维特征向量支持角色驱动变化,并且干净分离的训练集和黄金评估集划分确认数据不是记忆诱饵(按标准差距分析)。与八个外部系统的比较表明,没有单一公开平台同时具备多轮生成、状态接地工具模拟、层次化多智能体支持和内置评判器评分功能。

英文摘要

Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

2606.16302 2026-06-16 cs.CV 新提交

Explainable Flood Segmentation on Sentinel-1 SAR Imagery: A Comparative Study of CNN and Transformer Architectures

可解释的Sentinel-1 SAR影像洪水分割:CNN与Transformer架构的比较研究

Arundhuti Banerjee, David Daou

发表机构 * United Nations University's Institute for Environment and Human Security (UNU-EHS)(联合国大学环境与人类安全研究所(UNU-EHS))

AI总结 比较CNN和视觉Transformer在Sentinel-1 SAR影像多类洪水分割中的性能,SegFormer-b2在ETCI数据集上显著优于U-Net,但在Sen1Floods11上优势缩小,并利用可解释性技术分析模型决策。

详情
AI中文摘要

快速准确的洪水预测对于灾害响应和减灾规划至关重要。卫星上的合成孔径雷达(SAR)传感器非常适合这一目的,因为它们独立于天气和日光条件运行。尽管基于SAR的数据能够实现全天候洪水监测,但区分被淹没的土地和永久水体仍然是一个重大挑战,特别是当洪水严格定义为被淹没的土地时。本研究提供了卷积神经网络(CNN)和视觉Transformer架构在多类洪水分割中的全面比较,使用Sentinel-1 SAR影像,专门训练以区分被淹没的土地、永久水体和陆地。三个基于CNN的最先进模型U-Net、U-Net++和带ResNet-34骨干的DeepLabV3,以及三个SegFormer变体(b0, b1, b2)在两个基准数据集ETCI NASA和SenFloods11上进行了评估,采用基于场景的数据划分以确保空间泛化的现实评估。结果表明,SegFormer-b2在ETCI数据集上显著优于U-Net基线(在Wilcoxon符号秩检验中,所有7个测试场景的洪水IoU更高),而在Sen1Floods11上微调后,优势缩小到场景变异范围内,并集中在空间碎片化的洪水事件中。研究包括定性和定量的可解释性技术,以直观理解模型决策并系统评估预测可靠性。定性分析显示,SegFormer-b2产生更空间连贯的Grad-CAM激活,聚焦于洪水相关特征,而U-Net在洪水边界处产生更具信息量的不确定性估计。

英文摘要

Rapid and accurate flood prediction is essential for disaster response and mitigation planning. Synthetic Aperture Radar (SAR) sensors in satellites are well-suited for this purpose because they operate independently of weather and daylight conditions. Although SAR-based data enable all-weather flood monitoring, distinguishing flooded land from permanent water remains a significant challenge, particularly when flooding is defined strictly as inundated land. This study provides a comprehensive comparison of convolutional neural network (CNN) and vision transformer architectures for multi-class flood segmentation using Sentinel-1 SAR imagery, specifically trained to separate flooded land from permanent water bodies and land. Three state-of-the-art (SOTA)CNN-based models, U-Net, U-Net++, and DeepLabV3 with ResNet-34 backbone, and three SegFormer variants (b0,b1,b2) were evaluated in two benchmark datasets, the ETCI NASA dataset and SenFloods11, using scene-based data splits to ensure a realistic assessment of spatial generalization. The results demonstrate that SegFormer-b2 significantly outperforms the U-Net baseline on the ETCI dataset (higher flood IoU across all 7 test scenes in the Wilcoxon signed-rank test), while after fine-tuning on Sen1Floods11, the advantage narrows to within the range of scene variability and is concentrated in spatially fragmented flood events. The study includes both qualitative and quantitative explainability techniques to visually comprehend model decisions and systematically assess prediction reliability. Qualitative analysis reveals that SegFormer-b2 produces more spatially coherent Grad-CAM activations focused on flood-relevant features, while U-Net generates more informative uncertainty estimates along flood boundaries.

2606.16301 2026-06-16 cs.LG stat.ML 新提交

One-Step Generalization Ratio Guided Optimization for Domain Generalization

一步泛化比率引导的域泛化优化

Sumin Cho, Dongwon Kim, Kwangsu Kim

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国高级科学技术研究所)

AI总结 提出GENIE优化器,通过一步泛化比率(OSGR)动态均衡参数更新,抑制虚假相关,促进域不变特征学习,在域泛化任务中超越现有优化器。

Comments 29 pages, accepted at the 42nd International Conference on Machine Learning (ICML 2025)

详情
AI中文摘要

域泛化(DG)旨在训练模型泛化到未见过的目标域,但常常过拟合到域特定特征,即所谓的非期望相关性。基于梯度的DG方法通常引导梯度朝向主导方向,但往往无意中强化了虚假相关性。最近的工作采用dropout来正则化过度自信的参数,但未明确调整梯度对齐或确保平衡的参数更新。我们提出GENIE(泛化增强迭代均衡器),一种新颖的优化器,利用一步泛化比率(OSGR)量化每个参数对损失减少的贡献并评估梯度对齐。通过预条件因子动态均衡OSGR,GENIE防止少量参数主导优化,从而促进域不变特征学习。理论上,GENIE平衡参数间的收敛贡献和梯度对齐,在保持SGD收敛速度的同时实现更高的OSGR。实验上,它优于现有优化器,并在与各种DG和单DG方法集成时提升性能。

英文摘要

Domain Generalization (DG) aims to train models that generalize to unseen target domains but often overfit to domain-specific features, known as undesired correlations. Gradient-based DG methods typically guide gradients in a dominant direction but often inadvertently reinforce spurious correlations. Recent work has employed dropout to regularize overconfident parameters, but has not explicitly adjusted gradient alignment or ensured balanced parameter updates. We propose GENIE (Generalization-ENhancing Iterative Equalizer), a novel optimizer that leverages the One-Step Generalization Ratio (OSGR) to quantify each parameter's contribution to loss reduction and assess gradient alignment. By dynamically equalizing OSGR via a preconditioning factor, GENIE prevents a small subset of parameters from dominating optimization, thereby promoting domain-invariant feature learning. Theoretically, GENIE balances convergence contribution and gradient alignment among parameters, achieving higher OSGR while retaining SGD's convergence rate. Empirically, it outperforms existing optimizers and enhances performance when integrated with various DG and single-DG methods.

2606.16298 2026-06-16 cs.CV 新提交

DDTNet: Degradation Disentanglement and Transfer Network for Test-Time All-in-One De-weathering Adaptation

DDTNet:面向测试时全能去天气适应的退化解缠与迁移网络

Kuan-Hung Lin, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin, Yen-Yu Lin

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) National Tsing Hua University(国立清华大学) National Chengchi University(国立政治大学) NVIDIA(英伟达)

AI总结 提出DDTNet,通过解缠目标域退化模式并迁移至源域干净图像生成域自适应训练数据,微调恢复模型以提升跨天气和域的适应能力,核心是退化解缠模块(DDM)中的退化耦合注意力(DCA)。

详情
AI中文摘要

全能型恶劣天气图像恢复旨在使用单一统一模型去除多种退化,如雨、雾和雪。尽管具有广泛适用性,现有方法通常以牺牲性能为代价,对单个退化类型提供平衡但次优的结果。当训练和测试数据之间存在域差距时,这一问题变得更加突出。受退化模式建模比恢复干净内容更可行的观察启发,我们提出了退化解缠与迁移网络(DDTNet),该网络专门关注退化迁移。通过从目标域退化图像中解缠退化模式并将其迁移到源域干净图像,DDTNet生成域自适应的配对训练数据。这些配对数据随后用于微调恢复模型,显著增强其在各种天气条件和域上的适应性。DDTNet的核心是退化解缠模块(DDM),该模块包含退化耦合注意力(DCA),用于捕获通用和特定天气特征,从而实现退化模式的有效解缠和迁移。实验结果表明,DDTNet在真实世界的去雨、去雪和去雾数据集上显著且一致地改进了现有的全能型模型。

英文摘要

All-in-one adverse weather image restoration aims to remove multiple degradations, such as rain, haze, and snow, using a single unified model. Despite their broad applicability, existing methods typically compromise performance, delivering balanced but suboptimal results for individual degradation types. This issue becomes more pronounced when a domain gap exists between training and testing data. Motivated by the observation that modeling degradation patterns is more feasible than recovering clean content, we propose the Degradation Disentanglement and Transfer Network (DDTNet), which focuses specifically on degradation transfer. By disentangling degradation patterns from target-domain degraded images and transferring them to source domain clean images, DDTNet generates domain-adaptive paired training data. These pairs are then used to fine-tune restoration models, significantly enhancing their adaptability across diverse weather conditions and domains. The core of DDTNet is the Degradation Disentanglement Module (DDM), which comprises Degradation Coupled Attention (DCA) to capture both general and weather-specific features, thereby enabling effective disentanglement and transfer of degradation patterns. Experimental results demonstrate that DDTNet significantly and consistently improves existing all-in-one models across real-world deraining, desnowing, and dehazing datasets.

2606.16295 2026-06-16 cs.CV cs.CL 新提交

VisualClaw: A Real-Time, Personalized Agent for the Physical World

VisualClaw:面向物理世界的实时个性化智能体

Haoqin Tu, Jianwen Chen, Zijun Wang, Siwei Han, Juncheng Wu, Hardy Chen, Haonian Ji, Kaiwen Xiong, Jiaqi Liu, Peng Xia, Jieru Mei, Hongliang Fei, Jason Eshraghian, Zeyu Zheng, Yuyin Zhou, Huaxiu Yao, Cihang Xie

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) UNC-Chapel Hill(北卡罗来纳大学教堂山分校) Google(谷歌) UC Berkeley(加州大学伯克利分校)

AI总结 提出VisualClaw,一种自进化多模态智能体,通过混合编码和技能进化机制降低部署成本并提升准确性,在多个视频QA基准上实现平均-98%的API成本削减和最高+15.80%的准确率提升。

Comments H. T. and J. C. contribute to this project equally

详情
AI中文摘要

视觉语言模型正作为复杂多模态任务的通用接口。然而,部署仍面临三个差距:VLMs在处理密集视频帧和长提示时通常产生高延迟和成本,智能体框架在部署后保持静态,标准视频QA基准不测试智能体是否能在工具使用工作区内使用视觉证据。我们提出VisualClaw,一个围绕两个原则构建的自进化多模态智能体。首先,混合编码通过级联门过滤信息较少的流式帧,并通过热/冷top-k注入压缩文本技能库,从而降低部署成本。其次,技能进化让智能体从失败中学习:检索的记忆作为直接拼接上下文或引导证据条件化进化器,产生技能库更新以帮助未来问题。在4个视频QA基准上使用2个VLM,VisualClaw相比全帧上传平均降低每问题API成本-98%,相比离线均匀8帧基线降低-25.9%,同时在大多数设置中提升准确率,例如在EgoSchema上使用Gemini 3 Flash平均+3.85%,峰值+15.80%。为解决这一差距,我们整理了VisualClawArena,一个通过严格五阶段流程构建的200场景多模态智能体基准;模型必须使用视频证据、文档、动态更新和工作区内的可执行检查。在VisualClawArena上,相同的框架配合计算机使用智能体后端,相比无进化基线,Codex (GPT-5.5)的宏观准确率提升+2.9%,Claude Code (Sonnet 4.6)提升+3.2%,相比均匀采样基线成本降低-9.5%。这些特性使VisualClaw自然适用于边缘应用,其中级联将1小时流式会话从约3,600次API上传减少到仅5-20次调用,自进化使其成为完美的个性化助手。

英文摘要

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

2606.16294 2026-06-16 cs.CV q-bio.NC 新提交

Sex-based Network-Specific Differences in Connectomes: A Krakencoder-Based Analysis

基于性别的连接组网络特异性差异:基于Krakencoder的分析

Vibhashree S H, Debanjali Bhattacharya, Vamshi Krishna Kancharla, Neelam Sinha

发表机构 * Centre for Brain Research, Indian Institute of Science(印度科学研究所大脑研究中心) Dept. of Artificial Intelligence, Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham(阿姆里塔大学阿姆里塔人工智能学院人工智能系)

AI总结 使用Krakencoder框架模拟脑连接组模态间缺陷传播,分析702名健康被试的结构和功能连接组,发现默认模式网络扰动最大,感觉运动网络最小,完整预测连接组保留更多性别判别信息。

详情
AI中文摘要

本研究使用Krakencoder作为模拟框架,探讨一个脑连接组模态的缺陷如何传播到另一个模态。分析了人类连接组项目中702名健康被试的结构和功能连接组,并分别评估了每个Yeo-7功能网络的影响。考虑了七种场景,每种场景涉及移除单个网络,同时保留其余网络。使用三种互补指标量化跨模态预测中的扰动:特征值谱上的KL散度、Frobenius范数和Wasserstein距离。此外,评估了预测连接组中性别特异性信息的持久性。在所有指标和两个预测方向上,默认模式网络产生的扰动最大,而感觉运动网络产生的扰动最小。网络级扰动特征的性别差异细微,最佳结果是在网络移除条件下预测的连接组达到66.09%的准确率。相比之下,从完整输入预测的连接组实现了更高的性别分类准确率,最高达84.76%。这些发现证实,完整的预测连接组比仅基于扰动的特征保留了显著更多的性别判别信息。

英文摘要

This study examines how deficiencies in one brain connectome modality propagate to the other, using the Krakencoder as a simulation framework. Structural and functional connectomes from 702 healthy participants in the Human Connectome Project were analyzed, with the impact of each of the Yeo-7 functional networks assessed separately. Seven scenarios were considered, each involving the removal of a single network while the remaining networks were preserved. The resulting perturbations in cross-modal predictions were quantified using three complementary metrics: KL divergence on eigenvalue spectra, Frobenius norm, and Wasserstein distance. In addition, the persistence of sex-specific information within the predicted connectomes was evaluated. Across all metrics and both prediction directions, the Default Mode Network produced the largest perturbations, whereas the Somatomotor network yielded the smallest. Sex differences in network-level perturbation signatures were subtle, with the best result being an accuracy of 66.09% from connectomes predicted under network-removal conditions. In contrast, connectomes predicted from intact inputs achieved substantially higher sex classification accuracy, reaching up to 84.76%. These findings confirm that full predicted connectomes retain considerably more sex-discriminative information than perturbation-derived signatures alone.