arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2605.19101 2026-05-20 cs.SD cs.LG

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

面向异质性的数据集调度以实现高效的音频大语言模型训练

Yanru Wu, Jianning Wang, Chongxin Gan, Yang Li

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Independent Researcher(独立研究者) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种面向异质性的数据集调度方法GST,通过将数据集分组并按渐进调度策略引入,平衡了并行训练的稳定性与序列优化的效率,从而在14个AudioQA数据集上实现了30-40%的更快收敛速度。

详情
AI中文摘要

训练通用的音频大语言模型(ALLMs)以跨多样化的数据集进行训练对于全面的音频理解至关重要,但面临由于数据集异质性导致的显著挑战,这通常会导致冲突的梯度和缓慢的收敛。尽管其影响重大,如何在训练过程中显式管理这种异质性仍鲜有研究,当前的做法主要依赖于均匀混合。在本文中,我们从收敛性角度分析多数据集AudioQA训练,并提出分组序列训练(GST)。GST战略性地将数据集分为具有亲和力的数据集组,并通过渐进调度协议引入这些数据集,有效地平衡了并行训练的稳定性与序列优化的效率。为了确保可扩展性,我们开发了基于梯度的亲和度度量,以捕捉跨数据集的关系,而无需采用具有抑制成本的经验转移性估计。在14个AudioQA数据集上的广泛评估表明,GST在标准并行训练上实现了30-40%更快的收敛速度,同时保持或超越混合所有训练的性能。我们的结果提供了理论见解和一个实用且模型无关的框架,用于高效的大规模ALLM优化。

英文摘要

Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships without the prohibitive cost of empirical transferability estimation. Extensive evaluations on 14 AudioQA datasets spanning speech, music, and environmental sounds demonstrate that GST achieves 30--40\% faster convergence than standard parallel training while maintaining or even surpassing the performance of mix-all training. Our results provide both theoretical insights and a practical, model-agnostic framework for efficient large-scale ALLM optimization.

2605.19099 2026-05-20 cs.AI cs.CL cs.MA

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench: 一个用于长周期代理工作流中涌现委托的基准测试

Yuxuan Gao, Megan Wang, Yi Ling Yu, Zijian Carl Ma, Ao Qu

发表机构 * OpenMesh AI University of Pennsylvania(宾夕法尼亚大学) Columbia University(哥伦比亚大学) Stanford University(斯坦福大学) MIT(麻省理工学院)

AI总结 本文提出DecisionBench基准测试,用于评估长周期代理工作流中涌现的委托机制,通过五个条件参考扫描发现委托质量、路由保真度和潜在性能上限等核心发现。

Comments 28 pages, 9 figures, 11 tables. Code and data: https://huggingface.co/decisionbench

详情
AI中文摘要

我们引入DecisionBench,一个用于长周期代理工作流中涌现委托的基准测试子系统。该子系统固定了一个任务集(GAIA,tau-bench,BFCL多轮),一个同级模型池(11个模型,7个供应商家族),一个委托接口(调用模型加可选的读取资料通道),一个确定性技能标注层,以及一个覆盖质量、成本、延迟、委托率、路由保真度-at-k、供应商自偏好以及反事实委托天花板的多轴度量套件。该子系统对同级信息的生成或传递方式无关,因此学习的路由器、更丰富的同级记忆、适应性的资料构造以及多步委托均可在此进行评估。我们通过在完整池(n=23,375任务实例)上的五条件参考扫描来表征该子系统。三个基准级别发现:(i)四个意识条件下的平均终端任务质量在统计上无法区分(|beta| <= 0.010,p >= 0.21),因此仅质量评估会错过编排信号;(ii)路由保真度-at-1在条件中从7.5%到29.5%不等,且在近似相等的平均质量下,交付通道(按需工具 vs. 预加载描述)主导描述内容;(iii)反事实天花板将完美委托置于每套测量性能的15-31个百分点之上,定位了未来编排方法中巨大的未实现潜力。我们发布了该子系统、标注层、参考干预套件、分析流程以及220个每条件运行存档。

英文摘要

We introduce DecisionBench, a benchmark substrate for emergent delegation in long-horizon agentic workflows. The substrate fixes a task suite (GAIA, tau-bench, BFCL multi-turn), a peer-model pool (11 models, 7 vendor families), a delegation interface (call_model plus an optional read_profile channel), a deterministic skill-annotation layer, and a multi-axis metric suite covering quality, cost, latency, delegation rate, routing fidelity-at-k, vendor self-preference, and a counterfactual-delegation ceiling. The substrate is agnostic to how peer information is generated or delivered, so learned routers, richer peer memories, adaptive profile construction, and multi-step delegation can all be evaluated against it. We characterize the substrate with a five-condition reference sweep on the full pool (n=23,375 task instances). Three benchmark-level findings emerge: (i) mean end-task quality is statistically indistinguishable across the four awareness conditions (|beta| <= 0.010, p >= 0.21), so quality-only evaluation would miss the orchestration signal; (ii) routing fidelity-at-1 ranges from 7.5% to 29.5% across conditions at near-equal mean quality, with delivery channel (on-demand tool vs. preloaded description) dominating description content; (iii) a counterfactual ceiling places perfect delegation 15-31 percentage points above measured performance on every suite, locating large unrealized headroom for future orchestration methods. We release the substrate, annotation layer, reference intervention suite, analysis pipeline, and 220 per-condition run archives.

2605.19095 2026-05-20 cs.LG cs.AI stat.ML

ScheduleFree+: Scaling Learning-Rate-Free & Schedule-Free Learning to Large Language Models

ScheduleFree+: 将学习率自由和调度自由学习扩展到大型语言模型

Aaron Defazio

发表机构 * FAIR at Meta Super-Intelligence Labs(Meta 超智能实验室)

AI总结 本文提出了一种学习率自由和调度自由的学习方法(ScheduleFree+),用于训练大型语言模型,该方法在大规模训练中显著优于传统的Warmup-Stable-Decay(WSD)调度方案,并证明了调度自由学习在长周期训练中的有效性。

详情
AI中文摘要

调度自由学习作为一种实用的随时训练方法,在机器学习中展示了其在数十个标准基准问题上的成功。然而,对于大型语言模型(LLM)训练,强大的性能仅在小规模情况下得到验证。我们识别出一系列必要的改进,以将调度自由学习扩展到更大的批量大小和模型大小,并提出了一种学习率自由和调度自由的方法(ScheduleFree+)用于训练大型语言模型,其性能显著优于Warmup-Stable-Decay(WSD)调度方案。我们还证明调度自由学习在长周期训练中最有效,并且在每参数1000个令牌的情况下,比最先进的调度方案高出31%。调度自由学习为预训练过程中模型平均和检查点合并的使用提供了理论基础。

英文摘要

Schedule-Free Learning has shown promise as a practical anytime training method for machine learning, showing success across dozens of standard benchmark problems. However, strong performance for LLM training has only been demonstrated at small scales. We identify a number of fixes necessary to scale up Schedule-Free Learning to larger batch sizes and model sizes, and present a learning-rate-free and schedule-free method (ScheduleFree+) for training large language models which greatly outperforms Warmup-Stable-Decay (WSD) schedules. We also demonstrate that Schedule-Free Learning is most effective for long duration training, and at 1000 tokens per parameter, it outperforms SOTA schedules by 31%. Schedule-Free Learning provides a theoretical foundation for the use of model averaging and checkpoint merging during pretraining.

2605.19093 2026-05-20 cs.AI cs.LG

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

通过 elicitation 进行嵌入:用于系统提示贝叶斯优化的动态表示

Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy

发表机构 * Meta

AI总结 本文研究了在仅有聚合反馈的情况下,如何通过动态表示进行系统提示的贝叶斯优化,提出了一种基于 elicitation 的嵌入方法 ReElicit,利用 LLM 构建可解释的特征空间,并通过概率高斯过程代理选择目标特征向量,最终实现系统提示的优化。

详情
AI中文摘要

系统提示是现代 AI 系统中的核心控制机制,在对话、任务和用户群体中塑造行为。然而,当反馈仅作为聚合度量而非每个示例的标签、失败或批评时,调整系统提示变得困难。我们研究了这种聚合反馈设置作为受限样本的黑盒优化问题,针对离散且长度可变的文本。我们引入了 ReElicit,一种基于 elicitation 的贝叶斯优化框架。给定任务描述、先前评估的提示和标量分数,LLM 会提取一个紧凑且可解释的特征空间,并将提示映射到其中。利用概率高斯过程代理,获取函数会选择目标特征向量,LLM 会实现并优化这些向量以生成可部署的系统提示。随着新评估的到来,重新提取特征空间使表示能够适应观察到的提示-分数历史。我们通过离线基准准确率作为受控的聚合代理来评估该设置:优化器观察每个提示的一个标量分数,而没有每个示例的标签、错误或批评。在十个系统提示优化任务中,使用 30 次总评估预算,ReElicit 在代表性聚合-only 提示优化基线中实现了最强的聚合性能。这些结果表明,LLM 不仅可以作为提示生成器,还可以作为适应性语义表示构建器,用于自然语言艺术的贝叶斯优化。

英文摘要

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

2605.19092 2026-05-20 cs.LG cs.AI cs.CL

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

反事实可能性测试用于私人推理通道中的间接影响

Alexander Boesgaard Lorup

发表机构 * Openhagen

AI总结 本文提出了一种反事实可能性测试方法,用于衡量私人推理通道之间的影响力,通过替换上游私人块为匹配长度的供体块,并固定公共令牌序列和下游目标,测量下游目标的负对数似然变化,以评估私人和公共通道中的直接和间接影响。

Comments 12 pages, 4 figures, 5 tables

详情
AI中文摘要

推理系统越来越多地将中间计算分成私人和公共通道,产生在转录中看起来相似的评估案例:独立共推导、直接访问私人内容和通过公共通信的间接影响。本文提出了一种反事实可能性测试,用于测量私人推理通道之间的影响力。该方法用一个长度匹配的供体块替换上游私人块,固定公共令牌序列和下游目标,测量下游目标的负对数似然变化。在用于验证的7B角色通道推理模型上,文本探针不可靠:原始n-gram重叠高估了泄漏,修正重叠仍存在噪声,canary复现报告无区分能力。反事实可能性将未遮蔽和遮蔽条件分开,而长度匹配控制了RoPE位置混杂因素。在强化遮蔽验证中,B到A的反向影响接近于零,而A到B的影响通过公共语音隐藏状态持续存在。在三个检查点、五个种子和13,734个有效方向对比的多检查点验证中,重复了这种不对称性。一个图分离控制,阻止私人到公共的载体边,产生所有13,734个控制评估中自然和反事实分数位相同的结果,确定测试的公共通道路径是测量的反事实信号在实施的角色可见性遮蔽下的完整载体。结果表明,私人通道评估应分别报告直接和间接影响,并且反事实可能性探针为测量这些边界提供了实用的默认方法。

英文摘要

Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

2605.19091 2026-05-20 cs.LG

Chessformer: A Unified Architecture for Chess Modeling

Chessformer: 一个用于棋类建模的统一架构

Daniel Monroe, George Eilender, Philip Chalmers, Zhenwei Tang, Ashton Anderson

发表机构 * University of Toronto(多伦多大学) Williams College(威廉姆斯学院)

AI总结 本文提出Chessformer,一种统一的棋类建模架构,能够同时提升棋类建模的三大核心目标:提升棋力、预测人类下棋和增强可解释性。

Comments International Conference in Learning Representations (2026)

详情
AI中文摘要

棋类长期以来一直是人工智能的典型测试平台,但其核心任务的建模方法却各不相同。最大化棋力、预测人类下棋和增强可解释性通常使用不同的架构,这些设计往往与领域本身的几何结构不一致。这引出了一个自然问题:这些目标是否需要不同的建模范式,或者是否存在一个能够同时支持它们的单一架构?我们介绍了Chessformer,一种统一的架构,它在棋类建模的三个核心目标上都达到了最先进的水平。Chessformer是一种仅包含编码器的Transformer,将棋盘方格表示为标记,通过一种名为几何注意力偏置(GAB)的新动态位置编码来增强自注意力机制,该编码能够适应领域特定的几何结构,并通过基于注意力的源-目标策略头来预测动作。我们对Chessformer的每个方面进行了评估。首先,我们开发了\maiathree,一个用于预测人类下棋的模型家族,其移动匹配准确率达到57.1%,显著超越了之前最先进的方法,且参数量不到四分之一。其次,我们将Chessformer集成到领先的开源引擎Leela Chess Zero中,使其棋力提升超过100个Elo,并在主要的计算机国际象棋比赛中战胜Stockfish。第三,我们证明Chessformer的方格标记设计使注意力模式和激活可以直接归因于棋盘方格,从而实现细粒度的可解释性分析,而以前的架构不自然支持。更广泛地说,我们的结果表明,将模型的标记化、位置编码和输出设计与领域底层结构对齐,可以同时带来性能、人类兼容性和可解释性的提升。

英文摘要

Chess has long served as a canonical testbed for artificial intelligence, but modeling approaches for its central tasks have diverged. Maximizing playing strength, predicting human play, and enabling interpretability are typically solved with disparate architectures, and these designs are often misaligned with the geometry of the domain. This raises the natural question of whether these objectives require separate modeling paradigms, or if there exists a single architecture that supports them simultaneously. We introduce Chessformer, a unified architecture that advances the state of the art on all three central goals in chess modeling. Chessformer is an encoder-only transformer that represents board squares as tokens, augments self-attention with a novel dynamic positional encoding called Geometric Attention Bias (GAB) that adapts to domain-specific geometry, and predicts actions with an attention-based source-destination policy head. We evaluate Chessformer on each front. First, we develop \maiathree, a family of models for human move prediction that reaches 57.1\% move-matching accuracy, significantly surpassing the previous state of the art with fewer than a quarter of the parameters. Second, we integrate Chessformer into Leela Chess Zero, a leading open-source engine, adding over 100 Elo of playing strength and resulting in tournament victories over Stockfish in major computer chess competitions. Third, we show that Chessformer's square-token design makes attention patterns and activations directly attributable to board squares, enabling granular interpretability analyses that prior architectures do not naturally support. More broadly, our results demonstrate that aligning a model's tokenization, positional encoding, and output design with the underlying structure of a domain can yield simultaneous gains in performance, human compatibility, and interpretability.

2605.19080 2026-05-20 cs.LG cs.AI

MANGO: Meta-Adaptive Network Gradient Optimization for Online Continual Learning

MANGO:面向在线持续学习的元适应网络梯度优化

Ankita Awasthi, Marco Apolinario, Kaushik Roy

发表机构 * Purdue University(普渡大学) TU Delft(代尔夫特理工大学)

AI总结 本文提出MANGO框架,通过梯度门控和元学习正则化平衡持续学习中的稳定性与可塑性,实现对过去任务遗忘的克服和新任务高效学习。

详情
AI中文摘要

在在线持续学习(OCL)中,神经网络在单次通过中从非平稳数据流中依次学习,仅能访问有限的内存回放缓冲区。这与离线持续学习形成鲜明对比,后者依赖多个epoch训练大型数据集。OCL的主要挑战是克服对过去任务的灾难性遗忘(稳定性)的同时高效学习新任务(可塑性)。现有方法通过回放式复习、输出级蒸馏、固定正则化或当前数据上的元学习来对抗遗忘。然而,这些方法存在局限:复习引入存储样本偏差;蒸馏在输出分布上操作而无法调节参数更新;固定正则化对参数施加惩罚而不考虑敏感性;仅基于数据流的元学习缺乏反馈控制的参数更新。我们提出元适应网络梯度优化(MANGO),一种OCL框架,通过梯度门控和元学习正则化平衡稳定性与可塑性。梯度门控根据敏感性调整参数更新,防止破坏性更新。元学习正则化适应稳定性系数,评估参数更新对回放的影响。在MANGO中,回放同时充当训练信号和遗忘评估器。我们在三个标准OCL基准数据集上评估了我们的方法。MANGO在多个基准上优于强基线方法,取得最先进的结果,并在不同回放大小下保持一致性能。在CLEAR-10上的领域增量学习和CIFAR-100和Tiny-ImageNet上的类别增量学习中,它在所有基线中取得最高准确率,并实现正向反馈转移,克服CLEAR-10上的遗忘。

英文摘要

In Online Continual Learning (OCL), a neural network sequentially learns from a non-stationary data stream in a single-pass with access only to a limited memory replay buffer. This contrasts sharply with off-line continual learning where training is multiple epoch dependent on large datasets. The main challenge faced by OCL is to overcome catastrophic forgetting of past tasks (stability) while learning new ones efficiently (plasticity). Existing methods counter forgetting via replay-based rehearsal, output level distillation, fixed regularization, or meta-learning on the current data. However, these methods have limitations: rehearsal introduces a stored sample bias; distillation operates on output-distributions without modulating parameter updates; fixed-regularization penalizes parameters irrespective of sensitivity; stream-only meta-learning lacks a feedback controlled parameter update. We propose Meta-Adaptive Network Gradient Optimization (MANGO), an OCL framework that balances stability-plasticity via gradient-gating and meta-learned regularization. Gradient-gating scales parameter updates based on sensitivity, preventing destructive updates. Meta-learned regularization adapts stability coefficients, evaluating the effect of parameter update on replay. In MANGO, replay acts as both a training signal and a forgetting evaluator. We evaluated our method on three standard OCL benchmark datasets. MANGO outperforms strong baselines, achieving state-of-the-art results with consistent performance across replay sizes. In domain incremental learning on CLEAR-10 and class incremental learning on CIFAR-100 and Tiny-ImageNet, it achieves highest accuracy among all baselines and achieves positive Backward Transfer, overcoming forgetting on CLEAR-10.

2605.19077 2026-05-20 cs.CL cs.AI

ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking

ReacTOD:用于零样本对话状态跟踪的受限神经符号代理NLU

Yanjun Lin, Zimo Xiao, Kartik Natarajan, Mahesh Sankaranarayanan, Niraj Nawanit, Rakshit Parashar, Austin Zhang, Karthik Konaraddi, Rishita Mote, Wei Niu

发表机构 * Amazon(亚马逊)

AI总结 该研究提出ReacTOD,一种受限神经符号架构,通过在自我纠正的ReAct循环中将NLU重新表述为离散工具调用来解决零样本对话状态跟踪问题,其核心方法是确定性验证,主要贡献是实现了新的零样本状态-of-the-art结果。

Comments Accepted at TrustNLP Workshop at ACL 2026

详情
AI中文摘要

面向任务的对话系统--处理交易、预订和服务请求--需要可预测的行为,然而用于实际延迟的中等大小LLM容易产生幻觉和格式错误,这些错误会级联到错误的动作中(例如,预订了错误日期的酒店)。我们提出了ReacTOD,一种受限神经符号架构,将NLU重新表述为自纠正ReAct循环中的离散工具调用。受限的ReAct循环能够实现迭代自我纠正,比单次推断在MultiWOZ上提高了9.3个百分点的准确性。一个符号验证器在每次对话状态更新时强制执行动作合规性、模式一致性以及核心ference一致性,实现了93.1%的自我纠正率,并产生结构化的执行轨迹。增量状态预测和按需历史检索保持提示紧凑,实验证明在参数受限的模型中提高了指令遵循性。在MultiWOZ 2.1上,ReacTOD实现了新的零样本状态-of-the-art:gpt-oss-20B达到52.71%的联合目标准确率,超过之前的最佳结果14个百分点,而Qwen3-8B仅使用8B参数达到47.34%。在Schema-Guided Dialogue(SGD)基准上,ReacTOD在完全端到端评估中使用预测的领域,Claude-Opus-4.6达到80.68%的JGA,Qwen3-32B达到64.09%--展示了无需任务特定训练数据的跨基准泛化能力。

英文摘要

Task-oriented dialogue systems -- handling transactions, reservations, and service requests -- require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09% -- demonstrating cross-benchmark generalization without task-specific training data.

2605.19076 2026-05-20 cs.LG physics.flu-dyn

The impact of observation density on Bayesian inversion of latent dynamics in shock-dominated flows

观测密度对冲击主导流动中潜变量动态贝叶斯反演的影响

Bipin Tiwari, Muhammad Abid, Omer San

发表机构 * Department of Mechanical and Aerospace Engineering, University of Tennessee, Knoxville(田纳西大学机械与航空航天工程系)

AI总结 本文提出了一种非侵入式降阶建模框架,用于高效贝叶斯初始状态反演与不确定性量化,通过卷积自编码器和学习的潜空间前向算子结合,以提高冲击主导流动中潜变量动态的反演精度和效率。

详情
AI中文摘要

从稀疏和噪声测量中推断冲击主导可压缩流动中未知的初始状态是一个具有挑战性的不适定反问题,由于非线性波相互作用和传感限制。在本工作中,我们开发了一种非侵入式降阶建模框架,用于高效的贝叶斯初始状态反演与不确定性量化。该框架结合了卷积自编码器和学习的潜空间前向算子。自编码器将高维流动场压缩成紧凑的非线性潜表示,而前向算子从编码的初始条件预测最终时间的潜变量状态。该AE-ROM代理能够快速进行正向评估,并嵌入到No-U-Turn Sampler (NUTS)中进行后验探索。该框架通过拉丁超立方采样生成500个高保真度Sod冲击管模拟,并使用五阶WENO方案求解。反问题旨在从稀疏噪声观测的最终时间密度和压力场中恢复未知的左和右密度和压力状态。结果表明,AE-ROM能够准确重建关键的冲击管结构,包括稀疏波、接触不连续性和激波前。潜变量维度为32提供了重建精度和减少空间紧凑性之间的有效平衡,而250个训练模拟足以实现准确的重建。增加观测密度显著收缩后验不确定性,将密度的均值后验标准差减少约78%,压力减少约76%。总体而言,所提出的框架为冲击主导流动的反演分析提供了一种计算高效且具有不确定性的方法,具有向多维可压缩流动和数字孪生应用扩展的潜力。

英文摘要

Inferring unknown initial states in shock-dominated compressible flows from sparse and noisy measurements is a challenging ill-posed inverse problem due to nonlinear wave interactions and limited sensing. In this work, we develop a non-intrusive reduced-order modeling framework for efficient Bayesian initial-state inversion with uncertainty quantification. The framework combines a convolutional autoencoder with a learned latent-space forward operator. The autoencoder compresses high-dimensional flow fields into a compact nonlinear latent representation, while the forward operator predicts final-time latent states from encoded initial conditions. This AE-ROM surrogate enables rapid forward evaluations and is embedded within a No-U-Turn Sampler (NUTS) for posterior exploration. The framework is demonstrated using 500 high-fidelity Sod shock tube simulations generated through Latin hypercube sampling and solved using a fifth-order WENO scheme. The inverse problem seeks to recover unknown left and right density and pressure states from sparse noisy observations of final-time density and pressure fields. Results show that the AE-ROM accurately reconstructs key shock-tube structures, including the rarefaction wave, contact discontinuity, and shock front. A latent dimension of 32 provides an effective balance between reconstruction accuracy and reduced-space compactness, while 250 training simulations are sufficient for accurate reconstruction. Increasing observation density significantly contracts posterior uncertainty, reducing the mean posterior standard deviation by approximately 78% for density and 76% for pressure. Overall, the proposed framework provides a computationally efficient and uncertainty-aware approach for inverse analysis of shock-dominated flows, with potential extensions to multidimensional compressible-flow and digital-twin applications.

2605.19075 2026-05-20 cs.CV cs.AI

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

CRAFT: 基于批评的自适应关键帧目标定位用于多模态视频问答

Mahesh Bhosale, Abdul Wasi, Vishvesh Trivedi, Pengyu Yan, Akhil Gorugantu, David Doermann

发表机构 * University at Buffalo(布法罗大学) New York University(纽约大学)

AI总结 该研究提出CRAFT方法,通过动态关键帧选择、每视频ASR与多语言回退以及混合批评循环,迭代验证和修复声明,最终实现多模态视频问答的准确证据聚合。

Comments Accepted at ACL 2026 Multimodal Augmented Generation via MultimodAl Retrieval Workshop

详情
AI中文摘要

基于现实世界新闻事件的多视频问答需要系统在异构视频档案中检索与查询相关的证据,并将每个声明归因于其支持来源。我们介绍了CRAFT(Critic-Refined Adaptive Key-Frame Targeting),一种查询条件的管道,结合动态关键帧选择、每视频ASR与多语言回退以及混合批评循环,以迭代验证和修复声明,然后整合。该管道集成了UNLI时间蕴含、DeBERTa-v3跨声明筛选以及Llama-3.2-3B裁决者,并在最终引用合并阶段发出每个事实一次,附带所有支持来源标识符。在MAGMaR 2026上,CRAFT实现了最佳的总体平均(0.739)、参考召回(0.810)和引用F1(0.635)。我们进一步在WikiVideo的MAGMaR风格转换上进行了评估,包含52个非重叠事件查询,CRAFT也表现出色(0.823 Avg),表明其声明中心的证据聚合能力超越了MAGMaR。消融研究显示,原子声明、ASR和批评循环在超过基本查询条件基线时发挥了主要作用。代码和实现细节可在https://github.com/bhosalems/CRAFT公开获取。

英文摘要

Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.

2605.19074 2026-05-20 cs.CV cs.AI

Learning Long-Term Temporal Dependencies in Photovoltaic Power Output Prediction Through Multi-Horizon Forecasting

通过多时间尺度预测学习光伏功率输出预测中的长期时间依赖性

Sumit Laha, Ankit Sharma, Hassan Foroosh

发表机构 * Department of Computer Science University of Central Florida Orlando, Florida, United States(计算机科学系 佛罗里达中央大学 奥兰多 佛罗里达 美国)

AI总结 本文提出一种多时间尺度预测框架,通过联合优化多个未来值来提高深度神经网络对隐含的步间时间依赖性的捕捉能力,从而提升光伏功率输出预测的准确性和鲁棒性。

详情
AI中文摘要

全球太阳能光伏(PV)容量的迅速扩张——2024年达到创纪录的597 GW——凸显了需要稳健的预测模型来缓解由太阳能辐照度间歇性引起的电网不稳定性。尽管基于深度学习的直接预测使用地面天空图像(GSI)已成为主导方法,但现有文献常受限于单一架构评估和对单时间尺度(点)预测的专注。本文提出从传统单时间尺度估计向多时间尺度预测框架的转变,从而实现架构无关的准确率提升。我们假设并实验验证了联合优化一系列未来值使深度神经网络能够通过避免网络在权重梯度和滤波器多样性方面的过早收敛来更好地捕捉隐含的步间时间依赖性。利用这种架构无关的改进,将顺序天空图像与历史光伏发电数据相结合,我们评估了模型在多个离散未来时间步长上同时预测功率输出的能力。我们的方法通过在多样深度学习架构上的比较分析进行验证。结果表明,这种多时间尺度方法在预测时间范围内显著提高了预测准确性和鲁棒性,同时保持计算效率。通过在单时间尺度模型上实现优越性能且计算开销 negligible,本文提供了一种可扩展且高效的解决方案,以提高现代电网的韧性。

英文摘要

The rapid global expansion of solar photovoltaic (PV) capacity-reaching a record 597 GW in 2024-highlights the urgent need for robust forecasting models to mitigate the grid instability caused by the intermittent nature of solar irradiance. While deep learning-based direct forecasting using ground-based sky images (GSI) has emerged as a dominant approach, existing literature is often constrained by single-architecture evaluations and an exclusive focus on single-horizon (point) prediction. This paper proposes a transition from traditional single-horizon estimation toward a multi-horizon forecasting framework, leading to an architecture-independent improvement in accuracy. We hypothesize and demonstrate experimentally that joint optimization over a sequence of future values allows deep neural networks to better capture latent inter-step temporal dependencies by avoiding precocious convergence of the network in terms of both weight gradients and filter diversity. Leveraging this architecture-independent improvement that integrates sequential sky imagery with historical PV generation data, we evaluate the models' abilities to predict power output across multiple discrete future time steps simultaneously. Our methodology is validated through a comparative analysis across diverse deep learning architectures. The results demonstrate that this multi-horizon approach significantly enhances predictive accuracy and robustness across the entire forecast horizon while maintaining computational parsimony. By achieving superior performance with negligible overhead compared to single-horizon models, this work provides a scalable and efficient solution to improve the resilience of modern power grids.

2605.19073 2026-05-20 cs.LG cs.AI

Riemannian Networks over Full-Rank Correlation Matrices

全秩相关矩阵上的Riemannian网络

Ziheng Chen, Xiaojun Wu, Bernhard Schölkopf, Nicu Sebe

发表机构 * Department of Information Engineering and Computer Science, University of Trento, Trento, Italy(特伦托大学信息工程与计算机科学系) School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机科学学院)

AI总结 本文提出了一种在全秩相关矩阵上进行Riemannian网络的研究,通过扩展基本层并引入准确的反向传播方法,展示了其在对比现有SPD和Grassmannian网络时的有效性。

Comments Accepted to ICML 2026

详情
AI中文摘要

在不同应用中,对称正定(SPD)流形上的表示已引起广泛关注。相比之下,全秩相关矩阵流形,作为SPD矩阵的归一化替代品,仍然鲜为人知。本文介绍了在相关流形上进行的Riemannian网络,利用了五种最近发展的相关几何结构。我们系统地扩展了基本层,包括多项式对数回归(MLR)、全连接(FC)和卷积层,到这些几何结构上。此外,我们还提出了用于两种相关几何结构的准确反向传播方法。通过与现有SPD和Grassmannian网络的比较实验,展示了该方法的有效性。

英文摘要

Representations on the Symmetric Positive Definite (SPD) manifold have garnered significant attention across different applications. In contrast, the manifold of full-rank correlation matrices, a normalized alternative to SPD matrices, remains largely underexplored. This paper introduces Riemannian networks over the correlation manifold, leveraging five recently developed correlation geometries. We systematically extend basic layers, including Multinomial Logistic Regression (MLR), Fully Connected (FC), and convolutional layers, to these geometries. Besides, we present methods for accurate backpropagation for two correlation geometries. Experiments comparing our approach against existing SPD and Grassmannian networks demonstrate its effectiveness.

2605.19066 2026-05-20 cs.CL

The Annotation Scarcity Paradox in Low-Resource NLP Evaluation: A Decade of Acceleration and Emerging Constraints

低资源NLP评估中的标注稀缺悖论:十年加速与新兴约束

Vukosi Marivate

发表机构 * DSFSI & AfriDSAI, University of Pretoria(DSFSI与AfriDSAI, Pretoria大学)

AI总结 本文探讨了低资源NLP评估中由于标注资源不足导致的矛盾,分析了过去十年中评估方法的发展阶段,并提出了新的解决方案以提升评估的公平性和有效性。

Comments Under Review

详情
AI中文摘要

在过去十年中,低资源自然语言处理(NLP)经历了爆炸性增长,推动因素包括跨语言迁移、大规模多语言模型以及基准测试的快速普及。然而,这种表面的进步掩盖了一个关键但研究不足的矛盾:评估日益复杂的生成系统所需深入的社会语言学专业知识严重受限、分配不均且结构上被边缘化。本文对低资源NLP评估(2014至今)进行了批判性叙述调查,追溯其在三个阶段的发展:早期启发式乐观、顶层基准测试的幻象以及当前的生成瓶颈时代。我们提出了标注稀缺悖论的概念,即当模型扩展能力远超所需的人类基础设施时产生的结构性摩擦。通过分析提取数据管道、未补偿的“幽灵工作”和语言数据膨胀,我们论证了这种悖论威胁了所报告进展的认知有效性。我们调查了新兴的回应方式,包括数据增强、基于模型的评估、参与性整理以及通过项目反应理论和主动学习的标注高效方法,并评估了它们的公平性与有效性权衡。最后,我们呼吁实践者行动,认为克服这一瓶颈需要从交易性数据提取转向基于认知治理、数据主权和共享所有权的社区嵌入评估。

英文摘要

Over the past decade, low-resource natural language processing (NLP) has experienced explosive growth, propelled by cross-lingual transfer, massively multilingual models, and the rapid proliferation of benchmarks. Yet this apparent progress masks a critical, insufficiently examined tension: the deep sociolinguistic expertise required to evaluate increasingly complex generative systems is severely strained, inequitably distributed, and structurally marginalised. We present a critical narrative survey of low-resource NLP evaluation (2014--present), tracing its evolution across three phases: early heuristic optimism, the illusions of top-down benchmark scaling, and the current era of generative bottlenecks. We conceptualise the \emph{Annotation Scarcity Paradox}, the structural friction arising when the technical capacity to scale models vastly outpaces the sovereign human infrastructure required to authentically evaluate them. By examining extractive data pipelines, undercompensated ``ghost work'', and language data flaring, we argue that this paradox threatens the epistemic validity of reported progress. We survey emerging responses -- including data augmentation, model-based evaluation, participatory curation, and annotation-efficient approaches via item response theory and active learning -- and assess their equity and validity trade-offs. We close with a practitioner call to action, arguing that overcoming this bottleneck requires a paradigm shift from transactional data extraction to relational, community-embedded evaluation rooted in epistemic governance, data sovereignty, and shared ownership.

2605.19063 2026-05-20 cs.LG

Mapping Uncharted Symmetries: Machine Discovery in Combinatorics

映射未知对称性:组合学中的机器发现

Eugenio Cainelli, Lorenzo Luccioli, Alessandro Iraci, Michele D'Adderio, Giovanni Paolini

发表机构 * University of Bologna(博洛尼亚大学) Pegaso University(佩加索大学) University of Pisa(比萨大学)

AI总结 本文提出了一种基于机器学习的组合学研究方法,通过构建满足精确分布约束的简单数学函数,发现q,t-纳尔ayan多项式的新组合解释,并提供了其对称性的证明。

Comments 20 pages

详情
AI中文摘要

受代数组合学中长期未解决的问题启发,我们展示了现代机器学习可以有意义地贡献于可验证的数学发现。特别是,我们关注在精确分布约束下构造简单数学函数的问题,将其正式化为简单学习在刚性比例下(SLURP)。我们通过引入两种方法:MapSeek-Functional,通过交替伪标签和监督训练步骤建模所需函数;以及MapSeek-Symbolic,直接生成符号公式。我们成功将这两种方法应用于代数组合学中的研究问题,发现了来自表示论的q,t-纳尔ayan多项式的新组合解释。据我们所知,这是基于非交叉划分的第一个此类解释。使用一个发现的统计量,我们找到了这些多项式对称性的组合证明,在之前未解决的情况下。为了简化验证和可重复性,我们发布了所有代码,包括本文所有数学发现的Lean 4形式化。

英文摘要

Inspired by long-standing open problems in algebraic combinatorics, we show that modern machine learning can meaningfully contribute to verifiable mathematical discoveries. In particular, we focus on the construction of simple mathematical functions under exact distributional constraints, a setting we formalize as Simple Learning Under Rigid Proportions (SLURP). We tackle this problem by introducing two methods: MapSeek-Functional, which models the desired function alternating pseudo-labeling and supervised training steps; and MapSeek-Symbolic, designed to directly produce symbolic formulas. We successfully apply both methods to a research problem in algebraic combinatorics, discovering a new combinatorial interpretation of the $q,t$-Narayana polynomials arising from representation theory. To our knowledge, this is the first such interpretation based on noncrossing partitions. Using one discovered statistic, we find a combinatorial proof of the symmetry of these polynomials in a previously unsolved case. To streamline verification and reproducibility, we release all code, including a formalization of all the mathematical discoveries of this paper in Lean 4.

2605.19060 2026-05-20 cs.CV cs.AI eess.IV

LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators

LiFT:用于从2D生成器生成3D图像的提升跨切片特征轨迹

Xinhe Zhang, Yuyang Zhang, Pengfei Jin, Arnau Marin-Llobet, Na Li, Quanzheng Li

发表机构 * School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院) Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(马萨诸塞总医院和哈佛医学院高级医学计算与分析中心) Kempner Institute, Harvard University(哈佛大学凯普纳研究所)

AI总结 本文提出LiFT框架,通过将3D体积合成分解为单切片图像生成和跨切片轨迹学习,解决高分辨率3D医学图像生成中体积模型计算成本高和2D切片生成器在第三维度上无法保持解剖一致性的问题。

详情
AI中文摘要

高分辨率3D医学图像生成仍然具有挑战性,因为完全体积分布模型计算成本高,而高效的2D切片生成器往往无法在第三维度上保持解剖一致性。我们提出LiFT,一种用于提升跨切片特征轨迹的框架,将3D体积合成分解为单切片图像生成和跨切片轨迹学习。与端到端建模体积分布不同,LiFT将体积视为特征空间中的有序轨迹,捕捉解剖结构在深度方向上的出现、变换和消失。一个三平面漂移损失对齐生成切片的轨迹与真实体积的轨迹,使在无条件生成中能够学习跨切片进展的分布;在配对翻译中,一个双向$z$-上下文混合器通过注册目标进行训练,提供通过平面的连贯性同时保持单切片的保真度。我们在BraTS 2023(无条件和缺失模态MRI)和SynthRAD2023(MRI到CT)上评估LiFT。在这些设置中,LiFT保持单切片质量,接近报告的cWDM缺失MRI重建质量,在约135倍更低的推理成本下(无正式等价性测试),并在MRI到CT中相对于无映射消融提高了通过平面的连贯性,证明了轻量级跨切片轨迹学习是高分辨率3D医学合成的可行途径。

英文摘要

High-resolution 3D medical image generation remains challenging because fully volumetric models are computationally expensive, while efficient 2D slice generators often fail to preserve anatomical consistency across the third dimension. We propose LiFT, a framework for Lifted inter-slice Feature Trajectories that factorizes 3D volume synthesis into per-slice image generation and inter-slice trajectory learning. Rather than modeling the volumetric distribution end-to-end, LiFT treats a volume as an ordered trajectory in feature space, capturing how anatomical structures appear, transform, and disappear across depth. A tri-planar drifting loss aligns the trajectory of generated slices with the trajectories of real volumes, enabling distributional learning over inter-slice progressions in unconditional generation; in paired translation, a bidirectional $z$-context mixer trained against the registered target supplies through-plane coherence while preserving per-slice fidelity. We evaluate LiFT on BraTS 2023 (unconditional and missing-modality MR) and SynthRAD2023 (MR-to-CT). Across these settings, LiFT preserves per-slice quality, approaches the reported cWDM missing-MR reconstruction quality at $\sim$$135\times$ lower inference cost (without formal equivalence testing), and improves through-plane coherence on MR-to-CT relative to a no-mapper ablation, demonstrating that lightweight inter-slice trajectory learning is a viable route to high-resolution 3D medical synthesis.

2605.19050 2026-05-20 cs.LG physics.chem-ph q-bio.QM

Generative Pseudo-Force Fields for Molecular Generation

生成伪力场用于分子生成

Stefaan Simon Pierre Hessmann, Khaled Kahouli, Stefan Gugler, Michael Plainer, Frank Noé, Klaus-Robert Müller, Niklas Wolf Andreas Gebauer

发表机构 * Machine Learning Group, Technische Universität Berlin(技术大学柏林机器学习小组) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究院) Department of Mathematics and Computer Science, Freie Universität Berlin(柏林自由大学数学与计算机科学系) Zuse School ELIZA, Darmstadt, Germany(达姆施塔特德国Zuse学校ELIZA) Department of Physics, Freie Universität Berlin(柏林自由大学物理系) Microsoft Research AI4Science, Berlin, Germany(柏林德国微软研究院AI4Science) Department of Chemistry, Rice University, Houston, USA(美国休斯顿莱斯大学化学系) Max-Planck Institute for Informatics, Saarbrücken, Germany(德国萨尔布吕肯马克斯·普朗克信息研究所) Department of Artificial Intelligence, Korea University, Seoul, South Korea(韩国首尔韩国大学人工智能系)

AI总结 本文提出生成伪力场(GPFFs)以解决分子生成中能量基放松与数据驱动生成模型采样效率之间的权衡问题,通过训练MLFF在参考平衡结构上的二次伪势能面上实现高效且稳定的分子构象生成。

详情
AI中文摘要

生成稳定的分子构象通常需要在基于物理的能量放松的物理真实性和数据驱动生成模型的采样效率之间做出权衡。虽然机器学习力场(MLFFs)可以通过根据物理力放松分子几何结构来采样稳定的构象,但它们需要昂贵的从头计算训练数据。相反,扩散模型(DMs)仅从平衡数据学习,但依赖于噪声调度和时间步长条件。在本文中,我们提出生成伪力场(GPFFs)以弥合这些范式,通过在参考平衡结构上的二次伪势能面上训练MLFF。由于不需要对扰动几何进行从头计算,非平衡训练数据可以通过对平衡结构添加高斯噪声实时生成。我们证明GPFFs是方差爆炸扩散模型的时间步长无关变种:分数来自预测的伪力,但力的大小隐含地编码了噪声水平,因此不需要时间步长条件。我们的GPFF因此可以作为标准扩散采样(祖先、Heun)中的直接替换,也可以促进更高效、自适应的变种和一个受MLFF启发的直接去噪方案。我们提出的采样算法支持任意的结构先验和几何约束。在QM9数据集上,GPFF在256个神经函数评估(NFE)时有100%的有效性,在仅6个NFE时超过50%,优于所有扩散基线。结合自定义先验,我们在分子编辑器中展示了我们的方法在药物设计设置中的快速和准确的生成过程,其中分子在实时中生成。

英文摘要

Generating stable molecular conformations typically forces a tradeoff between the physical realism of energy-based relaxation and the sampling efficiency of data-driven generative models. While machine learning force fields (MLFFs) can sample stable conformations by relaxing molecular geometries according to physical forces, they require costly ab-initio training data. Conversely, diffusion models (DMs) learn from equilibrium data alone but are dependent on noise schedules and time-step conditioning. In this work, we propose generative pseudo-force fields (GPFFs) to bridge these paradigms by training an MLFF on a quadratic pseudo-potential energy surface relative to reference equilibrium structures. Because no ab-initio calculations are required for the perturbed geometries, non-equilibrium training data can be generated on the fly by perturbing the equilibria with Gaussian noise. We show that GPFFs constitute a time-step-agnostic variant of variance exploding DMs: the score comes from the predicted pseudo-forces but because force magnitudes implicitly encode the noise level, no time-step conditioning is needed. Our GPFF can hence be used as a drop-in replacement in standard diffusion sampling (ancestral, Heun) but also facilitates more efficient, adaptive variants and an MLFF inspired direct denoising scheme. Our proposed sampling algorithms support arbitrary structural priors and geometric constraints. On QM9, GPFF has 100 % validity at 256 neural function evaluations (NFE) and over 50 % at just 6 NFE, outperforming diffusion baselines across all samplers. Combined with custom priors, we showcase the fast and accurate generation process of our method in a molecular editor for a drug design setting, where a molecule is generated in real time.

2605.19049 2026-05-20 cs.LG cs.AI

KVBuffer: IO-aware Serving for Linear Attention

KVBuffer: 为线性注意力设计的I/O感知服务

Longwei Zou, Lin Zhong

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出KVBuffer,一种I/O感知的线性注意力服务机制,通过缓冲最近的键和值,使服务系统能够更灵活且高效地计算线性注意力输出,从而减少内存访问和解码延迟,提升服务性能。

详情
AI中文摘要

线性注意力因在长上下文推理中具有与上下文长度无关的恒定解码成本而受到广泛关注。然而,现有服务系统通常在每次解码步骤中递归计算和更新一个大的线性注意力状态,由于该状态远大于每个token的键和值,递归解码导致显著的内存访问开销,对服务线性注意力效率低下。在本文中,我们提出KVBuffer,一种为线性注意力设计的I/O感知服务机制。通过缓冲最近的键和值,KVBuffer使服务系统能够以更灵活且内存高效的方式计算线性注意力输出。对于解码,KVBuffer支持分块计算,通过延迟状态更新并批量应用,减少了平均内存访问和解码延迟。对于推测解码,KVBuffer并行验证草案token并避免存储临时状态。对于短上下文,KVBuffer直接从缓冲的键和值计算注意力输出,无需创建或更新线性注意力状态。我们将在SGLang中实现KVBuffer用于Qwen3-Next。我们的评估显示,当验证四个草案token时,KVBuffer可将线性注意力解码延迟降低高达45.17%,并使推测解码的最大服务请求数增加5倍。

英文摘要

Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For speculative decoding, KVBuffer verifies draft tokens in parallel and avoids storing temporary states. For short contexts, KVBuffer computes attention outputs directly from buffered keys and values, without creating or updating the linear attention state. We implement KVBuffer in SGLang for Qwen3-Next. Our evaluations show that KVBuffer can reduce linear attention decoding latency by up to 45.17% and increase the maximum number of serving requests by 5x for speculative decoding when verifying four draft tokens.

2605.19042 2026-05-20 cs.AI

Interference-Aware Multi-Task Unlearning

干扰感知的多任务反学习

Ying-Hua Huang, Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文提出了一种干扰感知的多任务反学习方法,通过任务感知梯度投影和实例级梯度正交化来解决多任务设置中因共享参数导致的任务级和实例级干扰问题,实验表明该方法在多任务计算机视觉基准上有效减少了反学习的干扰。

详情
AI中文摘要

机器反学习旨在从已训练模型中移除指定训练数据的贡献,同时保持对剩余数据的性能。现有工作主要集中在单任务设置,而现代模型往往在具有共享主干的多任务设置中运行,其中移除一个任务或实例的监督可能无意中影响其他任务。我们引入了多任务反学习,包含两种设置:全任务反学习,即从所有任务中移除目标实例,以及部分任务反学习,即仅从选定的任务中移除监督。我们表明共享参数将遗忘集和保留集耦合在一起,导致非目标任务的任务级干扰和其它实例的实例级干扰。为了解决这个问题,我们提出了一种干扰感知框架,结合任务感知的梯度投影,该方法约束更新在任务特定的子空间内,以及实例级的梯度正交化,以减少遗忘和保留信号之间的冲突。在两个多任务计算机视觉基准上跨五个任务的实验表明,我们的方法在有效反学习的同时保持了强大的泛化能力,与最强基线相比,在全任务反学习中减少了30.3%的UIS,在部分任务反学习中减少了52.9%的UIS。

英文摘要

Machine unlearning aims to remove the contribution of designated training data from a trained model while preserving performance on the remaining data. Existing work mainly focuses on single-task settings, whereas modern models often operate in multi-task setups with shared backbones, where removing supervision for one task or instance can unintentionally affect others. We introduce multi-task unlearning with two settings: full-task unlearning, which removes a target instance from all tasks, and partial-task unlearning, which removes supervision only from selected tasks. We show that shared parameters couple the forget and retain sets, causing task-level interference on non-target tasks and instance-level interference on other instances. To address this issue, we propose an interference-aware framework that combines task-aware gradient projection, which constrains updates within task-specific subspaces, with instance-level gradient orthogonalization, which reduces conflicts between forget and retain signals. Experiments on two multi-task computer vision benchmarks across five tasks show that our method achieves effective unlearning while maintaining strong generalization, reducing UIS compared with the strongest baseline by 30.3% in full-task unlearning and 52.9% in partial-task unlearning.

2605.19038 2026-05-20 cs.RO cs.LG

Guiding Neuro-Symbolic Scenario Generation with Spatio-Temporal Logic

用时空逻辑引导神经符号场景生成

Lorenzo Bonin, Francesco Giacomarra, Luca Bortolussi, Jyotirmoy V. Deshmukh, Francesca Cairoli

发表机构 * University of Trieste(特里埃斯特大学) University of Southern California(南加州大学)

AI总结 本研究提出STRELGen框架,结合扩散模型和时空逻辑规范,高效生成安全关键的多智能体驾驶场景,提升自动驾驶系统的鲁棒性验证能力。

详情
AI中文摘要

自动驾驶技术的快速发展已远超安全评估方法的进展。传统测试依赖于暴露自动驾驶系统于大量真实交通场景,这是一种成本高昂且统计上无法有效捕捉罕见安全关键边缘情况的暴力方法。为解决这一根本限制,我们引入STRELGen,一个可扩展的框架,用于目标生成安全关键的驾驶场景。STRELGen协同结合多智能体轨迹生成扩散模型(DM)与通过高度可解释的形式化方法编码复杂安全和现实属性的时空逻辑(STREL)规范。关键在于监控这些规范的满足程度是可微的,从而允许基于梯度的搜索。在推理时间,我们直接优化DM的潜在空间以最大化STREL公式满足程度。结果是高效生成高度可信且安全关键的多智能体场景,这些场景位于学习的数据分布内。STRELGen因此提供了一种灵活、可解释且强大的工具,用于对自动驾驶系统进行压力测试,超越了暴力数据收集的限制。

英文摘要

The rapid advancement of autonomous driving (AD) technologies has outpaced the development of robust safety evaluation methods. Conventional testing relies on exposing AD systems to vast numbers of real-world traffic scenes -- a brute-force approach that is prohibitively expensive and statistically ineffective at capturing the rare, safety-critical edge cases essential for validating real-world robustness. To address this fundamental limitation, we introduce STRELGen, a scalable framework for the targeted generation of safety-critical driving scenarios. STRELGen synergistically combines a multi-agent trajectory-generation diffusion model (DM) with Spatio-Temporal Logic (STREL) specifications that encode complex safety and realism properties through a highly interpretable formalism. Crucially, monitoring satisfaction levels of these specifications is differentiable, enabling gradient-based search. At inference time, we optimize directly over the DM latent space to maximize STREL formula satisfaction. The result is efficient generation of highly plausible yet safety-critical multi-agent scenarios that lie within the learned data distribution. STRELGen thus provides a flexible, interpretable, and powerful tool for stress-testing autonomous driving systems, moving beyond the limitations of brute-force data collection.

2605.19035 2026-05-20 cs.AI

Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

可信代理网络:在代理网络中,信任必须被内置,而非事后添加

Yixiang Yao, Yuhang Yao, Xinyi Fan, Jiechao Gao, Jie Wang, Minjia Zhang, Srivatsan Ravi, Carlee Joe-Wong

发表机构 * University of Southern California(南加州大学) Carnegie Mellon University(卡内基梅隆大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Stanford University(斯坦福大学)

AI总结 本文探讨了在代理网络中信任必须被内置而非事后添加的问题,提出了一个综合的概念框架,通过四个设计支柱来建立代理网络中的信任。

Comments Accepted by SIGKDD 2026 Blue Sky Ideas Track

详情
AI中文摘要

大型语言模型的快速发展催生了能够进行复杂推理和执行的自主LLM代理。随着这些代理从孤立操作转向协作生态系统,我们见证了代理到代理(A2A)网络的出现,这是一种异构代理自主协调解决多步骤任务的范式。尽管这些网络可能比单纯使用一个代理完成整个任务表现更好,但它们引入了系统性漏洞,如对抗性组合、语义错位和级联操作失败,现有代理对齐技术无法解决。在本文的愿景论文中,我们主张A2A网络的可信度不能通过在现有协议上进行 retrofitting 来完全保证,这些协议大多为单个代理设计。相反,它必须在A2A协调框架的最初阶段进行架构设计。我们提出一个综合的概念框架,通过四个设计支柱来在A2A系统中建立信任。

英文摘要

The rapid advancement of Large Language Models has given rise to autonomous LLM-based agents capable of complex reasoning and execution. As these agents transition from isolated operation to collaborative ecosystems, we witness the emergence of the Agent-to-Agent (A2A) network, a paradigm where heterogeneous agents autonomously coordinate to solve multi-step tasks. While these networks may offer better task performance compared to simply using one agent to complete the entire task, they introduce systemic vulnerabilities, such as adversarial composition, semantic misalignment, and cascading operational failures, that existing agent alignment techniques cannot address. In this vision paper, we argue that the trustworthiness of A2A networks cannot be fully guaranteed via retrofitting on existing protocols that are largely designed for individual agents. Rather, it must be architected from the very beginning of the A2A coordination framework. We present a comprehensive conceptual framework that situates trust in A2A systems through four design pillars.

2605.19033 2026-05-20 cs.RO cs.AI cs.CV cs.LG cs.MA

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

RLFTSim: 通过强化学习微调实现逼真且可控的多智能体交通仿真

Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee

发表机构 * University of Alberta(阿尔伯塔大学) Huawei Technologies Canada(华为加拿大技术有限公司) York University(约克大学) Canada CIFAR AI Chair, Amii(加拿大 CIFAR 人工智能主席,Amii)

AI总结 本文提出RLFTSim框架,通过强化学习微调提升交通仿真场景的真实感,并通过目标条件化方法实现对交通仿真可控性的提炼,实验表明其在真实感和可控性方面均优于其他启发式搜索方法。

Comments CVPR 2026 Highlight; Project page at https://ehsan-ami.github.io/rlftsim

详情
AI中文摘要

监督式开环训练已被广泛用于训练交通仿真模型;然而,它无法捕捉复杂驾驶场景中固有的动态性和多智能体交互。我们引入RLFTSim,一种基于强化学习的微调框架,通过将模拟器运行与真实世界数据分布对齐来增强场景真实性,并提供一种方法用于在场景生成中提炼目标条件化的可控性。我们基于预训练的仿真模型实例化RLFTSim,设计一种平衡保真度和可控性的奖励函数,并在Waymo Open Motion Dataset上进行了全面实验。我们的结果表明在真实感方面取得了改进,实现了最先进的性能。与其它基于启发式搜索的微调方法相比,RLFTSim由于提出了一种低方差且密集的奖励信号,所需样本显著更少,并且通过设计直接解决了真实感对齐问题。我们还通过目标条件化展示了我们方法在提炼交通仿真可控性方面的有效性。项目页面可在https://ehsan-ami.github.io/rlftsim上访问。

英文摘要

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

2605.19032 2026-05-20 cs.CV

Personalized Face Privacy Protection From a Single Image

基于单张图像的个性化面部隐私保护

Zachary Yahn, Fatih Ilhan, Tiansheng Huang, Selim Tekin, Sihao Hu, Yichang Xu, Margaret Loper, Ling Liu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出FaceCloak系统,通过单张图像生成个性化面部隐私掩码,有效防止面部识别,经实验验证其在多个数据集上优于其他方法。

详情
AI中文摘要

在线上传的面部照片容易受到恶意行为者的攻击,他们可以刮取面部图像并通过未经授权的面部识别模型侵犯个人隐私。本文提出了FaceCloak,一种新颖的个性化面部隐私保护系统,该系统能够从用户单张图像生成防御性身份特定的通用面部隐私掩码,使面部识别失败。FaceCloak引入了三阶段的个性化面部扰动学习方法:(1)基于用户的单张图像生成少量高多样性的合成面部图像;(2)通过迭代扰动生成在合成图像的小集合上学习面部伪装,通过增加关键面部身份泄露区域的保护,有效将用户的身份嵌入推向遥远的锚身份并远离相似身份;(3)生成以像素级伪装形式的个性化身份保护掩码,该掩码轻量且可以高效应用于任何用户的面部图像,同时保持良好的感知质量。在三个流行面部数据集上对十个识别模型的广泛实验显示,FaceCloak相比29种其他现有代表性方法更有效。代码可在https://github.com/zacharyyahn/FaceCloak获取。

英文摘要

Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user's identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at https://github.com/zacharyyahn/FaceCloak

2605.19029 2026-05-20 cs.RO

Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation

通过Stein变分推断进行分布鲁棒控制的接触丰富操作

Hrishikesh Sathyanarayan, Victor Vantilborgh, Harish Ravichandar, Tom Lefebvre, Ian Abraham

发表机构 * Department of Mechanical Engineering, Yale University(耶鲁大学机械工程系) Department of Electromechanical, Systems and Metal Engineering, Ghent University(根特大学机电系统与金属工程系) School of Interactive Computing, Georgia Institute of Technology(佐治亚理工学院交互计算学院) Department of Electrical Engineering, University of Sydney(悉尼大学电子工程系)

AI总结 本文提出了一种基于Stein变分推断的分布鲁棒控制方法,用于提升接触丰富操作中的不确定性建模能力,通过更灵活的不确定性建模在保持性能的同时精确适应不确定性,实验结果表明在广泛参数不确定性下,鲁棒性提高了3倍。

Comments In Proceedings of Robotics: Science and Systems, Sydney, Australia, July 2025

详情
AI中文摘要

可靠的机器人操作需要能够准确表示和适应来自接触丰富交互中不确定性的控制策略。现代数据驱动方法通过大规模训练和计算来缓解不确定性,但在训练样本有限时性能显著下降。相比之下,经典模型驱动控制器计算高效且可靠,但其对任务相关不确定性的有限表示能力会阻碍接触丰富交互的性能。在本文中,我们提出通过更灵活的不确定性建模来扩展模型驱动操作控制的能力,该方法将操作问题转化为分布鲁棒控制优化,并提出一种基于Stein变分推断的新确定性公式,该公式在保持性能的同时显式建模任务敏感的参数不确定性。结果表明,所得到的控制器更加关注任务对不确定性的敏感性,从而在不牺牲性能的情况下获得高可靠性。实验结果表明,在广泛参数不确定性下,接触丰富操作任务的鲁棒性提高了3倍,优于现有模型驱动控制方法。

英文摘要

Reliable robotic manipulation requires control policies that can accurately represent and adapt to uncertainty arising from contact-rich interactions. Modern data-driven methods mitigate uncertainty through large-scale training and computation, and degrade significantly in performance with limited number of training samples. By contrast, classical model-based controllers are computationally efficient and reliable, but their limited ability to represent task-relevant uncertainty can hinder performance in contact-rich interactions. In this work, we propose to expand the capabilities of model-based manipulation control through more flexible uncertainty modeling that retains performance while exactly adapting to uncertainty. Our approach casts the manipulation problem as a distributionally robust control optimization and proposes a novel deterministic formulation based on Stein variational inference that preserves performance while explicitly modeling task-sensitive parameter uncertainty. As a result, the derived controllers are more aware of task sensitivities to uncertainty, yielding high reliability without compromising performance. Experimental results demonstrate up to 3$\times$ improved robustness across a range of contact-rich manipulation tasks under broad parametric uncertainty, outperforming existing model-based control methods.

2605.19028 2026-05-20 cs.LG

Learning When to Adapt

学习何时适应

Ali Zindari, Xiaowen Jiang, Rotem Mulayoff, Sebastian U. Stich

发表机构 * Universität des Saarlandes(萨尔布吕肯大学) CISPA Helmholtz Center for Information Security(信息安全研究中心)

AI总结 本文提出DISeL,一种动态输入敏感的低秩适应方法,通过引入轻量级输入依赖门控机制,减少遗忘并保持微调准确性,同时提供可解释的诊断视图。

Comments Preprint

详情
AI中文摘要

低秩适应(LoRA)是一种广泛使用的参数高效微调方法,但其学习修正却是静态的:相同的低秩更新被应用于每一个输入。这种输入无关的方法在适应微调分布和保持预训练行为在该分布之外的输入之间造成不可避免的权衡,导致灾难性遗忘。我们引入DISeL(动态输入敏感LoRA),通过在LoRA模块中添加轻量级的输入依赖门控机制,增强每个秩一组件。门控机制默认保留预训练模型的行为,而训练过程学习激活选定的组件以减少微调损失。DISeL仅添加少量参数并保持低秩结构。在RoBERTa在GLUE上的表现,以及经过数学推理和代码生成微调的Llama和Mistral模型中,DISeL相比LoRA及相关变体减少了遗忘,同时保持竞争性的微调准确性。此外,学习到的门控激活提供了可解释的诊断视图,显示哪些层和秩组件在微调过程中最活跃,从而提供关于任务特定适应集中位置的见解。代码可在https://github.com/alizindari/DISeL获得。

英文摘要

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method, yet its learned correction is static: the same low-rank update is applied to every input. This input-agnostic approach creates an inevitable compromise between adapting to the fine-tuning distribution and preserving pre-trained behavior on inputs outside that distribution, contributing to catastrophic forgetting. We introduce DISeL (Dynamic Input-Sensitive LoRA), which augments LoRA modules with lightweight input-dependent gates over individual rank-one components. The gating mechanism is designed to preserve the pre-trained model's behavior by default, while training learns to activate selected components that reduce the fine-tuning loss. DISeL adds only a small number of parameters and preserves the low-rank structure. Across RoBERTa on GLUE, and Llama and Mistral models fine-tuned for mathematical reasoning and code generation, DISeL reduces forgetting relative to LoRA and related variants while maintaining competitive fine-tuning accuracy. In addition, the learned gate activations provide an interpretable diagnostic view of which layers and rank components are most activated during fine-tuning, giving insight into where task-specific adaptation is concentrated. Code available at https://github.com/alizindari/DISeL .

2605.18597 2026-05-20 cs.AI

Latent Action Reparameterization for Efficient Agent Inference

潜在动作重参数化用于高效智能体推断

Wenhao Huang, Qingwen Zeng, Qiyue Chen, Zijie Guo, Yu Sun, Cheng Yang, Siru Ouyang, Jiri Gesi, Fang Wu, Jiayi Zhang, Huaming Chen, Bang Liu, Xiangru Tang, Chenglin Wu

发表机构 * Université de Montréal(蒙特利尔大学) The University of Sydney(悉尼大学) Fudan University(复旦大学) Yale University(耶鲁大学) DeepWisdom University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Amazon Science(亚马逊科学) Stanford University(斯坦福大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Mila - Quebec AI Institute(蒙特利尔人工智能研究所)

AI总结 本文提出Latent Action Reparameterization (LAR)框架,通过学习紧凑的潜在动作空间来提升大语言模型智能体的推断效率,减少有效动作 horizon 并保持原始动作空间的表达性。

详情
AI中文摘要

大型语言模型(LLM)智能体通常依赖于长序列的低级文本动作,导致较大的有效决策 horizon 和较高的推断成本。尽管先前工作通过系统级优化或提示工程来提高推断效率,我们认为动作空间的表示是关键瓶颈。我们提出Latent Action Reparameterization (LAR),一种学习紧凑的潜在动作空间的框架,其中每个潜在动作对应于多步骤语义行为。通过将智能体动作重参数化为潜在单元,LAR使在较短的有效 horizon 上进行决策的同时保持原始动作空间的表达性。与手工制作的宏或分层控制器不同,潜在动作从智能体轨迹中学习并直接集成到模型中,允许规划和执行在抽象动作表示上进行。在一系列基于LLM的智能体基准测试中,LAR显著减少了有效动作 horizon 并在固定计算预算下提高了推断效率。作为结果,我们的方法在减少动作令牌和相应的墙钟推断时间的同时,保持或提高了任务成功率。这些结果表明,动作表示学习是扩展高效LLM智能体推断的关键且未被探索的因素,与模型架构和硬件的进步互补。

英文摘要

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

2605.18464 2026-05-20 cs.CV

PERL: Parameter Efficient Reasoning in CLIP Latent Space

PERL:在CLIP潜在空间中实现参数高效的推理

Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

发表机构 * University of Catania(卡塔尼亚大学)

AI总结 本文提出PERL,一种在CLIP潜在空间中通过迭代潜在推理实现参数高效适应的框架,该方法在多个基准测试中表现出最佳的参数-性能权衡,仅需约6K可训练参数即可实现强的新型类别准确率和竞争性的迁移性能。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

对比训练的视觉-语言模型,如CLIP,通过在共享嵌入空间中对齐图像和文本,提供了强大的零样本迁移能力。然而,将这些模型适应到下游任务而不影响其开放词汇泛化能力仍然具有挑战性。现有的参数高效适应方法通常通过学习的提示、适配器或多模态转换来提高任务专业化,其中适应能力主要通过额外的可训练参数来表达。受最近语言模型中潜在推理方法的启发,我们探讨了一种互补的视角:适应是否可以来自于对潜在表示的迭代推理,而不是仅仅通过增加参数数量?我们介绍了PERL(在CLIP潜在空间中实现参数高效的推理),一种轻量级的适应框架,它通过在冻结的CLIP模型上添加一个紧凑的共享推理模块,在多次细化步骤中反复应用。在每一步中,PERL根据当前的表示生成一个潜在推理标记,并将其注入到中间编码器层中,逐步细化更高层次的语义表示,同时保持CLIP的预训练多模态结构。在15个基准测试中,涵盖基础到新颖泛化、跨数据集迁移以及非分布ImageNet变体,PERL在快速适应的少样本设置下,实现了与其他方法相比最佳的参数-性能权衡,仅使用约6K可训练参数,比最大的比较方法少817倍,同时结合了强的新类别准确率和具有竞争力的迁移性能。总体而言,我们的结果表明,迭代的潜在推理为判别视觉-语言模型中的参数扩展提供了一种互补的适应机制。

英文摘要

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理?

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究所) TransPerfect(TransPerfect公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究探讨了现有模型如何利用潜在令牌,发现潜在令牌在最终预测中起作用有限,主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示,需要高质量数据和更精确的潜在令牌预测来推动发展。

详情
AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题,而非仅通过语言推理。受此启发,近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中,我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是,当潜在令牌被无信息的占位符令牌替代时,模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象,我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题:首先,在大多数现有数据集中,oracle潜在令牌提供的信息有限,仅超出原始图像,且不显著简化任务,导致模型在训练时忽略它们,并在推理时有效绕过它们。当在诊断数据集上微调时,其中潜在令牌为最终预测提供充分支持,我们显示模型可以因果依赖于它们。其次,在推理时生成的潜在令牌偏离其对应的oracle表示,坍缩到狭窄区域,即使模型依赖它们也无法获得收益。总体而言,我们的发现表明,未来潜在视觉推理的进步取决于两个关键支柱:具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

2605.18431 2026-05-20 cs.CV

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

协同视见:基于多模态大语言模型的多机器人协作自体空间推理

Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu, Yufan Chen, Junwei Zheng, Hao Shi, Yi Zhou, M. Saquib Sarfraz, Danda Pani Paudel, Luc Van Gool

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Hunan University(湖南大学) University of Oxford(牛津大学) Zhejiang University(浙江大学) ETH Zurich(苏黎世联邦理工学院) Ant Group(蚂蚁集团)

AI总结 本文研究了多机器人协作动态空间推理问题,提出了首个针对该任务的基准CoopSR以及多机器人自体问答数据集EgoTeam,通过引入SP-CoR框架实现了细粒度的协作空间推理,显著提升了多机器人协作推理性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在自体视频理解方面取得了显著进展,但其从多个具身视角进行协作推理的能力仍鲜有探索。我们通过多机器人协作动态空间推理研究该问题,其中模型必须通过集成同步的自体视频来回答空间、时间、可见性和协调性问题。为此,我们引入了首个针对该任务的基准CoopSR,以及EgoTeam多机器人自体问答数据集。EgoTeam包含114,227个问答对,覆盖19种问题类型、四个难度等级和三种团队规模,在Habitat和iGibson中,以及一个包含约2,326个问题的现实世界测试集。我们进一步提出了SP-CoR(Spectral and Physics-Informed Cooperative Reasoner),一种用于细粒度协作空间推理的MLLM框架。SP-CoR结合了动态感知的多机器人帧采样、光谱和物理引导的视图融合以及物理对齐的提示蒸馏,使模型在训练时能够受益于特权机器人姿态监督,而在测试时仅需自体视频。在22个MLLM基线模型上,SP-CoR在Habitat上比最强的微调基线高出3.87%,在iGibson上高出7.12%。它还展示了更强的泛化能力,适用于未见过的团队规模和现实世界机器人测试。代码可在https://github.com/KPeng9510/seeing-together.git找到。

英文摘要

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.

2605.18413 2026-05-20 cs.CV

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

基础的裂缝:一个挑战视觉基础模型的民用基础设施数据集

Nicola Farronato, Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Rizwan Ullah Khan, Michele Magno, Konrad Schindler, Cristiano Malossi, Florian Scheidegger

发表机构 * IBM Research(IBM研究院) ETH Zürich(苏黎世联邦理工学院) University of Twente(特文特大学)

AI总结 本文提出Cracks in the Foundation数据集,通过高分辨率图像挑战视觉基础模型在民用基础设施中的密集图像理解能力,揭示了现有模型在真实世界中的局限性。

详情
AI中文摘要

自动化结构健康监测对于防止基础设施灾难性失效至关重要。精确的像素级缺陷分割对于准确评估结构完整性至关重要,但进展受制于极少数数据的匮乏,这需要昂贵的专家标注。由于问题固有的算法障碍,如中心偏差和在检查近似无纹理的建筑材料时需要更多依赖形状,数据需求更加突出。为消除瓶颈,我们引入Cracks in the Foundation (CiF),这是迄今为止最大的、最详细的民用基础设施(实例)分割数据集,包含约150,000张高分辨率图像,经过五年与土木工程专家的合作精心编纂。借助这一前所未有的数据源,我们揭示了当前视觉AI的一个盲点:尽管提示式基础模型(FMs)和视觉语言模型(VLMs)已出现,尽管当今专门的分割模型表现出色,但建成环境中的密集图像理解仍远未解决。我们的评估表明,即使是最新的零样本FMs在部署到真实基础设施时也面临重大挑战,甚至专门模型在领域特定监督下的性能也停滞在约25%的mAP。CiF将民用基础设施检查,一个基础且看似简单的感知任务,确立为一个开放挑战,揭示了目前主要在互联网图像上训练的模型的根本性弱点,字面和比喻上都突显了当前基础模型范式的裂缝。

英文摘要

Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising $\approx$150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today's specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at $\approx$25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.

2605.18396 2026-05-20 cs.CV

NEWTON: Agentic Planning for Physically Grounded Video Generation

NEWTON:面向物理基础视频生成的代理规划

Yuxiang Feng, Juncheng Wang, Chao Xu, Yijie Qian, Huihan Wang, Wenlong Hou, Yang Liu, Baigui Sun, Yong Liu, Shujun Wang

发表机构 * Zhejiang University(浙江大学) The Hong Kong Polytechnic University(香港理工大学) IROOTECH TECHNOLOGY(IROOTECH技术公司) Sany Group(三一集团)

AI总结 本文提出NEWTON,通过将视频生成从系统输出降级为代理工具箱中的一个动作,利用学习的规划器协调物理感知工具,提高视频生成的物理合理性,从而在VideoPhy-2数据集上显著提升联合准确性。

Comments project page: https://Newton026.github.io/newton

详情
AI中文摘要

视频生成模型能够产生视觉上吸引人的结果,但系统性地违反物理常识——在VideoPhy-2数据集上,最佳模型仅能达到32.6%的联合准确性。我们识别出一个规范瓶颈:文本提示是对物理世界的损失压缩,省略了完全决定动态的参数,而无论模型规模如何扩大都无法恢复从未指定的内容。从这一诊断中,我们得出物理条件必须满足的三个属性——充分性、动态性和可验证性,并展示现有方法均无法同时满足这三个属性。我们提出了NEWTON,其中视频生成被降级为代理工具箱中的一个动作:学习的规划器协调物理感知工具(关键帧生成、科学计算、提示优化)以构建丰富的条件输入,并通过验证器闭合回路以实现迭代再规划。规划器是唯一可训练的组件,通过Flow-GRPO在实时多轮循环中进行在线优化。在VideoPhy-2数据集上,NEWTON在LTX-Video上将联合准确性从21.4%提升到29.7%,在Veo-3.1上从30.7%提升到37.4%,而无需修改生成器。我们的项目页面:https://Newton026.github.io/newton

英文摘要

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: https://Newton026.github.io/newton