arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1926
2605.22177 2026-05-22 cs.LG cs.CL

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Maestro:通过强化学习协调分层模型-技能集合

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao

发表机构 * Tsinghua University(清华大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Nanyang Technological University(南洋理工大学) Tongji University(同济大学)

AI总结 本文提出Maestro框架,通过强化学习协调多模态任务,利用分层模型-技能集合提升多模态任务性能,实现高效且通用的协调策略。

详情
AI中文摘要

大型语言模型(LLMs)和模块化技能的普及使自主代理具备了越来越强大的能力。现有框架通常依赖于单一的LLM和固定的逻辑来与这些技能交互。这导致了一个关键瓶颈:不同的LLMs在不同领域具有不同的优势,但当前框架未能利用模型和技能的互补优势,从而限制了其在下游任务上的性能。在本文中,我们提出了Maestro(多模态代理专家技能强化学习协调框架),这是一个由强化学习(RL)驱动的协调框架,将异构多模态任务重新框架化为一个在分层模型-技能注册表上的顺序决策过程。与将所有知识整合到单一模型中不同,Maestro训练了一个轻量级的策略,动态组合冻结的专家模型和一个双层技能库,决定在每一步是否调用外部专家,选择哪个模型-技能对,以及何时终止。该策略通过基于结果的强化学习进行优化,不需要步骤级监督。我们评估了Maestro在十个代表性的多模态基准上,涵盖数学推理、图表理解、高分辨率感知和领域特定分析。仅使用一个4B的协调器,Maestro实现了70.1%的平均准确率,超过了GPT-5(69.3%)和Gemini-2.5-Pro(68.7%)。关键的是,学习的协调策略能够泛化到未见过的模型和技能,无需重新训练:在注册表中添加非领域专家,使在四个具有挑战性的基准上平均达到59.5%,优于所有闭源基线。Maestro进一步保持了高计算效率和低延迟。源代码可在https://github.com/jinyangwu/Maestro上获得。

英文摘要

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

2605.22176 2026-05-22 cs.AI

LLM-Metrics: Measuring Research Impact Through Large Language Model Memory

LLM-Metrics: 通过大语言模型内存测量研究影响力

Si Shen, Wenhua Zhao, Danhao Zhu

发表机构 * School of Economics and Management(经济管理学院) College of Information Management(信息管理学院) Department of Computer Science and Engineering(计算机科学与工程系) Nanjing Agricultural University(南京农业大学) Nanjing University of Science and Technology(南京理工大学) Department of Criminal Science and Technology(犯罪科学与技术系) Jiangsu Police Institute(江苏警察学院)

AI总结 本文提出LLM-Metrics,一种基于大语言模型参数内存的研究影响力评估指标,通过设计多种选择题探针评估549篇2023-2024年计算机科学论文,发现高影响力论文在学术社区中获得更大曝光,从而在LLM训练数据中形成更强的参数记忆,与引用次数呈现显著相关性。

Comments 25pages, 5figures

详情
AI中文摘要

引用次数仍然是评估研究影响力的主要指标,但存在众所周知的局限性:时间滞后、学科偏见和马太效应。本文提出LLM-Metrics,一种基于大语言模型(LLMs)参数内存的研究影响力评估指标。核心假设是高影响力论文在学术界获得更大曝光,这种曝光以文本形式进入LLM训练数据,从而使模型形成更强的参数记忆。我们设计了四种类型的多项选择探针,涵盖标题识别、作者识别、方法识别和会议识别,并评估了549篇2023-2024年发表的计算机科学论文,覆盖17个LLM,参数范围从0.5B到72B,来自六个供应商。在17个模型中,15个产生了正预测,其中9个在p小于0.05时显著,与引用次数的斯皮尔曼相关性为rho=0.1495,p=0.0004。三个额外的发现支持所提出的机制。首先,预测信号在2024年的论文中更强,rho=0.1880,其引用次数在模型训练时间接近零,减少了简单反向因果解释的可能性。其次,作者识别探针显示出最强的判别能力,与曝光驱动的记忆机制一致。第三,模型规模和预测能力是非单调的:一个3B参数的模型Llama-3.2-3B-Instruct,rho=0.1829,优于大多数更大的模型,支持了一个选择性记忆假设,即较小模型的有限容量可以作为有效的信息过滤器。LLM-Metrics提供了一种实时、跨学科、不依赖引用的研究所评估范式。

英文摘要

Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.

2605.22170 2026-05-22 cs.CL

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

事实回忆机制在文本到语音的多模态语言模型中是否延续?

Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, Richard Johansson

发表机构 * Zenseact Unbox AI Chalmers University of Technology(楚德斯大学) University of Gothenburg(哥德堡大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 研究探讨了多模态语言模型中事实回忆机制在文本和语音模态间的延续性,通过因果中介分析揭示了语音到文本与文本到文本在事实存储和回忆中的差异。

Comments In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics

详情
AI中文摘要

近年来,几种将语音和书面文本联合表示的语音语言模型(SLMs)已被提出。然后出现的问题是,当模型在两种模态中运行时,内部机制的相似性和差异性如何。我们关注这些系统如何编码、存储和检索事实知识,这之前已经在文本模型中被研究过。为了调查SLMs中事实关联存储和回忆的机制,我们利用了因果中介分析,这是一种之前应用于基于文本的模型的技术。使用SpiritLM这一多模态模型,整合离散语音标记,初步结果揭示了文本到文本和语音到文本结果之间的差异,表明事实回忆的新兴机制仅部分从文本延续到语音模态。这些结果加深了我们对SLMs中内部机制如何编码事实关联的理解,并为改进语音启用的AI系统提供了见解。

英文摘要

In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

2605.22169 2026-05-22 cs.CV

Balancing Uncertainty and Diversity of Samples: Leveraging Diversity of Least, High Confidence Samples for Effective Active Learning

平衡不确定性和样本多样性:利用低、高置信度样本的多样性进行有效的主动学习

Vipul Arya, S. H. Shabbeer Basha, Srikrishna U N, Sunainha Vijay, Snehasis Mukherjee

发表机构 * School of Computer Science and Engineering, RV University(计算机科学与工程学院,RV大学) School of Engineering & Technology, Vidyashilp University(工程与技术学院,Vidyashilp大学) Shiv Nadar Institution of Eminence(Shiv Nadar卓越研究院)

AI总结 本文提出了一种新的混合采样方法,通过同时选择容易和困难的样本,结合多样性,以提高主动学习的效果。实验表明,所提出的Least Confident and Diverse (LCD)方法在性能上优于现有方法,通过选择不确定且多样的实例,帮助模型学习更明显的特征。

详情
AI中文摘要

深度学习模型,包括卷积神经网络(CNNs)和视觉Transformer(ViTs),在各种计算机视觉任务如物体分类、检测、分割、生成等任务中取得了最先进的性能。然而,这些模型对数据需求很高,因为它们需要更多的训练数据来学习数百万或数十亿的参数。特别是对于监督学习任务,为模型训练收集大量标记样本是一个昂贵且耗时的任务。主动学习(AL)已被用于解决这个问题多年。现有的主动学习方法旨在从未标记样本池中选择用于注释的样本,这些样本要么是多样化的要么是不确定的。选择这样的样本可能会阻碍模型的性能,因为我们基于单一维度进行池化,即要么多样化要么不确定。在本文中,我们提出四种新颖的混合采样方法,用于同时池化容易和困难的样本,这些样本也是多样的。为了验证所提出方法的有效性,进行了大量的实验,分别使用高和低置信度样本。我们从实验中发现,所提出的混合采样方法,即Least Confident and Diverse(LCD),在性能上始终优于最先进的方法。观察到选择不确定且多样的实例有助于模型学习更明显的特征。与本研究相关的代码将在https://github.com/XXX/LCD上提供。

英文摘要

Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model's performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at https://github.com/XXX/LCD.

2605.22168 2026-05-22 cs.AI cs.LG

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

衡量跨模态协同:VLM可解释性的一个基准

Joël Roman Ky, Salah Ghamizi, Maxime Cordy

发表机构 * University of Luxembourg(卢森堡大学) Luxembourg Institute of Health (LIH)(卢森堡健康研究院)

AI总结 本文提出Synergistic Faithfulness作为衡量VLM跨模态协同的指标,解决了传统单模态评估方法在评估VLM可解释性时的不足,通过引入Shapley交互指数,实现了对多模态协同的准确评估,同时提升了计算效率。

详情
AI中文摘要

视觉-语言模型(VLMs)将复杂的视觉输入映射到语义空间,但目前解释VLM的跨模态推理仍依赖于通过单模态扰动度量评估的后验解释器。我们揭示了这一范式的局限性:由于多模态数据集包含语言先验和模态偏差,VLMs经常表现出跨模态冗余,允许它们仅使用文本回答视觉查询。因此,单模态度量惩罚忠实的解释器,导致评估崩溃,其中视觉和文本排名根本矛盾(Kendall's τ= -0.06)。为了解决这一问题,我们引入了Synergistic Faithfulness(F_syn),一个基于Shapley交互指数的可扩展度量,严格隔离模态间的Harsanyi收益,作为高度准确的替代指标(ρ= 0.92),同时实现了24倍的计算加速。在评估8种不同的XAI方法、3种VLM架构和3个基准数据集时,发现为VLM设计的解释器严重过度索引视觉显著性,并在捕捉真正的跨模态协同方面显著劣于适应的注意力方法。通过将视觉合理性与跨模态忠实性解耦,本文提供了一个严格评估框架,以安全审计VLM在高风险部署中的推理。

英文摘要

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ= -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($ρ= 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

2605.22164 2026-05-22 cs.LG cs.RO

Beyond Euclidean Proximity: Repairing Latent World Models with Horizon-Matched Trajectory Reachability Metrics

超越欧几里得距离:通过地平线匹配轨迹可达性度量修复潜在世界模型

Liangyu Li, Shengzhi Wang, Qingwen Liu

发表机构 * Tongji University(同济大学)

AI总结 本文提出轨迹可达性度量(TRM)作为固定潜在世界模型的后处理终端排名方法,通过训练小的成对头部来改进终端排名,从而提高连续操控任务的性能。

Comments 26 pages, 7 figures

详情
AI中文摘要

潜在世界模型可以包含用于控制的状态,但其终端成本接口可能会向规划器暴露错误的决策相关信息。在常见的潜在MPC中,候选序列通过预测终端和目标潜在状态之间的欧几里得距离进行排名;这假设了原始潜在距离权重能够正确地反映可达性相关变量。我们提出轨迹可达性度量(TRM),一种用于固定潜在世界模型的后处理终端排名方法。TRM从记录的轨迹结构中训练一个小的成对头部,并将其用作替代或混合成本;编码器、动力学、采样器、优化器和评估表现保持不变。关键设计选择是地平线意识监督:该度量在广泛的、平衡的时间分离上进行训练,以匹配长地平线终端候选排名问题。在硬TwoRoom基准上,使用LeWorldModel(LeWM)的原始潜在规划成功率为7.0%,而全地平线TRM成功率为97.0%;洗牌时间标签控制仍为0.0%。同样的配方在三个种子上将PLDM基线从32.7%提高到84.0%,而短地平线TRM变体在100,000对预算下仅达到35.0%。在TwoRoom中,我们提供了TRM为何有效的机理证据:XY位置是线性可解码的(R²=0.998),但原始潜在MSE错误地排名候选;XY探针行空间在终端-目标潜在MSE中占比不到1%,但承载了大部分候选质量信号;SCSA审计显示TRM提高了规划器看到的排序和选定终点。在PushT go50/go75中,TRM风格的任务-状态度量比闭环成功更清晰地改进了SCSA排名和选定最终距离,推动了连续操控中的辅助混合成本。TRM是规划器面对的修复,审计解释了何时终端可达性度量应替代或补充原始潜在接近度。

英文摘要

Latent world models can contain the state needed for control, yet their terminal-cost interface can expose the planner to the wrong decision-relevant information. In common latent MPC, candidate sequences are ranked by Euclidean distance between predicted terminal and goal latent states; this assumes that raw latent distance weights reachability-relevant variables correctly. We propose trajectory reachability metrics (TRM), a post-hoc terminal-ranking method for fixed latent world models. TRM trains a small pairwise head from logged trajectory structure and uses it as a replacement or hybrid cost; the encoder, dynamics, sampler, optimizer, and evaluation manifests remain fixed. The key design choice is horizon-aware supervision: the metric is trained on broad, balanced temporal separations to match the long-horizon terminal candidate ranking problem. On a hard TwoRoom benchmark, raw latent planning with LeWorldModel (LeWM) reaches 7.0% success, while full-horizon TRM reaches 97.0%; shuffled temporal-label controls stay at 0.0%. The same recipe improves a PLDM baseline from 32.7% to 84.0% across three seeds, and a short-horizon TRM variant reaches only 35.0% with the 100,000 pair budget. In TwoRoom, we provide mechanistic evidence for why TRM works: XY position is linearly decodable (R^2=0.998), yet raw latent MSE misranks candidates; the XY-probe rowspace accounts for less than 1% of terminal-goal latent MSE but carries most candidate-quality signal; and SCSA audits show that TRM improves the ordering and selected endpoint seen by the planner. On PushT go50/go75, TRM-style task-state metrics improve SCSA ranking and selected final distance more cleanly than closed-loop success, motivating auxiliary hybrid costs in continuous manipulation. TRM is the planner-facing repair, and audits explain when terminal reachability metrics should replace or augment raw latent proximity.

2605.22158 2026-05-22 cs.AI cs.CV

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ST-SimDiff:平衡时空相似性与差异以实现高效的视频理解与大语言模型

Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding

发表机构 * Tsinghua University(清华大学) Shenzhen University(深圳大学) Xidian University(西安电子科技大学)

AI总结 本文提出ST-SimDiff框架,通过平衡时空相似性与差异来提高视频理解效率,利用时空图和双选择策略减少计算成本并提升性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在处理长视频时面临显著的计算开销,因为需要处理大量的视觉标记。为了提高效率,现有方法主要通过修剪或合并标记来减少冗余,但这些方法忽略了视频内容的一个关键维度,即变化和转折点,并且缺乏对时空关系的协作模型。为此,我们提出了一种新的视角:相似性用于识别冗余,而差异用于捕捉关键事件。基于此,我们设计了一个无需训练的框架,名为ST-SimDiff。我们首先从视觉标记中构建时空图,以统一建模其复杂的关联。随后,我们采用并行双选择策略:1)基于相似性的选择使用社区检测保留代表性标记,压缩静态信息;2)基于时间差异的选择精确定位内容变化点,以保留捕捉关键动态变化的标记。这使它能够用最少的标记保留静态和动态内容。广泛实验表明,我们的方法在显著优于现有最先进方法的同时,大幅减少了计算成本。我们的代码可在https://github.com/bingjunluo/ST-SimDiff上获得。

英文摘要

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST-SimDiff.

2605.22156 2026-05-22 cs.LG cs.AI

One-Way Policy Optimization for Self-Evolving LLMs

单向策略优化用于自演化大语言模型

Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University(北京大学深圳研究生院) Dartmouth College(达特茅斯学院) Alibaba(阿里巴巴)

AI总结 本文提出单向策略优化方法,通过解耦优化方向与更新幅度,解决传统方法中验证器奖励稀疏导致的训练不稳定问题,实现大语言模型的持续自演化。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为扩展大语言模型(LLMs)推理能力的一种有前景的范式。然而,二进制验证器奖励的稀疏性往往导致低效和优化不稳定。为了稳定训练,现有方法通常施加与参考策略相关的令牌级约束。我们发现这些约束会无差别地惩罚偏差;当策略试图超越参考时,这会翻转由验证器确定的方向,从而抑制收益。为了解决这个问题,我们提出了一种基于解耦优化方向与更新幅度原理的单向策略优化(OWPO)方法。在OWPO中,验证器规定更新方向,而参考策略仅用于调整更新幅度。具体而言,OWPO采用不对称重加权:它对劣质偏差(策略落后于参考)执行加速对齐,对优质偏差(策略超越参考)执行收益锁定。此外,通过整合迭代参考更新,OWPO创建了“棘轮效应”,持续巩固收益。实验结果表明,OWPO在DAPO、OPD和MOPD等强基线方法上表现更优,突破了固定先验的瓶颈,使大语言模型能够持续自演化,而无需依赖外部参考模型。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

2605.22155 2026-05-22 cs.LG

Algebraic Machine Learning for Small-to-Medium Datasets Is Competitive against Strong Standard Baselines

代数机器学习在小至中等数据集上的表现与强标准基线竞争

David Mendez, Fernando Martin-Maroto, Gonzalo G. de Polavieja

发表机构 * Mathematics of Behavior and Intelligence Lab(行为与智能数学实验室) Champalimaud Foundation(Champalimaud基金会)

AI总结 本文研究了代数机器学习在小至中等规模数据集上的表现,发现其在图像和表格分类任务中能与CNN等强基线方法竞争,且无需交叉验证。

Comments 9 pages, 4 figures

详情
AI中文摘要

符号方法通常不被认为在现实监督任务上能与强大的现代学习者竞争。我们评估了代数机器学习(AML)框架在不同训练集大小下的图像和表格分类任务中的表现,该框架通过代数结构的子直接分解来学习,而非数值优化。我们发现,AML仅在训练数据上训练,不使用验证或交叉验证,就能在小至中等规模的图像数据集(50-2000个训练示例)上优于包括CNN在内的多种交叉验证基线方法。在相同规模范围内的表格数据集中,XGBoost总体表现最佳,但AML仍能与包含任务特定偏置的方法(如LightGBM和随机森林)竞争。AML通过通用的代数归纳偏置在两种非常不同的数据集类型上实现了竞争性表现,而不是标准基线(如CNN用于图像或XGBoost用于表格数据)中固有的模态特定偏置,并且不需要交叉验证,因为它没有需要调优的任务依赖超参数。

英文摘要

Symbolic methods are generally not considered competitive with strong modern learners on realistic supervised tasks. We evaluate Algebraic Machine Learning (AML), a framework that learns through subdirect decomposition of algebraic structure rather than numerical optimization, against standard baselines on image and tabular classification across varying training-set sizes. We find that AML trained only on training data without using validation or cross-validation outperforms a family of cross-validated baseline methods including CNNs on small to medium image datasets (50--2000 training examples). On tabular datasets in the same size range, XGBoost is overall the best performing method, but AML is nonetheless comparable to methods incorporating task-specific biases such as LightGBM and random forests. AML achieves this competitive performance across two very different types of datasets using a generic algebraic inductive bias, rather than the modality-specific biases built into standard baselines like CNNs for images or XGBoost for tabular data, and requires no cross validation because it has no task-dependent hyperparameters to tune.

2605.22154 2026-05-22 cs.AI

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

IdleSpec: 通过投机规划利用空闲时间用于LLM代理

Daewon Choi, Kyunghyun Park, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Jinwoo Shin, Aram Galstyan

发表机构 * KAIST(韩国科学技术院) Amazon AGI(亚马逊人工智慧实验室) Together AI

AI总结 本文提出IdleSpec,一种利用空闲时间提升LLM代理性能的方法,通过在空闲期间生成计划候选并减少延迟开销,从而在多种代理场景中显著提高性能。

详情
AI中文摘要

基于大型语言模型(LLM)的代理通过多步骤推理和迭代工具调用及环境交互来解决复杂任务,这在等待观察时会产生空闲时间。尽管大多数代理场景中普遍存在空闲时间,但现有工作将其视为不可避免的开销或提出受限解决方案,忽略了不同工具调用之间的不同计算预算和未来观察不确定性,从而导致空闲时间利用不充分。本文介绍IdleSpec,一种可扩展且通用的推理方法,利用空闲时间计算来提高代理性能同时最小化延迟开销。具体而言,IdleSpec在空闲期间迭代生成计划候选,并在观察可用时汇总它们以引导下一步推理。为了在观察不确定性下有效生成计划,IdleSpec从学习的分布中采样互补的起草策略(即渐进和恢复),该分布通过后验反馈更新。我们的实验表明,IdleSpec在各种代理场景中通过有效利用空闲时间显著提高了代理性能。特别是,在GAIA和FRAMES上,IdleSpec使用Gemini-2.5-Flash实现了55.6%的平均准确率,超过了不使用空闲时间的基线方法5.1%。此外,在涉及大量代码执行延迟的MLE-Bench上,IdleSpec在Any Medal速率上实现了高达9.1%的性能提升,突显了其在长周期任务中的通用性。

英文摘要

Large language model (LLM)-based agents solve complex tasks by leveraging multi-step reasoning with iterative tool calls and environment interactions, which incur idle time while waiting for observations. Despite the prevalence of idle time in most agentic scenarios, existing works treat it as an unavoidable overhead or propose restricted solutions that overlook varying computational budgets across different tool calls and future observation uncertainty, thereby leading to suboptimal utilization of idle time. In this paper, we introduce IdleSpec, a scalable and generic inference approach that leverages idle-time computation to improve agent performance while minimizing latency overhead. Specifically, IdleSpec iteratively generates plan candidates during idle periods and, once observations become available, aggregates them to guide the next reasoning step. For effective plan generation under observation uncertainty, IdleSpec samples between complementary drafting strategies (i.e., progressive and recovery) from a learned distribution that is updated via posterior feedback. Our experiments demonstrate that IdleSpec significantly improves agent performance in various agentic scenarios by effectively utilizing idle time. In particular, on the GAIA and FRAMES, IdleSpec achieves 55.6% average accuracy with Gemini-2.5-Flash, surpassing the vanilla baseline without idle-time usage by 5.1%. Furthermore, for MLE-Bench, which involves substantial delay from code executions, IdleSpec achieves performance gains of up to 9.1% on the Any Medal rate, highlighting its generalizability to long-horizon tasks.

2605.22148 2026-05-22 cs.AI cs.CL

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Ratchet:一种最小化卫生的自演化LLM代理技能库

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS 生成式人工智能创新中心) HSBC Holdings Plc.(汇丰控股有限公司) HSBC Technology Center, China(汇丰技术中心,中国)

AI总结 本文提出Ratchet,一种单代理循环,使冻结的LLM能够自行编写、检索、整理和淘汰其自然语言技能,通过整合四个卫生机制提升技能库的生命周期管理,从而在MBPP+ hard-100数据集上显著提升性能。

Comments 16 pages, 2 figures, 6 tables. Extends arXiv:2605.19576 with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)

详情
AI中文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce extbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

英文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

2605.22147 2026-05-22 cs.CV

Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution

基于流的高斯点散射用于连续尺度遥感图像超分辨率

Jiangwei Mo, Xi Lu, Hanlin Wu

发表机构 * School of Information Science and Technology, Beijing Foreign Studies University(信息科学与技术学院,北京外国语大学)

AI总结 本文提出FlowGS框架,通过流匹配和高斯点散射实现任意尺度的遥感图像超分辨率,提升生成效率和质量。

详情
AI中文摘要

高分辨率遥感图像(RSI)对于地球观测应用至关重要,但获取它们通常受到传感器限制和成本的限制。近年来,生成式超分辨率(SR)方法,特别是扩散模型,取得了显著进展。然而,它们通常需要缓慢的迭代推断,需要40-1000步,并且在连续尺度SR设置中表现出有限的灵活性。为了解决这些问题,我们提出FlowGS,一种用于任意尺度RSI超分辨率的生成性重建框架。FlowGS建模高分辨率和低分辨率图像之间的高频细节表示,并通过流匹配(FM)约束于快捷一致性,学习从噪声到细节先验的连续概率流,从而减少生成复杂性并提高推断效率。此外,我们采用2D高斯点散射来构建连续特征场,从而在任意查询位置上实现灵活的重建。实验结果表明,FlowGS在连续尺度和固定尺度SR设置中均能提供与现有方法相媲美的感知质量,同时具有显著提高的推断效率。

英文摘要

High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40--1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.

2605.22144 2026-05-22 cs.CV

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

一句话,一出戏剧:通过多智能体系统实现个性化短剧生成

Yufei Shi, Weilong Yan, Naixuan Huang, Yucheng Chen, Chenyu Zhang, Tao He, Si Yong Yeo, Ming Li

发表机构 * MedVisAI Lab, Lee Kong Chian School of Medicine, Nanyang Technological University(MedVisAI实验室,李光前医学院,南洋理工大学) National University of Singapore(新加坡国立大学) Beijing Institute of Technology(北京理工大学) Tsinghua University(清华大学) University of Electronic Science and Technology of China(电子科技大学) Guangming Laboratory(光明实验室)

AI总结 本文提出了一种多智能体框架,通过结构化中间模块和迭代优化,将用户的单句想法转化为完整短剧,解决了短剧生成中的叙事节奏、空间一致性及生产质量控制问题。

详情
AI中文摘要

现有的数字短剧制作方法通常依赖单次生成的LLM脚本和松散耦合的流程,无法满足短剧生成的三个关键要求:(1) 叙事节奏,导致钩子弱、情节不足和不吸引人的结局;(2) 空间一致性,导致场景布局漂移和跨片段角色位置不一致;(3) 生产级质量控制,需要在脚本和视觉阶段进行大量手动审查和修正。我们提出了One Sentence, One Drama,一种分层多智能体框架,通过结构化中间模块和迭代优化,将用户的单句想法转化为完整短剧。我们的方法基于三个关键组件:(1) 基于多智能体辩论的故事生成模块,强制短剧节奏和叙事连贯性;(2) 3D基础的第一帧生成机制,建立共享的空间参考,确保跨片段的一致性角色定位和场景布局;(3) 多阶段评审循环,在脚本、视觉和视频生成阶段进行全面的错误检测和有针对性的修订。我们还引入了场景级BGM匹配和场景转换规划,以提高观众的沉浸体验。为了系统评估该任务,我们引入了Short-Drama-Bench,一个扩展标准视频质量指标的基准,包含短剧特定的评估标准。实验结果表明,我们的方法在叙事质量、跨片段一致性以及整体观看体验上显著优于现有流程。

英文摘要

Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

2605.22140 2026-05-22 cs.CL

Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues

Psy-Chronicle: 一个用于合成长周期校园心理辅导对话的结构化流水线

Chaogui Gou, Jiarui Liang

发表机构 * University of Science and Technology Beijing(北京科技大学)

AI总结 本文提出Psy-Chronicle,一种结构化数据生成框架,用于合成长周期校园心理辅导对话,通过生成学期跨度的时间压力事件图和学生与辅导员代理的交互模拟,构建了包含100个学生档案和9万条对话的CPCD数据集,并通过CPCD-Bench评估模型的长周期校园辅导能力,实验结果表明CPCD有效提升了模型的会话级响应生成和长周期记忆召回能力。

详情
AI中文摘要

近年来,大语言模型在心理支持任务中展现出显著潜力。然而,现有心理辅导数据大多依赖于单轮问答或短多轮对话,难以刻画大学生心理困扰在校园生活事件中如何积累、交互并逐渐演变的长期过程。为解决这一问题,本文提出Psy-Chronicle,一种结构化数据生成框架,用于合成长周期校园心理辅导对话。我们生成一个跨越学期的时间压力事件图,以建模校园压力事件之间的时序顺序和演变依赖关系。通过学生代理与辅导员代理之间的交互模拟,以及结构化记忆整合机制,Psy-Chronicle生成具有连续性且跨越多个辅导会话的长周期对话。基于Psy-Chronicle,我们构建并开源了CPCD,一个中文长周期对话数据集,包含100个学生档案和90,000条辅导对话。我们进一步构建CPCD-Bench,从三个维度评估模型的长周期校园辅导能力:会话级响应、长周期记忆召回和时间因果推理。实验结果表明,CPCD有效提升了具有相同基础架构的模型的会话级响应生成和长周期记忆召回能力。同时,时间因果推理的改进仍有限,表明事件链组织和因果解释是长周期心理辅导建模中的关键挑战。相关代码和数据可在:https://github.com/EdwinUSTB/Psy-Chronicle 获取。

英文摘要

In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single-turn question answering or short multi-turn dialogues, making it difficult to characterize how college students' psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy-Chronicle, a structured data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. We generate a semester-spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions. Based on Psy-Chronicle, we construct and open-source CPCD, a Chinese long-horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD-Bench to evaluate models' long-horizon campus counseling capabilities from three dimensions: session-level response, long-horizon memory recall, and temporal-causal reasoning. Experimental results show that CPCD effectively improves session-level response generation and long-horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal-causal reasoning remain limited, indicating that event-chain organization and causal explanation are key challenges in long-horizon psychological counseling modeling. The related code and data are available at: https://github.com/EdwinUSTB/Psy-Chronicle

2605.22139 2026-05-22 cs.CV

EventGait: Towards Robust Gait Recognition with Event Streams

EventGait: 向事件流中实现鲁棒的步态识别

Senyan Xu, Shuai Chen, Chuanfu Shen, Kean Liu, Zhijing Sun, Chengzhi Cao, Xueyang Fu

发表机构 * University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出EventGait,一种端到端的双流框架,通过事件相机捕捉动态和形状信息,提升在复杂光照和运动环境下的步态识别鲁棒性,并通过合成数据集和新基准测试验证了其有效性。

详情
AI中文摘要

步态识别能够实现非侵入性和隐私保护的识别,但在不受控环境中由于传统相机的光照和运动敏感性而面临挑战。本文探讨了使用事件相机进行步态识别,事件相机提供微秒级时间分辨率和高动态范围,自然捕捉鲁棒的动态特征并抑制静态噪声。现有基于事件的方法通常将事件流聚合为事件图像,从而丢弃了对步态识别至关重要的细粒度运动动态。因此,我们提出了EventGait,一种端到端的双流框架,分别建模运动和形状,同时保留事件的优势。我们的动态流利用混合脉冲专家(MoSE)和多样化的神经元常数,以在复杂的运动和光照场景中实现稳健的动态感知,而静态流通过跨模态结构对齐(CroSA)学习密集的形状表示,使用大规模视觉基础模型。为了解决大规模基于事件的步态数据集的缺乏,我们引入了合成管道并发布了两个新的基准:SUSTech1K-E和CCGR-Mini-E。广泛的实验表明,基于事件的步态识别不仅在正常条件下实现了与基于相机的步态识别相当的结果,而且在低光场景中显著优于前者。我们的方法在合成和真实世界基于事件的步态基准上均达到了新的状态,突显了事件驱动步态分析的鲁棒性和潜力。代码和数据集已发布在https://github.com/QUEAHREN/EventGait。

英文摘要

Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity of conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition. Therefore, we propose \textbf{EventGait}, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structure Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets are released at https://github.com/QUEAHREN/EventGait.

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出通过分解决策过程为三个系统:模拟推理、自我调节和反应执行,来提升代理推理的效率,并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情
AI中文摘要

代理应该如何决定何时以及如何规划?主流方法将代理建模为具有自适应计算的反应策略(例如链式思考),通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围,这些系统显著增加了推理长度,导致无效的令牌使用,而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统:模拟推理(系统II)通过世界模型将推理根植于未来状态预测;自我调节(系统III)通过学习的配置器决定何时以及如何深入规划;以及反应执行(系统I)处理细粒度的动作。模拟推理在不同任务中提供统一的规划,而无需每个领域的工程,同时自我调节确保规划只在需要时被调用。为了测试这一点,我们开发了SR$^2$AM(Self-Regulated Simulative Reasoning Agentic LLM),在LLM的链式思考中实现这两个系统作为独立阶段,其中LLM作为世界模型。我们探索了两种实现:从提示的多模块系统中记录决策(v0.1)和从预训练推理LLM的痕迹中重建结构化计划(v1.0),通过监督学习和强化学习(RL)训练。在数学、科学、表格分析和网络信息检索中,v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当,而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%,而规划频率仅增加2.0%,表明它学会了更远地规划而不是更频繁地规划。更广泛地说,学习的自我调节实例化了一个原则,我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

2605.22132 2026-05-22 cs.CV

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

通过可插拔的深度卷积加速视觉基础模型

Carmelo Scribano, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu, Giorgia Franchini, Danda Pani Paudel, Marko Bertogna, Luc Van Gool

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) INSAIT, Sofia University ``St. Kliment Ohridski'', Sofia, Bulgaria(INSAIT,索菲亚大学『圣克莱门特·欧赫里迪斯』,索菲亚,保加利亚)

AI总结 本文提出了一种通过可插拔深度卷积层替代部分注意力头来加速大规模预训练视觉Transformer,同时保持特征提取能力,在图像分类和分割任务中实现了17-20%的推理加速且性能损失极小。

Comments Accepted at ICPR 2026

详情
AI中文摘要

预训练的视觉基础模型在少量微调下即可在多种任务中取得优异性能。然而,其视觉Transformer(ViT)主干结构导致较高的推理开销,限制了在资源受限设备上的部署。在本文中,我们通过利用某些注意力头内在的卷积类行为,加速大规模预训练ViT的同时保持其特征提取能力。具体而言,我们引入了一个高效的基于深度卷积的层,作为这些头的可插拔替代方案。此外,我们提出了简单策略来识别可替换的头,并引入一种微调过程以恢复下游任务性能。在图像分类和分割任务中,我们的方法实现了17-20%的推理加速,且性能损失极小。我们通过详细的推导、广泛的实验和效率基准验证了该方法。参考实现已公开。

英文摘要

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

2605.22126 2026-05-22 cs.CV

AesFormer: Transform Everyday Photos into Beautiful Memories

AesFormer: 将日常照片转化为美丽的记忆

Tianxiang Du, Hulingxiao He, Yuxin Peng

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China(王轩计算机技术研究所,北京大学,北京,中国)

AI总结 本文提出AesFormer框架,通过将审美规划与图像编辑解耦,改进照片的审美质量,同时保持主体身份和场景语义,构建了包含9071对严格对齐图像对的AesRecon基准数据集。

Comments Accepted by ICML 2026

详情
AI中文摘要

在日常摄影中,吸引人的时刻往往受到结构缺陷(如构图、相机视角或姿势)的影响,而现有的修图和人像增强方法无法修复这些缺陷。我们提出将审美照片重建(APR)定义为通过结构重建来提高照片的审美质量,同时保持主体身份和场景语义。尽管最近的图像编辑模型使APR成为可能,但它们通常缺乏审美理解,导致编辑结果在语义上合理但审美上薄弱。为此,我们提出了AesFormer,一个两阶段框架,将审美规划与图像编辑解耦。在第一阶段,一个审美动作模型(AesThinker)分析输入沿七个渐进的摄影维度,并输出可执行的编辑动作;我们进一步应用GRPO-A来鼓励在多样化的动作计划上进行广泛探索,超越SFT。在第二阶段,一个动作条件编辑器(AesEditor)在这些动作的指导下执行结构编辑。为了支持APR,我们构建了一个基于视频的语料挖掘管道(VCMP)并构建了AesRecon,一个包含9,071对严格对齐(差,好)图像对的基准。实验表明,AesFormer显著提高了APR性能,并与Nano Banana Pro具有竞争力。代码可在https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026获取。

英文摘要

In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo's aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026.

2605.22123 2026-05-22 cs.RO

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

超越像素:从少量示范中学习不变的奖励以实现实世界机器人学

Tengye Xu, Yangting Sun, Ziju Shen, Guanqi Chen, Zhen Fu, Chen yizhou, Hua Chen, Jia Pan

发表机构 * School of Computing and Data Science, The University of Hong Kong(计算科学与数据科学学院,香港大学) LimX Dynamics Technology Co., Ltd(LimX动力技术有限公司) Southern University of Science and Technology(南方科技大学) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 本文提出了一种从少量示范中学习不变奖励的方法,以实现实世界机器人学中的泛化能力,通过发现行为不变量来改进奖励函数的设计,从而在多个任务中提升策略学习效果。

详情
AI中文摘要

设计能够超越受控实验室环境的奖励函数仍然是强化学习在机器人学中的基本挑战。在开放世界操纵问题中,单一任务可以通过不同的物体实例、位置和摄像头视角出现多种变体。最近基于视觉的奖励模型倾向于记忆特定的像素分布,并且无法超越其训练条件进行泛化。为了解决这个问题,我们提出了一种框架,该框架可以从最少的五个示范中学习不变的符号奖励函数。关键思想是将视觉特征拟合转向发现行为不变量:在多样化的视觉实例中保持不变的任务级属性。该框架有两个耦合的组件:一个结构化奖励公式,它编码任务级策略和物理约束,同时保持最优策略不变性;以及一个混合的符号-数值过程,该过程从示范中提炼这些不变量,而无需在线交互。在八个Meta-World任务和三个Franka操纵任务上的实验表明,我们的方法在过程对齐和策略展开排名能力方面优于基线方法,加速了下游策略学习。三个现实世界的出分布实验进一步表明,学习到的奖励能够零样本泛化到位置、视角和物体变体,使单一奖励表示能够在实践中重用于多种任务变体。

英文摘要

Designing reward functions that generalize beyond controlled laboratory settings remains a fundamental challenge in reinforcement learning for robotics. In open-world manipulation problems, a single task can appear in numerous variants through different object instances, positions, and camera viewpoints. Recent vision-based reward models tend to memorize specific pixel distributions and fail to generalize beyond their training conditions. To address this, we propose a framework that learns invariant symbolic reward functions from as few as five demonstrations. The insight is to shift from visual feature-fitting to the discovery of behavioral invariants: task-level properties that remain constant across diverse visual instantiations. The framework has two coupled components: a structural reward formulation that encodes task-level strategies and physical constraints while preserving optimal policy invariance, and a hybrid symbolic-numerical procedure that distills these invariants from demonstrations without online interaction. Experiments on eight Meta-World tasks and three Franka manipulation tasks demonstrate that our method achieves stronger process alignment and policy rollout ranking abilities compared to baselines, accelerating downstream policy learning. Three real-world out-of-distribution experiments further show that the same learned reward generalizes zero-shot to position, viewpoint, and object variations, enabling a single reward representation to be reused across diverse task variants in practice.

2605.22121 2026-05-22 cs.CV

MotionDPS: Motion-Compensated 3D Brain MRI Reconstruction

MotionDPS: 3D脑部MRI重建中的运动补偿

Antonio Ortiz-Gonzalez, Erich Kobler, Lukas Schletter, Alexander Effland

发表机构 * Life and Medical Sciences Institute, University of Bonn(波恩大学生命与医学科学研究所) Institute for Machine Learning, LIT AI Lab, Department of Virtual Morphology, Clinical Research Institute Medical AI, Johannes Kepler University Linz(林茨约翰尼斯·凯撒大学机器学习研究所、LIT AI实验室、虚拟形态部门、医学人工智能临床研究机构) German Center for Neurodegenerative Diseases (DZNE)(德国神经退行性疾病研究中心(DZNE)) Institute for Applied Mathematics, University of Bonn(波恩大学应用数学研究所)

AI总结 本文提出了一种统一的贝叶斯框架,用于运动补偿的3D MRI重建,通过直接从运动损坏的k空间数据中联合估计解剖图像、刚体运动参数和线圈灵敏度图,实现了无需配对无运动训练数据的完全无监督重建。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

磁共振成像(MRI)由于其相对较长的采集时间和k空间中按顺序采集数据的事实,对患者运动高度敏感。即使是很小的患者移动也会在测量之间引入相位不一致,导致严重的伪影,如模糊、鬼影和几何失真,这些伪影可能影响诊断质量。回顾性运动补偿仍然具有挑战性,尤其是在加速采集中,由于联合重建和运动估计问题的不恰当性。在本工作中,我们提出了一种统一的贝叶斯框架,用于运动补偿的3D MRI,该框架直接从运动损坏的k空间数据中联合估计解剖图像、刚体运动参数和线圈灵敏度图。我们的方法将预训练的3D复值分数扩散模型作为表达性解剖图像先验整合到基于物理的正向模型中。通过交替扩散后验图像更新和高效的近端优化步骤进行运动和线圈灵敏度估计,实现完全无监督的重建,无需配对无运动训练数据。在模拟和真实运动脑部MRI数据集上的实验表明,所提出的方法在图像质量和运动鲁棒性方面优于最先进的经典和学习运动校正技术,特别是在存在严重运动和高加速的情况下。

英文摘要

Magnetic resonance imaging (MRI) is highly susceptible to patient motion due to its relatively long acquisition times and the fact that data are acquired sequentially in k-space. Even small patient movements introduce phase inconsistencies across measurements, leading to severe artifacts such as blurring, ghosting, and geometric distortions that can compromise diagnostic quality. Retrospective motion compensation remains challenging, particularly in accelerated acquisitions, due to the ill-posed nature of the joint reconstruction and motion estimation problem. In this work, we propose a unified Bayesian framework for motion-compensated 3D MRI that jointly estimates the anatomical image, rigid-body motion parameters, and coil sensitivity maps directly from motion-corrupted k-space data. Our approach integrates pretrained 3D complex-valued score-based diffusion models as expressive anatomical image priors within a physics-based forward model. Inference is performed by alternating diffusion posterior image updates with efficient proximal optimization steps for motion and coil sensitivity estimation, enabling fully unsupervised reconstruction without the need for paired motion-free training data. Experiments on simulated and real-motion brain MRI datasets demonstrate that the proposed method achieves improved image quality and motion robustness compared to state-of-the-art classical and learning-based motion correction techniques, particularly in the presence of severe motion and high acceleration.

2605.22111 2026-05-22 cs.LG cs.CE stat.ML

Aerodynamic force reconstruction using physics-informed Gaussian processes

利用物理信息高斯过程进行气动力重建

Gledson Rodrigo Tondo, Igor Kavrakov, Guido Morgenthal

发表机构 * Bauhaus-Universität Weimar(魏玛应用科学大学) University of Cambridge(剑桥大学)

AI总结 本文提出一种基于物理信息的机器学习方法,用于从结构动态响应的噪声测量中重建底层气动载荷,通过避免过拟合和无需正则化方案,提高了模型的准确性和适用性。

详情
AI中文摘要

准确建模气动载荷对于理解和预测复杂结构系统的响应至关重要。然而,这些模型往往依赖于真实物理力的简化,引入假设可能会限制其准确性。在存在噪声或不完整数据的情况下,验证这些模型变得特别具有挑战性。为此,我们介绍了一种概率物理信息机器学习方法,旨在从结构动态响应的噪声测量中重建底层气动载荷。该模型避免了过拟合,消除了对正则化方案的需要,并允许在训练过程中使用异质和多保真度数据。通过重建大贝尔东桥在线性非稳态假设下的气动载荷,证明了该方法的有效性。结果表明,真实和预测载荷之间有很强的一致性,特别是在均方误差、幅度、相位角和信号峰值值方面。该载荷重建方法具有广泛的应用前景,如模型验证、未来载荷估计和结构损伤预测。

英文摘要

Accurate modeling of aerodynamic loads is essential for understanding and predicting the responses of complex structural systems. However, these models often rely on simplifications of the true physical forces, introducing assumptions that can limit their accuracy. Validating such models becomes particularly challenging in the presence of noisy or incomplete data. To address this, we introduce a probabilistic physics-informed machine learning approach designed to reconstruct the underlying aerodynamic loads from noisy measurements of structural dynamic responses. The model avoids overfitting, eliminates the need for regularization schemes, and allows for the use of heterogeneous and multi-fidelity data during the training process. The efficacy of the approach is demonstrated through the reconstruction of aerodynamic loads on the Great Belt East Bridge, simulated under a linear unsteady assumption. Results show a strong agreement between true and predicted loads, particularly related to root mean squared errors, magnitude, phase angle and peak values of the signals. The method for load reconstructing holds broad applicability, such as modeling validation, future load estimation, and structural damage prognosis.

2605.22109 2026-05-22 cs.AI cs.CV cs.CY

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

感知还是偏见:大语言模型能否超越个性的第一印象?

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang

发表机构 * The University of Tokyo(东京大学) Shanda AI Research Tokyo(Shanda AI 研究所东京) Dalian University of Technology(大连理工大学)

AI总结 本文探讨了多模态大语言模型(MLLMs)在感知个性方面的能力,提出了一种新的任务Grounded Personality Reasoning(GPR),并构建了一个新的数据集MM-OCEAN,通过三重评估体系揭示了MLLMs在人格推理中的偏见问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)正在越来越多地应用于需要感知个性的人类交互角色中,但现有的基准测试仅评估其对大五人格特质分数的预测能力,未能确定模型是通过行为理解真正感知个性,还是仅通过表面模式匹配进行偏见判断。我们通过三个贡献填补了这一空白:(i)一个新的任务:我们正式定义了Grounded Personality Reasoning(GPR),要求MLLMs通过一系列评分、推理和锚定过程,将每个大五评分与可观察的证据联系起来;(ii)一个新的数据集:我们发布了MM-OCEAN(1,104个视频,5,320个多项选择题),由多代理流程生成,包含时间戳行为观察、证据支持的特质分析以及七类线索锚定多项选择题;(iii)基准测试和分析:我们设计了一个三级评估体系(评分、推理、锚定)以及四个样本级失败模式指标:偏见率(PR)、编造率(CR)、整合失败率(IR)和整体锚定率(HR),并基准测试了27个MLLMs(13个封闭式,14个开放式)。分析揭示了一个显著的偏见差距:在所有正确评分中,51%的评分没有基于检索到的线索进行锚定,而整体锚定率仅在0-33.5%之间。这些发现揭示了获得正确分数与为正确原因推理之间的脱节,为MLLMs中的扎根社会认知绘制了路线图。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

2605.22106 2026-05-22 cs.AI

ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

ArborKV: 一种面向树状推理的KV缓存管理方法

Yeqiu Chen, Ziyan Liu, Zhenxin Huang, Runquan Gui, Hong Wang, Lei Liu

发表机构 * University of Science(科学大学) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 本文提出ArborKV,一种结构感知的KV缓存管理方法,通过轻量级值估计器和树状分配策略,实现纯token提取式淘汰与惰性再水合,从而在保持高精度的同时减少KV内存使用,使在固定硬件预算下能支持更大规模的树状推理搜索。

详情
AI中文摘要

最近在大语言模型推理方面的进展越来越多地从单次生成转向在中间推理状态上的显式搜索。Tree-of-Thoughts (ToT) 将推理组织为具有分支和回溯的树状搜索,但显著放大了键值(KV)缓存:保留用于前沿部分轨迹的KV状态很快成为内存瓶颈,限制了吞吐量并约束了在固定硬件预算下的搜索深度和宽度。我们通过观察到ToT风格推理中的KV重用由搜索动态决定:短期解码主要依赖于活跃分支及其祖先,而无效子树具有低短期重用概率但必须保持可恢复以供回溯。受此启发,我们提出了ArborKV,一种结构感知的淘汰框架,结合轻量级值估计器和树状分配策略,并进行纯token提取式淘汰与惰性再水合以支持回溯。在ToT风格推理基准上的实验表明,ArborKV实现了高达约4倍的KV内存减少,同时保持接近完整保留的精度,使在固定设备预算下能支持更大规模的树状推理搜索。

英文摘要

Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

2605.22104 2026-05-22 cs.CV

OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

OPERA: 一种用于图像修复的智能体,通过端到端联合规划-执行优化

Feng Zhu, Shuyang Xie, Yihan Zeng, Ming Liu, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 该研究提出OPERA框架,通过端到端联合优化修复规划和工具执行,解决图像修复中复杂混合退化问题,优于现有方法和统一模型。

详情
AI中文摘要

现实中的图像修复因复杂的、相互作用的混合退化而具有挑战性。最近的基于智能体的方法通过组合多个任务特定的修复工具来解决这个问题。然而,实证分析表明,其性能根本上受到隐式约束的规划空间和独立预训练工具之间缺乏协调的限制。为了解决这些问题,我们提出了OPERA(优化规划-执行修复智能体),一种框架,通过端到端的方式联合优化修复规划和工具执行。在规划方面,OPERA使用强化学习直接优化工具组合在一个组合计划空间上,最终修复质量作为奖励。在执行方面,OPERA引入了智能体引导的修复工具协同训练,使它们能够在顺序组合下学习合作行为。在多退化基准和真实世界数据集上的大量实验表明,OPERA在多样且复杂的退化场景中始终优于所有-in-one修复模型和现有基于智能体的方法。

英文摘要

Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.

2605.22102 2026-05-22 cs.AI

ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

ExComm:探索阶段通信用于容错的代理测试时间扩展

Woomin Song, Beomjun Kim, Daewon Choi, Sai Muralidhar Jayanthi, Saket Dingliwal, Jinwoo Shin, Aram Galstyan

发表机构 * KAIST(韩国科学技术院) Amazon AGI(亚马逊人工智能实验室) Together AI

AI总结 本文提出ExComm,一种用于探索阶段的代理测试时间扩展通信协议,通过定期审计代理信念状态以检测跨代理事实冲突,并通过专用工具验证循环解决冲突,从而提升测试时间扩展的容错能力。

详情
AI中文摘要

在长周期代理测试时间扩展中,错误传播是一个常见的失败模式,其中中间步骤中引入的事实错误或无效推论会持续存在于代理的信念状态中,并污染后续推理。现有测试时间扩展方法对这一过程控制有限,因为它们通常依赖于代理自行检测错误、在错误轨迹中选择或仅在错误已影响推理路径后才修正解决方案。我们提出ExComm,一种用于探索阶段的代理测试时间扩展通信协议。ExComm受到经验观察的启发,即并行代理推理中的大多数中间错误会产生可检测的跨代理事实冲突。利用代理工作流的迭代结构,ExComm定期审计代理信念状态以检测此类冲突,通过专用工具验证循环解决冲突,并将简洁、针对性的反馈返回相关代理。通过软信念更新将修正纳入其中,即附加已验证的反馈而非覆盖现有信念。此外,为防止由于通信导致轨迹多样性崩溃,ExComm进一步引入轨迹多样化模块,将冗余轨迹引导至正交策略。在AIME 2024、AIME 2025和GAIA上使用Gemini-2.5-Flash-Lite和Qwen3.5-4B的实验表明,ExComm在测试时间扩展中一致优于强基线,分别在最佳基线上实现了平均性能提升5.7%和5.0%。进一步分析显示了改进的错误恢复、有利的扩展行为、比适应通信基线更强的多样性,以及在评估方法中最佳的性能-成本权衡。

英文摘要

A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing test-time scaling methods provide limited control over this process, as they often rely on agents to detect their own mistakes, select among flawed trajectories, or refine solutions only after errors have already shaped the reasoning path. We propose ExComm, a communication protocol for exploration-stage agentic test-time scaling. ExComm is motivated by the empirical observation that the majority of intermediate errors in parallel agentic reasoning produce detectable cross-agent factual conflicts. Leveraging the iterative structure of agentic workflows, ExComm periodically audits agent belief states to detect such conflicts, resolves them through a dedicated tool-based verification loop, and returns concise, targeted feedback to the involved agents. Corrections are incorporated through soft belief updates, which append verified feedback rather than overwriting existing beliefs. Furthermore, to prevent collapsing trajectory diversity due to communication, ExComm further introduces a trajectory diversification module that redirects redundant trajectories toward orthogonal strategies. Experiments on AIME 2024, AIME 2025, and GAIA with Gemini-2.5-Flash-Lite and Qwen3.5-4B show that ExComm consistently outperforms strong test-time scaling baselines, achieving average performance gains of 5.7% and 5.0% over the best-performing baselines, respectively. Further analyses demonstrate improved error recovery, favorable scaling behavior, stronger diversity than adapted communication baselines, and the best performance-cost trade-off among the evaluated methods.

2605.22099 2026-05-22 cs.CL

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

一种用于柬埔寨检索增强问答的语言模型比较研究

Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho, Saksonita Khoeurn

发表机构 * Department of Computer Science, Chungbuk National University(Chungbuk National University 计算机科学系) Department of Big Data, Chungbuk National University(Chungbuk National University 大数据系) General Department of Information and Communication Technology, Ministry of Post and Telecommunications(邮电部信息和通信技术总局) Department of Management Information Systems, Chungbuk National University(Chungbuk National University 管理信息系统系) BigDatalabs Co., Ltd(BigDatalabs 公司)

AI总结 本文针对低资源非拉丁语种柬埔寨语言,比较了多种语言模型在检索增强问答任务中的性能,发现检索器选择是影响效果的关键因素,生成器在不同指标上表现各异。

Comments 14 pages, 1 figure,

详情
AI中文摘要

检索增强生成(RAG)作为一种将大型语言模型(LLM)输出与检索到的证据相结合的范式,已被证明可以减少幻觉并提高事实准确性。然而,其在低资源、非拉丁语种如柬埔寨语言中的有效性仍鲜有研究。本文提出了一种基于RAG的柬埔寨语言电信领域文档问答系统。我们进行了两阶段的比较评估。首先,我们基准测试了三种嵌入模型:BGE-M3(567M)、Jina-Embeddings-v3(570M)和Qwen3-Embedding(597M),用于柬埔寨文档的密集检索。BGE-M3在Hit Rate@3、File Hit Rate@3、MRR@3和Precision@3等指标上均表现最佳,显著优于其他检索器。其次,使用BGE-M3作为选定的检索器,我们在200个柬埔寨问答对的精心编纂的黄金数据集上评估了五个生成器后端:Qwen3(8B)、Qwen3.5(9B)、Sailor2-8B-Chat、SeaLLMs-v3-7B-Chat和Llama-SEA-LION-v2-8B-IT。为了量化系统性能,我们应用了六个受RAGAS启发的指标:忠实度、答案相关性、上下文相关性、事实正确性、答案相似性和答案正确性。结果表明,没有单一模型在所有指标上占据主导:Qwen3.5-9B在忠实度(0.859)和上下文相关性(0.726)上最高,Qwen3-8B在事实正确性(0.380)上最高,而SeaLLMs-v3-7B-Chat在答案相关性(0.867)、答案相似性(0.836)和答案正确性(0.599)上表现最佳。这些发现突显了检索器选择仍是柬埔寨RAG的主要瓶颈,而生成器的强项则取决于优先考虑的是 grounding、事实精度还是语义相似性。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

2605.22098 2026-05-22 cs.CV cs.AI cs.LG

TextTeacher: What Can Language Teach About Images?

TextTeacher: 语言能教会我们关于图像什么?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau(赖兴海大学凯撒斯劳滕-兰道分校) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 该研究提出TextTeacher方法,通过将语言模型的语义知识注入到图像分类训练中,提升视觉模型的性能,同时保持推理时的模型简洁性。

Comments Published at TMLR

Journal ref Transactions on Machine Learning Research, ISSN 2835-8856, 2026

详情
AI中文摘要

柏拉图表示假设认为,足够大的模型会收敛到共享的表示几何结构,即使跨模态。受此启发,我们提出问题:语言模型的语义知识能否有效提升视觉模型?为此,我们引入TextTeacher,一种简单的辅助目标,将文本嵌入作为额外信息注入图像分类训练。TextTeacher利用 readily available 的图像描述、预训练并冻结的文本编码器以及轻量级投影,生成语义锚点,高效引导训练期间的表示,同时保持推理时的模型不变。在ImageNet上使用标准ViT后端,TextTeacher将准确率提升高达+2.7个百分点(p.p.),并在相同配方和计算条件下产生一致的迁移增益(平均+1.0 p.p.)。它优于视觉知识蒸馏,在相同计算预算下更准确,或在相似准确率下更快。我们的分析表明,TextTeacher在训练初期塑造了更深的层,并通过补充互补的语义线索帮助泛化。TextTeacher增加的开销很小,不需要对目标模型进行昂贵的多模态训练,并保持纯视觉模型的简洁性和延迟。

英文摘要

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

2605.22096 2026-05-22 cs.CV

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

VISTA:基于解剖解码的时空基础模型验证引导集成用于罕见病理VCE事件检测——竞赛结果后

Bo-Cheng Qiu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

发表机构 * National Cheng Kung University, Taiwan(国立成功大学, 台湾) National Yang Ming Chiao Tung University, Taiwan(国立阳明交通大学, 台湾)

AI总结 本文提出VISTA框架,结合时空基础模型和解剖解码,通过验证引导的加权融合和时间事件解码,提升罕见病理VCE事件检测的性能,竞赛后进一步优化后取得第二名。

详情
AI中文摘要

胶囊内镜事件检测具有挑战性,因为临床相关发现稀少、视觉异质且以事件级别评估而非帧精度。我们提出VISTA,一个针对RAREVISION任务的度量对齐多主干框架。VISTA结合EndoFM-LV进行时间上下文分析和DINOv3 ViTL/16进行帧级视觉语义,随后通过Diverse Head Ensemble (DHE)、Validation-Guided Weighted Fusion (VGWF)和Anatomy-Aware Temporal Event Decoding (ATED)。原始官方提交在隐藏测试中达到mAP@0.5为0.3530和mAP@0.95为0.3235。竞赛后,通过局部阈值细化和全局粗略搜索的扩展,性能提升至mAP@0.5为0.3726和mAP@0.95为0.3431,排名Team ACVLab第二。

英文摘要

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

2605.22090 2026-05-22 cs.AI

A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing

一种用于多模非合作无人机感知的相机协作ISAC框架

Wenfeng Wu, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Institute of Intelligent Networks and Communications (NINE)(智能网络与通信研究院) School of Intelligent Software and Engineering, Nanjing University (Suzhou Campus)(南京大学智能软件与工程学院(苏州校区))

AI总结 本文提出了一种相机协作ISAC框架,通过多模感知实现高效的无人机波束定向和跟踪,提升了感知精度和资源效率。

详情
AI中文摘要

非合作无人驾驶航空器(UAV)的检测对集成感知与通信(ISAC)系统提出了重大挑战,因为单模感知存在固有局限,且共享通信和感知资源之间存在竞争。为了解决这些挑战,本文提出了一种新颖的相机协作ISAC(CC-ISAC)框架,利用多模感知实现高效的UAV波束定向和跟踪。该框架利用摄像头进行粗粒度空域监控,并利用ISAC实现细粒度、高精度感知,形成互补的感知回路,从而提升感知精度和资源效率。在该框架中,开发了两个关键模块:(1)一种通过交叉注意力机制对齐视觉和回波域特征的视觉到回波数据对齐(V2EDA)模型,以及(2)一种基于多模融合的估计(MMFE)模型,该模型整合历史多模数据与当前观测以实现稳健的状态估计。在DeepSense 6G数据集上进行的广泛评估表明,所提出的框架在保持高角估计精度的同时,实现了平均71%的波束定向开销减少和1.69-11.15%的跟踪开销减少。CC-ISAC框架有效缓解了感知与通信之间的资源竞争,实现了可靠的UAV监控,同时释放了大量系统资源用于额外的通信任务,从而代表了ISAC系统设计的实用进步。

英文摘要

The detection of non-cooperative unmanned aerial vehicles (UAVs) presents significant challenges for Integrated Sensing and Communication (ISAC) systems due to the inherent limitations of single-modal perception and the competition for shared communication and sensing resources. To address these challenges, this paper proposes a novel Camera-Cooperative ISAC (CC-ISAC) framework that employs multimodal sensing to enable efficient UAV beam steering and tracking. The proposed framework employs cameras for coarse-grained airspace monitoring and utilizes ISAC for fine-grained, high-precision sensing, forming a complementary perception loop that enhances both sensing accuracy and resource efficiency. Within this framework, two key modules are developed: (1) a Vision-to-Echo Data Alignment (V2EDA) model that aligns visual and echo-domain features through cross-attention mechanisms, and (2) a Multimodal Fusion-Based Estimation (MMFE) model that integrates historical multimodal data with current observations for robust state estimation. Extensive evaluations conducted on the DeepSense 6G dataset demonstrate that the proposed framework achieves an average reduction of 71% in beam steering overhead and 1.69-11.15% in tracking overhead while maintaining high angular estimation accuracy. The CC-ISAC framework effectively mitigates resource contention between sensing and communication, enabling reliable UAV surveillance while freeing substantial system resources for additional communication tasks, thereby representing a practical advancement in ISAC system design.

2605.22089 2026-05-22 cs.CV cs.AI

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

LVDrive: 基于潜在视觉表征的视觉-语言-动作自动驾驶模型

Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Xiaomi EV(小米电动车)

AI总结 本文提出LVDrive,一种增强视觉-语言-动作能力的自动驾驶模型,通过引入未来场景预测任务,在高维潜在空间中学习语义丰富的场景表示,从而提升闭环驾驶性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型已逐渐成为端到端自动驾驶的有前途的框架。然而,现有VLA通常依赖于稀疏的动作监督,这未能充分利用其强大的场景理解和推理能力。最近尝试通过世界建模引入密集视觉监督时,往往过度强调像素级图像重建,忽略了语义丰富的场景表示学习。在本文中,我们提出LVDrive,一种基于潜在视觉表征的VLA框架,用于自动驾驶。LVDrive在VLA范式中引入了未来场景预测任务,其中未来表示在预训练视觉主干的辅助监督下完全在高维潜在空间中学习。脱离低效的自回归生成,我们在一个统一的嵌入空间中联合建模未来场景和运动预测,通过单次前向传递进行未来感知推理。我们进一步设计了一种两阶段轨迹解码策略,明确利用所学的潜在未来表示来细化轨迹生成。在具有挑战性的Bench2Drive基准测试中,大量实验表明,LVDrive在闭环驾驶性能上实现了显著提升,优于动作监督方法和基于图像重建的世界模型方法。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.