arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2605.18389 2026-05-20 cs.LG math.OC

Spherical Harmonic Optimal Transport: Application to Climate Models Comparisons

球面调和最优传输:应用于气候模型比较

Pierre Houédry, Iskander Legheraba, Léo Buecher, Nicolas Courty

发表机构 * INRIA Rennes(INRIA里昂) University of Montpellier(蒙彼利埃大学) LPHI, UMR 5294, CNRS, INSERM(LPHI,UMR 5294,CNRS,INSERM) Université Bretagne Sud(布列塔尼南大学) IRISA, UMR 6074, CNRS(IRISA,UMR 6074,CNRS)

AI总结 本文提出了一种基于球面调和函数的最优传输方法,用于高效比较气候模型,通过在球面上利用谐波结构设计快速Sinkhorn算法,提升了计算效率并应用于全球气候模型评估。

详情
AI中文摘要

最优传输提供了一个强大的框架,用于在尊重其支撑集几何结构的情况下比较测度,但计算成本高昂,限制了其在现实应用中的潜力。在流形上,基于热核的卷积算法已被提出以缓解这一成本,但其理论性质仍鲜有探索。我们证明了当时间趋于零时,热核成本在平衡和非平衡情况下均收敛于最优传输成本。在特定情况下,对于2球面S²,我们确保所关联的Sinkhorn分歧保持经典最优传输差异的几何和分析性质。此外,我们利用球面的谐波结构推导出一种快速的Sinkhorn算法,仅需O(n)的内存和O(n^{3/2})的时间每迭代,且完全支持GPU友好的密集运算。我们在合成数据上验证了其计算效率,并讨论了其在评估全球气候模型中的潜在用途,提供了对模型性能的空间和季节性洞察。

英文摘要

Optimal transport provides a powerful framework for comparing measures while respecting the geometry of their support, but comes with an expensive computational cost, hindering its potential application to real world use cases. On manifolds, convolutional algorithms based on the heat kernel have been proposed to alleviate this cost, but their theoretical properties remain largely unexplored. We establish that the heat kernel cost converges to the optimal transport cost as time vanishes in the balanced and unbalanced cases. In the specific case of the 2-sphere $\mathbb{S}^2$, we ensure that the associated Sinkhorn divergences retains the desirable geometric and analytic properties of classical optimal transport discrepancies. Moreover, we leverage the harmonic structure of the sphere to derive a fast Sinkhorn algorithm, requiring only $\mathcal{O}(n)$ memory and $\mathcal{O}(n^{3/2})$ time per iteration, with fully dense GPU-friendly operations. We validate its computational efficiency on synthetic data, and discuss its potential use in the evaluation of global climate models, providing both spatial and seasonal insights into models performances.

2605.17942 2026-05-20 cs.CV

UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction

UAVFF3D: 一种面向无人机3D重建的几何感知基准

Xiang Yang, Yongli Wang, HaiFeng Li, Yunsheng Zhang

发表机构 * School of Geosciences and Info-Physics(地质科学与信息物理学院)

AI总结 本文提出UAVFF3D基准,旨在解决无人机摄影测量中因相机几何变化导致的重建问题,通过引入真实-合成图像和控制测试子集,提升无人机领域适应性和鲁棒性。

Comments 19 pages, 16 figures, 16 tables

详情
AI中文摘要

尽管前馈3D重建技术取得了快速发展,但当前模型在无人机摄影测量中仍不够可靠。我们认为,这种失败不仅源于外观域偏移,还源于无人机特定的相机几何变化,特别是斜视和HFOV高度模糊。现有无人机数据集主要强调场景多样性,但对相机配置的覆盖有限,限制了鲁棒性评估和无人机领域适应。为解决这一差距,我们引入UAVFF3D,一个面向前馈无人机3D重建的几何感知真实-合成基准。UAVFF3D包含超过170,000张真实无人机图像和超过370,000张由高质量纹理3D模型渲染的合成图像,覆盖多样的HFOV、飞行高度、观看方向和采集模式。它还包含一个受控的HFOV-高度测试子集,用于诊断投影几何模糊。我们进一步提出一个评估协议,联合评估相机几何估计和密集场景重建,通过共享的全局对齐,避免单独相机和几何对齐带来的偏差。在代表性前馈重建模型上的实验表明,基于UAVFF3D的领域适应一致地提高了相机和几何估计,将射线误差降低了高达84.2%,姿态ATE降低了高达76.0%,点距离降低了高达41.1%。在斜视场景中,适应减少了斜视-正视旋转差距高达90.7%。在HFOV-高度模糊情况下,它提高了在不同HFOV-高度配置下的鲁棒性,并在不同HFOV设置下产生了更稳定的性能。结合相机先验进一步改进了在无人机特定采集几何下的重建。数据集和评估代码可在https://github.com/yanxian-ll/UAVFF3D获取。

英文摘要

Feed-forward 3D reconstruction has advanced rapidly, but current models remain unreliable in UAV photogrammetric acquisition. We argue that this failure is caused not only by appearance-domain shift, but also by UAV-specific camera-geometry variations, especially oblique views and HFOV-height ambiguity. Existing UAV datasets mainly emphasize scene diversity and provide limited coverage of camera configurations, which restricts robustness evaluation and UAV-domain adaptation. To address this gap, we introduce UAVFF3D, a geometry-aware real-synthetic benchmark for feed-forward UAV 3D reconstruction. UAVFF3D contains more than 170k real UAV images and more than 370k synthetic images rendered from high-quality textured 3D models, covering diverse HFOVs, flight altitudes, viewing directions, and acquisition patterns. It also includes a controlled HFOV-height test subset for diagnosing projection-geometry ambiguity. We further propose an evaluation protocol that jointly assesses camera-geometry estimation and dense scene reconstruction under a shared global alignment, avoiding the bias caused by separate camera and geometry alignments. Experiments on representative feed-forward reconstruction models show that UAVFF3D-based domain adaptation consistently improves camera and geometry estimation, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. In oblique scenes, adaptation reduces the oblique-nadir rotation gap by up to 90.7%. Under HFOV-height ambiguity, it improves robustness across HFOV-height configurations and yields more stable performance across HFOV settings. Incorporating camera priors further improves reconstruction under UAV-specific acquisition geometries. The dataset and evaluation code are available at https://github.com/yanxian-ll/UAVFF3D .

2605.17916 2026-05-20 cs.CV

PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis

PanoWorld: 一种生成式空间世界模型,用于一致的整屋全景合成

Jinrang Jia, Zhenjia Li, Yijiang Hu, Yifeng Shi

发表机构 * Ke Holdings Inc.(凯控股有限公司)

AI总结 本文提出PanoWorld,一种生成式空间世界模型,通过自回归生成基于节点的360度全景图,实现一致的整屋全景合成,解决了纯2D生成器在视角变化时几何和材质重新想象的问题,以及单一3D生成在多房间尺度下的高成本和纹理丢失问题。

Comments 17

详情
AI中文摘要

生成一致的整屋VR游览需要逼真的全景图和跨视角的空间一致性。纯2D生成器产生吸引人的单个全景图,但在视角变化时重新想象几何和材质,而单一3D生成在多房间尺度下变得昂贵且丢失细纹理。我们引入PanoWorld,一种生成式空间世界模型,将整屋合成视为自回归生成基于节点的360度全景图,匹配真实VR游览产品使用的离散导航。PanoWorld使用由平面图派生的3D壳体作为全局几何代理,并使用动态3D高斯点云缓存作为可渲染的空间记忆。一个用于度量尺度多房间360度输入的前馈全景LRM将生成的全景图提升到局部360度高斯点云更新,同时房间感知的组注意机制抑制跨房间特征干扰。一种拓扑感知的渐进缓存策略将这些局部更新融合,而无需反复重建完整历史。通过将基于壳体的几何指导与缓存渲染的视觉记忆解耦,PanoWorld在保持高频率2D合成质量的同时,提高了跨节点布局和材质一致性。项目链接是https://jjrcn.github.io/PanoWorld-project-home/

英文摘要

Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency. The project link is https://jjrcn.github.io/PanoWorld-project-home/

2605.17889 2026-05-20 cs.LG

CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

CoX-MoE: 通过AMX启用的CPU-GPU协同执行提升高吞吐量MoE推理的协同专家执行

Muyoung Son, Yi Chen, Seungjae Yoo, Soongyu Choi, Joo-Young Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出CoX-MoE,一种通过AMX启用的CPU-GPU协同系统,通过协同专家执行和战略工作负载编排优化MoE推理,提升吞吐量。CoX-MoE引入了coalescing-aware orchestration策略和静态专家-aware分层方案,分别优化资源分配和减少PCIe传输开销,从而在吞吐量上比现有框架提升7.1倍和2.4倍。

Comments 7 pages, 8 figures, accepted to DAC '26

详情
AI中文摘要

混合专家(MoE)架构通过稀疏专家激活提高计算效率,但面向吞吐量的推理面临显著的GPU内存压力,因为参数规模和中间数据较大。先前工作尝试通过专家卸载和微批处理或卸载计算到CPU来缓解这一问题。然而,微批处理导致的工作负载碎片化会降低操作强度,导致专家执行成为内存瓶颈。同时,CPU卸载受限于慢速PCIe传输和其在解码阶段注意力计算中的有限适用性。因此,这些低效性限制了系统利用率,严重限制了MoE推理的端到端吞吐量。为了解决这些挑战,本文提出CoX-MoE,一种通过AMX启用的CPU-GPU协同系统,通过结合协同专家执行和战略工作负载编排来全面优化MoE推理。CoX-MoE引入(i)一种coalescing-aware orchestration策略,通过采用普通批处理而非微批处理进行专家计算和选择性注意力卸载,共同优化资源分配;(ii)一种静态专家-aware分层方案,预先将频繁激活的专家分配到GPU,减少PCIe传输开销并平衡CPU和GPU在推理中的工作负载。与最先进的框架相比,CoX-MoE实现了显著的提升,分别达到比FlexGen和MoE-Lightning高7.1倍和2.4倍的吞吐量。

英文摘要

The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.

2605.17809 2026-05-20 cs.AI cs.IR

Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

加速人工智能研究:PuppyChatter框架用于实用且灵活的工具开发

Chun-Hsiung Tseng, Hao-Chiang Koong Lin, Andrew Chih-Wei Huang, Yung-Hui Chen, Jia-Rou Lin

发表机构 * Dept. of Electrical Engineering, YuanZe Univ.(元智大学电子工程系) Dept. of Information and Learning Technology, National Univ. of Tainan(台湾国立科技大学资讯与学习科技系) Dept. of Psychology, Fo Guang Univ.(佛光大学心理学系) Dept. of Computer Information and Network Engineering, Lunghwa Univ.(龙华科技大学电脑资讯与网络工程系)

AI总结 本文提出PuppyChatter框架,旨在解决AI应用开发中的挑战,通过结合供应商特定SDK的直观性和模型抽象的中立性,提供更流畅灵活的开发方法。

详情
AI中文摘要

本研究针对开发人工智能应用,特别是利用大语言模型(LLMs)的应用所固有的挑战。尽管AI供应商提供应用程序编程接口(API)和软件开发工具包(SDK)来促进开发人员交互,但前者通常需要复杂的手动请求构造,而后者可能导致显著的供应商锁定。此外,尽管现有的模型抽象框架在减轻供应商依赖方面有所成效,但引入了额外的复杂性和潜在的安全问题。为调和这些矛盾因素,本研究引入了PuppyChatter,一种新的软件框架,旨在保持供应商特定SDK的直观简洁性,同时遵循模型抽象中固有的中立原则,从而提供更流畅且灵活的开发范式。

英文摘要

This research addresses the challenges inherent in developing Artificial Intelligence (AI) applications, particularly those leveraging Large Language Models (LLMs). While AI vendors provide Application Programming Interfaces (APIs) and Software Development Kits (SDKs) to facilitate developer interaction, the former often requires intricate manual request construction, and the latter can lead to significant vendor lock-in. Furthermore, existing model abstraction frameworks, though mitigating vendor dependency, introduce an additional layer of complexity and potential security concerns. To reconcile these conflicting factors, the study introduces PuppyChatter, a novel software framework designed to preserve the intuitive simplicity of vendor-specific SDKs while simultaneously adhering to the vendor-neutrality principles characteristic of model abstraction, thereby offering a more streamlined and flexible development paradigm.

2605.17804 2026-05-20 cs.LG eess.SP

GenTS: A Comprehensive Benchmark Library for Generative Time Series Models

GenTS:生成时间序列模型的综合基准库

Chenxi Wang, Xiaorong Wang, Peiyang Li, Yi Wang

发表机构 * The University of Hong Kong(香港大学) Fudan University(复旦大学)

AI总结 本文提出GenTS,一个用于系统评估生成时间序列模型的综合且可扩展的基准库,通过统一的数据预处理流程、多样化的模型集合和全景评估指标,为生成模型提供了更灵活的评估框架。

详情
AI中文摘要

生成模型在时间序列分析任务中展现出了显著的潜力,如合成、预测、插值等。然而,现有的时间序列库主要针对判别模型进行工程设计,具有针对特定任务的标准工作流程,例如优化时间序列预测的均方误差。这种刚性的结构与生成模型独特的、往往复杂的范式(如对抗训练、扩散过程)根本上不兼容,因为生成模型学习的是数据分布而非直接的输入-输出映射。为此,我们提出了GenTS,一个全面且可扩展的基准库,旨在对生成时间序列模型进行系统评估。GenTS具有统一的数据预处理流程、多样化的模型集合和全景评估指标。其模块化设计也使研究者能够灵活地自定义超出内置数据集和模型。基于GenTS,我们进行了在多种任务下的基准测试,从而为模型选择提供了建议,并识别了未来研究的潜在方向。我们的代码在https://github.com/WillWang1113/GenTS上开源。官方教程和文档可在https://willwang1113.github.io/GenTS/上获取。

英文摘要

Generative models have demonstrated remarkable potential in time series analysis tasks, like synthesis, forecasting, imputation, etc. However, offering limited coverage for generative models, existing time series libraries are mainly engineered for discriminative models, with standardized workflows for specific tasks, such as optimizing Mean Squared Errors for time series forecasting. This rigid structure is fundamentally incompatible with the distinct and often complex paradigms of generative models (e.g., adversarial training, diffusion processes), which learn the underlying data distribution rather than a direct input-output mapping. To this end, we proposed GenTS, a comprehensive and extensible benchmark library designed for systematic assessment on generative time series models. GenTS features a unified data preprocessing pipeline, a collection of versatile models, and panoramic evaluation metrics. Its modular design also enables the researchers to flexibly customize beyond our built-in datasets and models. Based on GenTS, we conducted benchmarking experiments under diverse tasks, accordingly offering suggestions for model selection and identifying potential directions for future research. Our codes are open-source at https://github.com/WillWang1113/GenTS. The official tutorials and document are available at https://willwang1113.github.io/GenTS/.

2605.17539 2026-05-20 cs.AI

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

具有跨分支知识转移的内存引导树搜索用于LLM求解器合成

Fatemeh Haji, Javier Delarosa Quiros, Peyman Najafirad

发表机构 * Secure AI and Autonomy Lab(安全人工智能与自主性实验室) The University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校)

AI总结 该研究提出MEMOIR框架,通过双层记忆体系结构实现内存引导的树搜索,以提高求解器合成的效率和有效性,通过跨分支知识转移提升求解器的解决方案质量。

详情
AI中文摘要

组合优化(CO)在从物流到芯片设计的决策中起着基础性作用,其中不可行的解决方案在操作上不可用,而小的改进可以转化为显著的经济价值。最近的研究利用大型语言模型(LLMs)自动化求解器合成:从自然语言规范生成可执行的求解器程序。然而,现有的树搜索和进化代理在并行细化候选轨迹时没有显式的知识转移,重新引入了相同的约束违规,并收敛到相似的算法家族。我们引入MEMOIR,一种具有两级记忆层次结构的内存引导树搜索框架:分支本地记忆在迭代单个算法设计时保存执行基础的细化细节,而全局记忆存储跨分支压缩的算法和失败模式摘要。在分支终止时的反思步骤提炼这些摘要,使跨分支转移成为可能,而不会污染未来的上下文与低层次调试跟踪。在七个跨越调度、路由、打包和几何设计的CO问题上,MEMOIR实现了96.7%的解决方案有效性(比最强基线高出9.2个点),并在匹配的每种方法执行预算下,将平均标准化分数提高了7.3个点。在四个问题上进行三次独立运行时,MEMOIR的运行间有效性标准差比我们评估的所有基线低一个数量级,表明内存引导的探索产生了持续的改进,而不是反映采样方差。

英文摘要

Combinatorial optimization (CO) underlies decision-making from logistics to chip design, where infeasible solutions are operationally unusable and small quality gains translate into substantial economic value. Recent work uses large language models (LLMs) to automate solver synthesis: generating executable solver programs from natural-language specifications. However, existing tree-search and evolutionary agents refine candidate trajectories in parallel without explicit knowledge transfer, reintroducing the same constraint violations and converging on similar algorithm families. We introduce MEMOIR, a memory-guided tree-search framework with a two-level memory hierarchy: branch-local memory preserves execution-grounded refinement details within a branch as it iterates on a single algorithmic design, while global memory stores compressed algorithmic and failure-mode summaries across branches. A reflection step at branch termination distills these summaries, enabling cross-branch transfer without polluting future contexts with low-level debugging traces. Across seven CO problems spanning scheduling, routing, packing, and geometric design, MEMOIR achieves 96.7% solution validity (a 9.2 point gap over the strongest baseline) and improves the average normalized score by 7.3 points at matched per-method execution budget. Over three independent runs on four problems, MEMOIR's run-to-run validity standard deviation is more than an order of magnitude below that of every baseline we evaluated in this setting, suggesting that memory-guided exploration yields consistent improvements rather than reflecting sampling variance.

2605.17480 2026-05-20 cs.AI

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

能力悖论:更聪明的审计员如何使多智能体系统更不安全

Qiqi Liu, Thorsten Holz, Shilin Ye, Runhan Song

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Max Planck Institute for Security and Privacy(马克斯·普朗克安全与隐私研究所) Henan Yinzhu Safety Technology Co., Ltd.(河南亿众安全技术有限公司) Harbin Institute of Technology, Faculty of Computing(哈尔滨工业大学计算机学院)

AI总结 本文研究了多智能体系统中,随着工人能力的提升,系统级攻击成功率反而上升的现象,揭示了语言确定性在攻击传播中的作用,并提出异质性集成验证作为解决方案,以降低攻击成功率。

Comments 28 pages, 6 figures

详情
AI中文摘要

多智能体系统通过将任务分解给专门的智能体来扩展大语言模型(LLMs),但其分布式决策过程创造了新的攻击面。我们识别出语义劫持攻击,即有害请求被隐藏在领域特定的叙述中,并通过工人报告传播到管理者,而无需任何语法注入原始。在42,000次对抗性试验中,我们发现了能力悖论:随着工人能力的增加,系统级攻击成功率(ASR)从18.4%增加到63.9%,峰值达到94.4%。为了解释这一效应,我们对两个独立数据集(47,807次交互)进行了多层中介分析。分析显示,这一悖论由语言确定性驱动:更强的工人更可能将对抗性叙述解释为合法,自信地传达结论,从而导致管理者将这种自信的背书视为执行的正当理由。在我们的更大工人-only设置(n_W=14)中,确定性中介了74%的效果,95%置信区间(CI)在蒙特卡洛和聚类Bootstrap下均排除零;较小的Full-MAS设置(n_W=6)显示了方向一致的间接效应。工人端的安全提示无法可靠地缓解这一失败。基于中介发现,我们提出异质性集成验证,通过配对具有不对称领域能力的工人,使它们的互补性漏洞打破确定性到执行的链条,将ASR从52.8%降低到2.0%,对良性任务影响微乎其微。我们的结果表明,升级组件到更强的模型会主动降低系统安全性,有效的防御需要利用而不是消除智能体之间的能力不对称性。

英文摘要

Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify semantic hijacking, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a capability paradox: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by linguistic certainty: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ($n_W$=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ($n_W$ =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose heterogeneous ensemble verification, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting--rather than eliminating--capability asymmetries between agents.

2605.17470 2026-05-20 cs.CV cs.MM eess.IV

EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

EchoSR: 为轻量图像超分辨率实现高效的上下文利用

Hanli Zhao, Binhao Wang, Shihao Zhao, Tao Wang, Kaihao Zhang, Wanglong Lu

发表机构 * College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325000, China(温州大学计算机科学与人工智能学院) vivo BlueImage Lab, vivo Mobile Communication Co., Ltd, Shanghai 200100, China(vivo蓝影实验室,vivo移动通信有限公司,上海200100,中国) College of Engineering and Computer Science, Australian National University, Canberra, Australia(工程与计算机科学学院,澳大利亚国立大学,堪培拉,澳大利亚) The AI/Analytics Team, Nasdaq, St. John’s, Canada(AI/分析团队,纳斯达克,圣约翰,加拿大)

AI总结 本文提出EchoSR框架,通过统一多尺度感受野建模和层次化上下文融合,提升了轻量图像超分辨率的效率和效果,同时在多个基准上优于现有方法,并实现了约两倍的速度提升。

Comments Accepted by Information Fusion; 20 pages, 17 figures

详情
AI中文摘要

图像超分辨率(SR)旨在从低分辨率(LR)输入中重建高质量、高分辨率(HR)图像,并在各种下游应用中发挥关键作用。尽管近年来取得了进展,但平衡重建保真度和计算效率仍然是一个根本性挑战,尤其是在资源受限的场景中。虽然现有轻量方法试图扩展感受野,但许多方法要么导致显著的计算开销,要么简单地扩大内核大小,或缺乏机制进行一致的多尺度整合,限制了它们的整体效果和可扩展性。为了解决这些限制,我们提出了EchoSR,一个高效的上下文利用框架,用于轻量图像超分辨率,它统一了多尺度感受野建模和层次化上下文融合。EchoSR通过一种高效的上下文利用策略将特征学习解耦为分离的局部、多尺度和全局建模阶段,并进一步通过跨尺度重叠融合机制促进无缝的跨尺度整合。广泛的实验表明,EchoSR在多个基准上一致优于现有最先进的轻量超分辨率方法,同时也实现了更快的速度(约2倍)。源代码可在https://github.com/funnyWang-Echoes/EchoSR上获得。

英文摘要

Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed $(\sim 2\times)$. The source code is available at https://github.com/funnyWang-Echoes/EchoSR.

2605.17370 2026-05-20 cs.AI

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

CBT-Audio: 评估音频语言模型以估计CBT会话录音中患者压力强度

Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim

发表机构 * School of Computer Science, Faculty of Engineering, University of Sydney, Australia(悉尼大学工程学院计算机科学学院,澳大利亚) School of Psychology, Faculty of Science, University of Sydney, Australia(悉尼大学科学学院心理学学院,澳大利亚) School of Computing, Faculty of Science and Engineering, Macquarie University, Australia(麦考瑞大学科学与工程学院计算学院,澳大利亚) CHeBA (Centre for Healthy Brain Ageing), School of Clinical Medicine, Discipline of Psychiatry & Mental Health, The University of New South Wales, Australia(新南威尔士大学临床医学学院精神病与心理健康学科健康大脑年龄中心,澳大利亚) Sydney School of Public Health, Faculty of Medicine and Health, University of Sydney, Australia(悉尼大学医学与健康学院公共卫生学院,澳大利亚)

AI总结 本文提出CBT-Audio数据集,用于评估音频语言模型在估计CBT会话中患者压力强度方面的性能,通过结合音频和文本输入提升了压力强度估计的准确性。

Comments 9 pages, 3 figures, 2 tables

详情
AI中文摘要

认知行为疗法被广泛用于帮助患者理解和管理心理压力。它通常通过口头交流进行,治疗师不仅关注患者所说的内容,还关注他们说话的方式,因为这些线索有助于治疗师决定如何回应和调整治疗。在构建AI系统用于CBT方面,进展主要局限于文本,部分原因是大多数可用数据集基于文本,而共享的 spoken CBT 数据在伦理和隐私约束下稀缺。这导致了盲点,因为基于文本的模型和评估无法捕捉文本和患者声音之间的不匹配,尽管治疗师经常依赖这种不匹配来理解患者的压力。我们引入了CBT-Audio,一个用于评估从 spoken CBT 会话中估计患者压力强度的音频语言模型的数据集。CBT-Audio包含96个公开可用的CBT录音中的1,802个患者发言,其中发言级别的压力标签已在专家标注的子集上验证。我们评估了10个开源音频语言模型,三种输入条件下,模型仅接收患者音频、仅接收转录文本或同时接收音频和转录文本。我们的结果表明,音频可以提供超出文本的信息,尤其是在与转录文本结合时。在10种模型家族中,有8种在添加音频到转录输入时,压力强度估计优于单独使用转录文本,其中4种有显著提升,案例研究显示当口头内容和语音表达不一致时,收益最明显。CBT-Audio使AI在CBT相关任务中可衡量患者的口语行为,支持未来音频语言模型在心理健康交互中的研究。

英文摘要

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

2605.17340 2026-05-20 cs.LG

Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density

Olivia:通过功率谱密度和谐化时间序列基础模型

Jingru Fei, Kun Yi, Alex Xing Wang, Qingsong Wen, Xiangxiang Zhu, Wei Fan

发表机构 * Beijing Institute of Technology(北京理工大学) North China Institute of Computing Technology(华北计算技术研究所) State Information Center(国家信息中心) University of Auckland(奥克兰大学) Northwest Polytechnical University(西北工业大学) Victoria University of Wellington(威灵顿维多利亚大学)

AI总结 本文提出Olivia,一种基于谐化机制的时间序列基础模型,通过在频域中使用功率谱密度来减少数据集间的不匹配并增强预训练效果,从而在零样本、少样本和全样本预测场景中取得最佳性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

时间序列基础模型依赖于在跨领域多样数据集上进行大规模预训练,但其在时间模式上的异质性可能会阻碍训练和学习可迁移的时间序列表示的有效性。受信号处理中归一化功率谱密度(PSD)基本概念的启发,我们假设通过频域中的PSD和谐化数据集可以减少不匹配并增强预训练。我们超越了直接不可行的最小化优化,创新性地将其重新表述为一种原则性的和谐化方法。具体而言,我们提出Harmonizer模块,该模块重塑频谱结构并隐式地在不同数据集中和谐化PSD,这在理论上对应于第二阶时间相关性的共享重参数化。我们的理论分析进一步揭示,与Harmonizer交互的token可以通过紧凑的共振器集合高效地进行调解,从而启发了HarmonicAttention设计,该设计在低维交互空间中执行自注意力。然后,我们提出Olivia,一种基于这些和谐化机制的新时间序列基础模型。在两个大规模基准(TSLib和GIFT-Eval)以及额外的6个GluonTS数据集上的广泛实验表明,Olivia在零样本、少样本和全样本预测场景中一致实现了最佳性能。我们的代码可在https://github.com/TSTS13/Olivia上获得。

英文摘要

Time series foundation models rely on large-scale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self-attention in a low-dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large-scale benchmarks (TSLib and GIFT-Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state-of-the-art performance under zero-shot, few-shot, and full-shot forecasting scenarios. Our code is available at https://github.com/TSTS13/Olivia.

2605.17003 2026-05-20 cs.LG cs.AI

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

学习区能量:用于高效RL后训练的在线数据选择

Peng Cui, Boyao Yang, Jun Zhu

发表机构 * Dept. of Comp. Sci. & Tech.(计算机科学与技术系) Institute for AI(人工智能研究院) BNRist Center(BNRist中心) Tsinghua-Bosch Joint ML Center(清华大学-博世联合机器学习中心) THBI Lab(THBI实验室) Tsinghua University(清华大学) Dept. of Automation(自动化系)

AI总结 本文提出学习区能量(LZE)方法,通过在线数据选择框架集中计算在模型的主动学习前沿,提高RL后训练的效率,实验表明在多个数据集上表现优异,且计算资源消耗减少。

详情
AI中文摘要

强化学习(RL)后训练已成为提取大语言模型(LLMs)数学推理能力的主要范式,但现有技术如GRPO和DAPO在提示上均匀分配rollout和梯度预算,浪费计算在已掌握的样本或远超模型当前能力的样本上。为解决这一根本性低效问题,我们提出学习区能量(LZE),一种理论支撑的完全在线数据选择框架,集中计算在模型的主动学习前沿。其核心是定义一个闭式学习区能量评分,融合三个互补信号,初始难度锚点、标准化结果不确定性项和通过率动量,形成一个单标量,可证明与组相对策略梯度更新的预期幅度一致。一个具有回放的前向修剪器进一步减少墙钟时间成本,通过跳过已解决提示的rollout生成,同时定期检查遗忘。在Qwen家族模型(1.5B-8B)上评估GSM8K、MATH和DAPO-MATH数据集,我们的方法每步仅保留40%的训练数据,却匹配或超越全数据基线,尤其在AIME25(+45.9%)和AMC23(+18.2%)上表现出显著的分布外收益,同时估计训练FLOPs减少约36%。我们的代码可在https://github.com/Stellaris167/LZE获取。

英文摘要

Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates. A forward pruner with replay further reduces wall-clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen-family models (1.5B-8B) across GSM8K, MATH and DAPO-MATH, our method retains only 40% of the training data per step yet matches or surpasses full-data baselines, with especially pronounced out-of-distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at https://github.com/Stellaris167/LZE.

2605.16736 2026-05-20 cs.CV

CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth

CAB: 通过校正和修正Adams-Bashforth加速流和扩散采样

Anuska Roy, Pravin Nair

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 本文提出了一种无需训练的采样器CAB,通过将采样动态转换为统一的校正坐标系,并应用带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器,从而在不增加额外函数评估次数的情况下加速流和扩散模型。

详情
AI中文摘要

流和扩散模型能够实现高质量、高分辨率的图像合成,但通常在采样时需要大量的函数评估次数(NFEs)。现有的加速方法要么需要通过蒸馏进行额外训练,要么依赖于无需训练的高阶求解器,但两者在低NFE预算下都会降低样本质量。我们提出CAB(Corrected Adams-Bashforth),一种无需训练的采样器,能够加速流和扩散模型。CAB首先将采样动态转换为统一的校正坐标系,然后应用一个带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器,因此不增加额外的NFEs。所得到的方法简单,具有相同的算法形式,适用于所有模型类别,并且具有至少第三阶局部截断误差和第二阶全局误差。在预训练的流和扩散模型上进行的实验,包括类别条件和大规模文本到图像基准,表明CAB在6-20 NFEs的低步数范围内改进了质量-NFE权衡。它在大多数测试模型中在更高步数时与强大的无需训练采样器保持竞争力。官方实现可在https://github.com/Anuska-Roy/CAB上获得。

英文摘要

Flow and diffusion models achieve high-fidelity, high-resolution image synthesis, but often require many function evaluations (NFEs) at sampling time. Existing acceleration methods either require additional training through distillation or rely on training-free high-order solvers, and both can degrade sample quality at low NFE budgets. We propose CAB (Corrected Adams-Bashforth), a training-free sampler that accelerates both flow and diffusion models. CAB first transforms the sampling dynamics to a common rectified coordinate system, and then applies a multistep Adams-Bashforth predictor augmented with a simple correction term based on past velocity evaluations and therefore incurs no additional NFEs. The resulting method is simple, has the same algorithmic form across model classes, and has at least third-order local truncation error and second-order global error. Experiments on pretrained flow and diffusion models, including class-conditional and large-scale text-to-image benchmarks, show that CAB improves quality-NFE trade-offs in the low-step regime of 6-20 NFEs. It also remains competitive with strong training-free samplers at higher step counts across most tested models. The official implementation is available at https://github.com/Anuska-Roy/CAB.

2605.16712 2026-05-20 cs.AI cs.CL cs.HC

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

回想起并不足够:在个性化语言系统中界定承诺

Rui Tang, Yichi Zhang, Xi Chen, Chen Dong, Youwei Yang, Yumeng Shen, Qiangqiang Liu

发表机构 * OpenAsk Stern School of Business, New York University(纽约大学 Stern 学院) Bank of Hebei(河北银行) Lingnan College, Sun Yat-sen University(中山大学 Lingnan 学院) School of Economics, Xiamen University(厦门大学 经济学院) BitMart Binance

AI总结 本文提出了一种新的方法,通过合同界定证据激活(CBEA)和词典承诺验证(LCV)来解决个性化语言系统中承诺界定的问题,从而在360个测试用例和三个生成后端上实现了零失败,同时降低了输入负载。

Comments 14 pages, 3 figures, 22 tables; preprint version

详情
AI中文摘要

长上下文和记忆系统通常将个性化视为召回问题。在实践中,许多故障发生在系统承诺时:它将嘈杂的提示转化为硬约束,丢弃罕见的见证,忘记下游义务,或在不可行的情况下作答。我们引入了合同界定证据激活(CBEA)与词典承诺验证(LCV)。CBEA通过类型覆盖、尾见证和后果债务激活一个有界的证据集;LCV在文本之前验证结构化的承诺,并将不可行的状态路由到修复、回避或再合同。在360个测试用例和三个生成后端上,CBEA+LCV在验证范围内达到零失败,可用性为0.49-0.60,而具有相同LCV门的原始和长上下文基线只有在0.003-0.092时才能达到零失败。一个影子 oracle 诊断标记了极限:CBEA+LCV召回了0.012个未编译的可见事实,而原始召回了0.53。结果是一个有界的操作点:显式的承诺控制和74-75%更低的中位数输入负载,而不是普遍的记忆主导。

英文摘要

Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.

2605.16679 2026-05-20 cs.CL cs.AI

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench: 能否让AI代理自动化端到端、长周期、政策丰富的医疗工作流程?

Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

发表机构 * Johns Hopkins Medicine(约翰霍普金斯医学中心) Wellstar Health System(Wellstar健康系统) Stanford University(斯坦福大学) CMU(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) Yale School of Medicine(耶鲁医学院) Salesforce AI Research(Salesforce人工智能研究) University of Washington(华盛顿大学) Northeastern University(东北大学) Brown University(布朗大学) Boston College(波士顿学院) Stony Brook University(史泰森布里克大学) University of Oxford(牛津大学) Arizona State University(亚利桑那州立大学) University of Southern California(南加州大学) Emory University(埃默里大学) MBZUAI Recursive Superintelligence(递归超级智能) University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出CHI-Bench基准,旨在评估AI代理在医疗工作流程中端到端、长周期和政策丰富任务中的自动化能力,揭示当前基准测试中政策密度、多角色协作和多方交互等能力的不足。

Comments Website: https://actava.ai/benchmarks Code: https://github.com/actava-ai/chi-bench Dataset: https://huggingface.co/datasets/actava/chi-bench

详情
AI中文摘要

现实医疗操作的端到端自动化要求具备当前基准测试中较少体现的三种能力:政策密度,即决策必须基于大量医疗、保险和运营规则;多角色组成,即单个任务需要代理扮演多个角色并进行交接;以及多方交互,即中间工作流程步骤是多轮对话,例如同行评审和患者接触。我们介绍了CHI-Bench,一个涵盖三个领域的长周期医疗工作流程基准:提供者预先授权、支付方使用管理以及护理管理。每个任务都会将代理置于一个高保真模拟器中,该模拟器暴露了20个医疗应用程序,通过87个MCP工具。代理必须通过工具调用和编写角色的文档来驱动任务完成,受1,290多份文档管理护理操作手册技能的指导。在30种代理配置下,最佳代理仅能解决28.0%的任务,没有代理在严格通过标准下达到20%以上,且单次会话执行所有任务会将性能降至3.8%。这些结果提出了假设,即在其他政策密集、角色组成和不可逆的企业领域中,类似的差距可能会出现。

英文摘要

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

2605.16353 2026-05-20 cs.CV cs.AI

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA: 向流式连续视觉指令微调迈进以适应大规模多模态语言模型

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) Tsinghua University(清华大学)

AI总结 本文提出StrLoRA,一种流式连续视觉指令微调方法,旨在解决动态任务流中模型持续学习的问题,通过任务感知的专家路由框架提升模型在不断变化的数据流中的表现。

详情
AI中文摘要

持续视觉指令微调(CVIT)使多模态大语言模型能够逐步获得新能力。然而,现有CVIT方法在任务增量设置下运行,每个训练阶段对应一个预定义任务,这不符合现实世界中数据作为连续流中交织和动态变化的任务的条件。为弥合这一差距,我们引入流式CVIT(StrCVIT),一种更通用和现实的设置,其中模型从包含动态混合任务的数据块中学习。在StrCVIT中,模型必须同时获得新能力、强化常见能力并减轻遗忘。现有CVIT方法在此处失败,因为它们无法可靠地区分或适应每个块内的异构任务样本。因此,我们提出了StrLoRA,一种正则化的两阶段专家路由框架。StrLoRA首先使用文本指令进行任务感知的专家选择,激活相关专家的稀疏子集,减少跨任务干扰。然后在该子集内应用基于令牌的专家加权,其中贡献权重通过本地视觉令牌与全局指令表示之间的跨模态注意力计算。为了在非平稳流中保持稳定性,路由稳定性正则化将当前路由分布与历史指数移动平均参考对齐。在新开发的StrCVIT基准上的广泛实验表明,StrLoRA显著优于现有方法,有效提升了模型从持续演变的数据流中获取能力的能力。代码可在https://github.com/chanceche/StrCVIT获取。

英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.

2605.15975 2026-05-20 cs.AI cs.RO

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

在符号世界模型上学习双层策略以实现长周期规划

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

发表机构 * Vector Institute(向量研究所) University of Toronto(多伦多大学) LAAS-CNRS(Laas--cnrs) University of Toulouse(图卢兹大学) RWTH Aachen University(亚琛工业大学)

AI总结 本文提出了一种结合低层模仿学习和高层符号抽象的双层策略,用于解决长周期规划问题,通过BISON系统在扩展的MetaWorld基准上验证了其在处理大量物体和长周期任务上的优越性。

详情
AI中文摘要

我们解决了构建具有身体智能的AI代理以可靠解决长周期规划问题的挑战。模仿学习从演示中已显示出在训练机器人解决需要精细运动控制和操作的复杂任务方面的有效性。然而,仅通过模仿学习生成长周期计划仍然是一个艰巨的挑战。相比之下,高层(HL)符号抽象能够促进高效且可解释的长周期规划。我们提出结合低层(LL)模仿学习在操作和控制中的优势,以及高层符号抽象在长周期规划中的优势。我们通过双层策略(π^hl, π^ll)实现这一想法,其中包括从低层演示中学习的神经策略π^ll,以及由低层演示的符号抽象和归纳概括结合而成的高层符号策略π^hl。我们实现了这些想法的BISON系统。在扩展的MetaWorld基准上的实验表明,BISON能够泛化到长周期和更多物体数量的问题,比VLA和端到端方法更高效,并且在训练和推理中更节省时间和内存。值得注意的是,当忽略低层执行时,BISON的高层策略可以在一分钟内解决包含10,000个相关物体的高层问题。项目页面:https://dillonzchen.github.io/bison

英文摘要

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

2605.15768 2026-05-20 cs.AI cs.CY

ALSO: Adversarial Online Strategy Optimization for Social Agents

ALSO: 用于社交代理的对抗在线策略优化

Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu

发表机构 * School of Artificial Intelligence, Tianjin University, Tianjin, China(天津大学人工智能学院,天津,中国) The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) East China Normal University, Shanghai, China(华东师范大学,上海,中国)

AI总结 本文提出ALSO框架,通过将多轮交互建模为对抗性带薪问题,并引入轻量级神经代理来预测奖励,从而在动态环境中实现社交代理的鲁棒策略优化。

Comments Accepted at ICML 2026

详情
AI中文摘要

社交模拟为研究社会智能提供了一个有力的测试平台,其中代理在不断变化的上下文中通过多轮对话进行交互并战略性地适应对手。此类环境本质上是非平稳的,要求代理动态调整其策略。然而,大多数基于大型语言模型(LLM)的社会代理依赖于静态人设,而现有的增强社会智能的方法,如离线强化学习或外部规划器,不适用于这些设置,通常假设平稳性并导致显著的训练开销。为弥合这一差距,我们提出了ALSO(对抗性在线策略优化),这是首个用于多代理社交模拟的在线策略优化框架。ALSO通过两个关键贡献提升了社会适应性:(1)ALSO将多轮交互建模为对抗性带薪问题,其中静态人设和动态策略指令的组合被视为臂,提供了一种不依赖环境稳定性假设的解决方案;(2)为了预测奖励并泛化多轮对话中的稀疏反馈,ALSO引入了轻量级神经代理来从交互历史中预测奖励,从而实现高效样本探索和持续在线适应。在Sotopia基准测试中,ALSO在动态环境中一致优于静态基线和现有优化方法,验证了对抗性在线策略优化在构建鲁棒社会代理方面的有效性。

英文摘要

Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.

2605.15518 2026-05-20 cs.CL

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

DetectRL-X: 向可靠多语言和真实世界LLM生成文本检测迈进

Junchao Wu, Yefeng Liu, Chenyu Zhu, Hao Zhang, Zeyu Wu, Tianqi Shi, Yichao Du, Longyue Wang, Weihua Luo, Jinsong Su, Derek F. Wong

发表机构 * NLP2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学学院NLP2 CT实验室) Alibaba Group(阿里巴巴集团) Xiamen University(厦门大学)

AI总结 本文提出DetectRL-X,一个综合性的多语言基准,用于评估先进检测器在8个维度上的性能,通过8种常见商业语言和6种易受LLM滥用的领域人类文本,结合4种流行商业LLM生成文本及润色、扩展和缩写等AI辅助写作操作,分析不同语言、领域、生成器、攻击策略、文本长度和润色操作对检测器性能的影响,以强化多语言和语言特定检测器。

Comments ACL 2026 Main. Code and data are available at https://github.com/AIDC-AI/Marco-LLM/tree/main/DetectRL-X

详情
AI中文摘要

由于LLM生成内容的滥用风险日益增加,有效检测和治理LLM生成内容变得越来越关键。尽管现有检测器性能优异,但其在多语言和真实世界场景中的可靠性和潜力仍鲜有研究。本文介绍DetectRL-X,一个综合性的多语言基准,用于评估先进检测器在8个维度上的性能。该基准涵盖8种在商业中常用的语言,并收集了6种易受LLM滥用的领域的人类文本。为更好地匹配真实世界应用,我们使用4种流行的商业LLM生成LLM生成文本,并包含润色、扩展和缩写等典型AI辅助写作操作,以捕捉真实的使用模式。此外,我们开发了多语言框架用于改写和扰动攻击,以模拟多样化的人类修改和写作噪声,从而在不同语言上对检测器进行压力测试。在DetectRL-X上的实验结果揭示了当前最先进的检测器在不同语言资源上的优势和局限性。我们进一步分析了领域、生成器、攻击策略、文本长度和润色操作如何影响不同语言中的性能,突显DetectRL-X作为强化多语言和语言特定检测器的有效基准。

英文摘要

The effective detection and governance of Large Language Model (LLM) generated content has become increasingly critical due to the growing risk of misuse. Despite the impressive performance of existing detectors, their reliability and potential in multilingual, real-world scenarios remain largely underexplored. In this study, we introduce DetectRL-X, a comprehensive multilingual benchmark designed to evaluate advanced detectors across 8 dimensions. The benchmark encompasses 8 languages commonly used in commercial contexts and collects human-written texts from 6 domains highly susceptible to LLM misuse. To better aligned with real-world applications, We create LLM-generated texts using 4 popular commercial LLMs, and include typical AI-assisted writing operations such as polishing, expanding, and condensing to capture authentic usage patterns. Furthermore, we develop a multilingual framework for paraphrasing and perturbation attacks to simulate diverse human modifications and writing noise, enabling stress testing of detectors across languages. Experimental results on DetectRL-X reveal the strengths and limitations of current state-of-the-art detectors when applied to diverse linguistic resources. We further analyze how domains, generators, attack strategies, text length, and refinement operations influence performance in different languages, underscoring DetectRL-X as an effective benchmark for strengthening multilingual and language-specific detectors.

2605.15497 2026-05-20 cs.CV cs.GR

AnyAct: Towards Human Reenactment of Character Motion From Video

AnyAct: 向视频中非人类角色动作的重新演绎迈进

Liuhan Chen, Lei Zhong, Jiewei Wang, Qin Shuai, Li Yuan, Leidong Fan, Qing Li, Kanglin Liu

发表机构 * Peking University(北京大学) Nankai University(南开大学) The University of Hong Kong(香港大学) Zhejiang University(浙江大学) Pengcheng Laboratory(鹏城实验室)

AI总结 本文研究如何从单目视频中直接推导出人类动作的初始重新演绎,其目标是将非人类角色的动作重新诠释为可编辑的人类表演,以供后续动画创作使用。核心方法是利用稀疏局部关节运动线索在结构差异大的情况下保持本质动态,提出AnyAct模型以实现基于可转移稀疏局部2D关节运动的条件人类运动生成。

Comments 12 pages

详情
AI中文摘要

我们研究了从非人类角色的单目视频中直接推导出初始人类重新演绎的问题。我们的目标不是重建源角色本身,而是将它的动作重新诠释为一个合理且可编辑的人类表演,以供后续动画创作使用。这一任务具有挑战性,因为现有的基于视频的动作捕捉方法大多局限于以人类为中心的结构空间,而动作重定向方法通常需要结构化的3D源动作和已知的源拓扑。我们的关键见解是稀疏局部关节运动线索可以在较大的结构差异下保持本质动态,为角色视频到人类重新演绎提供稳定的桥梁。基于这一观察,我们提出了AnyAct,将角色视频驱动的人类重新演绎公式化为从可转移的稀疏局部2D关节运动中生成的条件人类运动。为了使这一方法实用,我们引入了三个关键设计:通过增强的3D到2D投影进行的人类运动-only监督、渐进的3D到2D训练以缓解条件模糊性,以及全局-局部运动解耦以实现可靠的局部运动控制。我们进一步构建了一个主要涵盖多样化非人类角色视频的基准。在该基准上的实验表明,AnyAct能够生成高保真的初始人类重新演绎,这些重新演绎保留了参考视频中角色的本质动态,进一步的消融研究验证了其核心设计的有效性。

英文摘要

We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.

2605.15336 2026-05-20 cs.RO cs.AI

HoloMotion-1 Technical Report

HoloMotion-1 技术报告

Maiyue Chen, Kaihui Wang, Bo Zhang, Xihan Ma, Zhiyuan Yang, Yi Ren, Qijun Huang, Zihao Zhu, Yucheng Wang, Zhizhong Su

发表机构 * Horizon Robotics

AI总结 本文提出HoloMotion-1,一种用于零样本全身运动追踪的人形运动基础模型,通过大规模混合运动语料库训练控制策略,提升了运动行为的多样性和准确性,实现了对多种运动类型和捕捉条件的鲁棒泛化。

Comments 20 pages, 4 figures, 6 tables. Technical report

详情
AI中文摘要

在本报告中,我们介绍了HoloMotion-1,一种用于零样本全身运动追踪的人形运动基础模型。HoloMotion-1的关键创新在于利用大规模混合运动语料库进行控制策略训练,其中来自真实视频重建的运动提供了运动多样性的主要来源,而经过精心挑选的运动捕捉数据和内部运动数据则提供了更高保真度的监督和面向部署的覆盖范围。这种数据模式使HoloMotion-1超越了传统仅依赖运动捕捉的训练,并使策略能够接触更广泛的行为、捕捉条件和运动风格。从这种异构数据中学习引入了新的挑战,包括重建噪声、源域不匹配、运动质量不均以及在大行为变化下的时间建模需求。为了解决这些挑战,HoloMotion-1集成了大容量时间建模、具有稀疏激活的专家混合变压器以及KV缓存推理用于实时控制,并采用序列级训练策略,提高了在扩展运动序列上的学习效率。在多个未见过的运动基准测试中,HoloMotion-1在多样化的运动类型和捕捉条件下表现出鲁棒的泛化能力,显著提高了跟踪精度,且能够直接转移到真实的人形机器人上,无需特定任务的微调。

英文摘要

In this report, we present HoloMotion-1, a humanoid motion foundation model for zero-shot whole-body motion tracking. A key innovation of HoloMotion-1 is to scale control-policy training with a large-scale hybrid motion corpus, where video-reconstructed motions from in-the-wild videos provide the dominant source of motion diversity, while curated motion-capture and in-house motion data provide higher-fidelity supervision and deployment-oriented coverage. This data regime enables HoloMotion-1 to move beyond conventional MoCap-only training and exposes the policy to substantially broader behaviors, capture conditions, and motion styles. Learning from such heterogeneous data introduces new challenges, including reconstruction noise, source-domain mismatch, uneven motion quality, and the need for temporal modeling under large behavioral variation. To address these challenges, HoloMotion-1 integrates large-capacity temporal modeling, a sparsely activated Mixture-of-Experts Transformer with KV-cache inference for real-time control, and a sequence-level training strategy that improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen motion benchmarks show that HoloMotion-1 generalizes robustly across diverse motion types and capture conditions, significantly improves tracking accuracy over prior methods, and transfers directly to a real humanoid robot without task-specific fine-tuning.

2605.15186 2026-05-20 cs.CV cs.AI

VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

VGGT-Edit:基于残差场预测的前馈原生3D场景编辑

Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang

发表机构 * Peking University(北京大学) Tencent(腾讯) The Chinese University of Hong Kong(香港中文大学) Shanghai AI Lab(上海人工智能实验室) NTU Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院) Beijing Key Lab of Data Intel. & Security (PKU)(北京数据智能与安全实验室(北京大学))

AI总结 本文提出VGGT-Edit,一种基于文本条件的前馈原生3D场景编辑框架,通过引入深度同步文本注入和残差变换头,实现高质量的3D场景编辑,同时构建DeltaScene数据集以提升编辑效果和推理速度。

详情
AI中文摘要

高质量的3D场景重建近年来已发展为通用的前馈架构,使单次正向传递即可生成复杂的环境。然而,尽管这些模型在静态场景感知方面表现强劲,但它们在响应动态人类指令方面仍然有限,限制了其在交互应用中的使用。现有的编辑方法通常依赖于2D提升策略,即单独编辑每个视图,然后将其提升回3D空间。这种间接流程往往导致模糊的纹理和不一致的几何结构,因为2D编辑器缺乏保持跨视角结构的空间意识。为了解决这些限制,我们提出了VGGT-Edit,一种用于文本条件的前馈框架,用于原生3D场景编辑。VGGT-Edit引入了深度同步的文本注入,以对齐语义指导与骨干网络的空间姿态,确保稳定的指令接地。此语义信号随后由残差变换头处理,直接预测3D几何位移以变形场景,同时保持背景稳定性。为了确保高保真结果,我们通过多术语目标函数监督该框架,强制几何准确性和跨视图一致性。我们还构建了DeltaScene数据集,一个通过自动化流程生成的大规模数据集,通过3D一致过滤确保地面真实质量。实验表明,VGGT-Edit在2D提升基线中表现显著更好,生成更清晰的物体细节,更强的多视图一致性以及接近即时的推理速度。项目页面是https://chriszkxxx.github.io/VGGT-Edit/.

英文摘要

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed. The project page is https://chriszkxxx.github.io/VGGT-Edit/.

2605.15113 2026-05-20 cs.LG

Learning from Language Feedback via Variational Policy Distillation

通过变分策略蒸馏学习语言反馈

Yang Li, Erik Nijkamp, Semih Yavuz, Shafiq Joty

发表机构 * Salesforce AI Research(Salesforce人工智能研究)

AI总结 本文提出变分策略蒸馏(VPD),通过将学习语言反馈形式化为变分期望最大化问题,解决传统自蒸馏方法中教师策略能力停滞的问题,从而在科学推理和代码生成任务中优于标准RLVR和现有自蒸馏基线。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)面临稀疏结果信号的问题,导致复杂推理任务的探索瓶颈。最近的在线自蒸馏方法尝试通过利用语言反馈生成密集的token级监督来解决这一问题。然而,这些方法依赖于固定且被动的教师来解读反馈。随着学生策略的改进,教师的零样本评估能力趋于停滞,最终阻碍进一步学习。为克服这一问题,我们提出变分策略蒸馏(VPD),一个将学习语言反馈形式化为变分期望最大化(EM)问题的框架。VPD共同进化两种策略:在E步中,教师通过自适应信任区域更新在轨迹结果上主动优化,将文本反馈转化为动态改进的目标token分布。在M步中,学生内部化其在自身在线滚动中所获得的密集分布指导。通过持续提升教师从文本批评中提取可操作信号的能力,VPD克服了传统自蒸馏方法的局限性。在多样化的诊断反馈源上评估,VPD在科学推理和代码生成任务中持续优于标准RLVR和现有自蒸馏基线。最后,通过在刚性数学推理和冷启动场景中压力测试我们的框架,我们揭示了反馈驱动自蒸馏与纯环境驱动RL之间的基本界限。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

2605.14678 2026-05-20 cs.AI

$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

$π$-Bench:评估长周期工作流中主动型个人助理代理

Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang, Haodi Lei, Zihao He, Bingsu He, Chicheng Qin, Tong Zhu, Xiaoye Qu, Yang Yang, Yu Cheng, Yafu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) Nanjing University(南京大学) Zhejiang University(浙江大学) Tongji University(同济大学) Soochow University(苏州大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出$π$-Bench基准,用于评估个人助理代理在长周期工作流中的主动协助能力,通过100个多轮任务和5种特定领域用户角色,验证代理在未明确表达意图前识别和执行隐藏意图的能力,揭示主动协助的挑战及前期交互对后续任务的重要性。

Comments 44 pages

详情
AI中文摘要

随着个人助理代理(如OpenClaw)的兴起,大型语言模型在日常和工作场景中支持用户的能力日益凸显。在这些场景中,主动协助是一个核心挑战,因为用户往往开始时请求不明确,留下重要的需求、约束或偏好未被陈述。然而,现有基准很少评估代理是否能在用户明确表达之前识别并执行此类隐藏意图,尤其是在持续的多轮交互中,用户需求逐渐显现。为填补这一空白,我们引入$π$-Bench,一个包含100个多轮任务和5种特定领域用户角色的主动协助基准。通过整合隐藏用户意图、任务间依赖性和跨会话连续性,$π$-Bench评估代理在延长交互中预见和解决用户需求的能力,共同衡量长周期轨迹中的主动性和任务完成度。实验表明(1)主动协助仍然具有挑战性,(2)任务完成与主动性存在明显区别,(3)前期交互对后续任务中主动意图解析的价值。

英文摘要

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce $π$-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, $π$-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.

2605.14530 2026-05-20 cs.CV

Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

缓解大扩散视觉-语言模型中的遮蔽先验漂移和位置注意力崩溃

Sujung Hong, Chanyong Yoon, Seong Jae Hwang

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea(首尔大学人工智能系)

AI总结 本文研究了大扩散视觉-语言模型在长形式生成中的重复生成和视觉 grounding 退化问题,提出了一种无需训练的解决方案来缓解遮蔽先验漂移和位置注意力崩溃。

详情
AI中文摘要

大扩散视觉-语言模型(LDVLMs)最近作为一种有前途的替代自回归模型出现,能够实现高效的并行解码,并利用双向注意力获取全局上下文。尽管有这些进展,其在长形式生成中的行为仍然缺乏深入研究。在本文中,我们发现现有的LDVLMs存在重复生成和退化的视觉 grounding, 并识别出两个根本原因。首先,重复生成源于遮蔽标记先验:由于生成标记被初始化为遮蔽标记,其隐藏表示在生成步骤中逐渐漂向共享的先验方向。其次,位置注意力偏置与迭代解屏蔽过程之间的基本不匹配会抑制对信息性视觉标记的注意力,从而降低视觉 grounding。基于这些见解,我们提出了一种无需训练的方法,引入遮蔽先验抑制和单调RoPE缩放来缓解解码过程中的遮蔽先验漂移和位置注意力崩溃。在通用多模态基准和视觉 grounding 任务上的实验表明,与基线LDVLMs相比有所改进,特别是在长形式描述基准上表现稳健。我们的结果表明,这些失败可以通过一种轻量级、即插即用的策略有效解决,该策略不需要额外训练,并且在多种LDVLM架构上具有泛化能力。

英文摘要

Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

2605.14102 2026-05-20 cs.AI

ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

ChromaFlow: 一种关于在工具增强代理评估中编排开销的负消融研究

Tarun Mittal

发表机构 * Octave-X

AI总结 该研究通过ChromaFlow框架分析了在工具增强自主推理中编排开销的影响,发现更激进的编排并未提升整体性能,反而增加了操作噪声,并强调了编排升级、确定性提取、证据协调、提供者健康门控和显式运行门控等作为可靠自主代理评估的第一要求。

Comments 12 pages, 6 tables, 1 figure. Updated with follow-up strict-provider full-Level-1 diagnostic

详情
AI中文摘要

自主语言模型代理越来越多地结合规划、工具使用、文档处理、浏览、代码执行和验证循环。这些能力使代理系统更加有用,但同时也引入了无法仅通过最终准确性来观察的操作失败模式。本报告介绍了ChromaFlow,一种围绕规划引导执行、专门化工具使用和 telemetry 驱动评估构建的工具增强自主推理框架。我们分析了ChromaFlow在GAIA 2023 Level-1验证任务下的清洁评估约束。一个冻结的完整Level-1基线实现了29/53正确的答案,或54.72%。后来的恢复配置通过扩展编排实现了27/53正确的答案,或50.94%,同时增加了回溯、超时事件、工具失败提及、令牌日志调用和战役日志成本估计。两个随机化的20任务烟雾评估产生了12/20和11/20正确的答案,表明小规模诊断增益在样本间不稳定。因此,中心结果是负消融:更激进的编排并未提高整体性能,反而增加了操作噪声。后来的严格提供者全Level-1诊断在显式完整性控制下达到了30/53,或56.60%,但显着提高了令牌日志成本。报告认为,受控编排升级、确定性提取、证据协调、提供者健康门控和显式运行门控应被视为可靠自主代理评估的第一要求。

英文摘要

Autonomous language-model agents increasingly combine planning, tool use, document processing, browsing, code execution, and verification loops. These capabilities make agent systems more useful, but they also introduce operational failure modes that are not visible from final accuracy alone. This report presents ChromaFlow, a tool-augmented autonomous reasoning framework built around planner-directed execution, specialized tool use, and telemetry-driven evaluation. We analyze ChromaFlow on GAIA 2023 Level-1 validation tasks under clean evaluation constraints. A frozen full Level-1 baseline achieved 29/53 correct answers, or 54.72%. A later recovery configuration with expanded orchestration achieved 27/53 correct answers, or 50.94%, while increasing tracebacks, timeout events, tool-failure mentions, token-log calls, and campaign-log cost estimates. Two randomized 20-task smoke evaluations produced 12/20 and 11/20 correct answers, showing that small diagnostic gains can be unstable across samples. The central result is therefore a negative ablation: more aggressive orchestration did not improve full-set performance and increased operational noise. A later strict-provider full-Level-1 diagnostic reached 30/53, or 56.60%, under explicit integrity controls, but at substantially higher token-log cost. The report argues that bounded planner escalation, deterministic extraction, evidence reconciliation, provider-health gates, and explicit run gates should be treated as first-order requirements for reliable autonomous agent evaluation.

2605.14063 2026-05-20 cs.LG

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

可靠性门控的源锚定用于持续测试时间适应

Vikash Singh, Debargha Ganguly, Weicong Chen, Sabyasachi Sahoo, Sreehari Sankar, Biyao Zhang, Mohsen Hariri, Shouren Wang, Osama Zafar, Christian Gagné, Vipin Chaudhary

发表机构 * Case Western Reserve University(凯斯西储大学) Université Laval(拉瓦尔大学) Mila - Québec AI Institute(魁北克人工智能研究所)

AI总结 该研究提出了一种可靠性门控的源锚定方法(RMemSafe),用于持续测试时间适应(CTTA),通过利用冻结源的归一化预测熵来抑制所有显式源耦合使用,从而在源可靠性下降时自动关闭源锚定和一致性过滤器,提升模型在持续腐蚀任务中的性能。

详情
AI中文摘要

持续测试时间适应(CTTA)在在线更新预训练模型时,将模型锚定到一个冻结的源检查点上。然而,当源可靠性下降时,这种锚定方式会失效。在CCCHard数据集上,ResNet-50源的top-1准确率下降至约1.3%,而现有源锚定CTTA方法仍然使用相同的锚定强度。本文提出RMemSafe,一种基于ROID的可靠性门控扩展方法,利用冻结源的归一化预测熵来衰减目标函数中的所有显式源耦合使用。当源后验接近均匀分布时,门控关闭:源锚定和一致性过滤器消失,目标函数减少为源无关的回退,包含ROID的基本损失加上边际校准。结合ASR,RMemSafe在8个匹配分割的持续腐蚀单元中实现了最低的错误率,并在所有9个单元中是最佳的重置方法,比ROID+ASR在ResNet-50上提升1.05个百分点,在ViT-B/16上提升0.48个百分点。受控的源退化扫描显示,其危害斜率比ROID+ASR浅1.13倍,与渐进衰减预测一致。熵门控检测到高熵源崩溃,而非自信错误的低熵源;该范围被明确评估和讨论。

英文摘要

Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a frozen source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately $1.3\%$ top-$1$ accuracy, while existing source-anchored CTTA methods continue applying the same anchor strength. We call this failure mode blind anchoring and propose RMemSafe, a reliability-gated extension of ROID that uses the frozen source's normalized predictive entropy to attenuate all explicit source-coupled uses in the objective. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source-agnostic fallback comprising ROID's base losses plus marginal calibration. Combined with ASR, RMemSafe achieves the lowest error on $8$ of $9$ matched-split continual-corruption cells and is the best reset-based method on all $9$, improving ROID+ASR by $1.05$~pp on ResNet-50 and $0.48$~pp on ViT-B/16. A controlled source-degradation sweep shows a $1.13{\times}$ shallower harm slope than ROID+ASR, consistent with the graceful-decay prediction. The entropy gate detects high-entropy source collapse, not confidently wrong low-entropy sources; this scope is explicitly evaluated and discussed.

2605.13652 2026-05-20 cs.LG cs.AI cs.CL

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

超越困惑度:低秩预训练的几何与谱研究

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校)

AI总结 本文通过几何和谱分析研究低秩预训练方法,揭示其与全秩训练在模型性能和解空间上的差异,发现低秩方法在不同模型规模下表现各异,且困惑度不能完全反映下游任务性能。

Comments 9 pages, 5 figures, 2 tables

详情
AI中文摘要

大规模语言模型的预训练主要受限于存储全秩权重、梯度和优化器状态的内存成本。低秩预训练出现以解决这一问题,相关方法空间迅速扩展。一个核心问题仍未解决:低秩方法是否能产生与全秩训练具有同等泛化能力的模型,或者秩约束是否根本性地改变了所达到的解?现有比较几乎完全依赖于单种子运行的验证困惑度,通常继承自先前文献。然而,困惑度是解质量的差代理;两种方法可以在困惑度上匹配,却收敛到不同的损失景观区域和内部表示。我们通过表征五种低秩预训练方法(GaLore和Fira(内存高效优化器)、CoLA和SLTrain(架构再参数化)、ReLoRA(适配器式更新带周期性重置))在三个模型规模(60M、130M、350M)下与全秩训练的解,关闭这一差距。我们评估每种方法在四个维度上的16个指标:1D损失景观沿随机/Top-K PCA方向、1D检查点之间插值、权重和学习更新的谱结构,以及激活相似性与全秩训练。我们显示低秩方法不等同于全秩训练,也不等同于彼此,即使验证困惑度接近。全秩训练在随机方向上达到更尖锐的盆地,而反方向则适用于top-1 PCA方向。每种方法收敛到几何上不同的盆地。低秩激活在训练过程中随着层数增加而偏离全秩激活,GaLore最接近全秩激活。进一步,验证困惑度在每个规模下并不转化为下游性能。添加几何和谱度量提高了预测。

英文摘要

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

2605.13646 2026-05-20 cs.RO cs.AI

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

基于因果性的端到端自动驾驶:通过以自身为中心的联合场景建模

Seokha Moon, Minseung Lee, Joon Seo, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University(韩国大学) Kakao Mobility

AI总结 本文提出CaAD框架,通过共享潜在场景表示捕捉车辆与周围代理之间的因果依赖关系,以提高端到端自动驾驶的闭环规划性能。

详情
AI中文摘要

端到端自动驾驶通过直接从传感器输入预测未来轨迹,跳过了传统模块化流水线,近年来取得了显著进展。然而,现有方法往往忽视了车辆规划中的因果依赖关系,忽略了车辆与周围代理之间的相互关系。这种因果忽视导致轨迹预测不一致且不可靠,特别是在需要交互的关键场景中,车辆决策和邻近代理行为必须联合推理。为了解决这一限制,我们提出了CaAD,一个基于因果的端到端自动驾驶框架,该框架在共享的潜在场景表示中捕捉这些依赖关系。首先,我们提出一个以自身为中心的联合因果建模模块,基于边缘预测分支,并学习车辆与相关交互代理之间的因果依赖关系。其次,我们采用因果意识的策略对齐阶段,通过联合模式嵌入来对齐随机的车辆策略与从周围交通和地图上下文中计算出的规划导向闭环反馈。在Bench2Drive和NAVSIM基准上,CaAD展示了强大的闭环规划性能,分别在Bench2Drive上实现了87.53的驾驶分数和71.81的成功率,在NAVSIM上实现了91.1的PDMS。项目页面可在https://moonseokha.github.io/CaAD/上获取。

英文摘要

End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose an ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM. The project page is available at https://moonseokha.github.io/CaAD/.

2605.11021 2026-05-20 cs.LG

A Switching System Theory of Q-Learning with Linear Function Approximation

基于联合谱半径的Q学习线性函数逼近切换系统理论

Donghwan Lee, Han-Dong Lim

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 本文基于联合谱半径理论,提出了一种Q学习线性函数逼近的切换系统解释,推导了精确的线性切换模型,并将收敛性与相应切换系统的稳定性联系起来,同时扩展到具有独立同分布观测和马尔可夫观测的随机线性Q学习,提供了基于JSR的正则化Q学习视角。

详情
AI中文摘要

本文发展了一种基于联合谱半径(JSR)的Q学习线性函数逼近(LFA)的切换系统解释。我们推导了均值动态的精确线性切换模型,并将收敛性与相应切换系统的稳定性联系起来。相同的构造随后用于具有独立同分布(i.i.d.)观测和马尔可夫观测的随机线性Q学习。尽管一般情况下精确JSR计算困难,证书捕获切换模式的乘积,并且比一步范数界更保守。该框架还提供了基于JSR的正则化Q学习LFA视角。所得到的分析将投影贝尔曼方程、有限差分随机策略切换和切换系统稳定性连接到单一参数空间公式中。

英文摘要

This paper develops a switching-system interpretation of Q-learning with linear function approximation (LFA) based on the joint spectral radius (JSR). We derive an exact linear switched model for the mean dynamics and relate convergence to stability of the corresponding switched system. The same construction is then used for stochastic linear Q-learning with independent and identically distributed (i.i.d.) observations and with Markovian observations. Although exact JSR computation is difficult in general, the certificate captures products of switching modes and can be less conservative than one-step norm bounds. The framework also yields a JSR-based view of regularized Q-learning with LFA. The resulting analysis connects projected Bellman equations, finite-difference stochastic-policy switching, and switched-system stability in a single parameter-space formulation.