arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19033 2026-05-20 cs.RO cs.AI cs.CV cs.LG cs.MA

RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning

RLFTSim: 通过强化学习微调实现逼真且可控的多智能体交通仿真

Ehsan Ahmadi, Hunter Schofield, Behzad Khamidehi, Fazel Arasteh, Jinjun Shan, Lili Mou, Dongfeng Bai, Kasra Rezaee

AI总结 本文提出RLFTSim框架,通过强化学习微调提升交通仿真场景的真实感,并通过目标条件化方法实现对交通仿真可控性的提炼,实验表明其在真实感和可控性方面均优于其他启发式搜索方法。

Comments CVPR 2026 Highlight; Project page at https://ehsan-ami.github.io/rlftsim

详情
AI中文摘要

监督式开环训练已被广泛用于训练交通仿真模型;然而,它无法捕捉复杂驾驶场景中固有的动态性和多智能体交互。我们引入RLFTSim,一种基于强化学习的微调框架,通过将模拟器运行与真实世界数据分布对齐来增强场景真实性,并提供一种方法用于在场景生成中提炼目标条件化的可控性。我们基于预训练的仿真模型实例化RLFTSim,设计一种平衡保真度和可控性的奖励函数,并在Waymo Open Motion Dataset上进行了全面实验。我们的结果表明在真实感方面取得了改进,实现了最先进的性能。与其它基于启发式搜索的微调方法相比,RLFTSim由于提出了一种低方差且密集的奖励信号,所需样本显著更少,并且通过设计直接解决了真实感对齐问题。我们还通过目标条件化展示了我们方法在提炼交通仿真可控性方面的有效性。项目页面可在https://ehsan-ami.github.io/rlftsim上访问。

英文摘要

Supervised open-loop training has been widely adopted for training traffic simulation models; however, it fails to capture the inherently dynamic, multi-agent interactions common in complex driving scenarios. We introduce RLFTSim, a reinforcement-learning-based fine-tuning framework that enhances scenario realism by aligning simulator rollouts with real-world data distributions and provides a method for distilling goal-conditioned controllability in scenario generation. We instantiate RLFTSim on top of a pre-trained simulation model, design a reward that balances fidelity and controllability, and perform comprehensive experiments on the Waymo Open Motion Dataset. Our results show improvements in realism, achieving state-of-the-art performance. Compared with other heuristic search-based fine-tuning methods, RLFTSim requires significantly fewer samples due to a proposed low-variance and dense reward signal, and it directly addresses the realism alignment issue by design. We also demonstrate the effectiveness of our approach for distilling traffic simulation controllability through goal conditioning. The project page is available at https://ehsan-ami.github.io/rlftsim.

2605.19032 2026-05-20 cs.CV

Personalized Face Privacy Protection From a Single Image

基于单张图像的个性化面部隐私保护

Zachary Yahn, Fatih Ilhan, Tiansheng Huang, Selim Tekin, Sihao Hu, Yichang Xu, Margaret Loper, Ling Liu

AI总结 本文提出FaceCloak系统,通过单张图像生成个性化面部隐私掩码,有效防止面部识别,经实验验证其在多个数据集上优于其他方法。

详情
AI中文摘要

在线上传的面部照片容易受到恶意行为者的攻击,他们可以刮取面部图像并通过未经授权的面部识别模型侵犯个人隐私。本文提出了FaceCloak,一种新颖的个性化面部隐私保护系统,该系统能够从用户单张图像生成防御性身份特定的通用面部隐私掩码,使面部识别失败。FaceCloak引入了三阶段的个性化面部扰动学习方法:(1)基于用户的单张图像生成少量高多样性的合成面部图像;(2)通过迭代扰动生成在合成图像的小集合上学习面部伪装,通过增加关键面部身份泄露区域的保护,有效将用户的身份嵌入推向遥远的锚身份并远离相似身份;(3)生成以像素级伪装形式的个性化身份保护掩码,该掩码轻量且可以高效应用于任何用户的面部图像,同时保持良好的感知质量。在三个流行面部数据集上对十个识别模型的广泛实验显示,FaceCloak相比29种其他现有代表性方法更有效。代码可在https://github.com/zacharyyahn/FaceCloak获取。

英文摘要

Photos of faces uploaded online are vulnerable to malicious actors who can scrape facial images from online sources and intrude on personal privacy via unauthorized use of facial recognition models. This paper presents FaceCloak, a novel personalized face privacy protection system, which can generate defensive identity-specific universal face privacy masks from a single image of a user, causing facial recognition to fail. FaceCloak introduces a three-stage personalized face perturbation learning methodology: (1) It generates a small set of high-variety synthetic face images of a person based on a single image of the person. (2) It learns face cloaking by adding more protection to key facial-identity leakage regions through iterative perturbation generation over the small set of synthetic images, effectively shifting a user's identity embedding towards a distant anchor identity and away from a similar one. (3) It generates a personalized identity-protective mask in the form of pixel-wise cloaking, which is light-weight and can be efficiently applied to any facial image of a user while maintaining good perceptual quality. Extensive experiments on three popular face datasets across ten recognition models show the effectiveness of FaceCloak compared to 29 other existing representative methods. Code is available at https://github.com/zacharyyahn/FaceCloak

2605.19029 2026-05-20 cs.RO

Distributionally Robust Control via Stein Variational Inference for Contact-Rich Manipulation

通过Stein变分推断进行分布鲁棒控制的接触丰富操作

Hrishikesh Sathyanarayan, Victor Vantilborgh, Harish Ravichandar, Tom Lefebvre, Ian Abraham

AI总结 本文提出了一种基于Stein变分推断的分布鲁棒控制方法,用于提升接触丰富操作中的不确定性建模能力,通过更灵活的不确定性建模在保持性能的同时精确适应不确定性,实验结果表明在广泛参数不确定性下,鲁棒性提高了3倍。

Comments In Proceedings of Robotics: Science and Systems, Sydney, Australia, July 2025

详情
AI中文摘要

可靠的机器人操作需要能够准确表示和适应来自接触丰富交互中不确定性的控制策略。现代数据驱动方法通过大规模训练和计算来缓解不确定性,但在训练样本有限时性能显著下降。相比之下,经典模型驱动控制器计算高效且可靠,但其对任务相关不确定性的有限表示能力会阻碍接触丰富交互的性能。在本文中,我们提出通过更灵活的不确定性建模来扩展模型驱动操作控制的能力,该方法将操作问题转化为分布鲁棒控制优化,并提出一种基于Stein变分推断的新确定性公式,该公式在保持性能的同时显式建模任务敏感的参数不确定性。结果表明,所得到的控制器更加关注任务对不确定性的敏感性,从而在不牺牲性能的情况下获得高可靠性。实验结果表明,在广泛参数不确定性下,接触丰富操作任务的鲁棒性提高了3倍,优于现有模型驱动控制方法。

英文摘要

Reliable robotic manipulation requires control policies that can accurately represent and adapt to uncertainty arising from contact-rich interactions. Modern data-driven methods mitigate uncertainty through large-scale training and computation, and degrade significantly in performance with limited number of training samples. By contrast, classical model-based controllers are computationally efficient and reliable, but their limited ability to represent task-relevant uncertainty can hinder performance in contact-rich interactions. In this work, we propose to expand the capabilities of model-based manipulation control through more flexible uncertainty modeling that retains performance while exactly adapting to uncertainty. Our approach casts the manipulation problem as a distributionally robust control optimization and proposes a novel deterministic formulation based on Stein variational inference that preserves performance while explicitly modeling task-sensitive parameter uncertainty. As a result, the derived controllers are more aware of task sensitivities to uncertainty, yielding high reliability without compromising performance. Experimental results demonstrate up to 3$\times$ improved robustness across a range of contact-rich manipulation tasks under broad parametric uncertainty, outperforming existing model-based control methods.

2605.19028 2026-05-20 cs.LG

Learning When to Adapt

学习何时适应

Ali Zindari, Xiaowen Jiang, Rotem Mulayoff, Sebastian U. Stich

AI总结 本文提出DISeL,一种动态输入敏感的低秩适应方法,通过引入轻量级输入依赖门控机制,减少遗忘并保持微调准确性,同时提供可解释的诊断视图。

Comments Preprint

详情
AI中文摘要

低秩适应(LoRA)是一种广泛使用的参数高效微调方法,但其学习修正却是静态的:相同的低秩更新被应用于每一个输入。这种输入无关的方法在适应微调分布和保持预训练行为在该分布之外的输入之间造成不可避免的权衡,导致灾难性遗忘。我们引入DISeL(动态输入敏感LoRA),通过在LoRA模块中添加轻量级的输入依赖门控机制,增强每个秩一组件。门控机制默认保留预训练模型的行为,而训练过程学习激活选定的组件以减少微调损失。DISeL仅添加少量参数并保持低秩结构。在RoBERTa在GLUE上的表现,以及经过数学推理和代码生成微调的Llama和Mistral模型中,DISeL相比LoRA及相关变体减少了遗忘,同时保持竞争性的微调准确性。此外,学习到的门控激活提供了可解释的诊断视图,显示哪些层和秩组件在微调过程中最活跃,从而提供关于任务特定适应集中位置的见解。代码可在https://github.com/alizindari/DISeL获得。

英文摘要

Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method, yet its learned correction is static: the same low-rank update is applied to every input. This input-agnostic approach creates an inevitable compromise between adapting to the fine-tuning distribution and preserving pre-trained behavior on inputs outside that distribution, contributing to catastrophic forgetting. We introduce DISeL (Dynamic Input-Sensitive LoRA), which augments LoRA modules with lightweight input-dependent gates over individual rank-one components. The gating mechanism is designed to preserve the pre-trained model's behavior by default, while training learns to activate selected components that reduce the fine-tuning loss. DISeL adds only a small number of parameters and preserves the low-rank structure. Across RoBERTa on GLUE, and Llama and Mistral models fine-tuned for mathematical reasoning and code generation, DISeL reduces forgetting relative to LoRA and related variants while maintaining competitive fine-tuning accuracy. In addition, the learned gate activations provide an interpretable diagnostic view of which layers and rank components are most activated during fine-tuning, giving insight into where task-specific adaptation is concentrated. Code available at https://github.com/alizindari/DISeL .

2605.18597 2026-05-20 cs.AI

Latent Action Reparameterization for Efficient Agent Inference

潜在动作重参数化用于高效智能体推断

Wenhao Huang, Qingwen Zeng, Qiyue Chen, Zijie Guo, Yu Sun, Cheng Yang, Siru Ouyang, Jiri Gesi, Fang Wu, Jiayi Zhang, Huaming Chen, Bang Liu, Xiangru Tang, Chenglin Wu

AI总结 本文提出Latent Action Reparameterization (LAR)框架,通过学习紧凑的潜在动作空间来提升大语言模型智能体的推断效率,减少有效动作 horizon 并保持原始动作空间的表达性。

详情
AI中文摘要

大型语言模型(LLM)智能体通常依赖于长序列的低级文本动作,导致较大的有效决策 horizon 和较高的推断成本。尽管先前工作通过系统级优化或提示工程来提高推断效率,我们认为动作空间的表示是关键瓶颈。我们提出Latent Action Reparameterization (LAR),一种学习紧凑的潜在动作空间的框架,其中每个潜在动作对应于多步骤语义行为。通过将智能体动作重参数化为潜在单元,LAR使在较短的有效 horizon 上进行决策的同时保持原始动作空间的表达性。与手工制作的宏或分层控制器不同,潜在动作从智能体轨迹中学习并直接集成到模型中,允许规划和执行在抽象动作表示上进行。在一系列基于LLM的智能体基准测试中,LAR显著减少了有效动作 horizon 并在固定计算预算下提高了推断效率。作为结果,我们的方法在减少动作令牌和相应的墙钟推断时间的同时,保持或提高了任务成功率。这些结果表明,动作表示学习是扩展高效LLM智能体推断的关键且未被探索的因素,与模型架构和硬件的进步互补。

英文摘要

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

2605.18464 2026-05-20 cs.CV

PERL: Parameter Efficient Reasoning in CLIP Latent Space

PERL:在CLIP潜在空间中实现参数高效的推理

Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

AI总结 本文提出PERL,一种在CLIP潜在空间中通过迭代潜在推理实现参数高效适应的框架,该方法在多个基准测试中表现出最佳的参数-性能权衡,仅需约6K可训练参数即可实现强的新型类别准确率和竞争性的迁移性能。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

对比训练的视觉-语言模型,如CLIP,通过在共享嵌入空间中对齐图像和文本,提供了强大的零样本迁移能力。然而,将这些模型适应到下游任务而不影响其开放词汇泛化能力仍然具有挑战性。现有的参数高效适应方法通常通过学习的提示、适配器或多模态转换来提高任务专业化,其中适应能力主要通过额外的可训练参数来表达。受最近语言模型中潜在推理方法的启发,我们探讨了一种互补的视角:适应是否可以来自于对潜在表示的迭代推理,而不是仅仅通过增加参数数量?我们介绍了PERL(在CLIP潜在空间中实现参数高效的推理),一种轻量级的适应框架,它通过在冻结的CLIP模型上添加一个紧凑的共享推理模块,在多次细化步骤中反复应用。在每一步中,PERL根据当前的表示生成一个潜在推理标记,并将其注入到中间编码器层中,逐步细化更高层次的语义表示,同时保持CLIP的预训练多模态结构。在15个基准测试中,涵盖基础到新颖泛化、跨数据集迁移以及非分布ImageNet变体,PERL在快速适应的少样本设置下,实现了与其他方法相比最佳的参数-性能权衡,仅使用约6K可训练参数,比最大的比较方法少817倍,同时结合了强的新类别准确率和具有竞争力的迁移性能。总体而言,我们的结果表明,迭代的潜在推理为判别视觉-语言模型中的参数扩展提供了一种互补的适应机制。

英文摘要

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理?

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

AI总结 本研究探讨了现有模型如何利用潜在令牌,发现潜在令牌在最终预测中起作用有限,主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示,需要高质量数据和更精确的潜在令牌预测来推动发展。

详情
AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题,而非仅通过语言推理。受此启发,近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中,我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是,当潜在令牌被无信息的占位符令牌替代时,模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象,我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题:首先,在大多数现有数据集中,oracle潜在令牌提供的信息有限,仅超出原始图像,且不显著简化任务,导致模型在训练时忽略它们,并在推理时有效绕过它们。当在诊断数据集上微调时,其中潜在令牌为最终预测提供充分支持,我们显示模型可以因果依赖于它们。其次,在推理时生成的潜在令牌偏离其对应的oracle表示,坍缩到狭窄区域,即使模型依赖它们也无法获得收益。总体而言,我们的发现表明,未来潜在视觉推理的进步取决于两个关键支柱:具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

2605.18431 2026-05-20 cs.CV

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

协同视见:基于多模态大语言模型的多机器人协作自体空间推理

Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu, Yufan Chen, Junwei Zheng, Hao Shi, Yi Zhou, M. Saquib Sarfraz, Danda Pani Paudel, Luc Van Gool

AI总结 本文研究了多机器人协作动态空间推理问题,提出了首个针对该任务的基准CoopSR以及多机器人自体问答数据集EgoTeam,通过引入SP-CoR框架实现了细粒度的协作空间推理,显著提升了多机器人协作推理性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在自体视频理解方面取得了显著进展,但其从多个具身视角进行协作推理的能力仍鲜有探索。我们通过多机器人协作动态空间推理研究该问题,其中模型必须通过集成同步的自体视频来回答空间、时间、可见性和协调性问题。为此,我们引入了首个针对该任务的基准CoopSR,以及EgoTeam多机器人自体问答数据集。EgoTeam包含114,227个问答对,覆盖19种问题类型、四个难度等级和三种团队规模,在Habitat和iGibson中,以及一个包含约2,326个问题的现实世界测试集。我们进一步提出了SP-CoR(Spectral and Physics-Informed Cooperative Reasoner),一种用于细粒度协作空间推理的MLLM框架。SP-CoR结合了动态感知的多机器人帧采样、光谱和物理引导的视图融合以及物理对齐的提示蒸馏,使模型在训练时能够受益于特权机器人姿态监督,而在测试时仅需自体视频。在22个MLLM基线模型上,SP-CoR在Habitat上比最强的微调基线高出3.87%,在iGibson上高出7.12%。它还展示了更强的泛化能力,适用于未见过的团队规模和现实世界机器人测试。代码可在https://github.com/KPeng9510/seeing-together.git找到。

英文摘要

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.

2605.18413 2026-05-20 cs.CV

Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models

基础的裂缝:一个挑战视觉基础模型的民用基础设施数据集

Nicola Farronato, Niccolo Avogaro, Thomas Frick, Mattia Rigotti, Rizwan Ullah Khan, Michele Magno, Konrad Schindler, Cristiano Malossi, Florian Scheidegger

AI总结 本文提出Cracks in the Foundation数据集,通过高分辨率图像挑战视觉基础模型在民用基础设施中的密集图像理解能力,揭示了现有模型在真实世界中的局限性。

详情
AI中文摘要

自动化结构健康监测对于防止基础设施灾难性失效至关重要。精确的像素级缺陷分割对于准确评估结构完整性至关重要,但进展受制于极少数数据的匮乏,这需要昂贵的专家标注。由于问题固有的算法障碍,如中心偏差和在检查近似无纹理的建筑材料时需要更多依赖形状,数据需求更加突出。为消除瓶颈,我们引入Cracks in the Foundation (CiF),这是迄今为止最大的、最详细的民用基础设施(实例)分割数据集,包含约150,000张高分辨率图像,经过五年与土木工程专家的合作精心编纂。借助这一前所未有的数据源,我们揭示了当前视觉AI的一个盲点:尽管提示式基础模型(FMs)和视觉语言模型(VLMs)已出现,尽管当今专门的分割模型表现出色,但建成环境中的密集图像理解仍远未解决。我们的评估表明,即使是最新的零样本FMs在部署到真实基础设施时也面临重大挑战,甚至专门模型在领域特定监督下的性能也停滞在约25%的mAP。CiF将民用基础设施检查,一个基础且看似简单的感知任务,确立为一个开放挑战,揭示了目前主要在互联网图像上训练的模型的根本性弱点,字面和比喻上都突显了当前基础模型范式的裂缝。

英文摘要

Automated structural health monitoring is essential to prevent catastrophic infrastructure failures. Precise, pixel-level defect segmentation is needed to accurately assess structural integrity, but progress in defect segmentation for civil infrastructures has been held back by an extreme scarcity of data, which requires costly expert annotation. The need for data is accentuated by algorithmic hurdles intrinsic to the problem, including center-bias and the need to rely more on shape when inspecting nearly textureless building materials. To remove the bottleneck, we introduce Cracks in the Foundation (CiF), the largest and most detailed civil infrastructure (instance) segmentation dataset to date, comprising $\approx$150,000 high-resolution images meticulously curated over five years in collaboration with civil engineering experts. With the help of this unprecedented data source, we expose a blind spot of current visual AI: despite the advent of promptable Foundation Models (FMs) and Vision Language Models (VLMs), and despite the impressive abilities of today's specialised segmentation models, it turns out that dense image understanding in the built environment is nowhere near solved. Our evaluations indicate that even the most recent zero-shot FMs face significant challenges when deployed on real-world infrastructure and even the performance of specialised models with domain-specific supervision plateaus at $\approx$25% mAP. CiF establishes inspection of civil infrastructure, an elementary and seemingly easy perceptual task, as an open challenge that reveals fundamental weaknesses of present-day models trained predominantly on internet images, literally and figuratively highlighting cracks in the current foundation model paradigm.

2605.18396 2026-05-20 cs.CV

NEWTON: Agentic Planning for Physically Grounded Video Generation

NEWTON:面向物理基础视频生成的代理规划

Yuxiang Feng, Juncheng Wang, Chao Xu, Yijie Qian, Huihan Wang, Wenlong Hou, Yang Liu, Baigui Sun, Yong Liu, Shujun Wang

AI总结 本文提出NEWTON,通过将视频生成从系统输出降级为代理工具箱中的一个动作,利用学习的规划器协调物理感知工具,提高视频生成的物理合理性,从而在VideoPhy-2数据集上显著提升联合准确性。

Comments project page: https://Newton026.github.io/newton

详情
AI中文摘要

视频生成模型能够产生视觉上吸引人的结果,但系统性地违反物理常识——在VideoPhy-2数据集上,最佳模型仅能达到32.6%的联合准确性。我们识别出一个规范瓶颈:文本提示是对物理世界的损失压缩,省略了完全决定动态的参数,而无论模型规模如何扩大都无法恢复从未指定的内容。从这一诊断中,我们得出物理条件必须满足的三个属性——充分性、动态性和可验证性,并展示现有方法均无法同时满足这三个属性。我们提出了NEWTON,其中视频生成被降级为代理工具箱中的一个动作:学习的规划器协调物理感知工具(关键帧生成、科学计算、提示优化)以构建丰富的条件输入,并通过验证器闭合回路以实现迭代再规划。规划器是唯一可训练的组件,通过Flow-GRPO在实时多轮循环中进行在线优化。在VideoPhy-2数据集上,NEWTON在LTX-Video上将联合准确性从21.4%提升到29.7%,在Veo-3.1上从30.7%提升到37.4%,而无需修改生成器。我们的项目页面:https://Newton026.github.io/newton

英文摘要

Video generation models produce visually compelling results but systematically violate physical commonsense -- on VideoPhy-2, the best model achieves only 32.6% joint accuracy. We identify a specification bottleneck: text prompts are lossy compression of the physical world, omitting the parameters that fully determine dynamics, and no amount of model scaling can recover what was never specified. From this diagnosis we derive three properties that physics conditioning must satisfy -- sufficiency, dynamism, and verifiability -- and show that no existing approach satisfies all three. We present NEWTON, in which video generation is demoted from the system output to one action inside an agent's toolbox: a learned planner orchestrates physics-aware tools (keyframe generation, scientific computation, prompt refinement) to construct rich conditioning, and a verifier closes the loop for iterative re-planning. The planner is the sole trainable component, optimized on-policy via Flow-GRPO inside the live multi-turn loop. On VideoPhy-2, NEWTON improves joint accuracy from 21.4% to 29.7% on LTX-Video and from 30.7% to 37.4% on Veo-3.1, without modifying either generator. Our project page: https://Newton026.github.io/newton

2605.18389 2026-05-20 cs.LG math.OC

Spherical Harmonic Optimal Transport: Application to Climate Models Comparisons

球面调和最优传输:应用于气候模型比较

Pierre Houédry, Iskander Legheraba, Léo Buecher, Nicolas Courty

AI总结 本文提出了一种基于球面调和函数的最优传输方法,用于高效比较气候模型,通过在球面上利用谐波结构设计快速Sinkhorn算法,提升了计算效率并应用于全球气候模型评估。

详情
AI中文摘要

最优传输提供了一个强大的框架,用于在尊重其支撑集几何结构的情况下比较测度,但计算成本高昂,限制了其在现实应用中的潜力。在流形上,基于热核的卷积算法已被提出以缓解这一成本,但其理论性质仍鲜有探索。我们证明了当时间趋于零时,热核成本在平衡和非平衡情况下均收敛于最优传输成本。在特定情况下,对于2球面S²,我们确保所关联的Sinkhorn分歧保持经典最优传输差异的几何和分析性质。此外,我们利用球面的谐波结构推导出一种快速的Sinkhorn算法,仅需O(n)的内存和O(n^{3/2})的时间每迭代,且完全支持GPU友好的密集运算。我们在合成数据上验证了其计算效率,并讨论了其在评估全球气候模型中的潜在用途,提供了对模型性能的空间和季节性洞察。

英文摘要

Optimal transport provides a powerful framework for comparing measures while respecting the geometry of their support, but comes with an expensive computational cost, hindering its potential application to real world use cases. On manifolds, convolutional algorithms based on the heat kernel have been proposed to alleviate this cost, but their theoretical properties remain largely unexplored. We establish that the heat kernel cost converges to the optimal transport cost as time vanishes in the balanced and unbalanced cases. In the specific case of the 2-sphere $\mathbb{S}^2$, we ensure that the associated Sinkhorn divergences retains the desirable geometric and analytic properties of classical optimal transport discrepancies. Moreover, we leverage the harmonic structure of the sphere to derive a fast Sinkhorn algorithm, requiring only $\mathcal{O}(n)$ memory and $\mathcal{O}(n^{3/2})$ time per iteration, with fully dense GPU-friendly operations. We validate its computational efficiency on synthetic data, and discuss its potential use in the evaluation of global climate models, providing both spatial and seasonal insights into models performances.

2605.17942 2026-05-20 cs.CV

UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction

UAVFF3D: 一种面向无人机3D重建的几何感知基准

Xiang Yang, Yongli Wang, HaiFeng Li, Yunsheng Zhang

AI总结 本文提出UAVFF3D基准,旨在解决无人机摄影测量中因相机几何变化导致的重建问题,通过引入真实-合成图像和控制测试子集,提升无人机领域适应性和鲁棒性。

Comments 19 pages, 16 figures, 16 tables

详情
AI中文摘要

尽管前馈3D重建技术取得了快速发展,但当前模型在无人机摄影测量中仍不够可靠。我们认为,这种失败不仅源于外观域偏移,还源于无人机特定的相机几何变化,特别是斜视和HFOV高度模糊。现有无人机数据集主要强调场景多样性,但对相机配置的覆盖有限,限制了鲁棒性评估和无人机领域适应。为解决这一差距,我们引入UAVFF3D,一个面向前馈无人机3D重建的几何感知真实-合成基准。UAVFF3D包含超过170,000张真实无人机图像和超过370,000张由高质量纹理3D模型渲染的合成图像,覆盖多样的HFOV、飞行高度、观看方向和采集模式。它还包含一个受控的HFOV-高度测试子集,用于诊断投影几何模糊。我们进一步提出一个评估协议,联合评估相机几何估计和密集场景重建,通过共享的全局对齐,避免单独相机和几何对齐带来的偏差。在代表性前馈重建模型上的实验表明,基于UAVFF3D的领域适应一致地提高了相机和几何估计,将射线误差降低了高达84.2%,姿态ATE降低了高达76.0%,点距离降低了高达41.1%。在斜视场景中,适应减少了斜视-正视旋转差距高达90.7%。在HFOV-高度模糊情况下,它提高了在不同HFOV-高度配置下的鲁棒性,并在不同HFOV设置下产生了更稳定的性能。结合相机先验进一步改进了在无人机特定采集几何下的重建。数据集和评估代码可在https://github.com/yanxian-ll/UAVFF3D获取。

英文摘要

Feed-forward 3D reconstruction has advanced rapidly, but current models remain unreliable in UAV photogrammetric acquisition. We argue that this failure is caused not only by appearance-domain shift, but also by UAV-specific camera-geometry variations, especially oblique views and HFOV-height ambiguity. Existing UAV datasets mainly emphasize scene diversity and provide limited coverage of camera configurations, which restricts robustness evaluation and UAV-domain adaptation. To address this gap, we introduce UAVFF3D, a geometry-aware real-synthetic benchmark for feed-forward UAV 3D reconstruction. UAVFF3D contains more than 170k real UAV images and more than 370k synthetic images rendered from high-quality textured 3D models, covering diverse HFOVs, flight altitudes, viewing directions, and acquisition patterns. It also includes a controlled HFOV-height test subset for diagnosing projection-geometry ambiguity. We further propose an evaluation protocol that jointly assesses camera-geometry estimation and dense scene reconstruction under a shared global alignment, avoiding the bias caused by separate camera and geometry alignments. Experiments on representative feed-forward reconstruction models show that UAVFF3D-based domain adaptation consistently improves camera and geometry estimation, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. In oblique scenes, adaptation reduces the oblique-nadir rotation gap by up to 90.7%. Under HFOV-height ambiguity, it improves robustness across HFOV-height configurations and yields more stable performance across HFOV settings. Incorporating camera priors further improves reconstruction under UAV-specific acquisition geometries. The dataset and evaluation code are available at https://github.com/yanxian-ll/UAVFF3D .

2605.17916 2026-05-20 cs.CV

PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis

PanoWorld: 一种生成式空间世界模型,用于一致的整屋全景合成

Jinrang Jia, Zhenjia Li, Yijiang Hu, Yifeng Shi

AI总结 本文提出PanoWorld,一种生成式空间世界模型,通过自回归生成基于节点的360度全景图,实现一致的整屋全景合成,解决了纯2D生成器在视角变化时几何和材质重新想象的问题,以及单一3D生成在多房间尺度下的高成本和纹理丢失问题。

Comments 17

详情
AI中文摘要

生成一致的整屋VR游览需要逼真的全景图和跨视角的空间一致性。纯2D生成器产生吸引人的单个全景图,但在视角变化时重新想象几何和材质,而单一3D生成在多房间尺度下变得昂贵且丢失细纹理。我们引入PanoWorld,一种生成式空间世界模型,将整屋合成视为自回归生成基于节点的360度全景图,匹配真实VR游览产品使用的离散导航。PanoWorld使用由平面图派生的3D壳体作为全局几何代理,并使用动态3D高斯点云缓存作为可渲染的空间记忆。一个用于度量尺度多房间360度输入的前馈全景LRM将生成的全景图提升到局部360度高斯点云更新,同时房间感知的组注意机制抑制跨房间特征干扰。一种拓扑感知的渐进缓存策略将这些局部更新融合,而无需反复重建完整历史。通过将基于壳体的几何指导与缓存渲染的视觉记忆解耦,PanoWorld在保持高频率2D合成质量的同时,提高了跨节点布局和材质一致性。项目链接是https://jjrcn.github.io/PanoWorld-project-home/

英文摘要

Generating a consistent whole-house VR tour from a floorplan and style reference requires both photorealistic panoramas and cross-view spatial coherence. Pure 2D generators produce appealing single panoramas but re-imagine geometry and materials when the viewpoint changes, whereas monolithic 3D generation becomes expensive and loses fine texture at multi-room scale. We introduce PanoWorld, a generative spatial world model that treats whole-house synthesis as autoregressive generation of node-based 360-degree panoramas, matching the discrete navigation used by real VR tour products. PanoWorld uses a floorplan-derived 3D shell as a global geometric proxy and a dynamic 3D Gaussian Splatting cache as renderable spatial memory. A feed-forward panoramic LRM designed for metric-scale multi-room 360-degree inputs lifts generated panoramas into local 3DGS updates, while Room-aware Group Attention suppresses cross-room feature interference. A topology-aware progressive caching strategy fuses these local updates without repeatedly reconstructing the full history. By decoupling shell-based geometry guidance from cache-rendered visual memory, PanoWorld preserves high-frequency 2D synthesis quality while improving cross-node layout and material consistency. The project link is https://jjrcn.github.io/PanoWorld-project-home/

2605.17889 2026-05-20 cs.LG

CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

CoX-MoE: 通过AMX启用的CPU-GPU协同执行提升高吞吐量MoE推理的协同专家执行

Muyoung Son, Yi Chen, Seungjae Yoo, Soongyu Choi, Joo-Young Kim

AI总结 本文提出CoX-MoE,一种通过AMX启用的CPU-GPU协同系统,通过协同专家执行和战略工作负载编排优化MoE推理,提升吞吐量。CoX-MoE引入了coalescing-aware orchestration策略和静态专家-aware分层方案,分别优化资源分配和减少PCIe传输开销,从而在吞吐量上比现有框架提升7.1倍和2.4倍。

Comments 7 pages, 8 figures, accepted to DAC '26

详情
AI中文摘要

混合专家(MoE)架构通过稀疏专家激活提高计算效率,但面向吞吐量的推理面临显著的GPU内存压力,因为参数规模和中间数据较大。先前工作尝试通过专家卸载和微批处理或卸载计算到CPU来缓解这一问题。然而,微批处理导致的工作负载碎片化会降低操作强度,导致专家执行成为内存瓶颈。同时,CPU卸载受限于慢速PCIe传输和其在解码阶段注意力计算中的有限适用性。因此,这些低效性限制了系统利用率,严重限制了MoE推理的端到端吞吐量。为了解决这些挑战,本文提出CoX-MoE,一种通过AMX启用的CPU-GPU协同系统,通过结合协同专家执行和战略工作负载编排来全面优化MoE推理。CoX-MoE引入(i)一种coalescing-aware orchestration策略,通过采用普通批处理而非微批处理进行专家计算和选择性注意力卸载,共同优化资源分配;(ii)一种静态专家-aware分层方案,预先将频繁激活的专家分配到GPU,减少PCIe传输开销并平衡CPU和GPU在推理中的工作负载。与最先进的框架相比,CoX-MoE实现了显著的提升,分别达到比FlexGen和MoE-Lightning高7.1倍和2.4倍的吞吐量。

英文摘要

The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.

2605.17809 2026-05-20 cs.AI cs.IR

Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

加速人工智能研究:PuppyChatter框架用于实用且灵活的工具开发

Chun-Hsiung Tseng, Hao-Chiang Koong Lin, Andrew Chih-Wei Huang, Yung-Hui Chen, Jia-Rou Lin

AI总结 本文提出PuppyChatter框架,旨在解决AI应用开发中的挑战,通过结合供应商特定SDK的直观性和模型抽象的中立性,提供更流畅灵活的开发方法。

详情
AI中文摘要

本研究针对开发人工智能应用,特别是利用大语言模型(LLMs)的应用所固有的挑战。尽管AI供应商提供应用程序编程接口(API)和软件开发工具包(SDK)来促进开发人员交互,但前者通常需要复杂的手动请求构造,而后者可能导致显著的供应商锁定。此外,尽管现有的模型抽象框架在减轻供应商依赖方面有所成效,但引入了额外的复杂性和潜在的安全问题。为调和这些矛盾因素,本研究引入了PuppyChatter,一种新的软件框架,旨在保持供应商特定SDK的直观简洁性,同时遵循模型抽象中固有的中立原则,从而提供更流畅且灵活的开发范式。

英文摘要

This research addresses the challenges inherent in developing Artificial Intelligence (AI) applications, particularly those leveraging Large Language Models (LLMs). While AI vendors provide Application Programming Interfaces (APIs) and Software Development Kits (SDKs) to facilitate developer interaction, the former often requires intricate manual request construction, and the latter can lead to significant vendor lock-in. Furthermore, existing model abstraction frameworks, though mitigating vendor dependency, introduce an additional layer of complexity and potential security concerns. To reconcile these conflicting factors, the study introduces PuppyChatter, a novel software framework designed to preserve the intuitive simplicity of vendor-specific SDKs while simultaneously adhering to the vendor-neutrality principles characteristic of model abstraction, thereby offering a more streamlined and flexible development paradigm.

2605.17804 2026-05-20 cs.LG eess.SP

GenTS: A Comprehensive Benchmark Library for Generative Time Series Models

GenTS:生成时间序列模型的综合基准库

Chenxi Wang, Xiaorong Wang, Peiyang Li, Yi Wang

AI总结 本文提出GenTS,一个用于系统评估生成时间序列模型的综合且可扩展的基准库,通过统一的数据预处理流程、多样化的模型集合和全景评估指标,为生成模型提供了更灵活的评估框架。

详情
AI中文摘要

生成模型在时间序列分析任务中展现出了显著的潜力,如合成、预测、插值等。然而,现有的时间序列库主要针对判别模型进行工程设计,具有针对特定任务的标准工作流程,例如优化时间序列预测的均方误差。这种刚性的结构与生成模型独特的、往往复杂的范式(如对抗训练、扩散过程)根本上不兼容,因为生成模型学习的是数据分布而非直接的输入-输出映射。为此,我们提出了GenTS,一个全面且可扩展的基准库,旨在对生成时间序列模型进行系统评估。GenTS具有统一的数据预处理流程、多样化的模型集合和全景评估指标。其模块化设计也使研究者能够灵活地自定义超出内置数据集和模型。基于GenTS,我们进行了在多种任务下的基准测试,从而为模型选择提供了建议,并识别了未来研究的潜在方向。我们的代码在https://github.com/WillWang1113/GenTS上开源。官方教程和文档可在https://willwang1113.github.io/GenTS/上获取。

英文摘要

Generative models have demonstrated remarkable potential in time series analysis tasks, like synthesis, forecasting, imputation, etc. However, offering limited coverage for generative models, existing time series libraries are mainly engineered for discriminative models, with standardized workflows for specific tasks, such as optimizing Mean Squared Errors for time series forecasting. This rigid structure is fundamentally incompatible with the distinct and often complex paradigms of generative models (e.g., adversarial training, diffusion processes), which learn the underlying data distribution rather than a direct input-output mapping. To this end, we proposed GenTS, a comprehensive and extensible benchmark library designed for systematic assessment on generative time series models. GenTS features a unified data preprocessing pipeline, a collection of versatile models, and panoramic evaluation metrics. Its modular design also enables the researchers to flexibly customize beyond our built-in datasets and models. Based on GenTS, we conducted benchmarking experiments under diverse tasks, accordingly offering suggestions for model selection and identifying potential directions for future research. Our codes are open-source at https://github.com/WillWang1113/GenTS. The official tutorials and document are available at https://willwang1113.github.io/GenTS/.

2605.17539 2026-05-20 cs.AI

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

具有跨分支知识转移的内存引导树搜索用于LLM求解器合成

Fatemeh Haji, Javier Delarosa Quiros, Peyman Najafirad

AI总结 该研究提出MEMOIR框架,通过双层记忆体系结构实现内存引导的树搜索,以提高求解器合成的效率和有效性,通过跨分支知识转移提升求解器的解决方案质量。

详情
AI中文摘要

组合优化(CO)在从物流到芯片设计的决策中起着基础性作用,其中不可行的解决方案在操作上不可用,而小的改进可以转化为显著的经济价值。最近的研究利用大型语言模型(LLMs)自动化求解器合成:从自然语言规范生成可执行的求解器程序。然而,现有的树搜索和进化代理在并行细化候选轨迹时没有显式的知识转移,重新引入了相同的约束违规,并收敛到相似的算法家族。我们引入MEMOIR,一种具有两级记忆层次结构的内存引导树搜索框架:分支本地记忆在迭代单个算法设计时保存执行基础的细化细节,而全局记忆存储跨分支压缩的算法和失败模式摘要。在分支终止时的反思步骤提炼这些摘要,使跨分支转移成为可能,而不会污染未来的上下文与低层次调试跟踪。在七个跨越调度、路由、打包和几何设计的CO问题上,MEMOIR实现了96.7%的解决方案有效性(比最强基线高出9.2个点),并在匹配的每种方法执行预算下,将平均标准化分数提高了7.3个点。在四个问题上进行三次独立运行时,MEMOIR的运行间有效性标准差比我们评估的所有基线低一个数量级,表明内存引导的探索产生了持续的改进,而不是反映采样方差。

英文摘要

Combinatorial optimization (CO) underlies decision-making from logistics to chip design, where infeasible solutions are operationally unusable and small quality gains translate into substantial economic value. Recent work uses large language models (LLMs) to automate solver synthesis: generating executable solver programs from natural-language specifications. However, existing tree-search and evolutionary agents refine candidate trajectories in parallel without explicit knowledge transfer, reintroducing the same constraint violations and converging on similar algorithm families. We introduce MEMOIR, a memory-guided tree-search framework with a two-level memory hierarchy: branch-local memory preserves execution-grounded refinement details within a branch as it iterates on a single algorithmic design, while global memory stores compressed algorithmic and failure-mode summaries across branches. A reflection step at branch termination distills these summaries, enabling cross-branch transfer without polluting future contexts with low-level debugging traces. Across seven CO problems spanning scheduling, routing, packing, and geometric design, MEMOIR achieves 96.7% solution validity (a 9.2 point gap over the strongest baseline) and improves the average normalized score by 7.3 points at matched per-method execution budget. Over three independent runs on four problems, MEMOIR's run-to-run validity standard deviation is more than an order of magnitude below that of every baseline we evaluated in this setting, suggesting that memory-guided exploration yields consistent improvements rather than reflecting sampling variance.

2605.17480 2026-05-20 cs.AI

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

能力悖论:更聪明的审计员如何使多智能体系统更不安全

Qiqi Liu, Thorsten Holz, Shilin Ye, Runhan Song

AI总结 本文研究了多智能体系统中,随着工人能力的提升,系统级攻击成功率反而上升的现象,揭示了语言确定性在攻击传播中的作用,并提出异质性集成验证作为解决方案,以降低攻击成功率。

Comments 28 pages, 6 figures

详情
AI中文摘要

多智能体系统通过将任务分解给专门的智能体来扩展大语言模型(LLMs),但其分布式决策过程创造了新的攻击面。我们识别出语义劫持攻击,即有害请求被隐藏在领域特定的叙述中,并通过工人报告传播到管理者,而无需任何语法注入原始。在42,000次对抗性试验中,我们发现了能力悖论:随着工人能力的增加,系统级攻击成功率(ASR)从18.4%增加到63.9%,峰值达到94.4%。为了解释这一效应,我们对两个独立数据集(47,807次交互)进行了多层中介分析。分析显示,这一悖论由语言确定性驱动:更强的工人更可能将对抗性叙述解释为合法,自信地传达结论,从而导致管理者将这种自信的背书视为执行的正当理由。在我们的更大工人-only设置(n_W=14)中,确定性中介了74%的效果,95%置信区间(CI)在蒙特卡洛和聚类Bootstrap下均排除零;较小的Full-MAS设置(n_W=6)显示了方向一致的间接效应。工人端的安全提示无法可靠地缓解这一失败。基于中介发现,我们提出异质性集成验证,通过配对具有不对称领域能力的工人,使它们的互补性漏洞打破确定性到执行的链条,将ASR从52.8%降低到2.0%,对良性任务影响微乎其微。我们的结果表明,升级组件到更强的模型会主动降低系统安全性,有效的防御需要利用而不是消除智能体之间的能力不对称性。

英文摘要

Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify semantic hijacking, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a capability paradox: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by linguistic certainty: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ($n_W$=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ($n_W$ =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose heterogeneous ensemble verification, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting--rather than eliminating--capability asymmetries between agents.

2605.17470 2026-05-20 cs.CV cs.MM eess.IV

EchoSR: Efficient Context Harnessing for Lightweight Image Super-Resolution

EchoSR: 为轻量图像超分辨率实现高效的上下文利用

Hanli Zhao, Binhao Wang, Shihao Zhao, Tao Wang, Kaihao Zhang, Wanglong Lu

AI总结 本文提出EchoSR框架,通过统一多尺度感受野建模和层次化上下文融合,提升了轻量图像超分辨率的效率和效果,同时在多个基准上优于现有方法,并实现了约两倍的速度提升。

Comments Accepted by Information Fusion; 20 pages, 17 figures

详情
AI中文摘要

图像超分辨率(SR)旨在从低分辨率(LR)输入中重建高质量、高分辨率(HR)图像,并在各种下游应用中发挥关键作用。尽管近年来取得了进展,但平衡重建保真度和计算效率仍然是一个根本性挑战,尤其是在资源受限的场景中。虽然现有轻量方法试图扩展感受野,但许多方法要么导致显著的计算开销,要么简单地扩大内核大小,或缺乏机制进行一致的多尺度整合,限制了它们的整体效果和可扩展性。为了解决这些限制,我们提出了EchoSR,一个高效的上下文利用框架,用于轻量图像超分辨率,它统一了多尺度感受野建模和层次化上下文融合。EchoSR通过一种高效的上下文利用策略将特征学习解耦为分离的局部、多尺度和全局建模阶段,并进一步通过跨尺度重叠融合机制促进无缝的跨尺度整合。广泛的实验表明,EchoSR在多个基准上一致优于现有最先进的轻量超分辨率方法,同时也实现了更快的速度(约2倍)。源代码可在https://github.com/funnyWang-Echoes/EchoSR上获得。

英文摘要

Image super-resolution (SR) aims to reconstruct high-quality, high-resolution (HR) images from low-resolution (LR) inputs and plays a critical role in various downstream applications. Despite recent advancements, balancing reconstruction fidelity and computational efficiency remains a fundamental challenge, particularly in resource-constrained scenarios. While existing lightweight methods attempt to expand receptive fields, many of them either incur substantial computational overhead, naively scale up kernel sizes, or lack mechanisms for coherent multi-scale integration, limiting their overall effectiveness and scalability. To address these limitations, we propose EchoSR, an efficient context-harnessing framework for lightweight image super-resolution, which unifies multi-scale receptive field modeling and hierarchical context fusion. EchoSR decouples feature learning into disentangled local, multi-scale, and global modeling stages through an efficient context-harnessing strategy, and further promotes seamless cross-scale integration via a cross-scale overlapping fusion mechanism. Extensive experiments have shown that EchoSR consistently outperforms state-of-the-art lightweight super-resolution methods across multiple benchmarks, while also achieving a faster speed $(\sim 2\times)$. The source code is available at https://github.com/funnyWang-Echoes/EchoSR.

2605.17370 2026-05-20 cs.AI

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

CBT-Audio: 评估音频语言模型以估计CBT会话录音中患者压力强度

Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim

AI总结 本文提出CBT-Audio数据集,用于评估音频语言模型在估计CBT会话中患者压力强度方面的性能,通过结合音频和文本输入提升了压力强度估计的准确性。

Comments 9 pages, 3 figures, 2 tables

详情
AI中文摘要

认知行为疗法被广泛用于帮助患者理解和管理心理压力。它通常通过口头交流进行,治疗师不仅关注患者所说的内容,还关注他们说话的方式,因为这些线索有助于治疗师决定如何回应和调整治疗。在构建AI系统用于CBT方面,进展主要局限于文本,部分原因是大多数可用数据集基于文本,而共享的 spoken CBT 数据在伦理和隐私约束下稀缺。这导致了盲点,因为基于文本的模型和评估无法捕捉文本和患者声音之间的不匹配,尽管治疗师经常依赖这种不匹配来理解患者的压力。我们引入了CBT-Audio,一个用于评估从 spoken CBT 会话中估计患者压力强度的音频语言模型的数据集。CBT-Audio包含96个公开可用的CBT录音中的1,802个患者发言,其中发言级别的压力标签已在专家标注的子集上验证。我们评估了10个开源音频语言模型,三种输入条件下,模型仅接收患者音频、仅接收转录文本或同时接收音频和转录文本。我们的结果表明,音频可以提供超出文本的信息,尤其是在与转录文本结合时。在10种模型家族中,有8种在添加音频到转录输入时,压力强度估计优于单独使用转录文本,其中4种有显著提升,案例研究显示当口头内容和语音表达不一致时,收益最明显。CBT-Audio使AI在CBT相关任务中可衡量患者的口语行为,支持未来音频语言模型在心理健康交互中的研究。

英文摘要

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

2605.17340 2026-05-20 cs.LG

Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density

Olivia:通过功率谱密度和谐化时间序列基础模型

Jingru Fei, Kun Yi, Alex Xing Wang, Qingsong Wen, Xiangxiang Zhu, Wei Fan

AI总结 本文提出Olivia,一种基于谐化机制的时间序列基础模型,通过在频域中使用功率谱密度来减少数据集间的不匹配并增强预训练效果,从而在零样本、少样本和全样本预测场景中取得最佳性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

时间序列基础模型依赖于在跨领域多样数据集上进行大规模预训练,但其在时间模式上的异质性可能会阻碍训练和学习可迁移的时间序列表示的有效性。受信号处理中归一化功率谱密度(PSD)基本概念的启发,我们假设通过频域中的PSD和谐化数据集可以减少不匹配并增强预训练。我们超越了直接不可行的最小化优化,创新性地将其重新表述为一种原则性的和谐化方法。具体而言,我们提出Harmonizer模块,该模块重塑频谱结构并隐式地在不同数据集中和谐化PSD,这在理论上对应于第二阶时间相关性的共享重参数化。我们的理论分析进一步揭示,与Harmonizer交互的token可以通过紧凑的共振器集合高效地进行调解,从而启发了HarmonicAttention设计,该设计在低维交互空间中执行自注意力。然后,我们提出Olivia,一种基于这些和谐化机制的新时间序列基础模型。在两个大规模基准(TSLib和GIFT-Eval)以及额外的6个GluonTS数据集上的广泛实验表明,Olivia在零样本、少样本和全样本预测场景中一致实现了最佳性能。我们的代码可在https://github.com/TSTS13/Olivia上获得。

英文摘要

Time series foundation models rely on large-scale pretraining over diverse datasets across domains, yet their heterogeneity in temporal patterns could hinder the effectiveness of training and learning transferable time series representations. Inspired a fundamental concept, normalized power spectral density (PSD) in signal processing, we assume harmonizing datasets via PSDs in the spectral domain could reduce mismatches and enhance pretraining. We then go beyond the direct intractable minimization optimization and innovatively reformulate it as a principled harmonization approach. Specifically, we propose Harmonizer, a module that reshapes spectral structures and implicitly harmonizing PSDs across datasets, which theoretically corresponds to a shared reparameterization of second-order temporal correlations. Our theoretical analysis further reveals token interactions with Harmonizer can be efficiently mediated by a compact set of resonators, motivating a HarmonicAttention design that performs self-attention in a low-dimensional interaction space. Then, we propose Olivia, a novel time series foundation model built upon these harmonization mechanisms. Extensive experiments on two large-scale benchmarks (TSLib and GIFT-Eval) and extra 6 datasets from GluonTS, demonstrate Olivia consistently achieves state-of-the-art performance under zero-shot, few-shot, and full-shot forecasting scenarios. Our code is available at https://github.com/TSTS13/Olivia.

2605.17046 2026-05-20 cs.LG cs.AI cs.CL

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

1GC-7RC:一张图形卡——七个研究挑战!AI代理在做你的工作方面有多好?

Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald

AI总结 本文提出1GC-7RC基准测试,通过七个跨领域机器学习任务评估AI代理在从头设计、实现和训练模型的能力,揭示了不同代理在隐式机器学习知识、规划能力和时间预算管理方面的差异。

详情
AI中文摘要

自主AI编码代理正成为机器学习从业者在工业和研究中不可或缺的工具。尽管这种应用日益广泛,但尚无标准化基准来评估其在不同领域从头设计、实现和训练模型的能力。我们引入了1GC-7RC(单张图形卡:七个研究挑战),该基准包含七个机器学习任务,涵盖语言建模、图像分类、语义分割、图学习、表格预测、时间序列预测和文本分类。每个任务都提供锁定的数据准备和评估脚本以及基线训练脚本;代理只能修改训练代码,无法访问预训练权重(语义分割任务有一个受控例外),无法访问互联网,并必须在单个GPU上完成每个任务的时间预算(40-120分钟)。我们评估了七个编码代理:五个专有(Claude Code with Sonnet 4.6、Opus 4.6和Opus 4.7;Codex CLI with GPT 5.5;和OpenCode with Qwen 3.6+)和两个开源(OpenCode with Kimi K2.5、Kimi K2.6)。在每个代理-任务对的5次运行中,我们报告了显著的性能差异,揭示了不同代理在隐式机器学习知识、规划能力和时间预算管理方面的不同水平。该基准、工具和所有评估成果均在GitHub上公开,以促进未来代理的可重复比较。由于我们的基准设计是模块化的,该基准可以扩展到新任务和领域,适应不同的GPU预算,并用于研究多代理设置,使其成为未来自主研究代理研究的灵活平台。

英文摘要

Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.

2605.17003 2026-05-20 cs.LG cs.AI

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

学习区能量:用于高效RL后训练的在线数据选择

Peng Cui, Boyao Yang, Jun Zhu

AI总结 本文提出学习区能量(LZE)方法,通过在线数据选择框架集中计算在模型的主动学习前沿,提高RL后训练的效率,实验表明在多个数据集上表现优异,且计算资源消耗减少。

详情
AI中文摘要

强化学习(RL)后训练已成为提取大语言模型(LLMs)数学推理能力的主要范式,但现有技术如GRPO和DAPO在提示上均匀分配rollout和梯度预算,浪费计算在已掌握的样本或远超模型当前能力的样本上。为解决这一根本性低效问题,我们提出学习区能量(LZE),一种理论支撑的完全在线数据选择框架,集中计算在模型的主动学习前沿。其核心是定义一个闭式学习区能量评分,融合三个互补信号,初始难度锚点、标准化结果不确定性项和通过率动量,形成一个单标量,可证明与组相对策略梯度更新的预期幅度一致。一个具有回放的前向修剪器进一步减少墙钟时间成本,通过跳过已解决提示的rollout生成,同时定期检查遗忘。在Qwen家族模型(1.5B-8B)上评估GSM8K、MATH和DAPO-MATH数据集,我们的方法每步仅保留40%的训练数据,却匹配或超越全数据基线,尤其在AIME25(+45.9%)和AMC23(+18.2%)上表现出显著的分布外收益,同时估计训练FLOPs减少约36%。我们的代码可在https://github.com/Stellaris167/LZE获取。

英文摘要

Reinforcement Learning (RL) post-training has emerged as the dominant paradigm for eliciting mathematical reasoning in Large Language Models (LLMs), yet prevailing techniques such as GRPO and DAPO distribute rollout and gradient budgets nearly uniformly across prompts, squandering compute on samples that are already mastered or remain far beyond the model's current capability. To address this fundamental inefficiency, we propose Learning-Zone Energy (LZE), a theoretically grounded, fully online data selection framework that concentrates computation on the model's active learning frontier. At its core, we define a closed-form Learning-Zone Energy Score that fuses three complementary signals, an initial-difficulty anchor, a normalized outcome-uncertainty term, and a pass-rate momentum, into a single scalar that is provably aligned with the expected magnitude of group-relative policy gradient updates. A forward pruner with replay further reduces wall-clock time cost by skipping rollout generation for persistently solved prompts while periodically checking for forgetting. Evaluated on Qwen-family models (1.5B-8B) across GSM8K, MATH and DAPO-MATH, our method retains only 40% of the training data per step yet matches or surpasses full-data baselines, with especially pronounced out-of-distribution gains on AIME25 (+45.9%) and AMC23 (+18.2%), alongside an estimated 36% reduction in training FLOPs. Our code is available at https://github.com/Stellaris167/LZE.

2605.16736 2026-05-20 cs.CV

CAB: Accelerating Flow and Diffusion Sampling via Rectification and Corrected Adams-Bashforth

CAB: 通过校正和修正Adams-Bashforth加速流和扩散采样

Anuska Roy, Pravin Nair

AI总结 本文提出了一种无需训练的采样器CAB,通过将采样动态转换为统一的校正坐标系,并应用带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器,从而在不增加额外函数评估次数的情况下加速流和扩散模型。

详情
AI中文摘要

流和扩散模型能够实现高质量、高分辨率的图像合成,但通常在采样时需要大量的函数评估次数(NFEs)。现有的加速方法要么需要通过蒸馏进行额外训练,要么依赖于无需训练的高阶求解器,但两者在低NFE预算下都会降低样本质量。我们提出CAB(Corrected Adams-Bashforth),一种无需训练的采样器,能够加速流和扩散模型。CAB首先将采样动态转换为统一的校正坐标系,然后应用一个带有基于过去速度评估的简单修正项的多步Adams-Bashforth预测器,因此不增加额外的NFEs。所得到的方法简单,具有相同的算法形式,适用于所有模型类别,并且具有至少第三阶局部截断误差和第二阶全局误差。在预训练的流和扩散模型上进行的实验,包括类别条件和大规模文本到图像基准,表明CAB在6-20 NFEs的低步数范围内改进了质量-NFE权衡。它在大多数测试模型中在更高步数时与强大的无需训练采样器保持竞争力。官方实现可在https://github.com/Anuska-Roy/CAB上获得。

英文摘要

Flow and diffusion models achieve high-fidelity, high-resolution image synthesis, but often require many function evaluations (NFEs) at sampling time. Existing acceleration methods either require additional training through distillation or rely on training-free high-order solvers, and both can degrade sample quality at low NFE budgets. We propose CAB (Corrected Adams-Bashforth), a training-free sampler that accelerates both flow and diffusion models. CAB first transforms the sampling dynamics to a common rectified coordinate system, and then applies a multistep Adams-Bashforth predictor augmented with a simple correction term based on past velocity evaluations and therefore incurs no additional NFEs. The resulting method is simple, has the same algorithmic form across model classes, and has at least third-order local truncation error and second-order global error. Experiments on pretrained flow and diffusion models, including class-conditional and large-scale text-to-image benchmarks, show that CAB improves quality-NFE trade-offs in the low-step regime of 6-20 NFEs. It also remains competitive with strong training-free samplers at higher step counts across most tested models. The official implementation is available at https://github.com/Anuska-Roy/CAB.

2605.16712 2026-05-20 cs.AI cs.CL cs.HC

Recall Isn't Enough: Bounding Commitments in Personalized Language Systems

回想起并不足够:在个性化语言系统中界定承诺

Rui Tang, Yichi Zhang, Xi Chen, Chen Dong, Youwei Yang, Yumeng Shen, Qiangqiang Liu

AI总结 本文提出了一种新的方法,通过合同界定证据激活(CBEA)和词典承诺验证(LCV)来解决个性化语言系统中承诺界定的问题,从而在360个测试用例和三个生成后端上实现了零失败,同时降低了输入负载。

Comments 14 pages, 3 figures, 22 tables; preprint version

详情
AI中文摘要

长上下文和记忆系统通常将个性化视为召回问题。在实践中,许多故障发生在系统承诺时:它将嘈杂的提示转化为硬约束,丢弃罕见的见证,忘记下游义务,或在不可行的情况下作答。我们引入了合同界定证据激活(CBEA)与词典承诺验证(LCV)。CBEA通过类型覆盖、尾见证和后果债务激活一个有界的证据集;LCV在文本之前验证结构化的承诺,并将不可行的状态路由到修复、回避或再合同。在360个测试用例和三个生成后端上,CBEA+LCV在验证范围内达到零失败,可用性为0.49-0.60,而具有相同LCV门的原始和长上下文基线只有在0.003-0.092时才能达到零失败。一个影子 oracle 诊断标记了极限:CBEA+LCV召回了0.012个未编译的可见事实,而原始召回了0.53。结果是一个有界的操作点:显式的承诺控制和74-75%更低的中位数输入负载,而不是普遍的记忆主导。

英文摘要

Long-context and memory systems usually treat personalization as a recall problem. In practice, many failures occur later, when a system commits: it turns noisy hints into hard constraints, drops rare witnesses, forgets downstream obligations, or answers despite infeasibility. We introduce Contract-Bounded Evidence Activation (CBEA) with Lexicographic Commitment Validation (LCV). CBEA activates a bounded evidence set using typed coverage, tail witnesses, and consequence debt; LCV validates structured commitments before prose and routes infeasible states to repair, abstention, or recontract. Across 360 fixtures and three generation backends, CBEA+LCV reaches zero failures within validator scope at 0.49-0.60 availability over attempted runs. Raw and long-context baselines with the same LCV gate reach zero only at 0.003-0.092. A shadow oracle diagnostic marks the limit: CBEA+LCV recalls 0.012 of uncompiled visible facts, while raw recalls 0.53. The result is a bounded operating point: explicit commitment control and 74-75% lower median input payload, not universal memory dominance.

2605.16679 2026-05-20 cs.CL cs.AI

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

CHI-Bench: 能否让AI代理自动化端到端、长周期、政策丰富的医疗工作流程?

Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

AI总结 本文提出CHI-Bench基准,旨在评估AI代理在医疗工作流程中端到端、长周期和政策丰富任务中的自动化能力,揭示当前基准测试中政策密度、多角色协作和多方交互等能力的不足。

Comments Website: https://actava.ai/benchmarks Code: https://github.com/actava-ai/chi-bench Dataset: https://huggingface.co/datasets/actava/chi-bench

详情
AI中文摘要

现实医疗操作的端到端自动化要求具备当前基准测试中较少体现的三种能力:政策密度,即决策必须基于大量医疗、保险和运营规则;多角色组成,即单个任务需要代理扮演多个角色并进行交接;以及多方交互,即中间工作流程步骤是多轮对话,例如同行评审和患者接触。我们介绍了CHI-Bench,一个涵盖三个领域的长周期医疗工作流程基准:提供者预先授权、支付方使用管理以及护理管理。每个任务都会将代理置于一个高保真模拟器中,该模拟器暴露了20个医疗应用程序,通过87个MCP工具。代理必须通过工具调用和编写角色的文档来驱动任务完成,受1,290多份文档管理护理操作手册技能的指导。在30种代理配置下,最佳代理仅能解决28.0%的任务,没有代理在严格通过标准下达到20%以上,且单次会话执行所有任务会将性能降至3.8%。这些结果提出了假设,即在其他政策密集、角色组成和不可逆的企业领域中,类似的差距可能会出现。

英文摘要

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

2605.16445 2026-05-20 cs.LG cs.AI

Membership Inference Attacks on Discrete Diffusion Language Models

对离散扩散语言模型的成员推断攻击

Shailesh Kasivelrajan

AI总结 本文研究了对微调后的MDLMs的成员推断攻击,发现其比现有灰盒基线更易受攻击,并设计了阴影模型转移攻击以证明其有效性。

Comments Citations and Co Authors need to be verified and updated. Will submit a new version soon

详情
AI中文摘要

Masked Diffusion Language Models (MDLMs) 替换了自回归生成的迭代解 masking,其隐私属性大多未被研究。我们研究了对微调后的MDLMs的成员推断攻击(MIA),并发现其比现有灰盒基线所暗示的要显著更容易受到攻击。我们从四个 masking 比率下的模型重建损失中提取了一个46维的特征向量,并在其上训练XGBoost和MLP分类器。在六个文本领域上的MIMIR基准测试中,XGBoost实现了平均AUC 0.878,在Pile CC上达到峰值0.930,并在平均上比SAMA灰盒基线高出0.062 AUC。一个leave one signal out消融实验显示,仅ELBO轨迹就驱动了大部分结果,当移除时平均下降0.130,而注意力特征在低于0.003时几乎不起作用。我们还设计了一个阴影模型转移攻击,其中K=3个在无关领域训练的surrogate MDLMs在不接触目标领域的情况下生成分类器标签。这在0.020以内实现了0.858的平均AUC,并确立了阴影模型转移作为一种实用且几乎同样有效的攻击路径。

英文摘要

Masked Diffusion Language Models MDLMs replace autoregressive generation with iterative demasking and their privacy properties are largely unstudied. We study membership inference attacks MIA on fine tuned MDLMs and show they are significantly more vulnerable than current grey box baselines suggest. We extract a 46 dimensional feature vector from the models reconstruction loss at four masking ratios and train XGBoost and MLP classifiers on top. On the MIMIR benchmark across six text domains XGBoost achieves mean AUC 0.878 peaking at 0.930 on Pile CC and beats the SAMA grey box baseline by 0.062 AUC on average. A leave one signal out ablation shows that the ELBO trajectory alone drives most of this with a mean drop of 0.130 when removed while attention features add almost nothing below 0.003. We also design a shadow model transfer attack where K equals 3 surrogate MDLMs trained on data from unrelated domains generate classifier labels with no access to the target domain. This achieves 0.858 mean AUC within 0.020 of the white box oracle and establishes shadow model transfer as a practical and near equally effective attack path.

2605.16353 2026-05-20 cs.CV cs.AI

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA: 向流式连续视觉指令微调迈进以适应大规模多模态语言模型

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

AI总结 本文提出StrLoRA,一种流式连续视觉指令微调方法,旨在解决动态任务流中模型持续学习的问题,通过任务感知的专家路由框架提升模型在不断变化的数据流中的表现。

详情
AI中文摘要

持续视觉指令微调(CVIT)使多模态大语言模型能够逐步获得新能力。然而,现有CVIT方法在任务增量设置下运行,每个训练阶段对应一个预定义任务,这不符合现实世界中数据作为连续流中交织和动态变化的任务的条件。为弥合这一差距,我们引入流式CVIT(StrCVIT),一种更通用和现实的设置,其中模型从包含动态混合任务的数据块中学习。在StrCVIT中,模型必须同时获得新能力、强化常见能力并减轻遗忘。现有CVIT方法在此处失败,因为它们无法可靠地区分或适应每个块内的异构任务样本。因此,我们提出了StrLoRA,一种正则化的两阶段专家路由框架。StrLoRA首先使用文本指令进行任务感知的专家选择,激活相关专家的稀疏子集,减少跨任务干扰。然后在该子集内应用基于令牌的专家加权,其中贡献权重通过本地视觉令牌与全局指令表示之间的跨模态注意力计算。为了在非平稳流中保持稳定性,路由稳定性正则化将当前路由分布与历史指数移动平均参考对齐。在新开发的StrCVIT基准上的广泛实验表明,StrLoRA显著优于现有方法,有效提升了模型从持续演变的数据流中获取能力的能力。代码可在https://github.com/chanceche/StrCVIT获取。

英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.

2605.15975 2026-05-20 cs.AI cs.RO

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

在符号世界模型上学习双层策略以实现长周期规划

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

AI总结 本文提出了一种结合低层模仿学习和高层符号抽象的双层策略,用于解决长周期规划问题,通过BISON系统在扩展的MetaWorld基准上验证了其在处理大量物体和长周期任务上的优越性。

详情
AI中文摘要

我们解决了构建具有身体智能的AI代理以可靠解决长周期规划问题的挑战。模仿学习从演示中已显示出在训练机器人解决需要精细运动控制和操作的复杂任务方面的有效性。然而,仅通过模仿学习生成长周期计划仍然是一个艰巨的挑战。相比之下,高层(HL)符号抽象能够促进高效且可解释的长周期规划。我们提出结合低层(LL)模仿学习在操作和控制中的优势,以及高层符号抽象在长周期规划中的优势。我们通过双层策略(π^hl, π^ll)实现这一想法,其中包括从低层演示中学习的神经策略π^ll,以及由低层演示的符号抽象和归纳概括结合而成的高层符号策略π^hl。我们实现了这些想法的BISON系统。在扩展的MetaWorld基准上的实验表明,BISON能够泛化到长周期和更多物体数量的问题,比VLA和端到端方法更高效,并且在训练和推理中更节省时间和内存。值得注意的是,当忽略低层执行时,BISON的高层策略可以在一分钟内解决包含10,000个相关物体的高层问题。项目页面:https://dillonzchen.github.io/bison

英文摘要

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

2605.15768 2026-05-20 cs.AI cs.CY

ALSO: Adversarial Online Strategy Optimization for Social Agents

ALSO: 用于社交代理的对抗在线策略优化

Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu

AI总结 本文提出ALSO框架,通过将多轮交互建模为对抗性带薪问题,并引入轻量级神经代理来预测奖励,从而在动态环境中实现社交代理的鲁棒策略优化。

Comments Accepted at ICML 2026

详情
AI中文摘要

社交模拟为研究社会智能提供了一个有力的测试平台,其中代理在不断变化的上下文中通过多轮对话进行交互并战略性地适应对手。此类环境本质上是非平稳的,要求代理动态调整其策略。然而,大多数基于大型语言模型(LLM)的社会代理依赖于静态人设,而现有的增强社会智能的方法,如离线强化学习或外部规划器,不适用于这些设置,通常假设平稳性并导致显著的训练开销。为弥合这一差距,我们提出了ALSO(对抗性在线策略优化),这是首个用于多代理社交模拟的在线策略优化框架。ALSO通过两个关键贡献提升了社会适应性:(1)ALSO将多轮交互建模为对抗性带薪问题,其中静态人设和动态策略指令的组合被视为臂,提供了一种不依赖环境稳定性假设的解决方案;(2)为了预测奖励并泛化多轮对话中的稀疏反馈,ALSO引入了轻量级神经代理来从交互历史中预测奖励,从而实现高效样本探索和持续在线适应。在Sotopia基准测试中,ALSO在动态环境中一致优于静态基线和现有优化方法,验证了对抗性在线策略优化在构建鲁棒社会代理方面的有效性。

英文摘要

Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.