arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别4020
2606.09639 2026-06-09 cs.CV 新提交

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Xiangtai Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Electronic Science and Technology of China(电子科技大学) Zhejiang University(浙江大学) The University of Tokyo(东京大学) Nanyang Technological University(南洋理工大学)

AI总结 提出CineDance-1M大规模多镜头长片音视频数据集,通过三阶段筛选流程和CineBench评估体系,实现高质量联合生成。

详情
AI中文摘要

训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面表现出色,但开源模型的进展仍受限于高质量训练数据的稀缺性。为弥合这一差距,我们引入了CineDance-1M,一个大规模、开放研究文本到音视频(T2AV)数据集,专门用于多镜头、长片联合音视频生成。每个视频平均时长92.8秒,包含24.2个连续镜头,并提供音频和视频模态的可配置、结构化标注。这一卓越质量通过严格的三个阶段筛选流程实现:i) 多样化来源和全面清洗,ii) 基于电影理论的叙事解析,以及iii) 层次化双模态字幕生成。为进行全面评估,我们提出了CineBench,包含多样化的提示套件和六维、与人类对齐的度量系统,专为复杂叙事音视频评估而设计。此外,我们将LTX-2.3适配为CineDance,展示了卓越的单模态质量以及精确的音视频对齐和稳健的主体与环境一致性,有效验证了我们的筛选策略和CineDance-1M的高质量。我们预期这项工作将为加速未来多镜头、长片联合音视频生成研究奠定坚实基础。我们的项目页面可在https://aliothchen.github.io/projects/CineDance/获取。

英文摘要

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

2606.09638 2026-06-09 cs.LG cs.SC math-ph math.MP physics.comp-ph stat.AP 新提交

Data-driven discovery of governing differential equations across physical systems

跨物理系统的控制微分方程数据驱动发现

Siyu Lou, Hao Xu, Wenguan Wang, Lu Lu, Hao Sun, Yang Liu, Linfeng Zhang, Dongxiao Zhang, Yuntian Chen

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学与工程学院) Ningbo Key Laboratory of Advanced Manufacturing Simulation, Eastern Institute of Technology(东部理工学院宁波先进制造仿真重点实验室) The State Key Lab of Brain-Machine Intelligence, Zhejiang University(浙江大学脑机智能全国重点实验室) Department of Statistics and Data Science, Yale University(耶鲁大学统计与数据科学系) Department of Chemical and Environmental Engineering, Yale University(耶鲁大学化学与环境工程系) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) School of Engineering Sciences, University of Chinese Academy of Sciences(中国科学院大学工程科学学院) DP Technology

AI总结 本文提出问题导向视角,通过二维相图组织方程可发现性,并引入表示-评估-优化(REO)框架抽象发现过程,旨在从数据中推断物理定律,推动理论修正与新概念形成。

详情
AI中文摘要

微分方程在科学发现中扮演关键角色,因为它们提供了描述物理现象行为的数学框架。作为传统第一性原理的有前景替代,数据驱动微分方程发现因其直接从实验或模拟数据推断控制定律的能力而日益受到关注,尤其是在底层物理机制不明确时。然而,该领域沿着多样化的方法论方向迅速扩展,特别是随着基于AI的方法的出现,仍缺乏清晰的组织视角。在本综述中,我们提出数据驱动微分方程发现的问题导向视角。首先引入方程可发现性的二维相图,其中发现问题根据结构复杂性和系数复杂性进行组织。该相图展示了该领域如何从稀疏方程与简单系数的发现转向具有更丰富结构和更灵活参数化的更复杂控制定律。它还阐明了为什么不同的方法论家族在不同问题设置中成功或失败。然后,我们提出表示-评估-优化(REO)框架作为发现过程的基本抽象。通过识别跨算法变体持续存在的方程发现核心问题,REO将讨论从单个算法转向决定可发现性的基本原理。我们将这些视角与物理学及相邻科学的应用联系起来,并认为下一个挑战不仅仅是恢复方程,而是利用它们来修正现有理论、提炼机制并形成新的科学概念。

英文摘要

Differential equations play a critical role in scientific discovery because they provide a mathematical framework to describe the behaviour of physical phenomena. As a promising alternative to traditional first principles, data-driven differential equation discovery has attracted increasing attention for its ability to infer governing laws directly from experimental or simulated data, especially when the underlying physics is unclear. However, the field has expanded rapidly along diverse methodological directions, particularly with the emergence of AI-based approaches, and still lacks a clear organizing perspective. In this Review, we propose a problem-oriented perspective on data-driven differential equation discovery. We first introduce a two-dimensional phase diagram of equation discoverability, where discovery problems are organized according to structural complexity and coefficient complexity. This phase diagram shows how the field has moved from the discovery of sparse equations with simple coefficients toward more complex governing laws with richer structures and more flexible parameterizations. It also clarifies why different methodological families succeed or fail in different problem settings. We then present the representation-evaluation-optimization (REO) framework as a fundamental abstraction of the discovery process. By identifying the core problems of equation discovery that persist across algorithmic variations, REO shifts the discussion from individual algorithms to the fundamental principles that determine discoverability. We connect these perspectives to applications across physics and adjacent sciences, and argue that the next challenge is not merely recovering equations, but using them to revise existing theories, distil mechanisms and form new scientific concepts.

2606.09632 2026-06-09 cs.CL 新提交

Civil Court Simulation with Large Language Models

基于大型语言模型的民事法庭模拟

Yifan Chen, Haitao Li, Kaiyuan Zhang, Yueyue Wu, Qingyao Ai, Yiqun Liu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出多智能体民事法庭模拟框架,通过五阶段审判程序、记忆模块和法规检索实现可靠判决,在责任分配和多项裁决上表现优异。

详情
AI中文摘要

法庭模拟连接了法律教育与司法实践,但基于人类的模拟成本高且难以扩展。大型语言模型(LLMs)提供了一种可扩展的替代方案,但现有的法庭模拟研究主要集中于刑事案件。民事诉讼在实践中更为常见且更难模拟,因为其诉求、责任和救济方式更加灵活。我们提出了一个面向中国民事案件的多智能体法庭模拟框架。该框架通过五阶段民事审判程序组织基于角色的交互,并集成记忆模块和法规检索以支持长过程裁判。实验表明,该框架能产生可靠的民事判决,在责任分配和多项裁决方面具有明显优势。进一步实验显示,记忆质量显著影响下游模拟质量。通过五层因素框架,我们分析了法律基础、信息条件、司法能力与角色定位、组织压力以及社会背景如何影响框架的可靠性和行为。这些结果支持了所提框架在民事法庭模拟中的有效性。数据集和代码可在 https://github.com/foggpoy/Civil-Court 获取。

英文摘要

Court simulation bridges legal education and judicial practice, yet human-based simulations are costly and difficult to scale. Large language models (LLMs) offer a scalable alternative, but existing court-simulation research mainly focuses on criminal cases. Civil litigation is more common in practice and harder to simulate because its claims, liability, and remedies are more flexible. We present a multi-agent court simulation framework for Chinese civil cases. The framework organizes role-based interaction through a five-stage civil trial procedure and integrates memory module and statute retrieval to support long-process adjudication. Experiments show that the framework produces reliable civil judgments, with clear strengths in liability allocation and multi-item adjudication. Further experiments show that memory quality substantially affects downstream simulation quality. Through a five-layer factor framework, we analyze how legal grounding, information conditions, judicial capability and role orientation, organizational pressure, and social context affect the framework's reliability and behavior. These results support the effectiveness of the proposed framework for civil court simulation. The dataset and code are available at: https://github.com/foggpoy/Civil-Court.

2606.09630 2026-06-09 cs.RO cs.AI cs.LG 新提交

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

ReCoVLA: VLM引导的奖励编译用于视觉-语言-动作策略的故障恢复

Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino

发表机构 * University of Southern California(南加州大学) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Harvard University(哈佛大学)

AI总结 提出ReCoVLA框架,通过冻结预训练VLA策略,利用外部VLM推断故障模式并编译结构化奖励,训练残差恢复策略,实现零样本仿真到真实部署,在多种操作任务中提升成功率。

详情
Comments
19 pages, 7 figures
AI中文摘要

视觉-语言-动作(VLA)策略为语言条件操作提供了强大的先验知识,但在需要针对性恢复的非标称状态下仍然脆弱。我们提出ReCoVLA——一种故障条件的残差恢复框架,它保持预训练的VLA策略冻结,使用外部视觉-语言模型(VLM)推断故障模式和恢复阶段,并从任务相关组件编译结构化奖励。ReCoVLA并非使用VLM直接生成动作或奖励,而是将其作为语义奖励选择器:它预测恢复描述符和奖励掩码,用于仿真中的残差策略训练,随后将训练好的恢复策略零样本部署到真实世界。这解耦了高层故障理解与低层纠正控制,以支持不同的VLA。在短时域、长时域和接触丰富的操作任务上的实验表明,ReCoVLA在平均性能上优于测试的基线。在仿真中,我们的奖励编译器将微调$π_{0.5}$基线的平均成功率从36.7%提升到66.7%。在物理零样本仿真到真实实验中,ReCoVLA取得了最佳平均性能,成功率为61.7%。

英文摘要

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

2606.09623 2026-06-09 cs.LG 新提交

Constrained user-item allocation for e-commerce marketing campaigns

面向电子商务营销活动的约束用户-物品分配

Maja Lindström, Natalija Glisovic, Jan von Pichowski, Tommy Löfstedt, Martin Rosvall

发表机构 * Umeå University(于默奥大学) KTH Royal Institute of Technology(皇家理工学院) University of Würzburg(维尔茨堡大学)

AI总结 提出自动定向方法,通过约束谱双聚类、贪心局部搜索和多臂老虎机框架联合选择用户和物品构建多个不重叠营销活动,在合成数据、Amazon评论和商业数据上优于模拟退火。

详情
AI中文摘要

在开展营销活动时,零售商必须决定推广哪些产品以及针对哪些用户。这些决策本质上是耦合的:有效的活动将具有强烈相互亲和力的用户和物品匹配到预定义大小的非重叠组中。然而,现有方法假设预定义的活动结构或将物品选择与用户分配解耦,无法直接从联合交互模式中发现活动分组。因此,我们将该活动问题形式化为自动定向:联合选择用户和物品以构建多个不相交的活动。为了解决这个组合问题,我们提出了三种互补策略:(i)约束谱双聚类,以在用户-物品亲和力矩阵中找到密集区域;(ii)具有成对交换的贪心局部搜索,用于组合优化;(iii)多臂老虎机框架,通过探索逃离局部最优。我们在合成数据集、Amazon Reviews基准测试和大规模专有商业数据上评估了这些方法,并将结果与模拟退火基线进行比较。结果表明,双聚类始终获得最高的活动质量、提升度和公平性得分。虽然双聚类在较小数据集上运行高效,但在非常大的数据集上其运行时间显著增加,而基于老虎机的方法则提供了可扩展的替代方案。

英文摘要

When running marketing campaigns, retailers must decide which products to promote and which users to target. These decisions are inherently coupled: effective campaigns match users and items with strong mutual affinity into non-overlapping groups of predefined sizes. However, existing approaches assume predefined campaign structure or decouple item selection from user assignment, and cannot discover campaign groupings directly from joint interaction patterns. We therefore formalize this campaign problem as auto-targeting: jointly selecting users and items to construct multiple disjoint campaigns. To solve this combinatorial problem, we propose three complementary strategies: (i) constrained spectral biclustering to find dense regions in the user-item affinity matrix, (ii) greedy local search with pairwise swaps for combinatorial refinement, and (iii) a multi-armed bandit framework to escape local optima through exploration. We evaluate these methods on a synthetic dataset, the Amazon Reviews benchmarks, and large-scale proprietary commercial data, and compare the results to simulated annealing as a baseline. The results show that biclustering consistently achieves the highest campaign quality, lift, and fairness scores. While biclustering runs efficiently on smaller datasets, its runtime increases substantially on very large ones, where bandit-based methods instead offer a scalable alternative.

2606.09620 2026-06-09 cs.RO cs.SY eess.SY 新提交

Motion planning for hundreds of floating robots

数百个浮动机器人的运动规划

Jan Kamm, Antonio Terpin, Raffaello D'Andrea, Aswin Ramachandran

发表机构 * Institute for Dynamic Systems and Control, ETH Zürich(苏黎世联邦理工学院动态系统与控制研究所)

AI总结 针对大型浮动机器人编队的无碰撞运动规划问题,提出一种可扩展的流水线方法,通过碰撞图分解为独立子问题并行求解,在500个机器人仿真和实际演示中验证了有效性。

详情
AI中文摘要

为大型机器人编队规划无碰撞运动是困难的,因为碰撞避免引入了随团队规模快速增长且强烈的智能体间耦合。我们考虑水面上的全向浮动机器人,其编队动作由稀疏关键帧指定,交互工具必须在几秒内生成轨迹,即使过渡跨越几分钟和数千个时间步。我们提出一种可扩展的流水线,从初始化构建碰撞图,将耦合问题分解为交互簇,并独立(并行)求解这些簇,同时针对常见分解病态问题提供鲁棒性机制。我们在多达500个机器人的仿真中验证了该方法。合成的轨迹还已在两个实际演示中部署:在苏黎世湖上使用24艘Way of Water船只,以及在2025年威尼斯双年展的“时间空间存在”展览中。

英文摘要

Planning collision-free motion for large robot fleets is difficult because collision avoidance induces strong inter-agent coupling that grows rapidly with team size. We consider omnidirectional floating robots on water, where choreographies are specified by sparse keyframes and an interactive tool must generate trajectories within seconds, even when transitions span minutes and thousands of time steps. We propose a scalable pipeline that builds a collision graph from an initialization, decomposes the coupled problem into interaction clusters, and solves clusters independently (and in parallel) with robustness mechanisms for common decomposition pathologies. We validate the approach in simulations up to 500 robots. The synthesized trajectories have also been deployed in two real-world demonstrations, on Lake Zürich with a fleet of 24 Way of Water crafts and at the Time Space Existence 2025 Venice Biennale.

2606.09615 2026-06-09 cs.RO cs.CV 新提交

DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

DexPIE:基于真实世界经验的稳定灵巧策略改进

Ruizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin, Fan Yang, Kailun Yang, Yaonan Wang

发表机构 * Hunan University(湖南大学)

AI总结 提出DexPIE后训练框架,通过灵巧手适配干预系统、多阶段DAgger数据收集、相对动作空间异步推理和连续最优性指标条件化,在三个真实灵巧操作任务上成功率提升37%。

详情
Comments
Project website: https://siiuuuuuu.github.io/DexPIE
AI中文摘要

灵巧操作因其高维动作空间和复杂的接触动力学,给模仿学习带来了巨大挑战。纯粹从演示中训练的策略在部署时常常遭受复合误差,并且需要大量专家数据才能达到可靠性能。为了超越演示数据的局限性,本文提出DexPIE,一个通过真实世界部署收集的经验来改进灵巧策略的后训练框架。首先,DexPIE通过灵巧手适配的干预系统和跨初始与中间任务阶段的多阶段DAgger式数据收集,实现了有效的探索覆盖,为准确的策略评估提供了可靠的监督。为了减少后训练 rollout 与演示数据之间的时间噪声,我们引入了相对动作空间中的异步推理,这能更好地将 rollout 数据与演示行为对齐,并允许评论家学习由更一致的基础策略诱导的值函数。最后,DexPIE通过对连续最优性指标进行条件化来改进策略,使策略能够以更细粒度的方式利用数据质量。在三个具有挑战性的真实世界灵巧操作任务中,DexPIE相比基于演示的参考策略实现了37%的成功率提升,优于所有基线方法,并展现出更强的鲁棒性。源代码和数据集将公开提供。

英文摘要

Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.

2606.09613 2026-06-09 cs.CL cs.AI 新提交

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

AGENTSERVESIM:面向多轮LLM智能体服务的硬件感知模拟器

Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出AGENTSERVESIM模拟器,通过程序编排器、工具模拟器、会话感知路由器和KV驻留模型等模块,在程序粒度上评估多轮LLM智能体服务策略,在CPU上以6%误差复现真实系统行为。

详情
Comments
Preprint
AI中文摘要

多轮LLM智能体将模型调用与外部工具调用交织在一起,将服务从无状态请求处理转变为有状态程序执行。处理这些工作负载需要利用程序级上下文的调度、KV缓存管理和路由策略,包括轮次依赖、工具引入的间隙和可重用的KV状态。直接在真实系统上评估此类策略成本高昂,因为每个设计点可能需要跨到达率、模型规模、服务实例数量和内存层次结构的专用加速器时间。模拟提供了一种可扩展的替代方案,但现有的LLM服务模拟器针对无状态请求级工作负载,因此忽略了智能体服务的核心动态:多轮程序执行、跨轮缓存局部性以及工具间隙期间的KV缓存驻留。我们提出了AGENTSERVESIM,一种面向多轮LLM智能体服务的硬件感知模拟器。AGENTSERVESIM通过可组合模块在程序粒度上评估服务策略:程序编排器保留程序标识和轮次顺序,工具模拟器实现工具引入的间隙,会话感知路由器维护程序到实例的亲和性以实现缓存感知调度,KV驻留模型跟踪策略定义的跨HBM、主机DRAM/CXL和驱逐的KV放置。在真实服务部署和硬件配置上,AGENTSERVESIM在关键性能指标上的误差在6%以内,且完全在普通CPU上运行。这些结果表明,AGENTSERVESIM能够在不需在昂贵加速器上全面部署的情况下,实现受控、可重复的智能体服务策略探索。

英文摘要

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.

2606.09610 2026-06-09 cs.RO cs.AI 新提交

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

基于多智能体强化学习的任意物体协同运输中的形状形成

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser

发表机构 * University of Technology Nuremberg(纽伦堡工业大学)

AI总结 提出一种多智能体强化学习方法,使多机器人系统自主形成支撑任意形状和非均匀质量分布物体的编队,同时避免障碍物,实现可靠且泛化的协同运输。

详情
AI中文摘要

协同物体运输在众多领域(包括工业到家庭服务)中至关重要。一种流行的运输策略是将物体承载在多机器人系统之上。相应的任务通常通过将其分解为三个相互关联的子问题来解决:编队控制、协同导航和碰撞避免。现实世界物体带来的一个特殊挑战是其可能具有任意形状和非均匀质量分布,这需要机器人编队能够牢固支撑物体。在这项工作中,我们通过提出一种新颖的多智能体强化学习方法来解决运输此类现实世界物体时的模式形成控制挑战。我们的方法使多机器人系统能够自主定位在物体下方以支撑其重量,同时在编队过程中避免障碍物。我们在不同环境和不同数量机器人下的评估表明,我们的方法能够产生可靠形成平衡编队的策略,并泛化到杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。

英文摘要

Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

2606.09608 2026-06-09 cs.CV 新提交

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

TUDSR: 用于更高超分辨率的两次上采样扩散

Zhiqiang Wu, Yitong Dong, Xian Wei

发表机构 * East China Normal University(华东师范大学) Zhejiang University(浙江大学)

AI总结 提出TUDSR框架,通过两阶段训练(R分辨率和NR分辨率)结合循环分块策略,在SD2.1基础上实现1024²和2048²高分辨率图像超分辨率,显著优于现有方法。

详情
AI中文摘要

基于扩散的生成模型在真实世界图像超分辨率(SR)中取得了显著成功。通过分块扩散技术,这些模型可以生成超出其原生支持分辨率的高分辨率图像。然而,这种高分辨率(例如2048²)输出的质量通常仍然极差,主要归因于我们考虑的两个因素:图像上采样比率(例如×8)超过模型原生支持的上采样比率(例如×4),以及模型的原生支持分辨率。在实践中,训练原生高分辨率模型需要更大的架构,这会导致显著的计算开销和GPU内存成本,使其在资源有限的设备上难以实现。因此,我们提出了TUDSR,一种用于更高超分辨率的两次上采样扩散框架。TUDSR框架主要包括两个阶段:第一阶段在R分辨率下训练,第二阶段引入基于循环分块的训练策略在NR分辨率下训练。每个阶段采用包含生成器和判别器的单步GAN架构。基于SD2.1-base,我们开发了TUDSR-S,在多个基准测试中取得了最先进的性能。大量实验进一步表明,TUDSR-S在1024²甚至2048²分辨率下生成高质量图像,显著优于现有方法。代码可在https://github.com/wuer5/TUDSR获取。

英文摘要

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

2606.09605 2026-06-09 cs.AI 新提交

Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

下一个词预测学习睡眠生理学的可泛化表示

Jonathan F. Carter, Lionel Tarassenko

发表机构 * Institute of Biomedical Engineering, University of Oxford(牛津大学生物医学工程研究所)

AI总结 提出Hypnos模型,通过下一个词预测目标,从多模态生理信号中学习可泛化表示,在睡眠阶段分类和房颤检测等任务上显著优于现有基础模型。

详情
AI中文摘要

基础模型提供了一种有前景的途径,将多模态生理信号压缩为人类健康的紧凑表示,在睡眠医学、心脏病学、神经病学及其他医疗领域具有广泛应用。现有模型通常采用掩码重建或对比学习目标进行训练。然而,掩码重建可能不适用于这些信号的随机性质,而对比方法依赖于正样本对定义,尽管生理信号的语义不变性尚不明确。在这项工作中,我们展示了下一个词预测是一种简单且可扩展的替代方案。我们开发了Hypnos,一个多模态睡眠基础模型,使用来自超过20,000次夜间多导睡眠图记录的八种不同传感模态(例如EEG、ECG、呼吸信号)进行训练。我们使用残差向量量化将每种模态标记化为离散标记流,然后训练一个大型自回归RQ-Transformer,以并行方式联合预测所有模态的下一个标记。训练后,Hypnos可应用于任何支持模态子集的连续传感器数据流,为下游任务生成嵌入。在一系列基准测试中,Hypnos显著优于现有基础模型。在睡眠阶段分类中,我们在保留测试集上匹配了强监督基线的性能,同时使用的标记数据减少了100倍。Hypnos甚至泛化到日间生理学,在检测房颤方面超越了专用的ECG基础模型。我们的结果表明,下一个词预测是从多模态生理信号进行表示学习的强自监督目标。

英文摘要

Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using \(100\times\) less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

2606.09585 2026-06-09 cs.AI 新提交

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

光学推理:重新思考图像作为超越文本的表达性推理媒介

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出光学推理概念,将图像作为独立推理媒介,通过排版和图形两种变体实现,在语言和多模态任务中匹配或超越文本推理,同时减少推理令牌。

详情
AI中文摘要

思维链(CoT)提升了大型语言模型(LLMs)的性能,并已扩展到多模态大型语言模型(MLLMs)。最近的工作进一步从基于文本的多模态推理转向交错模态推理,其中中间步骤可以同时包含文本理由和视觉证据。在这项工作中,我们提出了一个更大胆、更雄心勃勃的想法:图像能否单独作为语言和多模态任务的推理媒介?为了探索这一点,我们提出了光学推理,它将图像视为独立的推理媒介。我们通过两种变体实例化这一概念:基于排版的光学推理,优化视觉布局以实现紧凑的理由渲染;以及基于图形的光学推理,将文本和图形元素组合成结构化的视觉理由。在数学、科学和交错模态推理基准测试中,光学推理可以匹配甚至超越传统的文本推理,同时在语言任务上平均减少28.57%的推理令牌,在多模态任务上减少16%,实现文本推理1.96倍的令牌效率。这些结果表明,图像可以有效且高效地编码理由,同时为推理提供统一的视觉画布。

英文摘要

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

2606.09578 2026-06-09 cs.AI cs.CL cs.IR 新提交

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

TABVERSE:大语言模型与视觉语言模型中跨格式表格理解的基准测试

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Singapore University of Technology and Design (SUTD)(新加坡科技设计大学)

AI总结 提出TABVERSE基准,通过控制表格内容、跨多种结构格式(HTML、Markdown、LaTeX)和渲染图像,系统评估LLM和VLM在问答、结构理解和结构重建任务中的表现,发现表示格式显著影响表格理解能力。

详情
Comments
24 pages, 18 tables, 16 figures, Submitted to ARR May 2026
AI中文摘要

大语言模型(LLMs)和视觉语言模型(VLMs)在表格推理任务上的评估日益增多,但表格表示的作用仍未充分探索。实践中,相同的表格内容可能以不同的结构格式出现,如HTML、Markdown和LaTeX,或作为渲染图像。然而,现有评估往往让内容、格式、布局和模态同时变化,使得难以隔离表示效应。我们引入了TABVERSE,一个受控的多模态表格基准,它在多个结构格式和渲染图像中对齐相同的表格内容,并带有问题类别和难度标签。这种设计使得在保持表格内容固定的同时,能够系统评估表示效应。我们在三个任务上评估LLMs和VLMs:问答(QA)、结构理解能力(SUC)和结构重建(SR)。我们的结果表明,表示选择显著影响表格理解。模型在结构化文本上的表现通常优于渲染图像,但这一差距的大小取决于任务、模型和格式。HTML通常是最稳健的文本格式,而行敏感的结构任务和语法可用的LaTeX重建仍然具有挑战性。这些发现表明,表格表示是可靠表格评估的关键因素。

英文摘要

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

2606.09577 2026-06-09 cs.CL cs.LG cs.SE 新提交

Code Is More Than Text: Uncertainty Estimation for Code Generation

代码不仅仅是文本:代码生成的不确定性估计

Yuling Shi, Caiqi Zhang, Yuexian Li, Haopeng Wang, Yeheng Chen, Nigel Collier, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Cambridge(剑桥大学)

AI总结 针对代码生成中错误程序的可靠性问题,提出基于词法、算法和功能三个正交轴的不确定性估计方法,在五个代码LLM上将AUROC提升8.1个百分点。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为代码生成器,其中静默错误的程序会带来真实的安全和可靠性风险。可靠的不确定性估计(UE)对于选择性预测、人在回路审查和下游智能体决策至关重要。然而,现有的大多数代码UE方法继承自自然语言(NL)生成,忽略了使代码独特的属性。我们认为代码在三个方面与NL不同:单个错误标记可能破坏整个程序(标记脆弱性);算法意图和具体实现可能独立不一致(意图-代码差距);程序可以被执行(可执行性)。我们将这些属性实例化为三个正交的不确定性轴:词法(Top-K标记熵)、算法(伪代码一致性)和功能(行为一致性)。在五个代码LLM上,我们的三轴集成将平均AUROC从最强NL衍生基线的0.696提高到0.776(+8.1点)。值得注意的是,在Qwen3-14B上,我们的单次Top-K标记熵匹配了最强多次基线,同时成本降低超过3倍;在各模型上,它仍然是一个有竞争力的低成本信号。这些结果表明,代码UE需要特定于代码的设计,而不是直接移植NL方法。

英文摘要

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

2606.09572 2026-06-09 cs.RO cs.AI 新提交

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

CT-VAM: 一种小脑-丘脑启发的视觉-动作模型用于高效视觉运动控制

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin

发表机构 * University of Science and Technology of China(中国科学技术大学) AIRLab, Department of Automation(自动化系AIRLab)

AI总结 提出CT-VAM模型,通过TARS条件注意力解码器融合异构输入,以68M参数实现与大型VLA模型相当的LIBERO成功率,并降低推理延迟,支持高频控制。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中展现出强大潜力,然而原始语言主要用于指定任务意图,而非在高频低层执行过程中反复处理。受此分离的启发,我们提出了一种小脑-丘脑启发的视觉-动作模型(CT-VAM),用于高效的任务条件视觉运动控制。CT-VAM作为一个紧凑的局部执行策略,从双视角视觉观察、本体感觉和轻量级任务条件中预测动作块,从而可能实现一种实用的云-边缘范式,其中高层语义推理由大模型处理,而快速闭环控制在本地硬件上运行。为了有效融合异构输入,CT-VAM引入了TARS(丘脑动作路由流),一种流分离的条件注意力解码器,独立路由动作、视觉和任务流,防止密集的感官标记淹没紧凑的任务相关条件。仅凭68M参数,CT-VAM在LIBERO上取得了与更大规模VLA模型竞争的成功率,同时降低了推理延迟。结合用于异步块执行的流一致修补,CT-VAM支持高频控制,并在资源受限的机器人平台上展示了鲁棒的实时部署能力。

英文摘要

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

2606.09569 2026-06-09 cs.RO cs.CV 新提交

Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications

自动驾驶应用中相对位姿估计的高效最小求解器

Tao Li, Liang Liu, Jianli Han, Weimin Lv

发表机构 * College of Aerospace Science and Engineering, Naval Aviation University(海军航空大学航空航天科学与工程学院)

AI总结 提出基于新平移参数化和一阶旋转近似的统一框架,设计三种最小求解器(利用IMU垂直方向、转向旋转轴方向、平面运动假设),减少点对应数量和代数复杂度,在RANSAC中加速假设生成,平衡速度与精度。

详情
AI中文摘要

随着视觉传感系统的进步,计算机视觉在自动驾驶和机器人导航中扮演着越来越重要的角色。多相机系统中的相对位姿估计对于精确的车辆定位和环境感知至关重要,要求高实时性和鲁棒性。然而,现有方法通常涉及高计算成本并严重依赖丰富的特征匹配,限制了它们在时间敏感驾驶场景中的适用性。为解决这些限制,本文引入了一个基于新颖平移参数化和一阶旋转近似的统一框架,用于高效相对位姿估计。在该框架内,我们提出了三种专门为自动驾驶车辆设计的高效最小求解器。第一个求解器集成了惯性测量单元(IMU)的垂直方向先验,第二个在转向操作期间利用旋转轴方向先验,第三个专为平面运动设计——这是结构化道路上地面车辆的现实假设。通过减少最小点对应数量和代数复杂度,我们的方法能够在基于RANSAC的流程中更快地生成假设,提高对实时系统的适用性。在合成数据集和KITTI自动驾驶基准上的大量实验表明,与现有最先进算法相比,所提出的求解器在速度和精度之间实现了有利的平衡。

英文摘要

With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.

2606.09568 2026-06-09 cs.AI 新提交

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

自适应与自组织系统中的自解释性:现状与研究方向

Tom Beyer, Svea Wisy, Sven Tomforde

发表机构 * Kiel University(基尔大学)

AI总结 本文通过系统文献综述,定义自解释性(SX)并建立分类法,提出自解释性层次框架,发现多数方法仍处于概念阶段,缺乏评估标准。

详情
Comments
Under review as a regular paper at ACM Transactions on Autonomous and Adaptive Systems (TAAS)
AI中文摘要

随着人工智能(AI)的进步,自适应和自组织系统的复杂性日益增加,使其越来越难以理解和信任。虽然可解释AI旨在提供对AI决策的洞察,但更高级的目标是让系统自我解释——这种能力称为自解释性(SX)。本文对SX进行了系统文献综述,分析了现有方法,包括其领域、目标和评估方法。综述提出了SX的统一定义和分类法,并引入了自解释性层次,为定位当前和未来研究提供了框架。我们的结果表明,大多数SX方法仍处于概念阶段,实际实现很少。此外,目前没有评估SX的正式或事实标准,突出了一个主要研究空白。因此,这项工作为推进复杂系统中的自解释性奠定了基础和路线图。

英文摘要

The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of Self-Explainability, providing a framework for positioning current and future research. Our results show that most SX approaches remain conceptual, with few practical implementations. Moreover, there is currently no formal or de facto standard for evaluating SX, highlighting a major research gap. This work thus establishes a foundation and roadmap for advancing Self-Explainability in complex systems.

2606.09563 2026-06-09 cs.AI cs.LG 新提交

PRISM: Recovering Instruction Sets from Language Model Activations

PRISM:从语言模型激活中恢复指令集

Gilad Gressel, Rahul Pankajakshan, Julia Diament, Efim Hudis, Krishnashree Achuthan, Yisroel Mirsky

发表机构 * Center for Cybersecurity Systems & Networks, Amrita Vishwa Vidyapeetham(阿姆里塔·维什瓦·维迪亚佩瑟姆网络安全系统与网络中心) Microsoft(微软) Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出PRISM方法,通过激活条件解码从冻结目标模型隐藏状态中恢复活跃指令集,利用法官引导的GRPO优化,在多种场景下优于基线方法。

详情
Comments
Under Review
AI中文摘要

随着LLM被部署为智能体,可靠的监控不仅需要知道它们输出了什么,还需要知道哪些指令在引导它们的行为。当模型推断出非预期的子目标、遵循上下文线索或受到提示注入和隐藏目标的影响时,这变得困难。虽然激活到语言的方法表明隐藏状态可以揭示自然语言信息,但现有方法并非设计用于恢复智能体设置中同时活跃的完整指令、约束、禁止和子目标集。我们将此问题形式化为指令集检索,并引入PRISM,一个激活条件的解释器,将冻结目标模型的隐藏状态解码为活跃指令的忠实项目符号列表。与先前的激活到语言方法不同,PRISM直接训练以恢复指令集,使用法官引导的GRPO来奖励覆盖的指令并惩罚不支持的指令。在良性、受限、提示注入和隐藏目标设置中,PRISM优于激活到语言基线,特别是在安全相关目标上。

英文摘要

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

2606.09559 2026-06-09 cs.LG cs.AI cs.CR cs.RO 新提交

Safe-RULE: Safe Reinforcement UnLEarning

Safe-RULE:安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame(圣母大学)

AI总结 针对离线安全强化学习易受数据投毒攻击的问题,提出Safe-RULE框架,通过反学习移除恶意样本影响,无需从头训练或访问原始环境,实验证明能有效提升安全性。

详情
Comments
20 pages, 3 figures
AI中文摘要

离线安全强化学习(Safe RL)使得无需在线交互即可进行策略学习,适用于机器人系统等安全关键系统。然而,其对静态数据集的依赖使离线Safe RL面临数据投毒攻击,攻击者注入恶意样本以破坏安全性并诱导不安全策略行为。在这项工作中,我们提出了一种新的学习范式,称为安全强化反学习(Safe-RULE),作为一种防御框架,用于在不从头重新训练或需要访问原始训练环境的情况下移除中毒数据的影响。我们进一步将强化反学习扩展到离线Safe RL,通过在反学习过程中明确考虑任务性能和安全约束。跨基准Safe RL任务的实验表明,我们的方法能有效增强针对数据投毒攻击的安全性能。

英文摘要

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

2606.09556 2026-06-09 cs.AI 新提交

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

AI科学家的能力取决于其证据:药物资产估值中专有数据与推理技能的分层消融研究

Yinan Wang

发表机构 * Noah AI Research(Noah AI研究)

AI总结 通过分层消融实验,发现药物资产估值中AI科学家的决策上限由专有证据集决定,而非仅依赖推理框架;加入专有数据后决策质量显著提升。

详情
Comments
Preprint; 2 figures, 5 tables
AI中文摘要

AI科学家智能体通常被评估时,仿佛能力主要取决于模型质量、提示或推理框架。我们在药物资产估值中测试了一个不同的假设:对于知识密集型的科学决策,限制因素往往是智能体能够访问的证据基础。我们在一个生产级估值智能体上进行了三臂对照消融实验:A是仅使用网络的普通LLM分析师,B增加了公共结构化工具以及14维估值剧本、验证器、客观性策略和红队,C增加了专有的Noah AI语料库,包含精选的管线、试验和交易情报。在包含13个资产的分层基准测试中,B改善了校准和审计纪律:层级内准确率从0.80提高到0.89,客观性从3.16提高到3.30。但B并未消除事实上限。在能力超集核算下,A和B仅恢复了精选黄金竞争记录的0.25和0.38,而C恢复了0.96;在精选长尾子集上,C达到0.93,而A/B为0.26/0.30。原始盲审决策质量A和B相似(7.01 vs 6.96),因此我们引入了完整性感知决策效用:知情决策质量 = 决策质量 × 黄金覆盖率。在此指标上,C达到7.43,而A/B为1.76/2.57。即使一个完美的非专有数据报告,其B的覆盖率上限也仅为3.83。结果并非推理框架不重要;它们改善了校准和纪律。相反,专有证据集设定了AI科学家所能知道并因此决策的上限。

英文摘要

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

2606.09547 2026-06-09 cs.CV cs.LG 新提交

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预:视频大语言模型能否在错误发生时即时纠正?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究院) York University(约克大学) Vector Institute for AI(向量人工智能研究所)

AI总结 提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力,并构建Ego-CoMist反事实合成数据集提升小模型性能。

详情
Comments
Qualcomm Interactive Cooking: Ego-MC-Bench -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-mistake-corrections and Ego-CoMist -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-counterfactual-mistakes
AI中文摘要

学习日常技能(如烹饪一道菜)越来越依赖于教学媒体,例如在线视频。这为使用视频(和多模态)大语言模型(LLMs)作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是,它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力,我们引入了Ego-MC-Bench(错误纠正),这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明,Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制,我们还引入了Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在Ego-CoMist上进行微调可以带来性能提升,特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

2606.09542 2026-06-09 cs.CV 新提交

A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation

一种用于零样本交通事故预警的VideoMAE-v2方法

Siyuan Li, Xiaoyang Bi, Mengshi Qi

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出基于VideoMAE-v2的框架,通过滑动窗口协议和逐帧预测头,在零样本设置下从粗粒度标注数据泛化到未知行车记录仪视频,实现交通事故预警。

详情
AI中文摘要

交通事故预警——在行车记录仪视频的每一帧预测即将发生碰撞的可能性——对于安全至关重要,但难以规模化,因为为每个部署场景收集域内标注的事故视频成本过高。我们在零样本设置下研究此任务,即没有目标域训练数据可用:模型必须仅从公开的二元标注驾驶事故数据集中学习,并泛化到未见过的行车记录仪视频。我们提出一个框架,通过将VideoMAE-v2骨干网络与滑动窗口协议下的逐帧预测头相结合,弥合帧级时间风险估计任务与粗粒度标注二元事故数据集之间的差距。我们的方法在2026年CVPR@AUTOPILOT零样本交通事故预警竞赛中获得第二名。代码可在https://github.com/TimeSouth/zero-shot-taa-solution获取。

英文摘要

Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.

2606.09539 2026-06-09 cs.LG 新提交

Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth

大规模高效交通预测:STGCN架构深度的系统研究

Soban Nasir Lone, Mohamed Abouelela, Taeyoung Yu, Jiwon Kim, Constantinos Antoniou

发表机构 * Technical University of Munich(慕尼黑工业大学) The University of Queensland(昆士兰大学)

AI总结 系统研究STGCN架构深度对交通预测性能与计算效率的影响,发现单块结构在多数数据集上达到最优或接近最优性能,且计算成本显著低于标准双块结构。

详情
Comments
Accepted for publication in IEEE ITSC (2026)
AI中文摘要

时空图神经网络(STGNNs)已成为交通预测的主流方法,但其计算需求对智能交通系统(ITS)的实际部署构成挑战。尽管近期研究提出了STGNNs的高效替代方案,但一个基本问题仍未探索:这些架构本身是否过参数化?我们使用该领域最广泛采用的模型之一——时空图卷积网络(STGCN)来研究这一问题。通过在四个不同的交通数据集上进行系统实验,我们比较了1块、2块(标准)和3块STGCN变体。我们的发现表明,单块架构在三个数据集上实现了短期预测(10分钟)的最优性能,而在更长预测时长上仅带来边际退化(相对误差≤1.8%)。关键的是,与单块相比,双块变体导致CPU推理延迟增加61%,吞吐量降低37%——这对于资源受限的ITS部署是巨大的开销。三块架构没有提供有利的权衡,计算成本增加一倍以上,而相对改进小于0.5%。这些结果表明,默认的双块STGCN在许多应用中可能过参数化,这对部署交通预测系统的从业者和基准测试效率方法的 researchers 都有影响。

英文摘要

Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a fundamental question remains unexplored: are these architectures themselves over-parameterised? We examine this question using the Spatio-Temporal Graph Convolutional Network (STGCN), one of the most widely adopted models in this domain. Through systematic experiments across four diverse traffic datasets, we compare 1-block, 2-block (standard), and 3-block STGCN variants. Our findings reveal that the single-block architecture achieves optimal performance for short-term prediction (10 mins) on three of four datasets, while incurring only marginal degradation ($\leq$1.8% relative error) at longer horizons. Crucially, the 2-block variant incurs 61% higher CPU inference latency and 37% lower throughput relative to 1-block -- substantial overhead for resource-constrained ITS deployment. The 3-block architecture offers no favourable tradeoff, more than doubling computational cost for $<$0.5% relative improvement. These results suggest that the default 2-block STGCN may be over-parameterised for many applications, with implications for both practitioners deploying traffic prediction systems and researchers benchmarking efficiency-focused methods.

2606.09535 2026-06-09 cs.CL cs.SD 新提交

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

克服Whisper在达罗毗荼语系和低资源语言中的解码器不一致性

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

发表机构 * Sony Research India(索尼印度研究院)

AI总结 针对Whisper在达罗毗荼语系上词错误率高的问题,通过语言学和数据集分析发现词汇稀疏和字符级替换错误,提出加权注意力和自条件化两种解码器增强方法,显著降低低资源和黏着语言的WER。

详情
Comments
Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables
AI中文摘要

多语言ASR模型如Whisper在高资源语言上表现良好,但在达罗毗荼语系上的词错误率(WER)显著高于印度-雅利安语系。通过语言学和数据集分析,我们发现达罗毗荼语系具有更长的单词、更高的词汇多样性和更低的重复率,导致标记分布稀疏和频繁的字符级替换错误。基线微调进一步揭示了自注意力(语言上下文)和交叉注意力(声学线索)之间的解码器不平衡。尽管合成标记重复实验表明潜在收益,但实际不可行。受这些观察启发,我们引入了两种解码器级增强:加权注意力(自适应平衡注意力来源)和自条件化(重新注入中间预测以提高标记一致性)。实验表明,对于低资源和黏着语言,WER持续降低。

英文摘要

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

2606.09525 2026-06-09 cs.CL cs.AI 新提交

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的涌现

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein

发表机构 * Nanyang Technological University(南洋理工大学) University of Copenhagen(哥本哈根大学)

AI总结 通过测量监督微调、直接偏好优化和可验证奖励强化学习三个阶段,发现大型语言模型对上下文特征的敏感性在指令微调过程中动态变化,其中监督微调使模型倾向于使用易理解的上下文,而后续阶段可能强化或改变这一偏好。

详情
AI中文摘要

在指令微调(IFT)过程中,大型语言模型(LLMs)通过使用提供的上下文来回答问题,从而学会遵循指令。虽然先前的工作已经研究了上下文特征如何与LLM的上下文使用相关,但这种分析仅限于推理时间,尚未揭示这些关系最初是如何获得的。在这里,我们测量了模型对这些特征的敏感性在连续的IFT阶段(监督微调(SFT)、直接偏好优化(DPO)和可验证奖励强化学习(RLVR))中如何变化。跨四个模型和三个数据集的实验表明,SFT使模型更倾向于使用易于理解的上下文,例如包含高长度、上下文-查询相似性和流畅性的上下文。SFT后的动态可能根据训练数据集强化或解决这些偏好。我们的发现揭示了上下文使用在每个IFT阶段都被积极重塑,并且设计平衡的IFT数据集对于确保指令微调模型稳健的上下文利用至关重要。

英文摘要

During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.

2606.09517 2026-06-09 cs.LG 新提交

Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

研究概率电价预测中的校准挑战

Jan Niklas Lettner, Hadeer El Ashhab, Benjamin Schäfer

发表机构 * Institute for Automation and Applied Informatics(自动化与应用信息学研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文指出当前概率电价预测中评分规则偏向锐度而忽视校准,导致过自信估计,呼吁未来研究转向校准感知的目标和架构。

详情
Comments
Presented at the ACM Sustainability Week Companion 2026, Banff, AB, Canada
AI中文摘要

随着可再生能源整合增加市场波动性,概率电价预测已成为有效风险管理的关键。然而,当前的适当评分规则往往优先考虑预测锐度而牺牲校准,导致过度自信且统计上不可靠的不确定性估计。本文强调了理论评分与实际校准之间的关键差距,证明当可靠性被忽视时,模型可能成为确定性预测的代理。我们得出结论,未来的研究必须转向校准感知的目标和架构,以确保能源市场预测的分布完整性。

英文摘要

As renewable energy integration increases market volatility, probabilistic electricity price forecasting has become essential for effective risk management. However, current-proper-scoring rules often prioritize forecast sharpness at the expense of calibration, leading to overconfident and statistically unreliable uncertainty estimates. This work highlights the critical gap between theoretical scoring and practical calibration, demonstrating that models can become mere proxies for deterministic forecasts when reliability is neglected. We conclude that future research must shift toward calibration-aware objectives and architectures to ensure the distributional integrity of energy market forecasts.

2606.09516 2026-06-09 cs.CV 新提交

SwiftVR: Real-Time One-Step Generative Video Restoration

SwiftVR:实时一步生成式视频恢复

Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li

发表机构 * State Key Laboratory of Internet of Things for Smart City, Department of Computer and Information Science, University of Macau(澳门大学智慧城市物联网国家重点实验室) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室)

AI总结 提出SwiftVR,一种流式一步生成式视频恢复框架,通过因果分块协议、无掩码移位窗口自注意力和轻量级恢复感知自编码器,在消费级GPU上实现实时高清视频恢复。

详情
AI中文摘要

实时视频恢复(VR)用于直播流,需要在严格的每帧延迟约束下输出高分辨率结果。现有的一步扩散式VR模型由于两个主要瓶颈难以部署在消费级GPU上:高分辨率下的二次空间注意力以及大型视频自编码器的延迟-内存开销。我们提出SwiftVR,一种流式一步生成式VR框架,在因果分块协议下减少这两个瓶颈。对于注意力,无掩码移位窗口自注意力通过确定性索引将每个空间窗口聚合成密集张量,所有注意力调用都在密集缩放点积注意力路径上,无需掩码、循环移位、填充或硬件特定的稀疏核。由于SwiftVR仅使用标准密集SDPA调用,训练好的模型无需重新训练或自定义核即可迁移到消费级GPU。对于自编码,轻量级恢复感知自编码器在保持重建质量的同时实现快速分块解码。在单个H100上,SwiftVR在2560x1440分辨率下维持31 FPS,在3840x2160下维持14 FPS,而所有对比的扩散式VR基线在4K下均超出内存限制。在消费级RTX 5090上,SwiftVR在1920x1080下达到26 FPS。据我们所知,SwiftVR是首个在消费级GPU上实现实时1080p流媒体的生成式VR模型,同时以更低的推理成本获得强大的无参考感知质量。项目地址:https://h-oliday.github.io/SwiftVR。

英文摘要

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

2606.09511 2026-06-09 cs.CV 新提交

Securing Self-supervised Data Curation for Foundation Models Robustness

保障基础模型鲁棒性的自监督数据筛选

Sandeep Gupta, Roberto Passerone

发表机构 * Queen's University Belfast(贝尔法斯特女王大学) University of Trento(特伦托大学)

AI总结 针对自监督数据筛选面临的数据投毒风险,提出基于ImageBind和传统分类器的主动防御机制PDD,在多种攻击下有效检测中毒数据,SVM-PDD表现最优。

详情
Comments
22 pages
AI中文摘要

自监督数据筛选为扩展和提升机器学习模型的泛化能力提供了一条途径。通过利用自监督学习(SSL)进行数据筛选,可以有效满足基础模型对大规模训练数据集的需求。SSL极大地减轻了与标注和人工数据集筛选相关的成本,同时最小化了对人工监督的需求。然而,必须严格检查SSL筛选数据集的完整性,因为依赖匿名且未经审查的外部来源会显著增加数据投毒的风险。在本文中,我们提出了一种中毒数据检测器(PDD),这是一种主动防御机制,旨在在基础模型训练之前确保SSL筛选数据集的完整性。PDD使用预训练的ImageBind模型与传统分类器(包括随机森林(RF)、k近邻(KNN)、朴素贝叶斯(NB)和支持向量机(SVM))的组合进行设计。我们使用来自三个不同数据集的176,200张图像以及三种不同的对抗攻击(涵盖分布内和分布外场景)严格评估了PDD。值得注意的是,SVM-PDD在分布内(Set3-Set5)和分布外(TrueFace和140K RealFace)数据集上均实现了优越的性能。我们的设计表现出强大的可扩展性,并通过集成方法实现了新对抗攻击检测器的快速集成。

英文摘要

Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of machine learning models. By leveraging self-supervised learning (SSL) for data curation, the demand for massive training datasets required by foundation models can be effectively met. SSL greatly alleviates the costs associated with annotation and manual dataset curation while minimizing the need for human oversight. However, the integrity of SSL-curated datasets must be rigorously checked, as reliance on anonymous and unvetted external sources can substantially increase the risk of data poisoning. In this paper, we propose a Poisoned Data Detector (PDD), an active defense mechanism designed to ensure the integrity of SSL-curated datasets prior to foundation model training. PDDs are designed using a combination of the pretrained ImageBind model and traditional classifiers, including Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayes (NB), and Support Vector Machines (SVM). We rigorously evaluated PDDs using 176,200 images from three diverse datasets and three different adversarial attacks encompassing both in-distribution and out-of-distribution scenarios. Notably, SVM-PDD achieves superior performance for both in-distribution (Set3-Set5) and out-of-distribution (TrueFace and 140K RealFace) datasets. Our design demonstrates strong scalability and enables the rapid integration of new adversarial attack detectors through an ensemble approach.

2606.09508 2026-06-09 cs.AI cs.CL 新提交

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

从刚性到动态:面向长上下文LLM的熵引导自适应推理

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

发表机构 * Department of Computing, PolyU(香港理工大学计算学系) DSA, HKUST(GZ)(香港科技大学(广州)数据科学与分析学域) CSE, HKUST(香港科技大学计算机科学与工程学系)

AI总结 提出EntropyInfer框架,利用注意力熵在预填充阶段自适应分配计算资源,并在解码阶段通过生成令牌压缩KV缓存,实现长上下文LLM的高效推理。

详情
AI中文摘要

现有的用于长上下文LLM推理的稀疏注意力和KV缓存压缩方法通常应用固定的稀疏模式或跨所有注意力头的统一预算,忽略了头和上下文之间注意力行为的显著变化。我们观察到注意力头之间存在两种不同的熵模式:刚性头,其熵在输入段中保持接近零;动态头,其熵显著波动。至关重要的是,这些类型的分布是上下文相关的,无法离线预先确定。因此,我们提出了EntropyInfer,一个无需训练框架,在预填充期间使用注意力熵在单个头和段的粒度上自适应分配计算。对于解码,我们引入了一种潜在KV缓存压缩方案,该方案利用生成的输出令牌(而非仅预填充令牌)来识别和保留最关键的缓存条目。在Llama、Qwen和openPangu模型系列上的大量实验表明,EntropyInfer在包括SnapKV、AdaKV和CritiPrefill在内的基线上持续取得优势,在超过100k令牌的情况下实现了高达2.39倍的端到端加速,同时与全注意力相比质量下降最小。代码已发布在https://github.com/SHA-4096/EntropyInfer。

英文摘要

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

2606.09507 2026-06-09 cs.CV 新提交

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Prisma-World: 相机可控的多智能体视频世界模型

Huiqiang Sun, Zhan Peng, Size Wu, Kun Wang, Kang Liao, Dianyi Wang, Xingyu Zeng, Sheng Jin, Yangguang Li, Zhiguo Cao, Ziwei Liu, Wei Li

发表机构 * School of AIA, HUST(华中科技大学人工智能与自动化学院) S-Lab, NTU(南洋理工大学S-Lab) SenseTime Research(商汤科技研究院) FDU(复旦大学) SUAT(深圳大学) HKU(香港大学) CUHK(香港中文大学)

AI总结 提出Prisma-World,通过联合几何感知去噪过程实现多智能体视频生成中的跨视角一致性,支持灵活智能体数量和相机控制。

详情
Comments
Project page: https://huiqiang-sun.github.io/prisma-world/
AI中文摘要

视频世界模型在生成可控视觉体验方面取得了快速进展,但大多数模型仍从单一观察者模拟世界。将此类模型扩展到多个智能体面临一个核心挑战:如果每个智能体的未来状态是独立生成的,重叠视角可能会实例化同一场景的不同版本,导致智能体间的物体、布局和外观不一致。传统的相机条件控制单个轨迹,但并未显式耦合在共享场景几何下应一致的视图生成。我们引入了Prisma-World,一个相机可控的多智能体世界模型,它将多智能体生成形式化为一个联合几何感知去噪过程,以实现跨视角一致性。Prisma-World在一个全注意力序列中处理所有智能体视频,使用多智能体RoPE设计来区分智能体身份同时保持同步的时间坐标,并将相对相机几何注入注意力中,使重叠视角偏向共享场景证据。为了进一步增强多视角一致性并提升全局空间感知,我们通过重叠衰减课程训练范式以及小地图条件结构指导来增强我们的框架。为了促进多智能体模型的训练和评估,我们引入了PrismaDataset,这是一个大规模UE5数据集,包含跨多样场景的全景采集、可组合的多智能体视角组(具有灵活的智能体数量和复杂的相机轨迹),以及用于一致性训练和评估的精确相机/动作标注。实验表明,单个Prisma-World模型可以生成高保真度的多智能体视频,具有灵活的智能体数量、相机可控性、改进的跨视角一致性以及在小地图引导下的空间定位。

英文摘要

Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.