arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪 全部专题
2605.15728 2026-05-18 cs.CV cs.AI

DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

DecomPose:解耦跨类优化冲突以实现类别级6D物体姿态估计

Yifan Gao, Lu Zou, Zhangjin Huang, Guoping Wang

发表机构 * Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, Hubei, China(智能机器人湖北省重点实验室,武汉理工大学,武汉,湖北,中国) University of Science(科学技术大学) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 本文提出DecomPose框架,通过数据驱动的难度代理和不对称分支策略,解耦跨类优化冲突,提升类别级6D姿态估计性能。

详情
AI中文摘要

类别级6D物体姿态估计通常被建模为多类联合学习问题,但类别间的几何异质性导致共享模块中不兼容的优化信号纠缠,产生梯度冲突和负迁移。为此,我们首先引入基于梯度的诊断方法量化模块级跨类冲突。基于诊断结果,我们提出DecomPose框架,通过难度感知的梯度解耦和稳定性驱动的不对称分支策略,缓解优化冲突:(1) 难度感知的梯度解耦通过数据驱动的难度代理将类别分组,并将每个实例路由到组特定的对应分支以隔离不兼容的更新;(2) 稳定性驱动的不对称分支将更高容量的分支分配给结构简单的类别作为稳定的优化锚点,同时通过轻量级分支约束复杂类别以抑制噪声更新并缓解负迁移。在REAL275、CAMERA25和HouseCat6D上的大量实验表明,DecomPose有效减少了跨类优化冲突,并在多个基准上实现了优越的姿态估计性能。

英文摘要

Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

2605.15726 2026-05-18 cs.AI cs.CL

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

走出舒适区:为RLVR的高效策略引导探索

Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出NudgeRL框架,通过策略引导实现结构化和多样性探索,提升RLVR在数学基准上的表现,相比标准GRPO和oracle引导方法更高效。

Comments 28 pages, 7 figures

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的可扩展范式。然而,其效果受限于探索:策略只能改进已采样的轨迹。增加轨迹数量可缓解此问题,但计算成本高,现有方法对探索内容控制有限。本文提出NudgeRL框架,引入策略引导,通过轻量策略上下文条件化每个轨迹,诱导多样化推理轨迹,不依赖昂贵的oracle监督。为进一步学习此类结构化探索,提出统一目标,将奖励信号分解为跨和内上下文组件,并结合蒸馏目标将发现的行为转移回基础策略。实验证明,NudgeRL在五项挑战性数学基准上平均优于oracle引导的RL基线,且在8倍更大的轨迹预算下优于标准GRPO。这些结果表明,结构化、上下文驱动的探索可作为高效且可扩展的替代方案,替代暴力轨迹扩展和基于特权信息的方法。代码可在https://github.com/tally0818/NudgeRL获取。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.

2605.15725 2026-05-18 cs.CV cs.AI cs.RO

DiLA: Disentangled Latent Action World Models

DiLA:解耦的潜在动作世界模型

Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Peking University(北京大学-清华生命科学中心,先进跨学科研究院,IDG/麦克戈文脑科学研究院,北京大学) Center of Quantitative Biology, Peking University(北京大学定量生物学中心) School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University(心理与认知科学学院,机器感知重点实验室(教育部),北京大学)

AI总结 DiLA通过内容-结构解耦解决动作抽象与生成保真度的平衡问题,实现高质量视频生成和动作迁移。

Comments Project Page: http://disentangled-latent-action-world-models.github.io

详情
AI中文摘要

潜在动作模型(LAMs)通过推断连续帧间的抽象动作来学习世界模型,但面临动作抽象与生成保真度的权衡问题。现有方法通常通过两阶段训练或限制预测到光流来解决。本文提出DiLA,一种解耦的潜在动作世界模型,通过内容-结构解耦解决这一权衡。我们的关键发现是解耦和潜在动作学习是共演进的:潜在动作学习中的预测瓶颈驱动解耦,迫使模型将空间布局压缩到结构路径,同时将视觉细节卸载到单独的内容路径进行生成。这种协同作用产生了一个连续且语义结构化的潜在动作空间,而不牺牲生成质量。DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面表现优异。这些发现确立了DiLA作为统一框架,同时实现高层动作抽象和高保真生成,推动了自监督世界模型学习的前沿。

英文摘要

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

2605.15723 2026-05-18 cs.LG cs.CV

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

GOMA:从图信号平滑视角迈向结构驱动的多模态对齐

Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

发表机构 * School of Airspace Science and Engineering, Shandong University(山东大学 airspace 科学与工程学院) Department of Computer Science, Beijing Institute of Technology(北京理工大学计算机学院) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院)

AI总结 GOMA通过统一设计解决多模态对齐中的拓扑障碍、平滑控制与信息保留问题,在七个多模态图基准上取得最佳检索性能并保持稳定性。

详情
AI中文摘要

多模态对齐通常通过CLIP式双编码器从孤立图像-文本对学习,忽略了实体间的关系上下文。多模态属性图(MAGs)中节点携带多模态属性,边编码语料结构,为优化冻结的视觉-语言嵌入提供自然设置。这种优化具有挑战性:视觉、文本和跨模态关系常诱导不同的邻域几何结构,而无限制的图传播可能导致检索表示快速过平滑。有效利用图上下文需要同时打破模态特定的拓扑障碍、控制平滑制度,并在语义边界崩溃前保留信息性平滑。我们提出图优化多模态对齐(GOMA),一种结构驱动的后对齐框架,将冻结的多模态嵌入视为图信号,并通过统一的检索导向设计解决这些需求。GOMA解耦了三个关键设计选择:消息应流动何处、多模态证据应如何传播,以及应保留哪种平滑深度。具体而言,它学习模态感知的传播算子,执行有限步耦合平滑而不使用对角线跨模态快捷方式,并自适应读取节点特定的平滑轨迹以在崩溃前保留有用平滑。所有实验遵循一种转换性MAG检索协议,其中图仅作为无标签上下文,且移除对角线自配对边。在七个MAG基准上,GOMA取得最佳或并列最佳检索性能,并显著优于最强的图竞争对手,证明MAG结构可以作为冻结多模态嵌入的有效后编码器。

英文摘要

Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.

2605.15722 2026-05-18 cs.LG cs.AI cs.CV eess.SP

Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

双向融合引导心脏模式用于半监督ECG分割

Jeonghwa Lim, Minje Park, Sunghoon Joo

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文提出CardioMix框架,通过心脏模式引导的双向CutMix策略提升ECG分割性能,实验表明其在多种数据集和标注比例下均优于现有方法。

Comments 11 pages, 6 figures, 6 tables

详情
AI中文摘要

准确界定心电图(ECG)并分割有意义的波形特征对心血管诊断至关重要。然而,标注数据稀缺给深度学习模型训练带来了重大挑战。传统半监督语义分割(SemiSeg)方法主要关注未标注数据的一致性,未能充分利用标注与未标注集之间的信息交换。为此,我们引入CardioMix,基于心脏模式引导的双向CutMix策略构建ECG分割框架。该方法通过从未标注数据中引入真实变化丰富标注集,同时对未标注集施加更强的监督信号,确保所有增强样本在生理上具有意义。本框架设计为即插即用模块,与各种SemiSeg算法具有高度兼容性。在SemiSegECG公共多数据集基准上的大量实验表明,CardioMix在多种数据集和标注比例下均优于现有基于CutMix的融合策略作为即插即用模块兼容各种SemiSeg算法。

英文摘要

Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

2605.15721 2026-05-18 cs.CL

Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

上下文作为推荐:面向上下文工程的进化式协同过滤

Jiachen Zhu, Zhuoying Ou, Congmin Zheng, Yuxiang Chen, Zeyu Zheng, Rong Shan, Lingyu Yang, Lionel Z. Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong Univ.(上海交通大学) Univ. College London(伦敦大学学院) Carnegie Mellon Univ.(卡内基梅隆大学) Hong Kong Polytechnic Univ.(香港理工大学)

AI总结 本文提出将上下文工程视为推荐问题,通过Neural Collaborative Context Engineering框架,实现动态实例级路由,提升LLM上下文工程的个性化性能。

详情
AI中文摘要

大型语言模型(LLMs)对输入上下文高度敏感,推动了自动化上下文工程的发展。然而,现有方法大多将其视为全局搜索问题,寻找单一上下文策略以最大化数据集的平均性能。本文提出将上下文工程作为推荐问题,引入Neural Collaborative Context Engineering(NCCE)框架,将优化从静态全局搜索转向动态实例级路由。NCCE首先构建多样化的锚点上下文目录,然后采用新颖的Context-CF共进化机制。该阶段建立协同反馈循环:轻量级Neural Collaborative Filtering(NCF)模型学习实例-上下文偏好以指导生成专用上下文变体,而新评估的上下文不断精炼NCF模型对潜在偏好的理解。在推理阶段,训练好的NCF模型作为上下文路由器,动态分配最合适的上下文策略给每个未见实例。理论证明和全面实验表明,通过匹配个体输入与最优上下文,NCCE显著提升任务准确性,突显了LLM上下文工程中个性化的重要性。

英文摘要

Large Language Models (LLMs) are highly sensitive to their input contexts, motivating the development of automated context engineering. However, existing methods predominantly treat this as a global search problem, seeking a single context strategy that maximizes average performance across a dataset. This restrictive assumption overlooks the fact that different inputs often require distinct guidance, leaving substantial instance-level performance gains untapped. In this paper, we propose a paradigm shift by formulating context engineering as a recommendation problem. We introduce \textbf{Neural Collaborative Context Engineering (NCCE)}, a framework that transitions optimization from a static global search to dynamic, instance-wise routing. NCCE first bootstraps a diverse catalog of anchor contexts and then employs a novel \textbf{Context-CF Co-Evolution} mechanism. This stage establishes a synergistic feedback loop: a lightweight Neural Collaborative Filtering (NCF) model learns instance-context preferences to guide the generation of specialized context variants, while the newly evaluated contexts continuously refine the NCF model's understanding of latent preferences. At inference time, the trained NCF model acts as a context router, dynamically assigning the most suitable context strategy to each unseen instance. Theoretical Proofs and comprehensive experiments demonstrate that by matching individual inputs with their optimal contexts, NCCE significantly improves task accuracy, highlighting the critical importance of personalization in LLM context engineering.

2605.15720 2026-05-18 cs.CV cs.LG

Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment

Semi-MedRef:基于跨模态对齐的半监督医学指引用图像分割

Yuchen Li, Zhen Zhao, Yi Liu, Luping Zhou

发表机构 * The University of Sydney(悉尼大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Changzhou University(常州大学)

AI总结 本文提出Semi-MedRef框架,通过三个组件维持医学图像与位置语言的一致性,实验显示其在低标签条件下优于其他方法。

详情
AI中文摘要

医学指引用图像分割(MRIS)需要像素级掩码与解剖位置的文本描述对齐,这在低标签环境下使标注成本高昂。半监督学习(SSL)可通过利用未标记数据缓解这一负担,但其成功依赖于在扰动下保持可靠的图像-文本对齐。现有SSL方法多采用独立或简单的多模态扰动(如左右翻转),未能充分解决强增强下的跨模态对齐问题,而CutMix在单模态SSL中效果显著,但在多模态设置中因破坏图像-文本一致性而未被广泛探索。本文提出Semi-MedRef,一种教师-学生SSL框架,通过三个保持对齐的组件:T-PatchMix,一种跨模态CutMix风格增强,通过位置约束和概率驱动规则同步补丁混合与指引用表达;PosAug,一种位置感知文本增强,通过遮蔽或模糊解剖短语;以及ITCL,一种位置引导的图像-文本对比学习模块,利用位置伪标签构建软解剖正例并加强医学基础的跨模态对齐。在QaTa-COV19和MosMedData+上的实验表明,Semi-MedRef在所有标签条件下均优于完全监督和半监督基线。

英文摘要

Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.

2605.15713 2026-05-18 cs.RO cs.AI

Learning Dynamic Pick-and-Place for a Legged Manipulator

学习动态抓取与放置用于四足机械臂

Moonkyu Jung, Jiseong Lee, Zhengmao He, Donghoon Youm, Juhyeok Mun, HyeongJun Kim, Hyunsik Oh, Donghyuk Choi, Jungwoo Hur, Jie Song, Jemin Hwangbo

发表机构 * Robotics and Artificial Intelligence Lab, KAIST(机器人与人工智能实验室,韩国科学技术院)

AI总结 本文提出一种分层强化学习框架,用于四足机械臂的动态抓取与放置任务,通过模拟和现实实验验证了其在不同负载和工作空间下的高成功率。

Comments Accepted to IEEE Robotics and Automation Letters 2026

Journal ref IEEE Robotics and Automation Letters, vol. 11, no. 6, pp. 7652-7659, 2026

详情
AI中文摘要

四足机械臂通过结合敏捷移动与多功能臂控制,扩展了机器人静态操作的能力。然而,实现精确操作的同时保持协调移动仍是一个重大挑战。本文提出了一种分层强化学习框架,用于四足机械臂的动态抓取与放置任务。该框架包含一个显式的质量估计模块,能够实现对不同重量物体的自适应全身控制。在模拟中,系统在负载达2.3kg时的成功率高达86.05%。通过六个代表性场景的现实实验,验证了该方法在不同物体物理属性(尺寸和质量)和任务高度下的有效性。在垂直工作空间从地面到1.1米高桌面的范围内,系统在负载达1.3kg时的平均成功率为73.3%,平均执行时间为4.06秒。与以往处理轻质物体并执行慢速分步操作的方法不同,本文的方法利用移动和操作的同时进行,实现了动态连续执行。这些结果展示了四足移动机械臂在适应性、全身抓取与放置任务中处理更重负载和扩展工作空间的潜力。

英文摘要

Legged manipulators extend robotic capabilities beyond static manipulation by integrating agile locomotion with versatile arm control. However, achieving precise manipulation while maintaining coordinated locomotion remains a major challenge. This work presents a hierarchical reinforcement learning framework for dynamic pick-and-place tasks using a quadruped equipped with a 6-DOF robotic arm. The framework incorporates an explicit mass estimation module enabling adaptive whole-body control for objects with varying weights. In simulation, the system achieves an 86.05% success rate with payloads up to 2.3 kg. The approach is further validated through real-world experiments across six representative scenarios with controlled variations in object physical properties (size and mass) and task heights. Specifically, within a wide vertical workspace ranging from ground level to 1.1~m-high tabletops, the system demonstrates an average success rate of 73.3% for payloads up to 1.3 kg, with an average execution time of 4.06 s. Unlike prior works that handle lightweight objects and execute pick-and-place motions with slow, piecewise motions, the proposed framework exploits concurrent locomotion and manipulation for dynamic, continuous execution. These results demonstrate the potential of quadrupedal mobile manipulators for adaptive, whole-body pick-and-place with heavier payloads and extended workspaces.

2605.15711 2026-05-18 cs.CV

EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

EntropyScan: 向通过视觉注意力熵实现LVLMs的模型级后门检测

Xuanyu Ge, Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

发表机构 * China University of Geosciences(中国地质大学) University of the Chinese Academy of Sciences(中国科学院大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 本文提出EntropyScan,一种轻量且不依赖触发器的模型级后门检测方法,通过量化视觉注意力分布的结构扭曲来检测后门模型,实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

Comments 20 pages, 6 figures, 8tables

详情
AI中文摘要

本文提出EntropyScan,一种轻量且不依赖触发器的模型级后门检测方法,通过量化视觉注意力分布的结构扭曲来检测后门模型,实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.

2605.15710 2026-05-18 cs.CL

SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

SMMBench:一种用于源分布多模态智能体记忆的基准测试

Huacan Chai, Yukai Wang, Yingxuan Yang, Dan Peng, Yuanyi Song, Zhihui Fu, Weiwen Liu, Jianghao Lin, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University, China(上海交通大学,中国) OPPO, China(OPPO,中国)

AI总结 SMMBench旨在评估智能体在多源分布证据下进行多模态推理、冲突解决和行动预测的能力,揭示当前系统在处理碎片化异构数据时的不足。

详情
AI中文摘要

现有多模态记忆推理基准主要在预编排上下文中评估系统,但未能充分评估智能体能否利用跨独立来源分布的证据。我们提出SMMBench,评估智能体能否从多个来源检索、对齐和组合多模态证据,而非在单一整理上下文中推理。该基准包含1877个样本,基于264个来源。实验表明,当前系统在这些能力上仍存在困难,凸显源分布多模态记忆对多模态智能体的重要性。数据可在https://huggingface.co/datasets/HuacanChai/SMMBench获取。

英文摘要

Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at https://huggingface.co/datasets/HuacanChai/SMMBench.

2605.15705 2026-05-18 cs.RO cs.AI

Feedback World Model Enables Precise Guidance of Diffusion Policy

反馈世界模型使扩散策略获得精准指导

Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 本文提出反馈世界模型,通过实时反馈修正预测误差,提升机器人决策性能,实验显示在分布偏移下预测准确率和策略表现显著提升。

Comments 21 pages, 9 figures

详情
AI中文摘要

世界模型旨在通过预测动作后果来提高机器人决策能力。然而,当机器人遇到训练分布外的状态时,其预测往往不可靠,限制了实际应用。我们发现执行本身提供了一个自然但未被充分利用的信号:每次动作后,机器人直接观察到真实下一步状态,揭示了预测与实际结果之间的不匹配。基于这一见解,我们提出反馈世界模型,一种在推理时关闭预测与观察之间循环的新范式。与将世界模型视为静态开环预测器不同,我们的方法维护一个轻量级反馈状态,在线更新以迭代修正未来预测,利用实时观测补偿模型误差,而无需额外训练数据或参数更新。我们证明这一过程可以被视为潜在空间观察者,并在温和条件下具有收敛保证。我们进一步引入动作感知指导,通过强调动作可控的组件而抑制无关变化,以更好地将修正预测转化为控制。在LIBERO-Plus、Robomimic和真实世界操控任务上的实验表明,我们的方法在分布偏移下显著提高了预测准确性和策略性能。特别是,它将世界模型预测误差减少了高达76.4%,并提高了分布外(OOD)成功率30%。这些结果表明,在推理时纳入实时反馈为静态世界建模提供了一个简单而有力的替代方案。

英文摘要

World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.

2605.15701 2026-05-18 cs.CL cs.AI

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

H-Mem: 一种通过混合结构进化和检索智能体记忆的新型记忆机制

Jiawei Yu, Yixiang Fang, Xilin Liu, Yuchi Ma

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Huawei Cloud Computing Technologies CO., LTD.(华为云计算技术有限公司)

AI总结 H-Mem通过混合结构有效建模智能体记忆的长期演化并高效检索记忆数据,提升问答任务性能。

详情
AI中文摘要

在基于大语言模型(LLM)的智能体(如OpenClaw和Manus)中,记忆数据无处不在。尽管近期有研究尝试利用智能体的记忆来提高问答(QA)任务的性能,但缺乏有效建模记忆数据随时间演化和高效检索的原理性机制,导致记忆利用效率低下。为此,我们提出了H-Mem,一种通过混合结构实现的新型记忆机制,能够有效建模智能体记忆的长期演化,并提供高效的记忆检索方法。特别是,H-Mem构建了时间与语义树结构,使短期记忆数据逐步演变为长期记忆数据,后者为前者提供总结信息,同时构建知识图谱以捕捉记忆中实体之间的关系。此外,通过利用树和图结构的混合特性,H-Mem提供了有效的记忆检索方法。在三个智能体记忆基准测试中,H-Mem在问答任务上实现了最先进的性能。

英文摘要

Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

2605.15700 2026-05-18 cs.LG

AGOP-IxG: A Gradient Covariance Filter for Local Feature Attribution on Tabular Data, with a Controlled Benchmark

AGOP-IxG:一种用于表格数据局部特征归因的梯度协方差滤波器,配有受控基准

Raj Kiran Gupta Katakam

发表机构 * Credit Karma

AI总结 本文提出AGOP-IxG,一种用于表格分类器的快速样本归因方法,通过预乘样本梯度与Top-K秩截断的平均梯度外积矩阵,对比四个常用基线方法,在为AutoML从业者设计的受控表格基准上进行评估。

Comments 12 pages, 2 figures, 3 tables. Submitted to AutoML Conference 2026 (ABCD Track)

详情
AI中文摘要

自动化机器学习流水线越来越多地生成需要向终端用户、审计员和下游决策系统解释预测的模型。最广泛使用的特征归因方法(SHAP、集成梯度、LIME)通常是通过惯例而非测量保真度来选择的,因为严格评估受到真实数据上缺乏真实归因的阻碍。我们提出了AGOP-IxG,一种针对表格分类器的快速样本归因方法,该方法将样本梯度乘以一个Top-K秩截断的平均梯度外积矩阵,并在为AutoML从业者设计的受控表格基准上评估了四个广泛使用的基线方法。在第一部分中,我们构建了三个合成的多类表格任务(线性、稀疏非线性、交互式),其中每个样本的真实归因可以解析或数值计算,我们比较了五种方法:AGOP-IxG、SHAP(DeepExplainer)、集成梯度、InputXGradient和LIME。AGOP-IxG在所有三个合成数据集上的Spearman秩相关性和噪声特征质量上领先,并在交互数据集上的Top-K精度上领先。在所有设置中,AGOP-IxG的速度比SHAP快约350倍至1650倍。在第二部分中,我们使用ROAR协议评估全局忠实性,在Adult Income和Credit Card Default上进行评估;方法在相对AUC上聚类在约1.7%范围内,这与AGOP-IxG优化于样本局部归因而非全局特征排名一致。

英文摘要

Automated machine learning pipelines increasingly produce models whose predictions must be explained to end users, auditors, and downstream decision systems. The most widely used feature attribution methods (SHAP, Integrated Gradients, LIME) are typically chosen by convention rather than measured fidelity, because rigorous evaluation is impeded by the absence of ground-truth attribution on real data. We propose AGOP-IxG, a fast per-sample attribution method for tabular classifiers that pre-multiplies the per-sample gradient by a top-$K$ rank-truncated Average Gradient Outer Product matrix, and evaluate it against four widely-used baselines on a controlled tabular benchmark designed for AutoML practitioners. In Part 1, we construct three synthetic multi-class tabular tasks (linear, sparse nonlinear, interaction-based) where ground-truth attribution per sample is analytically or numerically derivable, and compare five methods: AGOP-IxG, SHAP (DeepExplainer), Integrated Gradients, InputXGradient, and LIME. AGOP-IxG leads on Spearman rank correlation and noise feature mass on all three synthetic datasets, and on top-$k$ precision on the interaction dataset. Across all settings, AGOP-IxG is approximately $350\times$ to $1{,}650\times$ faster than SHAP. In Part 2, we evaluate global faithfulness on Adult Income and Credit Card Default using the ROAR protocol; the methods cluster within $\sim 1.7\%$ relative AUC, consistent with AGOP-IxG being optimized for per-sample local attribution rather than global feature ranking.

2605.15692 2026-05-18 cs.LG stat.ML

Tighter Regret Bounds for Contextual Action-Set Reinforcement Learning

更紧的基于上下文动作集强化学习的遗憾界

Zijun Chen, Zihan Zhang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文研究了具有固定奖励和转移函数的回合制强化学习,但每个回合的动作集依赖于回合。通过MVP算法,建立了对抗性和随机性情境下的更紧遗憾界,并推导了样本复杂度和间隙依赖的遗憾界。

详情
AI中文摘要

我们研究了具有固定奖励和转移函数的回合制强化学习,但每个回合的动作集依赖于回合。性能通过累积遗憾衡量,即$\sum_{k=1}^K [V^{*,M^k} - V^{π^k,M^k}]$,其中$M^k$表示第$k$个回合的动作上下文。我们证明MVP算法可以自然扩展到此框架并享有强理论保证。特别是,我们建立了对抗性情境下的最小最大遗憾界$\widetilde{O}(\sqrt{SAH^3K\log L})$,其中$L$表示可能的上下文数量。此结果意味着在随机性情境下的遗憾界为$\widetilde{O}(\sqrt{SAH^3K})$。我们进一步将随机性遗憾保证转换为固定上下文分布的样本复杂度界$\widetilde{O}(SAH^3/ε^2)$。此外,我们推导了一个依赖间隙的遗憾界$\widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right)$,其中$Δ_{\min}^{p}$是子最优$(h,s,a)$三元组的全局$p$-修剪正间隙底。此界在相关子最优间隙较大的情况下可以显著改进最小最大速率。

英文摘要

We study episodic reinforcement learning with fixed reward and transition functions, but with episode-dependent admissible action sets that are observed at the start of each episode. Performance is measured by cumulative regret against the episode-wise optimal value, $\sum_{k=1}^K [V^{*,M^k} - V^{π^k,M^k}]$, where $M^k$ represents the action context in the $k$-th episode. We show that the MVP algorithm naturally extends to this framework and enjoys strong theoretical guarantees. In particular, we establish a minimax regret bound of $\widetilde{O}(\sqrt{SAH^3K\log L})$ for adversarial contexts, where $L$ denotes the number of possible contexts. This result implies a regret bound of $\widetilde{O}(\sqrt{SAH^3K})$ for stochastic contexts. We further translate the stochastic regret guarantee into a sample complexity bound of $\widetilde{O}(SAH^3/ε^2)$ for a fixed context distribution. In addition, we derive a gap-dependent regret bound of \[ \widetilde O\left( \inf_{p\in [0,1)} \left( \frac{1}{Δ_{\min}^{p}} + pKΔ_{\min}^{p} \right)\log K \cdot \mathrm{poly}(S,A,H) \right), \] where $Δ_{\min}^{p}$ is the global $p$-trimmed positive-gap floor over suboptimal $(h,s,a)$ triples. This bound can substantially improve upon the minimax rate when the relevant suboptimality gaps are large.

2605.15689 2026-05-18 cs.CV

How to Choose Your Teacher for Fine Grained Image Recognition

如何为细粒度图像识别选择教师

Oswin Gosal, Edwin Arkel Rios, Augusto Christian Surya, Fernando Mikael, Bo-Cheng Lai, Min-Chun Hu

发表机构 * National Tsing Hua University, Taiwan(台湾国立清华大学) National Yang Ming Chiao Tung University, Taiwan(台湾国立阳明交通大学)

AI总结 本文提出Ratio 1-2指标,通过分析实验数据提升教师选择效果,使小模型在细粒度图像识别中获得17%的准确率提升。

Comments Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 3 figures, 4 tables

详情
AI中文摘要

细粒度图像识别用于分类如鸟类物种或汽车型号等子类别。尽管最先进的模型准确率高,但往往资源消耗过大,难以部署在受限设备上。知识蒸馏通过将大教师模型的知识转移到小学生模型中解决此问题。选择合适的教师模型是关键挑战,本文引入Ratio 1-2指标,基于教师预测比例进行评估。对超过1000次实验的分析显示,该指标比先前方法提升18%,使小模型在细粒度图像识别中达到17%的准确率提升。实验代码库可在https://github.com/arkel23/FGIR-KD-Teacher获取。

英文摘要

Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.

2605.15684 2026-05-18 cs.CV

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

ElasticDiT:通过弹性架构和稀疏注意力实现高效扩散变换器,用于移动设备上的高分辨率图像生成

Kunpeng Du, Haizhen Xie, Sen Lu, Lei Yu, Binglei Bao, Huaao Tang, Chuntao Liu, Hao Wu, Yang Zhao, Zhicai Huang, Heyuan Gao, Zhijun Tu, Jie Hu, Xinghao Chen

发表机构 * Huawei Technologies(华为技术)

AI总结 本文提出ElasticDiT,通过弹性架构和稀疏注意力机制,在移动设备上实现高效扩散变换器,平衡图像质量和计算效率,同时减少内存占用。

详情
AI中文摘要

扩散变换器(DiT)架构是高保真图像生成的最新范式,支撑如Stable Diffusion-3和FLUX.1等模型。然而,将这些模型部署到资源受限的移动设备上会带来极高的计算和内存开销。尽管效率驱动的方法如Linear-DiT和静态剪枝缓解了瓶颈,但通常会带来质量下降。不同于云环境,移动约束要求一种单模型范式,能够动态平衡保真度和延迟。我们引入ElasticDiT,通过调整空间压缩比和DiT块深度实现这种动态权衡。通过整合Shift Sparse Block Attention(SSBA)和Tiny DWT-Distilled VAE(T-DVAE),ElasticDiT在保持图像质量的同时减少了推理延迟和内存占用。实验表明,ElasticDiT能够在一个参数集内覆盖广泛的保真度-延迟权衡范围。通过联合调整压缩和深度,单个ElasticDiT模型可以动态重新配置以超越任务特定的基线。具体而言,我们的flex lite变体实现了32.87的HPS,超过了Flux模型,同时通过SSBA保持84.16%的平均稀疏度质量。此外,插件式的T-DVAE仅需标准VAEs的1/8计算成本即可实现SD3级的重建,而Flow-GRPO提升了语义对齐(GenEval: 66.93到73.62)。这些结果表明,ElasticDiT提供了一种多功能、硬件适应性的解决方案,消除了对多个专用模型的需求,为未来移动设备上的高分辨率图像生成提供了有前景的路径。

英文摘要

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

2605.15682 2026-05-18 cs.CV

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

DreamSR:通过增强感受野的扩散变换器实现超高清图像超分辨率

Qingji Dong, Hang Dong, Mingqin Chen, Rui Zhang, Yitong Wang

发表机构 * ByteDance Inc.(字节跳动公司)

AI总结 DreamSR通过双分支MM-ControlNet和增强感受野策略,解决超分辨率中局部过生成和细节合成问题,实现高质量细节恢复。

详情
AI中文摘要

大规模预训练扩散模型因强大的生成先验通过文本引导被广泛应用于实际图像超分辨率。然而,当使用基于补丁的推理策略超分辨率处理高分辨率图像时,现有扩散基超分辨率方法常因LR图像全局提示与每次推理步骤中局部补丁不完整语义信息之间的不匹配而产生过生成问题。另一方面,现有方法由于网络设计和训练策略过度强调全局生成能力,也难以在局部补丁中生成细节纹理。为了解决这个问题,我们提出了DreamSR,一种新的超分辨率模型,通过抑制局部过生成并提高细节合成,从而实现具有超高质量细节的视觉忠实结果。具体来说,我们提出了一个双分支MM-ControlNet,其中ControlNet使用补丁级提示生成局部文本特征,而预训练的DiT使用全局提示生成全局文本特征,从而缓解过生成并确保补丁间的语义一致性。我们还设计了全面的训练策略,包含阶段特定的数据处理管道和增强感受野策略,增强模型捕捉补丁信息和有效恢复局部纹理的能力。广泛的实验表明,DreamSR优于最先进的方法,提供高质量的超分辨率结果。代码和模型可在https://github.com/jerrydong0219/DreamSR上获得。

英文摘要

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

2605.15677 2026-05-18 cs.CL cs.CV

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

VCG-Bench:迈向统一的视觉导向基准,用于结构化生成与编辑

Xiaoyan Su, Peijie Dong, Zhenheng Tang, Song Tang, Yuyao Zhai, Kaitao Lin, Liang Chen, Gai Yuhang, Yuyu Luo, Qiang Wang, Xiaowen Chu

发表机构 * The Hong Kong University of Science and Technology (GuangZhou)(香港科学与技术大学(广州)) Huawei Technologies Co., Ltd(华为技术有限公司) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) South China University of Technology(华南理工大学)

AI总结 本文提出VCG-Bench,一个统一的视觉导向mxGraph任务基准,通过符号逻辑和XML实现精确的图表生成与编辑,解决现有方法在结构化任务中的局限性。

Comments Accepted by ICML2026, 37 pages, 10 figures

详情
AI中文摘要

尽管视觉语言模型(VLMs)迅速发展,但在处理专业工作流程中至关重要的结构化、可控图表任务方面仍存在关键差距。现有方法主要依赖像素级合成,其在可编辑性和保真度上存在固有限制。本文提出一种新的图表即代码范式,利用mxGraph可扩展标记语言(XML)进行精确的图表生成与编辑。我们提出了VCG-Bench,一个统一的视觉导向mxGraph任务基准。VCG-Bench包括:(1)一个包含1,449种不同图表的分类数据集,涵盖6个领域和15个子领域;(2)一种整合生成(视觉到代码)和可编辑性(代码到代码)的范式定义;(3)一种定制的评估协议,采用多维指标,如mxGraph执行成功率、风格一致性分数(SCS)等。实验结果突显了当前最先进(SOTA)VLMs在结构保真度和指令合规性方面的挑战,反映了其视觉和推理能力。

英文摘要

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

2605.15676 2026-05-18 cs.CL

Dynamic Chunking for Diffusion Language Models

扩散语言模型的动态分块

Yichen Zhu, Xiaoming Shi, Peng Zhao, Weiyu Chen, Debing Zhang, James Kwok

发表机构 * CSE, HKUST(香港科技大学计算机科学与工程系) Xiaohongshu Inc.(小红书公司) Alibaba group(阿里巴巴集团) CityUHK(城市大学香港校区)

AI总结 本文提出动态分块扩散模型,通过内容定义语义分块替代固定位置分块,提升序列结构利用效率,在参数规模达1.5B的下游任务中表现更优。

详情
AI中文摘要

块离散扩散语言模型将序列自回归地分解为固定大小的位置块,将块内并行去噪与块间条件解耦。我们认为这种刚性划分浪费了序列中已有的结构:以位置而非内容定义的块将语义连贯的token分开,将不相关的token分组。我们引入动态分块扩散模型(DCDM),用内容定义的语义分块替代位置块。其核心是Chunking Attention,一个可微层,将token路由到由可学习子空间参数化的K个聚类中,并通过扩散目标端到端塑造形状。所得聚类分配诱导出一个chunk因果注意力掩码,在此掩码下,离散扩散去噪器将序列似然自回归地分解为语义分块,严格推广块离散扩散。在参数规模达1.5B的下游任务中,DCDM在无结构和位置块扩散基线中均表现更优,优势在不同规模和训练早期均稳定。

英文摘要

Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.

2605.15675 2026-05-18 cs.LG cs.AI

Interaction-Aware Influence Functions for Group Attribution

群体属性中的交互感知影响函数

Jaeseung Heo, Kyeongheung Yun, Youngbin Choi, Sehyun Hwang, Jungseul Ok, Dongwoo Kim

发表机构 * GSAI, POSTECH(POSTECH 人工智能研究所) CSE, POSTECH(POSTECH 计算科学与工程系)

AI总结 本文提出交互感知影响函数,通过考虑样本间相互作用来改进群体属性评估,实验显示其在多个任务中优于传统方法。

详情
AI中文摘要

影响函数近似于移除训练样本如何改变感兴趣的量,如保留损失。为估计群体样本的影响,常规做法是求和个体影响。然而,这种求和无法捕捉样本联合影响:样本对可能是冗余或互补的,但求和无法区分这些情况。我们提出交互感知影响函数,通过在训练参数周围扩展目标到二次项,获得一个估计器,该估计器在标准求和基础上增加了一个双变量交互项,捕捉两个样本对目标影响的对齐情况。我们实验证明,该估计器在六个数据集-模型组合上显著优于一阶影响方法。此外,当用作Llama-3.1-8B指令微调数据的贪心选择规则时,在五个七下游任务中优于传统影响和表示相似性基线,在标准影响选择表现不佳的领域中。

英文摘要

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.

2605.15672 2026-05-18 cs.CV cs.AI

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

VLMs 跟踪无需跟踪:诊断视觉路径跟随中的失败

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

发表机构 * Yonsei University(延世大学)

AI总结 研究VLMs在视觉路径跟随任务中的表现,发现其在面对局部相似干扰时易切换路径,揭示局部竞争导致的失败原因。

详情
AI中文摘要

视觉-语言模型(VLMs)在多模态基准测试中表现优异,但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务,其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力,我们设计了受控的路径跟随任务,引入附近的竞争者并减少语义和拓扑模糊性,如交叉和重叠。在这些任务中,即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径,尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明,这些失败源于局部竞争:附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈:模型大小扩展只能提供有限的收益,推理部分通过成本高昂的替代策略补偿,而显式路径指示未能恢复稳定的路径跟随。最后,在复杂的电缆场景和地铁地图上测试表明,相同的路径切换失败在受控设置之外仍然存在。

英文摘要

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

2605.15669 2026-05-18 cs.LG

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Rule2DRC:用于DRC脚本合成的LLM代理基准测试

Jinuk Kim, Junsoo Byun, Donghwi Hwang, Seong-Jin Park, Hyun Oh Song

发表机构 * Department of Computer Science and Engineering, Seoul National University(首尔国立大学计算机科学与工程系) Neural Processing Research Center(神经处理研究所以) Samsung Electronics Co., Ltd(三星电子公司)

AI总结 Rule2DRC是一个大规模基准,用于评估DRC脚本生成代理,包含1000个规则到脚本任务和13921个用于执行评分的评估芯片布局。它提供了一种通过DRC执行结果衡量功能正确性的评估流程,并引入SplitTester生成区分性测试用例以提升Best-of-N选择性能。

Comments ICML 2026

详情
AI中文摘要

可制造的芯片布局必须满足成千上万的基于几何的设计规则,设计规则检查(DRC)通过在布局上运行可执行的DRC脚本来强制执行这些规则。将自然语言规则转换为正确的DRC脚本是劳动密集型的,需要专门的专家知识,这促使了LLM代理用于DRC脚本合成和调试。然而,现有的基准测试集较小,且通常通过代码相似性而不是执行正确性来评估脚本,而先前基于机器学习的方法要么忽略执行反馈,要么要求标签化的测试布局作为代理的输入。为此,我们引入了Rule2DRC,一个大规模的DRC脚本编码代理基准测试,包含1000个规则到脚本任务和13921个用于执行评分的评估芯片布局。Rule2DRC提供了一个评估流程,通过DRC执行结果衡量功能正确性,而无需将评估布局作为代理的输入。我们还提出了SplitTester,一个用于程序选择的测试代理,利用执行反馈生成区分性测试用例,显著提高了该领域的Best-of-N选择性能。我们已发布代码至https://github.com/snu-mllab/Rule2DRC。

英文摘要

Manufacturable chip layouts must satisfy thousands of geometry-based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain. We release the code at https://github.com/snu-mllab/Rule2DRC.

2605.15666 2026-05-18 cs.CV

ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark

ChronoEarth-492K:一个大规模且长时域的时空超光谱地球观测数据集和基准

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Siebel School of Computing and Data Science(计算与数据科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出ChronoEarth-492K数据集,通过NASA EO-1 Hyperion任务的超光谱数据,提供大规模、时间校准的时空超光谱数据,支持短时和长时分析,并建立统一的评估平台,推动超光谱时空表示学习的发展。

详情
AI中文摘要

超光谱成像(HSI)为地球表面提供了密集的光谱信息,使土地覆盖和生态系统动态在材料层面得以理解。尽管近年来在超光谱自监督学习(SSL)方面取得了进展,但现有数据集仍然时间较浅,限制了长时间域时空建模的发展。为解决这一差距,我们引入ChronoEarth-492K,这是首个大规模、时间校准的超光谱SSL数据集,基于NASA的EO-1 Hyperion任务,目前是世界上持续时间最长的超光谱档案(2001-2017)。ChronoEarth-492K包含492,354个辐射校准的块,覆盖185,398个全球地点17年,其中28,786个地点包含多时间序列(≥3次观测),可支持短时间域和长时间域的分析。在此基础上,我们建立了ChronoEarth基准,一个涵盖静态、短时间域和长时间域任务的统一评估套件,由六个开源地理空间产品组成,涵盖土地覆盖、作物类型、森林动态和土壤特性。我们进一步提出了一套标准化的评估协议,并在最先进的超光谱基础模型上报告了广泛的基线结果。共同而言,ChronoEarth和基准提供了首个大规模、时间校准的平台,用于系统性的时空超光谱表示学习。

英文摘要

Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.

2605.15663 2026-05-18 cs.LG

On the Power of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits

在线性老虎机中ε-最佳臂识别的适应性功率研究

Arnab Maiti, Yunbei Xu, Kevin Jamieson

发表机构 * University of Washington(华盛顿大学) National University of Singapore(新加坡国立大学)

AI总结 本文研究了在线性老虎机中ε-最佳臂识别的最小样本复杂度,提出非适应性固定设计方法及适应性采样策略,揭示了适应性在不同动作集中的效果差异。

Comments Accepted at COLT 2026

详情
AI中文摘要

我们研究了在线性老虎机中ε-最佳臂识别的最小样本复杂度。给定一个覆盖R^d的紧凑动作集X和未知奖励向量θ∈R^d,目标是输出一个动作x̂∈X,使得⟨x̂,θ⟩≥max_{x∈X}⟨x,θ⟩-ε,以概率至少1-δ使用尽可能少的样本。首先,我们提出一个非适应性固定设计方法,其样本复杂度为O(d log(1/δ)/ε² + w(X)²/ε²),其中w(X)是依赖于X的高斯宽度项,并证明了所有非适应性固定设计方法的匹配下界Ω(d log(1/δ)/ε² + w(X)²/ε²)。然后,我们转向适应性采样。我们提出一个重要的结构性问题:除了标准基底外,是否存在结构化的动作集,使得适应性仅在最优非适应性速率上提供对数因子的改进?我们对几种自然的动作集,即超立方体、l2球、m集和多任务多臂老虎机,给出了肯定回答。最后,我们提供了第一个构造的动作集X,其中适应性在每种非适应性算法上提供了多项式因子的改进。这一分离的关键成分是一个l2范数估计子程序:我们设计了一个适应性算法,使用O(d log(1/δ)/ε²)个样本从R^d中的单位l2球中输出一个估计值r̂,满足| r̂ - ||θ||_2 | ≤ ε,以概率至少1-δ,其中θ是未知奖励向量。

英文摘要

We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $θ\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $\langle \widehat{x},θ\rangle \ge \max_{x\in\mathcal{X}} \langle x,θ\rangle - \varepsilon$ with probability at least $1-δ$, using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/δ)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $Ω\!\left(\frac{d\log(1/δ)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/δ)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|θ\|_2|\le \varepsilon$ with probability at least $1-δ$, where $θ$ is the unknown reward vector.

2605.15661 2026-05-18 cs.CV cs.AI

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

VAGS:图像编辑与生成的速率自适应引导尺度

Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Harvard University(哈佛大学) School of Computing and Data Science(计算与数据科学学院) The University of Hong Kong(香港大学) Kempner Institute for the Study of Natural and Artificial Intelligence(自然与人工智能研究学院)

AI总结 VAGS通过自适应引导尺度提升图像编辑和生成的结构保真度和生成质量,无需微调或额外计算。

详情
AI中文摘要

分类自由引导(CFG)是控制流式采样器中文本语义强度的主要手段,但传统方法在整个ODE轨迹中固定引导尺度。这存在根本矛盾:早期步骤以噪声为主,携带弱语义信号,而后期步骤需提交图像结构,要求更强的方向性承诺;更关键的是,任何引导强度的值取决于引导速度是否与模型当前动态一致或相反。本文提出速率自适应引导尺度(VAGS),一种无需训练的替代方案,通过结合时间信号级项和任务相关速度场的余弦相似度,将名义尺度乘以一个有界因子。对于无需反向传播的编辑,VAGS测量源和目标引导速度之间的对齐程度,使每一步的编辑强度反映局部保留与变换的兼容性。对于生成,VAGS-Gen利用无条件与条件速度之间的对齐作为类比信号。两种变体均无需微调、辅助网络或额外前向传递,固定CFG是其特殊情形。在PIE-Bench和DIV2K进行编辑,在COCO17、CUB-200和Flickr30K进行生成时,VAGS在结构保真度和生成质量上优于固定CFG和近期无训练引导变体。代码可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公开获取。

英文摘要

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

2605.15660 2026-05-18 cs.CV

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

MaTe:仅需图像进行材料迁移的扩散变换器

Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen, Jie Guo, Tong-Yee Lee, Xiu Li

发表机构 * Tsinghua University(清华大学) PengCheng Laboratory(鹏城实验室) Lenovo Research(联想研究院) National Cheng-Kung University(国立成功大学)

AI总结 MaTe通过多模态注意力机制实现材料迁移,无需文本指导或辅助网络,提升了生成质量和效率。

详情
AI中文摘要

最近的基于扩散的方法在材料迁移中依赖图像微调或复杂的架构,但面临文本依赖、额外计算成本和特征对齐挑战。为此,我们提出了MaTe,一种简化扩散框架,消除了文本指导和参考网络。MaTe在token层面整合输入图像,通过共享潜在空间中的多模态注意力实现统一处理。此设计无需额外适配器、ControlNet、反转采样或模型微调。大量实验表明,MaTe在零样本、无训练范式下实现了高质量的材料生成。它在视觉质量和效率上优于现有方法,同时保持精确的细节对齐,显著简化了推理前提。

英文摘要

Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.

2605.15654 2026-05-18 cs.RO

PCASim: Promptable Closed-loop Adversarial Simulation for Urban Traffic Environment

PCASim:可提示的闭环对抗模拟用于城市交通环境

Chuancheng Zhang, Zhenhao Wang, Kaizheng Li, Yaran Lin, Qiang Guo, Bin Jiang

发表机构 * Shenzhen Research Institute of Shandong University(山东大学深圳研究院) School of Airspace Science and Engineering(空天科学与工程学院) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 本文提出PCASim框架,通过结合对抗场景生成与安全代理训练,提升城市交通环境中的安全性和鲁棒性,实验表明其在领域特定语言生成准确率、场景转换成功率和避障能力方面均有显著提升。

详情
AI中文摘要

现实中的自动驾驶,特别是在城市环境中存在大量边缘案例,需要严格测试以确保产品安全性和鲁棒性。然而,很少有研究探讨将对抗场景生成与安全代理在闭环测试中的训练相结合,以实现高效共演和相互增强。为了解决这一挑战,通过应用基于规则的过滤对开源数据集进行处理,并结合针对模拟环境定制的知识检索模块,构建了一个对抗行为知识库。大型语言模型(LLM)被用于整合知识驱动、数据驱动和对抗驱动的方法,生成定制化的安全关键交通场景。此外,在评估生成的场景时,使用强化学习模型训练不同类型的车辆行为,从而在不牺牲现实性的情况下丰富场景多样性。实验结果表明,所提出的框架将领域特定语言生成的准确性提高了12%。此外,新生成场景转换的成功率提高了8%,避障能力提高了30%。完整手稿请参考:https://zhenhaooo.github.io/PCASim.github.io/

英文摘要

Real-world autonomous driving, particularly in urban environments with numerous corner cases, requires rigorous testing to ensure product safety and robustness. However, few studies have explored integrating adversarial scenario generation with the training of safety agents in closed-loop testing, enabling efficient co-evolution and mutual enhancement of both. To address this challenge, an adversarial behavior knowledge repository is constructed by applying rule-based filtering to an open-source dataset, combined with knowledge retrieval modules tailored for simulation environments. A large language model (LLM) is employed to integrate knowledge-, data-, and adversarial-driven approaches, generating safety-critical traffic scenarios customized to user needs. Additionally, while evaluating the generated scenarios, we employ reinforcement learning models to train the behaviors of different types of vehicles, thereby enriching scenario diversity beyond existing datasets while preserving realism. Experimental results demonstrate that the proposed framework improves the accuracy of domain-specific language generation by 12\%. Moreover, the success rate of newly generated scenario transformations increases by 8\%, while obstacle-avoidance capability is enhanced by 30\%. For the complete manuscript, please refer to: https://zhenhaooo.github.io/PCASim.github.io/

2605.15651 2026-05-18 cs.LG cs.AI cs.GT

Sharp Spectral Thresholds for Logit Fixed Points

Logit固定点的尖锐谱阈值

Tongxi Wang

发表机构 * Southeast University(东南大学)

AI总结 研究探讨了logit反馈系统稳定性问题,提出新的欧几里得阈值条件以扩展稳定性保证,识别相变点。

详情
AI中文摘要

Softmax反馈系统是熵正则化强化学习、logit博弈动态、群体选择和均场变分更新的数学核心。其核心稳定性问题很简单:当softmax系统产生唯一且全局可预测的结果时?经典理论给出了保守答案。通过将softmax视为单位尺度响应,它仅在强随机化 regime 中保证稳定性。我们证明经典方法忽略了整个稳定 regime 并未识别真正质变发生点。对于有限维仿射logit系统,尖锐无维欧几里得阈值为$$β\\|ΠWΠ\\|_{\mathcal T\to\mathcal T}<2$$,而非之前使用的条件,该条件仅在softmax系统保持安全过正则化时保证稳定性。我们的定理填补了之前缺失的预分支 regime,将仿射softmax反馈系统的稳定性保证扩展到奖励响应但全局可预测的系统。它扩大了这些系统的认证稳定性边界,并识别模型真正经历相变的点。

英文摘要

Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is $$β\|ΠWΠ\|_{\mathcal T\to\mathcal T}<2,$$ rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition.

2605.15650 2026-05-18 cs.RO

MyoChallenge 2025: A New Benchmark for Human Athletic Intelligence

MyoChallenge 2025: 人类运动智能的新基准

Cheryl Wang, Chun Kwang Tan, Balint K. Hodossy, Eric Lyu, Jun Guo, Wentao Zhao, Huaping Liu, Chengkun Li, Merkourios Simos, Bianca Ziliotto, Alexander Mathis, Siyuan Liu, Jiahao Chen, Shanlin Zhong, Bo Jiang, Ci Song, Yaoye Zhu, Chenhui Zuo, Yanan Sui, Mohamed Irfan Refai, Massimo Sartori, Guillaume Durandau, Vikash Kumar, Vittorio Caggiano

发表机构 * McGill University(麦吉尔大学) National University of Singapore(国立新加坡大学) Imperial College London(伦敦帝国理工学院) King’s College London(伦敦国王学院) Tsinghua University(清华大学) EPFL(苏黎世联邦理工学院) CASIA(中国科学院自动化所) University of Twente(代尔夫特理工大学) MyoLab

AI总结 本文提出MyoChallenge 2025基准,通过高保真骨骼肌模型与机器学习算法结合,推动运动控制智能研究,包含乒乓球和足球点球两个任务,促进多学科交叉研究。

详情
AI中文摘要

运动表现代表人类运动智能的巅峰,要求快速决策、精确控制、敏捷性和协调性。当前人工智能和机器人系统难以复现这种能力。为填补这一理解空白,MyoChallenge 2025建立了运动控制智能的新基准,结合物理模拟和机器学习算法。竞赛包含上肢和下肢两个赛道,分别涉及乒乓球发球和足球点球任务。该赛事吸引了70多支队伍和560多份提交,推动了运动系统控制算法的发展,整合标准化任务和生理真实模型,为跨学科研究提供可重复的测试平台。

英文摘要

Athletic performance represents the pinnacle of human motor intelligence, demanding rapid choices, precise control, agility, and coordinated physical execution. Replicating this seamless combination of capabilities remains elusive in current artificial intelligence and robotic systems. Concurrently, understanding the biological mastery of these movements is hindered because complex muscle coordination is rarely measured in vivo due to the limitations of physical equipment. To bridge this fundamental gap in understanding, MyoChallenge at NeurIPS 2025 established a pioneering benchmark for motor control intelligence in sports, leveraging high-fidelity musculoskeletal models within physics simulation combined with machine learning-driven algorithms. The competition introduces two distinct tracks emphasizing either upper or lower limbs control: a table tennis rally task utilizing a biomechanic upper limb composed of an arm with a hand and a trunk; and a soccer penalty kick using a biomechanic model of legs and a trunk. Marking the fourth iteration of the MyoChallenge series, this event attracted almost 70 teams and over 560 submissions globally, uniting a diverse community ranging from physicians and neuroscientists to machine learning experts. The competition facilitated the development of several state-of-the-art control algorithms for a musculoskeletal system capable of sports agility, leveraging techniques such as physics-based motion planners, on-policy behaviour cloning, hierarchical planning, and muscle synergies. By integrating standardized tasks and physiologically realistic models into the open-source framework of MyoSuite, MyoChallenge'25 serves as a reproducible and reusable testbed to accelerate interdisciplinary research across machine learning, biomechanics, sports science, and neuroscience. Project page: https://www.myosuite.org//myochallenge/myochallenge-2025.

2605.15649 2026-05-18 cs.LG cs.NE

Towards Code-Oriented LM Embeddings for Surrogate-Assisted Neural Architecture Search

面向代理辅助神经架构搜索的代码导向语言模型嵌入

Pranav Somu, Advay Balakrishnan, Stepan Kravtsov, Aaron McDaniel, Jason Zutty

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Georgia Tech Research Institute(佐治亚理工研究 institute)

AI总结 本文提出一种低成本的代码导向语言模型嵌入策略,利用语言模型的归纳偏置,无需微调即可生成高效的架构特征提取器,实验证明其在NAS-Bench-201和einspace搜索空间中优于其他编码方式。

Comments This is an extended version of work accepted to GECCO 2026. Our code is available at https://github.com/pcsom/cole/tree/v1.0

详情
AI中文摘要

开发有效的代理(性能预测器)通常需要昂贵的微调或复杂的表示工程。我们提出了一种低成本的嵌入策略,利用语言模型的归纳偏置来消除这些开销。通过将架构表示为PyTorch类定义文本,我们证明了现成的LM可以作为竞争性的特征提取器,无需NAS专用的微调。最终的预测器通过将提取的代码导向语言模型嵌入(COLE)传递给轻量级回归头构建。我们还研究了提高嵌入质量和利用的策略。在NAS-Bench-201和einspace搜索空间的实验中,发现原始代码输入在使用冻结LM时比其他文本编码(如ONNX-to-text编码)具有更高的预测性能。我们还观察到COLE在NAS-Bench-201中使用BANANAS算法进行代理辅助搜索时表现更优。当优化CIFAR-100性能时,用COLE代替结构路径编码可使达到搜索空间中最佳架构1%测试准确率所需的评估预算减少34%。由于任何神经架构都可以表示为代码,这些发现证明COLE是推进NAS的多功能且高效的基石。

英文摘要

Developing effective surrogates (performance predictors) for Neural Architecture Search (NAS) typically requires expensive fine-tuning or the engineering of complex representations. We propose a low-cost embedding strategy that leverages the inductive bias of Language Models (LMs) to eliminate these overheads. By representing architectures as PyTorch class definition text, we demonstrate that off-the-shelf LMs act as competitive feature extractors without NAS-specialized fine-tuning. The final predictor is constructed by passing the extracted Code-Oriented LM Embeddings (COLE) through a lightweight regression head. We also investigate strategies to improve embedding quality and utilization. Our experiments on the NAS-Bench-201 and einspace search spaces reveal that raw code inputs yield higher predictive performance than other text-based encodings (e.g., ONNX-to-text encodings) when using frozen LMs. We also observe COLE drives superior surrogate-assisted search using the BANANAS algorithm in NAS-Bench-201. When optimizing for CIFAR-100 performance, replacing structural path encodings with COLE for architecture representation allows for a 34% decrease in the evaluation budget required to reach within 1% of the fittest architecture in the search space (by test accuracy). As any neural architecture can be represented as code, these findings establish COLE as a versatile and efficient foundation for advancing NAS.