arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2403.10629 2026-05-19 cs.RO cs.SY eess.SY

Virtual Elastic Tether: a New Approach for Multi-agent Navigation in Confined Aquatic Environments

虚拟弹性缆绳:一种多智能体在受限水下环境中的导航新方法

Kanzhong Yao, Xueliang Cheng, Keir Groves, Barry Lennox, Ognjen Marjanovic, Simon Watson

AI总结 本文提出了一种虚拟弹性缆绳(VET)方法,用于解决水下环境中多智能体导航的挑战,通过在不完全状态测量条件下实现更稳定的导航性能。

Comments This work has been submitted to the Wiley for possible publication

详情
AI中文摘要

水下导航是移动机器人领域中的一个具有挑战性的领域,由于水下环境中自我定位和通信固有的限制。一些挑战可以通过使用协作多智能体团队来缓解。然而,当应用于水下环境时,传统多智能体协作控制方法的鲁棒性受到很大限制,因为无法获得可靠的测量数据。本文在不完全状态测量的背景下引入了虚拟弹性缆绳(VET)的概念,这是一种用于受限制空间水下导航的创新方法。VET的概念是通过合作水下车辆探索系统(CAVES)进行公式化和验证的,CAVES是一种仿真到现实的多智能体水下机器人平台。在此框架内,开发了一种基于视觉的自主水下车辆-自主水面车辆的领导者-追随者公式。在仿真和物理平台上进行了实验,并与传统的基于图像的视觉伺服方法进行了比较。结果表明,基线方法在离散扰动下失效,当机器人之间的诱导距离在仿真中超过0.6米,在现实世界中超过0.3米时。相比之下,VET增强的系统在5秒内恢复到扰动前的距离。此外,结果展示了VET增强的CAVES在受限制的水池中成功导航,而基线方法无法有效执行。

英文摘要

Underwater navigation is a challenging area in the field of mobile robotics due to inherent constraints in self-localisation and communication in underwater environments. Some of these challenges can be mitigated by using collaborative multi-agent teams. However, when applied underwater, the robustness of traditional multi-agent collaborative control approaches is highly limited due to the unavailability of reliable measurements. In this paper, the concept of a Virtual Elastic Tether (VET) is introduced in the context of incomplete state measurements, which represents an innovative approach to underwater navigation in confined spaces. The concept of VET is formulated and validated using the Cooperative Aquatic Vehicle Exploration System (CAVES), which is a sim-to-real multi-agent aquatic robotic platform. Within this framework, a vision-based Autonomous Underwater Vehicle-Autonomous Surface Vehicle leader-follower formulation is developed. Experiments were conducted in both simulation and on a physical platform, benchmarked against a traditional Image-Based Visual Servoing approach. Results indicate that the formation of the baseline approach fails under discrete disturbances, when induced distances between the robots exceeds 0.6 m in simulation and 0.3 m in the real world. In contrast, the VET-enhanced system recovers to pre-perturbation distances within 5 seconds. Furthermore, results illustrate the successful navigation of VET-enhanced CAVES in a confined water pond where the baseline approach fails to perform adequately.

2308.06197 2026-05-19 cs.CV cs.AI cs.LG

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

利用基本特征的深度知识蒸馏进行复杂面部表情识别

Angus Maiden, Bahareh Nakisa

AI总结 本文提出了一种基于持续学习的方法,通过知识蒸馏和新颖的预测排序记忆重放,实现了复杂面部表情识别的最新状态,能够在少量样本下准确识别新复合表情类别。

Comments 13 pages, 9 figures, 6 tables, 3 algorithms. Code available at https://github.com/AngusMaiden/complex-FER

详情
AI中文摘要

复杂情绪识别是一种认知任务,迄今为止尚未达到与其他处于或高于人类认知水平的任务相同的优秀性能。通过面部表情识别情绪尤其困难,因为人类面部表达的情绪复杂性。为了使机器在复杂面部表情识别方面达到人类的水平,可能需要实时综合知识和理解新概念,就像人类所做的那样。人类能够仅通过少量示例学习新概念,通过从记忆中蒸馏重要信息。受人类认知和学习的启发,我们提出了一种新的持续学习方法,用于复杂面部表情识别,通过在基本表情类别上构建和保留知识,能够使用少量训练样本准确识别新的复合表情类别。在本工作中,我们还使用GradCAM可视化来展示基本和复合面部表情之间的关系。我们的方法通过知识蒸馏和一种新颖的预测排序记忆重放来利用这种关系,实现了复杂面部表情识别持续学习的最新状态,新类别的总体准确率为74.28%。我们还证明了使用持续学习进行复杂面部表情识别的性能远优于非持续学习方法,比最先进的非持续学习方法提高了13.95%。我们的工作也是首次将少样本学习应用于复杂面部表情识别,仅使用每个类别一个训练样本,就实现了100%的准确率,达到了最先进的水平。

英文摘要

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

2307.12405 2026-05-19 cs.LG

Optimal Control of Multiclass Fluid Queueing Networks: A Machine Learning Approach

多类流队列网络的最优控制:一种机器学习方法

Dimitris Bertsimas, Cheol Woo Kim

AI总结 本文提出了一种机器学习方法,用于多类流队列网络(MFQNETs)的最优控制,通过显式且有洞察力的控制策略,证明了存在分段常数最优策略,并通过OCT-H算法学习最优控制策略,实验表明在大规模网络中,该方法在测试集上达到100%的准确率。

详情
AI中文摘要

我们提出了一种机器学习方法,用于多类流队列网络(MFQNETs)的最优控制,该方法提供了显式且有洞察力的控制策略。我们证明了对于MFQNET控制问题存在分段常数最优策略,各段由通过原点的超平面分隔。我们使用最优分类树(OCT-H)来学习MFQNETs的最优控制策略。我们使用MFQNET控制问题的数值解作为训练集,并应用OCT-H来学习显式的控制策略。此外,我们还展示了理论结果和所提出算法可以扩展到具有不确定服务和到达率的鲁棒MFQNETs。我们报告了具有多达33个服务器和99个类别的实验结果,显示所学策略在测试集上达到100%的准确率。虽然OCT-H的离线训练在大型网络中可能需要数天时间,但在线应用只需毫秒级时间。

英文摘要

We propose a machine learning approach to the optimal control of multiclass fluid queueing networks (MFQNETs) that provides explicit and insightful control policies. We prove that a piecewise constant optimal policy exists for MFQNET control problems, with segments separated by hyperplanes passing through the origin. We use Optimal Classification Trees with hyperplane splits (OCT-H) to learn an optimal control policy for MFQNETs. We use numerical solutions of MFQNET control problems as a training set and apply OCT-H to learn explicit control policies. Furthermore, we show that both the theoretical results and the proposed algorithm extend to robust MFQNETs with uncertain service and arrival rates. We report experimental results with up to 33 servers and 99 classes that demonstrate that the learned policies achieve 100% accuracy on the test set. While the offline training of OCT-H can take days in large networks, the online application takes milliseconds.

2303.11675 2026-05-19 cs.CV

ReBaR: Reference-Based Reasoning for Robust Pose Estimation from Monocular Images

ReBaR:基于参考的鲁棒单目图像姿态估计

Yongkang Cheng, Mingjiang Liang, Jifeng Ning, Gaoge Han, Wei Liu, Shaoli Huang

AI总结 本文提出ReBaR方法,通过学习参考特征来解决遮挡和深度模糊问题,实现从单目图像中鲁棒的人体姿态和形状估计。

Comments Accepted by Pattern Recognition

详情
AI中文摘要

ReBaR(Reference-Based Reasoning for Robust Human Pose and Shape Estimation),旨在从单视图像中估计人体形状和姿态。ReBaR通过学习部分回归推理的参考特征,有效解决了遮挡和深度模糊的挑战。我们的方法首先通过注意力引导机制提取身体和部分区域的特征。随后,这些特征用于编码额外的部分-身体依赖关系,以实现个体部分的回归,其中部分特征作为查询,身体特征作为参考。这种基于参考的推理使网络能够利用可见部分和身体参考信息推断被遮挡部分与身体的空间关系。ReBaR在三个基准数据集上优于现有方法,并在最近的新方法中仍保持竞争力。结果表明在处理深度模糊和遮挡方面有显著改进。这些结果强烈支持了我们基于参考的框架在从单目图像中估计人体形状和姿态的有效性。

英文摘要

R}easoning for Robust Human Pose and Shape Estimation), designed to estimate human body shape and pose from single-view images. ReBaR effectively addresses the challenges of occlusions and depth ambiguity by learning reference features for part regression reasoning. Our approach starts by extracting features from both body and part regions using an attention-guided mechanism. Subsequently, these features are used to encode additional part-body dependencies for individual part regression, with part features serving as queries and the body feature as a reference. This reference-based reasoning allows our network to infer the spatial relationships of occluded parts with the body, utilizing visible parts and body reference information. ReBaR outperforms contemporary methods on three benchmark datasets and still maintains competitive advantages among recent new approaches. Demonstrating significant improvement in handling depth ambiguity and occlusion. These results strongly support the effectiveness of our reference-based framework for estimating human body shape and pose from single-view images.

2204.01611 2026-05-19 cs.AI

A Machine With Human-Like Memory Systems

具有人类样记忆系统的机器

Taewoon Kim, Michael Cochez, Vincent Francois-Lavet, Mark Neerincx, Piek Vossen

AI总结 本文提出了一种同时具备语义记忆和事件记忆的智能体,证明双记忆系统优于单一记忆系统,并通过自研环境

Comments Submitted to Human-Centered Design of Symbiotic Hybrid Intelligence 2022 (https://ii.tudelft.nl/humancenteredsymbioticHI/)

详情
AI中文摘要

受认知科学理论启发,我们显式建模了一个同时具备语义记忆和事件记忆的智能体,并证明其比仅拥有单一记忆系统的智能体更优。为了证明这一点,我们设计并发布了自己的挑战环境

英文摘要

Inspired by the cognitive science theory, we explicitly model an agent with both semantic and episodic memory systems, and show that it is better than having just one of the two memory systems. In order to show this, we have designed and released our own challenging environment, "the Room", compatible with OpenAI Gym, where an agent has to properly learn how to encode, store, and retrieve memories to maximize its rewards. The Room environment allows for a hybrid intelligence setup where machines and humans can collaborate. We show that two agents collaborating with each other results in better performance than one agent acting alone.

2605.17368 2026-05-19 cs.CV

RadGenome-Anatomy: A Large-Scale Anatomy-Labeled Chest Radiograph Dataset via Physically Grounded Volumetric Projection

RadGenome-Anatomy: 通过物理基础的体积分量生成大规模解剖标注胸部X光图像数据集

Shuchang Ye, Mingyuan Meng, Hao Wang, Usman Naseem, Jinman Kim

AI总结 本文提出RadGenome-Anatomy数据集,通过物理基础的体积分量生成技术,生成包含超过1000万段分割掩码的大型解剖标注胸部X光图像数据集,用于改进医学图像分割和诊断任务。

详情
AI中文摘要

胸部X光图像的解剖结构标注对于医学图像分割和广泛的下游诊断任务至关重要。然而,直接在2D胸部X光图像上标注解剖结构是劳动密集型且本质上模糊的,因为3D解剖结构被投影到一个单一的2D平面上,其中边界可能会重叠、被遮挡或只部分可见。因此,现有的解剖标注胸部X光图像数据集在规模、解剖覆盖和标签可靠性方面仍然有限。为了解决这些限制,我们引入了RadGenome-Anatomy,这是最大的解剖标注胸部X光图像数据集,包含超过1000万段分割掩码,涵盖210种解剖结构,共计25,692例研究。它通过将大规模3D解剖掩码从CT体积投影到2D放射学空间中,通过标准放射学几何构造而成。这将标注从直接追踪不确定的2D边界转移到定义体积空间中的解剖结构,其中在X光中重叠或部分不可见的结构仍能保持空间分离。因此,每个2D掩码代表了在体积空间中定义的结构的物理基础投影足迹。RadGenome-Anatomy的规模和广泛的解剖覆盖,包括重叠、部分可见或难以直接勾勒的结构,使研究几何测量作为胸部X光解释的明确证据成为可能。我们通过训练XAnatomy来预测结构特定的掩码并推导临床相关测量,实现了对心脏扩大、脊柱侧弯和脊柱后凸的诊断准确率分别为96.4%、95.6%和89.2%。

英文摘要

Anatomical structure labels for chest radiographs are essential for medical image segmentation and a broad range of downstream diagnostic tasks. However, annotating anatomy directly on 2D chest radiographs is labor-intensive and intrinsically ambiguous, as 3D anatomical structures are projected onto a single 2D plane where boundaries may overlap, be occluded, or appear only partially visible. Consequently, existing anatomy-labeled chest radiograph datasets remain limited in scale, anatomy coverage, and label reliability. To address these limitations, we introduce RadGenome-Anatomy, the largest anatomy-labeled chest radiograph dataset, containing over 10 million segmentation masks across 210 anatomical structures in 25,692 studies. It is constructed by projecting large-scale 3D anatomical masks from CT volumes into 2D radiographic space through canonical radiographic geometry. This shifts annotation from directly tracing uncertain 2D boundaries to defining anatomy in volumetric space, where structures that overlap or become partially invisible in radiographs remain spatially separable. As a result, each 2D mask represents the physically grounded projected footprint of a volumetrically defined structure. The scale and broad anatomical coverage of RadGenome-Anatomy, including structures that are overlapping, partially visible, or difficult to delineate directly, enable research on geometric measurements as explicit evidence for chest radiograph interpretation. We demonstrate this by training XAnatomy to predict structure-specific masks and derive clinically relevant measurements, achieving diagnostic accuracies of 96.4%, 95.6%, and 89.2% for cardiomegaly, kyphosis, and scoliosis, respectively.

2605.17367 2026-05-19 cs.CV

Bridging Data Trials and Task Barriers: A Unified Framework for Sketch Biometric Identification

弥合数据试验与任务障碍:面向草图生物识别的统一框架

Decheng Liu, Bin Hu, Xinbo Gao, Dawei Zhou, Chunlei Peng, Nannan Wang, Ruimin Hu

AI总结 本文提出了一种统一框架,用于解决草图生物识别中的跨模态和跨任务挑战,通过高效的合成草图生成和任务序列持续学习,提升模型的鲁棒性和泛化能力。

Comments The source code and models are publicly available at https://github.com/sHanbIgsUn/UFSB

详情
AI中文摘要

与现有的跨模态识别任务(例如异构人脸识别、草图重识别等)不同,我们引入了一种新的且实用的设置,称为草图生物识别,旨在在不同数据领域间持续训练一个统一的模型,即使涉及多样化的识别任务。草图生物识别面临挑战,包括真实的草图数据稀缺、高标注成本、隐私风险以及跨任务模型的泛化能力不足。现有方法通常依赖于有限的真实数据或单任务优化,难以有效解决跨模态和跨任务的联合挑战。本文提出了一种统一框架,整合了高效的合成草图生成和任务序列持续学习。首先,我们设计了一个高效的流程来生成大规模的高质量合成人物和人脸草图数据,这显著降低了成本并避免了隐私风险。同时,我们通过融合真实数据增强了模型的鲁棒性。其次,我们构建了一个通用的统一框架用于草图生物识别,该框架采用任务序列训练策略:模型首先在人物数据集上完成草图人物重识别学习;随后,通过可信样本重放技术保持获得的人物识别能力,并无缝地在人脸数据集上进行增量训练。这使一个模型能够同时处理多个草图生物识别任务的跨任务能力。为了支持上述草图生物识别的研究,我们构建了一个新的大规模基准,SketchUnified-BioID,并配备了几种实用的评估协议。

英文摘要

Different from existing cross-modality identification tasks (e.g., heterogeneous face recognition, sketch re-identification, etc.), we introduce a novel yet practical setting for these related identification tasks, named \textbf{sketch biometric identification}, which aims to continually train a unified model across different data domains, even diverse identification tasks. Sketch biometric identification faces challenges, including scarce real sketch data, high annotation costs, privacy risks, and insufficient generalization ability of cross-task models. Existing methods usually rely on limited real data or single-task optimization, making it difficult to effectively address the joint challenges of cross-modality and cross-task. This paper proposes a unified framework that integrates efficient synthetic sketch generation and task-sequential continual learning. First, we design an efficient pipeline to generate a large-scale and high-quality synthetic person and face sketch data, which significantly reduces costs and avoids privacy risks. Meanwhile, we enhance the model's robustness by fusing real data. Second, we construct a universal unified framework for sketch biometric identification, which adopts a task-sequential training strategy: the model first completes sketch person re-identification learning on the person dataset; subsequently, it maintains the acquired person recognition capability through a trusted sample replay technique and seamlessly performs incremental training on the face dataset. This enables a single model to simultaneously handle the cross-task capabilities of multiple sketch biometric identification tasks. To support the study of the mentioned sketch biometric identification, we built a new large-scale benchmark, SketchUnified-BioID, with several practical evaluation protocols.

2605.17365 2026-05-19 cs.CV

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

基于记忆的查询意图理解用于高效的基于聊天的图像检索

Xianke Chen, Daizong Liu, Yushuo Lou, Xin Tan, Xun Yang, Shuhui Wang, Xun Wang, Jianfeng Dong

AI总结 本文提出了一种高效的基于聊天的图像检索任务中的记忆增强查询意图理解框架MAQIU,通过动态聚合和演化查询意图的语义表示,防止意图遗忘并增强长期语义完整性,从而在保持高计算效率的同时实现显著的性能提升。

详情
AI中文摘要

与传统的文本到图像检索任务不同,基于聊天的图像检索允许人机交互系统通过多轮对话逐步澄清和细化用户意图,从而实现更精细的检索结果。该任务的关键挑战在于在对话轮次中动态理解和更新用户的查询意图。尽管现有工作在这一新任务上取得了显著性能,但它们要么通过直接拼接所有先前查询到一个长文本序列,要么依赖大语言模型来从历史中重建当前查询,这些策略计算冗余且容易导致意图表示不一致。为了解决这些问题,本文提出了一种新的、高效的基于记忆的用户意图更新框架,称为记忆增强查询意图理解(MAQIU)。它引入了一个轻量级的记忆模块,动态聚合和演化查询意图的语义表示,同时进一步采用记忆回查机制以防止意图遗忘并增强长期语义完整性。此外,MAQIU还整合了历史图像检索结果作为视觉指导,使模型能够加强跨轮次的相关性并细化当前视觉理解。广泛的实验表明,MAQIU在保持高计算效率的同时实现了显著的性能提升,与先前基线ChatIR相比,将对话编码FLOPs减少了86.4%。源代码可在https://github.com/HuiGuanLab/MAQIU上获得。

英文摘要

Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.

2605.17364 2026-05-19 cs.CL cs.IR

NewsLens: A Multi-Agent Framework for Adversarial News Bias Navigation

NewsLens: 一个用于对抗性新闻偏见导航的多智能体框架

Joy Bose

AI总结 本文提出NewsLens多智能体框架,通过五个智能体协作解构新闻文章,揭示意识形态缺失、修辞操纵和框架边界,利用Qwen2.5-3B-Instruct和Mistral 7B模型进行评估,展示了系统在不同政治事件簇中的表现。

Comments 17 pages, 2 figures, 7 tables, 1 appendix

详情
AI中文摘要

媒体偏见检测长期以来被框架为分类任务:为文章或媒体分配政治标签。我们认为这种框架过于浅显:它只能识别偏见存在,但无法确定其位置、方式,以及最关键的是,什么结构上被省略了。我们提出了NewsLens,一个五智能体对抗性流程,用于结构化新闻偏见导航。事实验证器、渐进框架分析器、保守框架分析器、宣传检测器和中性摘要器协作,将文章解构为可解释的框架地图,揭示意识形态缺失、修辞操纵和框架边界。该系统在四个政治事件簇(印度-巴基斯坦克什米尔、加沙、气候政策、乌克兰)的15篇文章上进行评估,使用Qwen2.5-3B-Instruct(4位量化,Google Colab T4),并使用Mistral 7B进行跨模型验证(在克什米尔簇)。中心媒体显示最高均值Perspective Divergence Score(PDS:Qwen 0.907,Mistral 0.729在克什米尔子集);保守框架媒体显示最高均值Manipulation Index(MI:0.600在两个模型上)。跨模型比较显示高传播内容具有高度一致性(Republic World delta-PDS=0.125,MI=0.8两个模型)和更广泛的变异。Mann-Whitney U检验发现n=15时组间差异无统计学意义,这被后验幂分析确认为样本量限制。部分消融实验去除宣传检测器显示中性摘要器输出的省略精度下降。该架构扩展了先前的词法-几何偏见工作到代理LLM推理,并且使用开放权重模型完全可复现,无需API密钥。

英文摘要

Media bias detection has predominantly been framed as a classification task: assign a political label to an article or outlet. We argue this framing is too shallow: it identifies that bias exists but not where, how, or crucially, what is structurally omitted. We present NewsLens, a five-agent adversarial pipeline for structured news bias navigation. A Fact Verifier, Progressive Framing Analyst, Conservative Framing Analyst, Propaganda Detector, and Neutral Summarizer collaborate to deconstruct articles into interpretable framing maps, exposing ideological omissions, rhetorical manipulation, and framing boundaries. The system is evaluated on 15 articles across four geopolitical event clusters (India-Pakistan Kashmir, Gaza, Climate Policy, Ukraine) using Qwen2.5-3B-Instruct (4-bit quantised, Google Colab T4), with cross-model validation using Mistral 7B on the Kashmir cluster. Center outlets show the highest mean Perspective Divergence Score (PDS: Qwen 0.907, Mistral 0.729 on Kashmir subset); conservative-framing outlets show the highest mean Manipulation Index (MI: 0.600 across both models). Cross-model comparison shows high consistency for high-propaganda content (Republic World delta-PDS=0.125, MI=0.8 both models) and greater variance for nuanced reporting. Mann-Whitney U tests find no statistically significant between-group differences at n=15, reported honestly as a sample-size limitation confirmed by post-hoc power analysis. A partial ablation removing the Propaganda Detector shows degraded omission precision in the Neutral Summarizer output. The architecture extends prior lexical-geometric bias work to agentic LLM reasoning, and is fully reproducible using open-weight models without API keys.

2605.17362 2026-05-19 cs.LG

Learning Fill-in Reduction Ordering via Graph Policy Optimization for Sparse Matrices

通过图策略优化学习稀疏矩阵的填充填充排序

Ziwei Li, Shuzi Niu, Huiyuan Li, Tao Yuan, Wenjia Wu

AI总结 本文提出了一种图策略优化方法,通过全局和局部视角建模填充,以减少稀疏矩阵求解器中的填充和内存使用,实验表明该方法在SuiteSparse矩阵集合上实现了显著的改进。

Comments Accepted by ICASSP 2026

详情
AI中文摘要

在大型稀疏求解器中,矩阵重新排序旨在寻找一个排列,以最小化因数分解填充来减少内存和计算。由于最小填充排列问题是NP难的,且填充隐含在稀疏性模式中,因此使用图论启发式方法。现有的强化学习方法要么忽略稀疏性模式--错过了全局填充,要么缺乏局部精确填充反馈。我们提出了一种图策略优化方法,建模来自全局和局部视角的填充:策略和价值网络均使用多跳图神经网络骨干来嵌入全局填充;策略进一步与图上的符号分解交互以提取局部、步骤级填充,并通过自适应饱和函数将结果反馈与价值网络对齐,以提高收敛性。在SuiteSparse矩阵集合上,我们的方法在状态-of-the-art基线上实现了平均填充减少29.3和峰值内存使用减少31.3。

英文摘要

Matrix reordering in large sparse solvers seeks a permutation that minimizes factorization fill-in to reduce memory and computation. Because the minimum fill-in ordering problem is NP-complete and fill-in is implicit in the sparsity pattern, graph-theoretic heuristics are used. Existing reinforcement learning methods either ignore sparsity patterns--missing the global fill-in--or lack local exact fill-in feedback. We propose a graph policy optimization method, modeling fill-ins from global and local views: both the policy and value networks use a multi-hop graph neural backbone to embed global fill-in; the policy further interacts with symbolic factorization over graphs to extract local, step-level fill-ins, and the resulting feedback is aligned with the value network via an adaptive saturation function to improve convergence. On the SuiteSparse Matrix Collection, our method achieves mean reductions of 29.3 in fill-ins and 31.3 in peak memory usage over state-of-the-art baselines.

2605.17361 2026-05-19 cs.LG cs.AI

\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

MasFACT:基于几何感知后验转移的连续多智能体拓扑学习

Xuefei Wang, Jialu Wang, Fengbo Zhang, Yihan Hu, Di Zhang, Yutong Ye, Yikun Ban, Jun Han, Ruijie Wang

AI总结 本文提出MasFACT框架,通过几何感知后验转移方法,解决多智能体系统中因新任务适应导致的拓扑遗忘问题,提升连续学习任务的准确性和拓扑稳定性。

详情
AI中文摘要

多智能体系统(MAS)借助大型语言模型(LLMs)已成为解决复杂问题的强大范式,其性能关键依赖于底层的智能体间通信拓扑。然而,现有拓扑生成方法主要针对孤立任务进行优化,而现实部署涉及连续演化的任务流,要求先前有效的协作模式被保留和重用而非重新发现或覆盖。本文识别出一种此前未被充分探索的失败模式,即拓扑遗忘,其中适应新任务会使拓扑生成器偏离早期任务所需通信结构。该问题源于智能体层面功能语义和关系通信结构的跨任务不一致。为解决这一挑战,我们提出MasFACT,一种几何感知后验转移框架,通过融合Gromov-Wasserstein最优传输在任务特定智能体空间中转移历史协作知识作为可转移拓扑先验,并通过PAC-Bayes引导的保守后验适应在任务特定可塑性与结构稳定性之间取得平衡。在类别级、领域级和任务级连续设置中的实验表明,MasFACT在提升平均准确率的同时减少了拓扑遗忘,相比强大的拓扑生成和重放基线表现更优,并可无缝集成到不同的MAS拓扑生成器中。

英文摘要

Multi-agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter-agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real-world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emph{topology forgetting}, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross-task misalignment in both agent-level functional semantics and relational communication structures. To address this challenge, we propose \textbf{\textsc{MasFACT}}, a geometry-aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport and perform PAC-Bayes-guided conservative posterior adaptation to balance task-specific plasticity with structural stability. Experiments across class-, domain-, and task-level continual settings demonstrate that \textsc{MasFACT} consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay-based baselines, and can be seamlessly integrated with different MAS topology generators.

2605.17360 2026-05-19 cs.CV

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Omni-DuplexEval: 评估实时双工全模交互

Chaoqun He, Mingyang Xiang, Yingjing Xu, Bokai Xu, Junbo Cui, Jie Zhou, Yuan Yao, Lijie Wen

AI总结 本文提出Omni-DuplexEval基准,用于系统评估实时双工交互能力,通过两个互补场景评估模型生成连续响应和主动提醒的能力,并揭示现有模型在平衡响应及时性和内容连贯性方面的局限性。

Comments 22 pages, 6 figures

详情
AI中文摘要

实时双工交互对于在真实世界场景中运行的多模态AI系统至关重要,其中模型必须持续处理流式输入并适时响应。然而,大多数现有的多模态大语言模型(MLLMs)是在离线设置中评估的,其中整个视频输入在生成任何响应之前都被处理。尽管最近的工作开始探索实时双工MLLMs,但仍然没有全面的基准或自动评估方法用于这种设置。为了解决这一差距,我们提出了Omni-DuplexEval,一个用于系统评估实时双工交互的基准。该基准包含两个互补场景:(1)实时描述,评估生成连续、时间对齐的响应以跟踪演化的多模态输入的能力,以及(2)主动提醒,评估识别显著事件并适时响应的能力。Omni-DuplexEval包含660个视频,具有细粒度、人工标注的标签和精确的时间元数据,涵盖9个基于真实世界场景的任务,其中所有问题均以开放性查询形式提出。我们进一步引入了一个基于LLM-as-a-Judge的自动评估框架,通过时间戳感知和顺序推理联合评估响应内容对齐和响应时间,实现了与人类判断的高度一致。在最先进的双工MLLMs上的实验揭示了显著的局限性。表现最好的模型整体得分仅为39.6%,在主动提醒任务上仅得20.0%。我们的分析识别出两个关键挑战:模型在平衡及时响应与连贯、整体内容生成方面存在困难,且它们往往无法确定何时响应和生成什么内容。我们希望我们的工作能促进MLLMs的进一步发展。

英文摘要

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.

2605.17359 2026-05-19 cs.CL

Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

学习跨领域的多智能体LLM协作可转移拓扑先验

Taolin Zhang, Zijie Zhou, Jiuheng Wan, Tingyuan Hu, Chengyu Wang, Xiaofeng He, Richang Hong

AI总结 本文提出TopoPrior框架,通过学习可转移的拓扑先验来提升多智能体LLM在跨领域协作中的效率,减少在线搜索开销并提高可扩展性。

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统通过结构化通信协调专门的智能体,在复杂推理中展现出强大潜力。然而,现有拓扑演化方法通常为每个查询从头构建或优化协作拓扑,导致显著的在线搜索开销、高推理时间token消耗以及在多领域设置中的有限可扩展性。我们提出TopoPrior,一种用于跨领域多智能体LLM协作学习可转移拓扑先验的框架。与其反复在线搜索有效的协作结构不同,TopoPrior从多个领域收集的参考协作图中学习可重用的拓扑先验,并利用它们生成查询条件的初始协作图以供下游细化。通过将部分拓扑搜索从每个查询的在线优化转移到离线先验学习,TopoPrior在保持与现有拓扑演化后端兼容的同时,降低了搜索成本。技术上,TopoPrior包含两个关键组件。首先,一个可转移拓扑先验学习模块采用条件变分图框架,在潜在空间中捕捉跨领域的可重用结构规律。其次,一个查询条件的潜在适应模块引入对抗对齐以减少不必要的领域差异,同时保持查询相关的结构变化。在多领域推理基准测试中,TopoPrior在多个异构拓扑演化后端上一致提升了性能,同时减少了在线推理时间的token使用,仅需少量额外的可训练参数。这些结果表明,可转移的拓扑初始化是一种有效且轻量的机制,用于提高跨领域的多智能体LLM协作效率。

英文摘要

Large language model (LLM)-based multi-agent systems have shown strong potential for complex reasoning by coordinating specialized agents through structured communication. However, existing topology-evolution methods typically construct or optimize a collaboration topology for each query from scratch, leading to substantial online search overhead, high inference-time token consumption, and limited scalability in multi-domain settings. We propose TopoPrior, a framework for learning transferable topology priors for multi-agent LLM collaboration across domains. Rather than repeatedly searching for effective collaboration structures online, TopoPrior learns reusable topology priors from reference collaboration graphs collected offline from multiple domains and uses them to generate query-conditioned initial collaboration graphs for downstream refinement. By shifting part of topology search from per-query online optimization to offline prior learning, TopoPrior amortizes search cost while remaining compatible with existing topology-evolution backbones. Technically, TopoPrior contains two key components. First, a transferable topology prior learning module employs a conditional variational graph framework to capture reusable structural regularities across domains in a latent space. Second, a query-conditioned latent adaptation module introduces adversarial alignment to reduce unnecessary domain discrepancy while preserving query-relevant structural variation. Experiments on multi-domain reasoning benchmarks show that TopoPrior consistently improves several heterogeneous topology-evolution backbones while reducing online inference-time token usage, with only modest additional trainable parameters. These results suggest that transferable topology initialization is an effective and lightweight mechanism for improving the efficiency of multi-agent LLM collaboration across domains.

2605.17356 2026-05-19 cs.CV

UniPPTBench: A Unified Benchmark for Presentation Generation Across Diverse Input Settings

UniPPTBench: 一种统一的演示生成基准,适用于多样化的输入设置

Bo Zhao, Maosheng Pang, Chen Zhang, Huan Yang, Yixin Cao, Wei Ji

AI总结 本文提出UniPPTBench,一个统一的演示生成基准,针对四种代表性的输入设置:模糊提示、长文档、多模态文档和多源生成,同时引入UniPPTEval评估协议,结合跨设置比较的共享指标和针对每个设置核心需求的定制指标,以提供更准确的评估框架。

详情
AI中文摘要

现有工作通常专注于孤立的输入设置下的演示生成,而现实中的使用案例涵盖了多样化的场景,包括模糊的用户提示、长文档、多模态材料和多个异质来源。此外,当前的评估往往不够场景特定。它们主要依赖于通用的演示质量标准,如视觉吸引力、布局质量以及整体连贯性,但未能评估不同输入设置所需的核心能力,包括基于事实的压缩、视觉-文本对齐以及跨来源合成。因此,该领域缺乏一个统一的基准和一个场景感知的评估框架,以准确诊断不同现实场景下的演示生成系统。我们提出了UniPPTBench,一个适用于四种代表性输入设置的统一基准:模糊提示、长文档、多模态文档和多源生成。我们进一步引入UniPPTEval,一种场景感知的评估协议,结合用于跨设置比较的共享指标和针对每个设置核心需求定制的场景特定指标。我们还提供了透明的参考基线以支持可重复的比较。在UniPPTBench上的实验揭示了不同设置之间的显著性能差异以及内容基础、多模态整合和跨来源合成中的反复失败模式。特别是,通用演示质量指标上的强大表现并不一定意味着在基于事实的场景中任务执行的强表现。共同,UniPPTBench和UniPPTEval为评估不同现实场景下的演示生成提供了忠实且诊断性的基础。代码和数据将公开可用。

英文摘要

Existing works typically focus on presentation generation under isolated input settings, whereas real-world use cases span diverse scenarios, including vague user prompts, long documents, multimodal materials, and multiple heterogeneous sources. Moreover, current evaluations are often insufficiently scenario-specific. They mainly rely on generic presentation-quality criteria, such as visual appeal, layout quality, and overall coherence, but fail to assess the core capabilities required by different input settings, including grounded compression, visual-text alignment, and cross-source synthesis. Consequently, the field lacks a unified benchmark and a scenario-aware evaluation framework for faithfully diagnosing presentation-generation systems across diverse real-world settings. We present UniPPTBench, a unified benchmark for presentation generation across four representative input settings: vague-prompt, long-document, multimodal-document, and multi-source generation. We further introduce UniPPTEval, a scenario-aware evaluation protocol that combines shared metrics for cross-setting comparison with scenario-specific metrics tailored to the core requirements of each setting. We also provide transparent reference baselines to support reproducible comparison. Experiments on UniPPTBench reveal substantial performance variation across settings and recurring failure modes in content grounding, multimodal integration, and cross-source synthesis. In particular, strong performance on generic presentation-quality metrics does not necessarily imply strong task fulfillment in grounded scenarios. Together, UniPPTBench and UniPPTEval provide a faithful and diagnostic foundation for evaluating presentation generation across diverse real-world scenarios. Code and data will be publicly available.

2605.17355 2026-05-19 cs.AI cs.CL

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona: 一种多级超图框架用于基于文本的自动人格预测

Sina Heydari, Majid Ramezani

AI总结 本文提出HyperPersona框架,通过超图结构显式建模文本的层次结构,利用基于Transformer的图编码器学习不同语言层之间的交互,从而在不依赖传统心理测量法的情况下,实现更准确的人格预测。

Comments Preprint. Submitted to Artificial Intelligence (Elsevier)

详情
AI中文摘要

作为一种现代商品,语言已成为一个庞大的社会和心理重要特质和概念的存储库,反映了人们如何将思维模式、行为和情感的模式编码成词语。基于文本的自动人格预测(APP)旨在从语言行为中推断人格,提供了一种可扩展的替代传统心理测量法的方案。尽管文本本质上是层次化的,文档级捕捉全局特征,句子级编码局部语义,词级提供细粒度的词汇信息,但大多数现有方法依赖于浅层、顺序或单级表示,忽略了书面语言的多级结构。为了解决这个问题,我们提出了HyperPersona,一个框架,通过超图结构显式建模文本的层次组织(文档、句子和词),其中文档及其句子表示为超边,词表示为节点,从而实现对文本全局、局部和词汇依赖关系的联合建模。随后通过基于Transformer的图编码器学习这些语言层内的交互,产生上下文敏感且结构基础的特征表示用于人格预测。在Big Five人格维度上的实验表明,仅依赖文本的情况下,HyperPersona有效整合了多级语言线索,相比最先进的基线方法实现了更优的性能。这些发现强调了文本层次结构在从自然语言中推进类人人格推断中的关键作用。

英文摘要

As a modern commodity, language has become a vast repository of socially and psychologically significant traits and concepts, reflecting the ways people encode pattern of thoughts, behaviors, and emotions into words. Text-based Automatic Personality Prediction (APP), seeks to infer personality from linguistic behavior, offering a scalable alternative to traditional psychometric assessments. Although text is inherently hierarchical, with the document-level capturing global features, the sentence-level encoding local semantics, and the word-level providing fine-grained lexical information, most existing approaches rely on shallow, sequential, or single-level representations that ignore the multi-level structure of written language. To address this, we propose HyperPersona, a framework that explicitly models the hierarchical organization of text (document, sentence, and word) through hypergraph structure, where a document and its sentences are represented as hyperedges, and the words are represented as nodes, enabling joint modeling of global, local, and lexical dependencies of text. Followed by a transformer-based graph encoder that learns interactions within and across these linguistic layers, yielding context-sensitive and structurally grounded feature representations for personality prediction. Experiments on the Big Five personality dimensions show that, while relying solely on text, HyperPersona effectively integrates multi-level linguistic cues, achieving superior performance compared to state-of-the-art baselines. These findings underscore the critical role of textual hierarchy in advancing human-like personality inference from natural language.

2605.17354 2026-05-19 cs.CV

GeoHand: Unlocking Prior Geometry Knowledge for Monocular 3D Hand Reconstruction

GeoHand: 解锁先验几何知识以实现单目3D手形 reconstruction

Weiquan Lin, Yaoqing Hu, Liangchen Dai, Xu Tang, Xingyu Chen

AI总结 本文提出GeoHand框架,通过解锁冻结的基础单目几何估计器MoGe2中的高质量几何先验,结合地图级GeoAdapter和门控跨模态token融合策略,实现高精度手形重建,尤其在严重遮挡和手-物体交互场景中表现优异。

详情
AI中文摘要

单目3D手形重建本质上是一个几何问题,然而仅依靠RGB外观特征往往难以解决由自遮挡和手-物体相互作用引起的严重歧义。虽然引入深度可以显式提供空间线索,但原始传感器捕获的深度图存在大量噪声和不完整性,限制了其在细粒度手形重建中的应用。为弥合这一差距,我们提出GeoHand,一种新颖的框架,能够从冻结的基础单目几何估计器MoGe2中解锁高质量几何先验。认识到这些先验偏向于通用场景,我们引入地图级GeoAdapter来重新校准空间特征,特别是适应于细节丰富的手形重建。此外,为了系统地整合这些适应后的先验而不过度干扰固有的RGB外观线索,我们采用门控跨模态token融合策略。最后,为了确保精确的局部运动,我们设计了关键点查询迭代细化器(KQIR),利用投影的关节位置查询几何感知的图像特征以进行空间修正。通过在统一管道中结合全局几何消歧和局部细化,GeoHand在FreiHAND、DexYCB和HO3Dv3上实现了最先进的性能,特别是在严重遮挡和手-物体交互场景中。

英文摘要

Monocular 3D hand reconstruction is intrinsically a geometric problem, yet RGB appearance features alone often struggle to resolve severe ambiguities caused by self-occlusions and hand-object interactions. While introducing depth can explicitly provide spatial cues, raw sensor-captured depth maps are extensively noisy and incomplete, limiting their usefulness for fine-grained hand reconstruction. To bridge this gap, we propose GeoHand, a novel framework that unlocks high-quality geometric priors from a frozen foundational monocular geometry estimator (MoGe2). Recognizing that these priors are oriented toward general scenes, we introduce a map-level GeoAdapter to recalibrate the spatial features, specifically adapting them for detailed hand reconstruction. Furthermore, to systematically integrate these adapted priors without overwhelming intrinsic RGB appearance cues, we employ a gated cross-modal token fusion strategy. Finally, to secure precise local articulation, we design a Keypoint-Queried Iterative Refiner (KQIR) that uses projected joint locations to query geometry-aware image features for spatial correction. By combining global geometric disambiguation with local refinement in a unified pipeline, GeoHand achieves state-of-the-art performance on FreiHAND, DexYCB, and HO3Dv3, especially under severe occlusions and hand-object interactions.

2605.17352 2026-05-19 cs.CL

AMATA: Adaptive Multi-Agent Trajectory Alignment for Knowledge-Intensive Question Answering

AMATA: 适应性多智能体轨迹对齐用于知识密集型问答

Taolin Zhang, Dongyang Li, Chen Chen, Qizhou Chen, Jiuheng Wan, Xiaofeng He, Chengyu Wang, Richang Hong

AI总结 本研究提出AMATA框架,通过动态整合外部知识提高知识密集型问答的响应可解释性和事实准确性,采用六个专门化智能体协作执行复杂问题推理,并引入两种创新:轨迹内偏好学习和智能体间依赖学习。

详情
AI中文摘要

尽管大语言模型(LLMs)取得了显著进展,但在知识密集型问答中生成事实一致的响应仍然具有挑战性。这些困难主要是由于幻觉和LLMs在长尾知识缺口上的局限性。为此,我们提出了AMATA,一种自适应多智能体轨迹对齐框架,通过动态整合外部知识来提高响应的可解释性和事实基础性。我们的架构利用六个专门化的智能体,协同执行结构化动作进行复杂问题推理。我们将多智能体协作与外部工具的协作形式化为轨迹偏好对齐问题,结合问题感知的智能体定制和智能体间偏好和谐化。AMATA引入了两种主要创新:(1)轨迹内偏好学习,学习以目标为导向的偏好以优先考虑关键智能体;(2)智能体间依赖学习,通过一种新颖的依赖感知直接偏好优化技术捕获跨智能体工具依赖性。实验证明,AMATA在五个已建立的知识密集型QA基准上一致优于基线方法、知识增强框架和基于LLM的轨迹系统。进一步分析显示,我们的方法在减少token消耗方面具有效率。

英文摘要

Despite substantial advances in large language models (LLMs), generating factually consistent responses for knowledge-intensive question answering remains challenging. These difficulties are primarily due to hallucinations and the limitations of LLMs in bridging long-tail knowledge gaps. To address this, we propose AMATA, an Adaptive Multi-Agent Trajectory Alignment framework that dynamically integrates external knowledge to improve response interpretability and factual grounding. Our architecture leverages six specialized agents that collaboratively perform structured actions for complex question reasoning. We formalize multi-agent collaboration with external tools as a trajectory preference alignment problem, incorporating question-aware agent customization and inter-agent preference harmonization. AMATA introduces two principal innovations: (1) Intra-Trajectory Preference Learning, which learns objective-oriented preferences to prioritize critical agents, and (2) Inter-Agent Dependency Learning, which captures cross-agent tool dependencies through a novel dependency-aware direct preference optimization technique. Empirical results show that AMATA consistently outperforms baseline approaches, knowledge-augmented frameworks, and LLM-based trajectory systems on five established knowledge-intensive QA benchmarks. Further analysis demonstrates the efficiency of our method in reducing token consumption.

2605.17348 2026-05-19 cs.CL

Taming "Zombie'' Agents: A Markov State-Aware Framework for Resilient Multi-Agent Evolution

驯服“僵尸”代理:一种面向鲁棒多代理演化的马尔可夫状态感知框架

Taolin Zhang, Pukun Zhao, Qizhou Chen, Jiuheng Wan, Chen Chen, Xiaofeng He, Chengyu Wang, Richang Hong

AI总结 本文提出AgentRevive框架,通过动态管理代理协作和状态感知边优化,有效解决多代理系统中因临时问题导致有价值代理被提前丢弃的问题,提升了系统鲁棒性和效率。

详情
AI中文摘要

近年来,基于大语言模型的多代理系统在复杂任务中的协作能力有了显著提升。为提高整体效率,现有方法常依赖于代理间的激进图演化(例如节点或边剪枝),这可能导致因临时问题(如幻觉或暂时知识缺口)而过早丢弃有价值的代理。然而,这种硬剪枝忽略了“僵尸”代理在后续讨论轮次中恢复和贡献的潜力。本文提出AgentRevive,一种面向鲁棒多代理演化的马尔可夫状态感知框架。我们的方法通过软状态转移动态管理代理协作,通过两个关键组件实现:(1)状态感知策略学习:将代理状态分为“活跃”、“待命”和“终止”状态,根据代理记忆选择性传播消息。策略利用风险估计器通过评估幻觉风险优化代理状态转移,最小化不可靠节点的影响,同时保护有价值节点。(2)状态感知边优化:根据策略学习到的状态剪枝子图边,永久移除“终止”节点,并保留“待命”节点以供后续轮次评估其潜在的未来贡献。在通用推理、领域特定和幻觉挑战任务上的广泛实验表明,我们的方法在性能上始终优于强基线,并通过状态感知的代理调度显著减少了令牌消耗。

英文摘要

Recent advancements in LLM-based multi-agent systems have demonstrated remarkable collaborative capabilities across complex tasks. To improve overall efficiency, existing methods often rely on aggressive graph evolution among agents (e.g., node or edge pruning), which risks prematurely discarding valuable agents due to transient issues such as hallucinations or temporary knowledge gaps. However, such hard pruning overlooks the potential for ``zombie'' agents to recover and contribute in subsequent discussion rounds. In this paper, we propose AgentRevive, a Markov state-aware framework for resilient multi-agent evolution. Our approach dynamically manages agent collaboration through soft state transitions, implemented via two key components: (1) State-Aware Policy Learning: Agent states are divided into ``Active'', ``Standby'', and ``Terminated'' states, selectively propagating messages based on agent memory. The policy employs a risk estimator to optimize agent state transitions by assessing hallucination risk, minimizing the influence of unreliable nodes while safeguarding valuable ones. (2) State-Aware Edge Optimization: Subgraph edges are pruned according to states learned from the policy, permanently removing ``Terminated'' nodes and retaining ``Standby'' nodes for subsequent rounds to assess their potential future contributions. Extensive experiments on general reasoning, domain-specific, and hallucination challenge tasks show that our method consistently outperforms strong baselines and significantly reduces token consumption through state-aware agent scheduling.

2605.17345 2026-05-19 cs.CV

VoxShield: Protecting 3D Medical Datasets from Unauthorized Training via Frequency-Aware Inter-Slice Disruption

VoxShield: 通过频率感知的跨切片扰动保护3D医学数据集免受未经授权的训练

Xinyao Liu, Zhipeng Deng, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

AI总结 本文提出VoxShield,一种通过频率感知的跨切片扰动机制,针对3D医学图像分割数据集中的体积诱导偏差,有效降低3D分割网络性能,同时保持视觉质量。

Comments Submitted version to MICCAI 2026 (Provisional Accept)

详情
AI中文摘要

公开3D医学图像分割(MIS)数据集的发布加速了临床研究,但同时也提高了未经授权的AI模型训练的风险。尽管不可学习的例子(UE)通过注入不可察觉的扰动来防止有效模型学习,但现有方法主要针对2D场景。它们忽略了3D医学体积中固有的体积空间相关性和跨切片解剖一致性,这些是3D分割网络的关键学习先验。为弥合这一差距,我们提出了VoxShield,一种UE框架,专门针对3D网络的体积归纳偏差。我们的核心见解是通过系统性地破坏3D架构依赖的跨切片连续性,可以根本破坏其空间聚合过程。具体来说,我们引入了一种跨切片频率一致性扰动机制,最大化相邻切片之间的频谱差异,沿z轴注入结构不一致性。此外,还加入了语义预测扰动模块。通过最大化干净和扰动logits之间的ℓ₁差异,它迫使注入的噪声穿透整个网络并破坏最终的语义映射。在BraTS19和FLARE21上的实验表明,VoxShield成功降低了3D分割性能,将DSC从80.0%降至接近0.0%,从88.6%降至6.8%。所有保护都通过最小扰动(ε=4/255)实现,以保持高质量的视觉保真度。代码可在https://github.com/KK266299/VoxShield上获得。

英文摘要

The release of public 3D medical image segmentation (MIS) datasets accelerates clinical research but simultaneously heightens risks of unauthorized AI model training. While Unlearnable Examples (UE) offer protection by injecting imperceptible perturbations to prevent effective model learning, existing methods primarily target 2D scenarios. They neglect the volumetric spatial correlations and inter-slice anatomical consistency inherent in 3D medical volumes, which serve as critical learning priors for 3D segmentation networks. To bridge this gap, we propose VoxShield, a UE framework that explicitly targets the volumetric inductive biases of 3D networks. Our core insight is that by systematically dismantling the cross-slice continuity that 3D architectures rely on, we can fundamentally impair their spatial aggregation process. Specifically, we introduce an Inter-Slice Frequency Consistency Disruption mechanism that maximizes the spectral divergence between adjacent slices, injecting structural incoherence along the $z$-axis. Complementing this structural attack, a Semantic Prediction Disruption module is incorporated. By maximizing the $\ell_1$ divergence between clean and perturbed logits, it forces the injected noise to penetrate the entire network and corrupt the final semantic mapping. Experiments on BraTS19 and FLARE21 demonstrate that VoxShield successfully degrades 3D segmentation performance, reducing the DSC from 80.0% to near 0.0% and from 88.6% to 6.8%, respectively. All protections are achieved with minimal perturbation ($ε=4/255$) to preserve high visual fidelity. The code is available at https://github.com/KK266299/VoxShield.

2605.17343 2026-05-19 cs.CV

GraphMAR: Geometry-Aware Graph Learning Framework for Spatially Adaptive CT Metal Artifact Reduction

GraphMAR: 一种基于几何的图学习框架用于空间自适应的CT金属伪影减少

Zilong Li, Chenglong Ma, Yiming Lei, Yuanlin Li, Jing Han, Jiannan Liu, Huidong Xie, Junping Zhang, Yi Zhang, Hongming Shan

AI总结 本文提出GraphMAR,一种基于几何的图学习框架,用于在图像域中实现空间自适应的CT金属伪影减少,通过引入图基的几何建模来显式识别伪影并提高恢复质量和可解释性。

详情
AI中文摘要

计算断层扫描(CT)金属伪影减少(MAR)旨在减少由金属植入物和其他高密度物体引起的严重条纹伪影。有效的MAR通常需要准确的伪影定位和去除。sinogram域方法可以利用显式的几何线索,如金属痕迹,来识别金属损坏的测量,但需要原始投影数据,这在临床和实际场景中往往不可用。图像域方法更加灵活且广泛适用,但通常缺乏可比的几何指导,限制了它们定位伪影的能力,导致结果不理想。为了解决这一限制,我们提出了GraphMAR,一种用于显式伪影识别和图像域中空间自适应MAR的几何意识学习框架。关键思想是引入基于图的几何建模作为sinogram金属痕迹的图像域类比。具体来说,我们首先从金属掩模中构建几何图,并推导出一个几何密度图,根据植入物之间的几何关系粗略定位伪影易发区域。然后我们设计了GraphMoE,一个基于图的混合专家模块,该模块在特征空间中构建极坐标伪影图,并适应性地将不同专家路由到不同的空间区域进行MAR。通过将学习到的路由图与几何密度图对齐,GraphMAR在提供显式和可解释的伪影定位的同时,实现了区域自适应的伪影减少。在模拟和真实世界数据集上的实验表明,GraphMAR在现有方法上实现了更优的MAR性能。据我们所知,这是首次引入基于图的建模用于CT MAR,并在图像域中实现显式的伪影识别,提高了恢复质量和可解释性。

英文摘要

Computed tomography (CT) metal artifact reduction (MAR) aims to reduce the severe streaking artifacts induced by metallic implants and other high-density objects. Effective MAR generally requires both accurate artifact localization and artifact removal. Sinogram-domain methods can exploit explicit geometric cues, such as metal traces, to identify metal-corrupted measurements, while requiring raw projection data, which is often unavailable in clinical and practical scenarios. Image-domain methods are more flexible and widely applicable, yet they usually lack comparable geometric guidance, limiting their ability to localize artifacts and leading to suboptimal results. To address this limitation, we propose GraphMAR, a geometry-aware learning framework for explicit artifact identification and spatially adaptive MAR in the image domain. The key idea is to introduce graph-based geometric modeling as an image-domain analogue of sinogram metal traces. Specifically, we first construct a geometric graph from the metal mask and derive a geometric density graph that coarsely localizes artifact-prone regions according to inter-implant geometry. We then design GraphMoE, a graph-routed mixture-of-experts module that builds a polar-coordinate artifact graph in feature space and adaptively routes different experts to different spatial regions for MAR. By aligning the learned routing maps with the geometric density graph, GraphMAR provides explicit and interpretable artifact localization while enabling region-adaptive artifact reduction. Experiments on both simulated and real-world datasets demonstrate that GraphMAR achieves superior MAR performance compared with existing methods. To the best of our knowledge, this is the first work to introduce graph-based modeling for CT MAR and to enable explicit artifact identification in the image domain, improving both restoration quality and interpretability.

2605.17342 2026-05-19 cs.CL cs.AI

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

传递性与循环性:动态大语言模型对齐的显式偏好分解

Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li

AI总结 本文提出Hybrid Reward-Cyclic模型,通过博弈论分解显式分离传递性和循环性偏好,结合动态自我博弈优化方法提升大语言模型对齐效果,实验证明其在混合传递-循环设置中具有结构优势和更高的准确率。

Comments Accepted by ICML 2026

详情
AI中文摘要

标准的RLHF依赖于传递性的标量奖励,无法捕捉人类偏好的循环性质。尽管一些方法如通用偏好模型(GPM)试图解决这一问题,但其隐式公式将层次结构与循环性结合在一起,未能保证主导解。为此,我们提出了混合奖励-循环(HRC)模型,利用博弈论分解将偏好显式分解为正交的传递性(标量)和循环性(向量)组件。此外,我们引入了动态自我博弈偏好优化(DSPPO),将对齐视为随时间变化的游戏,逐步引导策略向纳什均衡发展。合成数据实验进一步验证了HRC在混合传递-循环设置中的结构优势,其中HRC收敛速度更快且准确率更高。在RewardBench 2上的实验表明,HRC在BT和GPM基线基础上持续改进(例如,在Gemma-2B-it上提升1.23%)。特别是,其在Ties领域中的优越表现验证了模型在处理复杂非严格偏好时的鲁棒性。对AlpacaEval 2.0、Arena-Hard-v0.1和MT-Bench的广泛下游评估确认了我们框架的有效性。值得注意的是,当使用Gemma-2B-it作为基础偏好模型时,HRC+DSPPO在AlpacaEval 2.0上达到峰值长度控制下的胜率44.75%,在Arena-Hard-v0.1上达到46.8%,显著优于使用BT或GPM训练的SPPO基线。我们的代码在https://github.com/lab-klc/Hybrid-Reward-Cyclic上公开可用。

英文摘要

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

2605.17341 2026-05-19 cs.CV cs.AI

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

通过跨模态语义对齐实现面向视觉-语言模型的单样本黑盒成员推断攻击

Jiaqing Li, Yajuan Lu, Xiaochuan Shi, Gang Wu, ZhongYuan Wang, Chao Liang

AI总结 本文提出了一种基于跨模态语义对齐的新型成员推断攻击框架,针对视觉-语言模型在单样本和黑盒场景下的数据安全风险进行评估,通过量化联合嵌入空间中的对齐程度,显著提升了攻击性能。

详情
AI中文摘要

视觉-语言模型(VLMs)虽取得了显著成功,但其依赖大规模数据集和意外记忆训练数据,带来了重大数据安全风险。成员推断攻击(MIAs)旨在通过确定数据样本是否包含在模型训练集中来评估这些风险。然而,现有针对VLMs的MIAs方法面临关键瓶颈:灰盒方法依赖于内部logits,通常在实际应用程序接口(APIs)中受限,而黑盒方法依赖于大规模统计分布,在单样本场景中表现不佳。为此,我们从跨模态语义对齐的角度研究MIAs,并观察到成员图像由于训练记忆表现出显著更强的图像-描述对齐,而生成的非成员描述可能偏离原始视觉内容。基于这一洞察,我们提出了一种针对严格黑盒和单样本场景的新MIAs框架,该框架在联合嵌入空间中量化此类对齐,从而绕过这些不现实的假设。我们在三个开源和两个闭源VLMs上进行了广泛实验。在VL-MIA/Flicker数据集上,我们的方法在LLaVA-1.5上实现了0.821的AUC,显著优于现有基线。此外,它在各种图像扰动下仍保持稳健,突显了其实用性。

英文摘要

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

2605.17339 2026-05-19 cs.LG

Bridging the Gap between Sparse Matrix Reordering and Factorization: A Deep Learning Framework for Fill-in Reduction

弥合稀疏矩阵重排与分解之间的差距:一种用于填充减少的深度学习框架

Ziwei Li, Tao Yuan, Shuzi Niu, Huiyuan Li

AI总结 本文提出一种深度学习框架,通过谱嵌入最小化填充代理函数,弥合稀疏矩阵重排与分解之间的差距,实验表明其性能优于传统图论算法和深度学习方法。

Comments Accepted by DASFAA 2025

详情
AI中文摘要

稀疏矩阵重排可以显著减少矩阵分解过程中的填充量,从而降低稀疏矩阵计算中的计算和存储需求。寻找最小填充量的重排顺序已知是NP难问题。此外,存在一个悖论:矩阵重排在矩阵分解之前进行,但重排方法旨在减少的填充是由矩阵分解产生的。为了弥合重排与分解之间的差距,我们提出了一种深度学习框架,基于谱嵌入最小化填充代理函数。首先,我们采用多网格-like GNN架构来学习近似其图拉普拉斯矩阵的最小特征向量,即谱嵌入,并捕捉矩阵的全局结构信息。然后,另一个多网格-like GNN架构用于基于秩分布最小化潜在的填充空间。实验结果表明,我们的方法在传统图论算法和深度学习方法中表现具有竞争力。

英文摘要

Sparse matrix reordering can significantly reduce the fill-in during matrix factorization, thereby decreasing the computational and storage requirements in sparse matrix computations. Finding a minimal fill-in ordering is known to be an NP-hard problem. Moreover, there is a paradox: matrix reordering is applied before matrix factorization, but fill-ins that matrix reordering methods aim at are generated from matrix factorization. To bridge the gap between reordering and factorization, we propose a deep learning framework to minimize a fill-in surrogate function based on spectral embedding. First, we employ a multi-grid-like GNN architecture to learn to approximate the smallest eigenvectors of its graph Laplacian matrix, i.e. spectral embedding, and capture the global structural information of the matrix. Then, another multi-grid-like GNN architecture is used to minimize the potential space where fill-in can occur based on the rank distribution. Experimental results indicate that our approach achieves competitive performance compared with traditional graph-theoretic algorithms and deep learning methods.

2605.17336 2026-05-19 cs.RO cs.CV eess.SP

Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven Paradigms

基于触觉的多模态融合在具身智能中的应用:视觉、语言和接触驱动范式的综述

Zhixiang Cao, Di Tian, Runwei Guan, Yanzhou Mu, Xiaolou Sun, Shaofeng Liang, Daizong Liu, Tao Huang, Yutao Yue, Henghui Ding, Bin Fang, Alex Zhou, Qing-Long Han, Hui Xiong

AI总结 本文综述了多模态触觉融合在具身智能中的研究,探讨了如何通过整合视觉、语言和触觉信息来提升物理交互与语义推理的结合,提出了一种分层的分类体系,并总结了当前的研究挑战和未来方向。

Comments 20 pages, 8 figures

详情
AI中文摘要

触觉感知是具身智能中的基本模态,能够提供关于接触几何、材料属性和交互动态的独特且直接反馈,这无法被远程传感器所替代。然而,单一的触觉感知在空间覆盖稀疏和缺乏全局语义上下文方面存在固有局限。随着深度学习和大语言模型的迅速发展,将触觉与视觉和语言相结合已成为连接物理交互与语义推理的关键。尽管进展迅速,现有研究仍分散在不同的数据集、传感模态和任务中,缺乏统一的理论框架。为解决这一差距,本文提供了截至2026年第一季度的多模态触觉融合研究的全面综述。我们提出了一种分层的分类体系,将该领域分为两个主要维度:多模态数据集和多模态方法。在数据方面,我们对从触觉-视觉数据集、触觉-语言数据集、触觉-视觉-语言数据集以及触觉-视觉-其他数据集等资源进行了分类。在方法方面,我们把先前的工作分为三个核心支柱:(1)多模态感知与识别,专注于物体理解和抓取预测;(2)跨模态生成,专注于触觉、视觉和文本之间的双向翻译;(3)多模态交互,强调反馈控制和语言引导的操作。此外,我们总结了代表性的触觉传感硬件,回顾了常用的评估指标和基准设置,并讨论了当前的挑战和有前途的未来方向。

英文摘要

Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.

2605.17327 2026-05-19 cs.RO cs.AI cs.CV

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

AI总结 本文提出了一种无需视觉特征跟踪的初始化框架,利用前馈3D模型预测的点云,从而提高了单目视觉-惯性导航系统的初始化可靠性与效率,实验表明其初始化成功率超过90%且数据需求显著减少。

详情
AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统(VINS)至关重要,因为它为后续的状态估计建立了初始条件。尽管已有显著进展,但大多数现有方法仍依赖于视觉特征对应关系,并需要3-4秒的传感器数据才能成功初始化,这限制了它们的应用性和效率。随着前馈3D模型的出现,这些模型可以直接从图像预测点云,我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中,我们提出了一种特征-free初始化框架,利用前馈3D模型预测的点云,从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明,所提出的特征-free初始化方法实现了最高成功率,超过90%,并且显著减少了成功初始化所需的数据持续时间,通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法,覆盖了各种室内和室外场景,展示了鲁棒性能,特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

2605.17316 2026-05-19 cs.LG cs.AI

Learning Higher-Order Structure from Incomplete Spatiotemporal Data: Multi-Scale Hypergraph Laplacians with Neural Refinement

从不完整时空数据中学习高阶结构:具有神经细化的多尺度超图拉普拉斯算子

Keshu Wu, Sixu Li, Zihao Li, Zhiwen Fan, Xiaopeng Li, Yang Zhou

AI总结 本文提出了一种多尺度超图拉普拉斯(MSHL)框架,通过两阶段方法从不完整时空观测中学习高阶结构。该方法通过发现阶段构建多尺度超图,并在细化阶段引入条件残差网络,以处理高阶关系中的残差特征,从而在交通网络中实现了更准确的缺失数据填补。

详情
AI中文摘要

传感器网络日益成为现代基础设施的核心,然而标准填补基准所假设的均匀随机缺失模式往往不适用于实际场景。环形检测器在校准期间会断线,路边柜子会沉默附近传感器的集群,而新安装的仪器则无法提供历史数据。这些故障会产生结构化的缺失,其值受传感器组之间的高阶关系约束,而非仅仅是成对接近性。现有低秩和图方法往往无法捕捉这种集体结构,当缺失性变得一致时可能会失效。本文引入多尺度超图拉普拉斯(MSHL),一种两阶段框架,用于从不完整的时空观测中学习高阶结构。发现阶段通过互补的拓扑和残差相关证据构建多尺度超图,并采用仅基于观测的选取器,适应支持的交互尺度。细化阶段添加一个小型超图条件残差网络,其安全性由构造保证:在存在信息残差特征时学习非线性修正,在不存在时则退化为线性估计。我们证明MSHL可以表示无法被成对图先验捕捉的组内守恒模式,能够适应最佳固定尺度,至多一个对数因子,将这种优势转移到验证的填补误差中,并允许单侧细化保证。在两个真实交通网络上评估,针对散落单元缺失、连续块断电和整个传感器黑箱在五种速率下,MSHL在高阶结构可识别时优于成对图基线,否则在采样噪声范围内匹配。结果表明,可靠的基础设施学习存在更广泛的原则:缺失数据不应被视为孤立的填补条目,而应视为发现结构的证据。

英文摘要

Sensor networks increasingly govern modern infrastructure, yet the data they lose are rarely missing in the uniform-random patterns assumed by standard imputation benchmarks. Loop detectors go offline during calibration, roadside cabinets silence clusters of nearby sensors, and newly installed instruments provide no history. Such failures create structured absences whose values are constrained by higher-order relations among groups of sensors, not merely by pairwise proximity. Existing low-rank and graph-based methods often miss this collective structure and can fail when missingness becomes coherent. We introduce Multi-Scale Hypergraph Laplacians (MSHL), a two-stage framework for learning higher-order structure from incomplete spatiotemporal observations. The Discovery stage builds a multi-scale hypergraph from complementary topology and residual-correlation evidence, with an observation-only selector that adapts to the supported interaction scale. The Refinement stage adds a small hypergraph-conditioned residual network that is safe by construction: it learns nonlinear corrections where informative residual features exist and defers to the linear estimate where they do not. We prove that MSHL represents group-conservation patterns inaccessible to pairwise graph priors, adapts to the best fixed scale up to a logarithmic factor, transfers this advantage to held-out imputation error, and admits a one-sided refinement guarantee. On two real traffic networks evaluated across scattered cell missingness, contiguous block outages, and whole-sensor blackouts at five rates, MSHL improves over a pairwise-graph baseline whenever higher-order structure is identifiable and otherwise matches it within sampling noise. The results point to a broader principle for reliable infrastructure learning: missing data should be treated not as isolated entries to fill, but as evidence of structure to discover.

2605.17314 2026-05-19 cs.CL cs.AI cs.LG

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Wei Deng

AI总结 本文研究了通过较小较弱模型的不匹配错误草稿引导更强学习者的能力,发现这种策略在MATH-500和AIME 2025/2026等任务上表现优异,主要贡献是提出了一种有效的训练方法。

详情
AI中文摘要

我们考虑是否可以利用较小、较弱模型的离线经验来引导更强的学习者,使其在在线策略学习(如GRPO)无法达到的能力。我们发现,将数学上错误但更领域训练的较小模型生成的草稿注入更强学习者的GRPO上下文,能一致优于标准在线GRPO在MATH-500和离分布AIME 2025/2026上。具体来说,我们使用Mathstral-7B作为学习者,Qwen2.5-Math-1.5B作为草稿模型,8.8K Level 3--5 MATH问题(其中MATH-500被排除),并使用Dr. GRPO进行训练。不匹配是关键成分:在保持其他条件不变的情况下,将草稿洗牌到不匹配的问题中,使MATH-500的greedy pass@1提升+1.62pp(n=10种子,p=0.0015,Welch's t检验)。事实上,不匹配-错误变体在MATH-500上所有测试的变体中均优于。在离分布AIME 2025和2026上,不匹配-错误变体在每个样本预算从k=1到k=1024的所有年份中,均将pass@k提升到Mathstral-7B(其原生[INST]格式)和Qwen2.5-Math-1.5B草稿模型之上。所有变体在测试时使用相同的提示,没有草稿注入。该配方——在单个GPU上训练,无需SFT、奖励模型、合成数据和无produce-critique-revise内循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500成绩,这是目前该模型的最高已发表结果,超过了WizardMath流程在完整MATH上的70.9%(SFT + PPO加过程/指令奖励模型)。

英文摘要

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

2605.17312 2026-05-19 cs.CV

VISTA: Triplet-Supervised Video Style Transfer with Diffusion Transformers

VISTA: 基于扩散变换器的三元组监督视频风格迁移

Yiren Song, Wangzi Yao, Haofan Wang, Mike Zheng Shou

AI总结 本文提出VISTA方法,通过引入大规模三元组数据和基于扩散变换器的框架,解决视频风格迁移中风格、内容和运动的联合建模与解耦问题,实现了高质量的风格迁移效果。

详情
AI中文摘要

视频风格迁移旨在在保持内容、结构和运动的同时将视频渲染成目标艺术风格。尽管图像风格化技术已迅速发展,但视频风格化仍具有挑战性,因为存在时间不一致的问题。现有的大多数方法对帧或关键帧进行风格化,并通过启发式的时序传播来强制一致性,这在遮挡、遮挡解除和长期运动下容易产生漂移和闪烁伪影。我们提出VISTA-1000,一个包含1000种风格和运动对齐三元组的数据集(风格参考、干净视频、风格化视频),并提出一种基于扩散变换器的上下文视频风格迁移框架,具有轻量级的风格适配器以实现稳健的风格提取。大量实验表明,该方法在风格保真度、时间一致性和内容保持方面均达到最佳性能。

英文摘要

Video style transfer aims to render videos in a target artistic style while preserving content, structure, and motion. While image stylization has advanced rapidly, video stylization remains challenging due to temporal inconsistency. Most existing methods stylize frames or keyframes and enforce consistency via heuristic temporal propagation, which is brittle under occlusions, disocclusions, and long-term motion, leading to drift and flickering artifacts. We argue that a fundamental bottleneck lies in the lack of large-scale triplet data and a principled training paradigm that jointly models and disentangles style, content, and motion.To address this, we introduce VISTA-1000, a synthetic dataset with 1,000 styles and motion-aligned triplets of style reference, clean video, and stylized video, and propose a diffusion-transformer-based in-context video style transfer framework with a lightweight style adapter for robust style extraction. Extensive experiments demonstrate SOTA performance in style fidelity, temporal consistency, and content preservation.

2605.17311 2026-05-19 cs.CV

SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection

SpecSem-Net: 通过融合频谱和语义特征实现鲁棒的AI生成视频检测

Zixi Wei, Huixuaun Zhang, Xiaojun Wan

AI总结 本文提出SpecSem-Net框架,通过引入语义引导的频谱去噪机制,有效检测高保真AI生成视频,实验表明其在基准和公开数据集上达到87.25%和95.59%的准确率。

详情
AI中文摘要

近期商业视频生成模型如Sora和Veo的显著视觉保真度,使得鲁棒的AI生成视频检测变得至关重要,以防止合成内容与真实视频难以区分并被用于虚假信息。然而,现有检测器往往因过度依赖日益逼真的语义特征而失败,忽视了细微的频谱伪影。本文提出SpecSem-Net,这是首个专门针对高保真AI生成视频检测引入语义引导频谱去噪机制的框架。具体而言,我们设计了一个频谱模块,通过基于傅里叶变换的过滤提取高频特征。此外,为减少频谱噪声引起的误判,我们采用门控融合机制,自适应融合语义上下文,有效缓解频谱噪声。此外,为了评估检测器在最新顶级生成模型上的性能,我们构建了一个包含5个顶级商业生成器的综合基准。广泛实验表明,SpecSem-Net在基准和公开数据集上均优于现有方法,分别达到87.25%和95.59%的准确率。

英文摘要

The remarkable visual fidelity of recent commercial video generative models, such as Sora and Veo, renders robust AI-generated video detection increasingly essential to prevent synthetic content from being indistinguishable from real videos and exploited for disinformation. However, existing detectors often fail due to an over-reliance on increasingly realistic semantic features, neglecting subtle spectral artifacts. In this paper, we propose SpecSem-Net, the first framework to introduce a semantic-guided spectral denoising mechanism specifically for high-fidelity AI-generated video detection. Specifically, we design a spectral module to extract high-frequency features via Fourier-Transform based filtering. Furthermore, to reduce misjudgments arising from spectral noise, we employ a Gated Merging Mechanism to adaptively fuse semantic context, effectively mitigating spectral noise. Additionally, to evaluate detector performance on the latest top-tier generative models, we construct a comprehensive benchmark comprising 5 SOTA commercial generators. Extensive experiments demonstrate that SpecSem-Net outperforms existing methods, achieving accuracies of 87.25% and 95.59% on our benchmark and public datasets, respectively.

2605.17310 2026-05-19 cs.CV cs.AI

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

注意力劫持:跨查询的视觉-语言模型响应操控

Zhiqiang Wang, Dongrui Liu, Yan Li, Zonghao Ying, Wei Xue, Wenhan Luo, Yike Guo

AI总结 本文研究了视觉-语言模型中跨查询响应操控问题,提出了一种新的对抗攻击方法Attention Hijacking,通过引导内部注意力分布保持图像主导模式,提高攻击在不同查询下的有效性。

详情
AI中文摘要

现有针对视觉-语言模型(VLMs)的对抗攻击可以将模型输出导向攻击者指定的目标响应,但当相同扰动输入与不同文本查询配对时,其效果往往会下降。本文研究了跨查询响应操控,即期望一个对抗示例在多样化的用户查询中保持有效。我们首先分析了现有攻击的局限性,发现成功转移与在响应生成过程中保持图像主导的注意力模式密切相关。受此观察启发,我们提出了Attention Hijacking,一种新的对抗攻击方法,该方法明确引导内部注意力分布向持久的图像主导模式倾斜。通过放大视觉标记对目标响应标记的影响,同时抑制文本标记的竞争影响,我们的方法减少了 manipulated 输出对特定查询用语的依赖。在广泛使用的VLMs上的大量实验表明,Attention Hijacking显著提高了跨查询转移性,适用于多样化的目标响应和未见查询。该方法也有效扩展到多种攻击场景,为VLMs中注意力稳定性在可转移响应操控中的作用提供了新的见解。

英文摘要

Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.