arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3814
2605.16775 2026-05-19 cs.CV cs.AI cs.LG

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D: 基于3D体积分块对齐的脑MRI自监督学习

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

AI总结 本文提出VolTA-3D,一种用于脑MRI自监督学习的3D视觉Transformer框架,通过联合对齐全局类风格标记和局部块标记,增强体积分块表示的可迁移性,从而在多个下游任务中表现出更好的泛化能力和鲁棒性。

Comments Accepted at EMBC 2026

详情
AI中文摘要

自监督学习(SSL)通过利用大规模未标记数据推动了医学图像分析的发展。然而,在脑磁共振成像(MRI)中,大多数3D模型仍局限于分割或分类任务,限制了其在不同数据集、成像协议和下游任务中的泛化能力。这种缺乏可迁移性限制了3D MRI模型的临床应用,尽管存在大量未标记的体数据。我们提出了Volta-3D,一种自监督的3D视觉Transformer框架,旨在学习可迁移的体表示。Volta-3D在学生-教师范式中联合对齐全局类风格标记和局部块标记,并强制细粒度结构重建。这种联合全局-局部对齐解决了脑MRI中有限的语义多样性和细微解剖特征,这对现有SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D,包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中,Volta-3D学习的表示均优于随机初始化的基线,证明了其在域偏移下的改进可迁移性和鲁棒性。因此,在预训练过程中联合强制全局语义一致性和局部结构学习,使模型能够从未标记的脑MRI数据中学习更广泛的概念。总体而言,VolTA-3D支持有效的多任务下游性能,具有任务特定的适应性,是迈向通用化和临床可行的3D模型的一步。

英文摘要

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

2605.16774 2026-05-19 cs.CV cs.AI

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF:一种ASV视角的可回收物数据集和基准,用于表面级垃圾的检测与跟踪

Zaid Aljundi, Zahra F. Rahmatullah, Mostafa Elemam, Abdullah Moosa

AI总结 本文提出了一种新的ASV视觉系统和表面可回收物数据集,用于在水面条件下检测和跟踪小型反射性垃圾,如铝罐。数据集包含约7.3k张原始图像,经过十种增强方法扩展至约57k张训练/验证图像,涵盖了多样的光照和水状态。通过基准测试,训练YOLOv11在CANSURF数据集上提升了12倍的性能,展示了数据集的价值。实验表明,YOLOv11+ByteTrack在稳定跟踪和多目标准确性方面表现最佳,而YOLOv11+SAHI在远距离罐子的召回率上有所提升,但精度有所下降。考虑到任务需求,YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。

Comments Published in the 2025 8th International Conference on Signal Processing and Information Security (ICSPIS). Published and available to view on IEEE Xplore

详情
Journal ref
Proc. 2025 8th Int. Conf. Signal Processing and Information Security (ICSPIS), 2025, pp. 1-6
AI中文摘要

表面级海洋垃圾仍然是自主清洁任务中的实际瓶颈,其中小型、反射性的目标(如铝罐)必须在强光、波浪和部分淹没条件下从远处检测。本文提出了一种ASV视觉系统和一个新的表面可回收物数据集。该数据集包含约7.3k张从视频中提取的原始图像,并通过十种增强类型扩展至约57k张训练/验证图像,涵盖了多样化的光照和水状态。一组针对表面操作定制的检测器和检测-跟踪管道进行了基准测试。在CANSURF上训练YOLOv11的性能比通用数据集提高了12倍,突显了数据集的价值。实验表明,YOLOv11+ByteTrack在稳定跟踪(较少的身份切换)和多目标准确性方面表现最佳,而YOLOv11+SAHI在远距离罐子的召回率上有所提升,但精度在全上下文输入中有所下降。鉴于任务配置,单罐拾取与接近和抓取,YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。没有先前的公开数据集针对从水面视角在水面上检测铝罐;此数据集填补了这一空白,并支持可重复的评估。

英文摘要

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

2605.16770 2026-05-19 cs.CL cs.AI

Exploring Lightweight Large Language Models for Court View Generation

探索用于法院视图生成的轻量级大语言模型

Zhitian Hou, Tianyong Hao, Nanli Zeng, Zhixiong Chao, Kun Zeng

AI总结 本文研究了轻量级大语言模型在法院视图生成中的能力及其对指控预测的影响,探讨了模型架构、大小对性能的影响,以及轻量级LLM与深度神经网络在任务中的比较,同时开发了CVGEvalKit评估框架。

详情
AI中文摘要

刑事法院视图生成(CVG)是法律人工智能(Legal AI)中的关键任务,涉及根据案件事实生成法院视图。在本工作中,我们系统地探索了轻量级(小于2B参数)大语言模型(LLMs)在CVG中的能力及其对指控预测的影响。我们的研究解决了四个关键问题:(1)不同架构的LLMs如何影响CVG质量和指控预测;(2)LLMs的大小如何影响性能;(3)轻量级LLMs在这些任务中与深度神经网络(DNNs)的比较;(4)通过先生成法院视图再预测指控与直接预测指控的比较。此外,我们还开发了CVGEvalKit评估框架,包括三个公开可用的数据集用于CVG任务以及预测其指控。在该框架上进行了全面实验,模型在混合训练集上训练,并在每个数据集的测试集上评估。实验结果提供了关于模型架构、模型大小和不同任务之间影响的权衡的新见解,突显了轻量级LLMs在司法AI应用中的潜力。源代码匿名地可在\url{https://github.com/ZhitianHou/CVGEvalKit}获取。

英文摘要

Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}

2605.16769 2026-05-19 cs.CV

GLT-PEFT: Gated Lie-Tucker Parameter-Efficient Fine-Tuning for Alzheimer's Disease Diagnosis with Hippocampal Segmentation Pretraining

GLT-PEFT: 基于门控李-塔克参数高效微调的阿尔茨海默病诊断方法(结合海马体分割预训练)

Guanghua He, Hancan Zhu, Gaohang Yu, An Zhang

AI总结 本文提出GLT-PEFT方法,通过门控李-塔克分解实现高效参数微调,用于阿尔茨海默病诊断,结合海马体分割预训练,提升医学影像模型的适应性与鲁棒性。

详情
AI中文摘要

参数高效微调(PEFT)已成为在数据有限条件下适应预训练模型的有前景范式。然而,现有大多数PEFT方法针对矩阵结构参数设计,不适用于医学影像模型中的高维卷积核。此外,它们通常依赖加法更新,缺乏保持预训练参数几何结构的机制,而乘法(几何感知)更新难以在统一框架中整合。为了解决这一问题,本文提出GLT-PEFT,一种用于阿尔茨海默病(AD)诊断的门控李-塔克参数高效微调框架。所提出的方法将预训练的海马体分割模型转移到下游分类任务。塔克分解使3D卷积核实现张量感知的低秩适应,而基于李群的变换提供结构保持的乘法更新。门控机制进一步协调加法和乘法更新形式,实现统一且更稳定的微调策略。大量实验表明,GLT-PEFT在跨任务转移中实现有效效果,同时显著减少可训练参数,突显其在医学影像模型中的高效和鲁棒适应性。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as a promising paradigm for adapting pretrained models under limited data conditions. However, most existing PEFT methods are designed for matrix-structured parameters and are not well suited for high-dimensional convolutional kernels in medical imaging models. Moreover, they typically rely on additive updates and lack mechanisms to preserve the geometric structure of pretrained parameters, while multiplicative (geometry-aware) updates are difficult to integrate within a unified framework. To address this issue, this paper proposes GLT-PEFT, a gated Lie-Tucker parameter-efficient fine-tuning framework for Alzheimer's disease (AD) diagnosis. The proposed approach transfers a hippocampal segmentation pretrained model to a downstream classification task. Tucker decomposition enables tensor-aware low-rank adaptation of 3D convolutional kernels, while Lie group-based transformations provide structure-preserving multiplicative updates. A gating mechanism further reconciles additive and multiplicative update forms, resulting in a unified and more stable fine-tuning strategy. Extensive experiments demonstrate that GLT-PEFT achieves effective cross-task transfer while significantly reducing trainable parameters, highlighting its effectiveness for efficient and robust adaptation in medical imaging models.

2605.16768 2026-05-19 cs.CV eess.IV

Axial-Relation Guided Fusion State Space Model for Optical-Elevation Sensing Image Segmentation

基于轴向关系引导的融合状态空间模型用于光学-海拔感测图像分割

Feng Gao, Zhilin Jin, Yanhai Gan, Junyu Dong, Qian Du

AI总结 本文提出了一种基于状态空间模型的框架,用于光学-海拔遥感图像分割,通过引入多尺度状态空间模块和轴向关系引导融合模块,有效提升了多源遥感图像语义分割的性能和计算效率。

Comments Accepted by IEEE GRSL 2026

详情
AI中文摘要

多源遥感图像的语义分割是地球观测应用中的基本任务。现有方法在多尺度上下文建模不足和跨模态特征融合不优方面存在困难,限制了其在复杂高分辨率场景中的性能。为此,我们提出轴向关系引导融合Mamba(ARG-Mamba),一种基于状态空间模型的框架,用于光学-海拔遥感图像分割。具体而言,我们引入了多尺度状态空间模块,以线性计算复杂度捕获细粒度局部细节和全局上下文依赖。此外,设计了轴向关系引导融合模块,以显式建模水平和垂直轴上的全局跨模态相关性,从而在光学和海拔模态之间实现高效的特征融合。在ISPRS Vaihingen和Potsdam数据集上进行的广泛实验表明,ARG-Mamba在保持有利的计算效率的同时,始终优于最先进的方法。代码将在https://github.com/oucailab/ARG-Mamba上公开发布。

英文摘要

Semantic segmentation of multi-source remote sensing images is a fundamental task for Earth observation applications. Existing methods often struggle with insufficient multi-scale context modeling and suboptimal cross-modal feature fusion, limiting their performance in complex high-resolution scenes. To this end, we propose Axial-Relation Guided Fusion Mamba (ARG-Mamba), a state space model-based framework for optical-elevation remote sensing image segmentation. Specifically, we introduce a Multi-Scale State Space Module to capture both fine-grained local details and global contextual dependencies with linear computational complexity. Moreover, an Axial-Relation Guided Fusion Module is designed to explicitly model global cross-modal correlations along horizontal and vertical axes, enabling efficient feature fusion between optical and elevation modalities. Extensive experiments conducted on the ISPRS Vaihingen and Potsdam datasets demonstrate that our ARG-Mamba consistently outperforms state-of-the-art methods while maintaining favorable computational efficiency. The code will be made publicly available at \url{https://github.com/oucailab/ARG-Mamba}.

2605.16767 2026-05-19 cs.CL

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

基于检索的多标签法律标注:可扩展、数据高效且无幻觉

Li Zhang, Jaromir Savelka, Kevin Ashley

AI总结 本文提出了一种基于检索的多标签法律标注方法,该方法通过嵌入文档和标签描述,利用冻结的检索模型进行k近邻搜索,实现了高效的数据标注,同时避免了生成模型的幻觉问题,展示了在高基数和快速变化的法律标签空间中的实用性。

Comments 10 pages, 3 figures

详情
AI中文摘要

多标签法律标注需要将多个标签分配给长文本文档,这些标签来自大规模且不断演变的分类体系,通常在监督有限的情况下进行。参数编码器通常需要针对特定任务进行训练和重新训练,而提示生成大语言模型在标签集变化时成本高且性能下降。我们将其法律标注视为检索:通过冻结的检索模型嵌入文档和标签描述,并在嵌入空间中通过k近邻预测标签,从而通过重新嵌入和重新索引实现更新,而不是基于梯度的反向传播。在三个法律数据集(ECtHR-A、ECtHR-B和Eurlex,共100个标签)上,检索实现了竞争性的准确率和强大的数据效率;在Eurlex上,Qwen-8B检索将宏F1从40.41(GPT-5.2,零样本)提升至49.12,同时将计算量估计减少了20-30倍。仅使用(N=100)训练样本,检索在ECtHR-A上将微F1几乎翻倍,超过层级Legal-BERT(48.29 vs. 27.87)。我们还量化了生成推理的可靠性故障模式:GPT-5.2在确定性解码下在0.12-0.9%的测试样本中会生成超出提供分类法的标签。相比之下,检索严格遵守定义的标签集,通过设计消除幻觉。这些结果表明,基于检索模型的标注器是高基数和快速变化的法律标签空间的实用且可部署的替代方案。

英文摘要

Multi-label legal annotation requires assigning multiple labels from large, evolving taxonomies to long, fact-intensive documents, often under limited supervision. Parametric encoders typically require task-specific training and retraining when the label set changes, while prompting generative large language models becomes costly and degrades as the label space grows. We cast legal annotation as retrieval: we embed documents and label descriptions with a frozen retrieval model and predict labels via k-nearest neighbors in the embedding space, enabling updates by re-embedding and re-indexing rather than gradient-based backpropagation. Across three legal datasets (ECtHR-A, ECtHR-B, and Eurlex with 100 labels), retrieval achieves competitive accuracy and strong data efficiency; on Eurlex, Qwen-8B retrieval improves Macro-F1 from 40.41 (GPT-5.2, zero-shot) to 49.12 while reducing estimated compute by 20-30 times compared to fine-tuning. With only (N=100) training samples, retrieval nearly doubles Micro-F1 over hierarchical Legal-BERT on ECtHR-A (48.29 vs. 27.87). We also quantify a reliability failure mode of generative inference: GPT-5.2 hallucinates labels outside the provided taxonomy in 0.12-0.9% of test samples under deterministic decoding. In contrast, retrieval strictly respects defined label sets, eliminating hallucination by design. These results suggest retrieval-model-based annotators are a practical, deployable alternative for high-cardinality and rapidly changing legal label spaces.

2605.16764 2026-05-19 cs.CV eess.IV

Synthetic Aperture Radar Image Change Detection Based on Global Dynamic Context-Aware Network

基于全局动态上下文感知网络的合成孔径雷达图像变化检测

Baogui Huan, Chuanzheng Gong, Dezhong Chen, Feng Gao, Junyu Dong, Qian Du

AI总结 本文提出了一种专门用于合成孔径雷达图像变化检测的全局动态上下文感知网络GDNet,通过引入全局动态卷积模块和两阶段Mixup策略,有效整合局部细节与全局上下文信息,提升对不同变化模式的检测能力。

Comments Accepted by IEEE JSTARS 2026

详情
AI中文摘要

卷积神经网络(CNNs)已广泛且成功地应用于合成孔径雷达(SAR)图像变化检测任务。然而,传统卷积层固有地受到局部感受野的限制,主要捕捉空间局部模式,而忽视了对SAR图像中细微或大规模变化至关重要的全局上下文。为了解决这些限制,我们提出了一种专门针对SAR图像变化检测的全局动态上下文感知网络(GDNet)。我们的方法核心是一种新的全局动态卷积模块,该模块根据从输入特征中提取的全局语义信息,自适应地调节卷积核权重。通过动态整合长距离依赖关系,这种机制使网络能够整合局部细节和全局上下文,从而提高其检测不同变化模式的能力。此外,我们引入了精心设计的两阶段Mixup策略用于模型训练。与传统单阶段Mixup不同,我们的两阶段设计生成了更多样化和信息丰富的训练样本,有效正则化模型,即使在数据有限的情况下也能获得更稳定和可靠的分类结果。在三个SAR数据集上的广泛实验展示了所提GDNet相较于其他最先进方法的优越性。这些发现突显了全局动态建模和高级数据增强策略在推进SAR图像解释方面的潜力。源代码可在\url{https://github.com/oucailab/GDNet}获得。

英文摘要

Convolutional neural networks (CNNs) have been extensively and successfully applied to the task of synthetic aperture radar (SAR) image change detection. However, conventional convolutional layers are inherently limited by their local receptive fields, which mainly capture spatially localized patterns while neglecting the global context that is often crucial for accurately distinguishing subtle or large-scale changes in SAR imagery. To address these limitations, we propose a novel Global Dynamic Context-Aware Network (GDNet) specifically tailored for SAR image change detection. At the core of our approach lies a novel global dynamic convolution module, which adaptively modulates convolution kernel weights according to the global semantic information extracted from the input features. By dynamically incorporating long-range dependencies, this mechanism enables the network to integrate both local detail and global context, thus improving its ability to detect diverse change patterns. In addition, we introduce a carefully designed two-stage Mixup strategy for model training. Unlike conventional single-stage Mixup, our two-stage design generates more diverse and informative training samples, effectively regularizing the model and yielding more stable and reliable classification results even under limited data scenarios. Extensive experiments on three SAR datasets demonstrate the superiority of the proposed GDNet compared to other state-of-the-art methods. These findings highlight the potential of global dynamic modeling and advanced data augmentation strategies for advancing SAR image interpretation. Source codes are available at \url{https://github.com/oucailab/GDNet}.

2605.16758 2026-05-19 cs.CL

Language Acquisition Device in Large Language Models

大型语言模型中的语言习得装置

Masato Mita, Taiga Someya, Ryo Yoshida, Yohei Oseki

AI总结 本文提出了一种受语言习得装置启发的预预训练方法,通过在MP-STRUCT形式语言上进行预训练,提高了大型语言模型的数据效率,并展示了其对结构不合理的语言的抗性。

Comments Accepted to ACL2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)仍然显著不如人类数据高效。预预训练(PPT)在合成语言上的应用被提出以缩小这一差距,先前的工作强调了高度表达性的形式语言,如k-Shuffle Dyck。受语言习得装置(LAD)假说的启发,该假说认为内禀约束会预先限制学习者的假设空间以自然语言结构,我们提出LAD启发的PPT:在MP-STRUCT形式语言上进行预预训练,该语言的字符串编码了层次结构组成、基于特征的依赖关系以及长距离位移,通过MERGE、AGREE和MOVE操作。简短的500步PPT在MP-STRUCT上与强大的形式语言基线在token效率上相当,同时还赋予了对结构不合理的语言的人类样抗性(例如REVERSE)。分析简化变体,我们发现MP-STRUCT CORE在没有定义在C-RASP(变压器表达性的正式界限)的情况下仍优于k-Shuffle Dyck,挑战了先前假设即有效的PPT语言必须同时具有层次结构表达性和电路理论可学习性。我们显示功能地标,这些减少依赖解析歧义,是关键驱动因素,表明有效的PPT设计不仅依赖于表达性,还依赖于依赖解析的可访问性。

英文摘要

Large Language Models (LLMs) remain substantially less data-efficient than humans. Pre-pretraining (PPT) on synthetic languages has been proposed to close this gap, with prior work emphasizing highly expressive formal languages such as $k$-Shuffle Dyck. Inspired by the Language Acquisition Device (LAD) hypothesis, which posits that innate constraints preemptively restrict the learner's hypothesis space to natural-language-like structure, we propose LAD-inspired PPT: pre-pretraining on MP-STRUCT, a formal language whose strings encode hierarchical composition, feature-based dependencies, and long-distance displacement via MERGE, AGREE, and MOVE. A brief 500-step PPT with MP-STRUCT matches strong formal-language baselines in token efficiency while additionally imparting a human-like resistance to structurally implausible languages (e.g., REVERSE). Analyzing simplified variants, we find that MP-STRUCT CORE outperforms $k$-Shuffle Dyck despite not being definable in C-RASP (a formal bound on transformer expressivity), challenging the prior hypothesis that effective PPT languages must be both hierarchically expressive and circuit-theoretically learnable. We show that functional landmarks, which reduce dependency resolution ambiguity, are a key driver, suggesting that effective PPT design depends not only on expressivity but also on the accessibility of dependency resolution.

2605.16757 2026-05-19 cs.AI cs.MA stat.ME stat.ML

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

NeuroMAS: 多智能体系统作为神经网络的多智能体系统

Haoran Lu, Luyang Fang, Wenxuan Zhong, Ping Ma

AI总结 本文提出NeuroMAS,一种将多智能体系统视为可训练和可扩展的神经网络架构的方法,通过联合强化学习提升多智能体系统的性能和可扩展性。

详情
AI中文摘要

多智能体语言系统通常被构建为人工设计的工作流,其中智能体被分配语义角色,通信协议在提前指定。我们提出NeuroMAS,一种方法,首先将多智能体语言系统视为可训练和可扩展的神经网络-like架构,其中LLM智能体作为节点,中间文本信号作为边。在NeuroMAS中,智能体节点是无角色但结构感知的:拓扑结构只决定信息如何一般流动,而强化学习训练决定如何通信、专业化和协调。这种表法将多智能体设计从工作流工程转向架构设计,其中深度、宽度、连接性和增长协议成为可扩展的能力来源。进一步,我们提供了一个理论视角,说明为何这种模块化文本计算在任务允许层次分解时更具参数效率。实验表明,NeuroMAS在推理时间和训练多智能体基线方面均有显著提升。我们进一步发现,组织扩展是路径依赖的:更大的系统从头开始训练具有挑战性,但当从较小的训练系统逐步扩展时变得可行。这些结果表明,学习的神经多智能体系统是LLM的有前景的扩展轴。

英文摘要

Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.

2605.16755 2026-05-19 cs.LG cs.AI

Learning Unbiased Permutations via Flow Matching

通过流匹配学习无偏排列

Yimeng Min, Carla P. Gomes

AI总结 本文提出PermFlow框架,通过在具有单位行和列和的矩阵仿射子空间上直接操作,学习多模态排列分布,避免了基于熵正则化Sinkhorn方法在模糊性下的崩溃问题。

详情
AI中文摘要

学习排列对于排序、排名和匹配至关重要,但现有的基于熵正则化Sinkhorn的可微方法会产生单一的软解,并在模糊性下崩溃。我们提出了PermFlow,一种条件流匹配框架,直接在具有单位行和列和的矩阵仿射子空间上操作。一个闭式切线空间投影器通过构造而非迭代校正,精确保持这些约束沿每条轨迹。一个最近目标耦合将不同的噪声初始值引导到不同的有效排列。结果是一个能够捕捉多模态排列分布而非将其坍缩到单一模式的模型。在具有混合数字模糊性的视觉排序任务和对称线性分配问题上,PermFlow在无歧义输入上具有高精度,并在模糊性下恢复两个有效排列,而基于Sinkhorn的基线方法在结构上失败。

英文摘要

Learning permutations is fundamental to sorting, ranking, and matching, but existing differentiable methods based on entropy-regularized Sinkhorn produce a single softened solution and collapse under ambiguity. We present PermFlow, a conditional flow matching framework that operates directly on the affine subspace of matrices with unit row and column sums. A closed-form tangent-space projector preserves these constraints exactly along every trajectory, by construction rather than through iterative correction, and a nearest-target coupling routes distinct noisy initializations toward distinct valid permutations. The result is a model that captures multimodal permutation distributions rather than collapsing them to a single mode. On a visual sorting task with blended-digit ambiguity and a symmetric linear assignment problem, PermFlow achieves high accuracy on unambiguous inputs and recovers both valid permutations under ambiguity, where Sinkhorn-based baselines structurally fail.

2605.16747 2026-05-19 cs.LG math.AP math.OC math.PR math.ST stat.TH

Propagation of Chaos in Contextual Flow Maps

在上下文流映射中传播混沌

Shi Chen, Zhengjiang Lin, Kaizhao Liu, Philippe Rigollet

AI总结 本文提出了一种定量统计理论,用于在大上下文范围内研究transformers,通过采用上下文流映射(CFMs)的抽象:在一组注意力块中,动态系统在上下文度量的存在下演进一个区分的token。在此框架下,有限上下文模型近似于理想化的无限上下文系统,其中上下文度量被其底层总体取代,因此上下文长度n成为统计资源。利用动态的麦肯-瓦尔科夫结构和经典的传播混沌经典机器,我们建立了前向边界,控制有限上下文和无限上下文CFMs在深度上的偏差,并建立了后向边界,控制对应的训练轨迹在在线梯度下降迭代中的偏差。这两个边界实现了通用CFMs的最优Wasserstein速率n^{-1/d}和参数速率n^{-1/2},对于包含transformers的受限CFM类。分析基于新的欧拉共轭公式和由此产生的前向-共轭系统的稳定性估计,这两者可能具有独立兴趣。

Comments 31 pages, 1 figure

详情
AI中文摘要

我们通过采用上下文流映射(CFMs)的抽象来开发一种定量统计理论,用于在大上下文范围内研究transformers:动态系统在一组注意力块中,通过上下文度量的存在演进一个区分的token。在此框架下,有限上下文模型近似于理想化的无限上下文系统,其中上下文度量被其底层总体取代,因此上下文长度n成为统计资源。利用动态的麦肯-瓦尔科夫结构和经典的传播混沌经典机器,我们建立了前向边界,控制有限上下文和无限上下文CFMs在深度上的偏差,并建立了后向边界,控制对应的训练轨迹在在线梯度下降迭代中的偏差。这两个边界实现了通用CFMs的最优Wasserstein速率n^{-1/d}和参数速率n^{-1/2},对于包含transformers的受限CFM类。分析基于新的欧拉共轭公式和由此产生的前向-共轭系统的稳定性估计,这两者可能具有独立兴趣。

英文摘要

We develop a quantitative statistical theory of transformers in the large-context regime by adopting the abstraction of contextual flow maps (CFMs): dynamical systems that evolve a distinguished token in the presence of a contextual measure across a stack of attention blocks. Within this framework, the finite-context model approximates an idealized infinite-context system in which the contextual measure is replaced by its underlying population, so that the context length $n$ becomes a statistical resource. Exploiting the McKean--Vlasov structure of the dynamics and the classical machinery of propagation of chaos, we establish a forward bound controlling the deviation between the finite- and infinite-context CFMs uniformly along depth, and a backward bound controlling the deviation between the corresponding training trajectories uniformly across iterations of online gradient descent. Both bounds achieve the optimal Wasserstein rate $n^{-1/d}$ for general CFMs and parametric rate $n^{-1/2}$ for a restricted class of CFMs that includes transformers as a special case. The analysis rests on a new Eulerian adjoint formulation of the loss gradient and stability estimates for the resulting forward--adjoint system, both of which may be of independent interest.

2605.16746 2026-05-19 cs.AI cs.LG

State Contamination in Memory-Augmented LLM Agents

内存增强型大语言模型代理中的状态污染

Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

AI总结 研究探讨了内存增强型大语言模型代理中由于状态污染导致的安全问题,通过分析内存总结中的毒性内容传播,提出了一种新的衡量指标,并指出在信息压缩前进行净化可以有效减少潜在影响。

详情
AI中文摘要

LLM代理越来越多地依赖持久化状态,包括转录文本、摘要、检索上下文和内存缓冲区,以支持长周期交互。这使得安全性不仅取决于个体模型输出,还取决于代理存储和后来重用的内容。我们研究了一种称为内存清洗的故障模式:有毒或对抗性上下文可以被压缩成内存摘要,这些摘要在标准检测器下不再显得有毒,但仍保留了影响未来生成的敌对框架或冲突结构。通过配对的反事实多代理模拟,我们证明有毒起源的内存摘要可以保持在常见毒性阈值以下,但相对于匹配的中性基线,仍会增加下游毒性。为了衡量这种隐藏影响,我们引入了子阈值传播间隙(SPG),它量化了在部署监控器视为安全的内存状态下,下游行为差异。我们的实验表明,毒性通过不同的状态通道传播:原始转录文本重用驱动显性下游毒性,而压缩的内存则携带隐藏的子阈值影响。我们进一步发现,缓解依赖于干预位置。在摘要前净化有毒状态可显著减少隐藏传播间隙,而仅清洁完成的摘要则可能保留被清洗的影响。这些结果表明,内存增强型代理的安全性应被视为对演进上下文的状态控制问题,净化应在不安全信息被压缩进持久内存之前应用。

英文摘要

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

2605.16745 2026-05-19 cs.CV

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

EVA01: 通过混合变换器实现统一的原生3D理解和生成

Zongyuan Yang, Mingjing Yi, Wanli Ma, Chenzhuo Fan, Bocheng Li, Baolin Liu, Yuke Lou, Yingde Song, Yongping Xiong, Zhengdong Guo, Shimu Wang

AI总结 本文提出EVA01框架,通过混合变换器架构扩展多模态大语言模型的模态边界,实现原生的3D网格理解和生成以及上下文感知编辑,提升文本到3D生成的保真度和多轮几何编辑能力。

Comments 28 pages, 10 figures, 6 tables. Technical report

详情
AI中文摘要

本文解决了将3D网格作为多模态大语言模型(MLLM)的原生模态整合的挑战。基于扩散的大型重建模型将语义理解与几何推理解耦,作为无状态重建器,条件于密集的2D像素先验。最近的MLLM基于方法将3D模态视为外部输出而非多模态序列的原生组件,使渐进式适应而没有系统分析几何流形如何与MLLM特征空间对齐。我们引入EVA01,一个统一的框架,扩展MLLM的模态边界,原生纳入3D网格理解和生成以及上下文感知编辑。基于混合变换器(MoT)架构,EVA01将模型分为预训练的Understanding Expert(E_und)和结构上镜像的Generation Expert(E_gen),通过共享的全局自注意力和硬模态路由耦合。该设计使MLLM主干的语义潜在空间与几何流形对齐,从而在不使用中间2D表示的情况下直接转移多模态先验。结果表明,EVA01在文本到3D生成保真度方面达到最先进的水平,并解锁了具有身份保持的稳健长上下文多轮几何编辑能力,这一能力对无状态重建流程来说是根本无法实现的。我们的发现进一步为将2D基础模型与3D任务整合提供了架构洞察,指导3D原生多模态系统的设计。项目页面:https://www.seeles.ai/research/pages/EVA01

英文摘要

This paper addresses the challenge of integrating 3D meshes as a native modality within Multimodal Large Language Models (MLLMs). Diffusion-based large reconstruction models decouple semantic understanding from geometric reasoning, operating as stateless reconstructors conditioned on dense 2D pixel priors. Recent MLLM-based methods treat the 3D modality as an external output rather than a native component of the multimodal sequence, making incremental adaptations without a systematic analysis of how geometric manifolds align with MLLM feature spaces. We introduce EVA01, a unified framework that extends the modality boundary of MLLMs to natively incorporate 3D mesh understanding, generation, and context-aware editing. Built upon a Mixture-of-Transformers (MoT) architecture, EVA01 decouples the model into a pre-trained Understanding Expert ($E_{\mathrm{und}}$) and a structurally mirrored Generation Expert ($E_{\mathrm{gen}}$), coupled through shared global self-attention with hard modality routing. This design aligns the semantic latent space of the MLLM backbone with the geometric manifold, enabling direct transfer of multimodal priors without intermediate 2D representations. Results show that EVA01 achieves state-of-the-art native text-to-3D generation fidelity and unlocks robust long-context multi-turn geometric editing with identity preservation, a capability fundamentally inaccessible to stateless reconstruction pipelines. Our findings further offer architectural insights for integrating 2D foundation models with 3D tasks, informing the design of 3D-native multimodal systems. Project Page: https://www.seeles.ai/research/pages/EVA01

2605.16743 2026-05-19 cs.RO

LACE: Latent Visual Representation for Cross-Embodiment Learning

LACE: 用于跨具身学习的潜在视觉表示

Yoo Sung Jang, Kanchana Ranasinghe, Cristina Mata, Yichi Zhang, Jorge Mendez-Mendez, Michael S. Ryoo

AI总结 本文提出LACE框架,通过利用跨具身共享身体部分的对应关系,在自监督学习backbone的潜在空间中对齐人类和机器人视觉表示,从而解决人类与机器人具身之间的视觉差距问题,提升机器人策略在稀疏示范下的表现。

详情
AI中文摘要

从人类示范中进行跨具身学习受到人类与机器人具身之间视觉差距的阻碍。尽管自监督学习(SSL)backbone能够编码通用物体的丰富类间语义,但我们发现它们无法建立人类与机器人手之间的对应关系。我们提出了LACE,一个框架,通过利用跨具身共享身体部分的对应关系作为稀疏监督,在这些backbone的潜在空间中对齐人类和机器人视觉表示。这些注解可以通过正向运动学自动获得,单个机器人示范就足以训练模型。我们的语义对齐损失匹配由对应特征引起的影响分布,将片段级监督提升到语义级对齐,同时Gram损失保留预训练特征质量。这种对齐使机器人策略能够在机器人示范稀缺时利用丰富的数据:在零样本迁移中,使用LACE-DINO的策略比使用DINO的策略表现优异(65%),在低数据和分布外环境中有持续的提升。

英文摘要

Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.

2605.16742 2026-05-19 cs.CV stat.ME

Diffeomorphic Cortical Alignment via Direct Warping of Streamline Endpoints

通过直接变形纤维束端点实现的皮层对齐

Yang Xiang, Martin Cole, Zhengwu Zhang

AI总结 本文提出了一种基于连接性的皮层对齐方法,通过直接操作白质纤维束端点来对齐皮层表面,以提高纤维束层面的对应性,并在主要纤维束上实现更高的连接性重叠系数和更强的鲁棒性。

详情
AI中文摘要

皮层表面注册通常由局部几何描述符(例如沟回深度和曲率)驱动。尽管这种方法实现了几何对应,但忽略了白质解剖结构所施加的远距离连接约束。扩散磁共振成像束追踪提供了这些关键约束;然而,先前的连接性指导流程通常对预计算的连接性矩阵进行对齐,使优化高度敏感于连接性估计及其分辨率。在本文中,我们提出了一种新的基于连接性的皮层对齐方法,通过直接在白质纤维束端点上操作来对齐皮层表面。我们将束端点建模为产品流形Ω×Ω上的点云,其中Ω代表膨胀的皮层半球的球形域。我们的对齐方法通过迭代(i)通过最小化连接性不匹配计算Ω的小变形扭曲,并(ii)根据此扭曲更新端点。该方法依赖于一个几何框架,确保输出扭曲是微分同胚,并具有最终目标,即优化已知纤维束的匹配。在人类连接组计划(HCP)数据上的实验表明,该方法在纤维束层面实现了改进的对应性,实现了主要纤维束上的更高连接性重叠系数,并在Ω的网格分辨率下比最先进的方法如ENCORE和MSMAll表现出更强的鲁棒性。

英文摘要

Cortical surface registration is often driven by local geometric descriptors (e.g., sulcal depth and curvature). While this approach achieves geometric correspondence, it neglects the long-range wiring constraints imposed by white-matter anatomy. Diffusion MRI tractography offers these crucial constraints; however, prior connectivity-informed pipelines typically align precomputed connectivity matrices, making the optimization highly sensitive to connectivity estimation and its resolution. In this paper, we introduce a novel connectivity-based surface registration method that aligns cortical surfaces by operating directly on white-matter fiber-tract endpoints. We model tract endpoints as a point cloud on the product manifold $Ω\times Ω$, where $Ω$ represents the spherical domain of the inflated cortical hemispheres. Our alignment method iteratively (i) computes a small diffeomorphic warp for $Ω$ by minimizing connectivity mismatch, and (ii) updates the endpoints based on this warp. The method relies on a geometric framework that ensures output warps are diffeomorphisms and has a final goal that optimizes the matching of well-known fiber bundles. Experiments on Human Connectome Project (HCP) data demonstrate improved tract-level correspondence, achieving higher connectivity-level overlap coefficients on major fiber bundles and stronger robustness across grid resolutions for $Ω$ compared to state-of-the-art methods such as ENCORE and MSMAll.

2605.16737 2026-05-19 cs.RO cs.CV

DriveSafer: End-to-End Autonomous Driving with Safety Guidance

DriveSafer: 结合安全指导的端到端自动驾驶

Shounak Sural, Raj Rajkumar

AI总结 本文提出DriveSafer框架,通过减少致命性规划失败来提高端到端自动驾驶的安全性,而非单纯提升平均规划质量。

详情
AI中文摘要

端到端(E2E)自动驾驶模型近年来在性能上有了显著提升,尤其是在越来越具有挑战性的基准测试中。然而,现代生成式E2E规划器仍然在安全关键场景中存在大量致命性故障。我们发现许多此类故障源于物理约束和安全要求的违反,导致不安全行为。受此发现启发,本文专注于改进生成式端到端驾驶中的安全结果,通过有针对性地减少致命性规划失败,而不是提升平均规划质量。为此,我们提出了DriveSafer,一种面向失败的的安全框架,用于端到端规划器。DriveSafer通过利用训练时的安全约束和推理时的安全指导,明确引导生成式规划器朝向安全行为。与最先进的DiffusionDrive模型相比,在NAVSIM基准测试中,DriveSafer将致命性故障数量(PDMS=0)减少了48%,在可行驶区域合规性故障上减少了超过65%。

英文摘要

End-to-End (E2E) autonomous driving models have shown growing capability in recent years, with performance improving on increasingly challenging benchmarks. However, modern generative E2E planners still suffer from a substantial number of catastrophic failures in safety-critical scenarios. We find that many such failures arise from violations of physical constraints and safety requirements, leading to unsafe behavior. Motivated by this finding, in this paper, we focus on improving safety outcomes in generative end-to-end driving with a targeted reduction of catastrophic planning failures, instead of enhancing average planning quality. Towards this end, we propose DriveSafer, a failure-aware safety framework for end-to-end planners. DriveSafer explicitly steers generative planners towards safe behaviors leveraging both training-time safety constraints and inference-time safety guidance. Compared to the state-of-the-art DiffusionDrive model, on the NAVSIM benchmark, DriveSafer reduces the number of catastrophic failures (PDMS=0) by 48%, with over 65% reduction in drivable-area compliance failures.

2605.16732 2026-05-19 cs.CV cs.LG

DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers

DiRotQ:面向4位扩散变换器的旋转感知量化

Sayeh Sharify, Mahsa Salmani, Hesham Mostafa

AI总结 本文提出DiRotQ,一种W4A4量化框架,通过旋转感知激活量化缓解扩散变换器在4位精度下的性能下降问题,同时引入VLM-as-a-Judge评估协议和Triton定制内核提升压缩下的效率与质量。

详情
AI中文摘要

扩散变换器(DiTs)在图像生成质量上达到最先进的水平,但在推理过程中带来显著的内存和计算成本。尽管激进的后训练量化(PTQ)到4位精度能带来显著的效率提升,但通常会导致严重的质量下降。现有方法,包括基于平滑的方法、混合精度方案、旋转技术以及低秩残差方法,部分缓解了这一问题,但仍与FP16/BF16性能存在明显差距。在本工作中,我们引入DiRotQ,一种W4A4 PTQ框架,通过旋转感知的激活量化来缓解这种降级。DiRotQ通过主成分分析(PCA)识别出捕捉主导激活方差的低秩子空间,在该子空间中保留系数以较高精度,同时将剩余组件量化为4位。在推理时,通过校准得出的正交变换将激活旋转到PCA基底中,而逆旋转被融合到层权重中,离线。结合基于GPTQ的权重量化,DiRotQ在PixArt-Σ数据集上实现了FID(更低越好)为15.9和PSNR(越高越好)为19.1 dB,优于先前最先进的SVDQuant(FID 18.9,PSNR 17.6)在同一INT W4A4设置下的表现。除了标准指标外,我们引入了VLM-as-a-Judge评估协议,这是该设置下的首次此类评估,提供了更全面的感知质量和提示对齐评估。在系统层面,我们实现了基于Triton的定制内核,以实现高效的端到端推理,将12B FLUX.1-dev模型的内存使用减少了2.1倍,并在24 GB RTX 4090 GPU上实现了2.3倍的加速。

英文摘要

Diffusion Transformers (DiTs) achieve state-of-the-art image generation quality but incur substantial memory and computational costs at inference. While aggressive Post-Training Quantization (PTQ) to 4-bit precision offers significant efficiency gains, it typically results in severe quality degradation. Existing approaches, including smoothing-based methods, mixed-precision schemes, rotation techniques, and low-rank residual methods, partially mitigate this issue but still leave a noticeable gap to FP16/BF16 performance. In this work, we introduce DiRotQ, a W4A4 PTQ framework that mitigates this degradation through rotation-aware activation quantization. DiRotQ identifies a low-rank subspace capturing dominant activation variance via Principal Component Analysis (PCA), preserving coefficients in this subspace at higher precision while quantizing the remaining components to 4-bit. Activations are rotated into the PCA basis at inference time using calibration-derived orthogonal transformations, while the inverse rotation is fused into the layer weights offline. Combined with GPTQ-based weight quantization, DiRotQ achieves an FID (lower is better) of 15.9 and PSNR (higher is better) of 19.1 dB on PixArt-Σ over the MJHQ-30K dataset, outperforming the prior state-of-the-art SVDQuant (FID 18.9, PSNR 17.6) under the same INT W4A4 setting. Beyond standard metrics, we introduce a VLM-as-a-Judge evaluation protocol for diffusion model quantization, the first such evaluation in this setting, providing a more holistic assessment of perceptual quality and prompt alignment under aggressive compression. On the systems side, we implement a Triton-based custom kernel to enable efficient end-to-end inference, reducing memory usage of the 12B FLUX.1-dev model by 2.1x and delivering 2.3x speedup over the BF16 baseline, on a 24 GB RTX 4090 GPU.

2605.16728 2026-05-19 cs.AI

Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents

具身视角形成与意图同调在人工体中

Hongju Pae

AI总结 本文提出了一种最小架构,用于人工体中的具身视角形成,通过引入内感受性活力信号、Fisher式度量以及意图同调机制,展示了如何在无奖励的网格世界中将学习到的身体倾向转化为稳定的体定向行为。

详情
AI中文摘要

本文提出了一种最小架构,用于人工体中的具身视角形成。在扩展先前工作的同时,该模型引入了内感受性活力信号,一种基于融合的外感受性和内感受性状态的Fisher式度量,以及将身体倾向与行动准备性联系起来的意图同调机制。在无奖励的网格世界中,意图将学习到的身体倾向转化为稳定的体定向行为,而身体到视角的路由允许身体扰动在视角潜在空间中留下可恢复的几何残差。本研究展示了如何通过具身组织世界如何呈现给代理的方式,在现象学意义上实现人工主体性的最小结构性条件的操作化。

英文摘要

This paper proposes a minimal architecture for body-grounded perspective formation in artificial agents. Extending prior work, the model introduces an interoceptive viability signal, a Fisher-style metric over fused exteroceptive-interoceptive states, and a conative alignment mechanism linking bodily tendency to action readiness. In a reward-free gridworld, conation converts learned bodily tendency into stable body-directed behavior, while body-to-perspective routing allows bodily perturbations to leave a recoverable geometric residue in the perspective latent. This study shows how minimal structural conditions for artificial subjectivity can be operationalized in the phenomenological sense, through the embodied organization of how a world is given to an agent.

2605.16727 2026-05-19 cs.AI

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

PopuLoRA: 为推理自博弈的协同进化LLM种群

Roger Creus Castanyer, Geoffrey Bradway, Lorenz Wolf, Maxwill Lin, Augustine N. Mavor-Parker, Matthew James Sargent

AI总结 本文提出PopuLoRA,一种基于种群的非对称自博弈框架,用于强化学习中可验证奖励(RLVR)的后训练LLM。通过专门的LoRA适配器在共享冻结基座上进行教师和学生分工,教师提出问题,学生在程序验证器下解决,不同亚种群间的交叉评估取代了限制单智能体自博弈的自我校准。LoRA权重空间进化算子家族作为7B规模种群训练循环的替代步骤,实现了种群的协同进化竞赛。

详情
AI中文摘要

我们介绍了PopuLoRA,一种基于种群的非对称自博弈框架,用于强化学习中可验证奖励(RLVR)的后训练LLM。教师和学生是专门的LoRA适配器,共享冻结基座:教师提出问题,匹配的学生在程序验证器下解决,亚种群间的交叉评估取代了限制单智能体自博弈的自我校准。一组LoRA权重空间进化算子(在几秒钟内产生同等级种群成员的突变和交叉)作为7B规模种群训练循环的替代步骤。我们将在Absolute Zero Reasoner上实现PopuLoRA,并将其与一个每适配器计算匹配的单智能体基线进行比较。当单智能体自我校准到可以可靠解决的问题时,种群进入协同进化竞赛:教师产生越来越复杂的问题,学生解决率波动,问题空间覆盖持续扩展。尽管训练时间奖励较低,种群均值在三个代码基准(HumanEval+, MBPP+, LiveCodeBench)和七个数学基准(AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench)上均优于基线,并且种群中最弱的成员在汇总上也优于基线。

英文摘要

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

2605.16726 2026-05-19 cs.AI

A Global-Local Graph Attention Network for Traffic Forecasting

面向交通预测的全局-局部图注意力网络

Tianchi Zhang

AI总结 本文提出一种全局-局部图注意力网络(GLGAT),通过成对编码和基于事件的邻接矩阵,解决传统图卷积网络和图注意力网络在处理顶点异质性时的复杂性问题,有效捕捉时空相关性并在交通预测中取得竞争优势。

详情
AI中文摘要

交通预测是智能交通系统的重要组成部分。交通预测中的关键挑战之一是发现时空相关性。近年来,图卷积网络和图注意力网络已取代传统统计模型来预测未来交通。然而,这两种方法都难以让顶点具有非常不同的特性。为了解决这个问题,我们提出了具有成对编码和基于事件的邻接矩阵的全局-局部图注意力网络(GLGAT)。GLGAT允许顶点拥有针对整个图的全局注意力矩阵集,并为每个顶点分配局部注意力矩阵集。在两个真实世界交通数据集上的实验表明,GLGAT能够有效捕捉时空相关性,并在与其他最先进的基线模型相比时表现出竞争力。

英文摘要

Traffic forecasting is a significant part of intelligent transportation systems. One of the critical challenges of traffic forecasting is to find spatio-temporal correlations. In recent years, graph convolutional networks and graph attention networks have replaced traditional statistical models to predict future traffic. However, it is complicated for both of them to allow vertices to have far different characters. To address this, we propose the Global-Local Graph Attention Network (GLGAT) with pairwise encoding and the event-based adjacency matrix. The GLGAT allows vertices to have a global attention matrix set for the whole graph and assigns local attention matrix sets to each vertex. Experiments on two real-world traffic datasets show that GLGAT can effectively capture spatio-temporal correlations and has competitive performance against other state-of-the-art baselines.

2605.16725 2026-05-19 cs.AI

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

《白兔在奇幻世界:在线自监督动态发现用于可执行世界模型》

SeungWon Seo, DongHeun Han, SeongRae Noh, HyeongYeop Kang

AI总结 该研究探讨了在先验错配情况下,如何通过交互证据自监督学习可执行世界模型,引入了Alice系统,通过失败的候选更新作为结构信号,发现并改进动态,从而提升可执行世界模型的学习效果。

详情
AI中文摘要

可执行世界模型可以被阅读、编辑、执行和重用以进行规划,但前提是程序捕获了环境的转换定律,而非其表面词汇的语义捷径。我们研究了在先验错配情况下在线可执行世界模型学习的问题,其中智能体必须从交互证据中诱导状态依赖的动态,而无需规则描述、奖励信号或可信的词汇先验。我们引入了Alice,一个闭环系统,将失败的候选更新视为结构信号:当候选解释新的转换但失去之前解释的转换时,保存冲突揭示了当前程序所混淆的动态。Alice将这些冲突细化为假设类别,这些类别既提供了紧凑的、分层的保存反例以指导更新,又引导前沿探索向新颖且在当前程序下代表性不足的转换。我们在《白兔在奇幻世界》上评估了Alice,这是《白兔在你》的一个先验错配变种,它保留了模拟动态,同时将语义重要的规则属性标签替换为无关词汇。实验表明,Alice在先验错配情况下显著提升了可执行世界模型的学习效果,消融实验显示,类别细化和类别感知探索均有所贡献。

英文摘要

Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.

2605.16720 2026-05-19 cs.CV cs.LG

Compositional Adversarial Training for Robust Visual Watermarking

组合对抗训练用于鲁棒的视觉水印

Anirudh Satheesh, Michael-Andrei Panaitescu-Liess, Andrew Xu, Georgios Milis, Heng Huang, Zikui Cai, Furong Huang

AI总结 本文提出了一种组合对抗训练(CAT)框架,通过在结构化空间中构建组合转换的min-max问题,提升视觉水印的鲁棒性,实验表明其在多种攻击设置下优于随机增强基线。

详情
AI中文摘要

鲁棒水印通常使用随机后处理增强进行训练,但随机采样无法覆盖真实攻击管道的组合空间,难以遇到真正破坏检测的稀有组合。这导致训练不稳定且样本效率低。我们将其水印鲁棒性建模为结构化组合转换空间上的min-max问题。我们提出组合对抗训练(CAT),一种插件框架,学习一个顺序可微的对抗者,观察当前水印图像并在每一步选择攻击家族以最大程度干扰信息恢复。CAT结合了直通Gumbel-Softmax攻击选择与熵正则化,使反向传播可端到端微分并聚合攻击家族的梯度信息,从而实现更快、更平滑的收敛,而不陷入单一攻击模式。我们评估CAT在生成后水印VideoSeal 0.0、VideoSeal 1.0和PixelSeal以及在生成WMAR下的单步和双步攻击套件,以及在分布内和多分布图像和视频基准测试中。CAT在单步攻击设置中将水印容量提高最高63.5%,在组合设置中提高13.0%;在自回归设置中,CAT在困难几何变换上将TPR@FPR=1%平均提高12%。这些结果表明,鲁棒视觉水印受益于对抗适应组合对抗者而非独立随机破坏。

英文摘要

Robust watermarking is typically trained with random post-processing augmentation, but random sampling under-covers the combinatorial space of realistic attack pipelines and rarely encounters the rare compositions that actually break detection. This leads to unstable training and poor sample efficiency. We instead formulate watermark robustness as a min-max problem over a structured space of compositional transformations. We propose Compositional Adversarial Training (CAT), a plug-in framework that learns a sequential differentiable adversary that observes the current watermarked image and selects an attack family at each step to maximally disrupt message recovery. CAT combines a straight-through Gumbel-Softmax attack selection with entropy regularization, allowing the backward pass to be end-to-end differentiable and aggregate gradient information across attack families, yielding faster, smoother convergence without collapsing to a single attack mode. We evaluate CAT on post-generation watermarks VideoSeal 0.0, VideoSeal 1.0, and PixelSeal and in-generation WMAR under both single-step and two-step attack suites, on in-distribution and multiple out-of-distribution image and video benchmarks. CAT consistently outperforms random-augmentation baselines trained with the same augmentation budget, with the largest gains on hard composed attacks and OOD evaluations; improving overall watermark capacity by up to $63.5\%$ in the single-step attack setting and $13.0\%$ in the compositional setting. In the autoregressive setting, CAT improves the TPR@FPR$=1\%$ by $12\%$ on average on difficult geometric transformations. These results show that robust visual watermarking benefits from training against adaptive compositional adversaries rather than independent random corruptions.

2605.16714 2026-05-19 cs.AI cs.CR

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

GRID:用于安全文本知识图谱构建的智能数据图表示

Liangyi Huang, Zichen Liu, Fei Shao, Shang Ma, Mengshi Zhang, Zihao Chen, Yanfang Ye, Xusheng Xiao

AI总结 本文提出GRID框架,通过构建可追溯的文章-图对齐,将文档到图学习转化为剧本任务库,提升安全知识图谱构建的稳定性和效率。

详情
AI中文摘要

安全知识图谱可以为安全代理提供可计算的外部记忆,但从长篇网络威胁情报(CTI)中构建仍具有挑战性:LLMs通常缺乏领域知识,端到端文档到图训练难以用低成本、稳定的奖励监督。我们提出了GRID(智能数据图表示),一种用于安全文本知识图谱构建的端到端框架。GRID首先通过图提取和知识图引导的文本修订,从CTI文章中构建安全领域监督。然后将文档到图学习转化为结合四选项多选问题和三级正则表达式匹配目标的剧本任务库,产生比反复评分完整图输出的LLM判断器更稳定的任务特定奖励。使用这种监督流程,我们训练了两个基于Qwen3-4B-Instruct-2507的4B提取器:一个任务库奖励模型和一个端到端奖励模型。在249篇CTI文章上,任务库奖励模型在具有本体引导的GRID提取流程下达到84.62%的源平均精度、64.91%的源平均召回率和68.53%的平均F1分数,实现了最佳源平均召回率和接近顶级平均F1分数,同时具有更低的token使用和部署成本。端到端奖励模型达到76.91%的精度、53.85%的召回率和58.06%的平均F1分数。进一步分析显示,任务库奖励可以一次离线构建并在后续训练运行中重复使用,优于在线端到端LLM作为判断器奖励和较弱的替代方案,如仅选择奖励和无RL的端到端SFT。

英文摘要

Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.

2605.16708 2026-05-19 cs.LG stat.ML

Isolating Nonlinear Independent Sources in fMRI with $β$-TCVAE Models

利用β-TCVAE模型在fMRI中分离非线性独立源

Qiang Li, Shujian Yu, Jesus Malo, Jingyu Liu, Tülay Adali, Vince D. Calhoun

AI总结 本文提出利用β-TCVAE模型处理非线性fMRI数据,分离混合的空间和时间脑信号,恢复具有生物学意义的非线性空间成分,并通过功能网络连接性验证了潜在结构的可解释性。

Comments 6 pages, 2 figures

详情
AI中文摘要

从非线性fMRI数据中学习有意义的潜在表示仍然是神经影像分析中的基本挑战。传统独立成分分析(ICA)因其能估计可解释的功能脑网络而被广泛使用,但其依赖于线性混合假设,限制了其捕捉大脑动态内在非线性和复杂组织的能力。近年来,深度表示学习方法作为非线性潜在结构建模的有希望替代方案出现。然而,许多方法主要在模拟数据集或自然图像基准上评估,对真实世界神经影像数据如fMRI的验证相对有限。本文受β-TCVAE(总相关变分自编码器)的启发,这是β-VAE框架的改进,用于学习潜在表示而不引入额外超参数。我们调整并修改该模型以适应fMRI数据,旨在分离混合的空间和时间脑信号为可解释的成分。我们证明β-TCVAE框架可以恢复具有生物学意义的非线性空间成分,包括已建立的内在连接网络如默认模式网络。此外,我们通过功能网络连接性评估学习的表示,显示潜在结构捕捉了连贯且可解释的大脑组织模式。本研究提供了一项将非线性表示学习与fMRI分析连接的初步调查。

英文摘要

Learning meaningful latent representations from nonlinear fMRI data remains a fundamental challenge in neuroimaging analysis. Traditional independent component analysis, widely used due to its ability to estimate interpretable functional brain networks, relies on a linear mixing assumption for latent sources, limiting its ability to capture the inherently nonlinear and complex organization of brain dynamics. More recently, deep representation learning methods have emerged as promising alternatives for modeling nonlinear latent structure. However, many of these approaches have been evaluated primarily on simulated datasets or natural image benchmarks, with comparatively limited validation on real-world neuroimaging data such as fMRI. In this work, we are motivated by the $β$-TCVAE (Total Correlation Variational Autoencoder), a refinement of the $β$-VAE framework for learning latent representations without introducing additional hyperparameters during training. We adapt and modify this model to fMRI data for nonlinear source disentanglement, aiming to separate mixed spatial and temporal brain signals into interpretable components. We show that the $β$-TCVAE framework can recover meaningful nonlinear spatial components with biological relevance, including well-established intrinsic connectivity networks such as the default mode network. Furthermore, we evaluate the learned representations using functional network connectivity, showing that the latent structure captures coherent and interpretable brain organization patterns. This study provides a pilot investigation that bridges nonlinear representation learning and fMRI analysis.

2605.16704 2026-05-19 cs.LG

Convex Dataset Valuation for Post-Training

训练后凸集估值

Siqi Zeng, Christopher Jung, Rui Li, Zhe Kang, Ming Li, Nima Noorshams, Zhigang Wang, Fuchun Peng, Han Zhao, Xue Feng

AI总结 本文研究了在训练后利用凸集估值选择辅助数据集以提升大语言模型性能,提出基于核均值匹配的凸集估值方法,有效解决数据冗余问题,实验表明其在低计算开销下表现优于现有方法。

Comments Published as a conference paper at ICML '26. 30 pages, 8 figures

详情
AI中文摘要

改进大语言模型在下游任务上的性能有时需要在训练后利用辅助数据集。然而,开发者在计算、标注和许可成本上面临限制,无法使用所有可用数据,需要有原则的数据集层面选择。这些限制日益受到数据集市场的影响,其中数据获取由预算和谈判决定。我们研究了数据集估值作为训练后大语言模型中的子集选择问题。我们的目标是识别并加权辅助数据集,以在受限制的预算下最大化目标任务性能。我们首先表明,常用梯度对齐分数提供了一个合理但不完整的估值信号,因为它们忽略了数据集间的冗余。为了解决这个问题,我们提出了一种基于梯度空间中核均值匹配(KMM)的可扩展凸数据集估值方法,该方法同时考虑了与目标任务的对齐和辅助数据集间的冗余。通过在多样化的训练后设置和任务中进行广泛实验,我们证明了我们的方法在低计算开销下一致优于现有估值基线,实现了更强的性能。我们的结果将数据集估值定位为一种实用的决策工具,用于受市场限制的大语言模型训练后数据选择。代码可在https://github.com/uiuctml/convex_data_valuation获取。

英文摘要

Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post-training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset-level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post-training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post-training data selection in market-constrained large language model settings. The code is available at https://github.com/uiuctml/convex_data_valuation.

2605.16699 2026-05-19 cs.LG q-fin.RM stat.ML

Your SaaS Is an Insurance Product: A Modeling Framework

你的 SaaS 是一种保险产品:一种建模框架

Caio Gomes

AI总结 本文将 capped-usage SaaS 产品与保险产品进行类比,提出基于频率-严重性分解、保费计算原理和蒙特卡洛储备充足性的建模框架,用于 SaaS 价格建模。

Comments 23 pages, 2 figures, 7 tables. Companion code archived at DOI 10.5281/zenodo.20213155

详情
AI中文摘要

Capped-usage SaaS 产品——如 Claude Code 和 ChatGPT 等大语言模型订阅、Vercel 和 Cloudflare Workers 等云平台、企业福利平台、具有责任转移的身份验证服务——与保险产品有相同的结构性特征:固定保费与实际消费解耦、用户层面的随机需求具有厚尾严重性、非同质的上限在固定时间表重置、以及需要在尾部风险下具备充足储备的组合层面暴露。我们主张这不是类比,而是 actuarial science 已经几十年来试图解决的问题,用新的依赖变量(如 tokens、带宽字节、函数调用、健身房打卡)替代医疗索赔。本文提出一个基于频率-严重性分解、保费计算原理和蒙特卡洛储备充足性的建模框架,将其映射到两个领域(LLM 服务和云平台)的公开可观察的订阅层级,基于经典的健康保险经济学(Arrow 1963; Pauly 1968; Manning 等 1987; Brot-Goldberg 等 2017),并通过一个工作示例展示与传统单位经济的差异。贡献是操作性的而非理论性的:不是新的定理,而是目前缺失于 cs.LG/stat.ML 实践中的词汇和工具。

英文摘要

Capped-usage SaaS products -- LLM subscriptions such as Claude Code and ChatGPT, cloud platforms such as Vercel and Cloudflare Workers, corporate benefit platforms, identity-verification services with liability transfer -- share a structural signature with insurance products: a fixed premium decoupled from realized consumption, stochastic per-user demand with heavy-tailed severity, a non-fungible cap that resets on a fixed schedule, and a portfolio-level exposure that requires reserve adequacy under tail risk. We argue that this is not an analogy. It is the same operational problem actuarial science has been tooled for decades to address, restated with new dependent variables (tokens, bandwidth bytes, function-invocations, gym check-ins) in place of medical claims. This paper proposes a modeling framework for capped-usage SaaS pricing built from frequency-severity decomposition, premium calculation principles, and Monte Carlo reserve adequacy. We map the framework to publicly observable subscription tiers in two domains (LLM services and cloud platforms), ground it in canonical health-insurance economics (Arrow 1963; Pauly 1968; Manning et al. 1987; Brot-Goldberg et al. 2017), and demonstrate divergence from traditional unit economics through a worked example. The contribution is operational rather than theoretical: not a new theorem, but vocabulary and tools currently absent from cs.LG/stat.ML practice.

2605.16696 2026-05-19 cs.CV

Face inpainting with Identity Preserving Latent Diffusion Models

基于身份保持的潜在扩散模型的面部修复

João Santos, Carlos Santiago, Manuel Marques

AI总结 本文提出ID-ControlNet,利用潜在扩散模型实现面部修复,通过身份嵌入保持身份一致性,实验表明其在CelebA-HQ等数据集上优于传统方法,接近最先进的身份感知方法。

详情
AI中文摘要

面部修复技术能够以视觉逼真方式恢复缺失或遮挡的面部区域,但保持最终输出的身份仍是一个基本挑战。身份一致性对于下游应用如人脸识别、数字取证和人机交互至关重要,其中细微的身份扭曲可能显著降低性能或信任度。尽管扩散基生成模型在图像修复中取得了显著进展,但它们通常难以忠实保留个体特定的面部特征。另一方面,现有身份感知方法通常依赖于昂贵的微调、辅助监督或对多样遮挡、姿态和面部变化的鲁棒性有限。为了解决这些限制,我们提出ID-ControlNet,一种基于潜在扩散模型的身份保持面部修复框架。基于ControlNet架构,我们的方法将扩散过程条件化为从预训练的人脸识别网络中提取的面部身份嵌入。这种设计使能够重建遮挡的面部区域,同时保持全局面部一致性和身份保真度。此外,我们引入了身份一致性和三元组损失训练策略,以显式地强制生成的面部与目标身份表示之间的对齐。在CelebA-HQ、FFHQ和新的E-Mask数据集上的大量实验表明,ID-ControlNet在身份保持方面显著优于标准扩散基修复方法,实现了与最先进身份感知方法相当的性能。

英文摘要

Face inpainting techniques recover missing or occluded facial regions in a visually realistic manner, but preserving the identity in the final output remains a fundamental challenge. Identity consistency is crucial for downstream applications such as face recognition, digital forensics, and human-computer interaction, where even subtle identity distortions can significantly degrade performance or trust. Although diffusion-based generative models have recently achieved remarkable progress in image inpainting, they often struggle to faithfully retain individual-specific facial characteristics. On the other hand, existing identity-aware methods typically rely on costly fine-tuning, auxiliary supervision, or exhibit limited robustness to diverse occlusions, poses, and facial variations. To address these limitations, we propose ID-ControlNet, an identity-preserving face inpainting framework built upon latent diffusion models. Based on ControlNet architecture, our approach conditions the diffusion process on facial identity embeddings extracted from a pretrained face recognition network. This design enables reconstruction of occluded facial regions while maintaining global facial coherence and identity fidelity. Furthermore, we introduce an identity consistency and triplet loss training strategy that explicitly enforces alignment between the generated face and the target identity representation. Extensive experiments on CelebA-HQ, FFHQ, and on a new E-Mask dataset demonstrate that ID-ControlNet significantly improves identity preservation over standard diffusion-based inpainting methods, achieving performance comparable to SOTA identity-aware approaches.

2605.16690 2026-05-19 cs.LG

UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

UB-SMoE:面向资源自适应联邦微调的通用平衡稀疏专家混合模型

Van-Tuan Tran, Hong-Hanh Nguyen-Le, Marco Ruffini, Merim Dzaferagic

AI总结 本文提出UB-SMoE,通过动态调节路由和通用伪梯度解决异构联邦学习中专家利用率失衡和Top-K路由非可微问题,实现低资源客户端的计算节省与性能提升。

Comments ICML 2026

详情
AI中文摘要

异构LoRA-rank方法通过根据计算能力分配客户端特定的秩来解决联邦微调基础模型中的系统异质性问题。然而,这些方法仅实现有限的计算节省,因为密集的前馈计算占主导地位。稀疏专家混合(SMoE)通过条件计算提供有前途的替代方案,但我们发现其在异构联邦设置中的直接应用引入了两个关键不一致:(i)专家利用率不平衡和(ii)Top-K路由的非可微性。我们的收敛分析表明,这些不一致导致了收敛性下降,特别是对资源受限客户端。为了解决这些挑战,我们提出了通用平衡稀疏专家混合(UB-SMoE),它引入了动态调节路由(DMR)来重新平衡专家利用率,并引入通用伪梯度(PG)来重建未激活专家的学习信号。这些机制形成一个自我强化的循环,使专家在异构客户端中保持活力。在基准测试中,UB-SMoE在低资源客户端上实现了高达45.0%的计算节省,同时相比现有异构LoRA-rank方法,其性能提高了8.7倍。

英文摘要

Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to $45.0\%$ computational reduction on low-resource clients while improving their performance by $8.7 \times$ compared to existing heterogeneous LoRA-rank methods.

2605.16686 2026-05-19 cs.LG

Scalable Knowledge Editing for Mixture-of-Experts LLMs via Tensor-Structured Updates

基于张量结构更新的混合专家LLM可扩展知识编辑

Roman Maksimov, Vladimir Aletov, Dmitry Bylinkin, Daniil Medyakov, Vladimir Solodkin, Aleksandr Beznosikov

AI总结 本文提出一种针对混合专家架构LLM的知识编辑方法,通过张量结构和Woodbury矩阵恒等式实现高效参数更新,提升编辑效率6倍,扩展了知识编辑的应用范围。

Comments 17 pages, 3 architectures, 1 figure, 6 tables

详情
AI中文摘要

知识编辑(KE)为LLM提供了一种轻量级替代方案,避免重复微调。然而,现有KE方法多针对密集前馈层,而现代LLM越来越多采用混合专家(MoE)架构以提升内存效率和推理效率。本文提出MEMIT-like框架,利用MoE层的张量结构,在专家层面准确制定编辑目标,并通过Woodbury矩阵恒等式避免显式计算专家权重的全堆叠矩阵。所获更新仅需固定低秩矩阵的逆运算,无需额外反向传播。实验表明,该方法在主要KE指标上与强基线持平,但编辑过程加速达6倍,得益于批量MEMIT式公式和Woodbury恒等式带来的低维逆运算。这些结果表明,封闭形式的参数修改KE可有效扩展至密集层之外,为现代稀疏LLM架构的可扩展知识编辑开辟了新路径。

英文摘要

Knowledge editing (KE) provides a lightweight alternative to repeated fine-tuning of LLMs. However, most existing KE methods target dense feed-forward layers, while modern LLMs increasingly adopt Mixture-of-Experts (MoE) architectures for their superior memory footprint and inference efficiency. This mismatch leaves a growing class of production models without principled editing tools. We propose a MEMIT-like framework for knowledge editing in MoE-based LLMs. Our method exploits the tensor structure of MoE layers to formulate the editing objective faithfully at the per expert level, and applies the Woodbury matrix identity to avoid materializing or inverting the full stacked matrix of expert weights. The resulting update reduces to inversions of fixed low-rank matrices and requires no additional backward passes. Empirically, our approach matches the editing quality of strong baselines on the main KE metrics while accelerating the editing procedure by up to 6x, owing to the batched MEMIT-style formulation and the low-dimensional inversions enabled by the Woodbury identity. These results show that closed-form, parameter-modifying KE can be extended efficiently beyond dense layers, opening a path toward scalable knowledge editing in modern sparse LLM architectures.

2605.16682 2026-05-19 cs.LG

Identify Then Project: Contrastive Learning of Latent Dynamics from Partial Observations with Port-Hamiltonian Structure

识别后再投影:从部分观测中利用端-哈密顿结构进行对比学习

Peilun Li, Kaiyuan Tan, Daniel Moyer, Thomas Beckers

AI总结 本文提出一种两阶段框架,通过对比学习从部分观测中学习隐状态动态,并投影到端-哈密顿子流形,以实现物理一致性。

详情
AI中文摘要

在直接建模不可行的情况下,识别隐状态表示和动态至关重要,尤其是在部分和高维观测下。我们研究了隐式端-哈密顿系统,这是一种包含守恒和耗散动态的结构化类别。我们提出了一种两阶段识别-再投影框架。首先,对比教师从部分观测中学习连续时间隐动态。然后,学生将识别的教师表示和动态投影到端-哈密顿子流形上,通过学习的仿射图表,得到物理一致的实现。作为概念反事实,我们还考虑了单阶段变体,联合学习隐识别和端-哈密顿结构,但发现其可靠性较低,从而提出所提出的两阶段教师-学生框架。我们理论上证明仿射投影是连接对比隐识别的仿射度量和端-哈密顿系统之间的自然桥梁。经验上,我们展示了所提出的两阶段方法在保持教师动态的同时强制物理结构,并在耗散区域和高维视觉设置中比单阶段替代方法更可靠。

英文摘要

Identifying latent state representations and dynamics is essential when direct modeling in observation space is infeasible, particularly under partial and high-dimensional observations. In such settings, representation learning and physics-aware modeling are inherently coupled. We study this problem for latent port-Hamiltonian systems, a structured class encompassing both conservative and dissipative dynamics. We propose a two-stage identify-then-project framework. First, a contrastive teacher learns continuous-time latent dynamics from partial observations. Then, a student projects the identified teacher representation and dynamics onto a port-Hamiltonian submanifold via a learned affine chart, yielding a physically consistent realization. As a conceptual counterfactual, we also consider a single-stage variant that jointly learns latent identification and port-Hamiltonian structure, but find it to be less reliable, motivating the proposed two-stage teacher-student framework. We show theoretically that affine projection is the natural bridge between the affine gauge of contrastive latent identification and the port-Hamiltonian systems. Empirically, we demonstrate that the proposed two-stage approach preserves the teacher's dynamics while enforcing physical structure, and performs more reliably than the single-stage alternative, particularly in dissipative regimes and high-dimensional visual settings.