arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1708
专题追踪
2508.03668 2026-06-08 cs.CL 版本更新

CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

CTR-Sink:用于点击率预测的语言模型中的注意力汇聚点

Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu, Jian Chen, Dingwei Chen, Xiyu Chang, Ngai Wong, Liang Zhang, Linjian Mo, Chengming Li, Chuan Yuan, Zhenan Sun

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences(神经信息处理教育部重点实验室,自动化研究所,中国科学院) Ant Group(蚂蚁集团) The University of Hong Kong(香港大学) City University of Hong Kong(香港城市大学) Sun Yat-sen University(中山大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 针对用户行为序列与语言模型预训练文本之间的结构差异导致的语义碎片化问题,提出CTR-Sink框架,通过引入行为级注意力汇聚点并动态调节注意力聚合,提升点击率预测性能。

详情
AI中文摘要

点击率(CTR)预测是推荐系统中的核心任务,利用历史行为数据估计用户点击可能性。将用户行为序列建模为文本以利用语言模型(LM)进行该任务的方法,由于LM强大的语义理解和上下文建模能力而受到关注。然而,存在一个关键的结构性差距:用户行为序列由离散的动作组成,这些动作由语义上空的分离符连接,与LM预训练中的连贯自然语言有根本不同。这种不匹配导致语义碎片化,即LM的注意力分散在无关的标记上,而不是集中在有意义的行为边界和行为间关系上,从而降低了预测性能。为了解决这个问题,我们提出了$ extit{CTR-Sink}$,一种新颖的框架,引入了针对推荐场景定制的行为级注意力汇聚点。受注意力汇聚点理论的启发,它构建了注意力聚焦汇聚点,并通过外部信息动态调节注意力聚合。具体来说,我们在连续行为之间插入汇聚点标记,融入推荐特定信号(如时间距离)作为稳定的注意力汇聚点。为了增强通用性,我们设计了一个两阶段训练策略,明确引导LM注意力朝向汇聚点标记,以及一个注意力汇聚点机制,放大汇聚点间的依赖关系以更好地捕捉行为相关性。在一个工业数据集和两个开源数据集(MovieLens、Kuairec)上的实验以及可视化结果,验证了该方法在不同场景下的有效性。

英文摘要

Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs' strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method's effectiveness across scenarios.

2508.02039 2026-06-08 cs.LG stat.ML 版本更新

Model Recycling Framework for Multi-Source Data-Free Supervised Transfer Learning

多源无数据监督迁移学习的模型回收框架

Sijia Wang, Ricardo Henao

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Duke University(杜克大学)

AI总结 提出模型回收框架,在无源数据情况下,通过识别相关源模型子集实现白盒和黑盒设置下的参数高效迁移学习,支持多源无数据监督迁移学习。

详情
AI中文摘要

对数据隐私的日益关注以及与检索源数据进行模型训练相关的其他困难,催生了无源迁移学习的需求,在这种学习中,只能访问预训练模型,而不能访问原始源域的数据。这种设置带来了许多挑战,因为许多现有的迁移学习方法通常依赖于对源数据的访问,这限制了它们直接应用于源数据不可用的场景。此外,实际问题使其更加困难,例如在没有源数据信息的情况下有效选择迁移模型,以及在没有完全访问源模型的情况下进行迁移。受此启发,我们提出了一个模型回收框架,用于参数高效的模型训练,该框架在白盒和黑盒设置中识别要重用的相关源模型的子集。因此,我们的框架使模型即服务(MaaS)提供商能够构建高效预训练模型的库,从而为多源无数据监督迁移学习创造了机会。

英文摘要

Increasing concerns for data privacy and other difficulties associated with retrieving source data for model training have created the need for source-free transfer learning, in which one only has access to pre-trained models instead of data from the original source domains. This setting introduces many challenges, as many existing transfer learning methods typically rely on access to source data, which limits their direct applicability to scenarios where source data is unavailable. Further, practical concerns make it more difficult, for instance efficiently selecting models for transfer without information on source data, and transferring without full access to the source models. So motivated, we propose a model recycling framework for parameter-efficient training of models that identifies subsets of related source models to reuse in both white-box and black-box settings. Consequently, our framework makes it possible for Model as a Service (MaaS) providers to build libraries of efficient pre-trained models, thus creating an opportunity for multi-source data-free supervised transfer learning.

2507.12927 2026-06-08 cs.LG cs.IT math.IT 版本更新

Trace Reconstruction with Language Models

基于语言模型的迹重建

Franziska Weindel, Michael Girsch, Reinhard Heckel

发表机构 * School of Computation, Information and Technology, Technical University of Munich(计算、信息与技术学院,慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 提出TReconLM解码器仅变换器,将迹重建视为下一个标记预测任务,在合成和真实数据上预训练和微调,显著优于现有算法。

详情
AI中文摘要

一般的迹重建问题旨在从被插入、删除和替换独立损坏的噪声副本中恢复原始序列。该问题出现在DNA数据存储等应用中,DNA数据存储因其高信息密度和持久性而成为一种有前景的存储介质。然而,DNA合成、存储和测序过程中引入的错误需要通过算法和编码进行纠正,而迹重建通常作为数据检索的一部分。在这项工作中,我们提出了TReconLM,一种仅解码器的变换器,将迹重建作为下一个标记预测任务来解决。TReconLM优于最先进的迹重建算法,包括先前的深度学习方法,能够以无错误的方式恢复更高比例的序列。我们在基于简单错误模型生成的合成数据上进行预训练,并在真实世界数据上进行微调,以适应特定技术的错误模式。代码可在https://github.com/MLI-lab/TReconLM获取。

英文摘要

The general trace reconstruction problem seeks to recover an original sequence from its noisy copies independently corrupted by insertions, deletions, and substitutions. This problem arises in applications such as DNA data storage, a promising storage medium due to its high information density and longevity. However, errors introduced during DNA synthesis, storage, and sequencing require correction through algorithms and codes, with trace reconstruction often used as part of data retrieval. In this work, we propose TReconLM, a decoder-only transformer that solves trace reconstruction as a next-token prediction task. TReconLM outperforms state-of-the-art trace reconstruction algorithms, including prior deep-learning approaches, recovering a substantially higher fraction of sequences without error. We pretrain on synthetic data generated from a simple error model and fine-tune on real-world data to adapt to technology-specific error patterns. Code is available at https://github.com/MLI-lab/TReconLM.

2505.05232 2026-06-08 cs.AI 版本更新

ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers

ChemQuests: 从ChemRxiv论文中提取的精选化学问答数据库

Mahmoud Amiri, Thomas Bocklitz

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ChemQuests数据集,包含从155篇ChemRxiv论文中提取的952个高质量问答对,覆盖17个化学子领域,用于化学NLP研究。

详情
AI中文摘要

化学文献的快速扩展给研究人员高效获取领域特定知识带来了重大挑战。为了支持化学领域自然语言处理(NLP)的进展,我们提出了ChemQuests,这是一个精选数据集,包含来自化学17个子领域的155篇ChemRxiv \cite{chemrxivWebsite}论文的952个高质量问答(QA)对。每个QA对都明确链接到其源文本片段,以确保可追溯性和上下文准确性。ChemQuests使用自动化流水线构建,该流水线结合了光学字符识别(OCR)、使用GPT-4o的QA生成以及模糊搜索验证。该数据集强调概念性、机理、应用以及合成或实验性问题,支持基于检索的QA系统、搜索引擎开发以及领域自适应大语言模型的微调。我们分析了数据集的结构、覆盖范围和局限性,并概述了扩展和专家验证的未来方向。ChemQuests为化学NLP研究、教育和工具开发提供了基础资源。

英文摘要

The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemQuests, a curated dataset of 952 high-quality question-answer (QA) pairs derived from 155 ChemRxiv \cite{chemrxivWebsite} papers across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemQuests was constructed using an automated pipeline that combines optical character recognition (OCR), QA generation using GPT-4o, and fuzzy-search verification. The dataset emphasizes conceptual, mechanistic, applied, and synthetic or experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large language models. We analyze the dataset's structure, coverage, and limitations, and outline future directions for expansion and expert validation. ChemQuests provides a foundational resource for chemistry NLP research, education, and tool development.

2506.02622 2026-06-08 cs.RO cs.HC 版本更新

HORUS: A Mixed Reality Interface for Managing Teams of Mobile Robots

HORUS:用于管理移动机器人团队的混合现实界面

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa(DIBRIS部门、RICE实验室、热那亚大学)

AI总结 提出混合现实界面HORUS,通过Mini-Map和多种遥操作模式实现多机器人团队监控与任务分配,用户研究验证其在搜索救援任务中的协调有效性。

Comments 7 pages, 7 figures, conference paper submitted to UR 2026

详情
AI中文摘要

混合现实(MR)界面已被广泛探索用于控制移动机器人,但关于其在管理机器人团队方面的应用研究有限。本文提出HORUS:统一系统的整体操作现实,这是一个混合现实界面,提供了一套全面的工具,用于同时管理多个移动机器人。HORUS使操作员能够监控单个机器人状态、实时投影传感器数据,并将任务分配给单个机器人、团队子集或整个团队,所有这些都通过Mini-Map(地面站)完成。该界面还提供不同的遥操作模式:迷你地图模式,允许在观察机器人模型及其在迷你地图上的变换的同时进行遥操作;以及半沉浸式模式,提供平坦的屏幕状视图,可以是单视图或立体视图(3D)。我们进行了一项用户研究,参与者使用HORUS管理一个移动机器人团队,任务是在环境中寻找线索,模拟搜索和救援任务。该研究将HORUS的完整团队管理能力与单个机器人遥操作进行了比较。实验验证了HORUS在多机器人协调中的多功能性和有效性,展示了其在动态、基于团队的环境中推进人机协作的潜力。

英文摘要

Mixed Reality (MR) interfaces have been extensively explored for controlling mobile robots, but there is limited research on their application to managing teams of robots. This paper presents HORUS: Holistic Operational Reality for Unified Systems, a Mixed Reality interface offering a comprehensive set of tools for managing multiple mobile robots simultaneously. HORUS enables operators to monitor individual robot statuses, visualize sensor data projected in real time, and assign tasks to single robots, subsets of the team, or the entire group, all from a Mini-Map (Ground Station). The interface also provides different teleoperation modes: a mini-map mode that allows teleoperation while observing the robot model and its transform on the mini-map, and a semi-immersive mode that offers a flat, screen-like view in either single or stereo view (3D). We conducted a user study in which participants used HORUS to manage a team of mobile robots tasked with finding clues in an environment, simulating search and rescue tasks. This study compared HORUS's full-team management capabilities with individual robot teleoperation. The experiments validated the versatility and effectiveness of HORUS in multi-robot coordination, demonstrating its potential to advance human-robot collaboration in dynamic, team-based environments.

2506.01850 2026-06-08 cs.CV cs.AI cs.LG cs.MM 版本更新

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA: 面向指令型多模态大语言模型的细粒度视觉定位的调制适配器

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MoDA调制适配器,通过指令引导的通道级乘法调制增强细粒度视觉定位,在12个基准上对三种MLLM架构取得一致提升,计算开销极小。

Comments Accepted at ICML 2026. Code is available at https://github.com/waybarrios/MoDA

详情
AI中文摘要

多模态大语言模型(MLLMs)通过将预训练的视觉编码器与大语言模型(LLMs)集成,在指令跟随任务中取得了显著成功。然而,现有方法由于视觉补丁表示中的语义纠缠,常常难以实现细粒度的视觉定位,其中单个补丁混合了多个不同的视觉元素,使得模型难以聚焦于指令相关的细节。为了应对这一挑战,我们提出了MoDA(调制适配器),一种轻量级模块,通过指令引导的通道级调制增强视觉定位。与Q-Former等执行加性特征选择的令牌级方法不同,MoDA通过对已对齐特征进行乘法调制在通道级操作,从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准的LLaVA训练协议,MoDA在语言指令与预对齐的视觉特征之间应用交叉注意力,生成动态调制掩码,无需架构修改或额外监督。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准上评估了MoDA,包括最近的2024年基准(MMVP、CV-Bench、MMStar、RealWorldQA),并在三种不同的MLLM架构上进行了测试:LLaVA-1.5、LLaVA-MoRE(2025)和Qwen3-VL(2025)。MoDA在所有三个系列中均取得了一致的提升,在LLaVA-1.5系列的MMVP上提升了+12.0个百分点,在LLaVA-MoRE系列的ScienceQA上提升了+4.8个百分点,在Qwen3-VL上ScienceQA提升了+4.9、RealWorldQA提升了+4.1、GQA提升了+3.8,证实了这些增益在CLIP编码器之外具有泛化性,且计算开销极小(<1% FLOPs)。代码可在https://github.com/waybarrios/MoDA获取。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

2505.23437 2026-06-08 cs.LG cs.AI cs.IR 版本更新

Bounded-Abstention Pairwise Learning to Rank

有界弃权成对学习排序

Antonio Ferrara, Andrea Pugnana, Francesco Bonchi, Salvatore Ruggieri

发表机构 * Intesa Sanpaolo AI Research(Intesa Sanpaolo AI研究中心) University of Trento(特伦托大学) University of Pisa(比萨大学)

AI总结 提出一种基于条件风险阈值的成对排序弃权方法,理论刻画最优策略,设计模型无关的插件算法,实验验证有效性。

Comments KDD 2026

详情
AI中文摘要

排序系统影响健康、教育和就业等高风险领域的决策,可能产生重大经济和社会影响,因此集成安全机制至关重要。弃权是一种安全机制,允许算法决策系统将不确定或低置信度的决策推迟给人类专家。虽然弃权主要在分类任务中研究,但其在其他机器学习范式中的应用尚不充分。本文提出一种用于成对学习排序任务的弃权新方法。该方法基于对排序器条件风险设置阈值:当估计风险超过预定义阈值时,系统弃权不做决策。我们的贡献有三方面:最优弃权策略的理论刻画、一个模型无关的插件式算法用于构建弃权排序模型,以及在多个数据集上的全面实证评估,证明了我们方法的有效性。

英文摘要

Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial economic and social impacts. This makes the integration of safety mechanisms essential. One such mechanism is abstention, which enables algorithmic decision-making systems to defer uncertain or low-confidence decisions to human experts. While abstention has been predominantly explored in the context of classification tasks, its application to other machine learning paradigms remains underexplored. In this paper, we introduce a novel method for abstention in pairwise learning-to-rank tasks. Our approach is based on thresholding the ranker's conditional risk: the system abstains from making a decision when the estimated risk exceeds a predefined threshold. Our contributions are threefold: a theoretical characterization of the optimal abstention strategy, a model-agnostic, plug-in algorithm for constructing abstaining ranking models, and a comprehensive empirical evaluation across multiple datasets, demonstrating the effectiveness of our approach.

2505.23131 2026-06-08 cs.LG cs.DC 版本更新

DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs

DOPPLER: 异步数据流图中设备分配的双策略学习

Xinyu Yao, Daniel Bourgeois, Abhinav Jain, Yuxin Tang, Jiawen Yao, Zhimin Ding, Arlei Silva, Chris Jermaine

发表机构 * Rice University(里士大学) Rice Ken Kennedy Institute(里士肯尼迪研究所)

AI总结 提出Doppler框架,通过双策略网络(SEL选择操作、PLC放置设备)优化异步数据流图中的设备分配,减少执行时间并提高采样效率。

Comments 32 pages, 19 figures

Journal ref Proceedings of the International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

我们研究在work-conserving系统中将数据流图中的操作分配给设备以最小化执行时间的问题,重点关注复杂的机器学习工作负载。先前的基于学习的方法常常因三个关键限制而难以奏效:(1) 依赖像TensorFlow这样的批量同步系统,由于屏障同步导致设备利用率不足;(2) 在设计基于学习的方法时缺乏对底层系统调度机制的了解;(3) 完全依赖强化学习,忽略了专家设计的有效启发式结构。在本文中,我们提出Doppler,一个用于训练双策略网络的三阶段框架,包括1) 用于选择操作的$\mathsf{SEL}$策略和2) 用于将所选操作放置到设备上的$\mathsf{PLC}$策略。我们的实验表明,Doppler通过减少系统执行时间在所有任务上优于所有基线方法,并且通过减少每回合训练时间展示了采样效率。

英文摘要

We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose Doppler, a three-stage framework for training dual-policy networks consisting of 1) a $\mathsf{SEL}$ policy for selecting operations and 2) a $\mathsf{PLC}$ policy for placing chosen operations on devices. Our experiments show that Doppler outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.

2505.14289 2026-06-08 cs.AI 版本更新

EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks

EVA: 针对环境注入攻击的红队GUI智能体的演化语义对抗方法

Yijie Lu, Manman Zhao, Tianjie Ju, Zihe Yan, Xinbei Ma, Yuan Guo, Daizong Ding, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学计算机科学与工程学院) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Independent Researcher(独立研究者)

AI总结 提出EVA框架,通过演化语义对抗载荷攻击多模态大语言模型驱动的GUI智能体,揭示语义欺骗是攻击成功的关键,实现85%攻击成功率且快速收敛。

Comments Accepted by

详情
AI中文摘要

由多模态大语言模型驱动的图形用户界面智能体日益部署,但易受环境注入攻击。然而,当前的红队方法受限于高昂的计算成本和有限的适应性。一个基本问题仍未解决:攻击成功的瓶颈在于视觉感知还是语义理解?通过控制实验,我们观察到语义欺骗而非视觉外观是攻击成功的主要决定因素。基于这一见解,我们引入了EVA,一个仅在语义维度上演化对抗载荷的演化框架。EVA采用发现-部署框架来挖掘语言脆弱性模式并将其提炼为可泛化的规则。在五个代表性受害智能体上的实验结果表明,EVA实现了高达85%的攻击成功率,在仅1.18到1.71次迭代内将良性种子演化为成功攻击。这种快速收敛揭示了模型潜在表示中密集的语义攻击空间,揭示了一个关键的校准悖论:校准训练强化的指令遵循能力使智能体天生易受权威性、语义欺骗性环境线索的影响。

英文摘要

Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly deployed yet vulnerable to Environmental Injection Attacks (EIAs).However, current red-teaming methods are hindered by prohibitive computational costs and limited adaptability. A fundamental question remains unaddressed: does the bottleneck of attack success lie in visual perception or semantic understanding? Through controlled experiments, we observe that semantic deception, rather than visual appearance, serves as the primary determinant of attack success. Based on this insight, we introduce EVA, an evolutionary framework that evolves adversarial payloads exclusively within the semantic dimension. EVA employs a discovery-deployment framework to mine linguistic vulnerability patterns and distill them into generalizable rules. Experimental results across five representative victim agents demonstrate that EVA achieves up to 85\% attack success rate, evolving benign seeds into successful attacks within only 1.18 to 1.71 iterations. This rapid convergence uncovers a dense semantic attack space in the model's latent representation, unveiling a critical alignment paradox: the instruction-following capabilities reinforced by alignment training render agents inherently susceptible to authoritative, semantically deceptive environmental cues.

2505.12239 2026-06-08 cs.LG cs.AI cs.CR 版本更新

Towards Efficient and Exact Forgetting Services in Pre-Trained-Model-based Continual Learning

面向基于预训练模型的持续学习中的高效且精确的遗忘服务

Yajiang Huang, Jianheng Tang, Kejia Fan, Huiping Zhuang, Anfeng Liu, Tian Wang, Yunhuai Liu, Mianxiong Dong, Houbing Herbert Song

发表机构 * Department of Information Systems, University of Maryland, Baltimore County (UMBC)(马里兰大学巴尔的摩分校信息系统系)

AI总结 针对持续学习中顺序遗忘请求的挑战,提出基于解析方法的持续遗忘(ACU),通过最小二乘递归推导闭式解,实现高效精确的遗忘,保护历史数据隐私。

详情
AI中文摘要

在持续学习(CL)中,使用预训练模型(PTM)作为特征提取器已成为一种流行做法。结合解析分类器,基于PTM的方法在CL中实现了最先进的性能,追求非遗忘目标。同时,在大多数服务构建范式(例如移动群智感知(MCS))中,主动遗忘在CL阶段获得的特定知识也至关重要,其中移动边缘节点不断收集传感数据,不仅需要非遗忘适应,还需要特定知识遗忘以保护隐私。因此,当遗忘请求在CL中顺序出现时,产生了一个独特的问题,称为持续遗忘(CU)。然而,现有的遗忘方法专注于单次联合遗忘,在应用于CU时显得非常不足,包括(1)违反CL中的历史数据隐私,以及(2)容易被对抗性频繁请求淹没或降级。为了应对CU的挑战,我们提出了一种无梯度方法,称为解析持续遗忘(ACU),用于在基于PTM的CL中实现高效且精确的遗忘,同时保护历史数据隐私。针对每个遗忘请求,我们的ACU通过最小二乘法以可解释的方式递归推导解析(即闭式)解。通过精心设计,我们的ACU兼容样本级和类别级遗忘请求。理论和实验评估验证了我们的ACU在遗忘有效性、模型保真度和系统效率方面的优越性。

英文摘要

In Continual Learning (CL), using a Pre-Trained Model (PTM) as the feature extractor has become a popular practice. Accompanied by analytic classifiers, the PTM-based methods have achieved state-of-the-art performance in CL, in pursuit of the non-forgetting goal. Meanwhile, actively forgetting specific knowledge acquired during the CL phase is also essential in most service construction paradigms, for example, Mobile Crowd Sensing (MCS), where mobile edge nodes continuously collect sensory data and demand not only non-forgetting adaptation but also specific knowledge forgetting for privacy preservation. Thus, a unique problem, called Continual Unlearning (CU), arises when the forgetting requests show sequentially in CL. However, existing unlearning methods focus on single-shot joint forgetting and prove highly inadequate when applied to CU, including (1) violating the historical data privacy in CL and (2) vulnerably being overwhelmed or degraded with adversarially frequent requests. To handle the challenges of CU, we propose a gradient-free approach, called Analytic Continual Unlearning (ACU), for efficient and exact forgetting with historical data privacy preservation in PTM-based CL. In response to each unlearning request, our ACU recursively derives the analytical (i.e., closed-form) solutions via least squares in an interpretable manner. By meticulous design, our ACU is compatible with both sample-level and class-level unlearning requests. The theoretical and experimental evaluations validate our ACU's superiority in unlearning effectiveness, model fidelity, and system efficiency.

2505.10892 2026-06-08 cs.LG 版本更新

Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models

多目标偏好优化:提升生成模型的人类对齐

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对RLHF和偏好优化方法假设单一目标的问题,提出多目标偏好优化框架MOPO,通过约束KL散度最大化主要目标并保障次要目标下限,在合成基准和人类偏好数据上实现帕累托最优策略。

详情
AI中文摘要

使用RLHF和偏好优化方法(如DPO、IPO)对LLM进行后训练已大大改善了对齐,但这些方法假设单一目标。实际上,人类表达多个通常相互冲突的目标,例如有用性和无害性,没有自然的标量化。我们研究多目标偏好对齐问题,其中策略必须同时平衡多个目标。我们提出多目标偏好优化(MOPO),一个受约束的KL正则化框架,通过可调安全阈值在强制执行次要目标下限的同时最大化主要目标。MOPO直接操作成对偏好,无需点式奖励,并允许简单的闭式迭代更新。实验上,MOPO在合成基准上恢复帕累托最优策略,并在人类偏好数据上微调时,产生数十亿参数模型,实现更高奖励和帕累托支配基线,具有稳定且鲁棒的优化动态。

英文摘要

Post-training LLMs with RLHF and preference optimization methods (e.g., DPO, IPO) has greatly improved alignment, yet these approaches assume a single objective. In reality, humans express multiple, often conflicting objectives, such as helpfulness and harmlessness, with no natural scalarization. We study the multi-objective preference alignment problem, where a policy must balance several objectives simultaneously. We propose Multi-Objective Preference Optimization (MOPO), a constrained KL-regularized framework that maximizes a primary objective while enforcing lower bounds on secondary objectives via tunable safety thresholds. MOPO operates directly on pairwise preferences without point-wise rewards, and admits simple closed-form iterative updates. Empirically, MOPO recovers Pareto-optimal policies on synthetic benchmarks and, when fine-tuned on human-preference data, yields multi-billion parameter models that achieve higher rewards and Pareto-dominate baselines, with stable and robust optimization dynamics.

2504.10102 2026-06-08 cs.RO cs.SY eess.SY 版本更新

A Human-Sensitive Controller: Adapting to Human Musculoskeletal Disorder-Related Constraints via Reinforcement Learning

一种人类敏感控制器:通过强化学习适应人类肌肉骨骼疾病相关约束

Vitor Martins, Sara M. Cerqueira, Mercedes Balcells, Elazer R Edelman, Cristina P. Santos

发表机构 * Fundação para a Ciência e Tecnologia(葡萄牙科学与技术基金会) Centro de Microssistemas Eletromecânicos da Universidade do Minho(University of Minho微机电系统中心) Massachusetts Institute of Technology(麻省理工学院) Brigham and Women’s Hospital, Harvard Medical School(哈佛医学院布莱尔妇女医院) GEVAB, IQS School of Engineering(GEVAB,IQS工程学院) LABBELS-Associate Laboratory, University of Minho(University of Minho关联实验室)

AI总结 提出基于强化学习的人类敏感机器人控制策略,使用Q学习和深度Q网络优化协作机器人的人机工效,在保持零疼痛风险下平均缩短38%任务完成时间。

详情
AI中文摘要

工作相关肌肉骨骼疾病仍然是工业环境中的主要挑战,导致劳动力参与减少、医疗成本增加和长期残疾。本研究引入了一种人类敏感机器人系统,旨在将有肌肉骨骼疾病史的个体重新融入标准工作岗位,同时优化更广泛劳动力的工效条件。本研究利用强化学习(RL)开发协作机器人的人类感知控制策略,重点优化工效条件并防止任务执行过程中的疼痛。实现并测试了两种RL方法,即Q学习和深度Q网络(DQN),以根据个体用户特征个性化控制策略。尽管实验结果显示存在模拟到现实的差距,但微调阶段成功地将策略适应了现实条件。DQN优于Q学习,在保持零疼痛风险和安全的工效水平的同时,更快地完成任务,在所有测试的人体测量中平均任务完成时间缩短了38%。结构化的测试协议证实了系统对不同人体测量的适应性,突显了RL驱动的协作机器人实现更安全、更包容工作场所的潜力。

英文摘要

Work-Related Musculoskeletal Disorders continue to be a major challenge in industrial environments, leading to reduced workforce participation, increased healthcare costs, and long-term disability. This study introduces a human-sensitive robotic system aimed at reintegrating individuals with a history of musculoskeletal disorders into standard job roles, while simultaneously optimizing ergonomic conditions for the broader workforce. This research leverages reinforcement learning (RL) to develop a human-aware control strategy for collaborative robots, focusing on optimizing ergonomic conditions and preventing pain during task execution. Two RL approaches, Q-Learning and Deep Q-Network (DQN), were implemented and tested to personalize control strategies based on individual user characteristics. Although experimental results revealed a simulation-to-real gap, a fine-tuning phase successfully adapted the policies to real-world conditions. DQN outperformed Q-Learning by completing tasks faster while maintaining zero pain risk and safe ergonomic levels, achieving on average 38% shorter task completion times across all tested anthropometries. The structured testing protocol confirmed the system's adaptability to diverse human anthropometries, underscoring the potential of RL-driven cobots to enable safer, more inclusive workplaces.

2504.00613 2026-06-08 cs.AI cs.IT cs.NE math.IT 版本更新

LLM-Guided Search for Deletion-Correcting Codes

LLM引导的删除纠正码搜索

Franziska Weindel, Reinhard Heckel

发表机构 * School of Computation, Information and Technology, Technical University of Munich(计算、信息与技术学院,慕尼黑技术大学)

AI总结 针对删除纠正码最大尺寸的开放问题,采用LLM引导的进化搜索FunSearch,发现构建短码长删除纠正码的函数,单删除场景证明达到最优的Varshamov-Tenengolts码,多删除和四进制编辑码改进现有构造但缺乏理论洞见。

详情
AI中文摘要

寻找最大尺寸的删除纠正码已经是一个开放问题超过70年,即使对于单个删除也是如此。我们改编了FunSearch,一种大型语言模型(LLM)引导的进化搜索,以发现构建短码长删除纠正码的函数。对于单个删除,我们的搜索发现了一个函数,我们证明该函数构建了推测最优的Varshamov-Tenengolts码。对于多个删除和四进制编辑码,发现的函数改进了先前的显式、基于搜索和神经网络的构造,但仍然是经验启发式,没有新的理论洞见。我们研究了LLM引导进化搜索的设计选择,并发现对于我们的问题,计算资源更好地分配用于采样更多函数,而不是每个函数更长的推理轨迹,并且将自然语言描述与代码共同进化会损害搜索质量。我们提出在进化过程中对逻辑相同的函数进行去重,这对搜索多样性至关重要。我们的结果展示了LLM引导进化搜索在信息论和编码设计中的潜力,并代表了此类方法在构建纠错码中的首次应用。然而,在我们目前的公式中,评估一个函数的复杂度随码长指数增长,限制了该方法仅适用于短码。

英文摘要

Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. We adapt FunSearch, a large language model (LLM)-guided evolutionary search, to discover functions that construct deletion-correcting codes at short code lengths. For a single deletion, our search finds a function that we prove constructs the conjectured-optimal Varshamov-Tenengolts code. For multiple deletions and quaternary edit codes, the discovered functions improve on prior explicit, search-based, and neural constructions but remain empirical heuristics without new theoretical insights. We study design choices for LLM-guided evolutionary search and find that, for our problem, compute is better allocated to sampling more functions than to longer reasoning traces per function, and that co-evolving natural language descriptions with code hurts search quality. We propose deduplicating logically identical functions during evolution, which we find critical for search diversity. Our results demonstrate the potential of LLM-guided evolutionary search for information theory and code design and represent the first application of such methods for constructing error-correcting codes. However, in our current formulation, evaluating a function scales exponentially with code length, limiting the approach to short codes.

2502.16531 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Efficient Coordination and Synchronization of Multi-Robot Systems Under Recurring Linear Temporal Logic

基于循环线性时序逻辑的多机器人系统高效协调与同步

Davide Peron, Victor Nan Fernandez-Ayala, Eleftherios E. Vlahakis, Dimos V. Dimarogonas

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系) Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology(皇家理工学院电气工程与计算机科学学院决策与控制系统系)

AI总结 提出一种结合离线计划综合与在线协调的底层方法,通过实时通信动态调整计划,并引入同步机制处理动作延迟,实现多机器人系统的可扩展协调与同步框架。

Journal ref Proc. IEEE ICRA, 2025, pp. 10194-10200

详情
AI中文摘要

我们考虑在循环任务下形式化为线性时序逻辑(LTL)规范的多机器人系统。为了高效解决规划问题,我们提出了一种自底向上的方法,将离线计划综合与在线协调相结合,通过实时通信动态调整计划。为了解决动作延迟,我们引入了一种同步机制,确保协调的任务执行,从而得到一个适用于广泛多机器人应用的多智能体协调与同步框架。该软件包使用Python和ROS2开发,便于广泛部署。我们通过涉及九个机器人的实验室实验验证了我们的发现,显示出与先前方法相比增强的适应性。此外,我们进行了多达九十个智能体的仿真,以展示我们工作降低的计算复杂性和可扩展性特征。

英文摘要

We consider multi-robot systems under recurring tasks formalized as linear temporal logic (LTL) specifications. To solve the planning problem efficiently, we propose a bottom-up approach combining offline plan synthesis with online coordination, dynamically adjusting plans via real-time communication. To address action delays, we introduce a synchronization mechanism ensuring coordinated task execution, leading to a multi-agent coordination and synchronization framework that is adaptable to a wide range of multi-robot applications. The software package is developed in Python and ROS2 for broad deployment. We validate our findings through lab experiments involving nine robots showing enhanced adaptability compared to previous methods. Additionally, we conduct simulations with up to ninety agents to demonstrate the reduced computational complexity and the scalability features of our work.

2412.09119 2026-06-08 cs.LG cs.CR math.OC 版本更新

The Utility and Complexity of in- and out-of-Distribution Machine Unlearning

分布内与分布外机器遗忘的效用与复杂性

Youssef Allouah, Joshua Kazdan, Rachid Guerraoui, Sanmi Koyejo

发表机构 * EPFL(瑞士联邦理工学院) Stanford University(斯坦福大学)

AI总结 本文分析近似机器遗忘的效用、时间和空间复杂度权衡,提出输出扰动的经验风险最小化实现分布内遗忘的紧致权衡,并针对分布外遗忘提出鲁棒噪声梯度下降变体以摊销时间复杂性。

详情
AI中文摘要

机器遗忘,即从训练模型中选择性移除数据的过程,对于解决部署后的隐私问题和知识差距日益关键。尽管重要性显著,现有方法通常是启发式的且缺乏形式化保证。在本文中,我们分析了近似遗忘的基本效用、时间和空间复杂度权衡,提供了类似于差分隐私的严格认证。对于分布内遗忘数据——与保留集相似的数据——我们展示了一个出奇简单且通用的过程,即带有输出扰动的经验风险最小化,实现了紧致的遗忘-效用-复杂度权衡,解决了之前关于通过差分隐私实现“免费”遗忘的理论分离问题,差分隐私本质上便于移除此类数据。然而,这些技术在处理分布外遗忘数据——与保留集显著不同的数据——时失效,此时遗忘时间复杂度可能超过重新训练,即使对于单个样本也是如此。为了解决这个问题,我们提出了一种新的鲁棒噪声梯度下降变体,该变体在不损害效用的前提下可证明地摊销了遗忘时间复杂度。

英文摘要

Machine unlearning, the process of selectively removing data from trained models, is increasingly crucial for addressing privacy concerns and knowledge gaps post-deployment. Despite this importance, existing approaches are often heuristic and lack formal guarantees. In this paper, we analyze the fundamental utility, time, and space complexity trade-offs of approximate unlearning, providing rigorous certification analogous to differential privacy. For in-distribution forget data -- data similar to the retain set -- we show that a surprisingly simple and general procedure, empirical risk minimization with output perturbation, achieves tight unlearning-utility-complexity trade-offs, addressing a previous theoretical gap on the separation from unlearning "for free" via differential privacy, which inherently facilitates the removal of such data. However, such techniques fail with out-of-distribution forget data -- data significantly different from the retain set -- where unlearning time complexity can exceed that of retraining, even for a single sample. To address this, we propose a new robust and noisy gradient descent variant that provably amortizes unlearning time complexity without compromising utility.

2502.00527 2026-06-08 cs.LG cs.CL 版本更新

PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration

PolarQuant: 利用极坐标变换实现高效键缓存量化和解码加速

Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) ShanghaiTech University(上海科技大学) Meituan(美团)

AI总结 提出PolarQuant方法,通过将键向量分组为二维子向量并编码为量化半径和极角,解决键缓存量化中的异常值问题,同时通过查表加速解码,保持全精度模型性能。

Comments NeurIPS 2025 version with minor revisions to the methodology

详情
AI中文摘要

大型语言模型中的KV缓存是内存使用的主要因素,限制了其更广泛的适用性。将缓存量化到更低的位宽是减少计算成本的有效方法;然而,先前的方法由于异常值的存在,难以量化键向量,导致过高的开销。我们提出了一种名为PolarQuant的新型量化方法,有效解决了异常值挑战。我们观察到,异常值通常只出现在两个维度中的一个,当应用旋转位置嵌入时,这两个维度会一起旋转特定角度。当表示为二维向量时,这些维度展现出结构良好的模式,半径和角度在极坐标中平滑分布。这减轻了异常值对逐通道量化的挑战,使其非常适合量化。因此,PolarQuant将键向量分为二维子向量组,将其编码为相应的量化半径和极角,而不是直接量化原始键向量。PolarQuant在KV缓存量化中实现了卓越的效率,并通过将查询-键内积转化为查表操作来加速解码过程,同时保持全精度模型的下游性能。

英文摘要

The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.

2501.15768 2026-06-08 cs.RO cs.SY eess.SY 版本更新

Error-State LQR Formulation for Quadrotor UAV Trajectory Tracking

四旋翼无人机轨迹跟踪的误差状态LQR公式

Micah Reich

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种基于误差状态线性二次型调节器的四旋翼无人机轨迹跟踪方法,利用指数坐标表示姿态误差,结合全状态反馈与级联体速率控制器实现鲁棒控制。

详情
AI中文摘要

本文提出了一种用于四旋翼无人机(UAV)鲁棒轨迹跟踪的误差状态线性二次型调节器(LQR)公式。该方法利用误差状态动力学,并采用指数坐标表示姿态误差,从而实现用于实时控制的线性化系统表示。控制策略集成了基于LQR的全状态反馈控制器用于轨迹跟踪,并结合级联体速率控制器来处理执行器动力学。提供了误差状态动力学、线性化过程以及控制器设计的详细推导,突出了该方法在动态环境中实现精确稳定四旋翼控制的适用性。

英文摘要

This article presents an error-state Linear Quadratic Regulator (LQR) formulation for robust trajectory tracking in quadrotor Unmanned Aerial Vehicles (UAVs). The proposed approach leverages error-state dynamics and employs exponential coordinates to represent orientation errors, enabling a linearized system representation for real-time control. The control strategy integrates an LQR-based full-state feedback controller for trajectory tracking, combined with a cascaded bodyrate controller to handle actuator dynamics. Detailed derivations of the error-state dynamics, the linearization process, and the controller design are provided, highlighting the applicability of the method for precise and stable quadrotor control in dynamic environments.

2406.05670 2026-06-08 cs.LG cs.CR cs.CV 版本更新

Certified Robustness to Data Poisoning in Gradient-Based Training

基于梯度的训练中对数据投毒的认证鲁棒性

Philip Sosnin, Mark N. Müller, Maximilian Baader, Calvin Tsay, Matthew Wicker

发表机构 * Department of Computing, Imperial College London, United Kingdom(帝国理工学院伦敦分校计算机系) Department of Computer Science, ETH Zurich, Switzerland(苏黎世联邦理工学院计算机科学系) LogicStar.ai, Switzerland(LogicStar.ai公司) The Alan Turing Institute, United Kingdom(艾伦·图灵研究所)

AI总结 提出首个框架,通过凸松弛过度近似参数更新集,为梯度下降训练的模型提供针对无目标、有目标投毒和后门攻击的可证明鲁棒性保证。

Comments 21 pages, 8 figures

详情
AI中文摘要

现代机器学习流程利用大量公共数据,使得保证数据质量变得不可行,并使模型容易受到投毒和后门攻击。在攻击下可证明地约束模型行为仍然是一个开放问题。在这项工作中,我们通过开发第一个框架来应对这一挑战,该框架在不修改模型或学习算法的情况下,为使用可能被操纵的数据训练的模型的行为提供可证明的保证。特别是,我们的框架针对训练输入和标签的有界和无界操纵,认证了对无目标和有目标投毒以及后门攻击的鲁棒性。我们的方法利用凸松弛来过度近似给定投毒威胁模型下所有可能的参数更新集,从而允许我们为任何基于梯度的学习算法约束所有可达参数的集合。给定这个参数集,我们提供了最坏情况行为的界限,包括模型性能和后门成功率。我们在多个真实世界数据集上展示了我们的方法,这些数据集来自能源消耗、医学成像和自动驾驶等应用。

英文摘要

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

2408.15344 2026-06-08 cs.LG math.DS 版本更新

Conformal Disentanglement and Latent-Space Curation: A Neural Framework for Perspective Synthesis, Differentiation and Targeted Generation

共形解缠与潜在空间策展:面向视角合成、区分和定向生成的神经框架

George A. Kevrekidis, Eleni D. Koronaki, Dimitris G. Giovanis, Yannis G. Kevrekidis

发表机构 * Department of Applied Mathematics and Statistics, Johns Hopkins University(应用数学与统计学系,约翰霍普金斯大学) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室) Faculty of Science, Technology and Medicine, University of Luxembourg(科学、技术与医学学院,卢森堡大学) Department of Civil and Systems Engineering, Johns Hopkins University(土木与系统工程系,约翰霍普金斯大学) Department of Chemical and Biomolecular Engineering, Johns Hopkins University(化学与生物分子工程系,约翰霍普金斯大学)

AI总结 提出一种神经自编码器框架,通过结构约束和正交正则化从多传感器数据中分离共享与传感器特定潜在变量,并利用解缠潜在子空间实现定向生成和跨传感器推断。

详情
AI中文摘要

许多科学和工程问题涉及通过多个异构传感器或测量模态观察同一现象。此类观测通常包含跨传感器共享的信息(反映底层系统)以及来自测量过程或环境效应的传感器特定或外部成分。当传感器独立观测不可用时,解缠这些贡献至关重要。我们提出一种神经自编码器框架,从多传感器数据中显式分离共享和传感器特定的潜在变量。该架构通过结构约束和基于正交的正则化强制潜在组件之间的几何独立性,产生可解释且解缠的表示。基于此表示,我们引入一种潜在空间生成方法,其中生成模型在选定的解缠潜在子空间上被调谐/“限制”;然后我们建设性地组合解缠的观测潜在变量,通过训练的解码器条件合成新样本。这使得能够生成具有指定共享(或传感器特定)特征的一致数据。它还通过一致地采样未观测模态中合理测量的分布来支持跨传感器推断。我们在多个计算示例上展示了该方法,显示了在异构传感设置中的有效解缠、定向数据生成和模态插补。

英文摘要

Many scientific and engineering problems involve observing a common phenomenon through multiple heterogeneous sensors or measurement modalities. Such observations typically contain both information shared across sensors, reflecting the underlying system, and sensor-specific or extraneous components arising from measurement processes or environmental effects. Disentangling these contributions is essential when sensor-independent observations are unavailable. We propose a neural autoencoder framework that explicitly separates shared and sensor-specific latent variables from multi-sensor data. The architecture enforces geometric independence between latent components through structural constraints and orthogonality-based regularization, yielding interpretable and disentangled representations. Building on this representation, we then introduce a latent-space generative methodology in which generative models are tuned/"restricted" on selected disentangled latent subspaces; we then constructively combine disentangled observed latent variables to conditionally synthesize new samples via trained decoders. This enables consistent data generation with prescribed shared (or sensor-specific) characteristics. It also supports cross-sensor inference by consistently sampling distributions over plausible measurements in unobserved modalities. We demonstrate the approach on several computational examples, showing effective disentanglement, targeted data generation, and modality imputation in heterogeneous sensing settings.

2406.00636 2026-06-08 cs.CV 版本更新

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

T2LM:基于多句子的长期3D人体运动生成

Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez

发表机构 * IPAI & ASRI(IPAI与ASRI) Dept. of ECE, Seoul National University(电子工程系,首尔国立大学) NAVER LABS Europe(NAVER欧洲实验室)

AI总结 提出T2LM框架,利用1D卷积VQVAE和Transformer文本编码器,无需顺序数据即可从多句子生成连续长期3D人体运动,优于先前方法且与单动作SOTA竞争。

Comments CVPR 2024 HuMoGen Workshop

详情
AI中文摘要

本文解决了长期3D人体运动生成的挑战性问题。具体而言,我们旨在从多个句子(即段落)流中生成平滑连接的长时间动作序列。先前的长期运动生成方法大多基于循环方法,使用先前生成的运动块作为下一步的输入。然而,这种方法有两个缺点:1)依赖顺序数据集,成本高昂;2)这些方法在每一步生成的运动之间产生不切实际的间隙。为了解决这些问题,我们引入了简单而有效的T2LM,一个无需顺序数据即可训练的连续长期生成框架。T2LM包含两个组件:一个1D卷积VQVAE,训练将运动压缩为潜在向量序列;以及一个基于Transformer的文本编码器,根据输入文本预测潜在序列。在推理时,一个句子序列被翻译成连续的潜在向量流,然后由VQVAE解码器解码为运动;使用具有局部时间感受野的1D卷积避免了训练序列和生成序列之间的时间不一致性。VQ-VAE上的这个简单约束使其仅用短序列训练即可产生更平滑的过渡。T2LM优于先前的长期生成模型,同时克服了需要顺序数据的限制;它也与最先进的单动作生成模型具有竞争力。

英文摘要

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.

2403.10318 2026-06-08 cs.LG 版本更新

pTNAS: Progressive Neural Architecture Search for Tabular Data

pTNAS: 面向表格数据的渐进式神经架构搜索

Naili Xing, Shaofeng Cai, Lingze Zeng, Jiaqi Zhu, Peng Lu, Jian Pei, Beng Chin Ooi

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个针对表格数据的渐进式神经架构搜索方法pTNAS,采用过滤-精炼优化策略,结合零成本代理和固定预算调度算法,实现架构快速识别与性能持续提升,相比其他NAS方法加速高达82.75倍。

详情
AI中文摘要

最近的进展已将表格学习的范式转向表格基础模型,但其准确性依赖于随着上下文大小扩展而性能不佳的高推理成本。当配备精心设计的架构时,深度神经网络仍然是一种极具竞争力且更高效的建模范式;然而,以数据自适应和预算感知的方式识别此类架构仍然具有挑战性。我们提出了pTNAS,这是首个针对表格数据定制的渐进式神经架构搜索(NAS)方法,它能够快速识别可行的架构,并在更多预算可用时持续提高其搜索性能。pTNAS采用了一种过滤-精炼优化策略,结合了高效的免训练和有效的基于训练的架构评估。在过滤阶段,我们引入了pTProxy,这是一种专为表格网络设计的新型零成本代理,它联合捕捉架构的可训练性和表达能力,从而能够快速过滤大型架构搜索空间。在精炼阶段,pTNAS采用固定预算调度算法,从一小批有希望的候选架构中准确识别出性能最佳的架构。我们进一步提出了一种预算感知协调器来整体优化预算分配。实验表明,与其他NAS方法相比,pTNAS将达到全局最佳架构的时间缩短了高达82.75倍,实现了最佳的平均预测排名,并且与TabPFN相比,端到端效率提高了高达4.78倍。

英文摘要

Recent advances have shifted the paradigm of tabular learning toward tabular foundation models, yet their accuracy relies on a heavy inference cost that scales poorly with context size. Deep neural networks remain a highly competitive and more efficient modeling paradigm when equipped with well-designed architectures; however, identifying such architectures in a data-adaptive and budget-aware manner remains challenging. We propose pTNAS, the first progressive neural architecture search (NAS) approach tailored for tabular data, which enables fast identification of a viable architecture and continuously improves its search performance as more budget becomes available. pTNAS adopts a filter-and-refine optimization strategy that combines efficient training-free and effective training-based architecture evaluation. In the filtering phase, we introduce pTProxy, a novel zero-cost proxy specifically designed for tabular networks that jointly captures architectural trainability and expressivity, enabling fast filtering of large architecture search spaces. In the refinement phase, pTNAS employs a fixed-budget scheduling algorithm to accurately identify the best-performing architecture from a small set of promising candidates. We further propose a budget-aware coordinator to optimize budget allocation holistically. Experiments show that pTNAS reduces the time to reach the globally best architecture by up to 82.75 X compared with other NAS approaches, achieves the best average predictive rank, and improves end-to-end efficiency by up to 4.78 X compared with TabPFN.

2403.05532 2026-06-08 cs.LG cs.CV 版本更新

Twin: Tuning Learning Rate and Weight Decay of Deep Homogeneous Classifiers without Validation

Twin: 无需验证的深度同质分类器学习率和权重衰减调优

Lorenzo Brigato, Stavroula Mougiakakou

发表机构 * ARTORG Center, University of Bern(伯恩大学ARTORG中心)

AI总结 提出Twin方法,利用同质网络的边界最大化动态和训练-测试损失间的经验缩放定律,实现无需验证集的学习率和权重衰减调优,在37个图像分类配置上达到与Oracle基线1.28%的平均绝对误差。

Comments Accepted at TMLR

详情
AI中文摘要

我们介绍了Tune without Validation (Twin),一种简单有效的管道,用于调优同质分类器的学习率和权重衰减,无需验证集,消除了保留数据的需求并避免了两步过程。Twin利用了同质网络的边界最大化动态以及连接超参数配置下训练和测试损失的经验缩放定律。这种数学建模产生了一个依赖于区域的、无需验证的选择规则:在不可分离区域,训练损失在测试损失中是单调的,因此可以预测泛化;而在可分离区域,由于边界最大化,参数的范数成为泛化的可靠指标。在37个图像分类的数据集-架构配置中,我们证明Twin与使用测试准确率选择超参数的Oracle基线相比,平均绝对误差为1.28%。我们展示了Twin在验证数据稀缺的场景(如小数据 regime)或难以且昂贵收集的场景(如医学成像)中的优势。代码可在 https://github.com/lorenzobrigato/twin 获取。

英文摘要

We introduce Tune without Validation (Twin), a simple and effective pipeline for tuning learning rate and weight decay of homogeneous classifiers without validation sets, eliminating the need to hold out data and avoiding the two-step process. Twin leverages the margin-maximization dynamics of homogeneous networks and an empirical scaling law that links training and test losses across hyper-parameter configurations. This mathematical modeling yields a regime-dependent, validation-free selection rule: in the non-separable regime, training loss is monotonic in test loss and therefore predictive of generalization, whereas in the separable regime, the parameters' norm becomes a reliable indicator of generalization due to margin maximization. Across 37 dataset-architecture configurations for image classification, we demonstrate that Twin achieves a mean absolute error of 1.28% compared to an Oracle baseline that selects HPs using test accuracy. We demonstrate Twin's benefits in scenarios where validation data is scarce, such as small-data regimes, or difficult and costly to collect, as in medical imaging. Code available at https://github.com/lorenzobrigato/twin.

2206.08598 2026-06-08 cs.LG stat.ML 版本更新

Characterizing Learning Dynamics under Relative Reparameterization of Singular Models

奇异模型相对重参数化下的学习动态表征

Pascal Mattia Esser, Frank Nielsen

发表机构 * Ludwig-Maximilians-Universität München(慕尼黑路易斯-马克西米利安大学) Sony Computer Science Laboratories Inc.(索尼计算机科学实验室)

AI总结 针对奇异模型参数空间与模型空间非一一对应导致收敛慢的问题,提出相对重参数化方法提取正则子模型,并在高斯混合模型和神经网络上理论分析梯度下降收敛率。

详情
AI中文摘要

分析统计模型学习的一种常见方法是考虑模型参数空间中的操作,但当参数空间与底层统计模型空间之间不存在一一映射时,这变得具有挑战性。这种“奇异模型”经常出现,并且由于吸引子行为,学习轨迹的收敛速度会特征性地降低。在这项工作中,我们考虑了参数空间的相对重参数化技术,该技术提供了一种从奇异模型中提取正则子模型的通用方法。以高斯混合模型和神经网络为例,我们从理论和数值上分析了两种参数化下梯度下降的收敛率。通过分析二阶方法和Fisher信息矩阵的显式性质,我们区分了由算法和内在信息几何方面引起的收敛行为差异。

英文摘要

A common way to analyze learning of statistical models is to consider operations in the models parameter space, however this becomes challenging when there is no one-to-one mapping between the parameter space and the underlying statistical model space. Such ``singular models'' occur frequently and exhibit a characteristic decrease in convergence speed of learning trajectories due to attractor behaviors. In this work, we consider a relative reparameterization technique of the parameter space, which yields a general method for extracting regular sub-models from singular models. On the example of Gaussian Mixture Models and Neural Networks we theoretically and numerically analyze the convergence rate for Gradient Descent under both parameterizations. Analyzing second-order methods and explicit properties of the Fisher Information Matrix we distinguish between differences in convergence behavior arising from algorithmic and intrinsic information-geometric aspects.

2606.05759 2026-06-08 cs.CV

Physics-Guided Deep Unfolding for Blind Cross-Sensor Spectral Super-Resolution via Learning the Spectral Transformation Function

物理引导的深度展开网络用于盲跨传感器光谱超分辨率:通过学习光谱变换函数

Zhaolin Li, Jinsong Chen, Shanxin Guo, Tuo Zhang, Xinglong Zhang, Pan Chen

发表机构 * Center for Geo-Spatial Information, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(地理信息中心,深圳先进技术研究院,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application(深圳海洋环境大数据分析与应用工程实验室)

AI总结 提出一种物理引导的深度展开网络PGU-Net,通过交替优化联合估计高光谱图像和可学习的光谱变换函数,解决盲跨传感器光谱超分辨率问题。

详情
AI中文摘要

高光谱成像为定量遥感提供丰富的光谱信息,然而高光谱传感器成本高昂,因此在许多无人机部署中不可用。光谱超分辨率旨在从多光谱图像重建高光谱图像。大多数现有的SSR方法假设固定且已知的光谱响应函数,因此仅限于单传感器设置。在实际的跨传感器场景中,从HSI到MSI的光谱退化是未知的,并且随传感器特性和场景内容变化,这使得HSI重建病态。本文提出一种物理引导的深度展开网络,称为PGU-Net,通过联合估计HSI和可学习的光谱变换函数来解决盲跨传感器SSR。PGU-Net将交替优化过程展开为端到端可训练的多阶段架构,每个阶段依次更新HSI和STF。两个模块结合了可学习的近端网络和可微的闭式求解器,在保持强表示能力的同时实现物理可解释性。在具有多个SRF的基准数据集(CAVE和NTIRE 2022)上的实验表明,STF(退化算子)的准确恢复以及相对于最先进SSR方法的重建性能提升。此外,在真实无人机跨传感器数据集(Headwall Nano HSI和DJI P4多光谱MSI)上的评估验证了PGU-Net在真正盲条件下的有效性和鲁棒性,并表明估计的STF可能表现出与土地覆盖相关的差异。

英文摘要

Hyperspectral imaging provides rich spectral information for quantitative remote sensing, yet hyperspectral sensors remain costly and thus unavailable in many UAV deployments. Spectral super-resolution (SSR) seeks to reconstruct hyperspectral images (HSIs) from multispectral images (MSIs). Most existing SSR methods assume a fixed and known spectral response function (SRF) and are therefore limited to single-sensor settings. In practical cross-sensor scenarios, the spectral degradation from HSI to MSI is unknown and varies with sensor characteristics and scene content, which renders HSI reconstruction ill-posed. This paper proposes a physics-guided deep unfolding network, termed PGU-Net, to address blind cross-sensor SSR by jointly estimating the HSI and a learnable spectral transformation function (STF). PGU-Net unrolls an alternating optimization procedure into an end-to-end trainable architecture with stages, where each stage sequentially updates the HSI and the STF. Both modules combine learnable proximal networks with differentiable closed-form solvers, enabling physical interpretability while retaining strong representation capacity. Experiments on benchmark datasets (CAVE and NTIRE 2022) with multiple SRFs demonstrate accurate recovery of the STF (degradation operator) and improved reconstruction performance over state-of-the-art SSR methods. Furthermore, evaluations on a real UAV cross-sensor dataset (Headwall Nano HSI and DJI P4 Multispectral MSI) verify the effectiveness and robustness of PGU-Net under truly blind conditions, and suggest that the estimated STF may exhibit land-cover-related differences.

2510.17568 2026-06-08 cs.CV

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

PAGE-4D: 通过解耦姿态与几何估计实现VGGT-4D感知

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University(哈佛人工智能与机器人实验室,哈佛大学) Media Lab and Electrical Engineering and Computer Science, Massachusetts Institute of Technology(媒体实验室和电气工程与计算机科学,麻省理工学院) Department of Computing, Imperial College London(计算系,帝国理工学院) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学自然与人工智能研究学院)

AI总结 提出PAGE-4D,扩展VGGT到动态场景,通过动态感知聚合器解耦静态与动态信息,同时提升相机姿态估计、深度预测和点云重建性能。

Comments ICLR 2026, VGGT-4D, Dynamic VGGT

详情
AI中文摘要

最近的3D前馈模型,如视觉几何基础变换器(VGGT),在推断静态场景的3D属性方面表现出强大的能力。然而,由于这些模型通常在静态数据集上训练,它们在涉及复杂动态元素的现实场景中(例如移动的人或可变形物体如雨伞)往往表现不佳。为了解决这一限制,我们引入了PAGE-4D,一种将VGGT扩展到动态场景的前馈模型,能够实现相机姿态估计、深度预测和点云重建——全部无需后处理。多任务4D重建的一个核心挑战是任务之间的固有冲突:准确的相机姿态估计需要抑制动态区域,而几何重建则需要对其进行建模。为了解决这一矛盾,我们提出了一种动态感知聚合器,通过预测动态感知掩码来解耦静态和动态信息——抑制姿态估计的运动线索,同时放大几何重建的运动线索。大量实验表明,PAGE-4D在动态场景中始终优于原始VGGT,在相机姿态估计、单目和视频深度估计以及密集点图重建方面取得了更优的结果。必要的代码和额外演示可在链接:https://page4d.github.io/ 获取,包括训练和推理掩码变体以及仅训练掩码变体(=推理时的VGGT架构)。关键词:VGGT-4D,4D感知,动态场景重建。

英文摘要

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/, including both the training-and-inference masking variant and the training-only masking variant (= VGGT architecture at inference). Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

2510.21122 2026-06-08 cs.CV

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

NoisyGRPO:通过噪声注入和贝叶斯估计激励多模态Co T推理

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Engineering Research Center of Intelligent Vision and Imaging(上海智能视觉与成像工程研究中心) Lingang Laboratory(临港实验室)

AI总结 NoisyGRPO通过引入可控噪声增强探索并利用贝叶斯框架建模优势估计,提升多模态大语言模型的泛化能力和鲁棒性,尤其在小规模模型上表现突出。

Comments Accepted by Neurips 2025, Project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/

Journal ref Advances in Neural Information Processing Systems 38 (2026) 124239-124267

详情
AI中文摘要

强化学习(RL)在增强多模态大语言模型(MLLMs)的链式推理能力方面展现出潜力。然而,当应用于提升通用链式推理时,现有RL框架往往难以超越训练分布。为此,我们提出NoisyGRPO,一种系统化的多模态RL框架,通过在视觉输入中引入可控噪声以增强探索,并通过贝叶斯框架显式建模优势估计过程。具体而言,NoisyGRPO通过(1)噪声注入探索策略:用高斯噪声扰动视觉输入以鼓励探索更广泛的视觉场景;以及(2)贝叶斯优势估计:将优势估计建模为一个原理性的贝叶斯推断问题,其中注入的噪声水平作为先验,观察到的轨迹奖励作为似然。这种贝叶斯建模融合了两种信息源,以计算轨迹优势的稳健后验估计,有效引导MLLMs偏好视觉支撑的轨迹而非噪声轨迹。在标准链式推理质量、通用能力和幻觉基准测试中,NoisyGRPO显著提高了泛化能力和鲁棒性,尤其是在小规模MLLMs如Qwen2.5-VL 3B的RL设置中。项目页面可在https://artanic30.github.io/project_pages/NoisyGRPO/上获取。

英文摘要

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol(布里斯托尔大学计算机科学学院) School of Geographical Sciences, University of Bristol(布里斯托尔大学地理科学学院)

AI总结 本文提出Forest-Chat,一种基于LLM的森林变化分析代理,通过多任务处理实现自然语言查询,提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情
AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat,一种基于大语言模型的视觉-语言代理,支持多任务的交互式森林变化分析,包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释(MCI)视觉-语言框架,结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集,包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上,Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%,在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中,其在Forest-Change数据集上达到60.15%和34.00%,在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明,描述优化能注入地理领域知识,但标签域迁移有限。这些发现表明,交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

2505.19888 2026-06-08 cs.LG

Generalized and Personalized Federated Learning with Black-Box Foundation Models via Orthogonal Transformations

基于正交变换的联邦学习与个性化方法:通过黑盒基础模型

Eun Gyung Kong, Je Won Yeom, Yonghoon Jeon, Taesup Kim

发表机构 * Seoul National University(首尔国立大学) Mobilint, Inc.(Mobilint公司) Kakao Healthcare Corp.(Kakao医疗公司)

AI总结 本文提出FedOT框架,通过正交变换实现联邦学习中的鲁棒泛化与有效个性化,在异构环境中提升性能,优于基线方法。

Comments 31 pages, 5 figures

Journal ref Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 24567-24576, 2026

详情
AI中文摘要

联邦学习(FL)在保护数据隐私的同时促进去中心化模型训练。然而,在异构(非iid)环境中同时实现鲁棒泛化和有效个性化仍是一个严峻挑战。此外,基础模型(FMs)的广泛使用要求双重隐私保护:(a)保护敏感客户端数据和(b)保护服务器的知识产权。这需要严格黑盒访问FMs。为解决这些挑战,我们引入FedOT,一种针对黑盒FMs优化的联邦学习框架。FedOT采用共享的全局任务依赖分类器,同时通过客户端特定的正交变换实现本地适应,该变换应用于FMs嵌入之外。这种架构本质上保证FMs内部参数保持不可访问和未修改。通过强制正交性,FedOT有效缓解了跨不同客户端的梯度冲突,理论上有界,保持FMs表示的语义完整性,并在显著的数据异质性下实现稳健性能。全局和本地参数的协同优化最佳平衡了泛化和个性化,显著优于基线FL方法。广泛的实证分析,包括严格多种子验证和可扩展性评估,证实了FedOT的鲁棒性、效率和优越性能。

英文摘要

Federated Learning (FL) facilitates decentralized model training while preserving data privacy. However, achieving both robust generalization and effective personalization simultaneously in heterogeneous (non-IID) environments remains a formidable challenge. Furthermore, the widespread adoption of proprietary Foundation Models (FMs) introduces a critical requirement for dual privacy: (a) protecting sensitive client data and (b) securing the server's valuable intellectual property. This mandates strictly black-box access to the FM. To address these multifaceted challenges, we introduce FedOT, a novel FL framework optimized for black-box FMs. FedOT employs a shared global task-dependent classifier while facilitating local adaptation through client-specific orthogonal transformations applied externally to the FM embeddings. This architecture inherently guarantees that the FM's internal parameters remain inaccessible and unmodified. By enforcing orthogonality, FedOT effectively mitigates gradient conflicts across diverse clients, which is theoretically bounded, preserves the semantic integrity of the FM representations, and achieves robust performance under significant data heterogeneity. The synergy of global and local parameters optimally balances generalization and personalization, markedly outperforming baseline FL methods across diverse benchmarks. Extensive empirical analysis, including rigorous multi-seed validation and scalability assessments, substantiates the robustness, efficiency, and superior performance of FedOT.

2502.21123 2026-06-08 cs.LG cs.AI

Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models

因果关系是理解和平衡可信机器学习与基础模型中多个目标的关键

Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) Max Planck Institute for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所(图宾根)) Google Research(谷歌研究) ETH Zürich(苏黎世联邦理工学院) University of Toronto(多伦多大学)

AI总结 本文主张将因果方法集成到机器学习中,以平衡公平性、隐私、鲁棒性、准确性和可解释性等可信原则之间的权衡,并探讨其在基础模型中的实际应用。

详情
AI中文摘要

确保机器学习系统的可信度至关重要,因为它们日益嵌入高风险领域。本文主张将因果方法集成到机器学习中,以应对可信机器学习关键原则(包括公平性、隐私、鲁棒性、准确性和可解释性)之间的权衡。虽然这些目标理想情况下应同时满足,但它们通常被孤立地处理,导致冲突和次优解决方案。借鉴因果在ML中成功协调目标(如公平性与准确性,或隐私与鲁棒性)的现有应用,本文认为因果方法对于平衡可信ML和基础模型中的多个竞争目标至关重要。除了强调这些权衡,我们考察了如何将因果实际集成到ML和基础模型中,提供增强其可靠性和可解释性的解决方案。最后,我们讨论了采用因果框架的挑战、局限性和机遇,为更负责任和合乎伦理的AI系统铺平道路。

英文摘要

Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.

2505.13140 2026-06-08 cs.CV

CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

CacheFlow: 通过缓存归一化流实现快速的人体运动预测

Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

发表机构 * Toyota Technological Institute(丰田技术研究所) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 CacheFlow通过缓存归一化流生成模型,实现快速3D人体运动预测,相比传统方法速度提升显著,且保持预测精度和模型表达能力。

Comments Accepted at Transactions on Machine Learning Research (TMLR). See https://openreview.net/forum?id=icq5659pQt

Journal ref Transactions on Machine Learning Research, 2026

详情
AI中文摘要

许多用于3D人体运动预测的密度估计技术需要大量推理时间,通常超过预测时间范围。为解决此问题,我们提出了一种新的基于流的方法,称为CacheFlow。与之前的条件生成模型相比,CacheFlow利用无条件的流生成模型,将高斯混合转化为未来运动的密度。流生成模型的计算结果可以预先计算并缓存。然后,对于条件预测,我们通过一个更轻量的模型将历史轨迹映射到高斯混合中的样本。这种映射方式相比传统条件流模型节省了显著的计算开销。通过这种两阶段方法和缓存慢流模型的计算结果,我们构建了CacheFlow,不损失预测精度和模型表达能力。此推理过程大约在1毫秒内完成,比之前的VAE方法快4倍,比之前的扩散方法快30倍。此外,我们的方法在Human3.6M数据集上展示了改进的密度估计精度,并与SOTA方法具有可比的预测精度。我们的代码和模型可在https://github.com/meaten/CacheFlow上获得。

英文摘要

Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from poor time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models are available at https://github.com/meaten/CacheFlow.