arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2094
热门方向导航
2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology(首尔科学技术大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) LG CNS

AI总结 提出MENTOR方法,通过灵活的教师优化奖励结构,平衡行为对齐与下游性能,提升小模型在工具使用任务中的域外泛化能力。

详情
AI中文摘要

将大型语言模型(LLMs)的工具使用能力蒸馏到小型语言模型(SLMs)中对其实际应用至关重要。主要方法监督微调(SFT)由于与静态教师轨迹的刚性对齐,导致域外(OOD)泛化性能较差。虽然强化学习(RL)提供了一种替代方案,但SLMs的能力限制带来了严峻的困境:稀疏的结果奖励提供的指导不足,而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距,我们提出了MENTOR,它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制,而是利用教师的参考来指导工具使用行为,平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明,与SFT和严格RL基线相比,MENTOR提高了OOD工具使用性能。我们的研究结果表明,在可验证的工具使用环境中,灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

2510.21978 2026-06-19 cs.LG cs.AI 版本更新

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

超越推理增益:缓解大型推理模型中的通用能力遗忘

Hoang Phan, Xianjun Yang, Yuanshun Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) New York University(纽约大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对强化学习训练导致推理模型遗忘基础能力的问题,提出RECAP重放策略,通过动态目标重加权在线调整训练重点,在保持通用能力的同时提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在数学和多模态推理方面取得了显著进展,并已成为当代语言和视觉-语言模型的标准后训练范式。然而,RLVR方法引入了能力退化的重大风险,即模型在长时间训练后,若未采用正则化策略,会遗忘基础技能。我们通过实验证实了这一担忧,观察到开源推理模型在感知和忠实性等核心能力上出现性能下降。虽然施加KL散度等正则化项有助于防止偏离基础模型,但这些项是在当前任务上计算的,因此不能保证保留更广泛的知识。同时,跨异构领域的经验回放使得决定每个目标应获得多少训练权重变得困难。为解决这一问题,我们提出RECAP——一种具有动态目标重加权的重放策略,用于通用知识保留。我们的重加权机制利用短期收敛和不稳定信号在线自适应,将后训练焦点从饱和目标转移到表现不佳或不稳定的目标。我们的方法是端到端的,可直接应用于现有RLVR流程,无需训练额外模型或进行繁重调优。在Qwen2.5-VL-3B和Qwen2.5-VL-7B上的广泛实验证明了我们方法的有效性,该方法不仅保留了通用能力,还通过实现任务内奖励的更灵活权衡提升了推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, in which models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are computed on the current task and therefore do not guarantee preservation of broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training emphasis each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts online using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks using Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

2510.20454 2026-06-19 cs.LG 版本更新

Capturing Intransitive Dominance in Tennis Forecasting: A Graph Neural Network Approach

网球预测中非传递性优势的捕捉:一种图神经网络方法

Lawrence Clegg, John Cartlidge

发表机构 * School of Engineering Mathematics and Technology, University of Bristol(布里斯托大学工程数学与技术学院)

AI总结 针对网球中常见的非传递性优势(A胜B,B胜C,C胜A),提出图神经网络模型,通过时间有向图建模历史比赛结果,捕捉被传递性评级系统忽略的预测信号,与加权Elo结合后显著提升预测性能。

Comments 41 pages, 7 figures. Major revision reframing the paper from betting-market inefficiency toward intransitivity analysis, forecast complementarity, and robustness. Added forecast-encompassing tests, new intransitivity measures, robustness analyses, and expanded appendices

详情
AI中文摘要

非传递性球员优势(即球员A击败B,B击败C,但C击败A)在竞技网球中很常见。然而,很少有已知的尝试将其纳入预测方法中。我们通过一种图神经网络方法来解决这个问题,该方法通过时间有向图显式建模这些非传递性关系,其中球员作为节点,他们的历史比赛结果作为有向边。我们的模型(准确率65.7%,Brier分数0.214)与加权Elo等已建立的评级系统相比具有竞争力。尽管它在无条件准确性上没有超越基线,但一项预测包含测试表明它携带了互补信息。组合预测显著优于加权Elo,并且有迹象表明,在我们的模型针对的非传递性对决中,增益增长更强烈。因此,基于图的球员交互表示捕捉了传递性评级系统丢弃的预测信号,即使在没有共同对手的球员之间也是如此。

英文摘要

Intransitive player dominance, where player A beats B, B beats C, but C beats A, is common in competitive tennis. Yet, there are few known attempts to incorporate it within forecasting methods. We address this problem with a graph neural network approach that explicitly models these intransitive relationships through temporal directed graphs, with players as nodes and their historical match outcomes as directed edges. Our model (65.7% accuracy, 0.214 Brier score) forecasts competitively with established rating systems such as Weighted Elo. Although it does not improve on the baseline in unconditional accuracy, a forecast-encompassing test shows that it carries complementary information. A combined forecast significantly outperforms Weighted Elo, and there is some indication that the gain grows more strongly on the intransitive matchups our model targets. A graph-based representation of player interactions thus captures a forecasting signal that transitive rating systems discard, even between players who share no common opponents.

2510.19893 2026-06-19 cs.LG 版本更新

EQPO: Equitable Group Relative Policy Optimization for Clinical Reasoning

EQPO: 面向临床推理的公平群体相对策略优化

Shiqi Dai, Wei Dai, Jiaee Cheong, Paul Pu Liang

发表机构 * MIT(麻省理工学院) Harvard University(哈佛大学)

AI总结 提出EQPO分层强化学习方法,通过自适应重加权样本促进异质临床人群的均衡学习,在7个诊断基准上降低F1标准差43.9%,缩小预测公平差距27.2%。

Comments Accepted as Oral on NeurIPS 2025 GenAI4Health Workshop

详情
AI中文摘要

医疗AI系统展示了令人印象深刻的诊断性能,但它们在不同人口统计群体之间通常表现出不均匀的准确性,使代表性不足的人群处于不利地位。尽管多模态推理基础模型推动了临床诊断的发展,基于强化学习的后训练倾向于吸收并放大多数主导训练语料中存在的偏见。我们提出公平群体相对策略优化(EQPO),一种分层强化学习方法,通过根据子群表示、任务难度和数据来源自适应地重新加权样本,鼓励跨异质临床人群的平衡学习。由于人口统计注释在真实临床数据中经常缺失,EQPO还在不可用时应用无监督聚类来恢复潜在子群。在覆盖5种模态(X射线、CT、皮肤镜、乳腺X线摄影、超声)的7个诊断基准上,EQPO在QoQ-Med3-8B上相比原始GRPO将F1标准差降低43.9%,最大跨群体F1差距降低42.7%,并在MedGemma-4B上将预测公平差距缩小27.2%(相比有偏减轻的RL基线),同时即使没有任何人口统计标签也将F1提高12.5%。检查训练轨迹显示,EQPO在优化过程中稳步提高公平性,而基线方法的公平性随训练进行而下降,并且发现的隐式群体保持稳定并与掩蔽的人口统计属性对齐。我们进一步发布了EquiMedGemma-4B和EquiQoQ-Med3-8B,这两种具有公平意识的临床VLLM在显著缩小人口统计差距的同时达到了最先进的准确性。

英文摘要

Medical AI systems demonstrated impressive diagnostic performance, yet they routinely show uneven accuracy across demographic groups, disadvantaging underrepresented populations. Although multimodal reasoning foundation models have pushed clinical diagnosis forward, reinforcement learning-based post-training tends to absorb and magnify the biases present in majority-dominated training corpora. We propose Equitable Group Relative Policy Optimization (EQPO), a hierarchical reinforcement learning method that encourages balanced learning across heterogeneous clinical populations by adaptively reweighting samples according to subgroup representation, task difficulty, and data source. As demographic annotations are frequently missing in real-world clinical data, EQPO additionally applies unsupervised clustering to recover latent subpopulations when they are unavailable. On 7 diagnostic benchmarks covering 5 modalities (X-ray, CT, dermoscopy, mammography, ultrasound), EQPO reduces F1 standard deviation by 43.9% and the maximum cross-group F1 gap by 42.7% on QoQ-Med3-8B over vanilla GRPO, and narrows predictive parity gaps by 27.2% on MedGemma-4B over bias-mitigated RL baselines while raising F1 by 12.5% even without any demographic labels. Examining the training trajectory shows that EQPO steadily improves fairness over the course of optimization, in contrast to baseline methods whose fairness degrades as training proceeds, and the discovered implicit groups remain stable and align with masked demographic attributes. We further release EquiMedGemma-4B and EquiQoQ-Med3-8B, equitability-aware clinical VLLMs that attain state-of-the-art accuracy with markedly smaller demographic gaps.

2510.16311 2026-06-19 cs.LG 版本更新

Toward General Digraph Contrastive Learning: A Dual Spatial Perspective

面向一般有向图对比学习:双空间视角

Zhengyu Wu, Daohan Su, Yang Zhang, Xunkai Li, Rong-Hua Li, Guoren Wang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出S2-DiGCL框架,从复数域和实数域双空间视角对有向图进行对比学习,通过磁拉普拉斯自适应调制和路径子图增强,在节点分类和链接预测任务上分别提升4.41%和4.34%。

详情
AI中文摘要

图对比学习(GCL)已成为一种从图中提取一致表示而无需标签信息的强大工具。然而,现有方法主要关注无向图,忽略了在实际网络(如社交网络和推荐系统)中基础且不可或缺的关键方向信息。本文提出了S2-DiGCL,一种新颖的框架,强调从复杂域和实数域视角对有向图进行对比学习的空间洞察。从复数域视角,S2-DiGCL在磁拉普拉斯中引入个性化扰动,以自适应地调制边相位和方向语义。从实数域视角,它采用基于路径的子图增强策略,捕捉细粒度的局部不对称性和拓扑依赖性。通过联合利用这两个互补的空间视图,S2-DiGCL构建了高质量的正负样本,从而实现更通用和鲁棒的有向图对比学习。在7个真实有向图数据集上的大量实验证明了我们方法的优越性,在监督和无监督设置下,节点分类和链接预测分别实现了4.41%和4.34%的性能提升,达到了最先进水平。

英文摘要

Graph Contrastive Learning (GCL) has emerged as a powerful tool for extracting consistent representations from graphs, independent of labeled information. However, existing methods predominantly focus on undirected graphs, disregarding the pivotal directional information that is fundamental and indispensable in real-world networks (e.g., social networks and recommendations).In this paper, we introduce S2-DiGCL, a novel framework that emphasizes spatial insights from complex and real domain perspectives for directed graph (digraph) contrastive learning. From the complex-domain perspective, S2-DiGCL introduces personalized perturbations into the magnetic Laplacian to adaptively modulate edge phases and directional semantics. From the real-domain perspective, it employs a path-based subgraph augmentation strategy to capture fine-grained local asymmetries and topological dependencies. By jointly leveraging these two complementary spatial views, S2-DiGCL constructs high-quality positive and negative samples, leading to more general and robust digraph contrastive learning. Extensive experiments on 7 real-world digraph datasets demonstrate the superiority of our approach, achieving SOTA performance with 4.41% improvement in node classification and 4.34% in link prediction under both supervised and unsupervised settings.

2510.08807 2026-06-19 cs.RO cs.LG 版本更新

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

Humanoid Everyday:面向开放世界人形机器人操作的综合机器人数据集

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharov, Vitor Guizilini, Yue Wang

发表机构 * University of Southern California(南加州大学) Toyota Research Institute(丰田研究院)

AI总结 提出Humanoid Everyday数据集,包含10.3k轨迹、260个任务的多模态数据,用于人形机器人灵巧操作、人机交互和移动操作研究,并配套云评估平台。

详情
AI中文摘要

从运动到灵巧操作,人形机器人在展示复杂的全身能力方面取得了显著进展。然而,当前大多数机器人学习数据集和基准主要关注固定机器人臂,少数现有人形数据集要么局限于固定环境,要么任务多样性有限,通常缺乏人机交互和下肢运动。此外,缺乏用于在人形数据上对基于学习的策略进行基准测试的标准化评估平台。在这项工作中,我们提出了Humanoid Everyday,一个大规模且多样化的人形操作数据集,其特点是涉及灵巧物体操作、人机交互、运动集成动作等广泛的任务多样性。利用高效的人工监督遥操作流水线,Humanoid Everyday聚合了高质量的多模态感官数据,包括RGB、深度、LiDAR和触觉输入,以及自然语言注释,包含10.3k条轨迹和超过300万帧数据,涵盖7个大类共260个任务。此外,我们对数据集上的代表性策略学习方法进行了分析,提供了它们在不同任务类别中的优势和局限性的见解。为了标准化评估,我们引入了一个基于云的评估平台,允许研究人员在我们的受控环境中无缝部署他们的策略并接收性能反馈。通过发布Humanoid Everyday以及我们的策略学习分析和标准化的基于云的评估平台,我们旨在推进通用人形操作的研究,并为现实世界中更有能力和具身化的机器人代理奠定基础。我们的数据集、数据收集代码和云评估网站在我们的项目网站上公开发布。

英文摘要

From loco-motion to dextrous manipulation, humanoid robots have made remarkable strides in demonstrating complex full-body capabilities. However, the majority of current robot learning datasets and benchmarks mainly focus on stationary robot arms, and the few existing humanoid datasets are either confined to fixed environments or limited in task diversity, often lacking human-humanoid interaction and lower-body locomotion. Moreover, there are a few standardized evaluation platforms for benchmarking learning-based policies on humanoid data. In this work, we present Humanoid Everyday, a large-scale and diverse humanoid manipulation dataset characterized by extensive task variety involving dextrous object manipulation, human-humanoid interaction, locomotion-integrated actions, and more. Leveraging a highly efficient human-supervised teleoperation pipeline, Humanoid Everyday aggregates high-quality multimodal sensory data, including RGB, depth, LiDAR, and tactile inputs, together with natural language annotations, comprising 10.3k trajectories and over 3 million frames of data across 260 tasks across 7 broad categories. In addition, we conduct an analysis of representative policy learning methods on our dataset, providing insights into their strengths and limitations across different task categories. For standardized evaluation, we introduce a cloud-based evaluation platform that allows researchers to seamlessly deploy their policies in our controlled setting and receive performance feedback. By releasing Humanoid Everyday along with our policy learning analysis and a standardized cloud-based evaluation platform, we intend to advance research in general-purpose humanoid manipulation and lay the groundwork for more capable and embodied robotic agents in real-world scenarios. Our dataset, data collection code, and cloud evaluation website are made publicly available on our project website.

2509.13972 2026-06-19 cs.RO 版本更新

BIM Informed Visual SLAM for Construction Environments

BIM 引导的视觉 SLAM 在建筑环境中的应用

Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠与信任跨学科研究中心(SnT),卢森堡大学)

AI总结 针对建筑环境中视觉SLAM轨迹漂移问题,提出利用建筑信息模型(BIM)的结构先验增强RGB-D SLAM系统,通过墙面对应与几何约束优化减少漂移,提升全局一致性,实验显示轨迹误差降低25.23%,地图精度提升7.14%。

Comments 9 pages, 7 tables, 4 figures

详情
AI中文摘要

监测建筑施工现场需要将计划设计与实际建造状态进行比较,而同步定位与地图构建(SLAM)技术可以实时估计实际状态。然而,视觉SLAM在建筑环境中容易产生轨迹漂移,生成的地图在几何上与实际环境不准确。为解决这一局限,我们利用从建筑信息模型(BIM)导出的结构先验增强现有的RGB-D SLAM系统。该系统将检测到的墙面与BIM中的对应墙面关联,并将这些对应关系作为几何约束加入后端优化,从而减少漂移并增强全局一致性。所提方法实时运行,并在多个真实建筑工地上验证,与最先进的基线相比,平均轨迹误差降低25.23%,地图精度提升7.14%。鲁棒性分析进一步表明,该方法对不完整的BIM数据以及计划模型与实际环境之间的几何差异具有韧性。

英文摘要

Monitoring building construction sites requires comparing the as-planned design with the as-built state, which can be estimated in real time using Simultaneous Localization and Mapping (SLAM) techniques. However, visual SLAM is prone to trajectory drift in construction environments, producing maps that are geometrically inaccurate with the actual environment. To address this limitation, we augment an existing RGB-D SLAM system with structural priors derived from the Building Information Model (BIM). The system associates detected walls with their BIM counterparts and includes these correspondences as geometric constraints in the back-end optimization, reducing drift and enhancing global consistency. The proposed method operates in real time and is validated on multiple real construction sites, achieving an average trajectory error reduction of 25.23% and a 7.14% improvement in map accuracy over state-of-the-art baselines. Robustness analyses further demonstrate resilience to incomplete BIM data and geometric discrepancies between as-planned models and the as-built environment.

2510.00831 2026-06-19 cs.AI cs.LG eess.SP 版本更新

Controlled Comparison of Machine Learning Models for Fault Classification and Localization in Power System Protection

电力系统保护中故障分类与定位的机器学习模型受控比较

Julian Oelhaf, Georg Kordowich, Changhun Kim, Paula Andrea Pérez-Toro, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Department of Electrical Engineering, Media and Computer Science, Ostbayerische Technische Hochschule Amberg-Weiden(奥贝格-魏登应用技术大学电气工程、媒体与计算机科学系)

AI总结 在统一电磁暂态数据集和10-50ms决策窗口下,对比机器学习模型在故障分类与定位中的性能,发现分类在10ms时F1>0.98,定位误差稳定在约10%线路长度。

Comments Accepted at IEEE PES Innovative Smart Grid Technologies Europe 2026 (ISGT Europe 2026). Pre-camera-ready author version; final proceedings version may differ

详情
AI中文摘要

现代电力系统因逆变器基和分布式能源的集成而日益复杂,挑战了传统保护方案的可靠性,并推动了机器学习在保护任务中的应用。然而,由于不同研究中的数据集、传感假设和决策时域各异,已发表的结果往往难以比较。本文在相同的传感、时序和验证条件下,基于公共电磁暂态数据集,使用10-50ms的决策窗口以反映保护相关时间尺度,对故障分类(FC)和故障定位(FL)的机器学习模型进行了受控比较。对于FC,性能最佳的非线性模型在10ms时F1分数已超过0.98,而低容量模型在较短时域下性能下降,但随窗口延长而改善,表明相关故障类型信息在最早暂态中已存在。对于FL,顶级模型在所有评估时域下达到约10%归一化线路长度的稳定定位误差,而较弱模型形成明显分离的第二性能层级。线路解析分析显示,定位精度随电网段变化,表明存在拓扑依赖的难度而非仅时间上下文不足。这些发现为比较两个信息需求根本不同的保护任务中的机器学习模型提供了受控参考。

英文摘要

The increasing complexity of modern power systems, driven by the integration of inverter-based and distributed energy resources, challenges the reliability of conventional protection schemes and motivates the use of machine learning for protection tasks. However, published results are often difficult to compare because datasets, sensing assumptions, and decision horizons vary across studies. This paper presents a controlled comparison of machine learning models for fault classification (FC) and fault localization (FL) under identical sensing, timing, and validation conditions on a common electromagnetic transient dataset, using decision windows of 10-50 ms to reflect protection-relevant time scales. For FC, the best-performing nonlinear models achieve F1 scores above 0.98 already at 10 ms, while lower-capacity models degrade at shorter horizons but improve with longer windows, indicating that relevant fault-type information is already present in the earliest transient. For FL, the top-performing models reach a stable localization error of about 10 % of normalized line length across all evaluated horizons, while weaker models form a clearly separated second performance tier. Line-resolved analysis shows that localization accuracy varies across grid segments, indicating topology-dependent difficulty rather than insufficient temporal context alone. These findings provide a controlled reference for comparing machine learning models across two protection tasks with fundamentally different information requirements.

2509.25148 2026-06-19 cs.AI 版本更新

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

AAPA:用于大型语言模型后训练的对抗锚定偏好对齐

Faqiang Qian, Kang An, Weikun Zhang, Ziliang Wang, Xuhui Zheng, Liangjian Wen, Yong Dai, Mengya Gao, Yichao Wu

发表机构 * Southwest University of Finance and Economics(西南财经大学)

AI总结 提出AAPA框架,通过固定轻量判别器对策略输出与专家响应进行句子级对抗锚定,增强SFT、GRPO等后训练目标,在指令遵循基准上持续提升性能。

详情
AI中文摘要

大型语言模型的后训练对齐通常结合了专家演示上的监督微调(SFT)和来自偏好或可验证反馈的强化学习(RL)。SFT提供了有用的行为锚点,但可能过拟合静态演示,而RL鼓励探索但可能偏离专家行为或利用不完美的奖励。我们提出\textbf{AAPA}(\emph{对抗锚定偏好对齐}),这是一个插件式框架,通过句子级对抗锚定信号增强现有的后训练目标。AAPA使用固定的轻量判别器将策略生成结果与离线预收集的专家响应进行比较,因此在策略优化期间既不需要在线教师推理,也不需要判别器协同训练。相同的锚定项可以添加到SFT、GRPO和CHORD中,同时保留其原始训练流程。在指令遵循基准上的实验表明,AAPA在不同模型规模上一致地改善了相应的基础目标。特别是,分阶段的AAPA配置在\texttt{Qwen3-0.6B}上比强GRPO基线提高了5.77%,在\texttt{Qwen3-4B}上提高了3.75%。对响应长度、对数概率分布和判别器变体的进一步分析表明,对抗锚定为偏好优化提供了稳定的语义基础信号。代码可在\url{this https URL}获取。

英文摘要

Post-training alignment of large language models often combines supervised fine-tuning (SFT) on expert demonstrations with reinforcement learning (RL) from preference or verifiable feedback. SFT provides a useful behavioral anchor but can overfit to static demonstrations, whereas RL encourages exploration but may drift from expert behavior or exploit imperfect rewards. We propose \textbf{AAPA} (\emph{Adversarially Anchored Preference Alignment}), a plug-in framework that augments existing post-training objectives with a sentence-level adversarial anchoring signal. AAPA compares policy rollouts with offline, pre-collected expert responses using a fixed lightweight discriminator, and therefore requires neither online teacher inference nor discriminator co-training during policy optimization. The same anchoring term can be added to SFT, GRPO, and CHORD while preserving their original training pipelines. Experiments on instruction-following benchmarks show that AAPA consistently improves the corresponding base objectives across model scales. In particular, the staged AAPA configuration improves over a strong GRPO baseline by 5.77\% on \texttt{Qwen3-0.6B} and 3.75\% on \texttt{Qwen3-4B}. Further analyses on response length, log-probability distributions, and discriminator variants suggest that adversarial anchoring provides a stable semantic grounding signal for preference optimization. Code is available at \url{https://github.com/IsFaqq/AAPA}.

2509.19658 2026-06-19 cs.RO cs.AI 版本更新

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

RoboSSM: 基于状态空间模型的可扩展上下文模仿学习

Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) KAIST(韩国科学技术院) FAIR at Meta(元宇宙FAIR) Amazon(亚马逊) Sony AI(索尼人工智能)

AI总结 提出RoboSSM,用状态空间模型替代Transformer实现上下文模仿学习,在LIBERO基准上对未见和长时任务泛化更优,首次证明SSM是ICIL高效可扩展的骨干网络。

Comments IROS 2026

详情
AI中文摘要

上下文模仿学习(ICIL)使机器人能够从仅包含少量演示的提示中学习任务。通过消除部署时参数更新的需求,该范式支持对新任务的少样本适应。然而,最近的ICIL方法依赖于Transformer,其计算能力有限,并且在处理比训练时更长的提示时往往表现不佳。在这项工作中,我们引入了RoboSSM,一种基于状态空间模型(SSM)的可扩展上下文模仿学习方案。具体来说,RoboSSM用Longhorn(一种最先进的SSM)替代Transformer,该模型提供线性时间推理和强大的外推能力,非常适合长上下文提示。通过在LIBERO基准上的多样化实验,我们证明了将SSM应用于ICIL的有效性,通过处理测试时更长的上下文,实现了比基于Transformer的ICIL方法对未见和长时任务更好的泛化。这些结果首次表明,SSM是ICIL高效且可扩展的骨干网络。我们的代码可在此网址获取。

英文摘要

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

2509.10416 2026-06-19 cs.RO 版本更新

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC:面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics(KU莱顿机械工程系,机器人、自动化与机电一体化研究单位) KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images(KU莱顿电气工程系,语音与图像处理研究单位)

AI总结 提出TASC框架,通过视觉构建开放词汇交互图推断任务级用户意图,并基于空间约束提供共享控制辅助,提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情
AI中文摘要

我们提出了TASC,一个面向关系遥操作的任务感知共享控制框架,该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务,TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系,并据此推断用户意图。然后,共享控制策略在抓取和物体交互过程中提供辅助,该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战:(1)从低级运动命令中推断任务级意图,以及(2)跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明,与先前方法相比,TASC提高了任务效率并减少了用户输入努力,同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

2504.11171 2026-06-19 cs.CV cs.AI 版本更新

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind:面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe(IBM欧洲研究院) ETH Zurich(苏黎世联邦理工学院) Forschungszentrum Jülich(尤利希研究中心) European Space Agency(欧洲航天局) Φ \Phi -Lab(Φ实验室) NASA IMPACT University of Iceland(爱沙尼亚大学)

AI总结 提出首个任意到任意生成式多模态基础模型TerraMind,通过双尺度表示(token级和像素级)预训练,实现零样本/少样本应用,并引入“模态思考”能力,在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情
AI中文摘要

我们提出了TerraMind,这是首个面向地球观测(EO)的任意到任意生成式多模态基础模型。与其他多模态模型不同,TerraMind在跨模态的双尺度表示(结合token级和像素级数据)上进行预训练。在token级别,TerraMind编码高层上下文信息以学习跨模态关系;在像素级别,TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中,我们证明:(i)TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用;(ii)TerraMind引入了“模态思考”(TiM)——在微调和推理过程中生成额外人工数据以改善模型输出的能力;(iii)TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

2506.14990 2026-06-19 cs.AI 版本更新

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

MEAL: 持续多智能体强化学习基准

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang

发表机构 * Eindhoven University of Technology, The Netherlands(埃因霍温理工大学,荷兰) University of Edinburgh, UK(爱丁堡大学,英国) University of Stuttgart, Germany(斯图加特大学,德国) King's College London, UK(伦敦国王学院,英国) University of Liverpool, UK(利物浦大学,英国)

AI总结 提出MEAL基准,利用JAX和GPU加速实现100任务序列训练,揭示长序列中出现的失败模式。

Comments To be published in the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

基准在强化学习(RL)研究中扮演核心角色,但其计算约束往往塑造了研究内容。尽管有终身学习的动机,大多数持续RL论文仅考虑3-10个顺序任务,因为CPU密集型环境使得更长的序列不切实际。同时,合作多智能体环境中的持续学习仍基本未被探索。为弥补这些空白,我们引入MEAL(多智能体自适应学习环境),这是首个持续多智能体RL基准。通过利用JAX和GPU加速,MEAL能够在单个GPU上几小时内训练100个任务的序列。我们发现,长任务序列揭示了在较小规模下不会出现的失败模式。

英文摘要

Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as CPU-bound environments make longer sequences impractical. Meanwhile, continual learning in cooperative multi-agent settings remains largely unexplored. To address these gaps, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark for continual multi-agent RL. By leveraging JAX and GPU acceleration, MEAL enables training on sequences of 100 tasks in a few hours on a single GPU. We find that long task sequences reveal failure modes that do not appear at smaller scales.

2509.00271 2026-06-19 cs.RO 版本更新

Learn from What We HAVE: History-Aware VErifier that Reasons about Past Interactions Online

从我们所拥有的学习:在线推理过去交互的历史感知验证器

Yishu Li, Xinyi Mao, Ying Yuan, Kyutae Sim, Ben Eisner, David Held

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出历史感知验证器HAVE,通过解耦动作生成与验证,利用历史交互在线消除歧义,理论证明其提升期望动作质量,在多个模拟和真实环境中验证有效性。

Comments CoRL 2025

详情
AI中文摘要

我们引入了一种新颖的历史感知验证器(HAVE),通过利用过去的交互来在线消除不确定场景中的歧义。机器人经常遇到视觉上模糊的物体,这些物体的操作结果直到物理交互之前都是不确定的。虽然仅凭生成模型理论上可以适应这种模糊性,但在实践中,即使在以动作历史为条件的情况下,它们在模糊情况下也会获得次优性能。为了解决这个问题,我们提出明确地将动作生成与验证解耦:我们使用无条件的基于扩散的生成器来提出多个候选动作,并采用我们的历史感知验证器通过推理过去的交互来选择最有希望的动作。通过理论分析,我们证明了使用验证器显著提高了期望动作质量。在多个模拟和真实环境(包括铰接物体、多模态门和不均匀物体拾取)中的实证评估和分析证实了我们方法的有效性以及对基线的改进。我们的项目网站位于:this https URL

英文摘要

We introduce a novel History-Aware VErifier (HAVE) to disambiguate uncertain scenarios online by leveraging past interactions. Robots frequently encounter visually ambiguous objects whose manipulation outcomes remain uncertain until physically interacted with. While generative models alone could theoretically adapt to such ambiguity, in practice they obtain suboptimal performance in ambiguous cases, even when conditioned on action history. To address this, we propose explicitly decoupling action generation from verification: we use an unconditional diffusion-based generator to propose multiple candidate actions and employ our history-aware verifier to select the most promising action by reasoning about past interactions. Through theoretical analysis, we demonstrate that employing a verifier significantly improves expected action quality. Empirical evaluations and analysis across multiple simulated and real-world environments including articulated objects, multi-modal doors, and uneven object pick-up confirm the effectiveness of our method and improvements over baselines. Our project website is available at: https://liy1shu.github.io/HAVE_CoRL25/

2507.22524 2026-06-19 cs.LG 版本更新

HGCN(O): A Self-Tuning GCN HyperModel Toolkit for Outcome Prediction in Event-Sequence Data

HGCN(O):一种用于事件序列数据结果预测的自调优GCN超模型工具包

Fang Wang, Paolo Ceravolo, Ernesto Damiani

发表机构 * College of Computing and Mathematical Sciences, Khalifa University(哈立发大学计算与数学科学学院) Department of Computer Science, University of Milan(米兰大学计算机科学系)

AI总结 提出HGCN(O)工具包,集成四种GCN架构和多种图表示,通过自调优优化预测准确性和稳定性,在平衡和不平衡数据集上表现优异,优于传统方法。

Comments 38 pages, 2 figures

详情
AI中文摘要

我们提出了HGCN(O),一个使用图卷积网络(GCN)模型进行事件序列预测的自调优工具包。该工具包包含四种GCN架构(O-GCN、T-GCN、TP-GCN、TE-GCN),基于GCNConv和GraphConv层,集成了事件序列的多种图表示,具有不同的节点级和图级属性选择,并通过边权重捕获时间依赖性,优化了平衡和不平衡数据集的预测准确性和稳定性。大量实验表明,GCNConv模型在不平衡数据上表现优异,而所有模型在平衡数据上表现一致。实验还证实了HGCN(O)相对于传统方法的优越性能。应用包括预测性业务流程监控(PBPM),即基于事件日志预测业务流程的未来事件或状态。

英文摘要

We propose HGCN(O), a self-tuning toolkit using Graph Convolutional Network (GCN) models for event sequence prediction. Featuring four GCN architectures (O-GCN, T-GCN, TP-GCN, TE-GCN) across the GCNConv and GraphConv layers, our toolkit integrates multiple graph representations of event sequences with different choices of node- and graph-level attributes and in temporal dependencies via edge weights, optimising prediction accuracy and stability for balanced and unbalanced datasets. Extensive experiments show that GCNConv models excel on unbalanced data, while all models perform consistently on balanced data. Experiments also confirm the superior performance of HGCN(O) over traditional approaches. Applications include Predictive Business Process Monitoring (PBPM), which predicts future events or states of a business process based on event logs.

2507.21460 2026-06-19 cs.CV 版本更新

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

用于低光场景光场目标跟踪的角-时交互网络

Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen

发表机构 * Engineering Research Center of Learning-Based Intelligent System (Ministry of Education)(教育部学习驱动智能系统工程研究中心) key Laboratory of Computer Vision and System (Ministry of Education)(教育部计算机视觉与系统重点实验室) School of Computer Science and Engineering, Tianjin University of Technology(天津工业大学计算机科学与工程学院)

AI总结 提出一种光场极线平面结构图像表示和角-时交互网络,通过显式建模几何结构和自监督优化,在低光场景下实现高效目标跟踪,性能达到最优。

详情
AI中文摘要

高质量的四维光场表示结合高效的角特征建模对于场景感知至关重要,因为它可以提供判别性的空间-角度线索来识别移动目标。然而,近期的发展仍然难以在时间域中提供可靠的角建模,尤其是在复杂的低光场景中。在本文中,我们提出了一种新颖的光场极线平面结构图像(ESI)表示,该表示显式定义了光场内的几何结构。通过利用极线平面内光线角度的突变,这种表示可以增强低光场景中的视觉表达,并减少高维光场的冗余。我们进一步提出了一种用于光场目标跟踪的角-时交互网络(ATINet),该网络从光场的几何结构线索和角-时交互线索中学习角感知表示。此外,ATINet还可以通过自监督方式进行优化,以增强时间域上的几何特征交互。最后,我们引入了一个大规模的光场低光数据集用于目标跟踪。大量实验表明,ATINet在单目标跟踪中达到了最先进的性能。此外,我们将所提方法扩展到多目标跟踪,这也显示了高质量光场角-时建模的有效性。

英文摘要

High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.

2505.18726 2026-06-19 cs.SD cs.LG eess.AS 版本更新

Bioacoustic Geolocation: Species Sounds as Geographic Signals

生物声学地理定位:物种声音作为地理信号

Mustafa Chasmai, Wuao Liu, Subhransu Maji, Grant Van Horn

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文研究仅通过声音进行全球尺度地理定位,利用生物声学信号中的物种地理分布线索,提出结合物种范围预测与检索的地理定位方法,并验证多模态融合的潜力。

Comments Accepted to ICML 26

详情
AI中文摘要

我们能否仅通过听到的声音确定某人的地理位置?声学信号是否足以定位到国家、州甚至城市?在这项工作中,我们应对全球尺度音频地理定位的挑战,特别关注野生动物和自然声音。我们假设生物声学信号包含信息丰富的地理定位线索,因为物种具有明确的地理分布范围。为了验证这一假设,我们对图像地理定位和声景映射方法进行基准测试,设计预言机和以物种为中心的基线,并提出一种结合物种范围预测与基于检索的地理定位的混合方法。我们进一步探究地理定位是否随着物种多样性记录和跨邻近样本的时空聚合而改善。最后,我们将研究扩展到多模态地理定位,通过结合音频和视觉内容的电影案例研究。我们的结果突出了将生物声学信号纳入地理空间任务的潜力,为物种识别和音频地理定位的未来工作提供了动力。

英文摘要

Can we determine someone's geographic location solely from the sounds they hear? Are acoustic signals enough to localize within a country, state, or even city? In this work, we tackle the challenge of global-scale audio geolocation, with a particular focus on wildlife and natural sounds. We posit that bioacoustic signals contain informative geolocation cues because of well-defined geographic ranges of species. To test this hypothesis, we benchmark image geolocation and soundscape mapping methods, design oracles and species-centric baselines, and propose a hybrid approach that combines species range prediction with retrieval-based geolocation. We further ask whether geolocation improves with species-diverse recordings and spatiotemporal aggregation across neighboring samples. Finally, we extend our study to multimodal geolocation with case studies from movies that combine both audio and visual content. Our results highlight the potential of incorporating bioacoustic signals into geospatial tasks, motivating future work on species recognition and audio geolocation.

2507.15584 2026-06-19 cs.LG 版本更新

We Need to Rethink Benchmarking in Anomaly Detection

我们需要重新思考异常检测中的基准测试

Philipp Röchner, Simon Klüttermann, Kevin Kammler, Franz Rothlauf, Emmanuel Müller, Daniel Schlör

发表机构 * University of Mainz(马尔堡大学) TU Dortmund(杜伊斯堡-艾森大学) University of Würzburg(维尔茨堡大学)

AI总结 本文指出当前异常检测基准测试导致进展停滞,提出基于场景分类的评估框架以改进算法选择和性能评估。

详情
AI中文摘要

尽管不断有新的异常检测算法提出且基准测试工作广泛,但进展似乎停滞不前,既有基线与新算法之间仅存在微小的性能差异。在这篇立场论文中,我们认为这种停滞源于我们评估异常检测算法的方式存在局限性。在当前的基准测试中,一个仅检查单个特征极端值的平凡算法与最先进的深度学习方法竞争激烈,尽管它在简单案例(如正常点环内的异常)上失败。此外,现有基准测试未能充分反映异常检测应用的多样性,使得从业者难以可靠地为其应用选择算法。因此,我们需要重新思考异常检测中的基准测试。我们认为,异常检测应通过使用场景来研究,这些场景将共享相关特征的应用分组,并通过通用分类法定义。场景内的基准测试能够实现预处理、度量和模型选择的场景特定选择,明确哪些进展在相似应用间迁移,并为从业者在其特定上下文中提供可靠指导。

英文摘要

Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. In current benchmarks, a trivial algorithm that only checks for extreme values in individual features performs competitively with state-of-the-art deep learning methods, despite failing on simple cases such as anomalies within an annulus of normal points. Moreover, existing benchmarks do not adequately reflect the diversity of anomaly detection applications, making it difficult for practitioners to reliably select algorithms for their applications. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that group applications sharing relevant characteristics, defined through a common taxonomy. Benchmarking within scenarios enables scenario-specific choices for preprocessing, metrics, and model selection, clarifying which advances transfer across similar applications and providing practitioners with reliable guidance for their specific contexts.

2506.06952 2026-06-19 cs.CV 版本更新

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland(马里兰大学) Nvidia(英伟达) Salesforce AI Research(Salesforce AI研究) Intuit AI Research(Intuit AI研究)

AI总结 提出LaTtE-Flow,一种基于预训练视觉语言模型的高效统一架构,通过层间时间步专家流和条件残差注意力机制,实现图像理解与生成,生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情
AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展,为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展,现有的统一模型通常需要大量的预训练,并且与专门针对每项任务的模型相比,难以达到相同的性能水平。此外,许多这些模型存在图像生成速度慢的问题,限制了它们在实时或资源受限环境中的实际部署。在这项工作中,我们提出了基于层间时间步专家流的Transformer(LaTtE-Flow),一种新颖且高效的架构,可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型(VLM)之上,以继承强大的多模态理解能力,并通过新颖的层间时间步专家流架构扩展它们,以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中,每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为了进一步提升性能,我们提出了一种时间步条件残差注意力机制,用于跨层高效的信息重用。实验表明,LaTtE-Flow在多模态理解任务上取得了强劲的性能,同时与最近的统一多模态模型相比,实现了具有竞争力的图像生成质量,推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

2505.22829 2026-06-19 cs.LG cs.AI 版本更新

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

弥合分布偏移与AI安全:概念与方法论的协同

Chenruo Liu, Kenan Tang, Yao Qin, Qi Lei

发表机构 * Center for Data Science, New York University New York New York USA Computer Science Department, University of California, Santa Barbara Santa Barbara California USA Department of Electrical Computer Engineering, University of California, Santa Barbara Santa Barbara California USA Courant Institute for Mathematical Sciences \& Center for Data Science, New York University New York New York USA Center for Data Science, New York University Computer Science Department, University of California, Santa Barbara Computer Engineering, University of California, Santa Barbara Courant Institute for Mathematical Sciences \& Center for Data Science, New York University

AI总结 本文通过分析分布偏移与AI安全之间的概念和方法论协同,建立了特定偏移类型与细粒度安全问题之间的两种联系,促进了两领域研究的深度融合。

Comments 35 pages

详情
AI中文摘要

本文通过全面分析分布偏移与AI安全之间的概念和方法论协同,弥合了这两者之间的鸿沟。虽然先前的讨论通常关注狭隘的案例或非正式的类比,但我们建立了特定分布偏移原因与细粒度AI安全问题之间的两种联系:(1) 解决特定偏移类型的方法可以帮助实现相应的安全目标,或 (2) 某些偏移和安全问题可以形式化地相互归约,从而使它们的方法能够相互适应。我们的发现提供了一个统一的视角,鼓励分布偏移与AI安全研究之间更深入的整合。

英文摘要

This paper bridges distribution shift and AI safety through a comprehensive analysis of their conceptual and methodological synergies. While prior discussions often focus on narrow cases or informal analogies, we establish two types connections between specific causes of distribution shift and fine-grained AI safety issues: (1) methods addressing a specific shift type can help achieve corresponding safety goals, or (2) certain shifts and safety issues can be formally reduced to each other, enabling mutual adaptation of their methods. Our findings provide a unified perspective that encourages deeper integration between distribution shift and AI safety research.

2505.18201 2026-06-19 cs.RO cs.LG 版本更新

Reinforcement Twinning for Hybrid Control of Flapping-Wing Drones

强化孪生用于扑翼无人机的混合控制

Romain Poletti, Lorenzo Schena, Lilla Koloszar, Joris Degroote, Miguel Alfonso Mendez

发表机构 * Environmental and Applied Fluid Dynamics, von Karman Institute for Fluid Dynamics(环境与应用流体动力学,冯·卡门流体动力学研究所) Department of Mechanical Engineering, Vrije Universiteit Brussel(机械工程系,自由大学布鲁塞尔) Department of Electromechanical, Systems and Metal Engineering, Ghent University(机电系统与金属工程系,根特大学) Aero-Thermo-Mechanics Laboratory, École Polytechnique de Bruxelles, Université Libre de Bruxelles(航空热力学力学实验室,布鲁塞尔理工学院,自由大学布鲁塞尔) Experimental Aerodynamics and Propulsion Lab, Universidad Carlos III de Madrid(实验空气动力学与推进实验室,马德里卡洛斯三世大学)

AI总结 提出一种混合无模型/基于模型的扑翼无人机控制方法,通过强化孪生算法结合强化学习与自适应数字孪生,利用迁移学习和策略裁判提升样本效率与控制鲁棒性。

详情
AI中文摘要

控制扑翼无人机需要能够处理来自不完整、有噪声传感器数据的时变、非线性、欠驱动动力学的控制器。人工智能的最新进展,特别是强化学习,通过从环境交互中进行数据驱动的策略优化,为解决此类复杂控制问题开辟了新视角。然而,纯数据驱动方法样本效率低,需要大量甚至不安全的探索,尤其是在缺乏引导物理模型的情况下。这激发了混合人工智能-物理框架。本文提出了一种使用强化孪生算法的混合无模型/基于模型的飞行控制方法。基于模型的组件使用伴随公式和从实时轨迹中连续识别的自适应数字孪生;无模型组件使用强化学习。两个智能体通过迁移学习、模仿学习以及真实环境与数字孪生之间的共享经验来共享知识,并由一个策略裁判协调,该裁判根据数字孪生性能和真实到虚拟一致性比率选择哪个智能体在现实中行动。该框架针对扑翼无人机的纵向控制进行了评估,该无人机被建模为由准稳态气动力驱动的非线性时变系统。混合策略在三种自适应模型初始化下进行了测试:(1)从现有数据进行离线识别,(2)随机初始化并进行完全在线识别,以及(3)使用有偏参数进行离线预训练,然后进行在线自适应。在所有情况下,混合框架在性能、鲁棒性和样本效率方面均优于纯无模型和纯基于模型的方法。

英文摘要

Controlling flapping-wing drones requires controllers that handle time-varying, nonlinear, underactuated dynamics from incomplete, noisy sensor data. Recent advances in artificial intelligence (AI), particularly reinforcement learning (RL), have opened new perspectives for addressing such complex control problems through data-driven policy optimization from interaction with the environment. Yet purely data-driven methods are sample-inefficient, demanding extensive, sometimes unsafe exploration, especially without guiding physical models. This motivates hybrid AI-physics frameworks. This article proposes a hybrid model-free/model-based flight-control approach using the reinforcement twinning algorithm. The model-based (MB) component uses an adjoint formulation and an adaptive digital twin continuously identified from live trajectories; the model-free (MF) component uses RL. The two agents share knowledge via transfer learning, imitation learning, and shared experience between the real environment and the digital twin, coordinated by a policy referee that selects which agent acts in reality based on digital-twin performance and a real-to-virtual consistency ratio. The framework is evaluated for the longitudinal control of a flapping-wing drone, modelled as a nonlinear time-varying system driven by quasi-steady aerodynamic forces. The hybrid strategy is tested under three adaptive-model initializations: (1) offline identification from existing data, (2) random initialization with fully online identification, and (3) offline pre-training with biased parameters followed by online adaptation. In all cases, the hybrid framework improves performance, robustness, and sample efficiency over purely model-free and purely model-based approaches.

2505.16319 2026-06-19 cs.LG 版本更新

FreshRetailNet-50K: A Stockout-Annotated Censored Demand Dataset for Latent Demand Recovery and Forecasting in Fresh Retail

FreshRetailNet-LT:面向生鲜零售中潜在需求恢复与预测的缺货标注删失需求数据集

Yangyang Wang, Jiawei Gu, Li Long, Xin Li, Li Shen, Zhouyu Fu, Xiangjun Zhou, Xu Jiang

发表机构 * Fresh Retail, Inc.(新鲜零售公司)

AI总结 针对生鲜零售中缺货导致的销售数据删失问题,提出首个大规模基准数据集FreshRetailNet-50K,包含50,000条高时间分辨率小时级销售序列及缺货标注,并展示了两阶段需求建模方法,将预测准确率提升2.73%,需求低估偏差从7.37%降至近零。

详情
AI中文摘要

准确的需求估计对于零售业务指导易腐产品的库存和定价策略至关重要。然而,它面临缺货期间删失销售数据的根本挑战,其中未观察到的需求会造成系统性政策偏差。现有数据集缺乏解决这种删失效应所需的时间分辨率和标注。为填补这一空白,我们提出了FreshRetailNet-50K,这是首个用于删失需求估计的大规模基准。它包含来自18个主要城市898家商店的50,000条商店-产品时间序列的详细小时级销售数据,涵盖863个易腐SKU,并精心标注了缺货事件。该数据集独有的小时级库存状态记录,结合丰富的上下文协变量(包括促销折扣、降水和时间特征),使得超越现有解决方案的创新研究成为可能。我们展示了一个两阶段需求建模的用例:首先,利用精确的小时级标注重建缺货期间的潜在需求;然后,利用恢复的需求在第二阶段训练鲁棒的需求预测模型。实验结果表明,该方法将预测准确率提高了2.73%,同时将系统性需求低估从7.37%降至接近零偏差。凭借前所未有的时间粒度和全面的真实世界信息,FreshRetailNet-50K在需求插补、易腐库存优化和因果零售分析方面开辟了新的研究方向。该数据集独特的标注质量和规模解决了零售AI中长期存在的局限性,提供了即时解决方案和未来方法论创新的平台。数据(此 https URL )和代码(此 https URL )已公开。

英文摘要

Accurate demand estimation is critical for the retail business in guiding the inventory and pricing policies of perishable products. However, it faces fundamental challenges from censored sales data during stockouts, where unobserved demand creates systemic policy biases. Existing datasets lack the temporal resolution and annotations needed to address this censoring effect. To fill this gap, we present FreshRetailNet-50K, the first large-scale benchmark for censored demand estimation. It comprises 50,000 store-product time series of detailed hourly sales data from 898 stores in 18 major cities, encompassing 863 perishable SKUs meticulously annotated for stockout events. The hourly stock status records unique to this dataset, combined with rich contextual covariates, including promotional discounts, precipitation, and temporal features, enable innovative research beyond existing solutions. We demonstrate one such use case of two-stage demand modeling: first, we reconstruct the latent demand during stockouts using precise hourly annotations. We then leverage the recovered demand to train robust demand forecasting models in the second stage. Experimental results show that this approach achieves a 2.73% improvement in prediction accuracy while reducing the systematic demand underestimation from 7.37% to near-zero bias. With unprecedented temporal granularity and comprehensive real-world information, FreshRetailNet-50K opens new research directions in demand imputation, perishable inventory optimization, and causal retail analytics. The unique annotation quality and scale of the dataset address long-standing limitations in retail AI, providing immediate solutions and a platform for future methodological innovation. The data (https://huggingface.co/datasets/Dingdong-Inc/FreshRetailNet-50K) and code (https://github.com/Dingdong-Inc/frn-50k-baseline}) are openly released.

2504.15535 2026-06-19 cs.RO 版本更新

VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation

VibeCheck: 使用主动声学触觉传感进行接触丰富的操作

Kaidi Zhang, Do-Gon Kim, Eric T. Chang, Hua-Hsuan Liang, Zhanpeng He, Kathryn Lampo, Philippe Wu, Ioannis Kymissis, Matei Ciocarlie

发表机构 * Dept. of Mechanical Engineering(机械工程系) Dept. of Computer Science(计算机科学系) Dept. of Electrical Engineering(电气工程系) Columbia University(哥伦比亚大学)

AI总结 本文构建了带有两个压电手指的主动声学传感夹爪,通过物体传递声学振动来感知其声学特性和接触状态,用于物体分类、抓取位置估计、内部结构姿态估计以及外部接触类型分类,并基于接触分类模型实现了鲁棒的插销任务。

Comments Published at IROS 2025. 8 pages, 7 figures

详情
AI中文摘要

物体的声学响应可以揭示其全局状态,例如材料属性或与外界的外部接触。在这项工作中,我们构建了一个主动声学传感夹爪,配备两个压电手指:一个用于生成信号,另一个用于接收信号。通过将一个手指的声学振动通过物体传递到另一个手指,我们能够洞察物体的声学特性和接触状态。我们使用该系统进行物体分类、估计抓取位置、估计内部结构的姿态,以及分类物体与环境的外部接触类型。利用我们的接触类型分类模型,我们解决了一个标准的长时域操作问题:插销插入。我们基于传感器的性能使用一个简单的模拟转移模型来训练一个模仿学习策略,该策略对分类器的不完美预测具有鲁棒性。最后,我们在UR5机器人上演示了该策略,仅使用主动声学传感作为反馈。视频可在此 https URL 找到。

英文摘要

The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. In this work, we build an active acoustic sensing gripper equipped with two piezoelectric fingers: one for generating signals, the other for receiving them. By sending an acoustic vibration from one finger to the other through an object, we gain insight into an object's acoustic properties and contact state. We use this system to classify objects, estimate grasping position, estimate poses of internal structures, and classify the types of extrinsic contacts an object is making with the environment. Using our contact type classification model, we tackle a standard long-horizon manipulation problem: peg insertion. We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. We finally demonstrate the policy on a UR5 robot with active acoustic sensing as the only feedback. Videos can be found at https://roamlab.github.io/vibecheck .

2305.14985 2026-06-19 cs.CV cs.CL 版本更新

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) HKUST(香港科技大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出IdealGPT框架,利用大型语言模型迭代分解视觉语言推理任务,通过子问题生成、子答案获取和最终答案推理的循环过程,在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视觉与语言(VL)理解领域通过端到端的大型预训练VL模型(VLM)取得了前所未有的进展。然而,它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标,先前的工作采用了分而治之的流程。本文认为,先前的工作存在几个固有的缺点:1)它们依赖于特定领域的子问题分解模型。2)即使子问题或子答案提供的信息不足,它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性,该框架利用大型语言模型(LLM)迭代分解VL推理。具体来说,IdealGPT使用一个LLM生成子问题,一个VLM提供相应的子答案,另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程,直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是,我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%,在SNLI-VE上提高了15%。代码可在以下网址获取:此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

2504.02885 2026-06-19 cs.CL 版本更新

Med-R2: Perception and Reflection-driven Complex Reasoning for Medical Report Generation

Med-R2:面向医学报告生成的感知与反思驱动复杂推理

Hao Wang, Shuchang Ye, Jinghao Lin, Usman Naseem, Jinman Kim

发表机构 * The School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) The School of Computing, Macquarie University(麦考瑞大学计算机学院) Doubao Medical Group, ByteDance(字节跳动 doubao 医疗集团)

AI总结 提出Med-R2微调策略,通过引入感知驱动的长推理过程和放射学知识指导,并加入反思机制修正感知错误,提升LVLMs在医学报告生成中的病理特征感知和诊断准确性。

Comments 28 pages, 3 figures, 1 table

详情
AI中文摘要

自动化医学报告生成(MRG)越来越多地被用于减轻人工报告负担和辅助决策。大型视觉语言模型(LVLMs)因其细粒度的图像-文本对齐和先进的文本生成能力,在自动化MRG中展现出巨大潜力。目前,最先进的MRG主要专注于通过直接监督微调(SFT)来适应预训练的LVLMs,这是一种使用医学图像-报告对的微调策略。然而,有几个因素限制了这些LVLMs的性能。首先,直接SFT使LVLMs能够直接生成医学报告,而无需经过病理特征感知和诊断推理的中间思考过程。这导致可能无法感知病理特征,从而引起误诊。其次,直接SFT缺乏放射学特定知识的指导,导致LVLMs误解感知到的病理特征并做出错误诊断。为了解决这些问题,我们提出了一种名为Med-R2的新型微调策略。我们引入了一个感知驱动的长推理过程,该过程在报告生成之前进行,并融入放射学特定知识作为指导。此外,为了减轻复杂推理中潜在的感知错误,引入了一种反思机制来细化病理特征的感知和生成的报告。我们的实验表明,Med-R2通过微调LVLMs有效增强了MRG的病理特征感知能力和诊断准确性。

英文摘要

Automated medical report generation (MRG) is increasingly used to reduce the burden of manual reporting and for decision support. Large vision-language models (LVLMs) hold great promise for automated MRG due to their fine-grained image-text alignment and advanced text-generation capabilities. Currently, state-of-the-art MRGs primarily focus on adapting pre-trained LVLMs with direct supervised fine-tuning (SFT), a fine-tuning strategy with medical image-report pairs. However, several factors limit the performance of these LVLMs. Firstly, direct SFT enables LVLMs to generate medical reports directly without an intermediate thinking process of pathological feature perception and diagnostic reasoning. This causes a potential failure to perceive pathological features and thus leads to misdiagnosis. Secondly, direct SFT lacks the incorporation of radiology-specific knowledge guidance, causing LVLMs to misinterpret perceived pathological features and make incorrect diagnoses. To address these gaps, we propose a novel fine-tuning strategy named Med-R2. We introduce a perception-driven long reasoning process that precedes report generation and incorporates radiology-specific knowledge as guidance. Additionally, to alleviate potential perceptual errors in complex reasoning, a reflection mechanism is introduced to refine the perception of pathological features and the generated report. Our experiments demonstrate that Med-R2 effectively enhances the capability of pathological features perception and diagnosis accuracy for MRG via fine-tuned LVLMs.

2411.10077 2026-06-19 cs.CV 版本更新

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

多视角融合的分层互蒸馏:从所有可能的视角组合中学习

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(翰阳大学) Hankuk University of Foreign Studies(韩国民法大学)

AI总结 本文提出一种新颖的多视角不确定性加权互蒸馏方法,通过分层互蒸馏提升预测一致性,有效利用各视角信息并缓解不确定预测的影响。

详情
Journal ref
Pattern Recognition 178 (2026) 113432
AI中文摘要

多视角学习常面临有效利用不同角度和位置拍摄图像的挑战,尤其是在处理视角间不一致性和不确定性时更为突出。本文提出了一种新颖的多视角不确定性加权互蒸馏(MV-UWMD)方法。我们的方法通过在所有可能的视角组合中进行分层互蒸馏来增强预测一致性,包括单视角、部分多视角和全多视角预测。这引入了一种基于不确定性的加权机制,通过互蒸馏有效利用每个视角的独特信息,同时减轻不确定预测的影响。我们扩展了CNN-Transformer混合架构以促进在多个视角组合中的稳健特征学习和整合。我们使用了一个大规模、非结构化的数据集进行广泛实验,该数据集来自多样且非固定视角的拍摄。结果表明,MV-UWMD相比现有多视角学习方法在预测准确性和一致性方面有所提升。

英文摘要

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

2502.03227 2026-06-19 cs.LG cs.CV 版本更新

Adversarial Dependence Minimization

对抗性依赖最小化

Pierre-François De Plaen, Tinne Tuytelaars, Marc Proesmans, Luc Van Gool

发表机构 * CVL, ETH Zürich, Switzerland(CVL,苏黎世联邦理工学院,瑞士) INSAIT, Sofia University, Bulgaria(INSAIT,索菲亚大学,保加利亚)

AI总结 提出ADM算法,通过对抗博弈最小化特征维度间的统计依赖性,证明全局最优时达到相互独立,并应用于非线性去相关、图像分类泛化提升和自监督学习维度坍塌预防。

详情
AI中文摘要

最小冗余表示通常通过最小化特征协方差来学习。然而,基于协方差的方法无法消除所有依赖/冗余,因为线性不相关的变量仍可能表现出非线性关系。为了解决这个问题,我们引入了ADM,一种可微分的算法,通过对抗博弈最小化特征维度之间的统计依赖性:辅助网络识别依赖关系,而编码器去除它们。我们证明了在全局最优时实现了相互独立,经验验证了收敛性,并研究了三个潜在应用:将PCA扩展到非线性去相关、提高图像分类的泛化能力以及防止自监督学习中的维度坍塌。通过促进统计独立的表示,ADM为在多种应用中学习更鲁棒、更压缩和更泛化的表示铺平了道路。

英文摘要

Minimally redundant representations are typically learned by minimizing feature covariance. However, covariance-based methods fail to eliminate all dependencies/redundancies, as linearly uncorrelated variables can still exhibit nonlinear relationships. To address this, we introduce ADM, a differentiable algorithm that minimizes statistical dependence between feature dimensions through an adversarial game: auxiliary networks identify dependencies, while the encoder removes them. We prove that mutual independence is achieved at the global optimum, empirically verify convergence, and study three potential applications: extending PCA to nonlinear decorrelation, improving generalization in image classification, and preventing dimensional collapse in self-supervised learning. By promoting statistically independent representations, ADM paves the way for learning more robust, compressed, and generalizable representations across diverse applications.

2502.06866 2026-06-19 cs.LG cs.AI econ.EM stat.AP stat.ML 版本更新

Global Ease of Living Index: a machine learning framework for longitudinal analysis of major economies

全球生活便利指数:面向主要经济体纵向分析的机器学习框架

Arun Kumar Selvaraj, Tanay Panat, Rohitash Chandra

发表机构 * Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics(过渡人工智能研究组,数学与统计学学院) Centre for Artificial Intelligence and Innovation(人工智能与创新中心) Pingla Institute(Pingla研究所)

AI总结 提出全球生活便利指数,结合社会经济和基础设施因素,利用机器学习处理缺失数据,并通过主成分分析和因子分析降维,为政策制定者提供改善生活质量的可操作工具。

详情
AI中文摘要

全球经济、地缘政治条件以及COVID-19疫情等破坏性事件对生活成本和生活质量产生了巨大影响。理解主要经济体中生活成本和生活质量的长期影响至关重要。一个透明且全面的生活指数必须包含生活条件的多个维度。在本研究中,我们提出了一种通过全球生活便利指数量化生活质量的方法,该指数将各种社会经济和基础设施因素整合为一个单一综合得分。我们的指数利用定义生活水平的经济指标,这有助于针对特定领域进行干预改进。我们提出了一个机器学习框架来处理特定国家某些经济指标的数据缺失问题。然后,我们整理并更新数据,并使用降维方法(主成分分析和因子分析)创建自1970年以来主要经济体的生活便利指数。我们的工作通过为政策制定者提供识别需要改进领域(如医疗系统、就业机会和公共安全)的实用工具,显著丰富了相关文献。我们的方法使用开放数据和代码,易于复现并适用于各种情境,为生活质量评估的持续研究和政策制定提供了透明度和可访问性。

英文摘要

The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is essential to comprehend the long-term implications of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework to address missing data for certain economic indicators in specific countries. We then curate and update the data and use a dimensionality reduction approach (Principal Component Analysis and Factor Analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts, providing transparency and accessibility for ongoing research and policy development in quality-of-life assessment.

2501.18322 2026-06-19 cs.LG math.AP 版本更新

A Unified Perspective on the Dynamics of Deep Transformers

深度Transformer动力学的统一视角

Valérie Castin, Pierre Ablin, José Antonio Carrillo, Gabriel Peyré

发表机构 * CNRS and Ecole Normale Supérieure PSL(CNRS和巴黎高等师范大学) Apple(苹果公司) Mathematical Institute, University of Oxford(牛津大学数学学院)

AI总结 提出Transformer PDE作为注意力层迭代的均场极限,证明其适定性并分析高斯初始数据下的各向异性演化与聚类现象。

详情
AI中文摘要

Transformer在大多数机器学习任务中是最先进的,它将数据表示为称为token的向量序列。然后通过注意力函数利用这种表示,该函数学习token之间的依赖关系,是Transformer成功的关键。然而,跨层迭代应用注意力会导致复杂的动力学,这些动力学尚未被完全理解。为了分析这些动力学,我们将每个输入序列识别为一个概率测度,并将其演化建模为称为Transformer PDE的Vlasov方程,其速度场在概率测度中是非线性的。我们的第一组贡献聚焦于紧支撑初始数据。我们证明Transformer PDE是适定的,并且是相互作用粒子系统的均场极限,从而将先前的分析推广并扩展到自注意力的几种变体:多头注意力、L2注意力、Sinkhorn注意力、Sigmoid注意力和掩码注意力——利用条件Wasserstein框架。在第二组贡献中,我们首次研究非紧支撑初始条件,聚焦于高斯初始数据。再次针对不同类型的注意力,我们证明Transformer PDE保持高斯测度空间,这使我们能够从理论上和数值上分析高斯情况以识别典型行为。这种高斯分析捕捉了通过深度Transformer的数据各向异性演化。特别地,我们强调了与先前在非归一化离散情况下的结果平行的聚类现象。

英文摘要

Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.

2501.17015 2026-06-19 cs.AI cs.MA cs.RO 版本更新

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

UniMM:一种用于多智能体仿真的统一混合模型框架

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

发表机构 * Zhejiang University(浙江大学) Horizon Robotics

AI总结 提出UniMM框架统一回归混合模型与离散NTP模型,通过闭环样本生成缓解分布偏移,并在WOSAC基准上取得最优性能。

Comments Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2026
AI中文摘要

仿真在评估自动驾驶系统中起着关键作用,其中生成逼真的多智能体行为是一个关键方面。在多智能体仿真中,主要挑战包括行为多模态性和闭环分布偏移。在本研究中,我们提出了一个统一的混合模型(UniMM)框架,用于生成多模态智能体行为,该框架涵盖了主流方法,包括基于回归的混合模型和离散NTP模型。此外,我们引入了一种针对混合模型的闭环样本生成方法,以缓解分布偏移。在UniMM框架内,我们从模型和数据角度识别了关键配置。我们对各种模型配置进行了系统检查,并全面描述了它们的效果。此外,我们对数据配置的研究强调了闭环样本在实现逼真仿真中的关键作用。为了将闭环样本的优势扩展到更广泛的混合模型中,我们进一步引入了一种时间解缠和对齐机制,以解决捷径学习和离策略学习问题。利用我们探索的见解,UniMM框架内提出的不同变体,包括离散模型、无锚模型和基于锚点的模型,均在WOSAC基准上取得了最先进的性能。

英文摘要

Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.