arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2603.27355 2026-05-22 cs.AI cs.CL cs.SE

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

LLM Readiness Harness: 评估、可观测性和持续集成门禁用于LLM/RAG应用

Alexandre Cristovão Maiorano

AI总结 本文提出了一种LLM和RAG应用的准备性框架,通过自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,将评估转化为部署决策流程,并通过帕累托前沿计算场景加权的准备度分数,展示了在票务路由工作流和BEIR接地任务上的评估结果。

Comments 19 pages, 4 figures, 15 tables

详情
AI中文摘要

我们提出了一种用于LLM和RAG应用的准备性框架,将评估转化为部署决策流程。该系统结合了自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,通过最小的API合同聚合工作流程成功、政策合规性、 groundedness、检索命中率、成本和p95延迟,计算出场景加权的准备度分数。我们对票务路由工作流和BEIR接地任务(SciFact和FiQA)进行了评估,覆盖了完整的Azure矩阵(162/162有效单元跨数据集、场景、检索深度、种子和模型)。结果表明,准备度不是单一指标:在FiQA中,sla-first at k=5时,gpt-4.1-mini在准备度和忠实度上领先,而gpt-5.2则支付了显著的延迟成本;在SciFact中,模型质量接近但仍有操作区分。票务路由回归门禁持续拒绝不安全的提示变体,证明了该框架能够阻止风险发布,而不仅仅是报告离线分数。结果是一个可重复、操作基础的框架,用于决定LLM或RAG系统是否准备好发布。

英文摘要

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.

2603.25958 2026-05-22 cs.LG

Cluster-Adaptive Feature Extraction and its Theoretical Foundation with Minkowski Weighted k-Means

基于Minkowski加权k均值的聚类自适应特征提取及其理论基础

Renato Cordeiro de Amorim, Vladimir Makarenkov

AI总结 本文提出了一种基于Minkowski加权k均值的聚类自适应特征提取方法,通过理论分析揭示了特征权重的结构,并证明了该方法在抑制高分散特征和增强信息性特征方面的有效性。

详情
AI中文摘要

Minkowski加权k均值(mwk-均值)算法通过引入特征权重和Minkowski距离扩展了经典k均值。我们首先证明,mwk-均值的目标函数可以表示为聚类内分散度的幂均值聚合,其中幂次由Minkowski指数p决定。这一表示揭示了p如何控制特征在选择性和均匀性之间的过渡。利用这种表示,我们推导了目标函数的界限,并刻画了特征权重的结构,证明其仅依赖于相对分散度,并遵循与分散比的幂律关系。这导致了对高分散特征抑制的显式保证,并建立了算法的收敛性。基于这些理论结果,我们引入了聚类自适应特征提取(CAFE),一种利用mwk-均值特征权重对数据进行预处理以进行无监督特征提取的方法。我们证明这种预处理反转了聚类内分散度的排序,抑制噪声特征并放大信息性特征。在受控的聚类内噪声环境下进行的大量实验表明,CAFE在传统特征提取方法的结果上始终表现出改进。

英文摘要

The Minkowski weighted $k$-means ($mwk$-means) algorithm extends classical $k$-means by incorporating feature weights and a Minkowski distance. We first show that the $mwk$-means objective can be expressed as a power-mean aggregation of within-cluster dispersions, with the order determined by the Minkowski exponent $p$. This formulation reveals how $p$ controls the transition between selective and uniform use of features. Using this representation, we derive bounds for the objective function and characterise the structure of the feature weights, showing that they depend only on relative dispersion and follow a power-law relationship with dispersion ratios. This leads to explicit guarantees on the suppression of high-dispersion features, and we establish convergence of the algorithm. Building on these theoretical results, we introduce Cluster-Adaptive Feature Extraction (CAFE), a method that uses the $mwk$-means feature weights to rescale the data prior to unsupervised feature extraction. We prove that this rescaling reverses the within-cluster dispersion ordering, suppressing noisy features and amplifying informative ones. Numerous experiments conducted under controlled within-cluster noise show that CAFE consistently improves the results of traditional feature extraction methods.

2603.20405 2026-05-22 cs.LG cs.CL cs.LO

Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP

使用 Opus 4.6 和 Rocq-MCP 的 2025 年 Putnam 问题

Guillaume Baudart, Marc Lelarge, Tristan Stérin, Jules Viennot

AI总结 研究探讨了使用 Opus 4.6 配合 Rocq-MCP 工具自主证明 2025 年 Putnam 数学竞赛中 12 个问题中的 10 个,展示了基于模型上下文协议 (MCP) 的自动证明方法及公开可用的证明过程。

详情
AI中文摘要

我们报告了一项实验,其中配备有 Model Context Protocol (MCP) 工具的 Claude Opus~4.6,能够自主证明 2025 年 Putnam 数学竞赛中的 10 个问题。MCP 工具由 Claude 设计,通过分析先前在 miniF2F-Rocq 上的实验日志来编码一种“先编译,后交互回退”的策略。该代理在隔离的虚拟机上运行,无网络访问,部署了 141 个子代理,在 17.7 小时的活跃计算时间(51.6 小时墙钟时间)内消耗了约 190 亿个 token。所有证明均公开可用。

英文摘要

We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a "compile-first, interactive-fallback" strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.

2603.18003 2026-05-22 cs.CV

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

通过可微渲染和大语言模型实现通用骨架理解

Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu

AI总结 本文提出SkeletonLLM,通过可微渲染将任意骨架序列转换为大语言模型的视觉模态,实现通用骨架理解,同时引入协同训练策略提升推理能力,展示了在开放词汇动作识别中的强泛化能力,并扩展到异构骨架格式的运动描述和问答任务。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言推理方面表现出色,但无法处理结构化非视觉数据如人体骨架。现有方法要么将骨架动力学压缩成有损特征向量以进行文本对齐,要么将运动量化为离散标记,但这些方法在异构骨架格式上泛化能力较差。我们提出了SkeletonLLM,通过将任意骨架序列转换为MLLM的本机视觉模态实现通用骨架理解。其核心是DrAction,一种可微、格式无关的渲染器,将骨骼运动学转换为紧凑的图像序列。由于整个流程是端到端可微的,MLLM的梯度可以直接引导渲染以生成任务相关信息的视觉标记。为进一步增强推理能力,我们引入了协同训练策略:因果推理蒸馏将结构化的逐步推理从教师模型转移过来,而判别微调则增强可混淆动作之间的决策边界。SkeletonLLM在开放词汇动作识别中表现出强泛化能力,其学习的推理能力自然扩展到异构骨架格式的运动描述和问答任务——表明了将MLLM应用于非本机模态的可行路径。代码:https://github.com/wangzy01/SkeletonLLM。

英文摘要

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet cannot process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization \revise{in open-vocabulary action recognition, while its learned reasoning capabilities naturally extend to motion captioning and question answering across heterogeneous skeleton formats} -- suggesting a viable path for applying MLLMs to non-native modalities. Code: https://github.com/wangzy01/SkeletonLLM.

2603.16672 2026-05-22 cs.AI cs.CL cs.CY

CritiSense: Critical Digital Literacy and Resilience Against Misinformation

CritiSense: 关键数字素养与对抗虚假信息的韧性

Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori, Giovanni Da San Martino, Abul Hasnat, Raian Ali

AI总结 本研究提出CritiSense,一个多功能的移动媒体素养应用,通过短而互动的挑战提升用户识别操纵手段的能力,为多语言的预警告平台和微学习效果评估提供测试环境。

Comments resilience, disinformation, misinformation, fake news, propaganda

详情
AI中文摘要

社交媒体上的虚假信息破坏了知情决策和公众信任。预警告(prebunking)通过帮助用户在遇到真实信息前识别操纵手法,提供了一种积极的补充方法。我们介绍了CritiSense,一个移动媒体素养应用,通过短而互动的挑战和即时反馈来培养这些技能。它是首个支持九种语言且模块化的平台,设计用于快速更新不同主题和领域。我们报告了93名用户的可用性研究:83.9%的用户表示总体满意,90.1%的用户认为该应用易于使用。定性反馈表明,CritiSense有助于提高数字素养技能。总体而言,它提供了一个多语言预警告平台和一个测试环境,用于衡量微学习对对抗虚假信息韧性的影响。在六个月中,我们已吸引了超过500名活跃用户。它在Apple App Store(https://apps.apple.com/us/app/critisense/id6749675792)和Google Play Store(https://play.google.com/store/apps/details?id=com.critisense&hl=en)上免费向所有用户提供。

英文摘要

Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 6 months, we have reached 500+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en).

2603.08403 2026-05-22 cs.CV

SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

SPIRAL:通过反思规划代理实现自演化动作条件视频生成

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Liang Lv, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

AI总结 本文提出SPIRAL框架,通过反思规划代理实现长时域动作条件视频生成,解决传统方法在长时间视频生成中的不足,通过闭环设计和自演化机制提升视频生成的一致性和准确性。

Comments 42 Pages, 21 Figures, Project page at https://yuyang-cloud.github.io/spiral

详情
AI中文摘要

长时域动作条件视频生成旨在合成符合复杂动作指令的时序一致视频,要求过程有序、持续执行动作和场景一致,超越传统TI2V的短时精度。现有单次视频生成模型通常采用开环方式,导致动作执行不完整、幻觉运动和时间漂移。为解决此问题,我们提出SPIRAL,一种闭环框架,通过顺序规划和迭代反思进行动作条件长时域视频生成。具体而言,SPIRAL实现一个思考-行动-反思过程:PlanAgent将高层目标分解为子动作,这些动作条件VideoGenerator生成每个片段并伴随记忆上下文,同时CriticAgent评估中间视频片段以提供迭代优化的反馈。此闭环设计进一步通过利用PlanAgent提出的行为和CriticAgent得出的奖励进行GRPO基于的后训练,以增强视频生成器的长时域一致性。此外,我们引入ActVideoGen-Dataset用于任务特定训练,并建立ActVideoGen-Bench作为专用评估套件,用于衡量动作质量和时间一致性。在多个TI2V后端和自演化策略下的实验显示,在ActVideoGen-Bench和VBench上均取得一致提升,证明了SPIRAL的有效性。

英文摘要

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons, requiring procedural ordering, persistent action execution, and scene consistency beyond conventional TI2V's short-term fidelity. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs sequential planning and iterative reflection for action-conditioned long-horizon video generation. Specifically, SPIRAL instantiates a think-act-reflect process: a PlanAgent decomposes high-level goals into sub-actions, which condition a VideoGenerator to synthesize each segment alongside a memory context, while a CriticAgent evaluates intermediate video segments to provide corrective feedback for iterative refinement. This closed-loop design further supports self-evolution by utilizing PlanAgent-proposed actions and CriticAgent-derived rewards for GRPO-based post-training to enhance the video generator's long-horizon consistency. Moreover, we introduce ActVideoGen-Dataset for task-specific training, and establish ActVideoGen-Bench as a dedicated evaluation suite for measuring action quality and temporal coherence. Experiments across multiple TI2V backbones alongside the self-evolving strategy show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

2603.03454 2026-05-22 cs.LG

[Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

[Re] FairDICE:多目标离线RL中的公平权衡

Peter Adema, Karim Galliamov, Aleksey Evstratovskiy, Ross Geurts

AI总结 该研究探讨了多目标离线强化学习中公平权衡的问题,提出FairDICE算法通过自适应学习多目标权重来实现公平妥协,但发现代码错误导致其在连续环境中退化为标准行为克隆,并需修正超参数以提升实验有效性。

Comments 12 pages, 8 figures in main text. Code at https://github.com/p-adema/re-fairdice. Reviewed at https://openreview.net/forum?id=Tr6MBt0hAj

详情
Journal ref
Published 05/2026 in Transactions on Machine Learning Research
AI中文摘要

离线强化学习(RL)是RL领域的一个新兴分支,其中策略仅从演示中学习。在离线RL中,某些环境需要平衡多个目标,但现有的多目标离线RL算法未能提供有效的方法来找到公平的折中方案。FairDICE(见arXiv:2506.08062v2)通过将OptiDICE(一种离线RL算法)进行适应性修改,以自动学习多个目标的权重,例如激励目标间的公平性。由于这一贡献具有价值,本复制研究检验了关于FairDICE的可复制性声明。我们发现许多理论声明成立,但代码中的错误使FairDICE在连续环境中退化为标准行为克隆,并且许多重要的超参数最初未明确指定。在修正之后,我们通过扩展原始论文的实验表明,FairDICE可以扩展到复杂环境和高维奖励,尽管它在(在线)超参数调优上可能依赖性较强。我们得出结论,FairDICE是一种理论上有吸引力的方法,但实验验证需要显著修订。

英文摘要

Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g. incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

2603.02938 2026-05-22 cs.LG cs.AI

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

超越一刀切:基于大语言模型的零样本图学习中的自适应子图去噪

Fengzhi Li, Liang Zhang, Yuan Zuo, Ruiqing Zhao, YanSong Liu, Yunfei Ma, Fanyu Meng, Junlan Feng

AI总结 本文提出GraphSSR框架,通过自适应子图提取和去噪方法,解决传统图神经网络在零样本学习中泛化能力不足的问题,提升大语言模型在图推理任务中的表现。

详情
AI中文摘要

图基任务在零样本设置中仍面临显著挑战,由于数据稀缺性和传统图神经网络(GNNs)无法泛化到未见领域或标签空间。尽管最近的进展转向利用大语言模型(LLMs)作为预测器来增强GNNs,但这些方法常面临跨模态对齐问题。最近的范式(即Graph-R1)通过采用纯文本格式和基于LLM的图推理克服了上述架构依赖性,显示出改进的零样本泛化能力。然而,它使用一种任务无关的“一刀切”子图提取策略,不可避免地引入了显著的结构噪声——无关邻居和边——这会扭曲LLMs的感知范围并导致次优预测。为了解决这一限制,我们引入GraphSSR,一种新的框架,用于零样本LLM图推理中的自适应子图提取和去噪。具体而言,我们提出了SSR流水线,通过“采样-选择-推理”过程动态定制子图提取以适应特定上下文,使模型能够自主过滤掉任务无关的邻居并克服“一刀切”问题。为了内化这一能力,我们开发了SSR-SFT,一种数据合成策略,生成高质量的SSR风格图推理轨迹用于LLM的监督微调。此外,我们提出了SSR-RL,一种两阶段强化学习框架,该框架专门设计用于自适应子图去噪,明确调节所提出SSR流水线中的采样和选择操作。通过结合真实性增强和去噪增强的强化学习,我们引导模型使用简洁的、去噪的子图进行推理以实现准确预测。

英文摘要

Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise--irrelevant neighbors and edges--that distorts the LLMs' receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a "Sample-Select-Reason" process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.

2602.23231 2026-05-22 cs.CV

Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Skarimva:基于骨架的动作识别是一种多视图应用

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

AI总结 本文研究了基于骨架的动作识别中多视图应用的重要性,指出通过多摄像头视图三角化获得更准确的3D骨架数据,可以显著提升现有动作识别模型的性能,表明输入数据质量是限制模型性能的关键因素,未来研究应将多视图应用作为标准设置。

详情
AI中文摘要

人类动作识别在开发人机智能交互中起着重要作用。尽管有很多研究致力于改进用于基于骨架的动作识别的机器学习算法,但对输入骨架数据本身质量的关注却很少。本文证明,通过利用多个摄像头视图来三角化更准确的3D骨架,可以显著提高现有动作识别模型的性能。这表明,输入数据的质量目前是这些模型性能的限制因素。基于这些结果,认为在大多数实际应用场景中,使用多个摄像头的成本效益比非常有利,因此未来基于骨架的动作识别研究应将多视图应用作为标准设置。

英文摘要

Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

2602.22719 2026-05-22 cs.LG

Interpreting and Steering State-Space Models via Activation Subspace Bottlenecks

通过激活子空间瓶颈解释和操控状态空间模型

Vamshi Sunku Mohan, Kaustubh Gupta, Aneesha Das, Chandan Singh

AI总结 本文通过识别Mamba家族状态空间模型中的激活子空间瓶颈,提出了一种在测试时通过乘以标量来操控激活的干预方法,从而在多个模型和基准测试中提升了性能,并验证了这些瓶颈对性能的阻碍作用。

详情
AI中文摘要

状态空间模型(SSMs)已经 emerged 作为构建强大语言模型的有效策略,避免了transformers中计算注意力的二次复杂度。尽管有潜力,现代SSMs的可解释性和操控性仍然相对研究不足。我们通过使用机理可解释性工具,在Mamba家族的SSMs中识别出激活子空间瓶颈。然后,我们引入了一种测试时操控干预,通过将识别出的瓶颈的激活乘以一个标量。在7个SSMs和6个多样化的基准测试中,这种干预平均提升了8.27%的性能,无需任何任务特定的调优。最后,我们验证了识别出的瓶颈确实阻碍了性能,通过修改它们得到一种称为Stable-Mamba的架构,在重新训练时实现了长上下文性能的提升。

英文摘要

State-space models (SSMs) have emerged as an efficient strategy for building powerful language models, avoiding the quadratic complexity of computing attention in transformers. Despite their promise, the interpretability and steerability of modern SSMs remain relatively underexplored. We take a major step in this direction by identifying activation subspace bottlenecks in the Mamba family of SSM models using tools from mechanistic interpretability. We then introduce a test-time steering intervention that simply multiplies the activations of the identified bottlenecks by a scalar. Across 7 SSMs and 6 diverse benchmarks, this intervention improves performance by an average of 8.27%, without requiring any task-specific tuning. Finally, we validate that the identified bottlenecks are indeed hindering performance by modifying them to yield an architecture we call Stable-Mamba, which achieves long-context performance gains when retrained from scratch.

2602.22270 2026-05-22 cs.LG q-bio.PE

Prior Knowledge-enhanced Spatio-temporal Epidemic Forecasting

先验知识增强的时空疫情预测

Sijie Ruan, Jinyu Li, Jia Wei, Zenghao Xu, Jie Bao, Junshi Xu, Junyang Qiu, Shuliang Wang, Xiaoxiao Wang, Hanning Yuan

AI总结 本文提出了一种结合隐式时空先验和显式专家先验的新型混合框架STOEP,通过动态调整区域依赖关系、放大弱信号和机制性预测来提升时空疫情预测的准确性。

Comments 12 pages, 10 figures, accepted to IJCAI 2026

详情
AI中文摘要

时空疫情预测对于公共卫生管理至关重要,但现有方法常面临对弱疫情信号不敏感、空间关系过于简化和参数估计不稳定的问题。为解决这些问题,我们提出了Spatio-Temporal priOr-aware Epidemic Predictor(STOEP),一种新的混合框架,整合了隐式时空先验和显式专家先验。STOEP由三个关键组件组成:(1)病例感知邻接学习(CAL),利用历史感染模式动态调整基于移动性的区域依赖关系;(2)空间指导参数估计(SPE),采用可学习的空间先验来放大弱疫情信号;(3)基于滤波的机制性预测(FMF),使用专家指导的自适应阈值策略来正则化疫情参数。在真实世界中的新冠和流感数据集上进行的广泛实验表明,STOEP在RMSE上比最佳基线高出11.1%。该系统已在中国一个省级CDC部署,以促进后续应用。

英文摘要

Spatio-temporal epidemic forecasting is critical for public health management, yet existing methods often struggle with insensitivity to weak epidemic signals, over-simplified spatial relations, and unstable parameter estimation. To address these challenges, we propose the Spatio-Temporal priOr-aware Epidemic Predictor (STOEP), a novel hybrid framework that integrates implicit spatio-temporal priors and explicit expert priors. STOEP consists of three key components: (1) Case-aware Adjacency Learning (CAL), which dynamically adjusts mobility-based regional dependencies using historical infection patterns; (2) Space-informed Parameter Estimating (SPE), which employs learnable spatial priors to amplify weak epidemic signals; and (3) Filter-based Mechanistic Forecasting (FMF), which uses an expert-guided adaptive thresholding strategy to regularize epidemic parameters. Extensive experiments on real-world COVID-19 and influenza datasets demonstrate that STOEP outperforms the best baseline by 11.1% in RMSE. The system has been deployed at a provincial CDC in China to facilitate downstream applications.

2602.20845 2026-05-22 cs.CV

FLIM Networks with Bag of Feature Points

具有特征点袋的FLIM网络

João Deltregia Martinelli, Marcelo Luis Rodrigues Filho, Felipe Crispim da Rocha Salvagnini, Gilson Junior Soares, Jefersson A. dos Santos, Alexandre X. Falcão

AI总结 本文提出FLIM-BoFP,一种更高效的滤波器估计方法,用于显微镜图像中的寄生虫检测,相较于FLIM-Cluster和其他先进基线,在效率、效果和泛化能力上均有优势。

Comments Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer

详情
AI中文摘要

卷积网络需要大量的图像标注,这可能成本高昂且耗时。通过从少量代表性图像上用户绘制的标记中估计编码器滤波器(即核权重),特征学习从图像标记(FLIM)解决了这一挑战,而无需传统优化。这种编码器与自适应解码器结合构成了一个完全训练而无需反向传播的FLIM网络。先前研究已证明其在显著物检测(SOD)中的有效性,比现有轻量模型显著更轻。本研究重新审视FLIM SOD,并引入FLIM-Bag of Feature Points(FLIM-BoFP),一种显著更快的滤波器估计方法。先前方法FLIM-Cluster通过每个编码器块的补丁聚类来推导滤波器,导致计算开销和对滤波器位置的控制减少。FLIM-BoFP通过在输入块进行一次聚类,创建特征点袋,并在所有块上直接从映射的特征点定义滤波器。论文评估了FLIM-BoFP与FLIM-Cluster和其他最先进的基线在寄生虫检测中的效率、效果和泛化能力的益处。

英文摘要

Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder's block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.

2602.18141 2026-05-22 cs.LG

Geometry-Induced Diffusion on Graphs: A Learnable Weighted Laplacian for Spectral GNNs

图诱导扩散:用于谱GNNs的可学习加权拉普拉斯算子

Mia Zosso, Ali Hariri, Victor Kawasaki-Borruat, Pierre-Gabriel Berlureau, Pierre Vandergheynst

AI总结 本文提出了一种简单的谱GNN架构mu-ChebNet,通过学习节点级权重函数mu来修改图拉普拉斯算子,从而改变传播几何而不改变图拓扑,从而促进信息传播的优选路径,帮助长距离信号避免高收缩瓶颈,无需重复层堆叠。

详情
AI中文摘要

长距离图任务对图神经网络(GNNs)来说具有挑战性:全局机制如注意力或重排方案可能计算成本高,而深度局部传播容易导致梯度消失、过平滑和过压缩。引入的mu-ChebNet架构是一种简单的谱GNN,它在应用ChebNet式滤波器之前学习一个节点级权重函数mu。所学的权重mu诱导了一个修改后的图拉普拉斯算子,从而有效改变传播几何而不改变图拓扑。这种任务相关的几何促进了信息传播的优选路径,从而帮助长距离信号避免高度收缩的瓶颈,并消除了对重复层堆叠的需要。在实践中,我们用学习的算子L_mu代替固定的图拉普拉斯算子L,保持所提出的mu-ChebNet架构轻量级,同时使传播任务自适应。此外,我们提供了一种谱分析,说明mu如何调节传播动力学,并在合成长距离推理任务和现实世界图基准上观察到性能的提高。所学的权重函数不仅具有可解释性,还为自适应图传播提供了轻量级的替代方案。

英文摘要

Long-range graph tasks are challenging for Graph Neural Networks (GNNs): global mechanisms such as attention or rewiring schemes can be computationally expensive, while deep local propagation is prone to vanishing gradients, oversmoothing, and oversquashing. The introduced mu-ChebNet architecture is a simple spectral GNN that learns a node-wise weight function mu before applying ChebNet-style filters. The learned weighting mu induces a modified graph Laplacian which effectively changes the propagation geometry without altering the graph topology. This task-dependent geometry promotes preferred routes for information propagation, thereby helping long-range signals avoid highly contractive bottlenecks, and obviating the need for repeated layer stacking. In practice, we replace the fixed graph Laplacian L by a learned operator L_mu, keeping the proposed mu-ChebNet architecture lightweight while making propagation task-adaptive. Furthermore, we provide a spectral analysis demonstrating how mu modulates propagation dynamics, and empirically observe improved performance on both synthetic long-range reasoning tasks and real-world graph benchmarks. The learned weight function is not only interpretable, but also offers a lightweight alternative to attention and rewiring for adaptive graph propagation.

2602.17517 2026-05-22 cs.CV

Depth Augmented and FE Free 3D/2D Liver Registration for Laparoscopic Liver AR

深度增强和无有限元分析的3D/2D肝脏注册用于腹腔镜肝脏AR

Hanyuan Zhang, Lucas He, Runlong He, Weixi Yi, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew J. Clarkson

AI总结 本研究提出了一种深度增强且无需有限元分析的3D/2D肝脏注册方法,通过结合鲁棒的刚性初始化和患者特定的非刚性细化,以提高腹腔镜肝脏手术AR中的3D到2D注册精度。

详情
AI中文摘要

增强现实(AR)在腹腔镜肝脏手术中的引导需要准确地将术前3D模型与术中2D视频进行注册,但因部分可见性、镜面反射和组织变形而具有挑战性。现有方法通常依赖于基于轮廓的刚性初始化和有限元(FE)模型进行可变形注册,增加了建模和工程复杂性。我们提出了一种深度增强且无有限元分析的3D-2D注册流程,结合了鲁棒的刚性初始化和患者特定的非刚性细化。对于刚性对齐,我们通过使用多类轮廓图和单目深度来适应FoundationPose的RefineNet模块以适应腹腔镜肝脏场景,以实现相对姿态的细化。对于可变形对齐,我们从非刚性ICP(NICP)对应关系中构建患者特定的统计变形模型,并使用粗到细的L-BFGS-B策略优化姿态和形状参数。在公开的临床腹腔镜肝脏数据集上,所提出的方法在受控的手动轮廓设置下实现了平均目标注册误差(TRE)为14.73毫米。消融研究显示,单目深度在轮廓输入上提高了刚性初始化,而肿瘤映射分析表明良好的表面对齐并不一定转化为更低的目标定位误差。在没有地面真实数据的外部数据集上,该方法产生视觉上合理的叠加以进行定性评估。这些结果表明,深度增强的姿态细化和无有限元分析的统计变形建模为受控的3D-2D肝脏注册在手术AR中提供了一个有前景的替代方案。

英文摘要

Augmented reality (AR) guidance in laparoscopic liver surgery requires accurate registration of preoperative 3D models to intraoperative 2D video, but remains challenging due to partial visibility, specularities, and tissue deformation. Existing methods often rely on contour-based rigid initialization and finite-element (FE) models for deformable registration, increasing modeling and engineering complexity. We present a depth-augmented, FE-free 3D--2D registration pipeline that combines robust rigid initialization with patient-specific non-rigid refinement. For rigid alignment, we adapt the RefineNet module of FoundationPose to laparoscopic liver scenes by using multi-class contour maps and monocular depth for relative pose refinement. For deformable alignment, we construct a patient-specific statistical deformation model from non-rigid ICP (NICP) correspondences and optimize pose and shape parameters using a coarse-to-fine L-BFGS-B strategy. On a public clinical laparoscopic liver dataset, the proposed method achieves a mean target registration error (TRE) of 14.73\,mm under a controlled manual-contour setting designed to isolate registration performance. Ablation studies show that monocular depth improves rigid initialization over contour-only inputs, while tumor-mapping analysis indicates that good surface alignment does not necessarily translate into lower target localization error. On an external dataset without ground truth, the method produces visually plausible overlays for qualitative assessment. These results suggest that depth-augmented pose refinement and FE-free statistical deformation modeling provide a promising alternative to FE-based pipelines for controlled 3D--2D liver registration in surgical AR.

2602.17385 2026-05-22 cs.AI

Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature

通过克罗内克-因子近似曲率进行任务算术中的无数据权重解耦

Angelo Porrello, Pietro Buzzega, Felix Dangel, Thomas Sommariva, Riccardo Salami, Lorenzo Bonicelli, Simone Calderara

AI总结 本文提出了一种无数据的方法,通过将表示漂移正则化问题框架化为曲率矩阵近似问题,以解决任务算术中任务向量的交叉任务干扰问题,实现了任务加法和否定的最新成果。

Comments Accepted to ICLR 2026

详情
AI中文摘要

任务算术提供了一种模块化且可扩展的方法来适应基础模型。然而,结合多个任务向量可能导致跨任务干扰,导致表示漂移和性能下降。表示漂移正则化提供了一种自然的解决方法来解耦任务向量;然而,现有方法通常需要外部任务数据,这与模块化和数据可用性约束(例如隐私要求)相冲突。我们提出了一种无数据的方法,通过将正则化表示漂移作为曲率矩阵近似问题来框架化。这使我们能够利用已建立的技术;特别是,我们采用克罗内克-因子近似曲率,并获得一个实用的正则器,实现了任务加法和否定的最新成果。我们的方法在任务数量上具有常数复杂性,并增强了对任务向量重新缩放的鲁棒性,消除了对保留调优的需要。

英文摘要

Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.

2602.13372 2026-05-22 cs.AI cs.LG

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

MoralityGym:用于评估序列决策代理中分层道德对齐的基准

Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James

AI总结 本文提出MoralityGym基准,通过将道德规范表示为有序的规范约束,评估序列决策代理中分层道德对齐的挑战,展示了98个伦理困境问题,并通过心理学和哲学的见解改进了伦理决策方法。

Comments Accepted at AAMAS 2026

详情
Journal ref
Proc of the 25th International Conference on Autonomous Agents and Multiagent Systems AAMAS 2026, Paphos, Cyprus, May 25 to 29, 2026, IFAAMAS
AI中文摘要

评估在面对冲突且分层结构的人类规范时,代理的道德对齐是一个在人工智能安全、道德哲学和认知科学交汇处的关键挑战。我们引入了Morality Chains,一种新的形式化方法,用于将道德规范表示为有序的规范约束,并引入了MoralityGym,一个包含98个伦理困境问题的基准,这些问题是作为电车困境风格的Gymnasium环境呈现的。通过将任务解决与道德评估解耦,并引入新的道德度量标准,MoralityGym允许将心理学和哲学的见解整合到规范敏感推理的评估中。基于安全强化学习方法的基准结果揭示了关键限制,强调了需要更系统的方法来处理伦理决策。本文为开发在复杂现实环境中行为更可靠、透明和道德的AI系统提供了基础。

英文摘要

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

2602.12952 2026-05-22 cs.LG cs.AI cs.CV

Transporting Task Vectors across Different Architectures without Training

在不同架构间传输任务向量而无需训练

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

AI总结 本文提出Theseus方法,通过功能匹配在不同宽度模型间传输任务更新,无需训练或反向传播,展示了在视觉和语言模型上的改进效果。

Comments Accepted at the International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

适应大型预训练模型以完成下游任务时,通常会产生针对特定任务的参数更新,这些更新对于每个模型变体重新学习都很昂贵。尽管最近的研究表明,这些更新可以在具有相同架构的模型之间转移,但跨不同宽度的模型转移仍鲜有探索。在本文中,我们引入Theseus,一种无需训练的方法,用于在异构宽度模型间传输任务更新。与其匹配参数,我们通过其在中间表示上诱导的功能效应来表征任务更新。我们正式将任务向量传输定义为在观察到的激活上进行的功能匹配问题,并显示在通过正交Procrustes分析对齐表示空间后,它允许一个稳定的闭式解,该解保留了更新的几何结构。我们在不同宽度的视觉和语言模型上评估Theseus,显示在不进行额外训练或反向传播的情况下,相对于基线有持续的改进。我们的结果表明,当任务身份通过功能而非参数定义时,任务更新可以有意义地在不同架构间转移。代码可在https://github.com/apanariello4/merge-and-rebase获取。

英文摘要

Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.

2602.12506 2026-05-22 cs.LG

On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs

关于RL微调VLMs的鲁棒性和链式思维一致性

Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal

AI总结 本文研究了RL微调VLMs在视觉推理任务中的鲁棒性和链式思维一致性,发现文本扰动和CoT不一致会显著降低模型的鲁棒性和信心,而闭源模型在保持鲁棒性和推理一致性方面表现更佳,指出这一差距源于当前开源RL微调的不足而非任务本身的限制。

Comments ICML 2026

详情
AI中文摘要

强化学习(RL)微调已成为增强大型语言模型(LLMs)在推理密集型任务中的关键技术,推动其扩展到视觉语言模型(VLMs)。尽管RL微调的VLMs在视觉推理基准测试中表现优异,但它们仍容易受到弱视觉基础、幻觉和过度依赖文本提示的影响。我们发现,简单的受控文本扰动,包括误导的标题或错误的链式思维(CoT)轨迹,会导致鲁棒性和信心的显著下降,且当考虑跨开源多模态推理模型的CoT一致性时,这些影响更为明显。相比之下,闭源模型表现出相似的失败模式,但保持了显著更高的鲁棒性和推理一致性,这表明差距反映的是当前开源RL微调的不足,而非任务本身的限制。为了更好地理解这些漏洞,我们进一步分析了RL微调动态,并揭示了准确率与忠实度之间的权衡:微调提高了基准测试准确率,但同时可能削弱伴随的CoT的可靠性及其对上下文变化的鲁棒性。尽管对抗性增强提高了鲁棒性,但本身并不能防止忠实度漂移。结合忠实度意识的奖励可以恢复答案与推理之间的对齐,但当与增强结合时,训练风险会坍缩到捷径策略,鲁棒性仍然难以获得。这些发现突显了仅基于准确率的评估的局限性,并促使训练和评估协议共同强调正确性、鲁棒性和视觉基础推理的忠实度。

英文摘要

Reinforcement learning (RL) finetuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision-language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations, including misleading captions or incorrect chain-of-thought (CoT) traces, cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. In contrast, closed models exhibit similar failure modes but maintain markedly greater robustness and reasoning consistency, suggesting that the gap reflects a shortcoming in current open-source RL finetuning rather than an inherent limitation of the task. To better understand these vulnerabilities, we further analyze RL finetuning dynamics and uncover an accuracy-faithfulness trade-off: finetuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.

2602.10894 2026-05-22 cs.LG cs.AI

Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games

重新审视正则化策略优化以实现稳定且高效的双人博弈强化学习

Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada

AI总结 本文重新审视了带有反向Kullback-Leibler正则化和熵正则化的策略优化方法,在双人零和设置中从理论和经验角度分析其组合,提供了新的收敛保证并通过合成游戏的数值实验验证了理论结果,并基于正则化策略优化推导出一种实用的模型无关强化学习算法,通过在五个棋盘游戏中进行的全面实验验证了算法的训练效率。

Comments Accepted at ICML 2026

详情
AI中文摘要

像棋盘游戏这样的双人博弈长期以来一直是强化学习的传统基准。本工作重新审视了一种带有反向Kullback-Leibler正则化和熵正则化的策略优化方法,并从理论和经验角度分析其在双人零和设置中的组合。从理论角度来看,我们研究了策略更新规则在两个理论设置中的稳定性:博弈论的正常形式博弈和有限长度博弈。我们提供了新的收敛保证,并通过合成游戏的数值实验验证了我们的理论结果。从经验角度来看,我们推导出一种基于正则化策略优化的实用模型无关强化学习算法。我们通过在五个棋盘游戏中进行的全面实验验证了我们算法的训练效率。实验结果表明,我们的智能体在各种环境中学习效率均优于现有方法。

英文摘要

Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularization and analyzes this combination in two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule in two theoretical settings: game-theoretic normal-form games and finite-length games. We provide novel convergence guarantees and verify our theoretical results through numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the training efficiency of our algorithm through comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. Experimental results show that our agent learns more efficiently than existing methods across environments.

2602.10085 2026-05-22 cs.AI

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

CODE-SHARP: 连续开放发现和演化的技能作为层次奖励程序

Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully

AI总结 该研究提出CODE-SHARP框架,通过基础模型自主发现和演化技能作为层次奖励程序,实现通用智能体政策的从零开始强化学习,无需预定义奖励,有效学习长周期技能。

Comments Preprint

详情
AI中文摘要

一般智能的核心特征是能够自主扩展和演化其掌握的技能集。尽管最近基于基础模型(FM)的方法在这一目标上显示出有希望的结果,但它们通常依赖于显著的人工工程,限制了其在新环境中的可转移性。为了解决这个问题,我们引入了连续开放发现和演化技能作为层次奖励程序(CODE-SHARP)框架,该框架利用基础模型来自主增长和演化一个编码技能的Python程序档案,通过强化学习训练通用智能体策略。这些程序被称为技能作为层次奖励程序(SHARPs),每个程序编码一个局部成功条件和一组被委托给先前发现的SHARPs的先决条件。在运行时,SHARPs根据当前状态动态路由智能体通过其先决条件链,奖励沿途的每个完成,要求智能体仅学习每个新SHARP引入的边际行为,从而在无需预定义奖励的情况下高效学习长周期技能。在Craftax-Classic和XLand上,由CODE-SHARP完全自主训练的智能体在中位性能上比先前工作高出6倍和2.6倍,并且是唯一能够制作铁工具和开采钻石的智能体。在扩展的Craftax上,CODE-SHARP在超过90个发现的SHARPs上训练通用智能体,使其能够零样本解决具有挑战性的长周期任务,与基于真实奖励训练的智能体表现相当。

英文摘要

A core quality of general intelligence is the ability to open-endedly expand and evolve its set of mastered skills autonomously. While recent Foundation Model (FM) driven approaches have shown promising results towards this goal, they typically rely on significant human-in-the-loop engineering, limiting their transferability to novel environments. To address this, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a framework that leverages FMs to open-endedly grow and evolve an archive of Python programs encoding skills to train a generalist agent policy entirely from scratch via reinforcement learning, directly from source code. These programs, termed Skills as Hierarchical Reward Programs (SHARPs), each encode a local success condition and a set of prerequisites delegated to previously discovered SHARPs. At runtime, SHARPs dynamically route the agent through their prerequisite chain based on the current state, rewarding each completion along the way, requiring the agent to learn only the marginal behaviour each new SHARP introduces, enabling efficient learning of long-horizon skills without any pre-defined rewards. On Craftax-Classic and XLand, agents trained fully autonomously by CODE-SHARP outperform previous works by 6x and 2.6x in median performance and are the only agents capable of crafting iron tools and mining diamonds. Scaled to Craftax-Extended, CODE-SHARP trains a generalist agent on over 90 discovered SHARPs, enabling the agent to solve challenging long-horizon tasks zero-shot, matching agents trained on ground-truth rewards.

2602.10009 2026-05-22 cs.AI cs.HC

Discovering High Level Patterns from Simulation Traces

从仿真轨迹中发现高层次模式

Sean Memery, Kartic Subr

AI总结 本文提出了一种通过程序合成进行无监督学习的方法,将仿真轨迹转换为稀疏的高层次模式表示,以提升大语言模型对物理系统的推理能力。

详情
AI中文摘要

大型语言模型(LLMs)在处理特定物理系统时无法可靠推理。尽管尝试通过赋予LLMs物理概念知识来提升其能力显示出巨大潜力,但可解释性和验证仍面临挑战。一种新兴的替代方法是工具链,其中LLMs可以查询物理模拟器并利用生成的仿真轨迹作为验证上下文。然而,这种方法的可扩展性较差,因为仿真轨迹包含大量细粒度的数值和语义数据。我们证明,将仿真轨迹转换为稀疏表示的“高层次”结构模式能更有效地被LLMs解释。我们提出了一种无监督学习方案,通过程序合成执行此转换或注释。我们的学习结果产生了一组程序库,这些程序作为模式检测器,可以将仿真轨迹转换为稀疏注释的模式序列。检测到的模式可选地通过人类专家的字符串标签(如刚性碰撞、拉伸弹簧等)进行引导。我们通过最近的一个物理基准测试表明,这样的注释表示更易于自然语言推理特定物理系统。合成的程序充当透明、可解释的函数,将系统状态映射到稀疏且高效的注释空间。作为应用示例,我们展示了如何将自然语言指定的物理系统目标转换为奖励程序,通过最大化这些程序来寻找解决方案。

英文摘要

Large Language Models (LLMs) are unable to reliably reason about specific physical systems. Attempts to imbue LLMs with knowledge of the necessary physics concepts have shown great promise, but explainability and validation remain open challenges. An emerging alternative is tooling, where LLMs can query physical simulators and use the resulting simulation traces as context for validation. This approach suffers from poor scalability since simulation traces contain large volumes of fine-grained numerical and semantic data. We show that translating simulation traces to a sparse representation of "high-level" structural patterns leads to more effective interpretation by LLMs. We propose an unsupervised learning scheme to perform this translation, or annotation, via program synthesis. Our learning results in a library of programs that act as pattern detectors which can translate simulation traces to sparse, annotated pattern sequences. The detected patterns may optionally be guided by human experts via string labels (rigid collision, stretching spring, etc.). We show, using a recent physics benchmark, that such annotated representations are more amenable to natural language reasoning about specific physical systems. The synthesized programs serve as transparent, explainable functions that map system states to a sparse and efficient annotation space. As an example application, we show how goals within physical systems that are specified in natural language may be converted to reward programs which are maximized to find solutions.

2602.09851 2026-05-22 cs.LG

CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization

CoFEH: 由协作贝叶斯超参数优化赋能的LLM驱动特征工程

Beicheng Xu, Keyao Ding, Wei Liu, Yupeng Lu, Bin Cui

AI总结 本文提出CoFEH框架,通过结合LLM驱动的特征工程和贝叶斯超参数优化,实现鲁棒的端到端AutoML,解决了传统方法在搜索空间刚性和缺乏领域意识的问题,并引入互条件机制提升FE与HPO的协同效果。

Comments Accepted at KDD 2026. Extended version with full appendices

详情
AI中文摘要

特征工程(FE)在自动化机器学习(AutoML)中至关重要,但传统方法在搜索空间刚性和缺乏领域意识方面存在瓶颈。尽管大型语言模型(LLMs)能生成无界运算符,但现有方法仅关注孤立子任务,无法实现自由形式的FE流程。此外,它们很少与下游ML模型的超参数优化(HPO)结合,导致贪心的"FE-then-HPO"工作流无法捕捉强FE-HPO交互。本文提出CoFEH,一种协作框架,通过 interleaving LLM驱动的FE和贝叶斯HPO实现鲁棒的端到端AutoML。CoFEH使用基于Tree of Thought(TOT)的LLM驱动FE优化器探索灵活的FE流程,贝叶斯优化(BO)模块解决HPO,并动态优化器选择器适配FE和HPO步骤。关键的是,我们引入互条件机制,使LLM和BO之间共享上下文,实现相互指导的决策。实验表明,CoFEH在独立FE和联合FE+HPO设置中均优于传统和LLM基线。

英文摘要

Feature Engineering (FE) is pivotal in automated machine learning (AutoML) but remains a bottleneck for traditional methods, which operate within rigid search spaces and lack domain awareness. While Large Language Models (LLMs) offer a promising alternative to generate unbounded operators with semantic reasoning, existing methods focus on isolated subtasks such as feature generation, falling short of free-form FE pipelines. Moreover, they are rarely coupled with hyperparameter optimization (HPO) of the downstream ML model, leading to greedy "FE-then-HPO" workflows that cannot capture strong FE-HPO interactions. In this paper, we present CoFEH, a collaborative framework that interleaves LLM-based FE and Bayesian HPO for robust end-to-end AutoML. CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (TOT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that adaptively interleaves FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism that shares context between LLM and BO, enabling mutually informed decisions. Experiments show that CoFEH outperforms both traditional and LLM-based baselines in both standalone FE and joint FE+HPO settings.

2602.08064 2026-05-22 cs.LG cs.AI cs.CL

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

SiameseNorm: 突破预规范与后规范之间的障碍

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang

AI总结 本文提出SiameseNorm,一种双流架构,通过共享残差块将预规范和后规范结合,从而在保持训练稳定性的同时提升模型性能,适用于多种架构和模态。

Comments Accepted to ICML 2026; camera-ready version; revised presentation and added additional experimental results

详情
AI中文摘要

预规范与后规范之间的长期矛盾仍然是Transformer架构中的一个开放问题,反映了训练稳定性与表示能力之间的根本权衡。先前尝试结合两者优势的研究取得了一定进展,但往往在不同训练设置下表现有限,限制了其更广泛的应用。我们重新审视这一困境,表明单流架构难以协调预规范的稳定身份梯度传播与后规范的主要残差路径归一化。为了解决这种结构张力,我们提出SiameseNorm,一种简单而有效的双流架构,能够与预规范训练配方保持兼容。SiameseNorm通过共享残差块将预规范和后规范流连接起来,允许每个残差块从两个路径接收优化信号,且开销极低。在400M和1.3B密集语言模型、15B MoE模型、视觉Transformer以及扩散Transformer上的大量实验表明,SiameseNorm在各种架构和模态中都能保持强大的训练稳定性的同时提升性能。代码可在https://github.com/Qwen-Applications/SiameseNorm上获得。

英文摘要

The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

2602.07340 2026-05-22 cs.LG

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

通过选择性几何控制重新审视LLM安全对齐的鲁棒性

Yonghui Yang, Wenjian Tao, Jilong Liu, Xingyu Zhu, Junfeng Fang, Weibiao Huang, Le Wu, Richang Hong, Tat-Sent Chua

AI总结 本文通过优化几何视角重新审视LLM安全对齐的鲁棒性,提出ShaPO框架,通过选择性几何控制在对齐关键参数子空间上强制最坏对齐目标,提升安全鲁棒性。

详情
AI中文摘要

大型语言模型的安全对齐在领域偏移和噪声偏好监督下仍显得脆弱。大多数现有鲁棒对齐方法关注对齐数据中的不确定性,而忽视了基于偏好的目标中优化诱导的脆弱性。在本文中,我们从优化几何的角度重新审视LLM安全对齐的鲁棒性,并认为鲁棒性失败不能仅通过数据为中心的方法解决。我们提出了ShaPO,一种几何感知的偏好优化框架,通过在对齐关键参数子空间上进行选择性几何控制来强制最坏情况下的对齐目标。通过避免均匀的几何约束,ShaPO缓解了在分布偏移下可能损害鲁棒性的过度正则化问题。我们将在两个层面实例化ShaPO:token层面的ShaPO稳定了基于似然的替代优化,而reward层面的ShaPO在噪声监督下强制奖励一致的优化。在多样化的安全基准和噪声偏好设置中,ShaPO在流行偏好优化方法上一致地提高了安全鲁棒性。此外,ShaPO能够与数据鲁棒目标清洁地组合,产生额外的收益,并经验上支持所提出的优化-几何视角。代码可在https://github.com/liujilong0116/ShaPO上获得。

英文摘要

Safety alignment of large language models remains brittle under domain shift and noisy preference supervision. Most existing robust alignment methods focus on uncertainty in alignment data, while overlooking optimization-induced fragility in preference-based objectives. In this work, we revisit robustness for LLM safety alignment from an optimization geometry perspective, and argue that robustness failures cannot be addressed by data-centric methods alone. We propose \textit{ShaPO}, a geometry-aware preference optimization framework that enforces worst-case alignment objectives via selective geometry control over alignment-critical parameter subspace. By avoiding uniform geometry constraints, ShaPO mitigates the over-regularization that can harm robustness under distribution shift. We instantiate ShaPO at two levels: token-level ShaPO stabilizes likelihood-based surrogate optimization, while reward-level ShaPO enforces reward-consistent optimization under noisy supervision. Across diverse safety benchmarks and noisy preference settings, ShaPO consistently improves safety robustness over popular preference optimization methods. Moreover, ShaPO composes cleanly with data-robust objectives, yielding additional gains and empirically supporting the proposed optimization-geometry perspective. The code is available at https://github.com/liujilong0116/ShaPO.

2602.06995 2026-05-22 cs.RO cs.CV cs.IT cs.MA math.IT

When Simultaneous Localization and Mapping Meets Wireless Communications: A Survey

当同时定位与建图遇见无线通信:一篇综述

Konstantinos Gounis, Sotiris A. Tegos, Dimitrios Tyrovolas, Panagiotis D. Diamantoulakis, George K. Karagiannidis

AI总结 本文综述了SLAM与无线通信交汇领域的最新进展,重点探讨了视觉SLAM(V-SLAM)整合中的双向影响,总结了无线信号传播、几何信道建模、基于射频(RF)的定位与感知等关键概念,以及图像处理技术如何检测地标并预测无线信道的最优路径,同时分析了SLAM与无线通信交叉领域的技术、挑战和未来方向。

详情
AI中文摘要

本文综述了SLAM与无线通信交汇领域的最新进展, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

英文摘要

This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

2602.06676 2026-05-22 cs.CV

Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction

我们能否为伪造图像检测构建一个单一模型?SICA:语义诱导约束适应用于统一且具有判别性的伪影特征空间重建

Bo Du, Xiaochen Ma, Xuekang Zhu, Zhe Yang, Chaogun Niu, Chenfan Qu, Mingqi Fang, Zhenming Wang, Jingjing Liu, Jian Liu, Ji-Zhe Zhou

AI总结 本文提出了一种新的单体伪造图像检测模型SICA,通过语义诱导约束适应方法,解决伪影特征空间重建的统一与判别性矛盾,实验表明其优于15种现有方法。

详情
AI中文摘要

伪造图像检测(FID),旨在在四个图像鉴真子领域中实现统一检测,在现实鉴真场景中至关重要。与集成方法相比,单体FID模型在理论上更具前景,但至今在实践中始终表现不佳。在本文中,我们识别了伪影在子领域中的本质差异,这一关键障碍我们称之为“齐则现象”。受这一现象的驱动,我们首次诊断出这种表现不佳的根本原因:伪影特征空间的崩溃。因此,开发实用单体FID模型的核心挑战归结为“统一且具有判别性的”伪影特征空间重建。为了解决这个矛盾的挑战,我们假设高层语义可以作为重建的结构先验,并进一步提出语义诱导约束适应(SICA),这是首个单体FID范式。在我们开放的OpenMMSec数据集上进行了广泛的实验,结果表明SICA优于15种最先进的方法,并以近正交的方式重建了目标统一且具有判别性的伪影特征空间,从而牢固验证了我们的假设。代码和数据集可在:https://github.com/venus-guangjian/SICA_OpenMMSec获取。

英文摘要

Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, we identify the intrinsic distinctness of artifacts across subdomains, a critical barrier we term the ``Ji-Zhe phenomenon". Driven by this phenomenon, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space. The core challenge for developing a practical monolithic FID model thus boils down to the ``unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at: https://github.com/venus-guangjian/SICA_OpenMMSec.

2602.05873 2026-05-22 cs.LG

Large-scale Score-based Variational Posterior Inference for Bayesian Deep Neural Networks

大规模基于分数的变分后验推断用于贝叶斯深度神经网络

Minyoung Kim

AI总结 本文提出了一种适用于大规模贝叶斯深度神经网络的变分后验推断方法,结合了分数匹配损失和近端惩罚项,避免了重新参数化采样,实现了大规模神经网络的高效训练。

详情
AI中文摘要

贝叶斯(深度)神经网络(BNN)在多个方面比传统的点估计深度学习更具吸引力,包括不确定性量化、噪声鲁棒性、过拟合抵抗性等。变分推断(VI)是应用最广泛的近似推断方法之一。尽管基于ELBO的变分自由能方法在文献中占主导地位,但本文提出了一种基于分数的替代方法用于BNN的变分推断。基于分数的VI可以解决基于ELBO的VI中已知的模式崩溃问题。尽管社区中已经提出了几种基于分数的VI方法,但大多数方法由于各种计算和技术原因并不适用于大规模BNN。我们提出了一种新颖的可扩展VI方法,其中学习目标结合了分数匹配损失和近端惩罚项,这有助于我们的方法避免重新参数化采样,并允许通过随机梯度获得有偏的噪声小批量分数。这使得我们的方法能够扩展到大规模神经网络,包括视觉Transformer。在多个基准上,包括使用大规模深度网络的视觉识别和时间序列预测,我们实证地展示了我们方法的有效性。

英文摘要

Bayesian (deep) neural networks (BNN) are often more attractive than the vanilla point-estimate deep learning in various aspects including uncertainty quantification, robustness to noise, resistance to overfitting, and more. The variational inference (VI) is one of the most widely adopted approximate inference methods. Whereas the ELBO-based variational free energy method is a dominant choice in the literature, in this paper we introduce a score-based alternative for BNN variational inference. Score-based VI can address the known issue of mode collapsing in ELBO-based VI. Although several score-based VI methods have been proposed in the community, most are not adequate for large-scale BNNs for various computational and technical reasons. We propose a novel scalable VI method where the learning objective combines the score matching loss and the proximal penalty term in iterations, which helps our method avoid the reparametrized sampling, and allows for noisy unbiased mini-batch scores through stochastic gradients. This in turn makes our method scalable to large-scale neural networks including Vision Transformers. On several benchmarks including visual recognition and time-series forecasting with large-scale deep networks, we empirically show the effectiveness of our approach.

2602.05536 2026-05-22 cs.LG cs.AI cs.CL cs.CV

When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

当共享知识有害:模型融合中的谱过积累

Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi

AI总结 本文研究了模型融合中共享知识过积累的问题,提出SVC方法通过校准奇异值来恢复谱平衡,提升了模型融合和任务算术的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

模型融合通过将多个微调模型的权重更新相加,提供了一种轻量级的替代方法,而非重新训练。现有方法主要针对解决任务更新之间的冲突,未处理共享知识过积累的失败模式。我们发现当任务共享对齐的谱方向(即重叠的奇异向量)时,简单的线性组合会反复积累这些方向,导致奇异值膨胀并使融合模型偏向共享子空间。为缓解此问题,我们提出Singular Value Calibration (SVC),一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的谱。在视觉和语言基准上,SVC一致改进了强大的融合基线并实现了最先进的性能。此外,仅通过修改奇异值,SVC将任务算术的性能提高了13.0%。代码可在https://github.com/lyymuwu/SVC获取。

英文摘要

Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at https://github.com/lyymuwu/SVC.

2602.05304 2026-05-22 cs.LG cs.SY eess.SY math.OC

A Short and Unified Convergence Analysis of the SAG, SAGA, and IAG Algorithms

SAG、SAGA和IAG算法的简短统一收敛性分析

Feng Zhu, Robert W. Heath, Aritra Mitra

AI总结 本文提出了一种统一的收敛性分析方法,适用于SAG、SAGA和IAG算法,通过简单的集中工具建立延迟界并设计新的Lyapunov函数,从而得到高概率界,并扩展到非凸目标和马尔可夫采样。

Comments To appear at the 43rd International Conference on Machine Learning (ICML)

详情
AI中文摘要

诸如随机平均梯度(SAG)和SAGA的随机方差减少算法,以及其确定性对应物如增量聚合梯度(IAG)方法,在大规模机器学习中已被广泛研究。尽管这些算法很受欢迎,但现有的分析却各不相同,依赖于针对每种方法量身定制的证明技术。此外,SAG的原始证明已知相当复杂,需要计算机辅助分析。聚焦于有限和优化问题,我们的主要贡献是开发了一种适用于所有三种算法的统一收敛性分析:SAG、SAGA和IAG。我们的分析有两个关键步骤:(i)使用简单的集中工具建立由于随机子采样导致的延迟界;(ii)精心设计一个新的Lyapunov函数,以考虑此类延迟。所得到的证明简短且模块化,为SAG和SAGA提供了首个高概率界,可以无缝扩展到非凸目标和马尔可夫采样。作为我们新分析技术的直接产物,我们获得了IAG算法的最佳已知速率,显著改进了之前的界。

英文摘要

Stochastic variance-reduced algorithms such as Stochastic Average Gradient (SAG) and SAGA, and their deterministic counterparts like the Incremental Aggregated Gradient (IAG) method, have been extensively studied in large-scale machine learning. Despite their popularity, existing analyses for these algorithms are disparate, relying on different proof techniques tailored to each method. Furthermore, the original proof of SAG is known to be notoriously involved, requiring computer-aided analysis. Focusing on finite-sum optimization with smooth and strongly convex objective functions, our main contribution is to develop a single unified convergence analysis that applies to all three algorithms: SAG, SAGA, and IAG. Our analysis features two key steps: (i) establishing a bound on delays due to stochastic sub-sampling using simple concentration tools, and (ii) carefully designing a novel Lyapunov function that accounts for such delays. The resulting proof is short and modular, providing the first high-probability bounds for SAG and SAGA that can be seamlessly extended to non-convex objectives and Markov sampling. As an immediate byproduct of our new analysis technique, we obtain the best known rates for the IAG algorithm, significantly improving upon prior bounds.

2602.04768 2026-05-22 cs.LG cs.AI

Billion-Scale Graph Foundation Models

十亿级图基础模型

Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory, Daniel Haimovich, Ido Guy, Udi Weinsberg

AI总结 本文提出GraphBFF,一种用于构建大规模异构图的十亿参数图基础模型的端到端方法,通过引入GraphBFF Transformer架构,揭示了异构图的神经缩放定律,并在多个下游任务中展示了其优越的性能。

详情
AI中文摘要

图结构数据支撑了许多关键应用。尽管基础模型通过大规模预训练和轻量级适应改变了语言和视觉领域,但将其扩展到一般、现实世界的图结构却具有挑战性。在本文中,我们提出了Graph Billion-Foundation-Fusion(GraphBFF):一种用于构建大规模异构图的十亿参数图基础模型(GFMs)的端到端方法。该方法的核心是GraphBFF Transformer,一种灵活且可扩展的架构,专为实际的十亿级GFMs设计。利用GraphBFF,我们提出了异构图的神经缩放定律,并显示损失随着模型容量或训练数据规模的增加而减少,取决于哪个因素是瓶颈。GraphBFF框架提供了具体的方法论,用于数据分批、预训练和微调,以构建大规模的GFMs。我们通过一个现实世界中的十亿级图展示了该框架的有效性,评估了一个十亿参数的GraphBFF Transformer,按照所提出的配方。在十个不同的现实世界下游任务上,涵盖节点和链接级别的分类和回归,GraphBFF在训练过程中未见过的图上始终优于基线,最大差距达到31个PRAUC点,包括在少样本设置中。最后,我们讨论了使GFMs成为工业规模图学习实际和原则性基础的关键挑战和开放机会。

英文摘要

Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion-Foundation-Fusion (GraphBFF): an end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for large-scale heterogeneous graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present neural scaling laws for heterogeneous graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework over a real-world billion-scale graph, with an evaluation of a billion-parameter GraphBFF Transformer following the proposed recipe. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF consistently outperforms baselines, with large margins of up to 31 PRAUC points, including in few-shot settings. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learning at industrial scale.