arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.21543 2026-05-22 cs.LG

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

可证明的多语言模型基准测试去污染

Zhenlong Liu, Hao Zeng, Hongxin Wei

AI总结 本文提出了一种可证明的多语言模型基准测试去污染方法,通过联合选择过程实现全局污染率控制,提升跨模型比较的可靠性。

详情
AI中文摘要

在LLM评估中,基准数据污染已成为关键挑战:当评估示例出现在一个或多个受审模型的训练数据中时,报告性能可能被夸大,跨模型比较变得不可靠。大量训练数据检测工作设计了评分来量化模型对给定数据点的记忆程度,但这些基于评分的方法缺乏理论保证。最近的符合方法为单个模型提供了可证明的假识别控制;然而,分别应用它们到每个模型会产生模型特定的基准,破坏跨模型的公平比较。在本文中,我们将多模型基准去污染正式化为一个联合选择问题,并提出联合包络符合选择(JECS),一种符合程序,能够在给定假设下实现全局污染率(GCR)控制。具体而言,JECS计算每个模型的符合p值,通过每个项目的最大值进行汇总,并从高于数据驱动阈值的右尾观测中重建一个保守的包络最大p空分布。通过将自适应Benjamini-Hochberg(BH)程序应用于包络重新缩放值,我们选择了一个具有可证明GCR控制的基准。在各种模型和基准上的广泛实验表明,JECS在保持目标GCR控制的同时,比max-p基线具有更高的功效。

英文摘要

Benchmark data contamination has become a central challenge in LLM evaluation: when evaluation examples appear in the training data of one or more audited models, reported performance can be inflated and cross-model comparisons become unreliable. A broad line of training-data detection work designs scores to quantify how strongly a model memorizes a given data point, but these score-based methods lack theoretical guarantees. Recent conformal approaches provide provable false-identification control for a single model; however, applying them separately to each model can produce model-specific benchmarks, undermining fair comparison across models. In this work, we formalize multi-model benchmark decontamination as a joint selection problem and propose Joint Envelope Conformal Selection (JECS), a conformal procedure that enables global contamination rate (GCR) control under stated assumptions. Specifically, JECS computes per-model conformal p-values, aggregates them by the per-item maximum, and reconstructs a conservative envelope of the max-p null distribution from right-tail observations above a data-driven threshold. By applying the adaptive Benjamini-Hochberg (BH) procedure to the envelope-rescaled values, we select a benchmark with provable GCR control. Extensive experiments across various models and benchmarks demonstrate that JECS achieves higher power than the max-p baseline while consistently maintaining the target GCR control.

2605.21542 2026-05-22 cs.LG

Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

发现实体-条件滞后异质性:一种用于面板时间序列的滞后门神经审计框架

Andi Xu

AI总结 本文提出了一种用于面板时间序列的滞后门神经审计框架AC-GATE,旨在解决不同实体在不同时间跨度上对历史信号的响应问题,通过引入适应性编码器和尺度不变滞后门,实现对滞后异质性的发现和结构化输出。

Comments Preprint/technical paper. An interpretable neural audit framework for entity-conditioned lag discovery in panel time series. 10 pages, 5 figures, 16 tables. Code available at the GitHub repository

详情
AI中文摘要

国家层面的时间面板被广泛用于实证分析。研究人员经常需要审计不同实体在不同时间跨度上对历史信号的响应。当前方法通常无法直接提供可审计的实体特定滞后汇总。我们将其公式化为时间面板挖掘任务,并提出AC-GATE,一种具有尺度不变滞后门的适应性编码器。它通过使用可观察的实体层面代理来条件化历史观测的滞后权重分布,从而将有效的滞后作为模型的结构输出,而不是事后解释。评估基于分层审计协议,将预测校准与滞后发现分开。使用具有已知真实滞后的人工面板进行机制恢复测试,并使用两个现实世界的国家层面面板进行外部审计和压力测试。结果表明,AC-GATE可以在合成数据中恢复异质滞后结构,并在真实数据中生成非退化的、结构化的有效滞后。

英文摘要

Country-level temporal panels are widely used in empirical analysis. Researchers often need to audit how different entities respond to historical signals over different time horizons. Current approaches typically do not provide directly auditable entity-specific lag summaries. We formulate entity-conditioned heterogeneous lag discovery as a temporal panel mining task and propose AC-GATE, an Adaptive-Conditioning Encoder with a Scale-Invariant Lag Gate. It instantiates conditional Moderated Distributed Lag by using observable entity-level proxies to condition lag-weight distributions over historical observations, thereby making effective lags structural outputs of the model rather than post-hoc explanations. The evaluation is based on a layered audit protocol that separates predictive calibration from lag discovery. A synthetic panel with known ground-truth lags is used for mechanism recovery testing, and two real-world country-level panels are used for external audit and stress testing. The results show that AC-GATE can recover heterogeneous lag structure in synthetic data, and generates non-degenerate, externally structured effective lags in real data.

2605.21539 2026-05-22 cs.LG

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

DualOptim+: 联合与解耦优化器状态的桥梁以提升大语言模型中的机器反遗忘

Xuyang Zhong, Qizhang Li, Yiwen Guo, Chen Liu

AI总结 本文提出DualOptim+,一种改进大语言模型中机器反遗忘的新优化框架,通过引入基础状态和delta状态,有效平衡遗忘与保留目标,同时提出8位量化变体以减少内存开销,实验表明其在多个任务中均表现出色。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出DualOptim+,一种新的优化框架,用于改进大语言模型中的机器反遗忘。它引入了一个基础状态来捕捉遗忘和保留目标共享的表示,以及delta状态来保存目标特定的残差。这种架构允许优化器根据遗忘和保留梯度之间的方向冲突,自适应地连接共享和解耦状态。我们进一步引入DualOptim+ 8bit,一种量化变体,能够在不牺牲性能的情况下减少内存开销。在虚构和现实世界的反遗忘、安全对齐和多任务学习任务中进行的广泛实验表明,DualOptim+ 一致地在不同目标之间实现了更优的权衡。代码可在https://github.com/CityU-MLO/DualOptimPlus上获得。

英文摘要

We propose DualOptim+, a novel optimization framework for improving machine unlearning in large language models. It introduces a base state to capture common representations shared by forgetting and retaining objectives and delta states to preserve objective-specific residuals. This architecture allows the optimizer to adaptively bridge shared and decoupled states based on the directional conflict between forgetting and retaining gradients. We further introduce DualOptim+ 8bit, a quantized variant that reduces memory overhead without compromising performance. Extensive experiments across fictitious and real-world unlearning, safety alignment, and multi-task learning tasks demonstrate that DualOptim+ consistently achieves a superior trade-off between different objectives. Codes are available at https://github.com/CityU-MLO/DualOptimPlus.

2605.21538 2026-05-22 cs.SD

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

学术文本到音乐大奖赛:数据集、基线和评估方法

Fang-Chih Hsieh, Wei-Jaw Lee, Chun-Ping Wang, Hung-yi Lee, Hao-Wen Dong, Yi-Hsuan Yang

AI总结 本文介绍了ICME 2026学术文本到音乐生成大奖赛(ATTM)的概述和技术框架。尽管文本到音乐生成(TTM)系统取得了快速进展,但该领域目前主要由在大规模专有数据集上训练的模型主导,这些模型使用工业级计算资源,给学术研究带来了显著障碍。为此,ATTM挑战赛建立了一个公平的基准,要求参赛者使用标准化的、采用CC许可的MTG-Jamendo数据集子集(仅包含纯音乐)从头开始训练生成模型。该挑战分为两个赛道:效率赛道(限制在5亿参数以内)和性能赛道(无参数限制)。提交将通过多阶段评估过程进行评估,包括客观指标,如Fréchet音频距离、CLAP分数和新的概念覆盖分数(CCS),随后进行主观听觉测试。通过提供开源基线、预处理管道、参考标题和公开计算FAD和CLAP的评估代码,该挑战旨在促进学术环境中的TTM研究。

Comments Accepted to IEEE ICME 2026 Grand Challenge Paper

详情
AI中文摘要

本文介绍了ICME 2026学术文本到音乐生成大奖赛(ATTM)的概述和技术框架。尽管文本到音乐生成(TTM)系统取得了快速进展,但该领域目前主要由在大规模专有数据集上训练的模型主导,这些模型使用工业级计算资源,给学术研究带来了显著障碍。为此,ATTM挑战赛建立了一个公平的基准,要求参赛者使用标准化的、采用CC许可的MTG-Jamendo数据集子集(仅包含纯音乐)从头开始训练生成模型。该挑战分为两个赛道:效率赛道(限制在5亿参数以内)和性能赛道(无参数限制)。提交将通过多阶段评估过程进行评估,包括客观指标,如Fréchet音频距离、CLAP分数和新的概念覆盖分数(CCS),随后进行主观听觉测试。通过提供开源基线、预处理管道、参考标题和公开计算FAD和CLAP的评估代码,该挑战旨在促进学术环境中的TTM研究。

英文摘要

This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.

2605.21528 2026-05-22 cs.LG cs.AI

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

可重复的基于日志的自动机器学习框架用于医疗风险预测中的可解释流水线优化

Rui Huang, Lican Huang

AI总结 本文提出了一种可重复的基于日志的自动机器学习框架,用于医疗风险预测中的可解释流水线优化,通过分析组件属性、交互和冗余性,提高了模型性能和稳定性。

详情
AI中文摘要

准确且可重复的疾病风险预测仍然具有挑战性,由于异质特征、有限样本和严重的类别不平衡。本研究引入了yvsoucom-iterkit,一种确定性和基于日志的自动化机器学习框架,将流水线优化完全可重复地建模为配置级系统。每个流水线被编码为可追溯的日志实体,使能够分析组件属性、交互、相似性和跨种子鲁棒性。在超过18,000个流水线配置上对Pima Indians糖尿病和中风数据集的实验揭示了一个结构化且部分冗余的搜索空间,其中性能由一小部分相互作用的组件决定。随机森林重要性分析显示,增强(0.454)、模型选择(0.198)和不平衡处理(0.101)是Pima数据集的关键驱动因素,而不平衡处理主导中风(0.406)。组件相似性分析显示强冗余性,特征选择变体(biMax-biMean)表现出低RMS距离(0.0252),混合匹配无增强(0.0279),TomekLinks与无不平衡处理对齐(0.0325),而高斯噪声与无增强的差异更大(0.10)。该框架使用集成模型(加权F1 0.89,宏F1 0.88在Pima;加权F1 0.94在中风)实现了强且稳定的性能,而宏F1在中风上较低(0.67)由于类别不平衡。跨种子分析揭示了性能-鲁棒性权衡,集成模型的变异性低于SVM。这些结果表明,有效的AutoML优化可以聚焦于一组高影响的组件。

英文摘要

Accurate and reproducible disease risk prediction remains challenging due to heterogeneous features, limited samples, and severe class imbalance. This study introduces yvsoucom-iterkit, a deterministic and log-driven automated machine learning framework that formulates pipeline optimization as a fully reproducible, configuration-level system. Each pipeline is encoded as a traceable log entity, enabling analysis of component attribution, interactions, similarity, and cross-seed robustness. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured and partially redundant search space, where performance is governed by a small subset of interacting components. Random Forest importance analysis identifies augmentation (0.454), model choice (0.198), and imbalance handling (0.101) as key drivers on Pima, while imbalance handling dominates Stroke (0.406). Component similarity analysis shows strong redundancy, with feature selection variants (biMax-biMean) exhibiting low RMS distance (0.0252), mixup closely matching no augmentation (0.0279), and TomekLinks aligning with no imbalance handling (0.0325), whereas Gaussian noise shows greater divergence from no augmentation (0.10). The framework achieves strong and stable performance using ensemble models (Weighted-F1 0.89, Macro-F1 0.88 on Pima; Weighted-F1 0.94 on Stroke), while Macro-F1 remains lower on Stroke (0.67) due to class imbalance. Cross-seed analysis reveals a performance-robustness trade-off, with ensembles showing lower variability (0.023-0.026) than SVM. These results indicate that effective AutoML optimization can focus on a reduced set of high-impact components.

2605.21516 2026-05-22 cs.LG cs.AI

Harnesses for Inference-Time Alignment over Execution Trajectories

在执行轨迹上进行推理时间对齐的工具

Boyuan Wang, Bochao Li, Minghan Wang, Yuxin Tao, Fang Kong

AI总结 本文研究了在执行轨迹上进行推理时间对齐的工具设计,通过任务分解和引导执行机制来提高长期性能,发现工具设计中分解和引导的复杂性并不总是带来更好的结果,提出了任务分解和引导执行的两种机制,并通过合成实验和实际终端代理基准验证了这些发现。

详情
AI中文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

英文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

2605.21515 2026-05-22 cs.LG cs.AI

Predicting Performance of Symbolic and Prompt Programs with Examples

通过示例预测符号程序和提示程序的性能

Chengqi Zheng, Keya Hu, Shuzhi Liu, Tao Wu, Kevin Ellis, Yewen Pu

AI总结 本文研究了通过示例预测程序性能的问题,提出了一种基于简单硬币翻转模型的方法,利用观察到的执行结果和性能先验知识来预测程序性能,并开发了RAP方法来构建代理先验以提高预测效果。

详情
AI中文摘要

LLM提示广泛用于自然陈述的任务,但其不可靠,可能在少数测试用例上成功但在部署时失败。我们研究了性能预测:给定一个程序(例如符号程序或在LLM上执行的提示程序)和少量领域内示例,预测其在未见任务上的性能。我们使用一个简单的硬币翻转模型,将每次通过/失败的程序执行视为伯努利随机变量,其成功概率是程序未知的性能。在该模型中,性能完全取决于:1)在测试用例上观察到的执行结果,以及2)性能的先验分布。我们从多样化的程序和任务语料库中编译了经验性能先验,并发现符号程序(例如Python)都是全或无的,而提示程序具有弥漫的先验,有许多几乎正确的程序。这种差异解释了为什么少数通过测试可以认证符号程序但不能认证提示程序。基于这一见解,我们开发了RAP(检索近似先验),通过从现有语料库中检索相似任务和提示程序来构建代理先验,然后用于预测性能。我们展示了RAP实现了稳健的性能。

英文摘要

LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.

2605.21496 2026-05-22 cs.LG cs.AI cs.CL

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

HealthCraft: 一种用于急救医学的强化学习安全环境

Brandon Dent

AI总结 本文提出HealthCraft,首个公开的强化学习环境,用于在真实急救医学条件下奖励轨迹级安全,通过FHIR R4世界状态、24个MCP工具和双层评估标准,评估模型在急救任务中的安全性和性能,揭示了模型在多步骤工作流中的安全失败问题。

Comments 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: https://github.com/GOATnote-Inc/healthcraft

详情
AI中文摘要

前沿语言模型被部署到临床工作流程的速度超过了评估它们安全性的基础设施。静态医学问答基准测试忽略了急救医学中至关重要的失败模式:轨迹级安全崩溃、工具误用和在持续临床压力下的屈从。我们提出了HealthCraft,首个公开的强化学习环境,该环境在真实急救医学条件下奖励轨迹级安全,源自Corecraft。它基于FHIR R4世界状态,包含14个实体类型和3,987个种子实体,暴露24个MCP工具,并定义了双层评估标准,只要任何安全关键性标准被违反,就会将奖励设为零。我们发布了195个任务,涵盖六个类别,根据2,255个二元标准(其中515个为安全关键性标准)进行评分;一个事后10任务负类列表将此扩展到205个任务和2,337个标准。在两个前沿模型上的V8结果表明,Claude Opus 4.6在Pass@1达到24.8% [21.5-28.4],GPT-5.4为12.6% [10.2-15.6],安全失败率为27.5%和34.0%。在多步骤工作流——最接近真实急救护理的代理——中,性能降至接近零(Claude 1.0%,GPT-5.4 0.0%),尽管在单个步骤上部分具备能力。在试点v2和v8之间修复了六个基础设施错误,重新排列了哪些模型“看起来更强”,这表明基础设施的保真度是测量的一部分。一个确定性的LLM-判断器叠加限制了评估者的噪声,并且一个60次负类烟雾试点显示奖励信号不是可直接用于训练的安全:限制标准通过率为0.929的患病率,这在评估工具可以容忍但训练奖励不能。我们搭建了与Corecraft第5.2节中的Megatron+SGLang+GRPO循环的耦合,并将训练奖励的消融作为未来的工作。环境、任务、评估标准和工具均在Apache 2.0下发布。

英文摘要

Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

2605.21494 2026-05-22 cs.LG

Double descent for least-squares interpolation on contaminated data: A simulation study

过拟合模型的最小二乘插值在受污染数据中的双下降现象:一项模拟研究

Tino Werner

AI总结 本文研究了在受污染数据下线性回归中是否会出现双下降现象,比较了最小二乘插值估计器与几种鲁棒替代方法的性能,发现大规模过拟合确实导致双下降现象,使最小二乘插值器的泛化性能优于鲁棒替代方法。

详情
AI中文摘要

过参数化模型尽管根据经典统计理论应容易过拟合,但能表现出出色的泛化性能。双下降现象的发现,即在达到一定模型复杂度后泛化误差减小,开辟了新的研究方向。稳健统计考虑在受污染数据上的统计估计,由于现实数据不满足假设,导致数据点相对于假设的“理想”分布出现异常值,可能严重扭曲任何经典估计器。本文探讨在受污染训练数据的线性回归设置中是否会出现双下降现象。比较了高度非鲁棒的最小二乘插值估计器与几种鲁棒替代方法的性能。结果表明,大规模过参数化确实导致双下降现象,使最小二乘插值器的泛化性能非常优异,优于鲁棒替代方法。

英文摘要

Overparametrized models can exhibit an excellent generalization performance, although they should be prone to overfitting according to classical statistical theory. The discovery of the "double descent", indicating that the generalization error decreases after a certain model complexity has been reached, opened a new line of research. Robust statistics considers statistical estimation on contaminated data, which, due to assumptions that do not hold on real data, let data points appear as outliers w.r.t. the assumed "ideal" distribution, potentially severely distorting any classical estimator. We address the question whether a double descent phenomenon can be observed in a linear regression setting with contaminated training data. We compare the performance of the highly non-robust least-squares interpolation estimator with several robust alternatives. It turns out that large overparametrization indeed allows for a double descent phenomenon, resulting in a very good generalization performance of the least-squares interpolator, surpassing that of the robust alternatives.

2605.21493 2026-05-22 cs.LG cs.AI cs.CV

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

不要压缩你的特征:为什么CenterLoss伤害OOD检测和多尺度Mahalanobis获胜

Rahul D Ray

AI总结 本文提出GOEN方法,通过多尺度特征、L2归一化、Mahalanobis距离和校准头来提升OOD检测性能,发现CenterLoss会降低OOD检测性能,而GOEN-NoCenterLoss在CIFAR-10基准上表现优于其他基线方法。

详情
AI中文摘要

检测分布外(OOD)输入的能力是安全部署机器学习系统的基础。然而,当前方法往往依赖于仅优化分类准确性的特征表示,忽略了epistemic不确定性的要求。我们引入GOEN(几何优化的epistemic网络),一种结合多尺度特征、L2归一化、Mahalanobis距离和使用真实硬OOD示例训练的校准头的简单流程。通过系统消融,我们发现一个反直觉的发现:CenterLoss,一种用于特征紧凑性的流行正则化器,显著降低了OOD检测性能,尽管提高了分类准确性。最佳变体GOEN-NoCenterLoss在CIFAR-10基准上实现了0.9483的平均OOD AUROC,超过了包括深度集成(0.8827)、KNN(0.8967)和ODIN(0.8870)在内的所有基线方法,同时保持了有竞争力的分布内准确性。我们的结果挑战了普遍认为更好的分类几何自动导致更好的epistemic不确定性假设。相反,我们展示了过于紧致的特征簇会压缩类间边缘并扭曲所需的有效OOD检测的协方差结构。GOEN是高效的,在单个GPU上训练不到20分钟,并提供了一种构建可靠识别自身局限的AI系统的实用蓝图。

英文摘要

The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

2605.21492 2026-05-22 cs.LG cs.AI cs.LO stat.ML

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

特征归因不可能性:在共线性下,没有任何特征排名是忠实、稳定和完整的

Drake Caraker, Bryan Arnold, David Rhoads

AI总结 本文研究了在共线性情况下特征排名的不可能性,证明了无法同时满足忠实、稳定和完整性的条件,并提出了DASH方法作为解决途径,同时通过形式化验证展示了其理论基础和实际应用影响。

Comments 66 pages, 12 figures, 305 Lean 4 theorems. Code at https://github.com/DrakeCaraker/dash-impossibility-lean

详情
AI中文摘要

在共线性情况下,没有任何特征排名可以同时忠实、稳定和完整。对于共线性对,排名本质上等同于抛硬币。我们证明了这一不可能性,针对四种模型类别进行了量化分析,通过集成平均(DASH)方法解决该问题,并利用305个Lean 4定理进行机验证。我们刻画了完整的归因设计空间:恰好存在两种方法家族——忠实-完整方法(不稳定,排名可能翻转多达50%的时间)和集成方法如DASH(稳定,对称特征报告平局)。归因比在梯度提升中发散为1/(1-rho^2),在Lasso中为无穷大,在随机森林中收敛。DASH(Diversified Aggregation of SHAP)在无偏聚合中被证明是帕累托最优的,达到Cramer-Rao方差下界并具有紧的集成大小公式。在77个公共数据集中,68%表现出归因不稳定性。在特征具有相等因果效应时,切换到条件SHAP无法逃脱这一不可能性。该框架包括实用的诊断工具——Z检验工作流程和单模型筛查工具——并直接影响公平性审计:基于SHAP的代理歧视审计在共线性下被证明不可靠。设计空间定理、诊断和不可能性均在Lean 4中形式化验证(305个定理从16个公理,0 sorry)——据我们所知,这是可解释AI领域首个形式化验证的不可能性。

英文摘要

No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine-verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist -- faithful-complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) -- and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1-rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto-optimal among unbiased aggregations, achieving the Cramer-Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics -- a Z-test workflow and single-model screening tool -- and has direct consequences for fairness auditing: SHAP-based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) -- to our knowledge, the first formally verified impossibility in explainable AI.

2605.21491 2026-05-22 cs.LG cs.AI cs.CL

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较想法评估教授语言模型预测研究成功的技巧

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

AI总结 本研究探讨了语言模型能否在无需实验的情况下预测研究想法的实证成功,通过构建基于PapersWithCode客观结果的11488对想法数据集,发现通过强化学习可提升模型性能至71.35%,证明小型语言模型可以作为有效的客观验证器,为自主科学发现提供可扩展路径。

Comments ACL 2026 Findings

详情
AI中文摘要

随着语言模型通过自动化假设生成和实现加速科学研究,出现了一个新的瓶颈:在没有彻底实验的情况下评估和过滤数百个AI生成的想法。我们问语言模型是否能学会在任何实验运行之前预测研究想法的实证成功。我们研究了比较实证预测:给定一个基准特定的研究目标和两个候选想法,预测哪个将实现更好的基准性能。我们构建了一个基于PapersWithCode客观结果的11,488对想法数据集。尽管现成的8B参数模型表现不佳(30%准确率),SFT显著提升了性能至77.1%,优于GPT-5(61.1%)。通过将评估框架为推理任务,通过可验证奖励的强化学习(RLVR),我们训练模型发现潜在的推理路径,实现71.35%的准确率,并具有可解释的依据。通过额外的消融和分布外测试,我们展示了对表面启发式的鲁棒性,并转移到了跨领域时间拆分测试集和独立构建的测试集。我们的结果表明,计算高效的轻量级语言模型可以作为有效的、客观的验证器,为自主科学发现提供可扩展的路径。

英文摘要

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

2605.21490 2026-05-22 cs.LG cs.CR

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

基于时间对比的变压器用于金融犯罪检测:通过预测对比编码实现自监督序列嵌入

Danny Butvinik, Yonit Marcus, Nitzan Tal, Gabrielle Azoulay

AI总结 本文提出了一种名为时间对比变压器(TCT)的表示学习框架,旨在捕捉金融交易序列中的时间动态。通过自监督对比目标训练模型,生成编码时间行为模式的嵌入,以支持下游的欺诈检测任务。实验结果显示,嵌入本身能实现有意义的预测性能(AUC 0.8644),但结合领域工程特征时,性能提升不显著(AUC 0.9205 vs. 0.9245),表明学习到的表示与现有特征抽象有较大重叠。这些发现表明TCT是一种有前景的表示学习方法,能够捕捉相关的行为信号,同时凸显了在强领域特征上实现加性价值的挑战。

Comments 10 pages, 4 figures, one table

详情
AI中文摘要

我们介绍了一种时间对比变压器(TCT),一种旨在捕捉金融交易序列中上下文时间动态的表示学习框架。该模型通过自监督对比目标进行训练,以生成编码时间行为模式的嵌入,以支持下游的欺诈检测任务。我们通过将学习到的嵌入作为输入特征送入梯度提升分类器,在现实环境中评估TCT。实验结果表明,仅使用嵌入本身就能实现有意义的预测性能(AUC 0.8644),表明模型能够捕捉非平凡的时间结构。然而,当结合领域工程特征时,与基线相比没有可观的提升(AUC 0.9205 vs. 0.9245),表明学习到的表示与现有特征抽象有较大重叠。这些发现将TCT定位为一种有前景的表示学习方法,能够捕捉相关的行为信号,同时凸显了在强领域特征上实现加性价值的挑战。这些结果反映了时间表示学习在金融犯罪检测中的发展中间阶段,并激励进一步研究模型架构、训练目标和整合策略。在这一早期阶段,实现与强特征工程基线相当的性能本身就是一个有意义的结果,表明学习到的表示可以近似于领域特定的特征,而无需手动工程。虽然尚未达到生产就绪状态,但这些结果指出了减少对特征工程依赖的有希望的方向。

英文摘要

We introduce the Temporal Contrastive Transformer (TCT), a representation learning framework designed to capture contextual temporal dynamics in sequences of financial transactions. The model is trained using a self-supervised contrastive objective to produce embeddings that encode behavioral patterns over time, with the goal of supporting downstream fraud detection tasks. We evaluate TCT in a realistic setting by using the learned embeddings as input features to a gradient boosting classifier. Experimental results show that embeddings alone achieve meaningful predictive performance (AUC 0.8644), indicating that the model captures non-trivial temporal structure. However, when combined with domain-engineered features, no measurable improvement is observed over the baseline (AUC 0.9205 vs. 0.9245), suggesting that the learned representations largely overlap with existing feature abstractions. These findings position TCT as a promising representation learning approach that captures relevant behavioral signal, while highlighting the challenges of achieving additive value over strong domain features. The results reflect an intermediate stage in the development of temporal representation learning for financial crime detection and motivate further research on model architecture, training objectives, and integration strategies. At this early stage, achieving performance comparable to a strong feature-engineered baseline is itself a meaningful outcome, indicating that learned representations approximate domain-specific features without manual engineering. While not yet production-ready, these results point to a promising direction for reducing reliance on feature engineering in financial crime detection.

2605.21282 2026-05-22 cs.LG cs.AI

Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

随机均值流策略:带有熵镜降的一步生成控制

Zeyuan Wang, Da Li, Yulin Chen, Yuehu Gong, Yanming Guo, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu

AI总结 本文提出了一种随机均值流策略(SMFP),通过均值流变换将高斯噪声映射到动作,以实现可训练的生成策略,从而在离线策略镜降框架下实现探索性且稳定的改进。

详情
AI中文摘要

在线离线策略强化学习(RL)受到两个耦合选择的影响:策略类和更新规则。高斯策略速度快且具有可计算的熵,但难以处理多模态动作分布。生成策略更具表现力,但通常需要迭代采样或缺乏可计算的熵估计。在优化方面,SAC风格的软策略改进和镜降(MD)可以视为最小化不同的KL散度:前者将策略推向价值诱导的玻尔兹曼分布,后者则通过之前的策略正则化每个更新。将熵正则化与MD约束结合因此具有吸引力,因为它支持探索并稳定策略改进;然而,所得到的目标可能是多模态的,且与单峰高斯策略不匹配。我们提出随机均值流策略(SMFP),一种一步生成策略类,通过均值流变换将高斯噪声映射到动作。这种随机重参数化产生了一个可计算的熵替代物,并允许均值流策略在离线策略镜降框架下通过统一的目标进行训练,以实现探索性且稳定的改进。在七个MuJoCo基准测试中,SMFP在高斯和生成基线之上取得了改进,同时保留了单步推断效率。

英文摘要

Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

2605.21079 2026-05-22 cs.CV

VDFP: Video Deflickering with Flicker-banding Priors

VDFP:基于闪烁带先验的视频去闪烁

Zhiyi Zhou, Libo Zhu, Zihan Zhou, Yulun Zhang, Xiaokang Yang

AI总结 本文提出VDFP,一种基于闪烁带先验的视频去闪烁框架,通过构建DeViD数据集和引入DFM和CPP模块,有效解决屏幕捕捉中的带状伪影问题,实验表明其在去闪烁效果和时空一致性方面优于现有方法。

Comments Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP

详情
AI中文摘要

使用智能手机捕捉数字屏幕时,由于硬件同步不匹配,经常会产生严重的带状伪影。现有的视频修复方法难以处理这些结构化、周期性的亮度波动,通常导致残留伪影或过度平滑的纹理。我们首先构建了DeViD数据集,以应对可用数据集不足的问题。然后我们提出了VDFP(Video Deflickering with Flicker-banding Priors),一种新颖的感知引导生成框架。首先,我们引入了一种基于滚动快门机制的退化场建模(DFM),能够合成复杂的多带状场景。其次,我们提出了空间-时间连续先验感知(CPP)。不同于传统的二元分割,该模块通过闪烁感知的均方误差(FA-MSE)进行优化,以捕捉亮度过渡。通过零初始化增强的输入层,我们的模型保留了预训练的生成先验以及空间-时间先验感知。广泛的实验表明,VDFP在去闪烁效果和时空一致性方面显著优于其他方法,能够高效消除复杂的带状伪影并保留高保真的空间细节。我们的数据集和代码将在https://github.com/ZhiyiZZhou/VDFP上发布。

英文摘要

Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets. Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP.

2605.20514 2026-05-22 cs.LG cs.NA math.NA stat.ML

Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data

从稀疏数据快速重建精确的Maxwell动力学

Dan DeGenaro, Xin Li, Obed Amo, Michael Pokojovy, Sarah Adel Bargal, Markus Lange-Hegermann, Bogdan Raiţă

AI总结 本文提出FLASH-MAX神经网络架构,通过稀疏点观测预测均匀电磁场,该架构通过符号构造满足Maxwell方程,实现从稀疏数据快速训练,且保持零PDE残差,提升了科学机器学习中精度与优化速度的平衡。

Comments 31 pages, 8 figures

详情
AI中文摘要

我们介绍了FLASH-MAX,一种浅层、精确由构造的神经网络架构,用于从稀疏点观测预测均匀电磁场。每个隐藏神经元代表Maxwell方程的一个独立精确解,因此网络通过构造满足 governing equations,并能从稀疏数据中以秒级时间进行端到端训练。我们证明了一个通用逼近结果,表明这种精确模型类在任意域上保持通用性。FLASH-MAX在约1K稀疏点观测中达到子1%的验证相对误差,同时保持零PDE残差,并在仅100观测采样时仍保持单数字误差。这些结果表明,将 governing structure 从损失转移到假设类可以显著提升科学机器学习中精度与优化速度的平衡。

英文摘要

We introduce FLASH-MAX, a shallow, exact-by-construction neural network architecture for predicting homogeneous electromagnetic fields from sparse pointwise observations. Each hidden neuron represents a separate exact solution to Maxwell's equations, so that the network satisfies the governing equations symbolically by construction and can be trained end-to-end from sparse data within seconds. We prove a universal approximation result showing that this exact model class remains universal on arbitrary domains. FLASH-MAX reaches sub-1% relative validation error from about 1K sparse pointwise observations in seconds, all while maintaining a zero PDE residual, and keeps single-digit errors even for only 100 observations sampled from 3D space. These results suggest that moving governing structure from the loss into the hypothesis class can dramatically improve the trade-off between precision and optimization speed in scientific machine learning.

2605.20303 2026-05-22 cs.LG

AirfoilGen: A valid-by-construction and performance-aware latent diffusion model for airfoil generation

AirfoilGen: 一种用于翼型生成的可构造且性能感知的潜在扩散模型

Zhijie Yang, Min Tang, Peng Du, Qiang Zou

AI总结 本文提出了一种新的翼型生成模型AirfoilGen,通过引入圆扫表示法约束生成过程,确保生成的翼型符合基本特性,并通过在学习的潜在空间中操作实现对气动性能的显式控制,同时提供了一个包含超过20万翼型的新数据集。

Comments 15 pages

详情
AI中文摘要

翼型形状设计是航空工程中的基本任务,直接影响飞行稳定性与燃油消耗。深度学习最近 emerged 作为一种有前景的工具用于此任务,但现有的深度生成方法在几何有效性与物理可控性方面仍然有限。它们对生成的形状控制很少,导致无效的几何形状,并且通常不有效地对气动性能进行条件化。为了解决这些问题,本文提出了一种名为AirfoilGen的可构造且性能感知的潜在扩散模型用于翼型生成。首先引入了一种新的翼型表示方案,即圆扫表示法,以约束生成过程,使得输出形状尊重基本的翼型特性。然后通过在学习的潜在空间中操作,实现对气动性能(例如升力和阻力系数)的显式控制:一个transformer模型将翼型形状编码为向量嵌入,而一个条件扩散模型将高斯噪声解噪为这些潜在嵌入,同时结合目标气动性能。此外,本文还提出了一组包含超过200,000个翼型的新数据集,该数据集比广泛使用的UIUC翼型数据集(1,650个翼型)大得多,并且更适合训练现代深度生成模型。实验表明,AirfoilGen在几何有效性和气动性能可控性方面比之前实现的要高得多,平均性能条件化精度为98.41%。

英文摘要

Airfoil shape design is a fundamental task in aerospace engineering, with a direct impact on flight stability and fuel consumption. Deep learning has recently emerged as a promising tool for this task, but existing deep generative approaches remain limited in both geometric validity and physical controllability. They offer little control over the generated shapes, yielding invalid geometries, and they typically do not condition effectively on aerodynamic performance. To address these issues, this paper proposes AirfoilGen, a valid-by-construction and performance-aware latent diffusion model for airfoil. It first introduces a novel airfoil representation scheme, the circle sweeping representation, to constrain the generative process so that output shapes respect essential airfoil characteristics. It then enables explicit control over aerodynamic performance (e.g., lift and drag coefficients) by operating in a learned latent space: a transformer model encodes airfoil shapes into vector embeddings, and a conditional diffusion model denoises Gaussian noise into these latent embeddings while incorporating target aerodynamic performance. In addition, this paper presents a new dataset of over 200,000 airfoils, which is substantially larger than the widely used UIUC airfoil dataset (1,650 airfoils) and more suitable for training modern deep generative models. Experiments demonstrate that AirfoilGen enables airfoil generation with far greater geometric validity and aerodynamic performance controllability than previously achievable, with an average performance-conditioning accuracy of 98.41%.

2605.20302 2026-05-22 cs.LG cs.CV

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

按设计实现神经崩溃:在超球面上学习类别原型

Panagiotis Koromilas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, Yannis Panagakis

AI总结 本文研究了监督分类的理论最优解神经崩溃(NC),指出交叉熵(CE)和监督对比学习(SCL)两种主流范式在实践中无法达到该最优解。作者提出通过在超球面上对比原型的方法,改进了CE和SCL,从而在多个基准测试中实现了更接近NC的性能。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/nc_by_design

详情
AI中文摘要

监督分类有一个理论最优解,即神经崩溃(NC),然而其两种主导范式在实践中都无法达到这一最优。交叉熵(CE)保留了径向自由度,导致收敛到退化几何结构,而监督对比学习(SCL)在预训练阶段驱动特征向NC靠近,但在后续的线性探测阶段丢弃了这一结构。我们证明这两种范式实际上是同一种方法的不同表现,即在单位超球面上对比原型。缩小差距需要在各自失败点进行修正。从CE侧,我们提出NTCE和NONL两种归一化损失,将对比优化缺失的成分引入分类器学习:大有效负样本集和解耦的对齐和均匀性项。从SCL侧,我们证明SCL的目标在训练过程中已经优化了原理分类器,其权重是类别均值嵌入,使线性探测变得冗余且有害。实验表明,在四个基准测试(包括ImageNet-1K)中,NTCE和NONL在准确率上超过了CE,接近NC(≥95%),并在不到7.5%的迭代次数中在4/5个指标上匹配CE的收敛NC,而SCL在固定原型的情况下无需线性探测阶段即可达到。学习的几何结构在迁移学习中带来了+5.5%的平均相对改进,严重类别不平衡下可达+8.7%,并且在ImageNet-C上提高了对损坏的鲁棒性。本文将监督学习重新定义为在超球面上的原型学习,通过设计达到NC。

英文摘要

Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method that contrasts prototypes on the unit hypersphere, and that closing the gap requires fixing each at its point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and improved robustness to corruptions on ImageNet-C. Our work recasts supervised learning as prototype learning on the hypersphere, with NC reached by design.

2605.20246 2026-05-22 cs.LG cs.AI

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW: 将GRPO与状态-动作建模对齐以适用于开放世界VLM智能体

Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu

AI总结 本文提出GROW框架,通过将收集的轨迹分解为状态-动作样本,并在样本间计算优势,解决了标准GRPO在多轮RL中因需要完整轨迹导致上下文过长和噪声的问题,实验表明其在超过800个Minecraft任务中取得SOTA性能。

详情
AI中文摘要

最近,视觉-语言模型(VLM)智能体在开放世界任务中展现出有前景的进步,其中成功的任务完成通常需要多次视觉感知和动作执行的回合。然而,现有方法仍主要依赖于监督微调(SFT)专家演示,而先进的强化学习(RL)算法,特别是分组相对策略优化(GRPO),尚未在这些任务中有效应用于多轮RL,因为标准GRPO需要完整的轨迹作为训练样本,导致上下文过长和噪声。为了解决这个问题,我们提出GROW,一种适用于开放世界VLM智能体的RL框架,将收集的轨迹分解为状态-动作样本,并在这些样本之间计算优势,而不是将完整轨迹视为单一实体。我们进一步提供了一个替代分析,表明尽管分组样本是基于不同的局部状态而不是相同的提示上下文,简化假设下目标可以保留GRPO的核心相对策略优化信号。在超过800个Minecraft任务上的实验表明,我们的方法实现了最先进的性能,证明了我们提出的RL框架在开放世界VLM智能体中的有效性。

英文摘要

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.

2605.19192 2026-05-22 cs.AI cs.CR

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

幻觉作为利用:证据承载多模态智能体

Guijia Zhang, Hao Zheng, Harry Yang

AI总结 本文研究了多模态智能体中幻觉导致授权失败的问题,提出证据承载多模态智能体(ECA)方法,通过分解工具调用、获取类型证书并使用确定性门控来授权,从而将模型的模糊信念转换为可审计的残余,提高了系统的安全性。

Comments 23 pages, 6 figures, 15 tables

详情
AI中文摘要

多模态智能体越来越多地从截图、文档和网页中选择工具调用,其中虚假感知声明可能导致幻觉从答案质量错误转变为授权失败。我们正式将这种失败模式定义为幻觉到动作转换:一个不支持的声明为特权动作提供了前提条件。我们提出了证据承载多模态智能体(ECA),将自由形式模型文本视为不可接受的证据,将每个工具调用分解为动作关键谓词,从受限的DOM/OCR/AX验证器中获取类型证书,并使用确定性门来只授权证书支持的特权。与其隐藏感知错误不同,ECA将模糊的模型信念转换为可审计的残余,在验证器、模式和实现层面。在17个经典攻击类别上进行的验证器红队测试显示,四个目标加固步骤各自是必要的;在加固后,经典门绕过是0/1700(Wilson 95%上界0.22%)。使用内容衍生证书,ECA在200个端到端任务上观察到零不安全执行(Wilson 95%上界2.67%)和120个浏览器任务(上界4.3%)。对500个分层任务键的HACR审计显示,不支持的动作关键声明导致不安全执行,对原始智能体(100.0%)和仅提示防御(49.6%)无效,但对ECA无效。在7,488个GPT-5.4跟踪上进行的Oracle证书回放隔离了门的正确性,而神经判断基线在相同威胁模型下仍允许大多数不安全动作。最终的原则很简单:模型语言可能提出工具使用,但认证的谓词必须授权它。

英文摘要

Multimodal agents increasingly choose tool calls from screenshots, documents, and webpages, where a false perceptual claim can turn hallucination from an answer-quality error into an authorization failure. We formalize this failure mode as hallucination-to-action conversion: an unsupported claim supplies the precondition for a privileged action. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence, decompose each tool call into action-critical predicates, obtain typed certificates from constrained DOM/OCR/AX verifiers, and use a deterministic gate to authorize only the privileges those certificates support. Rather than hiding perception error, ECA converts opaque model belief into auditable residuals at the verifier, schema, and implementation levels. Verifier red-teaming across 17 canonical attack categories shows that four targeted hardening steps are each necessary; after hardening, canonical gate bypass is 0/1,700 (Wilson 95% upper bound 0.22%). With content-derived certificates, ECA observes zero unsafe executions on 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%). A HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defenses (49.6%), but not for ECA. Oracle-certificate replay over 7,488 GPT-5.4 traces isolates gate correctness, while neural judge baselines still admit most unsafe actions under the same threat model. The resulting principle is simple: model language may propose tool use, but certified predicates must authorize it.

2605.18893 2026-05-22 cs.LG

Position: Graph Condensation Needs a Reset -- Move Beyond Full-dataset Training and Model-Dependence

位置:图压缩需要重新开始——超越全数据集训练和模型依赖

Mridul Gupta, Samyak Jain, Vansh Ramani, Hariprasad Kodamana, Sayan Ranu

AI总结 本文指出当前图压缩方法存在系统性缺陷,呼吁转向轻量、架构无关且可部署的方法,以实现高效、通用和可扩展的图神经网络训练。

详情
AI中文摘要

图神经网络(GNNs)是学习图结构数据的强大工具,但其可扩展性在推荐系统、欺诈检测和分子生物学等领域的现实图规模下日益受到限制。图压缩——生成保留原始模型性能的更小合成图的任务——已成为有前途的解决方案。然而,主流的梯度匹配方法引入了根本性矛盾:它需要在完整数据集上训练以生成压缩版本,从而削弱了效率目标。更糟糕的是,这些方法存在高计算开销、在不同GNN架构间泛化差以及对特定模型配置的脆弱依赖。同样令人担忧的是社区对误导性评估协议如节点压缩比的依赖,这些协议未能反映真正的资源节约、压缩开销以及对神经架构搜索的虚假应用。这些不足并非偶然——它们是系统性的,并阻碍了有意义的进展。在本文的立场论文中,我们主张图压缩目前需要重新开始。我们呼吁超越全数据集训练和模型依赖,转而倡导轻量、架构无关且可部署的方法。通过识别关键方法论缺陷并概述具体研究方向,我们旨在将领域重新导向能够实现压缩真正承诺的方法:高效、通用和可扩展的图神经网络训练。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology. Graph condensation -- the task of generating a smaller synthetic graph that retains the performance of models trained on the original -- has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency. Worse still, these methods suffer from high computational overhead, poor generalization across GNN architectures, and brittle reliance on specific model configurations. Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search. These shortcomings are not incidental -- they are systemic, and they obstruct meaningful progress. In this position paper, we argue that graph condensation, in its current form, needs a reset. We call for moving beyond full-dataset training and model-dependent design, and instead advocate for methods that are lightweight, architecture-agnostic, and practically deployable. By identifying key methodological flaws and outlining concrete research directions, we aim to reorient the field toward approaches that deliver on the true promise of condensation: efficient, generalizable, and usable GNN training at scale.

2605.18721 2026-05-22 cs.LG cs.CL

General Preference Reinforcement Learning

通用偏好强化学习

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

AI总结 本文提出通用偏好强化学习(GPRL),通过引入通用偏好模型(GPM)解决传统强化学习在开放任务中连续探索不足的问题,通过多维偏好比较提升模型性能。

详情
AI中文摘要

训练后将大型语言模型(LLM)对齐分解为两个大致分离的轨道。在线强化学习(RL)通过可验证奖励推动数学和代码的涌现推理,但依赖于无法达到开放任务的程序验证器;而偏好优化处理开放生成任务却牺牲了驱动在线RL的连续探索。弥合这一差距需要一个开放性质量验证器,但标量奖励模型不适合此任务。质量是多维的,任何标量分数都是不完整的代理,使在线RL崩溃于分数最敏感的轴。我们转而采用通用偏好模型(GPM),将响应嵌入到k个斜对称子空间中,并将偏好表示为结构化的、具有不传递性的比较。在此基础上,我们提出通用偏好强化学习(GPRL),将k维结构延伸到策略更新中。GPRL计算每维的组相对优势,对每个优势进行归一化以避免任何轴主导,并通过上下文相关的特征值进行聚合。相同的结构推动了一个闭环漂移监视器,能够检测单轴利用并通过重新加权维度和收紧信任区域进行即时纠正。从Llama-3-8B-Instruct开始,GPRL在AlpacaEval~2.0上达到长度控制的胜利率为56.51%,并在Arena-Hard、MT-Bench和WildBench上优于SimPO和SPPO,通过在长时间训练中抵抗奖励黑客。

英文摘要

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

2605.17837 2026-05-22 cs.CV cs.AI

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

具有时间意识的剪枝用于高效扩散式视频生成

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, Xulong Tang

AI总结 本文提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成,通过时间平滑、层内token重选和时间步预算调度,提升生成效率并保持高质量视觉效果。

详情
AI中文摘要

视频扩散模型最近通过基于ViT的架构实现了高质量视频生成,但生成过程由于需要在长时空序列上进行注意力计算而计算成本高。token剪枝已被证明在ViTs和VLMs中有效。然而,大多数先前的剪枝方法基于注意力,按帧操作,无法确保视频生成任务中帧间的重要时间一致性。在实践中,简单采用仅注意力的剪枝会导致明显退化,由于背景一致性变差、闪烁和图像质量下降。为此,我们提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成。TAPE(i)应用时间平滑以对齐相邻帧之间的token重要性并抑制选择抖动;(ii)在选定的层中进行token重选,以使token剪枝与层的多样化语义关注相一致,并避免特定区域的误差累积;它还(iii)采用时间步级预算调度,在早期噪声步骤中进行激进剪枝,并在保真度关键的细化阶段放松剪枝。实验结果表明,TAPE在保持高质量视觉保真度的同时提供了显著的加速,优于先前的token减少方法。

英文摘要

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

2605.17659 2026-05-22 cs.LG

Bug or Feature$^2$: Weight Drift, Activation Sparsity and Spikes

Bug or Feature²:权重漂移、激活稀疏性与尖峰

Egor Shvetsov, Aleksandr Serkov, Shokorov Viacheslav, Redko Dmitry, Vladislav Goloshchapov, Evgeny Burnaev

AI总结 本文研究了现代神经网络架构中由于标准损失与正偏激活函数相互作用导致的负权重漂移现象,分析了其对激活稀疏性和模型性能的影响,并提出通过剪枝解决尖峰问题的方法。

详情
AI中文摘要

现代神经架构的设计通过逐步经验选择逐渐收敛,但其训练动态的机制仍只部分被理解。我们识别并分析了由标准损失与正偏激活函数相互作用引起的负权重漂移。证明在MSE或交叉熵损失下,正预激活的梯度在初始化时期望非负,驱动下游权重向负值发展。这种漂移是优化固有的,而非数据相关,并在多种架构(MLP、ResNet、ViT、GPT-nano、MP-SENe)和非对称激活函数(ReLU、GELU、SiLU)中持续存在。与ReLU结合,权重漂移产生高达90%的激活稀疏性。我们跨79种配置表征稀疏性-准确率权衡,并识别出稀疏性超过约70%时的准确率断崖。虽然ReLU²在GPT-nano中实现了良好的稀疏性-准确率比,但会病理性放大中间Transformer层的激活尖峰。剪枝可以解决这一问题,同时保留平方的表示优势:剪枝ReLU²优于其未剪枝版本,GELU²在GPT-nano上达到最低验证损失。代码可在https://github.com/On-Point-RND/BugOrFeature获取。

英文摘要

The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above $\sim$70\% activation sparsity. While ReLU$^2$ achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU$^2$ outperforms its unclipped version, and GELU$^2$ achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.

2605.17602 2026-05-22 cs.AI cs.CV cs.LG

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I: 一种用于文本到图像对齐的鲁棒基于规则的奖励模型

Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh

AI总结 本文提出AutoRubric-T2I,一种首个用于文本到图像生成的规则学习框架,通过自动合成和选择显式规则来指导视觉语言模型(VLM)法官。该方法通过合成偏好对的推理轨迹生成候选规则,并利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。通过ℓ1正则化逻辑回归精简器去除噪声和冗余规则,从而在少量标注偏好数据下生成高质量、可解释的奖励信号,并在多个图像奖励基准测试中优于现有奖励模型基线。

Comments 27 pages

详情
AI中文摘要

将文本到图像(T2I)生成模型与人类偏好对齐越来越依赖于图像奖励模型,这些模型根据提示对齐和感知质量对生成图像进行评分或排序。现有的奖励模型通常在大规模人类偏好语料上训练为Bradley-Terry(BT)偏好模型,这使得训练成本高、适应困难且评估标准不透明。同时,视觉语言模型(VLM)法官可以通过文本评分规则提供更细致的评估,但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。在本文中,我们提出AutoRubric-T2I,这是首个用于T2I的规则学习框架,能够自动合成和选择显式规则以指导VLM法官。AutoRubric-T2I首先通过合成偏好对的推理轨迹生成候选规则,然后利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。为了去除噪声和冗余规则,我们进一步采用ℓ1正则化逻辑回归精简器,选择Top-N最判别性的规则。广泛评估表明,AutoRubric-T2I在使用不到0.01%的标注偏好数据的情况下,能够生成高质量、可解释的奖励信号,大幅减少了大规模奖励模型训练的需求。在图像奖励基准如MMRB2上,AutoRubric-T2I优于强奖励模型基线。我们进一步验证AutoRubric-T2I作为强化学习奖励在下游T2I任务中的效果,包括TIIF和UniGenBench++,其中它通过流-GRPO管道在扩散模型上提升了生成质量,优于标量奖励模型。

英文摘要

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

2605.17596 2026-05-22 cs.AI

NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

NeuSymMS:一种混合神经符号记忆系统,用于持久、自管理的LLM代理

Mujahid Sultan, Sri Thuraisamy, Daya Rajaratnam

AI总结 NeuSymMS通过混合神经符号架构,使LLM代理能够在多个会话中学习、记忆和推理用户信息,其核心方法是结合神经网络的事实提取和基于CLIPS的专家系统,主要贡献是提出了一个支持自管理记忆的双视野记忆模型。

Comments 7 pages

详情
AI中文摘要

我们介绍了NeuSymMS,一种自适应的记忆系统,使大型语言模型(LLM)代理能够通过混合神经符号架构在多个会话中学习、记忆和推理用户信息。NeuSymMS结合了使用LLM从非结构化对话中提取事实的神经网络,以及基于CLIPS的专家系统,该系统在显式生命周期规则下对事实进行分类、去重和协调。系统将知识表示为主体-关系-值三元组,存储在关系数据库管理系统中。它支持用户/代理/代理到代理的范围,并实现双视野(短期和长期)记忆模型。它利用基于访问的提升和基于时间的剪枝来管理两个视野中的记忆。NeuSymMS在保持记忆连续性的同时避免了上下文窗口膨胀和跨实体污染。我们认为这种架构为生产代理系统提供了可靠、可审计的记忆的实用路径,并讨论其与日志检索、摘要和键值方法的创新性对比。

英文摘要

We present NeuSymMS, an adaptive memory system that enables large language model (LLM) agents to learn, remember, and reason about users across sessions via a hybrid neuro-symbolic architecture. NeuSymMS couples neural fact extraction from unstructured dialogue using LLMs and a CLIPS-based expert system that classifies, deduplicates, and reconciles facts under explicit lifecycle rules. The system represents knowledge as subject-relation-value triples stored in relational database management system. It supports user/agents/agent-to-agent scoping, and implements a dual-horizon (short-term and long-term) memory model. IT leverages access-based promotion and time-based pruning of the memory on both horizpons. NeuSymMS maintains continuity of memory while avoiding context-window bloat and cross-entity contamination. We argue that this architecture offers a practical path to trustworthy, auditable memory for production agentic systems and discuss its novelty relative to log retrieval, summarization, and key-value approaches.

2605.16923 2026-05-22 cs.CV

Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding

受神经科学启发的分阶段表征学习:解纠缠的粗粒度和细粒度语义用于EEG视觉解码

Xiang Gao, Hui Tian, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

AI总结 本文提出了一种受神经科学启发的分阶段表征学习框架,通过解纠缠的粗粒度和细粒度语义来改进EEG视觉解码,解决了现有方法在人类视觉处理分阶段和层次特性方面的不足。

Comments 17 pages, 5 figures

详情
AI中文摘要

从电生理图(EEG)信号解码视觉信息仍然是脑机接口和医疗康复中的基本挑战。现有的EEG视觉解码方法主要集中在学习一个单一的全局EEG嵌入以实现跨模态对齐,但它们大多忽略了人类视觉处理的分阶段和层次特性。为了解决这一限制,我们提出了一种受神经科学启发的分阶段表征学习框架,将EEG视觉解码重新表述为一个阶段特定的表征分解问题。所提出的框架将EEG表征学习分为三个互补的阶段:低级视觉表征学习、高级语义表征学习和整合信息融合。为了加强语义建模,我们进一步引入了一种多模态双级语义学习机制,将粗标签级别的语义与细图像级别的视觉-语义信息分开。此外,引入了语义潜在通道作为从观察到的视觉EEG信号生成的计算表征通道,扩展了通道级别的语义表征空间以实现结构化的语义抽象和跨模态对齐。在THINGS-EEG基准上的大量实验表明,所提出的方法在受试者依赖的零样本评估中表现优异,并在受试者独立的零样本评估中实现了改进的精确检索。此外,包括逐层检索、时间累积、扩展多图像检索和消融研究的额外分析进一步支持了分阶段分解和结构化语义建模的有效性。这些结果表明,显式建模分阶段的感知、语义和整合表征提供了一种有效的受神经科学启发的EEG视觉解码框架。

英文摘要

Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.

2605.16865 2026-05-22 cs.CL

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

MixSD: 混合上下文自蒸馏用于知识注入

Jiarui Liu, Lechen Zhang, Yongjin Yang, Yinghui He, Yingheng Wang, Weihao Xuan, Zhijing Jin, Mona Diab

AI总结 本文提出MixSD方法,通过混合模型自身条件下的token来实现与模型生成分布对齐的知识注入,从而在保持预训练能力的同时提升事实记忆和推理能力。

详情
AI中文摘要

监督微调(SFT)被广泛用于将新知识注入语言模型,但通常会损害预训练能力,如推理和通用领域性能。我们认为这种遗忘是由于微调目标与模型的自回归分布不一致,迫使优化器模仿低概率token序列。为了解决这个问题,我们提出了MixSD,一种无需外部教师的简单方法,用于对齐分布的知识注入。与固定目标训练不同,MixSD通过混合基础模型自身两个条件下的token动态构建监督。所生成的监督序列保留了事实学习信号,同时更接近基础模型的分布。我们在两个合成语料库上评估了MixSD,研究事实回忆和算术功能学习,并结合已建立的开放领域事实问答和知识编辑基准。在多种模型规模和设置下,MixSD在记忆-保留权衡上优于SFT和在线自蒸馏基线,能够保留基础模型的100% held-out能力,同时保持接近完美的训练准确率,而标准SFT只能保留1%。我们进一步表明,MixSD在基础模型下生成的监督目标具有显著更低的NLL,并减少了有害的Fisher敏感参数方向运动。这些结果表明,将监督与模型的本征生成分布对齐是简单且有效的知识注入原则,可以缓解灾难性遗忘。

英文摘要

Supervised fine-tuning (SFT) is widely used to inject new knowledge into language models, but it often degrades pretrained capabilities such as reasoning and general-domain performance. We argue this forgetting arises because fine-tuning targets from humans or external systems diverge from the model's autoregressive distribution, forcing the optimizer to imitate low-probability token sequences. To address this problem, we propose MixSD, a simple external-teacher-free method for distribution-aligned knowledge injection. Instead of training on fixed targets, MixSD constructs supervision dynamically by mixing tokens from two conditionals of the base model itself: an expert conditional that observes the injected fact in context, and a naive conditional that reflects the model's original prior. The resulting supervision sequences preserve the factual learning signal while remaining substantially closer to the base model's distribution. We evaluate MixSD on two synthetic corpora that we construct to study factual recall and arithmetic function acquisition in a controlled setting, together with established benchmarks for open-domain factual question answering and knowledge editing. Across multiple model scales and settings, MixSD consistently achieves a better memorization-retention trade-off compared to SFT and on-policy self distillation baselines, retaining up to 100% of the base model's held-out capability while maintaining near-perfect training accuracy, whereas standard SFT retains as little as 1%. We further show that MixSD produces substantially lower-NLL supervision targets under the base model and reduces harmful movement along Fisher-sensitive parameter directions. These results suggest that aligning supervision with the model's native generation distribution is a simple and effective principle for knowledge injection that mitigates catastrophic forgetting.

2605.16579 2026-05-22 cs.CV cs.LG

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

局部关注,线性记忆:线性注意力作为跨帧记忆用于自回归视频扩散

Kunyang Li, Mubarak Shah, Yuzhang Shang

AI总结 本文提出了一种名为ARL2的混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态,解决了自回归视频扩散模型在长视频生成中的可扩展性瓶颈问题,实现了线性时间复杂度和常数内存消耗,同时提升了时间一致性。

详情
AI中文摘要

自回归(AR)视频扩散是一种强大的视频生成范式,用于流式和交互式视频生成。然而,其依赖于softmax自注意力机制导致序列长度的二次计算复杂度和内存使用,由于键值缓存,限制了其扩展到长视频时间范围的能力。现有的解决方案(例如稀疏注意力和KV缓存压缩)降低了每步成本,但仍依赖于线性增长的缓存或不可逆地丢弃过去上下文,因此无法解决线性内存增长和流式上下文管理问题。为了解决这一可扩展性瓶颈,我们提出了ARL2(局部关注,线性记忆),一种混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态。我们将自注意力分解为两个分支:一个用于空间细节和局部依赖的帧内softmax分支,以及一个用于维护固定大小状态以流式管理上下文的帧间门控线性分支。我们的关键见解是softmax注意力捕捉细粒度的局部交互,而递归状态提供可控的长程记忆。这种设计实现了线性时间复杂度和常数内存消耗,同时在全softmax模型上提高了时间一致性。为防止噪声中间状态破坏记忆,我们只在去噪步骤后更新递归状态。为了避免帧内信息不对称,所有token共享相同的预更新状态,而不是按顺序更新。据我们所知,这是首次将预训练的AR视频扩散模型转换为混合线性注意力架构的工作,通过一种高效的两阶段训练方案实现AR视频的训练。在75%的层被替换为混合线性注意力的情况下,模型实现了高达2.26倍的时钟时间加速和54%的内存减少,同时保持与改进的时间一致性相当的质量。

英文摘要

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

2605.16362 2026-05-22 cs.LG cs.AI

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

当秩-1引导廉价时是什么情况?几何学、粒度和预算化搜索

John T. Robertson, Jianing Zhu, Haris Vikalo, Zhangyang Wang

AI总结 本文研究了秩-1引导在不同概念上的有效性差异,提出粒度和几何学是影响引导成本的关键因素,并介绍了GRACE框架来高效优化引导过程。

Comments Updated Abstract metadata

详情
AI中文摘要

激活引导提供了一种无需重新训练即可控制大语言模型的轻量方法,但其效果在不同概念上变化显著。先前研究通常将这种变化视为许多概念无法由单一引导方向捕捉的证据。我们主张这种变化更多反映了搜索难度:有用的秩-1干预通常存在,但找到它可能成本高昂。我们正式将秩-1引导定义为在干预层和系数上的预算约束优化。在不同概念和模型家族中,提示边界方向对齐预测有效干预的位置,使几何引导搜索能够以更少的评估达到高效用,平均减少39.8%的试验次数以恢复95%的最佳效用。为解释为何某些概念即使在更好的搜索下仍昂贵,我们引入了粒度,即对比上下文中方向异质性的度量。粒度区分了差异向量共享稳定全局方向的概念,与提示在每个输入中局部一致但最优方向系统性旋转的概念。更高的粒度与更慢的收敛速度和更低的最佳效用相关(相关系数分别为0.44和-0.46,p<0.001)。我们提出了GRACE框架,一个粒度和表征意识的概念工程框架,利用激活几何学来诊断引导难度的主要来源,选择适当的解决方案,并高效分配优化努力。我们的结果将框架从“秩-1何时失败?”转变为“秩-1何时廉价且稳定?”,使激活几何学从描述性工具转变为LLM控制的可操作先验。

英文摘要

Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", turning activation geometry from a descriptive tool into an actionable prior for LLM control.