arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2602.07906 2026-05-08 cs.LG cs.AI

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

AceGRPO:自适应课程增强的群体相对策略优化用于自主机器学习工程

Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Yanfeng Wang, Siheng Chen

发表机构 * School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院)

AI总结 本文提出AceGRPO,通过动态数据缓冲和可学习潜力函数提升自主机器学习工程的持续迭代优化能力,实验证明其在MLE-Bench-Lite上达到100%有效提交率。

Comments 18 pages, 5 figures

详情
AI中文摘要

自主机器学习工程(MLE)要求智能体在长时间范围内进行持续迭代优化。尽管基于大语言模型的智能体表现出潜力,但当前基于提示的MLE智能体因参数冻结导致行为停滞。虽然强化学习(RL)提供了解决方案,但将其应用于MLE受到执行延迟和数据选择效率的阻碍。针对这些挑战,我们提出AceGRPO,包含两个核心组件:(1)进化数据缓冲器,持续将执行轨迹重新利用为可重用的训练任务;(2)适应性采样,由可学习潜力函数引导,动态优先处理智能体学习前沿的任务以最大化学习效率。利用AceGRPO,训练的Ace-30B模型在MLE-Bench-Lite上实现了100%的有效提交率,接近专有前沿模型的性能,并优于更大的开源基线(如DeepSeek-V3.2),展示了持续迭代优化的鲁棒性。代码可在https://github.com/yuzhu-cai/AceGRPO获取。

英文摘要

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behavioral stagnation due to frozen parameters. Although Reinforcement Learning (RL) offers a remedy, applying it to MLE is hindered by prohibitive execution latency and inefficient data selection. Recognizing these challenges, we propose AceGRPO with two core components: (1) Evolving Data Buffer that continuously repurposes execution traces into reusable training tasks, and (2) Adaptive Sampling guided by a Learnability Potential function, which dynamically prioritizes tasks at the agent's learning frontier to maximize learning efficiency. Leveraging AceGRPO, our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines (e.g., DeepSeek-V3.2), demonstrating robust capability for sustained iterative optimization. Code is available at https://github.com/yuzhu-cai/AceGRPO.

2602.07322 2026-05-08 cs.RO cs.AI

Action-to-Action Flow Matching

动作到动作流匹配

Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 本文提出A2A方法,通过利用前一动作的本体感知信息进行初始化,避免了随机噪声采样带来的高延迟问题,提升了动作生成的效率和鲁棒性。

Comments 20 pages, 19 figures

详情
AI中文摘要

基于扩散的策略在机器人领域取得了显著成功,通过将动作预测建模为条件去噪过程。然而,传统的从随机高斯噪声采样通常需要多次迭代步骤来生成干净的动作,导致高推理延迟,成为实时控制的主要瓶颈。本文挑战了无信息噪声采样的必要性,提出动作到动作流匹配(A2A)新政策范式,从随机采样转向由前一本体动作信息引导的初始化。与现有方法将本体动作反馈视为静态条件不同,A2A利用历史本体序列,将其嵌入到高维潜在空间中作为动作生成的起点。这种设计避免了昂贵的迭代去噪,同时有效捕捉了机器人物理动态和时间连续性。大量实验表明,A2A表现出高训练效率、快速推理速度和改进的泛化能力。值得注意的是,A2A能够在单次推理步骤中实现高质量的动作生成,并在视觉扰动和未见配置的泛化方面表现出色。最后,我们还将A2A扩展到视频生成,展示了其在时间建模中的更广泛适应性。项目网站:https://lorenzo-0-0.github.io/A2A_Flow_Matching.

英文摘要

Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous proprioceptive action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step, and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.

2602.05896 2026-05-08 cs.LG cs.AI

Parity, Sensitivity, and Transformers

奇偶性、敏感度与变换器

Alexander Kozachinskiy, Tomasz Steifer, Przemysław Wałȩga

发表机构 * CENIA IPPT PAN Queen Mary University of London(伦敦大学Queen Mary)

AI总结 本文研究了变换器计算奇偶性任务的最小层数,证明单层变换器无法解决该任务,而两层变换器可实现,并提出改进的构造方法去除不切实际的假设。

Comments 15 pages. Version 2 -- lower bound extended from 1-layer 1-head to 1-layer O(1)-head transformers

详情
AI中文摘要

理解神经架构能计算什么而不能计算是什么在AI理论中是一个核心挑战。在这一背景下,奇偶性任务是一个根本问题,它询问二进制输入序列中1的个数是偶数还是奇数。奇偶性是计算理论中研究的核心任务之一,但仍然令人惊讶的是,变换器在何种条件下能或不能解决它仍不清楚。本文证明变换器计算奇偶性所需的最小层数是两层。特别是,我们解决了开放问题:单层变换器能否计算奇偶性。我们通过证明单层变换器的平均敏感度增长速度比奇偶性慢而否定这一可能性。此外,我们提出了一种新的变换器构造方法,该方法改进了现有构造,通过去除若干不切实际的假设。现有的奇偶性变换器依赖于诸如长度依赖的位置编码、硬max、无正则化参数的layernorm或与因果掩码不兼容等不切实际的假设。我们证明这些假设可以去除,代价是将层数从两层增加到四层。具体来说,我们证明可以通过softmax注意力、长度无关且多项式有界的定位编码、无layernorm以及兼容因果和非因果掩码的四层变换器计算奇偶性。

英文摘要

Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of impractical assumptions. In particular, the existing transformers for PARITY rely on such impractical assumptions as length-dependent positional encoding, hardmax, layernorm without a regularisation parameter, or incompatibility with causal masking. We show that these assumptions can be removed, at the cost of increasing the number of layers from two to four. Specifically, we show that PARITY can be computed by a four-layer transformer using softmax attention, length-independent and polynomially bounded positional encoding, no layernorm, and compatible with both causal and non-causal masking.

2602.02493 2026-05-08 cs.CV cs.AI

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

PixelGen: 通过感知监督改进像素扩散

Zehong Ma, Ruihan Xu, Shiliang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机学院,北京大学)

AI总结 PixelGen通过引入LPIPS和P-DINO损失增强x预测,采用噪声门控策略提升图像质量,在ImageNet-256上取得5.11的FID分数,且在文本到图像生成中表现优异。

Comments Project Pages: https://zehong-ma.github.io/PixelGen/

详情
AI中文摘要

Pixel diffusion直接在像素空间生成图像,避免了两阶段潜在扩散的VAE伪影和表征瓶颈。最近的JiT进一步简化了像素扩散,通过x预测生成干净图像而非速度。然而,标准像素级扩散损失对所有像素平等对待,将模型容量用于感知上不重要的信号,常导致模糊样本。我们提出PixelGen,一种端到端的像素扩散框架,通过感知监督增强x预测。具体而言,PixelGen在x预测上引入两种互补的感知损失:LPIPS用于局部纹理,P-DINO用于全局语义。为保持样本覆盖,PixelGen进一步提出噪声门控策略,仅在低噪声时间步应用这些损失。在无分类器自由引导的ImageNet-256上,PixelGen在80训练周期内取得5.11的FID分数,超越潜在扩散基线。此外,PixelGen能高效扩展到文本到图像生成,在仅6天训练8xH800 GPU上达到0.79的GenEval分数。这些结果表明,感知监督显著缩小了像素与潜在扩散之间的差距,同时保持了简单的单阶段流程。代码可在https://github.com/Zehong-Ma/PixelGen获取。

英文摘要

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.

2602.02288 2026-05-08 cs.LG cs.AI

AROpt: An Optimization Method for Autoregressive Time Series Forecasting

AROpt: 一种用于自回归时间序列预测的优化方法

Zheng Li, Jerry Cheng, Huanying Gu

发表机构 * Department of Computer Science(计算机科学系) New York Institute of Technology(纽约理工学院)

AI总结 本文提出一种新方法,通过强制预测误差随预测时间增加而增长,提升自回归时间序列预测的准确性与灵活性,实验表明在多个基准上取得新突破。

Comments 16 pages, 5 figures, 3 tables

详情
AI中文摘要

当前时间序列预测模型主要基于Transformer风格的神经网络。这些模型主要通过扩大模型规模实现长期预测,而非真正自回归(AR)滚动。从大语言模型训练的角度来看,传统时间序列预测模型训练忽略了单调误差增长启发式。本文提出了一种新的时间序列预测训练方法,强制两个关键属性:(1)AR预测误差应随预测时间增加而增加。违反此趋势被视为滚动不一致,并在训练中进行软惩罚;(2)该方法使模型能够将短期AR预测拼接起来,形成灵活的长期预测。实验结果表明,我们的方法在多个基准上建立了新的状态-of-the-art,相较于iTransformer和其他近期强基线,MSE减少超过10%。此外,它使短期预测模型能够在超过7.5倍的预测时间范围内进行可靠预测。代码可在https://github.com/LizhengMathAi/AROpt获取。

英文摘要

Current time-series forecasting models are primarily based on transformer-style neural networks. These models achieve long-term forecasting mainly by scaling up the model size rather than through genuinely autoregressive (AR) rollout. From the perspective of large language model training, traditional time-series forecasting model training ignores the monotonic error-growth heuristic. In this paper, we propose a novel training method for time-series forecasting that enforces two key properties: (1) AR prediction errors should increase with the forecasting horizon. Violations of this trend are interpreted as rollout inconsistency and are softly penalized during training, and (2) the method enables models to be able to concatenate short-term AR predictions to form flexible long-term forecasts. Empirical results demonstrate that our method establishes a new state-of-the-art across multiple benchmarks, achieving an MSE reduction of more than $10\%$ compared to iTransformer and other recent strong baselines. Furthermore, it enables short-horizon forecasting models to perform reliable long-term predictions at horizons over 7.5 times longer. Code is available at https://github.com/LizhengMathAi/AROpt

2602.01150 2026-05-08 cs.LG cs.AI cs.CR cs.CV math.OC

SMI: Statistical Membership Inference for Reliable Unlearned Model Auditing

SMI: 统计成员推断用于可靠未学习模型审计

Jialong Sun, Zeming Wei, Jiaxuan Zou, Jiacheng Gong, Jie Fu, Chengyang Dong, Heng Xu, Jialong Li, Bo Liu

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学) Peking University(北京大学) Xi’an Jiaotong University(西安交通大学) Heilongjiang University(黑龙江大学) Stevens Institute of Technology(斯蒂文斯理工学院)

AI总结 本文提出SMI方法,通过统计成员混合比例评估未学习模型的遗忘率,无需训练影子模型,有效解决传统成员推断攻击的局限性。

详情
AI中文摘要

机器未学习(MU)对于确保机器学习系统中的'被遗忘权'至关重要。MU的关键挑战是如何可靠地审计模型是否真正忘记了指定的训练数据。成员推断攻击(MIAs)广泛用于未学习模型审计,其中逃避成员检测的样本被视为成功遗忘。我们证明这一假设根本性错误:成员推断失败并不意味着真正的遗忘。我们证明未学习样本在特征空间中与非成员样本占据根本不同的位置,使这种对齐偏差不可避免且不可观测,导致系统性乐观的未学习性能评估。同时,为MIAs训练影子模型会带来显著的计算开销。为解决这两个限制,我们提出了统计成员推断(SMI),一种无需训练的审计框架,将审计重新公式化为估计未学习特征分布中的非成员混合比例。除了估计遗忘率外,SMI还提供bootstrap参考范围以量化审计可靠性。广泛实验表明,SMI在所有基于MIAs的基线方法上表现一致,无需训练影子模型。总体而言,SMI建立了基于原理和效率的替代方案,具有理论保证和强实验证明。

英文摘要

Machine unlearning (MU) is essential for enforcing the right to be forgotten in machine learning systems. A key challenge of MU is how to reliably audit whether a model has truly forgotten specified training data. Membership Inference Attacks (MIAs) are widely used for unlearned model auditing, where samples that evade membership detection are regarded as successfully forgotten. We show this assumption is fundamentally flawed: failed membership inference does not imply true forgetting. We prove that unlearned samples occupy fundamentally different positions in the feature space than non-member samples, making this alignment bias unavoidable and unobservable, which leads to systematically optimistic evaluations of unlearning performance. Meanwhile, training shadow models for MIA incurs substantial computational overhead. To address both limitations, we propose Statistical Membership Inference (SMI), a training-free auditing framework that reformulates auditing as estimating the non-member mixture proportion in the unlearned feature distribution. Beyond estimating the forgetting rate, SMI also provides bootstrap reference ranges for quantified auditing reliability. Extensive experiments show that SMI consistently outperforms all MIA-based baselines, with no shadow model training required. Overall, SMI establishes a principled and efficient alternative to MIA-based auditing methods, with both theoretical guarantees and strong empirical performance.

2602.01124 2026-05-08 cs.LG

ChronoSpike: An Adaptive Spiking Graph Neural Network for Dynamic Graphs

ChronoSpike: 一种自适应脉冲图神经网络用于动态图

Md Abrar Jahin, Taufikur Rahman Fuad, Jay Pujara, Craig Knoblock

发表机构 * University of Southern California(南加州大学) Islamic University of Technology(伊斯兰科技大学)

AI总结 本文提出ChronoSpike,结合可学习LIF神经元、多头空间注意力聚合和轻量Transformer时间编码器,实现动态图的高效表示学习,优于现有方法并在训练速度上更优。

详情
AI中文摘要

动态图表示学习需要捕捉结构关系和时间演化,但现有方法面临核心权衡:注意力方法在$O(T^2)$复杂度下具有表达力,而递归架构则面临梯度病理和密集状态存储问题。脉冲神经网络提供事件驱动效率,但受限于顺序传播、二进制信息丢失和缺乏全局上下文的局部聚合。我们提出ChronoSpike,一种自适应脉冲图神经网络,结合可学习LIF神经元与通道膜动力学、多头空间注意力聚合连续特征,以及轻量Transformer时间编码器。该设计使细粒度局部建模和长距离依赖捕捉成为可能,激活/状态内存为$O(T \cdot d)$,并具有额外的$O(T^2)$每节点注意力项,此处评估的范围内保持较小。ChronoSpike在三个大型基准上优于十二种最先进的基线,平均宏F1提升2.0%,微F1提升2.4%,同时在105K参数预算下,与递归方法相比训练速度快3-10倍。我们提供了膜电位有界性的理论保证,梯度流稳定性在收缩因子$ρ<1$下,以及BIBO稳定性;可解释性分析揭示了异质时间感受野和具有83-88%稀疏性的学习优先效应。

英文摘要

Dynamic graph representation learning requires capturing both structural relations and temporal evolution, yet existing approaches face a core trade-off: attention-based methods offer expressiveness at $O(T^2)$ complexity, while recurrent architectures suffer from gradient pathologies and dense state storage. Spiking neural networks provide event-driven efficiency but are constrained by sequential propagation, binary information loss, and local aggregation that lacks global context. We propose ChronoSpike, an adaptive spiking graph neural network that integrates learnable LIF neurons with per-channel membrane dynamics, multi-head spatially-attentive aggregation over continuous features, and a lightweight Transformer temporal encoder. This design enables fine-grained local modeling and long-range dependency capture with $O(T \cdot d)$ activation/state memory and an additional $O(T^2)$ per-node attention term that remains small for the horizons evaluated here. ChronoSpike outperforms twelve state-of-the-art baselines on three large benchmarks by $2.0$% Macro-F1 and $2.4$% Micro-F1 on average while achieving $3-10\times$ faster training than recurrent methods with a constant 105K-parameter budget independent of graph size. We provide theoretical guarantees for membrane potential boundedness, gradient flow stability under contraction factor $ρ<1$, and BIBO stability; interpretability analyses reveal heterogeneous temporal receptive fields and a learned primacy effect with $83-88$% sparsity.

2602.00407 2026-05-08 cs.LG

Fed-Listing: Federated Label Distribution Inference in Graph Neural Networks

Fed-Listing:图神经网络中的联邦标签分布推断

Suprim Nakarmi, Junggab Son, Yue Zhao, Zuobin Xiong

发表机构 * Department of Computer Science, University of Nevada Las Vegas(内华达大学拉斯维加斯分校计算机科学系) Department of Computer Science, University of Southern California(南加州大学计算机科学系)

AI总结 本文提出Fed-Listing,一种基于梯度的攻击方法,用于在联邦图神经网络中推断目标客户端的隐私标签统计信息,展示了其在非独立同分布场景下的优越性能。

Comments 9 pages, 3 figures, and 4 tables

详情
AI中文摘要

联邦图神经网络(FedGNNs)在保护用户隐私的前提下,促进多个客户端在图结构数据上的协同学习。然而,研究发现共享模型更新,尤其是梯度,可能无意中泄露本地用户的敏感信息。本文提出Fed-Listing,一种新的基于梯度的攻击方法,旨在在不访问原始数据或节点特征的情况下推断FedGNNs中目标客户端的隐私标签统计信息。Fed-Listing仅利用训练过程中交换的最终层梯度,以隐蔽的方式揭示类别比例的统计模式。在四个基准数据集和三种GNN架构上的广泛实验表明,Fed-Listing在非独立同分布场景下显著优于现有基线,包括随机猜测和Decaf。此外,现有防御机制难以降低Fed-Listing的攻击性能,除非模型的实用性严重受损。代码实现和补充材料可在https://github.com/suprimnakarmi/Fed-Listing获取。

英文摘要

Federated Graph Neural Networks (FedGNNs) facilitate collaborative learning across multiple clients with graph-structured data while preserving user privacy. However, emerging research indicates that within this setting, shared model updates, particularly gradients, can unintentionally leak sensitive information of local users. Numerous privacy inference attacks have been explored in traditional federated learning and extended to graph settings, but the problem of label distribution inference in FedGNNs remains largely underexplored. In this work, we introduce Fed-Listing (Federated Label Distribution Inference in GNNs), a novel gradient-based attack designed to infer the private label statistics of target clients in FedGNNs without access to raw data or node features. Fed-Listing only leverages the final-layer gradients exchanged during training to uncover statistical patterns that reveal class proportions in a stealthy manner. Extensive experiments on four benchmark datasets and three GNN architectures show that Fed-Listing significantly outperforms existing baselines, including random guessing and Decaf, even under challenging non-i.i.d. scenarios. Moreover, existing defense mechanisms can barely reduce the attack performance of Fed-Listing, unless the model's utility is severely degraded. The code implementation and Supplementary materials are available here: https://github.com/suprimnakarmi/Fed-Listing.

2601.21187 2026-05-08 cs.CV cs.LG

FRISM: Fine-Grained Reasoning Injection via Subspace-Level Model Merging for Vision-Language Models

FRISM:通过子空间级模型融合实现细粒度推理注入用于视觉-语言模型

Chenyu Huang, Peng Ye, Xudong Tan, Jinhan Mu, Shenghe Zheng, Li Shen, Tao Chen

发表机构 * College of Future Information Technology, Fudan University, Shanghai, China(复旦大学未来信息科技学院,中国) Shanghai Innovation Institute, China(上海创新研究院,中国) The Chinese University of Hong Kong, China(香港中文大学,中国) Harbin Institute of Technology, China(哈尔滨工业大学,中国) Shanghai Artificial Intelligence Laboratory, China(上海人工智能实验室,中国) Sun Yat-Sen University, Shenzhen, China(暨南大学深圳校区,中国)

AI总结 FRISM通过子空间级模型融合实现细粒度推理注入,有效提升视觉-语言模型的推理能力并保持视觉能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

通过将大规模推理模型(LRMs)与视觉-语言模型(VLMs)融合,高效增强VLMs的推理能力已成为研究热点。然而,现有方法通常在粗粒度层级操作,导致推理能力和视觉能力之间存在权衡。为此,我们提出FRISM(通过子空间级模型融合实现细粒度推理注入),基于子空间级模型融合构建推理注入框架。观察到不同SVD子空间对推理和感知的贡献不同,FRISM通过奇异值分解(SVD)分解LRM任务向量,并通过学习自适应调整每个子空间的缩放系数以实现细粒度推理注入。此外,我们引入一种无标签的自蒸馏学习策略,利用常见视觉-语言感知数据集进行双目标优化。大量实验表明,FRISM通过在多样化的视觉-语言推理基准测试中保持强劲表现,有效提升了推理能力并大幅保持了模型的视觉能力。

英文摘要

Efficiently enhancing the reasoning capabilities of Vision-Language Models (VLMs) by merging them with Large Reasoning Models (LRMs) has emerged as a promising direction. However, existing methods typically operate at a coarse-grained layer level, which often leads to a trade-off between injecting reasoning capabilities and preserving visual capabilities. To address this limitation, we propose FRISM (Fine-grained Reasoning Injection via Subspace-level model Merging), a fine-grained reasoning injection framework based on subspace-level model merging. Observing that different SVD subspaces contribute differently to reasoning and perception, FRISM decomposes LRM task vectors via Singular Value Decomposition (SVD) and adaptively tunes the scaling coefficients of each subspace through learning to realize fine-grained reasoning injection. Furthermore, we introduce a label-free self-distillation learning strategy with dual-objective optimization using common vision-language perception datasets. Extensive experiments demonstrate that FRISM effectively improves reasoning capabilities while largely preserving the model's visual capabilities by consistently achieving strong performance across diverse visual-language reasoning benchmarks.

2601.20375 2026-05-08 cs.LG cs.AI cs.CL

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

LLM-AutoDP: 通过LLM代理实现模型微调的自动数据处理

Wei Huang, Anda Cheng, Yinggui Wang, Lei Wang, Tao Wei

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文提出LLM-AutoDP框架,利用LLM代理自动生成和优化数据处理策略,通过迭代学习机制提升处理质量,同时减少隐私风险和人工成本。

Comments Accepted by VLDB2026

详情
AI中文摘要

大型语言模型(LLMs)可以通过领域特定数据进行微调以增强其在专业领域中的性能。然而,此类数据通常包含大量低质量样本,需要有效的数据处理(DP)。在实践中,DP策略通常通过迭代手动分析和试错调整来开发。这些过程不可避免地导致高昂的劳动力成本,并可能在隐私要求高的领域(如医疗)中引发隐私问题,因为直接的人类接触敏感数据。因此,实现自动化数据处理而不暴露原始数据已成为关键挑战。为此,我们提出LLM-AutoDP,一种利用LLM作为代理自动生成和优化数据处理策略的新框架。我们的方法生成多个候选策略,并通过反馈信号和比较评估迭代优化它们。这种迭代上下文学习机制使代理能够收敛到高质量的处理管道,而无需直接的人类干预或访问底层数据。为了进一步加速策略搜索,我们引入了三个关键技术:分布保持采样,通过减少数据量同时保持分布完整性;处理目标选择,使用二分类器识别低质量样本进行集中处理;缓存和重用机制,通过重用先前的处理结果最小化冗余计算。结果表明,使用我们框架处理的数据训练的模型在与未处理数据训练的模型相比时,胜率超过80%。与基于LLM代理的AutoML基线相比,LLM-AutoDP实现了约65%的胜率。此外,我们的加速技术将总搜索时间减少了多达10倍,展示了其有效性和效率。

英文摘要

Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.

2601.15395 2026-05-08 cs.CL cs.AI cs.HC

Beyond Fixed Psychological Personas: State Beats Trait, but Language Models are State-Blind

超越固定心理人格:状态胜过特质,但语言模型是状态盲的

Tamunotonye Harry, Ivoline Ngong, Chima Nweke, Yuanyuan Feng, Joseph Near

发表机构 * University of Vermont(佛蒙特大学) Independent Researcher(独立研究者)

AI总结 本文通过Chameleon数据集发现语言模型对状态盲,而奖励模型对用户状态反应不一致,推动情感计算和个性化对话研究。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

用户与语言模型的交互因用户静态属性(特质)和交互具体情境(状态)而异。然而,现有个性数据集(如PersonaChat、PANDORA等)仅捕捉特质,忽略状态影响。我们引入Chameleon数据集,包含5,001个情境心理档案,来自1,667名Reddit用户,每个用户在多个情境中测量。利用Chameleon数据集,我们得出三个关键发现:首先,受潜在状态-特质理论启发,分解方差发现74%在个体内部(状态)而仅26%在个体之间(特质)。其次,发现LLM是状态盲的:它们只关注特质,无论状态如何生成相似响应。第三,发现奖励模型对用户状态反应不一致:不同模型对同一用户倾向或惩罚方向相反。我们发布Chameleon以支持情感计算、个性化对话和RLHF对齐研究。

英文摘要

User interactions with language models vary due to static properties of the user (trait) and the specific context of the interaction (state). However, existing persona datasets (like PersonaChat, PANDORA etc.) capture only trait, and ignore the impact of state. We introduce Chameleon, a dataset of 5,001 contextual psychological profiles from 1,667 Reddit users, each measured across multiple contexts. Using the Chameleon dataset, we present three key findings. First, inspired by Latent State-Trait theory, we decompose variance and find that 74% is within-person(state) while only 26% is between-person (trait). Second, we find that LLMs are state-blind: they focus on trait only, and produce similar responses regardless of state. Third, we find that reward models react to user state, but inconsistently: different models favor or penalize the same users in opposite directions. We release Chameleon to support research on affective computing, personalized dialogue, and RLHF alignment.

2601.14594 2026-05-08 cs.CV

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

LFS: 用于事件感知和时间多样化的视频描述生成的可学习帧选择器

Lianying Chao, Linfeng Yin, Peiyu Ren, Yifan Jiang, Qiaoyu Ren, Dingcheng Shan, Jing-cheng Pang, Sijie Wu, Xubin Li, Kai Zhang, Xin Chen

发表机构 * GTS, AI Data Department, Huawei Technologies Co., Ltd.(华为技术有限公司人工智能数据部)

AI总结 本文提出LFS,通过学习选择时间多样且事件相关的帧,提升视频描述的细节质量,在两个基准和ICH-CC上取得显著提升。

详情
AI中文摘要

视频描述生成模型将帧转换为视觉标记,并利用大语言模型(LLMs)生成描述。由于编码所有帧成本过高,通常采用均匀采样,但此方法强制等时间覆盖而忽略事件分布的不均匀性。为此,本文提出可学习帧选择器(LFS),选择时间多样且事件相关的帧。LFS显式建模时间重要性以平衡时间多样性和事件相关性,并采用分层策略确保时间覆盖同时避免聚类。关键在于LFS利用冻结视频-LLMs的描述反馈来学习帧选择,直接优化下游描述质量。此外,本文识别现有基准与人类认知之间的差距,因此引入ICH-CC,由注释者精心设计的问题反映人类一致的视频理解。实验表明,LFS在两个代表性社区基准和ICH-CC上一致提升详细视频描述质量,在VDC上提升2.0%,在ICH-CC上提升超过4%。此外,观察到LFS增强的描述在视频问答中表现更优。总体而言,LFS为详细视频描述生成提供了一个有效且易于集成的解决方案。

英文摘要

Video captioning models convert frames into visual tokens and generate descriptions with large language models (LLMs). Since encoding all frames is prohibitively expensive, uniform sampling is the default choice, but it enforces equal temporal coverage while ignoring the uneven events distribution. This motivates a Learnable Frame Selector (LFS) that selects temporally diverse and event-relevant frames. LFS explicitly models temporal importance to balance temporal diversity and event relevance, and employs a stratified strategy to ensure temporal coverage while avoiding clustering. Crucially, LFS leverages caption feedback from frozen video-LLMs to learn frame selection that directly optimizes downstream caption quality. Additionally, we identify the gap between existing benchmark and human's cognition. Thus, we introduce ICH-CC built from carefully designed questions by annotators that reflect human-consistent understanding of video. Experiments indicate that LFS consistently improves detailed video captioning across two representative community benchmarks and ICH-CC, achieving up to 2.0% gains on VDC and over 4% gains on ICH-CC. Moreover, we observe that enhanced captions with LFS leads to improved performance on video question answering. Overall, LFS provides an effective and easy-to-integrate solution for detailed video captioning.

2601.12355 2026-05-08 cs.LG

Tree-Structured Synergy of Large Language Models and Bayesian Optimization for Efficient CASH

树状结构下大语言模型与贝叶斯优化的协同作用:高效CASH问题

Beicheng Xu, Weitong Qian, Lingching Tung, Yupeng Lu, Bin Cui

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系)

AI总结 本文提出LB-MCTS框架,通过树状结构整合大语言模型与贝叶斯优化,解决高维CASH问题中的冷启动问题,实验表明其优于传统方法。

详情
AI中文摘要

为降低机器学习的专业门槛,AutoML社区聚焦CASH问题,即联合自动算法选择和超参数调优。传统方法如贝叶斯优化(BO)在冷启动问题上表现不佳,而大语言模型(LLMs)可通过语义先验缓解此问题。然而,现有基于LLM的优化器在高维、结构化的CASH空间中泛化能力差。本文提出LB-MCTS,一种轨迹结构优化框架,利用蒙特卡洛树搜索树作为共享状态,用于算法选择、超参数细化和BO-LLM协同提议。在共享状态中,BO提供定量搜索的替代模型,而LLM利用路径感知选择性记忆生成语义提案和反思。随着替代模型的改进,可靠性感知的提议策略会适应性地从LLM驱动转向BO驱动的提案。在104个AMLB数据集上的实验表明,LB-MCTS在性能上优于BO驱动、LLM驱动和混合基线方法。

英文摘要

To lower the expertise barrier in machine learning, the AutoML community has focused on the CASH problem, which jointly automates algorithm selection and hyperparameter tuning. While traditional methods like Bayesian Optimization (BO) struggle with cold-start issues, Large Language Models (LLMs) can mitigate these through semantic priors. However, existing LLM-based optimizers generalize poorly to high-dimensional, structured CASH spaces. In this paper, we propose LB-MCTS, a trajectory-structured optimization framework that uses a Monte Carlo Tree Search tree as a shared state for algorithm selection, hyperparameter refinement, and BO-LLM proposer synergy. Within this shared state, BO provides algorithm-specific surrogate modeling for quantitative search, while the LLM exploits path-aware selective memory to generate semantic proposals and reflections. As the surrogate model improves, a reliability-aware proposer policy adaptively shifts from LLM-driven to BO-driven proposals within a unified search trajectory. Experiments on 104 AMLB datasets demonstrate that LB-MCTS consistently outperforms BO-based, LLM-based, and hybrid baselines.

2601.09298 2026-05-08 cs.CV

Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain

基于多模态LLM的ICT图像描述:弥合通用与行业领域之间的差距

Lianying Chao, Kai Zhang, Haoran Cai, Sijie Wu, Xubin Li, Xin Chen

发表机构 * GTS, Huawei Technologies Co., Ltd.(华为技术有限公司GTS部门)

AI总结 本文提出多阶段训练策略,构建ICT领域图像描述模型DICModel,通过合成图像文本对提升模型性能,实验表明其在BLEU指标和准确率上均优于现有模型。

Journal ref 2025 CCF BigData

详情
AI中文摘要

在信息与通信技术(ICT)行业中,训练领域专用的大语言模型(LLM)或构建检索增强生成系统需要大量高价值领域知识。然而,这些知识不仅隐藏在文本模态中,也存在于图像模态中。传统方法只能解析文本,但缺乏图像描述能力。多模态LLM(MLLM)能理解图像,但缺乏足够的领域知识。为解决这些问题,本文提出多阶段渐进训练策略,训练ICT领域图像描述模型(DICModel),并构建标准评估系统验证其性能。具体而言,本文首先通过Mermaid工具和LLM合成约7K图像-文本对用于DICModel的第一阶段监督微调(SFT)。然后,ICT领域专家手动标注约2K图像-文本对用于第二阶段SFT。最后,专家和LLM共同合成约1.5K视觉问答数据用于基于指令的SFT。实验结果表明,我们的DICModel仅使用7B参数就优于其他32B参数的最新模型。与7B和32B参数的SOTA模型相比,我们的DICModel在BLEU指标上分别提高了约56.8%和20.8%。在由ICT领域专家构建的客观问题上,我们的DICModel在准确率上优于Qwen2.5-VL 32B模型1%。总之,本文工作能够高效准确地从图像中提取逻辑文本,有望推动ICT领域多模态模型的发展。

英文摘要

In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.

2601.08403 2026-05-08 cs.AI

Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

Owen-Shapley策略优化:一种为生成搜索语言模型设计的原理性强化学习算法

Abhijnan Nath, Alireza Bagheri Garakani, Tianchen Zhou, Fan Yang, Yan Gao, Nikhil Krishnaswamy

发表机构 * Amazon Science(亚马逊科学) Situated Grounding and Natural Language (SIGNAL) Lab, Colorado State University(情境 grounding 和自然语言(SIGNAL)实验室,科罗拉多州立大学)

AI总结 本文提出Owen-Shapley策略优化算法,通过基于令牌边际贡献的序列优势再分配,解决生成搜索语言模型中因稀疏序列奖励导致的信用分配问题,实验表明其在Amazon ESCI和H&M Fashion数据集上表现优异。

Comments Added additional experiments, computational analysis and further revisions

详情
AI中文摘要

大型语言模型越来越多地通过强化学习进行训练,以实现个性化推荐任务,但标准方法如GRPO依赖于稀疏的序列级奖励。这些奖励模糊了哪些令牌实际上对高质量输出有贡献,导致信用分配问题。当模型必须从未指定的语言中推断潜在用户意图时,这一问题尤为突出,这在预训练期间很少见,但在部署时却很常见。我们引入Owen-Shapley策略优化(OSPO),一种框架,通过基于令牌边际贡献的序列优势再分配,将任务反馈转化为基于潜力的奖励塑造。OSPO通过Shapley-Owen归因将段级信用分配,同时保持最优策略,而无需参数化的价值模型。通过形成语义上连贯的单元(例如描述产品属性的短语或捕捉偏好的句子),OSPO确定哪些响应部分驱动性能。在Amazon ESCI和H&M Fashion数据集上的实验,包括受控生成任务,显示OSPO在基线模型上表现一致,并在训练期间未见过的分布外检索器上表现出显著的测试时间鲁棒性。

英文摘要

Large language models are increasingly trained via reinforcement learning for personalized recommendation tasks, but standard methods like GRPO rely on sparse, sequence-level rewards. These obscure which tokens actually contribute to high-quality outputs, creating a credit assignment gap. This gap is especially problematic when models must infer latent user intent from under-specified language without ground truth labels, which is a reasoning pattern rarely seen during pretraining but commonly required in deployment. We introduce Owen-Shapley Policy Optimization (OSPO), a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes. OSPO transforms task feedback into potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit while preserving the optimal policy, all without parametric value models. By forming coalitions of semantically coherent units (e.g., phrases describing product attributes or sentences capturing preferences), OSPO identifies which response parts drive performance. Experiments on Amazon ESCI and H&M Fashion datasets including controlled generation tasks show consistent gains over baselines and notable test-time robustness to out-of-distribution retrievers unseen during training.

2601.06320 2026-05-08 cs.LG physics.geo-ph

Sensoformer: Robust Sim-to-Real Inference on Variable-Geometry Sensor Sets via Physics-Structured Randomization

Sensoformer:通过物理结构化随机化实现变量几何传感器集的鲁棒仿真到现实推理

Zhe Jia, Xiaotian Zhang, Junpeng Li

发表机构 * Institute for Geophysics University of Texas at Austin(地质研究所德克萨斯大学奥斯汀分校)

AI总结 Sensoformer通过物理结构化随机化方法,解决稀疏传感器阵列在仿真到现实推理中的挑战,实现高维物理状态推断,并在地震源反演中展现优于MPNN和神经算子的性能。

详情
AI中文摘要

从稀疏、非标准传感器阵列推断高维物理状态是人工智能科学和工业物联网中的基本挑战。标准机器学习架构在这些领域挣扎,因为传感器几何形状不规则、变量基数以及未建模的物理异质性导致的仿真到现实分布偏移。为解决这些挑战,我们提出了Sensoformer,一种集成了物理结构化领域随机化(PSDR)的集合注意力框架。通过显式随机化底层物理动态(例如传播介质、极端噪声和网络可用性丢弃),而不是仅仅视觉特征,PSDR强制学习领域不变的物理算子。使用地震源反演作为严格的现实世界测试平台,Sensoformer在10万合成数据上预训练,并在高度复杂的现实世界目录上评估。我们证明Sensoformer实现了最先进的精度,并优于Message Passing Neural Networks(MPNNs)和Neural Operators(例如DeepONet),后者在极端空间稀疏性和混合模态输入方面挣扎。此外,可解释性分析揭示注意力机制自动发现最优实验设计原则,动态优先选择稀疏正交传感器以克服信息瓶颈。

英文摘要

Inferring high-dimensional physical states from sparse, ad-hoc sensor arrays is a fundamental challenge across AI for Science and industrial IoT. Standard machine learning architectures struggle in these domains due to irregular, variable-cardinality sensor geometries and the profound sim-to-real distribution shift caused by unmodeled physical heterogeneities. To address these challenges, we propose Sensoformer, a set-attention framework integrated with Physics-Structured Domain Randomization (PSDR). By explicitly randomizing the underlying physical dynamics (e.g., propagation media, extreme noise, and network availability dropout) rather than just visual features, PSDR enforces the learning of domain-invariant physical operators. Using seismic source inversion as a rigorous real-world testbed, Sensoformer is pre-trained on 100,000 synthetics and evaluated on a highly complex real-world catalog. We demonstrate that Sensoformer achieves state-of-the-art precision and outperforms Message Passing Neural Networks (MPNNs) and Neural Operators (e.g., DeepONet) which struggle with extreme spatial sparsity and mixed-modality inputs. Furthermore, interpretability analysis reveals that the attention mechanism autonomously discovers optimal experimental design principles, dynamically prioritizing sparse orthogonal sensors to overcome information bottlenecks.

2601.01400 2026-05-08 cs.CL

EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery

EternalMath: 一个与人类发现同步演化的前沿数学基准

Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu, Yuhong Liu

发表机构 * School of Mathematics, Renmin University of China(中国人民大学数学学院) Tencent(腾讯)

AI总结 本文提出一个自动化数学推理评估框架,通过将最新数学文献转化为可执行任务,构建可扩展的EternalMath基准,揭示LLM在前沿数学上的性能差距。

详情
AI中文摘要

当前大型语言模型的数学推理评估主要依赖静态基准,导致研究级数学覆盖有限且性能迅速饱和。本文提出一个完全自动化的、基于定理的评估流程,直接将最新同行评审数学文献转化为可执行且可验证的推理任务。该流程识别构造性或定量结果,将其实例化为参数化问题模板,并通过执行验证生成确定性解决方案,从而实现可扩展、可重复和持续更新的评估,无需依赖大规模专家编写。通过设计,该方法支持时间扩展性、内在正确性检查和数学子领域的定制化。应用该流程得到EternalMath,一个从当代研究论文中衍生的演进评估套件。对最新LLM的实验揭示了显著的性能差距,表明前沿数学推理仍远未饱和,凸显了需要与人类数学发现同步进化的评估方法的重要性。

英文摘要

Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.

2601.00655 2026-05-08 cs.LG cs.AI

Interpretability-Guided Bi-objective Optimization: Aligning Accuracy and Explainability

可解释性引导的双目标优化:协调精度与可解释性

Kasra Fouladi, Hamta Rahmani

发表机构 * Department of Computer Science, Iran University of Science and Technology(伊朗科学技术大学计算机科学系)

AI总结 本文提出IGBO框架,通过双目标方法整合领域知识训练可解释模型,采用DAG编码特征重要性并利用TIG测量重要性,提出相对重要性分数Hk(X,θ)并证明收敛到帕累托 stationary 点。

Comments 12 pages

详情
AI中文摘要

本文介绍可解释性引导的双目标优化(IGBO),通过双目标方法整合领域知识训练可解释模型。IGBO通过基于中心极限定理的构造将特征重要性层级编码为有向无环图(DAG),并利用时间整合梯度(TIG)测量特征重要性。框架提出新的相对重要性分数Hk(X,θ),量化特征随时间的归一化累积归因。我们提出几何投影映射P结合任务和可解释性梯度,并证明收敛到帕累托 stationary 点。为解决TIG计算中的分布外问题,我们概述了最优路径Oracle架构,留待未来工作。基于中心极限定理的可解释性DAG构造提供了关于无环性和传递性的统计保证,中位数阈值有无条件保证,更高置信水平有条件保证。

英文摘要

This paper introduces Interpretability-Guided Bi-objective Optimization (IGBO), a framework that trains interpretable models by incorporating structured domain knowledge via a bi-objective formulation. IGBO encodes feature importance hierarchies as a Directed Acyclic Graph (DAG) via Central Limit Theorem-based construction and uses Temporal Integrated Gradients (TIG) to measure feature importance. The framework employs a novel Relative Importance Score Hk(X, θ) that quantifies the normalized cumulative attribution of each feature over time. We propose a geometric projection mapping P for combining task and interpretability gradients, and prove convergence to Pareto-stationary points. To address the Out-of-Distribution problem in TIG computation, we outline an Optimal Path Oracle architecture, which we leave for future work. Central Limit Theorem-based construction of the interpretability DAG provides statistical guarantees on acyclicity and transitivity, with an unconditional guarantee for the median threshold and conditional guarantees for higher confidence levels.

2512.22991 2026-05-08 cs.LG

Fusion or Confusion? Multimodal Complexity Is Not All You Need

融合还是混淆?多模态复杂性并不都是你需要的

Tillmann Rheude, Roland Eils, Benjamin Wild

发表机构 * Berlin Institute of Health, Charité - Universitätsmedizin Berlin(柏林健康研究所,柏林查理医院) Intelligent Medicine Institute, Fudan University(复旦大学智能医学研究院) Department of Mathematics and Computer Science, Freie Universität Berlin(柏林自由大学数学与计算机科学系)

AI总结 本文通过大规模实验挑战多模态学习中复杂架构提升性能的假设,发现增加复杂性常导致混淆而非有效融合,强调需转向方法论严谨性。

详情
AI中文摘要

多模态学习已成为重要研究领域,通过跨模态信息融合可能带来显著性能提升。然而,模型发展趋向于更复杂的深度学习架构,基于多模态特定方法能提升性能的假设。本文通过重新实现19种高影响力多模态方法,在九个包含最多23种模态的多样化数据集上进行大规模实证研究。在标准化实验条件下,包括超参数调优、权重初始化、交叉验证和统计检验,增加多模态复杂性往往导致混淆而非有效数据模态融合。因此,复杂多模态架构并不总能优于单模态基线和简单多模态学习基线(SimBaMM)。通过聚焦案例研究,进一步展示了顶级多模态学习出版物中的具体方法论缺陷,强调了标准化评估实践的必要性。总之,本文呼吁多模态学习研究转向:远离架构创新的追求,转向方法论严谨性。

英文摘要

Multimodal learning has become a prominent research area, with the potential of substantial performance gains by combining information across modalities. At the same time, model development has trended toward increasingly complex deep learning architectures, motivated by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study by reimplementing 19 high-impact multimodal methods across nine diverse datasets with up to 23 modalities. Under standardized experimental conditions, including hyperparameter tuning, weight initialization, cross-validation, and statistical testing, increased multimodal complexity often yields confusion rather than effective fusion of data modalities. Accordingly, complex multimodal architectures do not reliably outperform unimodal baselines and a Simple Baseline for Multimodal Learning (SimBaMM). Through a focused case study, we further demonstrate concrete methodological shortcomings even in top-tier multimodal learning publications, underscoring the need for standardized evaluation practices. In summary, we argue for a shift in focus for multimodal learning: away from the pursuit of architectural novelty and toward methodological rigor.

2512.20854 2026-05-08 cs.CL cs.IR

How important is Recall for Measuring Retrieval Quality?

召回在衡量检索质量时有多重要?

Shelly Schwartz, Oleg Vasilyev, Randy Sawaya

发表机构 * Primer Technologies Inc.

AI总结 本文评估了在大型动态知识库中,如何通过LLM判断响应质量来衡量检索质量,提出了一种无需知道总相关文档数的高效方法。

Comments Dataset: https://huggingface.co/datasets/primer-ai/retrieval-response

详情
AI中文摘要

在大型动态知识库的现实检索设置中,查询相关的总文档数通常未知,因此无法计算召回率。本文评估了几种处理这一限制的已建立策略,通过测量检索质量指标与基于LLM的响应质量判断之间的相关性来处理这一问题。我们跨多个数据集进行了实验,这些数据集中的相关文档数量相对较少(2-15)。我们还介绍了一种简单的检索质量度量方法,该方法在不需知道总相关文档数的情况下表现良好。

英文摘要

In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.

2512.18181 2026-05-08 cs.CV

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

MACE-Dance: 基于动作-外观级联专家的音乐驱动舞蹈视频生成

Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He

发表机构 * Renmin University of China(中国人民大学) AMAP, Alibaba Group(阿里集团AMAP) Wuhan University(武汉大学) Tsinghua University(清华大学)

AI总结 本文提出MACE-Dance框架,结合级联混合专家模型,实现音乐驱动的高质量舞蹈视频生成,通过动作和外观专家分别生成3D动作和视频,提升视觉一致性和动作真实感。

Comments Accepted by SIGGRAPH 2026

详情
AI中文摘要

随着在线舞蹈视频平台的兴起和AI生成内容(AIGC)的快速发展,音乐驱动的舞蹈生成已成为有吸引力的研究方向。尽管在音乐驱动的3D舞蹈生成、姿态驱动的图像动画和音频驱动的谈话头合成等领域的进展显著,现有方法无法直接应用于此任务。此外,该领域的有限研究仍难以同时实现高质量的视觉外观和逼真的人体运动。因此,我们提出了MACE-Dance,一种具有级联混合专家(MoE)的音乐驱动舞蹈视频生成框架。动作专家负责音乐到3D动作生成,强制执行运动学合理性和艺术表现力,而外观专家负责动作和参考条件下的视频合成,保持视觉身份与时空一致性。具体而言,动作专家采用具有BiMamba-Transformer混合架构和无引导训练(GFT)策略的扩散模型,实现了3D舞蹈生成的最先进性能。外观专家采用解耦的运动-美学微调策略,实现了姿态驱动图像动画的最先进性能。为了更好地评估此任务,我们整理了一个大规模且多样化的数据集,并设计了动作-外观评估协议。基于此协议,MACE-Dance也实现了最先进性能。代码可在https://github.com/AMAP-ML/MACE-Dance获取。

英文摘要

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Code is available at https://github.com/AMAP-ML/MACE-Dance.

2512.18034 2026-05-08 cs.AI

Accelerating Discrete Facility Layout Optimization: A Hybrid CDCL and CP-SAT Architecture

加速离散设施布局优化:一种混合CDCL和CP-SAT架构

Joshua Gibson, Kapil Dhakal

发表机构 * University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校)

AI总结 本文提出一种混合CDCL和CP-SAT架构,通过利用CDCL的快速可行性检测能力提升离散布局优化效率,结合CP-SAT实现精确优化。

详情
AI中文摘要

离散设施布局设计涉及将物理实体放置以最小化搬运成本,同时遵守严格的安全和空间约束。这是一个组合优化问题,通常通过混合整数线性规划(MILP)或约束编程(CP)来解决,但这些方法在约束密度增加时往往面临可扩展性挑战。本文系统评估了冲突驱动子句学习(CDCL)与VSIDS启发式方法作为离散布局问题替代计算引擎的潜力。通过统一的基准测试工具,我们对CDCL、CP-SAT和MILP在不同网格大小和约束密度下的进行了受控比较。实验结果揭示出性能上的明显二元性:虽然CDCL由于成本盲目的分支策略在优化目标上表现不佳,但它在可行性检测上展现出无与伦比的主导地位,能够以比其他方法快数个数量级的速度解决高度约束的实例。基于这一发现,我们开发了一种新的"Warm-Start"混合架构,利用CDCL快速生成有效的可行性提示,然后将其注入到CP-SAT优化器中。我们的结果证实,这种分层方法成功地加速了精确优化,通过SAT驱动的剪枝来弥合快速可满足性和证明最优性之间的差距。

英文摘要

Discrete facility layout design involves placing physical entities to minimize handling costs while adhering to strict safety and spatial constraints. This combinatorial problem is typically addressed using Mixed Integer Linear Programming (MILP) or Constraint Programming (CP), though these methods often face scalability challenges as constraint density increases. This study systematically evaluates the potential of Conflict-Driven Clause Learning (CDCL) with VSIDS heuristics as an alternative computational engine for discrete layout problems. Using a unified benchmarking harness, we conducted a controlled comparison of CDCL, CP-SAT, and MILP across varying grid sizes and constraint densities. Experimental results reveal a distinct performance dichotomy: while CDCL struggles with optimization objectives due to cost-blind branching, it demonstrates unrivaled dominance in feasibility detection, solving highly constrained instances orders of magnitude faster than competing paradigms. Leveraging this finding, we developed a novel "Warm-Start" hybrid architecture that utilizes CDCL to rapidly generate valid feasibility hints, which are then injected into a CP-SAT optimizer. Our results confirm that this layered approach successfully accelerates exact optimization, using SAT-driven pruning to bridge the gap between rapid satisfiability and proven optimality.

2512.13281 2026-05-08 cs.CV

VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

VideoASMR-Bench: AI生成的ASMR视频能否欺骗视觉语言模型和人类?

Jiaqi Wang, Weijia Wu, Yi Zhan, Rui Zhao, Ming Hu, James Cheng, Wei Liu, Philip Torr, Kevin Qinghong Lin

发表机构 * The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) Peking University(北京大学) Monash University(墨尔本大学) Video Rebirth University of Oxford(牛津大学)

AI总结 VideoASMR-Bench通过细粒度音频视觉感知和感官沉浸性评估AI生成ASMR视频的检测能力,揭示了当前VLMs在识别AI生成ASMR视频上的不足以及VGMs生成逼真ASMR视频的能力。

Comments Code is at https://github.com/video-reality-test/video-reality-test, page is at https://video-reality-test.github.io/

详情
AI中文摘要

随着AI生成的视频越来越难以与现实区分,当前的基准主要关注广义语义对齐和基本物理一致性,提供有限的判别能力。为此,我们引入VideoASMR-Bench,一个基于自主感官脉冲反应(ASMR)视频的基准,强调细粒度音频视觉感知和感官沉浸性。该基准旨在回答两个关键问题:(i)当今的视频理解模型(VLMs)是否足够敏感,能够通过识别细微的视觉、物理或听觉瑕疵来检测AI生成的ASMR视频?(ii)当今的视频生成模型(VGMs)能否生成具有沉浸体验的ASMR视频?该基准包含来自社交媒体精心挑选的1500个高质量真实ASMR视频,以及由九个VGMs生成的2235个合成视频。此外,我们开源了一套可扩展的提示和参考图像,使基准能够动态扩展以适应未来视频模型。此外,我们设计了一个自动理解-生成评估框架,使VGMs试图生成逼真假视频以欺骗VLMs,而VLMs则试图检测它们,形成双方之间的对抗游戏。在VideoASMR-Bench上的评估表明,即使最先进的VLMs,如Gemini-3-Pro,也未能可靠地检测AI生成的ASMR视频。同时,当前前沿的视频生成模型能够生成难以被VLMs区分的ASMR视频,而人类仍能相对容易地识别它们。

英文摘要

With AI-generated videos increasingly indistinguishable from reality, current benchmarks primarily focus on broad semantic alignment and basic physical consistency, offering limited discriminative power for evaluating them. To address this, we introduce VideoASMR-Bench, a benchmark based on Autonomous Sensory Meridian Response (ASMR) videos that emphasizes fine-grained audio-visual perception and sensory immersion. This benchmark aims to answer two key questions: (i) Are today's video understanding models (VLMs) sensitive enough to detect AI-generated ASMR videos by recognizing minor visual, physical, or auditory artifacts? (ii) Can today's video generation models (VGMs) produce convincing ASMR videos with immersive experiences? This benchmark comprises a diverse set of 1,500 high-quality real ASMR videos curated from social media, alongside 2,235 synthetic counterparts generated by nine VGMs. Additionally, we open-source an extensible suite of prompts and reference images, enabling the benchmark to scale dynamically with future video models. Moreover, we design an automatic understanding-generation evaluation framework between VGMs and VLMs, where VGMs aim to produce realistic fake videos to fool the VLMs, while the VLMs seek to detect them, forming an adversarial game between the two parties. Our evaluation on VideoASMR-Bench reveals that even state-of-the-art VLMs, such as Gemini-3-Pro, fail to reliably detect AI-generated ASMR videos. Meanwhile, current frontier video generation models can produce ASMR videos that are difficult for VLMs to distinguish from real ones, while humans can still identify them relatively easily.

2512.10248 2026-05-08 cs.CV cs.AI

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

RobustSora:用于鲁棒AI生成视频检测的去水印基准

Zhuo Wang, Xiliang Liu, Ligang Sun

发表机构 * College of Computer Science, Beijing University of Technology(北京理工大学计算机学院)

AI总结 本文提出RobustSora基准,通过6500个手动验证的视频评估AI生成视频检测的鲁棒性,发现水印操控影响检测准确率,强调水印意识在训练中的重要性。

详情
AI中文摘要

AI生成视频模型的普及对信息完整性和数字信任提出了新挑战。然而,一个关键问题仍未解决:商业生成器嵌入可见叠加水印以追踪来源,但现有基准无法控制此变量,无法确定检测器是学习真实生成 artefacts 还是仅仅将水印模式与AI生成标签关联。我们提出了RobustSora,包含6500个手动验证的视频,分为四个类别:Authentic-Clean (A-C)、Generated-Watermarked (G-W)、Generated-DeWatermarked (G-DeW) 和 Authentic-Spoofed (A-S),来源包括Vript、DVF和UltraVideo(真实)以及Sora、Sora 2、Pika、Open-Sora 2和KLing(生成)。两个评估任务隔离水印影响:任务I(水印擦除鲁棒性)测试检测水印移除的AI视频;任务II(水印欺骗鲁棒性)测量在真实视频中注入假水印的误报率。在十个模型上,水印操控导致准确率变化为-9.4至+1.6个百分点(平均6.6个百分点;p<0.01对于7/10模型)。一个安慰剂控制将填充 artefact 混淆限制在≤2个百分点,而水印意识训练增强可恢复3-4个百分点,共同提供因果证据表明检测器主动依赖水印线索。每种生成器的分析显示,Sora 2导致-11至-14个百分点的下降,而Pika和Open-Sora 2则为-3至-6个百分点,表明水印的显著性而非检测器架构是依赖性的主要驱动因素。这些结果支持在AIGC视频检测中进行水印意识的评估和训练。数据集、评估代码和预训练检查点将被发布。

英文摘要

The proliferation of AI-generated video models poses new challenges to information integrity and digital trust. A key confound, however, remains unaddressed: commercial generators embed visible overlay watermarks for provenance tracking, yet no existing benchmark controls for this variable, leaving open whether detectors learn genuine generation artefacts or merely associate watermark patterns with AI-generated labels. We present RobustSora, a benchmark of 6,500 manually verified videos in four categories: Authentic-Clean (A-C), Generated-Watermarked (G-W), Generated-DeWatermarked (G-DeW), and Authentic-Spoofed (A-S), sourced from Vript, DVF, and UltraVideo (authentic) and from Sora, Sora 2, Pika, Open-Sora 2, and KLing (generated). Two evaluation tasks isolate watermark effects: Task-I (Watermark Erasure Robustness) tests detection on watermark-removed AI videos; Task-II (Watermark Spoofing Robustness) measures false-alarm rates on authentic videos injected with fake watermarks. Across ten models spanning specialized detectors, transformer classifiers, and MLLMs, watermark manipulation induces accuracy changes of $-9.4$ to $+1.6$ pp (mean 6.6 pp; $p{<}0.01$ for 7/10 models on each task). A placebo control bounds inpainting-artefact confounds at $\le$2 pp, and a watermark-aware training augmentation recovers 3-4 pp on both tasks, together providing causal evidence that detectors actively rely on watermark cues. Per-generator breakdown shows that Sora 2 induces drops of $-11$ to $-14$ pp versus $-3$ to $-6$ pp for Pika and Open-Sora 2, indicating that watermark prominence, rather than detector architecture, is the principal driver of dependency. These results argue for watermark-aware evaluation and training in AIGC video detection. Dataset, evaluation code, and pretrained checkpoints will be released.

2512.06370 2026-05-08 cs.LG stat.ML

Greedy Alignment Principle for Optimizer Selection

贪心对齐原则用于优化器选择

Jaerin Lee, Kyoung Mu Lee

发表机构 * Computer Vision Lab, ASRI, Seoul National University(计算机视觉实验室,ASRI,首尔国立大学)

AI总结 本文提出基于梯度更新对齐的贪心原则,用于优化器选择与调参,通过数学建模证明该原则能最大化损失下降率,并在多种任务中验证动态动量规则的有效性。

Comments 34 pages, 4 figures

详情
AI中文摘要

近期研究表明,梯度更新对齐是调节优化器更新的强大信号,通常能加快训练速度。本文将此启发式方法推广为数学上严谨的原则,用于选择和调整优化器超参数。通过将梯度和更新视为信号,优化器视为因果滤波器,将优化器选择建模为在给定优化器家族中最大化损失下降率的期望。证明该目标是优化器滤波器与梯度自相关函数的内积,并证明在估计梯度统计量扰动下存在贪心最优解且具有稳定性界。针对动量类优化器,理论推导出SGD+Momentum和Adam/AdamW的简单动态动量选择规则。在图像分类、语言模型微调和视觉变换器微调中实验表明,所得动态动量规则匹配或优于手动扫描找到的最佳固定超参数,减少对全面动量扫描的需求。代码可在https://github.com/ironjr/gap获取。

英文摘要

Recent works have shown that gradient-update alignment is a powerful signal for modulating optimizer updates, often leading to faster training. We promote this update-wise heuristic as a mathematically grounded principle for selecting and tuning optimizer hyperparameters. By treating gradients and updates as signals and an optimizer as a causal filter that maps between them, we formulate optimizer selection as maximizing the expected drop rate in loss over a prescribed family of optimizers. We show that this objective is exactly the inner product between the optimizer filter and the gradient autocorrelation, and prove that a greedy optimum exists and has a stability bound under perturbations of the estimated gradient statistics. Specializing in momentum-based optimizers, the theory yields simple dynamic momentum selection rules for both SGD+Momentum and Adam/AdamW. Experiments across image classification, language model fine-tuning, and vision transformer fine-tuning show that the resulting dynamic momentum rules match or improve upon the best fixed hyperparameters found via manual sweeps, reducing the need for exhaustive momentum sweeps. Code is available at https://github.com/ironjr/gap

2511.22812 2026-05-08 cs.CV

LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

LC4-DViT:基于变形视觉Transformer的土地覆盖创建用于土地覆盖分类

Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau

发表机构 * HKUST(香港科技大学) SSE, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)电子工程系) JHU(约翰·霍普金斯大学) NUS(新加坡国立大学) MUST(慕斯大学)

AI总结 本文提出LC4-DViT框架,结合生成数据创建与变形感知视觉Transformer,提升高分辨率土地覆盖制图精度,优于基线模型和传统网络。

Comments This work has been submitted to the IEEE for possible publication.The project is available at https://github.com/weicongpang/LVC2-DViT.git

详情
AI中文摘要

土地覆盖支撑生态系统服务、水文调节、灾害风险降低和基于证据的土地规划;及时准确的土地覆盖地图对环境管理至关重要。基于遥感的土地覆盖分类提供了一种可扩展的制图途径,但受限于稀缺且不平衡的标注和高分辨率场景中的几何失真。我们提出LC4-DViT(Land-cover Creation for Land-cover Classification with Deformable Vision Transformer),一种结合生成数据创建与变形感知视觉Transformer的框架。文本引导的扩散管道利用GPT-4o生成的场景描述和超分辨率示例合成类平衡、高保真训练图像,而DViT结合DCNv4变形卷积骨干网络与视觉Transformer编码器,共同捕捉细尺度几何和全局上下文。在Aerial Image Dataset(AID)-Beach、Bridge、Desert、Forest、Mountain、Pond、Port和River八类数据上,DViT达到0.9572整体准确率、0.9576宏F1分数和0.9510 Cohen's Kappa,优于基线ViT(0.9274 OA,0.9300宏F1,0.9169 Kappa)并优于ResNet50、MobileNetV2和FlashInternImage。跨数据集实验在三类SIRI-WHU子集(Harbor、Pond、River)上获得0.9333整体准确率、0.9316宏F1分数和0.8989 Kappa,表明良好的迁移能力。基于LLM的评分器使用GPT-4o评分Grad-CAM热图进一步显示,DViT的注意力最符合水文有意义的结构。这些结果表明,描述驱动的生成增强结合变形感知Transformer是高分辨率土地覆盖制图的有前途的方法。

英文摘要

Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

2511.22038 2026-05-08 cs.CL

Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

利用时序和情境 grounding 的临床语言处理进行早期风险预测

Rochana Chaturvedi, Yue Zhou, Andrew D. Boyd, Brian T. Layden, Mudassir Rashid, Lu Cheng, Ali Cinar, Barbara Di Eugenio

发表机构 * Kellogg School of Management, Northwestern University(西北大学凯洛格管理学院) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 本文提出两种方法,HiTGNN 通过时序图神经网络建模患者轨迹,ReVeAL 通过轻量框架提升预测准确性与公平性,用于2型糖尿病早期筛查。

详情
AI中文摘要

电子健康记录中的临床笔记捕捉了事件、医生推理和生活方式因素的丰富时序信息,这些信息在结构化数据中往往缺失。利用这些笔记进行预测建模可以及时识别慢性疾病。然而,它们带来了自然语言处理的核心挑战:长文本、事件分布不规则、复杂的时序依赖、隐私限制和资源限制。我们提出两种互补的方法,用于从纵向笔记中进行时序和情境 grounded 的风险预测。首先,我们引入HiTGNN,一种层次化时序图神经网络,整合了笔记内的时间事件结构、就诊间动态和医学知识,以精细的时间粒度建模患者轨迹。其次,我们提出ReVeAL,一种轻量级的测试时框架,将大语言模型的推理提炼成较小的验证器模型。应用于利用从私人和公共医院语料中编纂的现实时序队列进行2型糖尿病(T2D)的偶然筛查,HiTGNN实现了最高的预测准确性,尤其是在短期风险方面,同时保持隐私并限制对大型专有模型的依赖。ReVeAL增强了对真实T2D病例的敏感性并保留了解释性推理。我们的消融实验确认了时序结构和知识增强的价值,公平性分析显示HiTGNN在子群体中表现更加公平。

英文摘要

Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight test-time framework that distills LLMs' reasoning into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.

2511.21471 2026-05-08 cs.AI

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

SpatialBench: 多模态大语言模型空间认知的基准测试

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Gege Qi, Yunjian Zhang

发表机构 * Sun Yat-Sen University(中山大学) HKUST (GZ)(香港科技大学) Zhejiang University(浙江大学) Peking University(北京大学) CAICT(中国科学院电子技术研究所) UCAS(中国科学技术大学) CUC(中国科学技术大学)

AI总结 本文提出SpatialBench基准,通过五级空间认知框架评估多模态大语言模型的空间能力,揭示模型在感知与符号推理间的性能差异。

详情
AI中文摘要

空间认知是现实世界多模态智能的基础,使模型能有效与物理环境交互。尽管多模态大语言模型(MLLMs)取得显著进展,现有基准往往简化空间认知为单一维度指标,无法捕捉空间能力的层次结构和相互依赖性。为此,我们提出一个分层空间认知框架,将空间智能分解为五个逐步复杂的层次,从基本观察到高级规划。基于此分类,我们构建了覆盖15个任务的大型精细基准SpatialBench。为进一步统一评估异质任务,我们引入了一个高阶能力导向的度量标准,可靠评估模型的整体空间推理能力。大量实验表明,模型在感知方面表现强劲,但在符号推理、因果推理和规划方面受限。此外,人类测试显示,人类表现出选择性、目标导向的抽象能力,而MLLMs则倾向于过度关注表面细节,缺乏连贯的空间意图。本文建立了首个系统框架,用于衡量多模态大语言模型的分层空间认知,为未来空间智能系统奠定基础。

英文摘要

Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

2511.19972 2026-05-08 cs.CV

Boosting Reasoning in Large Multimodal Models via Activation Replay

通过激活回放提升大多模态模型的推理能力

Yun Xing, Xiaobin Hu, Qingdong He, Jiangning Zhang, Shuicheng Yan, Shijian Lu, Yu-Gang Jiang

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) Tencent Youtu Lab(腾讯云图实验室) Zhejiang University(浙江大学) Fudan University(复旦大学)

AI总结 本文通过激活回放方法提升大模型的多模态推理能力,通过操控低熵激活来增强推理性能,验证了该方法在数学、视觉代理和视频推理等场景中的有效性。

Comments CVPR 2026

详情
AI中文摘要

最近,可验证奖励的强化学习(RLVR)作为一种有效的方法,用于激励大多模态模型(LMMs)的推理能力,但其底层机制尚不明确。我们通过logit视角探讨了输入激活如何受RLVR影响,系统研究多个post-trained LMMs表明,RLVR会意外地改变低熵激活,而高熵激活影响较小。进一步通过受控实验表明,这些现象与LMM推理相关,暗示调节低熵激活可能有益。为此,我们提出了Activation Replay,一种新颖且有效的无训练方法,无需昂贵的策略优化即可提升post-trained LMMs的多模态推理能力。我们的设计涉及在测试时操控视觉token,回放基础LMMs的输入上下文中的低熵激活以调节RLVR对应部分。Activation Replay在数学、o3-like视觉代理和视频推理等多样化场景中促进了更好的推理。我们进一步展示了Activation Replay提高了Pass@K并缓解了RLVR的推理覆盖范围狭窄问题。我们的设计与替代方案进行比较,如回放高熵激活而不是低熵激活,或直接跨模型干预而不是操控输入token,证明了我们的实现优势。代码可在https://github.com/latentcraft/replay公开获取。

英文摘要

Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Code is publicly available at https://github.com/latentcraft/replay.

2511.00751 2026-05-08 cs.AI cs.CL

Self-Consistency Is Losing Its Edge: Diminishing Returns and Rising Costs in Modern LLMs

自一致性正在失去优势:现代大语言模型中的边际效益递减与成本上升

Chiyan Loo

发表机构 * Chiyan Loo

AI总结 研究指出,随着模型增强,自一致性技术变得低效且可能降低性能,通过实验显示增加推理路径数量带来的准确率提升有限,而成本却显著增加,建议仅在单次可靠度不足的问题上使用多路径采样。

Comments 7 pages, 3 figures

详情
AI中文摘要

自一致性——通过采样多个推理路径并选择最频繁的答案——最初设计用于语言模型频繁且不可预测出错的时代。本研究认为,随着模型能力增强,该技术变得越来越浪费资源,并可能在现代模型已能可靠解决的问题上降低性能。使用Gemini 2.5模型在HotpotQA和MATH-500上进行实验,结果显示增加采样路径数量带来的准确率提升极低——在HotpotQA上20次采样仅提升0.4%,在MATH-500上提升1.6%——而token成本几乎与采样次数成线性增长。关键发现是性能在早期趋于平缓,某些配置下在高采样次数时反而下降,表明当模型已能可靠解决问题时,额外路径引入噪声而非信号。随着模型规模扩大,推理成本上升,盲目使用自一致性难以成立。我们建议仅在明显超出模型单次可靠度的问题上保留多路径采样。

英文摘要

Self-consistency -- sampling multiple reasoning paths and selecting the most frequent answer -- was designed for an era when language models made frequent, unpredictable errors. This study argues that the technique has become increasingly wasteful as models grow stronger, and may degrade performance on problems that modern models already solve reliably. Using Gemini 2.5 models on HotpotQA and MATH-500, we show that accuracy gains from increasing the number of sampled reasoning paths are minimal -- 0.4% on HotpotQA across 20 samples, and 1.6% on MATH-500 -- while token costs scale nearly linearly with sample count. Critically, performance plateaued early and in some configurations declined at high sample counts, suggesting that additional paths introduce noise rather than signal when models already solve problems reliably. As inference costs rise with model scale, indiscriminate self-consistency is difficult to justify. We recommend reserving multi-path sampling for problems that demonstrably exceed a model's single-pass reliability.