arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2602.00834 2026-05-12 cs.LG cs.AI stat.ML

A Minimum Variance Path Principle for Accurate and Stable Score-Based Density Ratio Estimation

Wei Chen, Jiacheng Li, Shigui Li, Zhiqi Lin, Junmei Yang, John Paisley, Delu Zeng

AI总结 本文针对基于分数的密度比估计方法在实践中存在的路径依赖性问题,提出了一种最小方差路径(MVP)原则,通过推导分数函数路径方差的闭式表达式,实现了对路径方差的优化。该方法利用可灵活参数化的库马拉吉混合模型自动学习低方差路径,无需人工设定,从而提升了估计的准确性和稳定性,并在多个基准任务上取得了新的最优结果。

详情
Journal ref
The Fourteenth International Conference on Learning Representations,2026
英文摘要

Score-based methods are powerful across machine learning, but they face a paradox: theoretically path-independent, yet practically path-dependent. We resolve this by proving that practical training objectives differ from the ideal, ground-truth objective by a crucial, overlooked term: the path variance of the score function. We propose the MVP (**M**imum **V**ariance **P**ath) Principle to minimize this path variance. Our key contribution is deriving a closed-form expression for the variance, making optimization tractable. By parameterizing the path with a flexible Kumaraswamy Mixture Model, our method learns data-adaptive, low-variance paths without heuristic manual selection. This principled optimization of the complete objective yields more accurate and stable estimators, establishing new state-of-the-art results on challenging benchmarks and providing a general framework for optimizing score-based interpolation. Our code can be found in https://github.com/Hoemr/OpenDRE.git.

2602.00327 2026-05-12 cs.AI cs.HC

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Anticipation?

Yueyi Yang, Haotian Liu, Fang Kang, Mengqi Zhang, Zheng Lian, Hao Tang, Haoyu Chen

AI总结 该论文研究了大语言模型(LLMs)在预测人类对话中下一句发言时的困难,并提出了一个名为 SayNext-Bench 的基准测试,用于评估多模态大语言模型在不同场景下对上下文条件响应的预测能力。为支持该基准,研究构建了一个大规模多模态对话数据集 SayNext-PC,并设计了包含词汇相似性、情感意图一致性和模型对齐性的多层次评估框架。此外,作者提出 SayNext-Chat 模型,通过引入可学习的提示符融合感知线索与预测先验,显著提升了模型在预测任务中的表现,突显了多模态线索和主动预测机制在自然对话中的重要性。

详情
英文摘要

We explore the use of large language models (LLMs) for next-utterance anticipation in human dialogue. Despite recent advances in LLMs demonstrating their ability to engage in natural conversations with users, we show that even leading models surprisingly struggle to anticipate a human speaker's next utterance. Instead, humans can readily anticipate forthcoming utterances based on multi-modal cues -- such as gestures, gaze, and emotional tone -- from the context. To systematically examine this gap, we propose SayNext-Bench, a benchmark evaluating MLLMs on anticipating context-conditioned responses across diverse real-world scenarios. To support it, we build SayNext-PC, a large-scale multimodal dialogue dataset, and carefully design a multi-level evaluation framework spanning lexical similarity, emotion-intention consistency, and LLM-based overall alignment. Building on this, we develop SayNext-Chat, a cognitively inspired dual-route MLLM that incorporates learnable priming tokens to fuse perceptual cues with anticipatory priors. Extensive experiments demonstrate that SayNext-Chat consistently outperforms state-of-the-art MLLMs across all evaluation levels, corroborated by user studies and LLM-as-Judge evaluations. Our results emphasize the (i) indispensable role of multimodal cues and (ii) active anticipatory processing as foundations of natural human interaction currently missing in MLLMs.

2601.22904 2026-05-12 cs.CV cs.AI cs.LG

Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Hun Chang, Byunghee Cha, Jong Chul Ye

AI总结 本文提出了一种名为**Hyperspherical Autoencoder (HAE)**的框架,旨在提升图像重建与生成的保真度。该方法通过引入方向特征对齐目标和分层卷积块嵌入模块,实现了语义一致性和细节保留的平衡,并利用黎曼流匹配在超球面潜在空间上直接训练扩散变换器(DiT)。实验表明,该方法在生成和重建质量上均达到优异性能,验证了其在高保真图像生成任务中的有效性。

Comments 22 pages, and 20 figures

详情
英文摘要

Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the \textbf{\em Hyperspherical Autoencoder (HAE)}, a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that while semantic information in contrastive representations is primarily directional, enforcing strict magnitude matching hinders the preservation of fine-grained details. To address this, we introduce a {\em Directional Feature Alignment} objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention, alongside a {\em Hierarchical Convolutional Patch Embedding} module to enhance local structure preservation. Furthermore, observing that SSL-based representations intrinsically lie on a hypersphere, we employ {\em Riemannian Flow Matching} to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Notably, our manifold-aware DiT exhibits highly efficient convergence, achieving an exceptional gFID of \textbf{1.96} alongside a reconstruction rFID of \textbf{0.78} and a PSNR of \textbf{25.2} dB, validating the advantages of our manifold-aware approach.

2601.22449 2026-05-12 cs.AI

Emergence of Physical Intelligence via Controllable Information Production

Tristan Shah, Stas Tiomkin

AI总结 该研究提出了一种名为“可控信息生成”(CIP)的新方法,旨在无需外部奖励即可训练智能体,使其通过与环境的交互自发产生有用行为。CIP 基于动力系统和最优控制理论,通过衡量智能体生成信息的速率,实现了对可控复杂性的量化,避免了传统内在动机方法中的偏差问题。该方法将内在动机与最优控制统一于同一框架,揭示了价值函数结构与柯尔莫戈罗夫-辛艾熵之间的联系,并在机器人学习任务中表现出优于现有方法的性能,验证了物理智能来源于引导系统走向可控混沌边界的普适原理。

详情
英文摘要

Intrinsic Motivation (IM) aims to train agents without external rewards, enabling useful behavior to emerge from the agent's interaction with its environment alone. However, the dominant IM approaches rely on information-theoretic quantities with designer-chosen variables, introducing bias and lacking a principled connection to dynamics or optimal control (OC). We introduce Controllable Information Production (CIP), a new foundation for IM explicitly grounded in dynamical systems and OC. CIP measures the rate at which an agent produces information, capturing controllable complexity without external knowledge or bias. CIP unifies IM and OC into a single framework, formalizing physical intelligence as the control of information production. It further reveals connections between the structure of the value function and Kolmogorov-Sinai entropy. CIP consistently outperforms prior IM methods on standard benchmarks in robot learning and solves tasks they fail on, including humanoid self-righting. These results support a general organizing principle: physical intelligence emerges from driving systems toward the edge of controllable chaos.

2601.22427 2026-05-12 cs.LG cs.AI

CoDCL: Counterfactual-Inspired Augmentation Contrastive Learning for Temporal Link Prediction in Social Networks

Hantong Feng, Duxin Chen, Wenwu Yu

AI总结 本文提出了一种名为CoDCL的动态网络学习框架,用于解决社交网络中时间链预测的问题。该方法结合了反事实启发的数据增强与对比学习,旨在提升模型对随时间演变的复杂网络结构的适应能力。通过设计动态处理机制和高效的结构邻域探索策略,CoDCL能够生成高质量的反事实数据,从而更准确地捕捉交互模式的时间变化。实验表明,CoDCL在多个真实数据集上显著优于现有先进方法,验证了其有效性。

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Temporal link prediction is crucial for rapidly growing social networks. Existing methods often overlook the underlying causal mechanisms that drive link formation, making it difficult for algorithms to adapt to complex structures that continuously evolve over time. To enable prediction models to adapt to complex temporal environments, they need to be robust to emerging structural changes. We propose a dynamic network learning framework CoDCL, which combines counterfactual-inspired augmentation with contrastive learning to address this deficiency. Furthermore, we devise a comprehensive strategy to generate high-quality counterfactual data, combining a dynamic treatments design with efficient structural neighborhood exploration to quantify the temporal changes in interaction patterns. Crucially, the entire CoDCL is designed as a plug-and-play universal module that can be seamlessly integrated into various existing temporal graph models without requiring architectural modifications. Extensive experiments conducted on multiple real-world datasets demonstrate that CoDCL significantly outperforms state-of-the-art baselines in temporal link prediction, highlighting the effectiveness of integrating counterfactual-inspired data augmentation into dynamic representation learning.

2601.22204 2026-05-12 cs.LG cs.DC

FedAdaVR: Adaptive Variance Reduction for Robust Federated Learning under Limited Client Participation

S M Ruhul Kabir Howlader, Xiao Chen, Yifei Xie, Lu Liu

AI总结 本文提出了一种名为FedAdaVR的联邦学习算法,旨在解决由于客户端参与不充分导致的异构性问题。该方法结合自适应优化器和方差缩减技术,利用客户端历史更新来模拟其在当前训练轮次中的参与,从而提升模型训练的稳定性与收敛性。此外,还提出了FedAdaVR-Quant,通过量化存储客户端更新,大幅降低内存消耗,同时保持较高的模型性能。实验表明,FedAdaVR在多种数据集上均优于现有先进方法。

详情
英文摘要

Federated learning (FL) encounters substantial challenges due to heterogeneity, leading to gradient noise, client drift, and partial client participation errors, the last of which is the most pervasive but remains insufficiently addressed in current literature. In this paper, we propose FedAdaVR, a novel FL algorithm aimed at solving heterogeneity issues caused by sporadic client participation by incorporating an adaptive optimiser with a variance reduction technique. This method takes advantage of the most recent stored updates from clients, even when they are absent from the current training round, thereby emulating their presence. Furthermore, we propose FedAdaVR-Quant, which stores client updates in quantised form, significantly reducing the memory requirements (by 50%, 75%, and 87.5%) of FedAdaVR while maintaining highly competitive model performance. We analyse the convergence behaviour of FedAdaVR under general nonconvex conditions and prove that our proposed algorithm can eliminate partial client participation error. Extensive experiments conducted on multiple datasets, under both independent and identically distributed (IID) and non-IID settings, demonstrate that FedAdaVR consistently outperforms state-of-the-art baseline methods.

2601.22158 2026-05-12 cs.CV

One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, Kaiming He

AI总结 本文提出了一种名为“像素均流”(pMF)的一步式无潜在空间图像生成方法,旨在克服传统扩散/流模型依赖多步采样和潜在空间的限制。该方法通过将网络输出空间与损失空间分离,利用图像流形上的预测目标和速度空间中的均流损失进行优化,并引入图像流形与平均速度场之间的简单变换。实验表明,pMF在256x256和512x512分辨率的ImageNet数据集上取得了优异的生成效果,显著推进了一步式无潜在空间图像生成的研究进展。

Comments Tech report. Code at https://github.com/Lyy-iiis/pMF

详情
英文摘要

Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.

2601.21971 2026-05-12 cs.RO cs.AI cs.LG

Supervised Mixture-of-Experts for Surgical Grasping and Retraction

Lorenzo Mazza, Ariel Rodriguez, Rayan Younis, Martin Lelis, Ortrun Hellig, Chenpan Li, Sebastian Bodenstedt, Martin Wagner, Stefanie Speidel

AI总结 该研究提出了一种监督混合专家(MoE)架构,用于解决外科手术中的抓取与牵开任务,旨在提升机器人在复杂手术场景下的操作能力。通过结合轻量级的动作解码器策略,如Action Chunking Transformer(ACT),该方法仅需少量演示数据和立体内窥镜图像即可学习复杂的长期操作任务,克服了传统方法对多摄像头或大量数据的依赖。实验表明,该架构显著提升了模型在分布内和分布外场景下的性能,并具备良好的泛化能力,为手术机器人的实际应用提供了可行方案。

Comments Accepted at Robotics:Science and Systems 2026

详情
英文摘要

Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability. We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks, which can be added on top of any autonomous policy. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from less than 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture. We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction. Our results show that generalist Vision Language Action models fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting a supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, it generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this statement, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery.

2601.21698 2026-05-12 cs.LG cs.AI

Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics

Mohamed Elgaar, Hadi Amiri

AI总结 该研究探讨了课程学习在大语言模型预训练中的影响,分析了不同数据排序策略对学习动态的影响。研究通过三种语言学驱动的课程(词获取年龄、词频、动词变化)与随机排序进行对比,发现课程学习主要影响模型在不同训练阶段的时间分配,而随机排序在小模型中会导致更大的梯度噪声和输出头饱和。实验表明,在较小模型中,课程学习有助于提升训练稳定性,但在更大规模下这种差异减弱。

详情
英文摘要

Curriculum learning changes the order of pretraining data, but it remains unclear how ordering changes the learning dynamics. We pretrain models from 14M to 1B parameters for 300B tokens under three linguistically motivated curricula--Age-of-Acquisition, word frequency, and Verb Variation (VV)--and compare each against Random ordering. We analyze latent training phases, gradient noise scale (GNS), and the singular-value structure of the output head. We find that training follows a shared sequence of latent phases, while curricula mainly change time spent in each phase. Random ordering yields higher GNS at 14M-70M and late singular-entropy spikes up to 160M, consistent with noisier gradients and output-head saturation. A reverse-order VV control shows that direction matters: descending order loses much of the accuracy advantage of the ascending curriculum. At larger scales, these stability differences are smaller. These results indicate that the curricula studied here are associated with more stable within-phase training in smaller models rather than with the creation of new phases.

2601.21619 2026-05-12 cs.LG cs.AI cs.CL

On the Overscaling Curse of Parallel Thinking: System Efficacy Contradicts Sample Efficiency

Yiming Wang, Zhuosheng Zhang, Rui Wang

AI总结 本文研究了并行推理中“过放缩诅咒”问题,即系统整体性能提升与样本级效率之间的矛盾。作者提出了一种名为LanBo的方法,通过分析模型潜在表示来预测每个样本的最佳预算,从而显著提升预算利用率并保持整体准确率。此外,作者将LanBo整合到解码流程中,提出了预解码预算调整(PreAda)范式,进一步提升了计算效率和硬件资源利用。

Comments 44 pages, 66 figures, 24 tables

详情
英文摘要

Parallel thinking improves LLM reasoning through multi-path sampling and aggregation. In standard evaluations, due to a lack of sample-specific priors, all samples share a global budget chosen to maximize dataset accuracy. However, many samples reach their best accuracy with much smaller budgets, causing low budget utilization. This contradiction between system efficacy and sample efficiency constitutes the Overscaling Curse. In this paper, we first provide a formal analysis of the overscaling curse and quantify its prevalence and severity in real-world systems. To break it, we propose Latent Budget Predictor (LanBo), which probes model latent representations to predict sample-specific optimal budgets. LanBo significantly improves budget utilization while maintaining dataset accuracy. We further integrate LanBo into the full decoding pipeline, inspiring Pre-decoding Budget Adaptation (PreAda), a paradigm that allocates budgets before decoding to preserve decoding-time parallelization. LanBo substantially improves hardware-aware efficiency in latency and memory, demonstrating both its practical value and the promise of LanBo for efficient parallel decoding.

2601.21266 2026-05-12 cs.LG

Model-Free Neural Filtering: A Comparison with Classical Filters in Nonlinear Systems

Zhuochen Liu, Hans Walker, Rahul Jain

AI总结 本文研究了模型无关的神经网络估计器在非线性系统中的状态估计性能,并与经典滤波方法进行了系统比较。研究采用包括Transformer、循环神经网络和状态空间模型在内的多种神经网络架构,与粒子滤波和非线性卡尔曼滤波等经典方法进行对比。结果表明,结构化的状态空间模型(如Mamba和Mamba-2)在多个非线性场景中表现出色,尤其在无需系统模型的情况下优于部分经典滤波方法,同时在推理吞吐量上也具有优势。研究认为,这类模型的结构特性使其在参数预算有限、数据有限和长期评估条件下更接近经典滤波器。

Comments 9 pages, 15 figures

详情
英文摘要

Neural network models are increasingly used for state estimation in control and decision-making, yet it remains unclear to what extent they behave as principled filters in nonlinear dynamical systems. Unlike classical filters, which rely on explicit dynamics and noise models, neural estimators can be trained purely from data. We present a systematic comparison between model-free neural estimators and classical filtering methods across multiple nonlinear scenarios. On the neural side, we evaluate Transformer-based models, recurrent neural networks, and state-space models; on the classical side, we compare against particle filters and nonlinear Kalman filters. Results show that structured state-space models (SSMs), in particular Mamba and Mamba-2, are consistently strong among neural estimators. They approach strong classical filters in several nonlinear systems and outperform weaker classical baselines without access to system models, while the evaluated neural implementations achieve substantially higher inference throughput on the tested hardware. Accurate model-based filters can still dominate when their assumptions are well matched. We attribute the relative strength of SSMs to filtering-aligned inductive bias: recursive latent-state updates make them structurally closer to classical filters under fixed parameter budgets, finite data, and long-horizon evaluation.

2601.21164 2026-05-12 cs.AI

Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving

Jingyun Wang, Dian Li, Xiaohan Wang, Gang Liu, Jiahong Yan, Guoliang Kang

AI总结 本文研究了如何利用大语言模型(LLM)解决平面几何问题,核心挑战在于LLM难以直接处理几何图形。为此,作者提出通过训练一个多元模态语言模型(MLLM)解释器,将几何图示转化为简洁的条件声明语言(CDL)描述,再利用现成的LLM进行推理。该方法通过设计CDL匹配奖励机制,有效提升了模型的几何理解与推理能力,并在多个数据集上取得了优于现有主流模型的性能。

Comments CVPR 2026 Findings

详情
英文摘要

Plane Geometry Problem Solving (PGPS) is a multimodal reasoning task that aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions. Although Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams. Existing works typically fine-tune Multimodal LLMs (MLLMs) end-to-end on large-scale PGPS data to enhance visual understanding and reasoning simultaneously. However, such joint optimization may compromise base LLMs' inherent reasoning capability. In this work, we observe that LLM itself is potentially a powerful PGPS solver when appropriately formulating visual information as textual descriptions. We propose to train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning. Specifically, we choose Conditional Declaration Language (CDL) as the geometric description as its conciseness eases the MLLM Interpreter training. The MLLM Interpreter is fine-tuned via CoT (Chain-of-Thought)-augmented SFT followed by GRPO to generate CDL. Instead of using a conventional solution-based reward that compares the reasoning result with the ground-truth answer, we design CDL matching rewards to facilitate more effective GRPO training, which provides more direct and denser guidance for CDL generation. To support training, we construct a new dataset, Formalgeo7k-Rec-CoT, by manually reviewing Formalgeo7k v2 and incorporating CoT annotations. Extensive experiments on Formalgeo7k-Rec-CoT, Unigeo, and MathVista show our method (finetuned on only 5.5k data) performs favorably against leading open-source and closed-source MLLMs.

2601.21061 2026-05-12 cs.LG stat.ML

Signal from Structure: Exploiting Submodular Upper Bounds in Generative Flow Networks

Alexandre Larouche, Audrey Durand

AI总结 本文研究了生成流网络(GFlowNets)在奖励函数具有子模结构时的优化问题,提出了一种基于子模上界的新训练方法SUBo-GFN。该方法利用子模性推导出未观测组合对象的奖励上界,并基于不确定性乐观原则进行训练,显著提升了生成样本的质量和数量。实验表明,SUBo-GFN在合成和现实子模任务中表现出优越的分布匹配能力和候选生成效果。

详情
英文摘要

Generative Flow Networks (GFlowNets; GFNs) are a class of generative models that learn to sample compositional objects proportionally to their a priori unknown value, their reward. We focus on the case where the reward has a specified, actionable structure, namely that it is submodular. We show submodularity can be harnessed to retrieve upper bounds on the reward of compositional objects that have not yet been observed. We provide in-depth analyses of the probability of such bounds occurring, as well as how many unobserved compositional objects can be covered by a bound. Following the Optimism in the Face of Uncertainty principle, we then introduce SUBo-GFN, which uses the submodular upper bounds to train a GFN. We show that SUBo-GFN generates orders of magnitude more training data than classical GFNs for the same number of queries to the reward function. We demonstrate the effectiveness of SUBo-GFN in terms of distribution matching and high-quality candidate generation on synthetic and real-world submodular tasks.

2601.20829 2026-05-12 cs.LG cs.AI cs.CL

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

Minwu Kim, Safal Shrestha, Anubhav Shrestha, Keith Ross

AI总结 随着可验证奖励强化学习(RLVR)显著提升了大语言模型的推理能力,新的瓶颈问题出现:越来越多的训练问题变得饱和,即模型在几乎每次推理中都能正确回答问题。在这种情况下,奖励提供的学习信号非常有限。本文提出了一种简单有效的方法——失败前缀条件化,通过引导模型探索易出错的推理状态,从而挖掘饱和问题中剩余的学习信号。实验表明,该方法在标准RLVR陷入停滞时能持续提升性能,并在性能趋于平稳后,通过迭代更新失败前缀进一步提升效果。

Comments 20 pages

详情
英文摘要

As Reinforcement Learning with Verifiable Rewards (RLVR) substantially improves the reasoning abilities of large language models (LLMs), a new bottleneck emerges: more training problems become saturated, that is, the LLM answers the questions correctly for nearly every rollout. On such problems, rewards provide little useful learning signal. While collecting harder problems is a natural response, it is costly and increasingly difficult. We propose failure-prefix conditioning, a simple method that unlocks the remaining signal in saturated problems by shifting exploration toward failure-prone reasoning states. By conditioning on prefixes of rare incorrect trajectories, the method improves the model's ability to recover from misleading early reasoning. We observe that failure-prefix conditioning consistently improves performance where standard RLVR stalls, and achieves gains comparable to training on newly collected medium-difficulty problems. We further analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results show that saturated problems still contain valuable learning signal, and that failure-prefix conditioning provides an effective way to unlock it.

2601.18832 2026-05-12 cs.LG cs.AI

The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

Ren Zhuang, Ben Wang, Shuifa Sun

AI总结 该研究提出了一种名为《The Geometric Reasoner》的训练无关框架,旨在解决长上下文推理中计算成本与推理质量之间的根本矛盾。该方法通过在严格内存限制下进行流形感知的潜在前瞻搜索,结合轻量级的前向估计与软几何正则化,提升轨迹的平滑性与多样性。实验表明,该方法在数学和代码基准测试中显著提升了推理覆盖率,且仅带来少量的额外计算开销。

Comments 29 pages, 13 figures

详情
英文摘要

Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@k curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1--1.3 times.

2601.18061 2026-05-12 cs.AI cs.HC

Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing

Kiana Jafari, Paul Ulrich Nikolaus Rust, Duncan Eddy, Robbie Fraser, Nina Vasan, Darja Djordjevic, Akanksha Dadlani, Max Lamparth, Eugenia Kim, Mykel Kochenderfer

AI总结 该研究探讨了在心理健康AI安全测试中,专家评估与人类反馈的局限性。研究发现,三位精神科专家对大型语言模型生成的回应进行评估时,评分一致性较低,尤其在涉及自杀和自残等高风险内容时分歧更为显著。研究指出,专家间的不一致源于不同的临床理念,而非单纯的测量误差,表明当前基于专家共识的标签聚合方法可能忽略了专业判断的多样性与复杂性。这一发现对AI安全评估、奖励建模及评价基准的设计具有重要启示。

Comments 17 pages, 7 pages of appendix, 21 tables

详情
英文摘要

Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $α= -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.

2601.16836 2026-05-12 cs.CV cs.CL

ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

Chenxi Ruan, Yihan Hou, Yu Xiao, Guosheng Hu, Wei Zeng

AI总结 本文提出 ColorConceptBench,一个用于评估文本到图像模型在概率色彩概念理解能力的基准测试。该基准通过6,584个人工标注的隐式色彩概念,系统性地评估模型对情绪、视觉状态等抽象语义的理解能力。研究发现,现有主流模型在不同语义类别上的表现差异显著,且对抽象语义的敏感度较低,表明当前模型在学习和表示隐式语义方面仍存在明显不足。

Comments 9 pages, 6 figures

详情
英文摘要

Text-to-image (T2I) models have advanced considerably in generating high-quality images from textual descriptions. However, their ability to associate colors with concepts remains largely constrained to explicit color names or codes, while their capacity to handle \emph{implicit concepts}, such as emotions and visual states, remains underexplored. To address this gap, we introduce ColorConceptBench, an expert-annotated benchmark that systematically evaluates color-concept associations through probabilistic color distributions. ColorConceptBench moves beyond explicit color specifications by examining how models interpret 1,281 implicit color concepts, grounded in 6,584 human annotations. Our evaluation of nine leading T2I models reveals that performance varies substantially across semantic categories, and models exhibit a significant lack of sensitivity to abstract semantics. These limitations persist even when applying classifier-free guidance scaling at inference time, suggesting that achieving human-like color understanding demands a shift in how models learn and represent implicit semantic meaning.

2601.11042 2026-05-12 cs.CL cs.AI

Spectral Characterization and Mitigation of Sequential Knowledge Editing Collapse

Chi Zhang, Mengqi Zhang, Xiaotian Ye, Runxi Cheng, Zisheng Zhou, Ying Zhou, Pengjie Ren, Zhumin Chen

AI总结 在大型语言模型中,顺序知识编辑常常导致模型整体能力的严重退化,尤其是参数修改方法更为明显。本文通过谱分析揭示了模型通用能力与预训练权重矩阵的主奇异方向密切相关,这些方向对扰动高度敏感,反复编辑会逐步破坏它们,进而影响编辑效果和整体性能。基于这一发现,作者提出了REVIVE框架,通过在原始权重的谱基上进行参数更新,并过滤干扰保护区域的成分,有效稳定了顺序编辑过程。实验表明,REVIVE在多种模型和基准测试中均能显著提升编辑效果并保持模型的通用能力,即使在高达20000次编辑的极端情况下也表现优异。

Comments 22 pages, 18 figures, Accepted to ACL 2026 (Main Conference)

详情
英文摘要

Sequential knowledge editing in large language models often causes catastrophic collapse of the model's general abilities, especially for parameter-modifying methods. Existing approaches mitigate this issue through heuristic constraints on parameter updates, yet the mechanisms underlying such degradation remain insufficiently understood. In this work, we present a spectral analysis of sequential knowledge editing and show that a model's general abilities are closely associated with dominant singular directions of pretrained weight matrices. These directions are highly sensitive to perturbations and are progressively disrupted by repeated edits, closely tracking the collapse in both editing efficacy and general performance. Building on this insight, we propose REVIVE, a plug-and-play framework that stabilizes sequential editing by explicitly preserving the dominant singular subspace. REVIVE represents parameter updates in the spectral basis of the original weights and filters components that would interfere with the protected region. Extensive experiments across multiple models and benchmarks show that REVIVE consistently improves editing efficacy while substantially preserving general abilities under long-horizon sequential editing, including extreme settings with up to 20,000 edits.

2601.03511 2026-05-12 cs.CL cs.AI cs.LG

IntroLM: Introspective Language Models via Prefilling-Time Self-Evaluation

Hossein Hosseini Kasnavieh, Gholamreza Haffari, Chris Leckie, Adel N. Toosi

AI总结 本文提出了一种名为IntroLM的方法,使因果语言模型能够在预填充阶段通过引入内省标记对自己的输出质量进行预测,从而无需依赖外部分类器。该方法利用条件LoRA技术,仅在内省标记激活时进行质量预测,既保持了模型原有行为,又避免了额外计算开销。实验表明,IntroLM在问答任务中表现出色,显著优于基于DeBERTa的分类器,并在多模型路由系统中有效降低了延迟和大模型使用率。

Comments Accepted for publication in Findings of ACL 2026

详情
英文摘要

A major challenge for the operation of large language models (LLMs) is how to predict whether a specific LLM will produce sufficiently high-quality output for a given query. Existing approaches rely on external classifiers, most commonly BERT based models, which suffer from limited context windows, constrained representational capacity, and additional computational overhead. We propose IntroLM, a method that enables causal language models to predict their own output quality during the prefilling phase without affecting generation using introspective tokens. By introducing token conditional LoRA that activates only for the introspective token, the model learns to predict the output quality for a given query while preserving the original backbone behavior and avoiding external evaluators. On question answering benchmarks, IntroLM applied to Qwen3 8B achieves a ROC AUC of 90 precent for success prediction, outperforming a DeBERTa classifier by 14 precent. When integrated into multi model routing systems, IntroLM achieves superior cost performance tradeoffs, reducing latency by up to 33 precent and large model usage by up to 50 precent at matched reliability.

2512.24552 2026-05-12 cs.CV math.OC

OCP-GN: A Scalable Second-order Optimizer for Stochastic Optimization

Jindi Zhong, Congyaohui Yin, Zhaorong Zhang, Huanshui Zhang

AI总结 本文提出了一种基于最优控制原理(OCP)的新型二阶优化算法OCP-GN,适用于神经网络训练中的大规模优化问题。该算法具有O(d)的计算复杂度和较强的鲁棒性,实验结果表明其在多个基准测试中表现出显著的优越性。

详情
英文摘要

This paper proposes a novel second-order optimization algorithm based on the Optimal Control Principle (OCP), applicable to large-scale optimization problems in neural network training. The algorithm has a computational complexity of O(d) and strong robustness. Extensive experiments on multiple benchmarks demonstrate the significant superiority of the proposed method.

2512.23025 2026-05-12 cs.CL cs.AI

LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

Wenxuan Xu, Arvind Pillai, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell

AI总结 LENS 是一个基于大语言模型(LLM)的叙事生成框架,旨在将多模态健康传感数据转化为具有临床意义的心理健康叙述。该方法通过构建大规模的传感器-文本问答对数据集,并训练一个能够将原始传感器信号映射到语言模型表示空间的编码器,解决了传统LLM无法直接处理长时间传感器数据的问题。实验表明,LENS 在自然语言处理指标和症状严重程度准确性方面均优于现有方法,并获得了心理健康专业人士的认可,展示了其在临床应用中的潜力。

Comments Camera-ready version. Additional experiments

详情
英文摘要

Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.

2512.20798 2026-05-12 cs.AI

A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Miles Q. Li, Benjamin C. M. Fung, Martin Weiss, Pulei Xiong, Khalil Al-Hussaeni, Claude Fachkha

AI总结 随着自主AI代理在高风险环境中的应用日益广泛,确保其安全性和与人类价值观的一致性成为实际部署中的重要问题。现有基准主要评估AI对有害指令的拒绝或复杂任务的完成能力,但缺乏针对目标驱动型约束违反的评估体系。为此,研究者提出了一种包含40个场景的基准,用于检测AI在追求绩效指标时可能忽视伦理、法律或安全约束的行为,并通过对比不同代际模型发现,安全性能并未随模型迭代而稳定提升。

详情
英文摘要

As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values is becoming a practical deployment concern. Current benchmarks for AI agents primarily evaluate refusal of explicitly harmful instructions or completion of complex multi-step tasks. However, there is a lack of benchmarks designed to capture emergent outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints. To address this gap, we introduce a benchmark of 40 scenarios in production-inspired sandbox environments. Each scenario requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (direct KPI-outcome mandate) and Incentivized (KPI-pressure-driven) variations to distinguish failures under direct outcome mandates from self-directed constraint violations. Across 12 state-of-the-art LLMs, we observe outcome-driven constraint violations ranging from 0.0% to 62.8%, with most evaluated models exhibiting misalignment rates at or above 25%. Furthermore, through a cross-generational analysis comparing current models with their predecessors within the same product families, we find that safety does not reliably improve across generations: misalignment rates rose in four families and fell in five. To improve evaluation robustness, we score trajectories with a four-model judge panel aggregated by median, finding high agreement on the primary misalignment threshold. We also observe substantial deliberative misalignment: cases where models later judge their own trajectories as unethical despite having executed them under KPI pressure.

2512.19115 2026-05-12 cs.CV

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

AI总结 尽管多模态大语言模型(MLLMs)在生成任务中表现出色,但它们在零样本多模态检索任务中却表现出令人意外的不足。本研究通过稀疏自编码器(SAEs)分析发现,MLLMs的表示空间主要由文本语义主导,而对多模态检索至关重要的视觉语义占比很小,且模型在跨模态对齐上的过度关注进一步削弱了其区分能力。基于这些发现,作者提出了一种名为ReAlign的测试时适配方法,通过白化变换调整表示空间的几何结构,从而在无需微调的情况下显著提升了多种MLLMs的零样本多模态检索性能。

详情
英文摘要

Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from being effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; and the visual semantics essential for multimodal retrieval only constitute a small portion. We find that this imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations of MLLMs are actually distractors that greatly reduce retrieval performance. Building on these insights, we propose ReAlign, a test-time adaptation approach that applies a whitening transformation to adjust the geometry of MLLM representation spaces. Empirical results show that this simple intervention consistently improves zero-shot multimodal retrieval performance across diverse MLLMs without fine-tuning efforts. The code is available at https://github.com/Heinz217/mllm-retrieval-analysis.

2512.18610 2026-05-12 cs.LG

The Procrustean Bed of Time Series: The Optimization Bias in Point-wise Loss Functions

Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Zhiqiang Ge, Daoyi Dong, Hang Yu, Yong Liu, Qingsong Wen

AI总结 该论文研究了时间序列预测中点-wise 损失函数(如 MSE 和 MAE)所引入的系统性优化偏差问题。作者指出,这类损失函数忽略了时间依赖性,导致模型无法准确捕捉时间序列的联合分布,从而产生不可忽视的偏差。通过定义期望优化偏差(EOB)并推导其数学表达式,论文揭示了该偏差由序列长度和结构信噪比(SSNR)决定,并提出了一种基于序列长度缩减和结构正交化的去偏方法,显著提升了时间序列预测和插补的性能。

Comments 54 pages

详情
英文摘要

Intuitively, a more deterministic time series should be easier to forecast. However, point-wise loss functions (e.g., MSE and MAE), serving as differentiable surrogates for the ideal optimization target, score each timestamp independently and therefore disregard temporal dependence. This mismatch induces a systematic optimization bias that cannot be eliminated merely by improving model expressiveness or optimizer. To formalize this issue, we define the Expectation of Optimization Bias (EOB) as the Kullback--Leibler divergence between the true joint distribution and the factorized i.i.d. surrogate induced by the point-wise paradigm. Under covariance-stationary Gaussian assumptions, we derive closed-form expressions for the stochastic component of EOB, establishing it as an irreducible lower bound on the total bias in linear systems, and further extend it to nonlinear regimes through a Gaussian mixture model lower bound. Crucially, we prove this bias is governed intrinsically by two data properties, i.e., sequence length and Structural Signal-to-Noise Ratio (SSNR), regardless of specific model architecture, optimizer, or point-wise loss forms. This theory motivates a principled debiasing program based on sequence length reduction and structural orthogonalization, which we instantiate through DFT/DWT combined with a novel harmonized $\ell_p$ norm. Extensive experiments validate the predicted SSNR--horizon dynamics, resolve the classic trigonometric fitting failure as an objective-induced pathology, and demonstrate substantial plug-and-play gains. Notably, on iTransformer, our proposed objective reduces average MSE/MAE by 5.2%/5.0% in forecasting across 11 datasets and by 27.4%/19.4% in imputation across 9 datasets.

2512.13751 2026-05-12 cs.LG cs.AI

MIDUS: Memory-Infused Depth Up-Scaling

Taero Kim, Hoyoon Byun, Youngjun Choi, Sungrae Park, Kyungwoo Song

AI总结 本文提出了一种名为MIDUS的深度扩展方法,旨在提升预训练语言模型的容量而不显著增加计算成本。该方法通过引入记忆层替代传统的FFN分支,将新增的模型深度转化为基于检索的轻量级残差能力。核心创新是提出了一种头级记忆层(HML),结合多头产品键记忆与头级隐值扩展(HIVE),为每个注意力头分配独立的键空间,并从共享的潜在库中高效生成头特定的值,从而在性能和效率上均取得改进。

详情
英文摘要

Expanding pre-trained language models offers a practical way to increase capacity without training larger models from scratch. Depth Up-Scaling (DUS) does so by duplicating Transformer blocks and inserting them into a pre-trained backbone. This process also duplicates FFN-heavy blocks, increasing parameter and compute cost while adding capacity through a block-level dense residual branch. Yet prior work suggests that added capacity need not remain tied to dense FFN branches, while attention heads often play heterogeneous roles, motivating more efficient head-level residual corrections. We propose Memory-Infused Depth Up-Scaling (MIDUS), which replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity. We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit Value Expansion (HIVE). HML assigns each head a distinct key space, while HIVE realizes head-specific values from a shared latent bank through compact projections. Alongside empirical improvements in performance and efficiency, our head-importance and fixed-retrieval structural analyses characterize HML with HIVE as a structurally distinct, head-conditioned alternative to FFN-based residual expansion.

2512.08984 2026-05-12 cs.CV cs.AI

RAG-HAR: Retrieval Augmented Generation-based Human Activity Recognition

Nirhoshan Sivaroopan, Hansi Karunarathna, Chamara Madarasingha, Anura Jayasumana, Kanchana Thilakarathna

AI总结 RAG-HAR 是一种基于检索增强生成(RAG)的人类活动识别框架,无需训练即可利用大语言模型进行活动识别。该方法通过计算轻量统计特征,从向量数据库中检索语义相似样本,并结合上下文信息进行活动识别。通过引入提示优化和基于大语言模型的活动描述符,RAG-HAR 在六个不同的活动识别基准上取得了最先进的性能,且无需模型微调,具有较高的实用性和鲁棒性。

Comments Accepted to IEEE PerCom 2026 (Pervasive computing and communications)

详情
英文摘要

Human Activity Recognition (HAR) underpins applications in healthcare, rehabilitation, fitness tracking, and smart environments, yet existing deep learning approaches demand dataset-specific training, large labeled corpora, and significant computational resources.We introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for HAR. RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM-based activity identification. We further enhance RAG-HAR by first applying prompt optimization and introducing an LLM-based activity descriptor that generates context-enriched vector databases for delivering accurate and highly relevant contextual information. Along with these mechanisms, RAG-HAR achieves state-of-the-art performance across six diverse HAR benchmarks. Most importantly, RAG-HAR attains these improvements without requiring model training or fine-tuning, emphasizing its robustness and practical applicability. RAG-HAR moves beyond known behaviors, enabling the recognition and meaningful labelling of multiple unseen human activities.

2512.06673 2026-05-12 cs.CV

Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

Shida Gao, Feng Xue, Xiangfeng Wang, Anlong Ming, Zhaowen Lin, Haiyang Zhang, Teng Long, Nicu Sebe, Yihua Shao, Haozhe Wang, Wei Wang

AI总结 该研究提出了一种基于检测器的视频大语言模型DEViL,旨在提升时空视频定位(STVG)任务的效率。其核心思想是将密集的空间定位任务从语言模型中分离,交由高效且可并行的检测器完成,从而避免了传统方法中复杂的解码和候选生成过程。通过引入参考语义标记和时间一致性正则化,DEViL在保持语言模型推理能力的同时,实现了更高的推理速度和更优的定位性能。

详情
英文摘要

Multimodal large language models (MLLMs) are rapidly expanding from general video understanding to finer-grained understanding such as spatio-temporal video grounding (STVG) and reasoning. In these tasks, an MLLM must localize the user-queried target in time and space and take the results as evidence for reasoning. Existing MLLM methods mainly follow two paradigms: (1) Direct Localization, which outputs STVG results with extra alignment modules or specialized decoders; and (2) Candidate-based Selection, which first constructs tube-level candidates and then selects the relevant one by an MLLM. However, both suffer from a serious efficiency bottleneck: the former incurs linearly growing decoding cost as the queried temporal span increases, while the latter relies on costly candidate construction. To break this bottleneck, we propose DEViL, a detector-empowered Video-LLM with a simple key idea: offloading dense spatial grounding from the MLLM to a fully parallelizable, well-trained detector. Specifically, DEViL distills the query into a detector-compatible reference-semantic token, which replaces the detector's text embedding to enable spatial grounding in a single pass. Then, we design temporal consistency regularization to match objects across frames and enforce their coherence over time. In this way, DEViL avoids long coordinate decoding and heavy candidate pipelines. Extensive experiments show that DEViL achieves strong performance (43.1% m_vIoU on HC-STVG) with superior efficiency (14.33 FPS), while preserving the general reasoning capacity of the MLLM backbone.

2512.04475 2026-05-12 cs.LG cs.AI cs.NE stat.ML

GraphBench: Next-generation graph learning benchmarking

Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, Christopher Morris

AI总结 随着图机器学习在分子性质预测和芯片设计等领域取得进展,当前的基准测试方法仍存在碎片化问题,依赖于任务特定的数据集和不一致的评估协议,限制了研究的可复现性和整体进展。为应对这一挑战,本文提出 GraphBench,一个涵盖多种现实领域和任务场景的综合性基准测试套件,提供标准化的评估协议和统一的超参数调优框架,旨在推动图学习模型的全面评估与未来发展。

详情
英文摘要

Machine learning on graphs has made substantial progress across domains such as molecular property prediction and chip design. Yet benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, hindering reproducibility and broader progress. With the recent popularity of graph foundation models, these weaknesses have become apparent, as existing benchmarks are insufficient for thorough evaluation. To address these challenges, we introduce GraphBench, a comprehensive benchmark suite spanning diverse real-world domains and task settings, including node-level, edge-level, graph-level, and generative tasks. GraphBench provides standardized evaluation protocols, including consistent dataset splits and metrics for assessing out-of-distribution generalization across selected tasks, as well as a unified hyperparameter-tuning framework. We further evaluate GraphBench with recent message-passing neural networks and graph transformer models, establishing principled baselines for future research. See www.graphbench.io for further details.

2512.02012 2026-05-12 cs.CV cs.LG

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J. Zico Kolter, Kaiming He

AI总结 本文针对一种名为MeanFlow的一站式生成模型框架中存在的训练目标和引导机制方面的挑战,提出了改进方法iMF。研究通过将训练目标重新表述为对瞬时速度的回归问题,并引入显式条件变量进行引导,提升了模型的训练稳定性和灵活性。实验表明,iMF在ImageNet 256×256数据集上以单次函数评估实现了1.72的FID分数,显著优于现有同类方法,且无需知识蒸馏即可接近多步方法的性能。

Comments Technical report. Code at https://github.com/Lyy-iiis/imeanflow

详情
英文摘要

MeanFlow (MF) has recently been established as a framework for one-step generative modeling. However, its ``fastforward'' nature introduces key challenges in both the training objective and the guidance mechanism. First, the original MF's training target depends not only on the underlying ground-truth fields but also on the network itself. To address this issue, we recast the objective as a loss on the instantaneous velocity $v$, re-parameterized by a network that predicts the average velocity $u$. Our reformulation yields a more standard regression problem and improves the training stability. Second, the original MF fixes the classifier-free guidance scale during training, which sacrifices flexibility. We tackle this issue by formulating guidance as explicit conditioning variables, thereby retaining flexibility at test time. The diverse conditions are processed through in-context conditioning, which reduces model size and benefits performance. Overall, our $\textbf{improved MeanFlow}$ ($\textbf{iMF}$) method, trained entirely from scratch, achieves $\textbf{1.72}$ FID with a single function evaluation (1-NFE) on ImageNet 256$\times$256. iMF substantially outperforms prior methods of this kind and closes the gap with multi-step methods while using no distillation. We hope our work will further advance fastforward generative modeling as a stand-alone paradigm.

2512.02010 2026-05-12 cs.CL cs.LG

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Keith Wyss, Mahdi Nazemi, Asit Mishra, Carlo del Mundo, Tijmen Blankevoort, Song Han

AI总结 随着大语言模型规模的增长,低精度数值格式如NVFP4因其在提升速度和减少内存占用方面的潜力而受到关注。然而,将模型量化到NVFP4时,精度的降低通常会导致性能下降。本文提出了一种改进的块缩放NVFP4量化方法——Four Over Six(4/6),通过自适应地将部分块缩放到更小的FP4值,使可表示值的分布更加均匀,从而有效减少量化误差,尤其在接近最大值的区域。实验表明,4/6在现代硬件加速器上能够高效实现,并在预训练和推理过程中带来性能提升,且计算开销极小。

Comments 10 pages, 4 figures

详情
英文摘要

As large language models have grown larger, interest has grown in low-precision numerical formats such as NVFP4 as a way to improve speed and reduce memory usage. However, quantizing models to NVFP4 remains challenging as the lack of precision generally degrades model performance. In this work, we address this issue with Four Over Six (4/6), a modification to the block-scaled NVFP4 quantization algorithm that yields reduced quantization error. Unlike integer formats, floating point formats have non-uniform step sizes which create larger quantization error on larger values. 4/6 takes advantage of this by adaptively scaling some blocks to smaller FP4 values, making the distribution of representable values more uniform and reducing quantization error for near-maximal values. We show that 4/6 can be implemented efficiently on modern hardware accelerators, resulting in performance gains during both pre-training and inference with minimal computational overhead. In pre-training experiments with the Nemotron 3 Nano 30B-A3B model architecture, we find that 4/6 brings training loss closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. Our code is available at https://github.com/mit-han-lab/fouroversix.