arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1676
2507.16806 2026-05-18 cs.LG cs.AI cs.CL

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

AI总结 本文研究了如何通过强化学习训练语言模型在生成推理链时更好地评估自身不确定性。传统方法使用二元奖励函数仅评价输出正确性,导致模型在面对不确定情况时容易产生错误回答。为此,作者提出了一种新的训练方法 RLCR,结合二元正确性奖励与 Brier 分数,同时优化模型的准确性和置信度校准。实验表明,RLCR 在多个数据集上显著提升了模型的校准能力,且不牺牲准确性,优于传统强化学习和事后置信度校准方法。

详情
英文摘要

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. Code, models, and further info is available at https://rl-calibration.github.io/.

2507.15778 2026-05-18 cs.CL

Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang, Runze Liu, Fuzheng Zhang, Xiu Li, Guorui Zhou, Ling Pan

AI总结 该研究针对强化学习与可验证奖励(RLVR)方法在提升大语言模型推理能力中的应用,提出了一种新的框架Archer,通过引入双令牌约束机制,区分处理高熵(与推理相关)和低熵(与知识存储相关)令牌的优化策略。该方法在保持序列生成依赖性的前提下,对不同类型的令牌施加差异化的更新强度控制,从而在数学推理和代码生成任务中取得了优于现有方法的性能提升,验证了其在细粒度优化策略设计中的有效性。

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs). However, existing methods mainly apply uniform optimization constraints across all tokens, ignoring their heterogeneous roles. Prior work shows that high-entropy tokens are closely tied to reasoning, while low-entropy tokens primarily encode factual knowledge, and recent approaches attempt to exploit this distinction by isolating token updates via masking or asynchronous training. We argue that such isolation breaks the sequential dependency structure of autoregressive generation, leading to suboptimal learning. To address this, we propose \textbf{Archer}, an entropy-aware RLVR framework with \textbf{dual-token constraints} that preserves joint optimization while modulating update strength across token types. Our method introduces response-level entropy normalization for stable token classification and applies differentiated clipping ranges and KL regularization to encourage exploration on reasoning tokens while preserving knowledge tokens. Experiments on mathematical reasoning and code generation benchmarks show that Archer consistently outperforms strong baselines across multiple model scales, improving both \textit{pass@1} and \textit{pass@K} performance. These results highlight the importance of respecting sequence-level dependencies when designing fine-grained RL optimization strategies for LLMs.

2507.01679 2026-05-18 cs.LG cs.AI cs.CL

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

AI总结 本文研究了大语言模型后训练中监督微调(SFT)与强化微调(RFT)的结合方法,提出了Prefix-RFT这一混合策略,通过前缀采样实现从演示数据和探索行为中协同学习。该方法在数学推理任务中表现出色,不仅优于单独使用SFT或RFT,也优于其他混合策略,验证了SFT与RFT的互补性,并展示了其对演示数据质量与数量变化的鲁棒性。

Comments ICML 2026

详情
英文摘要

Existing LLMs-post-training techniques are broadly categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). Each paradigm presents a distinct trade-off: (1) SFT excels at mimicking demonstration data, but can lead to problematic generalization as a form of behavior cloning. (2) Conversely, RFT can significantly enhance a model's performance but is prone to learning unexpected behaviors, and its performance is sensitive to the initial policy. In this paper, we propose a unified view of these methods and introduce Prefix-RFT, a hybrid approach that synergizes learning from both demonstration and exploration. Using mathematical reasoning problems as a test bed, we empirically demonstrate that Prefix-RFT is simple yet effective. Not only does it surpass the performance of standalone SFT and RFT, but it also outperforms parallel mixed-policy RFT methods. Our analysis highlights the complementary nature of SFT and RFT, validating that Prefix-RFT effectively harmonizes them. Further ablation studies confirm the method's robustness to variations in the quality and quantity of demonstration data.

2507.01201 2026-05-18 cs.LG cs.CV

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Lauren Hyoseo Yoon, Yisong Yue, Been Kim

AI总结 该论文研究了如何对齐独立训练的视觉和语言模型,提出了一种名为JAM的方法,通过联合训练模态特定的自编码器,实现跨模态对齐。JAM引入了多模态扩散损失,有效提升了对齐效果,并系统分析了对齐目标、网络深度及基础模型规模对表示一致性的影响。研究不仅提供了对共享语义结构的理论见解,也为构建专业化的多模态模型提供了实用指导。

详情
英文摘要

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.

2507.00275 2026-05-18 cs.LG cs.AI

Deep Double Q-learning

Prabhat Nagarajan, Martha White, Marlos C. Machado

AI总结 本文提出了一种深度强化学习算法——Deep Double Q-learning(DDQL),旨在解决传统深度Q网络(DQN)中存在的估计过高的问题。该方法通过显式训练两个独立的Q函数,结合降低经验回放比例、延长目标网络更新间隔等技术,有效提升了训练稳定性。实验表明,DDQL在57款Atari 2600游戏中整体表现优于Double DQN,在其中47款游戏中表现更优,并进一步减少了估计过高的现象。

Comments 44 pages

详情
英文摘要

Double Q-learning is a classical control algorithm that mitigates the maximization bias of Q-learning. To do so, it explicitly trains two independent action-value functions and uses them to decouple action-selection and action-evaluation when computing bootstrap targets. Double DQN adapts target bootstrap decoupling to deep reinforcement learning (RL), but explicitly trains only a single action-value function and does not fully decouple its estimators. Consequently, the two estimators remain correlated, and overestimation persists. In this paper, we introduce Deep Double Q-learning (DDQL), a deep RL algorithm that explicitly trains two Q-functions through Double Q-learning. DDQL stabilizes training through a combination of techniques, including lower replay ratios, longer target network update intervals, and shared layers. Across 57 Atari 2600 games, DDQL improves aggregate performance over Double DQN, outperforming it on 47 games while further reducing overestimation. In addition, we study key design choices when adapting Double Q-learning to deep RL, including the network architecture, replay ratio, and minibatch sampling strategies.

2506.23552 2026-05-18 cs.CV cs.SD eess.AS

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

AI总结 本文提出了一种名为 JAM-Flow 的统一框架,用于同时生成面部运动和语音信号,解决了传统方法中将人脸生成与语音合成作为独立任务处理的问题。该方法结合了流匹配技术和一种新型的多模态扩散变换器(MM-DiT)架构,通过选择性联合注意力层实现跨模态交互,并保留各模态的特性。JAM-Flow 能够在单一模型中支持多种条件输入,如文本、参考音频和参考运动,从而实现从文本生成同步说话人脸、音频驱动动画等多种任务,显著推进了多模态生成建模的发展。

Comments project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

详情
英文摘要

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

2506.06739 2026-05-18 cs.AI cs.LG

Honey, I shrunk the hypothesis space (through logical preprocessing)

Andrew Cropper, Filipe Gouveia, David M. Cerna

AI总结 该研究提出了一种通过逻辑预处理缩小归纳逻辑编程(ILP)假设空间的方法。利用背景知识,该方法在学习前移除那些无论训练数据如何都无法出现在最优假设中的规则,例如“偶数不可能是奇数”等逻辑矛盾。实验表明,这种方法在保持预测精度的同时,显著减少了学习时间,例如在仅花费10秒预处理的情况下,将原本需要10小时以上的学习时间缩短至仅2秒。

Comments Published in JAIR

详情
Journal ref
Journal of Artificial Intelligence Research, Vol. 85 (2026)
英文摘要

Inductive logic programming (ILP) is a form of logical machine learning. The goal is to search a hypothesis space for a hypothesis that generalises training examples and background knowledge. We introduce an approach that 'shrinks' the hypothesis space before an ILP system searches it. Our approach uses background knowledge to find rules that cannot be in an optimal hypothesis regardless of the training examples. For instance, our approach discovers relationships such as "even numbers cannot be odd" and "prime numbers greater than 2 are odd". It then removes violating rules from the hypothesis space. We implement our approach using answer set programming and use it to shrink the hypothesis space of a constraint-based ILP system. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can substantially reduce learning times whilst maintaining predictive accuracies. For instance, given just 10 seconds of preprocessing time, our approach can reduce learning times from over 10 hours to only 2 seconds.

2506.05878 2026-05-18 cs.LG

A projection-based framework for gradient-free and parallel learning

Andreas Bergmeister, Manish Krishan Lal, Stefanie Jegelka, Suvrit Sra

AI总结 本文提出了一种基于投影的神经网络训练框架,不同于传统的梯度下降方法,它将训练过程转化为一个大规模可行性问题,通过迭代投影算法寻找满足局部约束的网络参数。该方法利用投影算子进行局部操作,支持并行计算,适用于处理非微分操作。作者开发了PJAX工具包,实现了该框架,支持GPU/TPU加速,并在多种网络结构上验证了其有效性,展示了其在并行性和通用性方面的优势。

详情
英文摘要

We present a feasibility-seeking approach to neural network training. This mathematical optimization framework is distinct from conventional gradient-based loss minimization and uses projection operators and iterative projection algorithms. We reformulate training as a large-scale feasibility problem: finding network parameters and states that satisfy local constraints derived from its elementary operations. Training then involves projecting onto these constraints, a local operation that can be parallelized across the network. We introduce PJAX, a JAX-based software framework that enables this paradigm. PJAX composes projection operators for elementary operations, automatically deriving the solution operators for the feasibility problems (akin to autodiff for derivatives). It inherently supports GPU/TPU acceleration, provides a familiar NumPy-like API, and is extensible. We train diverse architectures (MLPs, CNNs, RNNs) on standard benchmarks using PJAX, demonstrating its functionality and generality. Our results show that this approach is a compelling alternative to gradient-based training, with clear advantages in parallelism and the ability to handle non-differentiable operations.

2505.21698 2026-05-18 cs.CV

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

Yitong Li, Morteza Ghahremani, Christian Wachinger

AI总结 该研究针对基础视觉-语言模型在医学影像诊断中的应用难题,提出了一种名为MedBridge的轻量级适配框架,通过结合领域对齐、分辨率保持和多标签推理,有效缓解了医学图像与通用图像之间的领域差异。MedBridge利用预训练的视觉-语言模型作为多视角查询编码器,引入可学习的查询标记以实现非破坏性的领域适配,并通过多专家混合架构动态整合异构模型进行多标签诊断,显著提升了跨领域和同领域任务的性能。实验表明,该方法在多个胸部X光诊断基准上优于现有方法,且具有模型无关性和良好的扩展性。

详情
英文摘要

Vision-language foundation models achieve promising performance in natural image classification, yet their direct application to medical imaging is limited by severe domain shifts, resolution mismatches, and the multi-label nature of clinical diagnosis. Training dedicated medical foundation models from scratch, however, is costly and data-intensive. Here, we propose MedBridge, a lightweight adaptation framework that opens a new direction in domain-gap mitigation by jointly combining domain alignment, resolution preservation, and multi-label reasoning via complementary VLM experts for medical image diagnosis. Specifically, MedBridge transforms pretrained VLMs into multi-view query encoders that inject a compact set of learnable query tokens into intermediate layers, enabling non-destructive domain alignment while preserving fine-grained pathological cues via multi-view high-resolution sampling. These query tokens further act as routing signals for a mixture-of-experts, dynamically integrating heterogeneous foundation models for multi-label reasoning without requiring a shared representation space. We evaluated MedBridge on five chest radiograph benchmarks in three key adaptation tasks. MedBridge demonstrates superior performance in both cross-domain generalization (out-of-distribution transfer) and in-domain specialization (same-distribution tuning) settings, yielding a significant 6-15% AUC improvement over state-of-the-art adaptation methods for multi-label thoracic disease diagnosis. Furthermore, MedBridge is model-agnostic and demonstrates broad extensibility across eight diverse VLMs (e.g., CLIP, LLaVA, Qwen-VL, MedGemma), highlighting its ability to flexibly adapt arbitrary foundation models into a powerful medical diagnostic tool. Our code will be released upon acceptance.

2505.21535 2026-05-18 cs.CV cs.AI cs.LG

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

AI总结 本文提出了一种名为FAR的函数保持注意力替换框架,旨在解决Transformer模型在基于忆阻器(ReRAM)的存算一体(IMC)设备上推理效率低的问题。FAR通过将预训练DeiT模型中的注意力机制替换为与IMC数据流兼容的多头双向LSTM结构,并结合块级知识蒸馏和结构化剪枝,实现了功能等效的同时显著降低了计算延迟和参数量。实验表明,FAR在ImageNet及多个下游任务上保持了与原始模型相当的准确率,展示了其在边缘计算设备上高效部署Transformer模型的潜力。

Comments 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

详情
英文摘要

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

2505.19241 2026-05-18 cs.LG cs.AI

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low

AI总结 本文提出了一种名为 ActiveDPO 的主动直接偏好优化方法,旨在提升大语言模型对齐过程中的样本效率。该方法基于理论支撑的数据选择准则,适用于非线性奖励函数,并直接利用待对齐的LLM本身参数化奖励模型,从而更有效地指导数据选择。实验表明,ActiveDPO 在多种模型和真实偏好数据集上均优于现有方法,显著提升了对齐效果与数据使用效率。

Comments Accepted at ICLR 2026

详情
英文摘要

The recent success in using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks, such as question answering, mathematical reasoning, and code generation. However, achieving effective LLM alignment depends on high-quality datasets of human preferences. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive assumptions about the reward function, such as linear latent reward functions. To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model used for active data selection. As a result, ActiveDPO explicitly accounts for the LLM's influence on data selection, unlike methods that select data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Our extensive experiments demonstrate that ActiveDPO outperforms existing methods across various models and real-world preference datasets.

2505.18511 2026-05-18 cs.LG math.AP physics.comp-ph

SPDEBench: An Extensive Benchmark for Learning Stochastic PDEs

Yuantu Zhu, Zheyan Li, Dai Shi, Luke Thompson, Oliver Nash, Jose Miguel Lara Rangel, Siran Li, Bingguang Chen, Rongchan Zhu, Qi Meng, Hao Ni

AI总结 本文介绍了SPDEBench,这是首个用于学习随机偏微分方程(SPDEs)的统一基准平台,旨在解决当前在该领域缺乏标准化数据集和评估体系的问题。该基准涵盖了具有周期或狄利克雷边界条件的1-3维物理和数学上重要的SPDEs,包括常规和奇异SPDEs,并提供了多种机器学习基线模型及七种评估指标。实验表明,针对SPDE设计的模型在准确性和泛化能力方面优于通用操作符学习方法,SPDEBench为相关研究提供了可复现且可扩展的资源。

详情
英文摘要

Stochastic Partial Differential Equations (SPDEs) driven by random noise play a central role in modeling physical processes with rough spatio-temporal dynamics, such as turbulence flows, superconductors, and quantum dynamics. Although machine learning (ML)-based surrogate models have shown promise for efficiently approximating such dynamics, progress remains limited by the lack of a unified benchmark with controlled data generation and comprehensive evaluation. This gap is particularly significant for singular SPDEs, for which benchmark datasets are largely unavailable and reliable simulation requires numerically delicate schemes based on renormalization. Moreover, subtle differences in data-generation procedures, such as noise approximation, basis choice, and the inclusion of renormalization, can significantly affect the resulting datasets and, consequently, model evaluation. We introduce SPDEBench, the first unified benchmark for ML-based SPDE learning. SPDEBench provides ready-to-use datasets for physically and mathematically significant SPDEs on 1-3D domains with periodic or Dirichlet boundary condition. Both regular and singular SPDEs are taken into consideration. SPDEBench also incorporates representative ML baselines in operator learning, together with 7 evaluation metrics, including Sobolev and distributional metrics beyond the standard $L^2$-error. Supported by SPDEBench, we conduct systematic evaluations of model accuracy, robustness, and out-of-distribution generalization under controlled data variations. Our numerical results show that SPDE-aware architectures generally achieve stronger performance than generic operator-learning baselines. These findings establish SPDEBench as a reproducible and extensible resource, paving pathway for principled benchmarking and architecture design for stochastic spatio-temporal dynamics.

2505.18134 2026-05-18 cs.AI cs.CL cs.CV

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

AI总结 VideoGameBench 是一个用于评估视觉语言模型(VLMs)完成流行视频游戏能力的基准测试,包含10款90年代经典游戏,模型仅通过原始视觉输入和目标描述进行实时交互。该研究揭示了当前前沿VLM在实时游戏任务中表现有限,难以完成完整游戏,主要受限于推理延迟等问题。为此,研究还提出了VideoGameBench Lite 以缓解实时性挑战,并指出当前最先进的模型在该基准上的完成率仍非常低。

Comments 10 pages, 38 pages including supplementary

详情
英文摘要

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

2505.13350 2026-05-18 cs.RO

Approximating Global Contact-Implicit MPC via Sampling and Local Complementarity

Sharanya Venkatesh, Bibit Bianchini, Alp Aydinoglu, William Yang, Michael Posa

AI总结 为实现通用的灵巧操作,机器人需要快速规划并执行富含接触的运动行为。现有基于模型的控制器无法在实时中对指数级可能的接触序列进行全局优化,而隐式接触控制方法虽简化了模型,但仅能局部近似,限制了对接触空间的探索。本文提出一种结合局部互补性控制与全局采样的新方法,在每个控制周期中先进行无接触阶段的采样,再基于每个采样点进行富含接触的局部模型预测控制,从而实现全局感知的隐式接触控制器,能够在实时中完成非凸物体的精确非抓取操作。

Comments S.V. and B.B. contributed equally to this work. Accepted to RA-L 2025; presented at ICRA 2026. Project page: https://approximating-global-ci-mpc.github.io

详情
Journal ref
IEEE Robotics and Automation Letters, volume 10, number 11, pages 12117-12124, September 2025
英文摘要

To achieve general-purpose dexterous manipulation, robots must rapidly devise and execute contact-rich behaviors. Existing model-based controllers are incapable of globally optimizing in real-time over the exponential number of possible contact sequences. Instead, recent progress in contact-implicit control has leveraged simpler models that, while still hybrid, make local approximations. However, the use of local models inherently limits the controller to only exploit nearby interactions, potentially requiring intervention to richly explore the space of possible contacts. We present a novel approach which leverages the strengths of local complementarity-based control in combination with low-dimensional, but global, sampling of possible end-effector locations. Our key insight is to consider a contact-free stage preceding a contact-rich stage at every control loop. Our algorithm, in parallel, samples end effector locations to which the contact-free stage can move the robot, then considers the cost predicted by contact-rich MPC local to each sampled location. The result is a globally-informed, contact-implicit controller capable of real-time dexterous manipulation. We demonstrate our controller on precise, non-prehensile manipulation of non-convex objects using a Franka Panda arm. Project page: https://approximating-global-ci-mpc.github.io

2505.12601 2026-05-18 cs.LG

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

Yang Li

AI总结 随着大语言模型(LLM)规模和专业性不断提升,如何高效选择最适合的模型处理输入已成为关键问题。本文重新审视LLM路由策略,发现经过精心调优的k近邻(kNN)方法在多种任务中不仅表现优异,甚至优于当前最先进的学习路由方法。研究引入了一系列标准化路由基准和首个多模态路由数据集,揭示了嵌入空间中模型性能的局部特性使得非参数方法在样本复杂度上更具优势,挑战了当前追求复杂架构的趋势。

详情
英文摘要

As large language models (LLMs) grow in scale and specialization, routing--selecting the best model for a given input--has become essential for efficient and effective deployment. While recent methods rely on complex learned routing strategies, their dependence on disparate training data and evaluation setups makes comparison and generalization difficult. In this work, we revisit LLM routing through the lens of simplicity. We show that a well-tuned k-Nearest Neighbors (kNN) approach not only matches but often outperforms state-of-the-art learned routers across diverse tasks. To support systematic evaluation, we introduce a suite of standardized routing benchmarks spanning instruction-following, question-answering, and reasoning tasks, as well as the first multi-modal routing dataset involving visual inputs. Our findings reveal that the locality properties of model performance in embedding space enable simple non-parametric methods to achieve strong routing decisions with lower sample complexity than parametric approaches. This challenges the prevailing trend toward sophisticated architectures and highlights the importance of thoroughly evaluating simple baselines before investing in complex solutions. To support reproducibility and further exploration, we will release all benchmarks and code upon publication.

2505.07322 2026-05-18 cs.CV

RealRep: Generalized SDR-to-HDR Conversion via Attribute-Disentangled Representation Learning

Li Xu, Siqi Wang, Kepeng Xu, Gang He, Lin Zhang, Weiran Wang, Yu-Wing Tai

AI总结 本文提出了一种通用的SDR到HDR转换框架RealRep,通过解耦亮度和色度属性的学习,提升对真实世界中多样SDR内容的鲁棒性。核心方法包括解耦表征学习、基于退化感知的负样本生成策略,以及一个轻量的两阶段映射网络DDACMNet,能够根据退化条件动态调整映射过程。实验表明,RealRep在泛化能力和HDR色彩重构的感知保真度方面均优于现有方法。

Comments Published on AAAI'26(Oral): The Annual AAAI Conference on Artificial Intelligence

详情
英文摘要

High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly widespread, driving a growing need for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which struggle to handle the diverse appearances and degradations commonly present in real-world SDR content. To address this limitation, we propose a generalized SDR-to-HDR framework that enhances robustness by learning attribute-disentangled representations. Central to our approach is Realistic Attribute-Disentangled Representation Learning (RealRep), which explicitly disentangles luminance and chrominance components to capture intrinsic content variations across different SDR distributions. Furthermore, we design a Luma-/Chroma-aware negative exemplar generation strategy that constructs degradation-sensitive contrastive pairs, effectively modeling tone discrepancies across SDR styles. Building on these attribute-level priors, we introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a lightweight, two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned features, enabling robust adaptation across diverse degradation domains. Extensive experiments demonstrate that RealRep consistently outperforms state-of-the-art methods in both generalization and perceptually faithful HDR color gamut reconstruction.

2505.06982 2026-05-18 cs.CV

Decentralized LoRA augmented transformer with multi-scale feature learning for secured eye diagnosis

Md. Naimur Asif Borno, Md Sakib Hossain Shovon, MD Hanif Sikder, Iffat Firozy Rimi, Tahani Jaser Alahmadi, Mohammad Ali Moni

AI总结 本文提出了一种基于改进型图像Transformer(DeiT)的去中心化眼病诊断框架,旨在解决医学影像中眼科疾病诊断面临的数据不平衡、隐私保护、空间特征多样性和临床可解释性等挑战。该方法结合多尺度特征学习、低秩适配(LoRA)、知识蒸馏和联邦学习,有效提升了模型在计算效率、数据隐私保护和诊断性能方面的表现。实验表明,该框架在多个基准数据集上优于传统卷积神经网络和现有Transformer模型,并通过Grad-CAM++提供了可解释的诊断依据,为安全、可扩展的眼科AI诊断系统奠定了基础。

Comments Published at Knowledge-Based Systems

详情
英文摘要

Accurate and privacy-preserving diagnosis of ophthalmic diseases remains a critical challenge in medical imaging, particularly given the limitations of existing deep learning models in handling data imbalance, data privacy concerns, spatial feature diversity, and clinical interpretability. This paper proposes a novel Data efficient Image Transformer (DeiT) based framework that integrates context aware multiscale patch embedding, Low-Rank Adaptation (LoRA), knowledge distillation, and federated learning to address these challenges in a unified manner. The proposed model effectively captures both local and global retinal features by leveraging multi scale patch representations with local and global attention mechanisms. LoRA integration enhances computational efficiency by reducing the number of trainable parameters, while federated learning ensures secure, decentralized training without compromising data privacy. A knowledge distillation strategy further improves generalization in data scarce settings. Comprehensive evaluations on two benchmark datasets OCTDL and the Eye Disease Image Dataset demonstrate that the proposed framework consistently outperforms both traditional CNNs and state of the art transformer architectures across key metrics including AUC, F1 score, and precision. Furthermore, Grad-CAM++ visualizations provide interpretable insights into model predictions, supporting clinical trust. This work establishes a strong foundation for scalable, secure, and explainable AI applications in ophthalmic diagnostics.

2504.21850 2026-05-18 cs.CV

Visual Compositional Tuning

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky

AI总结 本文研究了视觉指令微调(VIT)数据集中样本复杂度对信息量的影响,提出了一种名为COMPACT的合成数据生成方法,通过在一个训练样本中组合多个基础视觉能力,显著提升了数据效率。实验表明,COMPACT在减少训练数据量90%的情况下,仍能保持与完整数据相当甚至更好的模型性能,在多个视觉语言基准测试中表现优异。该方法为提升视觉语言任务的训练效率提供了可扩展的解决方案。

Comments See the project website at this [URL](https://princetonvisualai.github.io/compact/)

详情
英文摘要

Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a compositional VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective VIT. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Furthermore, training on the COMPACT data outperforms training on the full-scale VIT data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on vision-language tasks.

2504.09544 2026-05-18 cs.LG cs.CE cs.CV

Integrating chemical structures as treatments improves representations of microscopy images for morphological profiling

Yemin Yu, Emre Hayir, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, Alex X. Lu

AI总结 该研究提出了一种名为MICON的新框架,通过在自监督预训练中整合化学结构信息,提升高通量显微图像的表征能力,以更准确地进行形态学分析。研究认为,将化合物结构作为诱导细胞表型变化的“处理”因素进行建模,能够显著优于传统手工特征和现有深度学习方法。实验表明,结合化学信息的表征学习在跨实验重复和数据来源的药物效应识别任务中表现更优,为多模态显微筛查数据的表征学习提供了新方向。

Comments 24 pages

详情
英文摘要

Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structures during self-supervised pre-training could improve learned representations of images from high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides small, but consistent improvements in performance and that modeling compounds specifically as treatments outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.

2504.08300 2026-05-18 cs.CL cs.AI

Large Language Models Could Be Rote Learners

Yuyang Xu, Renjun Hu, Haochao Ying, Jian Wu, Xing Shi, Wei Lin

AI总结 本文研究了大语言模型(LLMs)在基准测试中的表现是否受到训练数据污染的影响,指出当前基于基准测试的评估方式可能高估了模型的真实能力。为此,作者提出了一种新的评估框架TrinEval,通过重构多选题形式,减少对记忆的依赖,从而更准确地评估模型的真实学习能力。实验表明,主流大语言模型在多个数据集上约有19.6%的知识点依赖于死记硬背,而非真正的理解与推理能力。

Comments Work in Progress

详情
英文摘要

Benchmark-based evaluation, e.g., multiple-choice questions (MCQs) and open-ended questions (OEQs), is widely used for evaluating Large Language Models (LLMs), yet their reliability is undermined by benchmark contamination. When pre-exposed to the testing benchmark during training, less capable LLMs have been found to achieve inflated performance, thereby yielding erroneous results in LLM evaluation. In this study, we reframe contamination as an inherent aspect of learning and seek to disentangle and expose genuine capability acquisition from superficial memorization in LLM evaluation. Following this, firstly, by analyzing model performance under different memorization conditions of MCQs, we uncover a counterintuitive trend: LLMs perform worse on memorized benchmarks than on non-memorized ones, indicating the coexistence of two learning phenomena, i.e., rote memorization and genuine capability learning. To disentangle them, we propose TrinEval, a novel evaluation framework that reformulates MCQs into an alternative knowledge-centric trinity format, reducing memorization while preserving inherent knowledge, enabling the evaluation of genuine capability in the presence of memorization. Extensive experiments validate the effectiveness and robustness of TrinEval in reformulating benchmarks, and the evaluation results further reveal that mainstream LLMs rely on rote memorization for an average of 19.6% of knowledge points across the MMLU and the GSM8K dataset.

2504.05451 2026-05-18 cs.CV

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

AI总结 ViewBridge 是一种用于学习活动视点不变表示的框架,旨在应对野外视频中极端视角变化带来的挑战。该方法通过知识蒸馏保留动作语义,并结合课程学习策略,逐步增加视角难度以实现平滑适应。实验表明,ViewBridge 在两个任务上优于现有方法,适用于多个数据集。

详情
英文摘要

Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .

2503.16589 2026-05-18 cs.LG cs.ET math.ST stat.TH

A Statistical Analysis for Per-Instance Evaluation of Stochastic Optimizers: Avoiding Unreliable Conclusions

Moslem Noori, Elisabetta Valiante, Thomas Van Vaerenbergh, Masoud Mohseni, Ignacio Rozada

AI总结 本文针对随机优化器的性能评估问题,提出了一种统计分析方法,以避免因实验设计不当导致的不可靠结论。研究分析了常用性能指标的置信区间及其与实验重复次数的关系,并推导出保证指标精度所需的最小重复次数下界。基于此,作者提出了一种自适应调整重复次数的算法,以提高评估的准确性和可靠性。实验结果验证了该方法在基准测试和超参数调优中的有效性。

详情
Journal ref
Physical Review Applied 25, no. 3 (2026): 034081
英文摘要

A key trait of stochastic optimizers is that multiple runs of the same optimizer in attempting to solve the same problem can produce different results. As a result, their performance is evaluated over several repeats, or runs, on the problem. However, the accuracy of the estimated performance metrics depends on the number of runs and should be studied using statistical tools. We present a statistical analysis of the common metrics, and develop guidelines for experiment design to measure the optimizer's performance using these metrics to a high level of confidence and accuracy. To this end, we first discuss the confidence interval of the metrics and how they are related to the number of runs of an experiment. We then derive a lower bound on the number of repeats in order to guarantee achieving a given accuracy in the metrics. Using this bound, we propose an algorithm to adaptively adjust the number of repeats needed to ensure the accuracy of the evaluated metric. Our simulation results demonstrate the utility of our analysis and how it allows us to conduct reliable benchmarking as well as hyperparameter tuning and prevent us from drawing premature conclusions regarding the performance of stochastic optimizers.

2503.07518 2026-05-18 cs.CL cs.AI cs.LG

TokenButler: Token Importance is Predictable

Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Sameh Gobriel, Nilesh Jain, Mohamed S. Abdelfattah

AI总结 大型语言模型在解码过程中依赖键值缓存(KV-Cache)存储历史信息,但随着缓存增长,其成为内存和计算瓶颈。为解决这一问题,本文提出TokenButler,一种高精度、查询感知的标记重要性预测方法,能够在固定预算下动态选择关键标记,同时保留完整的KV缓存。该方法通过学习预测低维重要性查询,并结合缓存键的投影进行高效评分,实验表明其在长上下文任务中性能优越,并显著提升了推理速度。

详情
英文摘要

Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck. However, there is an opportunity to alleviate this bottleneck, prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks of tokens and many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. TokenButler predicts low-dimensional importance queries at a fixed depth stride, and combines them with a learned projection of the real KV-cache keys to score tokens cheaply, enabling dynamic per-token selection under a fixed budget while preserving the full KV cache. We train TokenButler by distilling the model's masked causal attention distributions, optimizing a lightweight predictor with minimal parameter overhead. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy where existing methods fail. Furthermore, TokenButler achieves competitive or superior performance on long-context benchmarks (RULER, LongBench), up to $\approx1.6\times$ on-GPU speedup using our proposed *prediction interval with neighbor fetching* that amortizes predictor cost while maintaining accuracy within $\approx$1.1\%, and up to 7.6$\times$ reduction in latency compared to Dense Attention with CPU offloading. Code is available: https://github.com/abdelfattah-lab/TokenButler

2503.02597 2026-05-18 cs.CV cs.AI

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

AI总结 近期多模态大语言模型(MLLMs)在理解和推理多模态信息方面取得了显著进展,但视觉与语言模态之间的对齐问题仍是一个关键挑战。本文从模型架构层面出发,提出了一种新的模态互注意力机制(MMA),通过将因果注意力扩展为跨模态互注意力,使图像模态能够关注文本模态,从而提升模型对输入信息的准确理解。该方法在多个多模态理解基准测试中取得了优越性能,且无需增加额外参数,具有通用性和可扩展性。

Comments ICML 2026. Code is available at https://github.com/sony/aki

详情
英文摘要

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

2502.12187 2026-05-18 cs.CL cs.FL cs.LG math.ST stat.ML stat.TH

Hallucinations are inevitable but can be made statistically negligible

Atsushi Suzuki, Yulan He, Feng Tian, Zhongyuan Wang

AI总结 本文探讨了语言模型中不可避免的“幻觉”现象,即模型生成非事实内容的问题。尽管已有研究从可计算性理论角度证明,任何语言模型在无限输入集上都会产生幻觉,但本文从概率论角度提出,只要训练数据的质量和数量足够,幻觉在统计意义上可以被显著降低。研究指出,虽然可计算性理论结果具有理论意义,但概率理论结果更符合实际应用需求,为缓解幻觉问题提供了新的理论依据。

详情
英文摘要

Hallucinations, a phenomenon where a language model (LM) generates nonfactual content, pose a significant challenge to the practical deployment of LMs. While many empirical methods have been proposed to mitigate hallucinations, recent studies established a computability-theoretic result showing that any LM will inevitably generate hallucinations on an infinite set of inputs, regardless of the quality and quantity of training datasets and the choice of the language model architecture and training and inference algorithms. Although the computability-theoretic result may seem pessimistic, its significance in practical viewpoints has remained unclear. This paper claims that those "innate" inevitability results from computability theory and diagonal argument, in principle, cannot explain practical issues of LLMs. We demonstrate this claim by presenting a positive theoretical result from a probabilistic perspective. Specifically, we prove that hallucinations can be made statistically negligible, provided that the quality and quantity of the training data are sufficient. Interestingly, our positive result coexists with the computability-theoretic result, implying that while hallucinations on an infinite set of inputs cannot be entirely eliminated, their probability can always be reduced by improving algorithms and training data. By evaluating the two seemingly contradictory results through the lens of information theory, we argue that our probability-theoretic positive result better reflects practical considerations than the computability-theoretic negative result.

2501.19128 2026-05-18 cs.LG cs.AI

Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach

Wenyun Li, Wenjie Huang, Chen Sun

AI总结 在强化学习中,稀疏奖励信号使得奖励函数的学习变得困难。本文提出一种半监督方法,结合非零奖励转移和数据增强技术,利用大量零奖励转移学习轨迹表示,从而提升奖励塑形的效果。实验表明,该方法在Atari和机器人操作任务中优于基于监督的方法,尤其在稀疏奖励环境下,其最高得分可达监督方法的两倍。

详情
英文摘要

In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods

2501.17116 2026-05-18 cs.LG cs.CL

Optimizing Large Language Model Training Using FP4 Quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, Peng Cheng

AI总结 随着大语言模型(LLM)训练的计算需求不断增长,如何提高训练效率成为关键问题。本文提出首个基于FP4量化的大语言模型训练框架,通过可微分量化估计器和异常值截断补偿策略,有效解决了FP4精度下量化误差大、表征能力有限的问题,并结合混合精度训练和向量化量化保证训练稳定性。实验表明,该框架在保持与BF16和FP8相近精度的同时,能够高效支持超大规模模型的训练。

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:62937-62957, 2025
英文摘要

The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.

2412.02271 2026-05-18 cs.CL

The MediaSpin Dataset: Post-Publication News Headline Edits Annotated for Media Bias

Preetika Verma, Kokil Jaidka

AI总结 本文介绍了 MediaSpin 数据集,这是一个大规模语言资源,记录了主要新闻机构在新闻发布后对标题的修改情况,并配套了 MediaSpin-in-the-Wild 数据集,用于分析这些修改后的标题在社交媒体上的互动情况。数据集包含78,910对标题,标注了13种媒体偏见类型,涵盖主观和客观偏见形式,并通过专家验证的大型语言模型进行标注。研究展示了该数据集在跨国家分析、偏见分类和社交媒体行为分析中的应用,揭示了媒体报道中的区域框架不对称性、可量化的语言特征以及偏见内容的高互动性。

Comments 8 pages, 3 figures, 8 tables Accepted at AAAI ICWSM 2026 We updated the paper title from "MediaSpin: Exploring Media Bias Through Fine-Grained Analysis of News Headlines " to "The MediaSpin Dataset: Post-Publication News Headline Edits Annotated for Media Bias"

详情
英文摘要

We present MediaSpin, a large-scale language resource capturing how major news outlets modify headlines after publication, and MediaSpin-in-the-Wild, a complementary dataset linking these revised headlines to their downstream engagement on social media. The increasing editability of online news headlines offers new opportunities to study linguistic framing and bias through the lens of editorial revisions. The dataset contains 78,910 headline pairs annotated for 13 types of media bias, grounded in established media-bias taxonomies, covering both subjective (e.g., sensationalism, spin) and objective (e.g., omission, slant) forms, with annotation conducted through a human-supervised large-language-model pipeline with expert validation and quality control. We describe the annotation schema and demonstrate three downstream applications: (1) cross-national analysis of how country references are added or removed during editing, (2) transformer-based bias classification at both binary and fine-grained levels, and (3) behavioral analysis of biased headlines on X (Twitter) using 180,786 news-related tweets from 819 consenting users. The results reveal regional asymmetries in representational framing, measurable linguistic markers, and consistently higher engagement with biased content. MediaSpin and MediaSpin-in-the-Wild together provide a reproducible benchmark for bias detection and the study of editorial and behavioral dynamics in contemporary media ecosystems.

2410.01990 2026-05-18 cs.LG cs.CE

Deep Learning Alternatives of the Kolmogorov Superposition Theorem

Leonardo Ferreira Guilhoto, Paris Perdikaris

AI总结 本文探讨了作为神经网络设计基础的柯尔莫戈罗夫叠加定理(KST)的替代形式。传统KST在数学上优雅,但因其对内外函数结构的洞察有限且引入大量未知变量,带来实际应用挑战。为此,研究提出了一种可扩展的深度学习模型ActNet,克服了原KST的诸多缺陷,并在物理信息神经网络(PINNs)框架下进行了评估,结果表明ActNet在偏微分方程模拟等任务中优于基于KST的Kolmogorov-Arnold网络,并具有与传统多层感知机相当的竞争力。

详情
Journal ref
Guilhoto, Leonardo Ferreira, and Paris Perdikaris. "Deep Learning Alternatives Of The Kolmogorov Superposition Theorem." The Thirteenth International Conference on Learning Representations (ICLR 2025)
英文摘要

This paper explores alternative formulations of the Kolmogorov Superposition Theorem (KST) as a foundation for neural network design. The original KST formulation, while mathematically elegant, presents practical challenges due to its limited insight into the structure of inner and outer functions and the large number of unknown variables it introduces. Kolmogorov-Arnold Networks (KANs) leverage KST for function approximation, but they have faced scrutiny due to mixed results compared to traditional multilayer perceptrons (MLPs) and practical limitations imposed by the original KST formulation. To address these issues, we introduce ActNet, a scalable deep learning model that builds on the KST and overcomes many of the drawbacks of Kolmogorov's original formulation. We evaluate ActNet in the context of Physics-Informed Neural Networks (PINNs), a framework well-suited for leveraging KST's strengths in low-dimensional function approximation, particularly for simulating partial differential equations (PDEs). In this challenging setting, where models must learn latent functions without direct measurements, ActNet consistently outperforms KANs across multiple benchmarks and is competitive against the current best MLP-based approaches. These results present ActNet as a promising new direction for KST-based deep learning applications, particularly in scientific computing and PDE simulation tasks.

2409.11022 2026-05-18 cs.CL cs.AI

DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

Hanjun Luo, Yingbin Jin, Xinfeng Li, Xuecheng Liu, Ruizhe Chen, Tong Shang, Kun Wang, Qingsong Wen, Zuozhu Liu

AI总结 随着大语言模型(LLM)在命名实体识别(NER)任务中的应用日益广泛,现有数据集在语料选择和设计逻辑上已难以满足LLM方法的需求。为此,本文提出DynamicNER,一个专为LLM设计的动态、多语言、细粒度NER数据集,支持同一实体在不同上下文中具有不同实体类型,涵盖8种语言和155种实体类型,适用于广泛领域。同时,本文还提出CascadeNER方法,通过两阶段策略和轻量级LLM实现更高效的细粒度识别,实验表明DynamicNER为LLM-based NER提供了有效的评估基准。

Comments This paper is accepted by EMNLP 2025 Main Conference

详情
英文摘要

The advancements of Large Language Models (LLMs) have spurred a growing interest in their application to Named Entity Recognition (NER) methods. However, existing datasets are primarily designed for traditional machine learning methods and are inadequate for LLM-based methods, in terms of corpus selection and overall dataset design logic. Moreover, the prevalent fixed and relatively coarse-grained entity categorization in existing datasets fails to adequately assess the superior generalization and contextual understanding capabilities of LLM-based methods, thereby hindering a comprehensive demonstration of their broad application prospects. To address these limitations, we propose DynamicNER, the first NER dataset designed for LLM-based methods with dynamic categorization, introducing various entity types and entity type lists for the same entity in different context, leveraging the generalization of LLM-based NER better. The dataset is also multilingual and multi-granular, covering 8 languages and 155 entity types, with corpora spanning a diverse range of domains. Furthermore, we introduce CascadeNER, a novel NER method based on a two-stage strategy and lightweight LLMs, achieving higher accuracy on fine-grained tasks while requiring fewer computational resources. Experiments show that DynamicNER serves as a robust and effective benchmark for LLM-based NER methods. Furthermore, we also conduct analysis for traditional methods and LLM-based methods on our dataset. Our code and dataset are openly available at https://github.com/Astarojth/DynamicNER.