arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4127
2602.06136 2026-06-02 cs.LG cs.CV

Tempora: Characterising the Time-Contingent Utility of Online Test-Time Adaptation

Tempora: 表征在线测试时适应的时间条件效用

Sudarshan Sreeram, Young D. Kwon, Cecilia Mascolo

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出Tempora框架,通过时间场景、评估协议和时间条件效用指标,系统评估测试时适应方法在延迟约束下的准确性-延迟权衡,揭示传统排名在时间压力下失效。

Comments Accepted to ICML 2026

详情
AI中文摘要

测试时适应(TTA)为在域偏移下性能下降的机器学习模型提供了一种引人注目的补救措施,仅使用未标记样本即可即时改进泛化能力。这种灵活性适合实际部署,但传统评估不切实际地假设无限处理时间,忽略了准确性-延迟权衡。随着机器学习越来越多地支撑延迟敏感和面向用户的应用,时间压力限制了可适应推理的可行性;到达太晚而无法采取行动的预测是徒劳的。我们引入了Tempora,一个在这种压力下评估TTA的框架。它由模拟部署约束的时间场景、实现测量的评估协议以及量化准确性-延迟权衡的时间条件效用指标组成。我们用三个这样的指标实例化该框架:(1)用于具有硬截止时间的异步流的离散效用,(2)用于价值随延迟衰减的交互式设置的连续效用,以及(3)用于预算受限部署的摊销效用。通过将Tempora应用于11种TTA方法,我们发现排名不稳定性在跨越不同数据集、模型和硬件平台的750多次时间评估中持续存在;即,传统排名不能预测时间压力下的排名。最高效用方法随偏移和时间压力而变化,没有明确的赢家。通过首次实现跨不同时间约束的系统评估,Tempora揭示了排名何时以及为何变化,为从业者提供了方法选择的视角,为研究人员提供了可部署适应的目标。代码:https://github.com/sudotensor/tempora。

英文摘要

Test-time adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisation on-the-fly with only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduce Tempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1) discrete utility for asynchronous streams with hard deadlines, (2) continuous utility for interactive settings where value decays with latency, and (3) amortised utility for budget-constrained deployments. By applying Tempora to 11 TTA methods, we find that rank instability persists across 750+ temporal evaluations spanning diverse datasets, models, and hardware platforms; i.e., conventional rankings do not predict rankings under temporal pressure. The highest-utility method varies with the shift and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings change, offering practitioners a lens for method selection and researchers a target for deployable adaptation. Code: https://github.com/sudotensor/tempora.

2602.06033 2026-06-02 cs.LG

Can Vision Language Models Learn Intuitive Physics from Interaction?

视觉语言模型能否从交互中学习直观物理?

Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

发表机构 * arXiv.org GitHub

AI总结 研究通过强化学习与环境交互训练视觉语言模型,发现交互学习能提升任务内性能,但无法产生可泛化的物理直觉。

Comments Updated accepted version for ICML'26

详情
AI中文摘要

预训练的视觉语言模型对物理世界没有良好的直觉。最近的研究表明,监督微调可以提高模型在简单物理任务上的性能。然而,微调后的模型似乎无法学习能够泛化到新情境的稳健物理规则。基于认知科学的研究,我们假设模型需要与环境交互才能正确学习其物理动态。我们训练模型通过强化学习与模拟环境交互来学习。虽然从交互中学习使模型能够提高其任务内性能,但未能产生具有可泛化物理直觉的模型。我们发现,在一个任务上训练的模型不能可靠地泛化到相关任务,即使这些任务共享视觉统计和物理原理,并且无论模型是否通过交互训练。

英文摘要

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with a simulated environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

2602.05970 2026-06-02 cs.LG cs.AI math.DS stat.ML

Inverse Depth Scaling From Most Layers Being Similar

大多数层相似时的逆深度缩放

Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过分析大型语言模型和玩具残差网络,发现损失与深度成反比,归因于功能相似的层通过集成平均而非组合学习或平滑动力学离散化来减少误差,表明需要架构创新以鼓励深度组合使用。

Comments Camera-ready version, ICML 2026

详情
AI中文摘要

神经缩放定律将损失与大型语言模型(LLM)的模型大小联系起来,但深度和宽度可能对性能有不同的贡献,需要更详细的研究。在这里,我们通过分析LLM和玩具残差网络来量化深度如何影响损失。我们发现LLM中的损失与深度成反比,这可能是由于功能相似的层通过集成平均而不是组合学习或平滑动力学的离散化来减少误差。这种机制效率低下但鲁棒,可能源于残差网络的架构偏差和与平滑动力学不兼容的目标函数。研究结果表明,提高LLM效率可能需要架构创新以鼓励深度的组合使用。

英文摘要

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.

2602.05951 2026-06-02 cs.CV cs.AI cs.LG

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

更好的源,更好的流:学习条件依赖的源分布用于流匹配

Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

发表机构 * New York University(纽约大学) KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出在流匹配框架中学习条件依赖的源分布,通过方差正则化和源-目标方向对齐,显著提升文本到图像生成的速度和质量。

Comments Project Page: https://junwankimm.github.io/CSFM

详情
AI中文摘要

流匹配最近已成为基于扩散的生成模型的有前途的替代方案,特别是在文本到图像生成方面。尽管它在允许任意源分布方面具有灵活性,但大多数现有方法依赖于标准高斯分布(这是从扩散模型继承的选择),并且很少在这种设置中将源分布本身视为优化目标。在这项工作中,我们表明源分布的原则性设计不仅是可行的,而且在现代文本到图像系统的规模上也是有益的。具体来说,我们提出在流匹配目标下学习条件依赖的源分布,以更好地利用丰富的条件信号。我们识别了将条件直接纳入源时出现的关键失败模式,包括分布坍缩和不稳定性,并表明适当的方差正则化以及源和目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响具有结构化源的流匹配,揭示了这种设计最有效的场景。在多个文本到图像基准上的大量实验表明了一致且稳健的改进,包括FID收敛速度提高多达3倍,突出了原则性源分布设计对条件流匹配的实际好处。

英文摘要

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

2602.05435 2026-06-02 cs.CV

Stable Velocity: A Variance Perspective on Flow Matching

稳定速度:流匹配的方差视角

Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Renjie Liao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对流匹配中单样本条件速度导致的高方差训练目标,提出稳定速度框架,通过方差表征识别高低方差区域,并引入无偏方差缩减目标(StableVM)、方差感知表示对齐(VA-REPA)以及免微调加速采样(StableVS),在多个大规模模型上实现训练效率提升和超过2倍采样加速。

Comments ICML 2026

详情
AI中文摘要

虽然流匹配优雅,但其对单样本条件速度的依赖导致高方差训练目标,从而破坏优化稳定性并减慢收敛速度。通过显式表征这一方差,我们识别出:1) 先验附近的高方差区域,优化困难;2) 数据分布附近的低方差区域,条件速度与边际速度几乎一致。基于这一洞察,我们提出稳定速度(Stable Velocity),一个统一框架,改进了训练和采样。对于训练,我们引入稳定速度匹配(StableVM),一个无偏方差缩减目标,以及方差感知表示对齐(VA-REPA),在低方差区域自适应增强辅助监督。对于推理,我们展示了低方差区域中的动力学允许闭式简化,从而实现稳定速度采样(StableVS),一种免微调加速。在ImageNet $256\times256$以及大型预训练文本到图像和文本到视频模型(包括SD3.5、Flux、Qwen-Image和Wan2.2)上的大量实验表明,训练效率持续提升,并且在低方差区域内采样速度提升超过2倍,同时不降低样本质量。我们的代码可在https://github.com/linYDTHU/StableVelocity获取。

英文摘要

While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.

2504.15371 2026-06-02 cs.CV cs.NE

Event2Vec: Processing Neuromorphic Events Directly by Representations in Vector Space

Event2Vec: 通过向量空间表示直接处理神经形态事件

Wei Fang, Priyadarshini Panda

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Event2Vec表示法,使Transformer能直接处理稀疏异步事件数据,在多个基准上实现高精度、低延迟和高吞吐量。

Comments Accepted at ICML 2026

详情
AI中文摘要

与传统相机相比,神经形态事件相机具有优越的时间分辨率、能效和动态范围。然而,它们的异步和稀疏数据格式对传统深度学习方法构成了重大挑战。大多数现有方法要么将事件密集化为帧,牺牲其稀疏异步特性,要么使用与GPU加速兼容性较差的非规则模型。受词到向量模型的启发,我们提出了event2vec,一种新颖的表示法,使Transformer能够直接处理事件。我们在DVS Gesture、ASL-DVS和DVS-Lip基准上展示了event2vec的有效性,表明event2vec具有显著的参数效率、高吞吐量和低延迟,即使在极低事件数或低空间分辨率下也能实现高精度。这些结果表明,稀疏异步事件数据可以直接集成到高吞吐量Transformer架构中,为实时神经形态视觉提供了一种高效的范式。代码可在https://github.com/Intelligent-Computing-Lab-Panda/event2vec获取。

英文摘要

Neuromorphic event cameras possess superior temporal resolution, power efficiency, and dynamic range compared to traditional cameras. However, their asynchronous and sparse data format poses a significant challenge for conventional deep learning methods. Most existing methods either densify events into frames, sacrificing their sparse asynchronous nature, or use irregular models that are less compatible with GPU acceleration. Inspired by word-to-vector models, we propose event2vec, a novel representation that allows Transformers to process events directly. We demonstrate the effectiveness of event2vec on the DVS Gesture, ASL-DVS, and DVS-Lip benchmarks, showing that event2vec is remarkably parameter-efficient, features high throughput and low latency, and achieves high accuracy even with an extremely low number of events or low spatial resolutions. These results show that sparse asynchronous event data can be directly integrated into high-throughput Transformer architectures, offering an efficient paradigm for real-time neuromorphic vision. The code is provided at https://github.com/Intelligent-Computing-Lab-Panda/event2vec.

2602.05293 2026-06-02 cs.CV

Fast-SAM3D: 3Dfy Anything in Images but Faster

Fast-SAM3D: 更快地将图像中的任何物体三维化

Weilun Feng, Mingqiang Wu, Zhiliang Chen, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiaokun Liu, Guoxin Fan, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu, Zhulin An

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出Fast-SAM3D,一种无需训练的三维重建加速框架,通过多级异构性感知机制(模态感知步骤缓存、联合时空令牌雕刻、频谱感知令牌聚合)实现高达2.67倍端到端加速且保真度损失极小。

Comments Accepted by ICML 2026

详情
AI中文摘要

SAM3D实现了从复杂场景中可扩展的开放世界三维重建,但其部署受到高昂推理延迟的阻碍。在这项工作中,我们对其推理动态进行了 extbf{首次系统研究},揭示了通用加速策略在此背景下是脆弱的。我们证明这些失败源于忽视了流水线固有的多级 extbf{异构性}:形状与布局之间的运动学差异、纹理细化的内在稀疏性以及几何体之间的频谱方差。为了解决这个问题,我们提出了 extbf{Fast-SAM3D},一个无需训练的框架,动态地将计算与瞬时生成复杂度对齐。我们的方法集成了三种异构性感知机制:(1) extit{模态感知步骤缓存},用于解耦结构演化与敏感布局更新;(2) extit{联合时空令牌雕刻},将细化集中在高熵区域;(3) extit{频谱感知令牌聚合},用于调整解码分辨率。大量实验表明,Fast-SAM3D实现了高达 extbf{2.67$ imes$}的端到端加速,且保真度损失极小,为高效单视图三维生成建立了新的帕累托前沿。我们的代码发布在https://github.com/wlfeng0509/Fast-SAM3D。

英文摘要

SAM3D enables scalable, open-world 3D reconstruction from complex scenes, yet its deployment is hindered by prohibitive inference latency. In this work, we conduct the \textbf{first systematic investigation} into its inference dynamics, revealing that generic acceleration strategies are brittle in this context. We demonstrate that these failures stem from neglecting the pipeline's inherent multi-level \textbf{heterogeneity}: the kinematic distinctiveness between shape and layout, the intrinsic sparsity of texture refinement, and the spectral variance across geometries. To address this, we present \textbf{Fast-SAM3D}, a training-free framework that dynamically aligns computation with instantaneous generation complexity. Our approach integrates three heterogeneity-aware mechanisms: (1) \textit{Modality-Aware Step Caching} to decouple structural evolution from sensitive layout updates; (2) \textit{Joint Spatiotemporal Token Carving} to concentrate refinement on high-entropy regions; and (3) \textit{Spectral-Aware Token Aggregation} to adapt decoding resolution. Extensive experiments demonstrate that Fast-SAM3D delivers up to \textbf{2.67$\times$} end-to-end speedup with negligible fidelity loss, establishing a new Pareto frontier for efficient single-view 3D generation. Our code is released in https://github.com/wlfeng0509/Fast-SAM3D.

2602.05217 2026-06-02 cs.CV

Cross-Domain Few-Shot Segmentation via Multi-view Progressive Adaptation

跨域小样本分割:多视角渐进适应

Jiahao Nie, Guanqiao Fu, Wenbin An, Yap-Peng Tan, Alex C. Kot, Shijian Lu

发表机构 * Interdisciplinary Graduate Programme, Nanyang Technological University(南洋理工大学交叉学科研究生项目) Nanyang Technological University(南洋理工大学) Xi’an Jiaotong University(西安交通大学) VinUniversity(文大学) SMBU

AI总结 提出多视角渐进适应方法,通过混合渐进增强和双链多视角预测,从数据和策略两方面逐步将小样本能力适应到目标域,显著提升跨域小样本分割性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

跨域小样本分割旨在基于少量样本对数据稀缺域中的类别进行分割。典型方法首先在大规模源域中建立小样本能力,然后将其适应到目标域。然而,由于目标样本的数量和多样性有限,现有方法仍表现出受限的性能。此外,源训练模型在目标域中初始的小样本能力较弱,加上显著的域差距,严重阻碍了目标样本的有效利用并进一步阻碍了适应。为此,我们提出多视角渐进适应,从数据和策略两方面逐步将小样本能力适应到目标域。(i) 从数据角度,我们引入混合渐进增强,通过累积的强增强逐步生成更多样化和复杂的视图,从而创建越来越具有挑战性的学习场景。(ii) 从策略角度,我们设计双链多视角预测,在广泛监督下通过顺序和并行学习路径充分利用这些渐进复杂的视图。通过联合强制跨多样化和复杂视图的预测一致性,MPA实现了对目标域的鲁棒且准确的适应。大量实验表明,MPA有效地将小样本能力适应到目标域,以较大优势(+7.0%)超越了最先进的方法。

英文摘要

Cross-Domain Few-Shot Segmentation aims to segment categories in data-scarce domains conditioned on a few exemplars. Typical methods first establish few-shot capability in a large-scale source domain and then adapt it to target domains. However, due to the limited quantity and diversity of target samples, existing methods still exhibit constrained performance. Moreover, the source-trained model's initially weak few-shot capability in target domains, coupled with substantial domain gaps, severely hinders the effective utilization of target samples and further impedes adaptation. To this end, we propose Multi-view Progressive Adaptation, which progressively adapts few-shot capability to target domains from both data and strategy perspectives. (i) From the data perspective, we introduce Hybrid Progressive Augmentation, which progressively generates more diverse and complex views through cumulative strong augmentations, thereby creating increasingly challenging learning scenarios. (ii) From the strategy perspective, we design Dual-chain Multi-view Prediction, which fully leverages these progressively complex views through sequential and parallel learning paths under extensive supervision. By jointly enforcing prediction consistency across diverse and complex views, MPA achieves both robust and accurate adaptation to target domains. Extensive experiments demonstrate that MPA effectively adapts few-shot capability to target domains, outperforming state-of-the-art methods by a large margin (+7.0%).

2511.16886 2026-06-02 cs.CL cs.AI cs.LG

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

TRMs中的潜在推理实际上是策略改进算子

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev Rayan Banerjee Fakhri Karray Martin Takac

AI总结 本文通过将潜在递归推理形式化为策略改进算法,解释了递归步骤何时有效提升性能,并提出结合强化学习和扩散方法的训练方案,在Tiny Recursive Model上实现18倍前向传递减少且保持性能。

详情
AI中文摘要

最近,具有潜在递归的小模型在复杂推理任务上取得了有希望的结果。这些结果通常由这样的理论解释:这种递归增加了网络的深度,使其能够紧凑地模拟更大模型的能力。然而,递归添加层的性能仍然落后于具有相同前馈深度的单次通过模型。这意味着在循环版本中,并非每个递归步骤都有效地贡献于深度。这提出了一个问题:潜在推理何时以及为何能提高性能,何时会导致无效计算?在我们的工作中,我们证明了潜在递归推理为这个问题提供了答案。我们展示了潜在递归推理可以形式化为策略改进算法。基于这些见解,我们提出使用强化学习和扩散方法的训练方案用于潜在推理模型。以Tiny Recursive Model作为测试平台,我们展示了通过我们的修改,可以避免无效计算步骤,并将前向传递总数减少18倍,同时保持性能。总的来说,我们展示了递归步骤的策略改进视角如何解释模型行为,并为进一步改进提供见解。

英文摘要

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we demonstrate that latent recursive reasoning provides answer to this question. We show that latent recursive reasoning can be formalized as a policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

2602.04861 2026-06-02 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

从评估到设计:利用势能面平滑度指标指导机器学习原子间势架构

Ryan Liu, Eric Qu, Tobias Kreiman, Samuel M. Blau, Aditi S. Krishnapriyan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出键平滑度表征测试(BSCT)作为高效评估机器学习原子间势(MLIP)势能面平滑度的指标,并与分子动力学稳定性强相关,同时指导模型设计以减少伪影。

Comments Accepted at the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

机器学习原子间势(MLIP)有时无法再现量子势能面(PES)的物理平滑性,导致下游模拟中出现标准能量和力回归评估无法捕捉的错误行为。现有评估方法(如微正则分子动力学(MD))计算成本高且主要探测近平衡态。为改进MLIP的评估指标,我们引入键平滑度表征测试(BSCT)。该高效基准通过受控键变形探测PES,检测近平衡和远离平衡态的非平滑性,包括不连续性、人工极小值和虚假力。我们证明BSCT与MD稳定性强相关,而成本仅为MD的一小部分。为展示BSCT如何指导迭代模型设计,我们利用无约束Transformer主干作为测试平台,说明如何通过改进(如新的可微$k$-最近邻算法和温度控制注意力)减少指标识别的伪影。通过基于BSCT系统优化模型设计,所得MLIP同时实现了低传统E/F回归误差、稳定的MD模拟和鲁棒的原子性质预测。我们的结果将BSCT确立为从业者评估MLIP实用性的验证指标,以及“循环内”模型设计代理,提醒MLIP开发者注意当前MLIP基准无法高效评估的物理挑战。BSCT数据集和评估可在https://github.com/ryanliu30/bsct.git获取。

英文摘要

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable $k$-nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric for practitioners to assess MLIP utility and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks. The BSCT dataset and evaluation are available on https://github.com/ryanliu30/bsct.git

2602.04343 2026-06-02 cs.CV

Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

寻找NeMO:面向少样本感知的模板视图几何感知表示

Sebastian Jung, Leonard Klüpfel, Rudolph Triebel, Maximilian Durner

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR))

AI总结 提出NeMO(神经记忆对象)表示,通过少量RGB模板视图编码生成稀疏点云,实现未见对象的检测、分割和6DoF姿态估计,无需重训练。

Comments 17 pages including supplement, published in 3DV 2026, Project website: https://sebastian-jung.github.io/nemo/

详情
Journal ref
Proceedings of the International Conference on 3D Vision (3DV), 2026
AI中文摘要

我们提出了神经记忆对象(NeMO),一种新颖的以对象为中心的表示,可用于使用RGB图像检测、分割和估计训练中未见对象的6DoF姿态。我们的方法包括一个编码器,该编码器仅需少量描绘对象的RGB模板视图,利用包含语义和几何信息的学到的UDF生成稀疏的对象状点云。接下来,解码器将对象编码与查询图像一起使用,生成各种密集预测。通过大量实验,我们展示了我们的方法可用于少样本对象感知,无需任何相机特定参数或对目标数据的重训练。我们提出的将对象信息外包到NeMO中并使用单个网络执行多个感知任务的概念,增强了对新对象的交互,通过启用快速对象接入而无需重训练或大量预处理,提高了可扩展性和效率。我们在BOP基准测试的各种数据集和感知任务上报告了竞争性和最先进的结果,展示了我们方法的多功能性。https://github.com/DLR-RM/nemo

英文摘要

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo

2602.04094 2026-06-02 cs.CV

VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

VideoBrain: 学习自适应帧采样以理解长视频

Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen

发表机构 * Stanford University(斯坦福大学)

AI总结 提出VideoBrain框架,通过CLIP和均匀采样双智能体策略,使视觉语言模型自适应获取关键帧,在减少30-40%帧数的同时提升长视频理解准确率3.5%-9.0%。

详情
AI中文摘要

长视频理解对视觉语言模型(VLM)仍然具有挑战性,因为计算约束与捕捉分布在数千帧中的信息之间存在固有的矛盾。现有方法要么均匀采样帧(存在信息丢失风险),要么单次选择关键帧(无法从错误选择中恢复)。我们提出VideoBrain,一个端到端框架,使VLM能够通过学习采样策略自适应地获取视觉信息。我们的方法采用双互补智能体:一个基于CLIP的智能体用于跨视频的语义检索,以及一个均匀智能体用于区间内的密集时间采样。与先前依赖纯文本LLM编排视觉工具的基于智能体的方法不同,我们的VLM直接感知帧并推理信息充分性。为了防止模型不加区分地调用智能体以最大化奖励,我们引入了一个行为感知奖励函数,结合一个数据分类流程,教会模型何时调用智能体真正有益。在四个长视频基准上的实验表明,VideoBrain在比基线少使用30-40%帧的情况下实现了+3.5%至+9.0%的提升,并且对短视频基准具有强大的跨数据集泛化能力。代码可在https://github.com/junbo-zou/VideoBrain获取。

英文摘要

Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40\% fewer frames, with strong cross-dataset generalization to short video benchmarks. The code is available at https://github.com/junbo-zou/VideoBrain.

2602.03318 2026-06-02 cs.CL

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

MIRROR: 一种用于运筹学优化建模的具有迭代自适应修正与分层检索的多智能体框架

Yifan Shi, Jiayi Wang, Minyi Wu, Ye Fan, Jialong Shi, Jianyong Sun

发表机构 * Xi’an Jiaotong University(西安交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出一种免微调的多智能体框架MIRROR,通过执行驱动的迭代自适应修正和分层检索机制,将自然语言优化问题直接转化为数学模型和求解器代码,在标准运筹学基准上优于现有方法。

详情
AI中文摘要

运筹学依赖于专家驱动的建模——这是一个缓慢且脆弱的过程,难以适应新场景。虽然大语言模型可以自动将自然语言转化为优化模型,但现有方法要么依赖昂贵的后训练,要么采用多智能体框架,然而大多数方法仍缺乏可靠的协作纠错和任务特定检索,常常导致错误输出。我们提出MIRROR,一个免微调的端到端多智能体框架,直接将自然语言优化问题转化为数学模型和求解器代码。MIRROR集成了两个核心机制:(1) 执行驱动的迭代自适应修正,用于自动纠错;(2) 分层检索,从精心策划的示例库中获取相关的建模和编码示例。实验表明,MIRROR在标准运筹学基准上优于现有方法,在复杂工业数据集如IndustryOR和Mamo-ComplexLP上取得了显著结果。通过将精确的外部知识注入与系统纠错相结合,MIRROR为非专家用户提供了高效可靠的运筹学建模解决方案,克服了通用大语言模型在专家优化任务中的根本局限性。

英文摘要

Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.

2509.23782 2026-06-02 cs.CL

Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

弥合大语言模型在多项选择题中的知识-预测差距

Yoonah Park, Haesung Pyun, Yohan Jo

发表机构 * KAIST(韩国科学技术院)

AI总结 通过分析隐藏表示中的知识子空间和预测子空间,提出轻量级推理时干预方法KAPPA来对齐两者,从而减少LLM在多项选择题上的知识-预测差距。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管大语言模型(LLM)在多种任务上表现强劲,但其可信度受到不忠实于内部知识的异常行为的限制。特别是,LLM在多项选择题(MCQ)上经常失败,即使它们在隐藏表示中编码了正确答案,这揭示了内部知识与输出行为之间的错位。我们通过三步分析隐藏表示来研究和缓解MCQ上的这种知识-预测差距。首先,我们量化了跨模型和数据集的差距的普遍性和幅度。其次,我们通过在残差流中识别不同的知识和预测子空间来提供几何解释。第三,我们引入了KAPPA,一种轻量级的推理时干预方法,用于对齐残差流中的两个子空间以减少知识-预测差距。我们的结果为LLM中的知识-预测差距提供了几何且可解释的解释。此外,KAPPA有效地减少了跨多种MCQ基准和模型的差距,并泛化到自由形式设置中。

英文摘要

While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.

2501.18649 2026-06-02 cs.CL cs.AI cs.IR cs.LG

Fake News Detection After LLM Laundering: Measurement and Explanation

LLM清洗后的假新闻检测:测量与解释

Rupak Kumar Das, Jonathan Dodge

发表机构 * College of IST Pennsylvania State University(宾夕法尼亚州立大学信息科学与技术学院)

AI总结 研究测量检测器在识别LLM改写假新闻时的有效性,发现检测器难以检测LLM改写的假新闻,并通过LIME解释发现情感偏移是检测失败的原因之一。

详情
AI中文摘要

凭借其先进的能力,大型语言模型(LLM)可以生成高度令人信服且上下文相关的假新闻,这可能有助于传播错误信息。尽管针对人类撰写文本的假新闻检测已有大量研究,但检测LLM生成的假新闻这一领域仍探索不足。本研究测量了检测器在识别LLM改写的假新闻方面的有效性,特别是确定在检测流程中添加改写步骤是有助于还是阻碍检测。本研究贡献如下:(1)检测器在检测LLM改写的假新闻时比检测人类撰写文本更困难;(2)我们发现了哪些模型在哪些任务(逃避检测、通过改写逃避检测以及为语义相似性进行改写)上表现出色;(3)通过LIME解释,我们发现了检测失败的一个可能原因:情感偏移;(4)我们发现了一个关于改写质量测量的令人担忧的趋势:尽管BERTSCORE很高,但样本仍表现出情感偏移;(5)我们提供了一对数据集,用改写输出和分数扩充了现有数据集。该数据集可在GitHub上获取。

英文摘要

With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub

2602.03719 2026-06-02 cs.CL

BranPO: Scalable Contrastive Branch Sampling for Long-Horizon Agentic Reinforcement Learning

BranPO:面向长程智能体强化学习的可扩展对比分支采样

Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Nanyang Technological University(南洋理工大学)

AI总结 针对长程智能体强化学习中轨迹级奖励稀疏且信用分配困难的问题,提出一种无值函数方法BranPO,通过截断轨迹并重采样延续构建对比分支,实现局部对比监督,无需密集奖励。

Comments 26 pages, 5 figures

详情
AI中文摘要

智能体强化学习使大型语言模型能够执行多轮规划和使用工具,但在稀疏轨迹级奖励下,长程训练仍然具有挑战性,其中单一结果被统一分配给所有决策。先前的方法通过基于树的探索或过程级评估引入更细粒度的监督,但往往成本高昂或产生噪声信用信号。在智能体轨迹中,早期错误可能被后续动作纠正,而看似有希望的中间状态可能因后续决策不佳而失败。我们将此属性称为非单调正确性,这使得结果奖励或状态值不足以指导从每个状态应采取什么动作。为了解决这个问题,我们提出了分支相对策略优化(BranPO),一种无值函数方法,无需密集奖励即可构建局部对比监督。BranPO在中间前缀处截断轨迹,并重采样延续以形成对比分支,这些分支共享相同的前缀但在最终结果上不同,从而隔离导致成功或失败的决策。我们进一步引入了难度感知分支采样和冗余步骤掩码,以提高采样效率并抑制冗余更新。实验表明,BranPO在多个多跳问答基准测试中一致优于各种基线类别,且无需额外训练成本,并泛化到更广泛的长程智能体任务中,实现持续改进。我们的代码可在https://github.com/YubaoZhao/BranPO获取。

英文摘要

Agentic reinforcement learning enables large language models to perform multi-turn planning and tool use, but long-horizon training remains challenging under sparse trajectory-level rewards, where a single outcome is uniformly assigned to all decisions. Prior methods introduce finer-grained supervision via tree-based exploration or process-level evaluation, but often incur high cost or produce noisy credit signals. In agentic trajectories, early mistakes may still be corrected by later actions, while seemingly promising intermediate states can fail due to poor subsequent decisions. We call this property non-monotonic correctness, which makes outcome rewards or state values insufficient for guiding what actions should be taken from each state. To address this, we propose Branching Relative Policy Optimization (\textbf{BranPO}), a value-free method that constructs localized contrastive supervision without dense rewards. BranPO truncates trajectories at intermediate prefixes and resamples continuations to form contrastive branches that share the same prefix but diverge in final outcomes, thereby isolating decisions that drive success or failure. We further introduce difficulty-aware branch sampling and Redundant Step Masking to improve sampling efficiency and suppress redundant updates. Experiments show that BranPO consistently outperforms diverse baseline categories across multiple multi-hop QA benchmarks without additional training cost, and generalizes to broader long-horizon agentic tasks with consistent improvements. Our code is available at https://github.com/YubaoZhao/BranPO.

2602.03685 2026-06-02 cs.LG cs.AI stat.ML

Universal One-third Time Scaling in Learning Peaked Distributions

学习尖峰分布中的普适三分之一时间缩放

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过理论分析和实验验证,揭示了使用softmax和交叉熵学习尖峰分布时,损失和梯度呈幂律衰减,导致损失时间缩放指数为1/3的普适瓶颈,为神经缩放现象提供了机理解释。

Comments Camera-ready version, ICML 2026

详情
AI中文摘要

训练大型语言模型(LLM)计算成本高昂,部分原因是损失呈现缓慢的幂律收敛,其起源仍有争议。通过对玩具模型的系统分析和LLM的经验评估,我们表明这种行为本质上源于softmax和交叉熵的使用。当学习尖峰概率分布(例如下一个词元分布)时,这些组件普遍产生幂律衰减的损失和梯度,与许多微观细节无关,从而形成基本的优化瓶颈。这最终导致损失的时间缩放服从幂律,普适指数为$1/3$。我们的结果为观察到的神经缩放提供了机理解释,并提出了改进LLM训练效率的新方向。

英文摘要

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

2602.03670 2026-06-02 cs.LG cs.AI cs.NE math.DS physics.class-ph

Equilibrium Propagation for Non-Conservative Systems

非保守系统的平衡传播

Antonino Emanuele Scurria, Dimitri Vanden Abeele, Bortolo Matteo Mognetti, Serge Massar

发表机构 * University of Amsterdam(阿姆斯特丹大学) Institute for Advanced Study(高级研究院)

AI总结 提出一种扩展平衡传播到非保守系统(包括前馈网络)的框架,通过在学习阶段引入与非互易相互作用成比例的项来精确计算代价函数的梯度,数值实验表明性能更优且学习更快。

Comments 23 pages

详情
AI中文摘要

平衡传播(EP)是一种受物理学启发的学习算法,它利用动力系统的稳态进行推理和学习。在其原始公式中,它仅限于保守系统,即从能量函数导出的动力学。考虑到它们的应用,将EP扩展到非保守系统(即具有非互易相互作用的系统)非常重要。先前将EP推广到此类系统的尝试未能精确计算代价函数的梯度。在这里,我们提出了一个将EP扩展到任意非保守系统(包括前馈网络)的框架。我们保留了平衡传播的关键特性,即同时使用稳态进行推理和学习。然而,我们在学习阶段通过一个与相互作用的非互易部分成比例的项修改了动力学,以便获得代价函数的精确梯度。该算法也可以通过变分公式推导,该公式通过定义在增广状态空间上的能量函数生成学习动力学。数值实验表明,该算法比先前的方案实现了更好的性能并学习更快。

英文摘要

Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, $\textit{i.e.}$ to dynamics which derive from an energy function. Given their applications, it is important to extend EP to non-conservative systems, $\textit{i.e.}$ systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary non-conservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments show that this algorithm achieves better performance and learns faster than previous proposals.

2602.03619 2026-06-02 cs.CL

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

从人类偏好中学习查询特定评分标准用于DeepResearch报告生成

Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Shihan Dou, Zisu Huang, Muzhao Tian, Xiaohua Wang, Yang Liu, Pluto Zhou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou

发表机构 * Tencent(腾讯) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出一种通过强化学习训练查询特定评分标准生成器的流水线,以解决DeepResearch长报告生成中缺乏可验证奖励信号的问题,并在人类偏好测试和下游任务中取得显著性能提升。

详情
AI中文摘要

如今,开发可靠的DeepResearch式长文本报告生成仍然具有挑战性,因为训练和评估缺乏可验证的奖励信号。因此,基于评分标准的评估已成为常见做法。然而,现有方法要么依赖于粗粒度、预定义的评分标准,缺乏足够的细粒度,要么依赖于人工构建的查询特定评分标准,成本高昂且难以扩展。在本文中,我们提出了一种流水线,用于训练偏好基础的查询特定评分标准生成器,专门用于DeepResearch报告生成。我们首先构建了一个DeepResearch式查询数据集,其中包含成对报告上的人类偏好注释,并通过强化学习训练评分标准生成器,使用结合偏好一致性、格式有效性和基于LLM的评分标准评估的混合奖励。我们在两个阶段评估生成的评分标准生成器。首先,在保留的人类偏好测试集上,学习到的评分标准比通用的、提示的或SFT训练的评分标准更有效地区分偏好报告和拒绝报告。其次,当用作训练DeepResearch系统的奖励信号时,我们的评分标准生成器在简单的单智能体ReAct框架和复杂的多智能体工作流上,在DeepResearch Bench上均取得了显著的性能提升。

英文摘要

Nowadays, developing reliable DeepResearch-style long-form report generation remains challenging, as training and evaluation lack verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train preference-grounded query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining preference consistency, format validity, and LLM-based rubric evaluation. We evaluate the resulting rubric generators in two stages. First, on a held-out human-preference test set, the learned rubrics discriminate preferred from rejected reports more effectively than generic, prompted, or SFT-trained rubric alternatives. Second, when used as reward signals to train DeepResearch systems, our rubric generators yield substantial performance gains under both a simple single-agent ReAct framework and a complex multi-agent workflow on the DeepResearch Bench.

2602.03554 2026-06-02 cs.LG cs.AI cs.CE cs.CL

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

当单一答案不够时:重新思考面向大语言模型的单步逆合成基准

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Mathieu Reymond, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

发表机构 * DeepMind, London, UK(伦敦英国深度思维公司)

AI总结 针对现有逆合成基准依赖单一真实答案的局限,提出基于化学合理性度量ChemCensor的新评估框架,并构建数据集CREED训练模型以提升性能。

详情
AI中文摘要

最近的进展扩展了大语言模型(LLMs)在药物发现中的应用,包括合成规划。然而,逆合成性能的客观评估仍然有限。现有的基准和指标通常依赖于已发表的合成程序以及基于单一真实答案的Top-K准确率,这未能捕捉真实世界合成规划的开放性。我们提出一个新的单步逆合成基准框架,使用ChemCensor(一种化学合理性的新度量)来评估通用型和化学专用型LLMs。通过强调合理性而非精确匹配,该方法更符合人类合成规划实践。我们还引入了CREED,一个包含数百万经ChemCensor验证的反应记录的新数据集,用于LLM训练,并使用它训练了一个在该基准下优于LLM基线的模型。

英文摘要

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

2602.03211 2026-06-02 cs.LG cs.AI

Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

前瞻样本奖励引导用于扩散模型的测试时缩放

Yeongmin Kim, Donghyeok Shin, Byeonghu Na, Minsang Park, Richard Lee Kim, Il-Chul Moon

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种高效测试时缩放方法LiDAR采样,通过前瞻几步采样和精确求解器引导粒子向高奖励区域移动,无需反向传播,在GenEval上达到与最新梯度引导方法相同性能且加速9.5倍。

Comments ICML 2026 Spotlight

详情
AI中文摘要

扩散模型已展现出强大的生成性能;然而,生成的样本往往未能完全符合人类意图。本文研究了一种高效的测试时缩放方法,用于从具有更高人类对齐奖励值的区域进行采样。现有的计算期望未来奖励(EFR)方法面临重要限制:反向展开导致采样成本过高,而基于Tweedie的方法(包括顺序蒙特卡洛和梯度引导)则存在偏差和固有的采样问题。我们证明,任何$\mathbf{x}_t$处的EFR仅需使用预训练扩散模型的边际样本即可计算,从而无需神经反向传播即可实现闭式奖励引导。为了进一步提高效率,我们引入了少步前瞻采样和一个精确求解器,引导粒子向高奖励的前瞻样本移动。我们将这种采样方案称为LiDAR采样。LiDAR在SDXL上达到了与最新梯度引导方法相同的GenEval性能,并实现了9.5倍的加速。我们在https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling 上发布了代码。

英文摘要

Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies an efficient test-time scaling method for sampling from regions with higher human-aligned reward values. Existing methods for computing the expected future reward (EFR) face important limitations: backward rollout incurs prohibitively high sampling costs, while Tweedie-based approaches, including Sequential Monte Carlo and gradient guidance, suffer from bias and inherent sampling issues. We show that the EFR at any $\mathbf{x}_t$ can be computed using only marginal samples from a pre-trained diffusion model, enabling closed-form reward guidance without neural backpropagation. To further improve efficiency, we introduce a few-step lookahead sampling and an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5x speedup. We release the code at https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling.

2602.03203 2026-06-02 cs.CL cs.LG

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

ForesightKV: 通过学习长期贡献优化推理模型的KV缓存驱逐

Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ForesightKV框架,通过监督学习和强化学习预测KV对在长文本生成中的重要性,在仅用一半缓存预算下优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期,大型语言模型通过生成长推理轨迹展现出卓越的推理能力。然而,随着序列长度增长,键值缓存线性扩展,导致显著的内存和计算成本。现有的KV缓存驱逐方法通过丢弃不重要的KV对来缓解此问题,但往往难以捕捉复杂的KV依赖关系,导致性能下降。为了更好地平衡效率与性能,我们引入了ForesightKV,一种基于训练的KV缓存驱逐框架,学习在长文本生成过程中预测哪些KV对应该被驱逐。我们首先设计了Golden Eviction算法,该算法利用未来注意力分数在每一步识别最优的驱逐KV对。然后通过监督训练,使用成对排序损失将这些轨迹和每一步的分数进行蒸馏。此外,我们将缓存驱逐建模为马尔可夫决策过程,并应用GRPO算法来缓解低熵令牌上语言模型损失显著增加的问题。在三个推理模型的AIME2024和AIME2025基准测试上的实验表明,ForesightKV在仅用一半缓存预算的情况下始终优于先前方法,同时从监督学习和强化学习方法中协同受益。代码可在https://github.com/RUCAIBox/ForesightKV获取。

英文摘要

Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches. Code is available at https://github.com/RUCAIBox/ForesightKV.

2602.03024 2026-06-02 cs.LG cs.AI

Consistency Deep Equilibrium Models

一致性深度均衡模型

Junchao Lin, Zenan Ling, Jingwen Xu, Robert C. Qiu

发表机构 * arXiv.org

AI总结 提出一致性深度均衡模型(C-DEQ),通过一致性蒸馏将DEQ迭代推理过程视为沿ODE轨迹演化,训练模型将中间状态直接映射到不动点,实现少步推理并保持性能,同时支持多步评估以灵活权衡计算与性能,实验表明在相同少步推理预算下精度提升2-20倍。

详情
AI中文摘要

深度均衡模型(DEQ)已成为深度学习中的一种强大范式,能够以恒定的内存使用量建模无限深度网络。然而,由于不动点求解器的迭代性质,DEQ会带来显著的推理延迟。在这项工作中,我们引入了一致性深度均衡模型(C-DEQ),这是一种利用一致性蒸馏来加速DEQ推理的新框架。我们将DEQ迭代推理过程视为沿固定ODE轨迹向均衡演化。沿着这条轨迹,我们训练C-DEQ将中间状态一致地直接映射到不动点,从而在保持教师DEQ性能的同时实现少步推理。同时,它支持多步评估,以灵活地权衡计算与性能提升。跨多个领域任务的广泛实验表明,在相同的少步推理预算下,C-DEQ相比隐式DEQ实现了2-20倍的精度提升。我们的代码可在https://github.com/landrarwolf/CDEQ获取。

英文摘要

Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieve consistent 2-20$\times$ accuracy improvements over implicit DEQs under the same few-step inference budget. Our code is available at https://github.com/landrarwolf/CDEQ.

2602.03018 2026-06-02 cs.LG

From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection

从零到英雄:推进表格异常检测的零样本基础模型

Xueying Ding, Haomin Wen, Simon Klüttermann, Leman Akoglu

发表机构 * Xueying Ding(丁雪莹) Haomin Wen(文浩明) Simon Klüttermann(西蒙·克吕特曼) Leman Akoglu(拉曼·阿科格卢)

AI总结 提出OUTFORMER模型,通过混合合成先验和自演化课程训练,实现零样本表格异常检测,在AdBench及新基准上达到最优性能。

Comments 41 Pages, ICML 2026

详情
AI中文摘要

异常检测(OD)在实践中广泛应用;但由于缺乏标记异常,其在新任务上的有效部署受到阻碍,这使得算法和超参数选择异常困难。基础模型(FMs)已经改变了机器学习,OD也不例外:Shen等人(2025)引入了FoMo-0D,这是第一个用于OD的基础模型,在众多基线中取得了显著性能。本文介绍了OUTFORMER,它通过(1)混合合成先验和(2)自演化课程训练推进了FoMo-0D。OUTFORMER仅在合成标记数据集上预训练,并通过将其训练数据作为上下文输入来推断新任务的测试标签。推理速度快且零样本,仅需前向传播,无需标记异常。得益于上下文学习,它不需要额外工作——无需OD模型训练或定制模型选择——实现了真正的即插即用部署。OUTFORMER在著名的AdBench以及我们引入的两个包含超过1500个数据集的大规模新OD基准上取得了最先进的性能,同时保持了快速的推理速度。

英文摘要

Outlier detection (OD) is widely used in practice; but its effective deployment on new tasks is hindered by lack of labeled outliers, which makes algorithm and hyperparameter selection notoriously hard. Foundation models (FMs) have transformed ML, and OD is no exception: Shen et. al. (2025) introduced FoMo-0D, the first FM for OD, achieving remarkable performance against numerous baselines. This work introduces OUTFORMER, which advances FoMo-0D with (1) a mixture of synthetic priors and (2) self-evolving curriculum training. OUTFORMER is pretrained solely on synthetic labeled datasets and infers test labels of a new task by using its training data as in-context input. Inference is fast and zero-shot, requiring merely forward pass and no labeled outliers. Thanks to in-context learning, it requires zero additional work-no OD model training or bespoke model selection-enabling truly plug-and-play deployment. OUTFORMER achieves state-of-the-art performance on the prominent AdBench, as well as two new large-scale OD benchmarks that we introduce, comprising over 1,500 datasets, while maintaining speedy inference.

2602.02886 2026-06-02 cs.LG cs.AI

Mixture of Concept Bottleneck Experts

概念瓶颈专家混合模型

Francesco De Santis, Gabriele Ciravegna, Giovanni De Felice, Arianna Casanova, Francesco Giannini, Michelangelo Diligenti, Johannes Schneider, Danilo Giordano, Mateo Espinosa Zarlenga, Pietro Barbiero

发表机构 * University of Padua(帕多瓦大学)

AI总结 提出概念瓶颈专家混合模型(M-CBE),通过引入多个专家表达式和灵活的函数形式,在保持可解释性的同时提升预测精度和适应性。

详情
AI中文摘要

概念瓶颈模型(CBM)通过将预测基于人类可理解的概念来促进可解释性。然而,现有的CBM通常将其任务预测器限制为单个表达式,其函数形式是预先设定的,这限制了预测精度和对不同用户需求的适应性。我们提出了概念瓶颈专家混合模型(M-CBE),这是一个沿两个维度推广现有CBM的框架:任务预测器用于将概念映射到任务的表达式数量(称为专家),以及每个表达式所采用的函数形式,从而揭示了该设计空间中一个未被充分探索的区域。我们通过实例化两个新颖的模型来研究这一区域:线性M-CBE,它学习一组有限的线性表达式;以及符号M-CBE,它利用符号回归从数据中发现专家函数,受限于用户指定的算子词汇表。实证评估表明,改变表达式的数量及其函数形式为导航精度-可解释性权衡提供了一个稳健的框架。

英文摘要

Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically constrain their task predictor to a single expression whose functional form is set a priori, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBE), a framework that generalizes existing CBMs along two dimensions: the number of expressions, referred to as experts, employed by the task predictor to map concepts to the task, and the functional form each expression takes, thus exposing an underexplored region of this design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data subject to user-specified operator vocabularies. Empirical evaluation demonstrates that varying the number of expressions and their functional form provides a robust framework for navigating the accuracy-interpretability trade-off.

2602.02823 2026-06-02 cs.CL

R2-Router: A New Paradigm for LLM Routing with Reasoning

R2-Router: 一种基于推理的LLM路由新范式

Jiaqi Xue, Qian Lou, Jiarong Xing, Heng Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有LLM路由忽略输出长度对质量和成本影响的问题,提出R2-Router,通过将输出长度预算作为可控变量联合选择最优LLM和预算,实现低成本高质量路由。

Comments Accepted by ICML 2026

详情
AI中文摘要

随着LLM在能力和成本上的多样化,LLM路由通过学习预测每个LLM对给定查询的质量和成本,然后选择高质量低成本的LLM。然而,现有路由器隐式假设每个LLM对每个查询有固定的质量和成本,忽略了同一LLM的质量随输出长度变化的事实。这导致当估计成本超过预算时,路由器会排除强大的LLM,错过了这些LLM通过缩短输出长度以降低成本仍能提供高质量的机会。为了解决这个问题,我们引入了R2-Router,它将输出长度预算视为可控变量,并联合选择最佳LLM和长度预算,通过长度约束指令强制执行预算。这使得R2-Router能够发现,在先前方法无法看到的成本效益相当的情况下,受限输出的强大LLM可以超越较弱的LLM。结合路由器框架,我们构建了R2-Bench,这是第一个捕捉LLM在不同输出长度预算下行为的路由数据集。实验表明,与现有路由器相比,R2-Router以4-5倍更低的成本实现了最先进的性能。这项工作开辟了一个新方向:路由即推理,其中路由器从反应式选择器演变为深思熟虑的推理器,探索使用哪个LLM以及以何种成本预算。代码公开于https://github.com/UCF-ML-Research/R2-Router。

英文摘要

As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5\times lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget. The code is publicly available at https://github.com/UCF-ML-Research/R2-Router.

2602.02557 2026-06-02 cs.LG cs.AI cs.SD

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

对齐诅咒:模态对齐通过文本传输增强音频攻击

Yupeng Chen, Junchi Yu, Aoxi Liu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出并验证了“对齐诅咒”原理,即更强的文本-音频模态对齐会促进文本攻击向音频的迁移,并通过黑盒实验表明文本转移的音频攻击性能与原生音频攻击相当甚至更优,揭示了能力与安全之间的根本矛盾。

Comments 23 pages, 5 figures

详情
AI中文摘要

近期端到端训练的全能模型通过加强文本-音频模态对齐显著提升了音频能力。然而,这种对齐是否无意中促进了安全漏洞跨模态的转移仍未被充分探索。这一问题至关重要,因为基于文本的越狱攻击远比基于音频的攻击成熟;如果它们系统性转移,当前的音频安全评估可能低估源自文本模态的风险。在本文中,我们引入了“对齐诅咒”,这是一个经过形式化表征和实证验证的原理,表明更强的模态对齐使得攻击从文本到音频的转移更有效,揭示了能力与安全之间的根本矛盾。基于这一原理,我们在最新的全能模型(如Qwen2.5-Omni、Qwen3-Omni)上对三类攻击(文本攻击、文本转移的音频攻击和音频攻击)进行了全面的黑盒评估。我们发现,文本转移的音频攻击与基于音频的攻击表现相当,甚至更优,在仅音频访问下展现出明显优势。这表明基于文本的漏洞在塑造音频安全风险中扮演关键角色。最后,我们实证分析了不同攻击方法和模型下模态对齐与转移有效性之间的关系,观察到对“对齐诅咒”的一致支持:更紧密的模态对齐导致更有效的跨模态攻击转移。

英文摘要

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.

2602.02547 2026-06-02 cs.LG cs.AI

naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

naPINN: 用于从损坏测量中恢复物理的噪声自适应物理信息神经网络

Hankyeol Kim, Pilsung Kang

发表机构 * Department of Industrial Engineering(工业工程系) Seoul National University(首尔国立大学)

AI总结 提出噪声自适应物理信息神经网络(naPINN),通过嵌入能量模型学习残差分布并自适应过滤异常值,从非高斯噪声和离群点损坏的测量中鲁棒恢复物理解。

详情
AI中文摘要

物理信息神经网络(PINNs)是解决逆问题和从观测数据中发现控制方程的有效方法。然而,在复杂测量噪声和严重离群点下,其性能显著下降。为解决此问题,我们提出了噪声自适应物理信息神经网络(naPINN),该网络无需噪声分布先验知识,即可从损坏测量中鲁棒恢复物理解。naPINN在训练循环中嵌入一个基于能量的模型,以学习预测残差的潜在分布。利用学习到的能量景观,一个可训练的可靠性门自适应地过滤具有高能量的数据点,同时拒绝代价正则化防止丢弃有效数据导致的平凡解。我们在被非高斯噪声和不同比例离群点损坏的各种基准偏微分方程上展示了naPINN的有效性。结果表明,naPINN显著优于现有的鲁棒PINN基线,成功隔离离群点并在严重数据损坏下准确重建动力学。

英文摘要

Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observational data. However, their performance degrades significantly under complex measurement noise and gross outliers. To address this issue, we propose the Noise-Adaptive Physics-Informed Neural Network (naPINN), which robustly recovers physical solutions from corrupted measurements without prior knowledge of the noise distribution. naPINN embeds an energy-based model into the training loop to learn the latent distribution of prediction residuals. Leveraging the learned energy landscape, a trainable reliability gate adaptively filters data points exhibiting high energy, while a rejection cost regularization prevents trivial solutions where valid data are discarded. We demonstrate the efficacy of naPINN on various benchmark partial differential equations corrupted by non-Gaussian noise and varying rates of outliers. The results show that naPINN significantly outperforms existing robust PINN baselines, successfully isolating outliers and accurately reconstructing the dynamics under severe data corruption.

2602.01753 2026-06-02 cs.CV

ObjEmbed: Towards Universal Multimodal Object Embeddings

ObjEmbed:迈向通用多模态对象嵌入

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng

发表机构 * arXiv.org University of Science and Technology of China(中国科学技术大学)

AI总结 提出ObjEmbed模型,通过分解图像为多个区域嵌入(每个对应一个对象)并生成语义和IoU两种互补嵌入,实现细粒度视觉-语言对齐,在视觉定位、局部和全局图像检索等任务中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

将对象与相应的文本描述对齐是视觉-语言理解中的一个基本挑战和现实需求。虽然最近的多模态嵌入模型在全局图像-文本对齐方面表现出色,但它们通常难以处理图像区域与特定短语之间的细粒度对齐。在这项工作中,我们提出了ObjEmbed,一种新颖的MLLM嵌入模型,它将输入图像分解为多个区域嵌入,每个对应一个单独的对象,以及全局嵌入。它支持广泛的视觉理解任务,如视觉定位、局部图像检索和全局图像检索。ObjEmbed具有三个关键特性:(1)面向对象的表示:通过为每个区域生成两个互补嵌入——用于语义匹配的对象嵌入和预测定位质量的IoU嵌入——来捕获对象的语义和空间方面。最终的对象匹配分数结合了语义相似性和预测的IoU,从而实现更准确的检索。(2)多功能性:无缝处理区域级和图像级任务。(3)高效编码:图像中的所有对象以及整个图像在单次前向传递中编码,效率高。在18个不同基准上的优越性能证明了其强大的语义区分能力。

英文摘要

Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings. It supports a wide range of visual understanding tasks like visual grounding, local image retrieval, and global image retrieval. ObjEmbed enjoys three key properties: (1) Object-Oriented Representation: It captures both semantic and spatial aspects of objects by generating two complementary embeddings for each region: an object embedding for semantic matching and an IoU embedding that predicts localization quality. The final object matching score combines semantic similarity with the predicted IoU, enabling more accurate retrieval. (2) Versatility: It seamlessly handles both region-level and image-level tasks. (3) Efficient Encoding: All objects in an image, along with the full image, are encoded in a single forward pass for high efficiency. Superior performance on 18 diverse benchmarks demonstrates its strong semantic discrimination.

2510.06048 2026-06-02 cs.LG

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

BLISS: 一种用于语言模型预训练数据选择的轻量级双层影响评分方法

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu

发表机构 * Department of Computer Science, George Mason University, USA(乔治·马歇尔大学计算机科学系) IBM T.J. Watson Research Center, USA(IBM T.J. Watson研究部) Department of Statistics, Rice University(里士大学统计系) Department of System Engineering & Operations Research, George Mason University, USA(乔治·马歇尔大学系统工程与运营管理系)

AI总结 提出一种无需外部预训练模型的轻量级数据选择方法BLISS,通过双层优化和代理模型估计训练样本的长期影响,实现高效数据筛选,在C4数据集上预训练多种规模模型,显著加速收敛并提升下游任务性能。

详情
AI中文摘要

有效的数据选择对于预训练大型语言模型(LLM)至关重要,可以提高效率并增强对下游任务的泛化能力。然而,现有方法通常需要利用外部预训练模型,使得难以将数据选择的效果与外部预训练模型的效果分开。此外,如果模型训练至收敛,它们通常忽略所选数据的长期影响,这主要是由于全规模LLM预训练的过高成本。在本文中,我们介绍了BLISS(用于数据选择的轻量级双层影响评分方法):一种轻量级数据选择方法,完全从头开始操作,不依赖任何外部预训练预言模型,同时明确考虑所选数据的长期影响。BLISS利用一个小型代理模型作为LLM的替代,并采用一个评分模型来估计如果代理模型训练至收敛时训练样本的长期影响。我们将数据选择形式化为一个双层优化问题,其中上层目标优化评分模型以分配重要性权重给训练样本,确保最小化下层目标(即在加权训练损失上训练代理模型直至收敛)导致最佳验证性能。一旦优化完成,训练好的评分模型预测数据集的影响分数,从而能够高效选择高质量样本用于LLM预训练。我们通过在C4数据集的选择子集上预训练410M/1B/2.8B Pythia和LLaMA-0.5B模型来验证BLISS。值得注意的是,在1B模型设置下,BLISS在达到与最先进方法相同性能时实现了1.7倍的加速,展示了在多个下游任务上的优越性能。

英文摘要

Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.