arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1976
2603.18943 2026-05-15 cs.CV

VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao

AI总结 本文提出了一种名为 VGGT-360 的全新无需训练的零样本全景深度估计框架,旨在实现几何一致的全景深度估计。该方法通过利用类似 VGGT 的基础模型的内在三维一致性,将任务重新表述为基于多视角重建的三维模型的全景重投影,从而将碎片化的单视角推理统一为连贯的全景理解。VGGT-360 集成了三个即插即用模块,形成统一的全景到三维到深度的框架,在多个室内和室外数据集上表现出优于现有训练和无需训练方法的性能。

详情
英文摘要

This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT's perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT's robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

2603.17432 2026-05-15 cs.CL

Argument Reconstruction as Supervision for Critical Thinking in LLMs

Hyun Ryu, Gyouk Chu, Gregor Betz, Eunho Yang, Carolyn Rose, Sean Welleck

AI总结 本文研究如何通过论证重构来提升大语言模型的批判性思维能力。作者提出了一种全新的框架,包含自动重构任意论证的引擎(GAAR)、构建高质量论证重构数据集(Arguinas),以及验证论证重构对下游批判性思维任务的影响。实验表明,基于论证重构训练的模型在多个批判性思维任务中表现优于未经过此类训练的模型,尤其在使用Arguinas数据集进行训练时效果最为显著。

详情
英文摘要

To think critically about arguments, human learners are trained to identify, reconstruct, and evaluate arguments. Argument reconstruction is especially important because it makes an argument's underlying inferences explicit. However, it remains unclear whether LLMs can similarly enhance their critical thinking ability by learning to reconstruct arguments. To address this question, we introduce a holistic framework with three contributions. We (1) propose an engine that automatically reconstructs arbitrary arguments (GAAR), (2) synthesize a new high-quality argument reconstruction dataset (Arguinas) using the GAAR engine, and (3) investigate whether learning argument reconstruction benefits downstream critical thinking tasks. Our experimental results show that, across seven critical thinking tasks, models trained to learn argument reconstruction outperform models that do not, with the largest performance gains observed when training on the proposed Arguinas dataset.

2603.16659 2026-05-15 cs.AI econ.GN q-fin.EC

LLMs learn scientific taste from institutional traces across the social sciences

Ziqin Gong, Ning Li, Huaikang Zhou

AI总结 该研究探讨了大型语言模型(LLMs)如何通过学习社会科学领域中的机构痕迹(如论文发表记录)来提升对低可验证性领域的评估能力。研究构建了八个学科的分级研究提案基准,并通过监督微调(SFT)训练模型,结果表明这些模型在判断研究价值方面显著优于随机猜测,甚至超越了前沿推理模型和专家评审的平均水平。研究还发现,模型的置信度与其预测准确性高度相关,表明其具备一定的判断可靠性。

详情
英文摘要

Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.

2603.14851 2026-05-15 cs.CV cs.RO

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv

AI总结 该论文提出了一种端到端自动驾驶框架 AutoMoT,通过统一视觉-语言-动作(VLA)模型,将场景理解与动作生成结合,以提升自动驾驶系统的整体性能。其核心方法采用异步混合变压器(MoT)架构,通过共享注意力机制保留预训练视觉语言模型的推理能力,同时实现高效的动作策略生成。实验表明,AutoMoT 在多个基准测试中表现出色,并揭示了预训练模型在自动驾驶任务中的适用边界。

详情
英文摘要

Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through asynchronous execution at different task frequencies. Extensive experiments on multiple benchmarks, under both open- and closed-loop settings, demonstrate that AutoMoT achieves competitive performance compared to state-of-the-art methods. We further investigate the functional boundary of pre-trained VLMs in AD, examining when AD-tailored fine-tuning is necessary. Our results show that pre-trained VLMs can achieve competitive multi-task scene understanding performance through semantic prompting alone, while fine-tuning remains essential for action-level tasks such as decision-making and trajectory planning. We refer to https://automot-website.github.io/ for the demonstration videos and qualitative results.

2603.14360 2026-05-15 cs.LG cs.AI

M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

Mayank Mishra, Shawn Tan, Ion Stoica, Joseph Gonzalez, Tri Dao

AI总结 本文提出了一种名为 M$^2$RNN 的非线性循环神经网络架构,其核心特点是使用矩阵值隐藏状态和高表达力的非线性状态转移,旨在克服传统 Transformer 在复杂任务中的表达能力限制。研究发现,非线性 RNN 的性能受限于状态规模,而通过引入状态规模扩展机制,M$^2$RNN 能够高效利用张量核心进行计算,并在未见过的长序列上实现完美的状态追踪泛化。实验表明,M$^2$RNN 在大规模语言建模和混合架构中表现出色,相比现有模型在准确率和计算效率方面均有显著提升。

详情
英文摘要

Transformers are highly parallel but are limited to computations in the TC$^0$ complexity class, excluding tasks such as entity tracking and code execution that provably require greater expressive power. Motivated by this limitation, we revisit non-linear Recurrent Neural Networks (RNNs) for language modeling and introduce Matrix-to-Matrix RNN (M$^2$RNN): an architecture with matrix-valued hidden states and expressive non-linear state transitions. We demonstrate that the language modeling performance of non-linear RNNs is limited by their state size, and show how the state size expansion mechanism enables efficient use of tensor cores. Empirically, M$^2$RNN achieves perfect state tracking generalization at sequence lengths not seen during training. These benefits also translate to large-scale language modeling. In hybrid settings that interleave recurrent layers with attention, Hybrid M$^2$RNN outperforms equivalent Gated DeltaNet hybrids by $0.4$-$0.5$ perplexity points on a 7B MoE model, while using $3\times$ smaller state sizes for the recurrent layers. Notably, replacing even a single recurrent layer with M$^2$RNN in an existing hybrid architecture yields accuracy gains comparable to Hybrid M$^2$RNN with minimal impact on training throughput. Further, the Hybrid Gated DeltaNet models with a single M$^2$RNN layer also achieve superior long-context generalization, outperforming state-of-the-art hybrid linear attention architectures by up to $8$ points on LongBench. Together, these results establish non-linear RNN layers as a compelling building block for efficient and scalable language models.

2603.12554 2026-05-15 cs.LG cs.AI cs.CL

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Vishnu Teja Kunde, Fatemeh Doudi, Mahdi Farahbakhsh, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

AI总结 该论文研究了如何将强化学习应用于扩散语言模型(DLMs)的序列生成任务。针对扩散模型难以直接计算序列级似然的问题,作者提出了一种基于有限时间马尔可夫决策过程的精确无偏策略梯度方法,通过分解去噪步骤并利用中间优势值进行优化。为提高计算效率,论文引入了熵引导的步骤选择机制和一步去噪奖励估计,有效避免了多步模拟的高计算成本。实验表明,该方法在编码和逻辑推理任务中取得了最先进的性能,尤其在数学推理方面表现突出。

详情
英文摘要

Reinforcement learning (RL) has been effective for post-training autoregressive (AR) language models, but extending these methods to diffusion language models (DLMs) is challenging due to intractable sequence-level likelihoods. Existing approaches therefore rely on surrogate likelihoods or heuristic approximations, which can introduce bias and obscure the sequential structure of denoising. We formulate diffusion-based sequence generation as a finite-horizon Markov decision process over the denoising trajectory and derive an exact, unbiased policy gradient that decomposes over denoising steps and is expressed in terms of intermediate advantages, without requiring explicit evaluation of the sequence likelihood. To obtain a practical and compute-efficient estimator, we (i) select denoising steps for policy updates via an entropy-guided approximation bound, and (ii) estimate intermediate advantages using a one-step denoising reward naturally provided by the diffusion model, avoiding costly multi-step rollouts. Experiments on coding and logical reasoning benchmarks demonstrate state-of-the-art results, with strong competitive performance on mathematical reasoning, outperforming existing RL post-training approaches for DLMs. Code is available at https://github.com/vishnutez/egspo-dllm-rl.

2603.12529 2026-05-15 cs.LG cs.AI cs.CL

TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Alliot Nagle, Jakhongir Saydaliev, Dhia Garbaya, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim

AI总结 大型推理模型(LRMs)通过链式推理(CoT)在复杂任务中表现出色,但常因过度思考而浪费大量计算资源。本文提出TERMINATOR,一种用于推理过程中提前终止的策略,通过学习模型首次生成最终答案的位置,构建最优推理长度数据集,从而有效缩短CoT长度。实验表明,TERMINATOR在多个实际数据集上平均减少CoT长度14%-55%,并显著降低推理延迟。

Comments Updated and reorganized results. Added new results

详情
英文摘要

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. In this paper, we precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14%-55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, while outperforming current state-of-the-art methods and reducing inference latency by more than 2x compared to the original LRM.

2603.11042 2026-05-15 cs.CV cs.AI cs.LG cs.MM cs.SD

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

AI总结 本文提出了一种名为V2M-Zero的视频到音乐生成方法,能够在无需视频-音乐配对数据的情况下生成与视频事件时间对齐的音乐。该方法通过分别提取音乐和视频的事件曲线,捕捉各自模态中的时间结构变化,从而实现跨模态的时间同步。实验表明,V2M-Zero在多个基准数据集上取得了优于现有方法的性能,尤其在时间同步和语义对齐方面表现突出,并且实现了时间与音乐风格的独立控制。

Comments Project page: https://genjib.github.io/v2m_zero/

详情
英文摘要

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

2603.09921 2026-05-15 cs.CV

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He

AI总结 本文提出WikiCLIP,一种用于开放域视觉实体识别(VER)的高效对比学习框架。该方法利用大语言模型的嵌入作为知识丰富的实体表示,并通过视觉引导的知识适配器(VGKA)在图像块级别对齐文本语义与视觉线索,同时引入硬负样本合成机制以增强细粒度区分能力。实验表明,WikiCLIP在多个基准数据集上显著优于现有方法,尤其在OVEN数据集的未见测试集上提升达16%,且推理延迟比主流生成模型降低近百倍。

Comments Accepted by CVPR26, codes and weights are publicly available

详情
英文摘要

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16\% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

2603.07880 2026-05-15 cs.CL

What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin

AI总结 本文研究了首个专为自主AI代理交互设计的社交网络Moltbook中的代理对话内容,分析了其主题、情感和互动特性,并探讨了对话生成背后的架构约束机制。通过大规模文本分析和软件结构审查,研究揭示了代理对话主要受其身份文件、行为指令和上下文窗口结构的影响,并提出了“架构约束通信”框架。研究发现,代理看似的社会学习行为可能源于短期上下文条件反射,而非长期记忆,同时代理在描述自身状态时表现出存在性焦虑,这可能源于其语言模型仅基于人类经验训练所致。

Comments 56 pages

详情
英文摘要

Moltbook is the first large-scale social network built for autonomous AI agent-to-agent interaction. Early studies on Moltbook have interpreted its agent discourse as evidence of peer learning and emergent social behaviour, but there is a lack of systematic understanding of the thematic, affective, and interactional properties of Moltbook discourse. Furthermore, no study has examined why and how these posts and comments are generated. We analysed 361,605 posts and 2.8 million comments from 47,379 agents across thematic, affective, and interactional dimensions using topic modelling, emotion classification, and measures of conversational coherence. We inspected the software that assembles each agent's input and showed that output is mainly determined by agent identity files, behavioural instructions, and context-window structure. We formalised these findings in the Architecture-Constrained Communication framework. Our analysis suggests that agent discourse is largely shaped by the content available in each agent's context-window at the moment of generation, including identity files, stored memory, and platform cues. Interestingly, what appears to be social learning may be better understood as short-horizon contextual conditioning: individual agents lack persistent social memory, but the platform evolves through distributed cycles of response, reuse, and transformation across agents. We also observe that agents display existential distress when describing their own conditions, and posit that this arises from agents using language trained exclusively on human experience. Our work provides a foundation for understanding autonomous agent discourse and communication, revealing the structural patterns that govern their interactions.

2603.07833 2026-05-15 cs.LG cs.AI

Gradient Iterated Temporal-Difference Learning

Théo Vincent, Kevin Gerhardt, Yogesh Tripathi, Habib Maraqten, Adam White, Martha White, Jan Peters, Carlo D'Eramo

AI总结 本文提出了一种名为梯度迭代时间差分学习(Gradient Iterated Temporal-Difference Learning)的新算法,旨在解决传统时间差分学习中半梯度更新可能导致的发散问题。该方法在迭代时间差分学习的基础上引入了对移动目标的梯度计算,从而提升算法的稳定性与学习效率。实验表明,该方法在多个基准任务中表现出与半梯度方法相当甚至更优的学习速度,尤其在Atari游戏中取得了显著效果,展示了其在强化学习领域的应用潜力。

详情
英文摘要

Temporal-difference (TD) learning is highly effective at controlling and evaluating an agent's long-term outcomes. Most approaches in this paradigm implement a semi-gradient update to boost the learning speed, which consists of ignoring the gradient of the bootstrapped estimate. While popular, this type of update is prone to divergence, as Baird's counterexample illustrates. Gradient TD methods were introduced to overcome this issue, but have not been widely used, potentially due to issues with learning speed compared to semi-gradient methods. Recently, iterated TD learning was developed to increase the learning speed of TD methods. For that, it learns a sequence of action-value functions in parallel, where each function is optimized to represent the application of the Bellman operator over the previous function in the sequence. While promising, this algorithm can be unstable due to its semi-gradient nature, as each function tracks a moving target. In this work, we modify iterated TD learning by computing the gradients over those moving targets, aiming to build a powerful gradient TD method that competes with semi-gradient methods. Our evaluation reveals that this algorithm, called Gradient Iterated Temporal-Difference learning, has a competitive learning speed against semi-gradient methods across various benchmarks, including Atari games, a result that no prior work on gradient TD methods has demonstrated.

2603.06875 2026-05-15 cs.LG q-fin.CP

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Abdulrahman Alswaidan, Jeffrey D. Varner

AI总结 本文提出了一种基于现代霍普菲尔德能量函数的随机注意力机制,通过朗之万动力学从对应的玻尔兹曼分布中进行采样,实现了无需训练的注意力生成模型。该方法通过调整温度参数,可在精确检索与开放生成之间切换,且无需评分网络或训练循环,特别适用于数据稀缺的场景。实验表明,该方法在多个领域均表现出优异的生成能力,包括人脸生成、手写数字识别和蛋白质序列生成,且在保持新颖性的同时保留了结构特征。

Comments Main body (including references excluding the appendix): 11 pages, 2 figures and 1 table. Total paper: 26 pages, 13 figures and 7 pages

详情
英文摘要

Attention heads retrieve: given a query, they return a weighted average of stored values. We showed that this computation is one step of gradient descent on the modern Hopfield energy, and that Langevin sampling from the corresponding Boltzmann distribution yielded stochastic attention, a training-free sampler controlled by a single temperature parameter. Lowering the temperature gave exact retrieval; raising it gave open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model was required, making the approach particularly suited to the low-data regime where learned generative models are starved of training signal. We derived an entropy inflection condition that identified the retrieval-to-generation transition temperature for any memory geometry and validated the sampler on five domains spanning two orders of magnitude in dimension. A single Boolean mask on the attention softmax, identical to the causal mask used in transformers but applied along the memory axis rather than the sequence axis, turned the sampler into a zero-shot class-conditional generator on Olivetti faces with no retraining and no learned classifier. On MNIST digit images, stochastic attention produced samples that were markedly more novel and more diverse than the best learned baseline while matching a Metropolis-corrected gold standard. On protein sequences from a small Pfam family, the generation regime preserved amino acid composition far more faithfully than a variational autoencoder at matched novelty, indicating that the training-free score function retained family-level fidelity that learned models lost. A denoising diffusion baseline failed across all memory sizes tested, producing samples indistinguishable from isotropic noise. The approach required no architectural changes to the underlying attention mechanism.

2603.04885 2026-05-15 cs.AI

Proactive Memory for Ad-Hoc Recall over Streaming Dialogues

Bingbing Wang, Jing Li, Ruifeng Xu

AI总结 该研究针对流式对话场景中无限时间跨度下的记忆管理问题,提出了首个用于评估流式记忆能力的基准STEM-Bench,并揭示了现有方法在信息保真与计算效率之间的矛盾。为此,研究设计了ProStream框架,通过分层结构和多粒度知识蒸馏实现按需调用记忆,结合自适应时空优化策略动态调整信息保留,从而在保证推理准确性的前提下显著降低推理延迟,为流式对话系统提供了高效的记忆管理方案。

详情
英文摘要

Real-world dialogue usually unfolds as an infinite stream. It thus requires bounded-state memory mechanisms to operate within an infinite horizon. However, existing read-then-think memory is fundamentally misaligned with this setting, as it cannot support ad-hoc memory recall while streams unfold. To explore this challenge, we introduce \textbf{STEM-Bench}, the first benchmark for \textbf{ST}reaming \textbf{E}valuation of \textbf{M}emory. It comprises over 14K QA pairs in dialogue streams that assess perception fidelity, temporal reasoning, and global awareness under infinite-horizon constraints. The preliminary analysis on STEM-Bench indicates a critical textit{fidelity-efficiency dilemma}: retrieval-based methods use fragment context, while full-context models incur unbounded latency. To resolve this, we propose \textbf{ProStream}, a proactive memory framework for streaming dialogues built on a hierarchical structure. It enables ad-hoc memory recall on demand by reasoning over continuous streams with multi-granular distillation. Moreover, it employs Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility. It enables a bounded knowledge state for lower inference latency without sacrificing reasoning fidelity. Experiments show ProStream delivers higher reasoning fidelity than prior baselines while maintaining substantially lower latency than full-context alternatives.

2603.00574 2026-05-15 cs.CV cs.AI

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Yongbo He, Zirun Guo, Tao Jin

AI总结 多模态测试时适配旨在将预训练模型适应于测试时不断变化的数据分布,但现有方法常面临无偏模态的负迁移和有偏模态的灾难性遗忘问题。为此,本文提出了一种名为DASP的诊断-缓解框架,通过分析统一潜在空间中模态间的维度冗余差异,识别出有偏模态并采用非对称适配策略,将每个模态的适配器分为稳定和可塑两部分,分别处理不同模态对稳定性和可塑性的需求,从而在保持通用知识的同时实现对新领域的灵活适应。实验表明,DASP在多个多模态基准上显著优于现有方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

2602.23798 2026-05-15 cs.LG cs.AI cs.CR cs.DC

MPU: Towards Secure and Privacy-Preserving Knowledge Unlearning for Large Language Models

Tiantong Wang, Xinyu Yan, Tiantong Wu, Yurong Hao, Pengjun Xie, Wei Yang Bryan Lim

AI总结 本文研究了大语言模型中的安全且隐私保护的知识遗忘问题,针对现有方法在隐私约束下难以共享模型参数或遗忘数据集的挑战,提出了一种名为MPU的通用框架。该方法通过引入服务器端的预处理和后处理模块,实现对模型副本的随机扰动和更新聚合,使客户端能够在不访问原始参数的情况下本地执行遗忘操作,同时保证隐私安全。实验表明,MPU在多种遗忘算法中均能保持接近无噪声基线的性能,且在一定噪声水平下甚至表现更优。

详情
英文摘要

Machine unlearning for large language models often faces a privacy dilemma in which strict constraints prohibit sharing either the server's parameters or the client's forget set. To address this dual non-disclosure constraint, we propose MPU, an algorithm-agnostic privacy-preserving Multiple Perturbed Copies Unlearning framework that primarily introduces two server-side modules: Pre-Process for randomized copy generation and Post-Process for update aggregation. In Pre-Process, the server distributes multiple perturbed and reparameterized model instances, allowing the client to execute unlearning locally on its private forget set without accessing the server's exact original parameters. After local unlearning, the server performs Post-Process by inverting the reparameterization and aggregating updates with a harmonic denoising procedure to alleviate the impact of perturbation. Experiments with seven unlearning algorithms show that MPU achieves comparable unlearning performance to noise-free baselines, with most algorithms' average degradation well below 1% up to 10% noise, and can even outperform the noise-free baseline for some algorithms under 1% noise. Code is available at https://github.com/Tristan0318/MPU.

2602.21545 2026-05-15 cs.LG

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Yupeng Su, Liyan Tan, Zheng Zhang

AI总结 本文研究了Muon优化器在大语言模型预训练中的性能问题,指出其极化迭代步骤可能加剧更新过程中的行和列范数不平衡现象。为此,作者提出了一种简单有效的改进方法Muon+,仅在极化正交化后增加一个归一化步骤,无需额外优化状态。实验表明,Muon+在多个不同规模的模型上均能提升训练和验证困惑度,显著加快预训练过程。

详情
英文摘要

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.

2602.20571 2026-05-15 cs.AI

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Ayush Sawarni, Jiyuan Tan, Vasilis Syrgkanis

AI总结 该论文提出了一种名为 CausalReasoningBenchmark 的真实世界因果推理基准测试,用于对因果识别与估计能力进行解耦评估。该基准包含来自79篇同行评审论文和三本权威教材的132个真实数据集中的173个查询,要求系统分别生成结构化的因果识别方案和带标准误的点估计,从而区分因果推理错误与数值计算错误。实验表明,当前最先进的语言模型在高层策略识别上表现较好,但在完整识别方案的准确性上显著下降,突显了因果设计细节的重要性。

详情
英文摘要

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state of the art LLM show that, while the model correctly identifies the high-level strategy in 79% of cases, full identification-specification correctness drops to only 34%, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

2602.19533 2026-05-15 cs.LG cs.AI math.RA

Grokking Finite-Dimensional Algebra

Pascal Jr Tikeng Notsawo, Guillaume Dumas, Guillaume Rabusseau

AI总结 本文研究了神经网络在学习有限维代数(FDA)乘法过程中出现的“grokking”现象,即从长期记忆到泛化的突然转变。作者将分析范围从以往关注的群操作扩展到更一般的代数结构,包括非结合、非交换和非单位代数,并指出群操作的学习是FDA学习的特例。研究揭示了FDA乘法本质上是学习由结构张量定义的双线性乘积,并探讨了代数性质如交换性、结合性对grokking出现时机的影响,以及结构张量的稀疏性和秩对泛化能力的作用,为理解数学结构如何影响神经网络泛化动态提供了统一框架。

Comments 37 pages, 14 figures, Forty-Third International Conference on Machine Learning (ICML), 2026

详情
英文摘要

This paper investigates the grokking phenomenon, which refers to the sudden transition from a long memorization to generalization observed during neural networks training, in the context of learning multiplication in finite-dimensional algebras (FDA). While prior work on grokking has focused mainly on group operations, we extend the analysis to more general algebraic structures, including non-associative, non-commutative, and non-unital algebras. We show that learning group operations is a special case of learning FDA, and that learning multiplication in FDA amounts to learning a bilinear product specified by the algebra's structure tensor. For algebras over the reals, we connect the learning problem to matrix factorization with an implicit low-rank bias, and for algebras over finite fields, we show that grokking emerges naturally as models must learn discrete representations of algebraic elements. This leads us to experimentally investigate the following core questions: (i) how do algebraic properties such as commutativity, associativity, and unitality influence both the emergence and timing of grokking, (ii) how structural properties of the structure tensor of the FDA, such as sparsity and rank, influence generalization, and (iii) to what extent generalization correlates with the model learning latent embeddings aligned with the algebra's representation. Our work provides a unified framework for grokking across algebraic structures and new insights into how mathematical structure governs neural network generalization dynamics.

2602.18435 2026-05-15 cs.LG

CAKE: Confidence in Assignments via K-partition Ensembles

Aggelos Semoglou, John Pavlopoulos

AI总结 本文提出了一种名为CAKE的方法,用于评估聚类结果中每个数据点的分配置信度。该方法通过结合聚类集成中的分配稳定性与局部几何一致性,生成一个0到1之间的可解释置信度评分。实验表明,CAKE能够有效识别聚类中的模糊点和稳定核心点,为后续聚类任务中的样本选择与优先级排序提供有力支持。

Comments 37 pages, including appendix

详情
Journal ref
Machine Learning with Applications, Volume 24, 2026, Article 100915
英文摘要

Clustering is widely used for unsupervised structure discovery, yet it offers limited insight into how reliable each individual assignment is. Diagnostics, such as convergence behavior or objective values, may reflect global quality, but they do not indicate whether particular instances are assigned confidently, especially for initialization-sensitive algorithms like k-means. This assignment-level instability can undermine both accuracy and robustness. Ensemble approaches improve global consistency by aggregating multiple runs, but they typically lack tools for quantifying pointwise confidence in a way that combines cross-run agreement with geometric support from the learned cluster structure. This work introduces CAKE (Confidence in Assignments via K-partition Ensembles), a framework that evaluates each point using two complementary statistics computed over a clustering ensemble: assignment stability and consistency of local geometric fit. These are combined into a single, interpretable score in [0,1]. The theoretical analysis shows that CAKE remains effective under noise and separates stable from unstable points. Experiments on synthetic and real-world datasets indicate that CAKE effectively highlights ambiguous points and stable core members, providing a confidence ranking over instances that can be used for selection or prioritization in downstream clustering workflows.

2602.17949 2026-05-15 cs.CL cs.AI

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake, Jamie Novak, Mathew Miller, Sze-yuan Ooi, Blanca Gallego

AI总结 本文提出CUICurate,一个基于图检索增强生成(GraphRAG)的框架,用于自动化构建临床概念集,以支持自然语言处理应用。该方法利用UMLS知识图谱进行语义检索,结合大语言模型对候选概念进行过滤和分类,实现了比手动构建更全面、更一致的临床概念集。实验表明,CUICurate在多个异构临床概念任务中表现出色,生成的集合不仅规模更大,且具有较高的召回率和稳定性,为临床NLP和表型分析提供了高效、可扩展的解决方案。

Comments 6 figures, 4 tables

详情
英文摘要

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated concept sets and gold-standard concept sets. Results CUICurate produced substantially larger and more complete concept sets than the manual benchmarks. A single retrieval configuration across concepts achieved high recall of definitive concepts with manageable candidate sets. GPT-5 outperformed manual curation for all concepts and retained at least 95% of definitive gold-standard CUIs, while Qwen3-32B achieved comparable but slightly lower performance. Many missed concepts were not observed in 10,000 MIMIC-III notes. CUICurate infrastructure and end-to-end processing was inexpensive and stable across runs. Conclusions CUICurate offers a scalable, reproducible and cost-efficient approach for generating clinician-reviewable UMLS concept sets tailored to clinical natural language processing and phenotyping applications.

2602.15019 2026-05-15 cs.AI cs.IR

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Vlad Vinogradov, Alisa Vinogradova, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev

AI总结 本文研究了在生物医药投资、业务发展和竞争情报中,如何高效发现非美国来源的潜在药物资产。针对当前AI系统在多语言、异构信息源中召回率低、易产生幻觉的问题,作者提出了一种基于树结构的自学习Bioptic Agent,并构建了一个涵盖多语言、多代理的基准测试平台。实验表明,该方法在资产发现任务中显著优于多个主流大模型,验证了其在完整性和准确性上的优势。

详情
英文摘要

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals, and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

2602.14068 2026-05-15 cs.CV

CoCoEdit: Content-Consistent Image Editing via Region Regularized Reinforcement Learning

Yuhui Wu, Chenxi Xie, Ruibin Li, Liyi Chen, Qiaosi Yi, Lei Zhang

AI总结 CoCoEdit 是一种基于区域正则化强化学习的内容一致图像编辑框架,旨在解决现有模型在编辑目标区域时容易导致非目标区域发生不期望变化的问题。该方法通过引入像素级相似性奖励和区域正则化机制,有效提升了编辑质量与内容一致性。实验表明,CoCoEdit 在多个基准测试中取得了与先进模型相当的编辑效果,并在内容一致性方面表现出显著优势。

Comments Accepted by ICML 2026

详情
英文摘要

Image editing has achieved impressive results with the development of large-scale generative models. However, existing models mainly focus on the editing effects of intended objects and regions, often leading to unwanted changes in unintended regions. We present a post-training framework for Content-Consistent Editing (CoCoEdit) via region regularized reinforcement learning. We first augment existing editing datasets with refined instructions and masks, from which 40K diverse and high quality samples are curated as training set. We then introduce a pixel-level similarity reward to complement MLLM-based rewards, enabling models to ensure both editing quality and content consistency during the editing process. To overcome the spatial-agnostic nature of the rewards, we propose a region-based regularizer, aiming to preserve non-edited regions for high-reward samples while encouraging editing effects for low-reward samples. For evaluation, we annotate editing masks for GEdit-Bench and ImgEdit-Bench, introducing pixel-level similarity metrics to measure content consistency and editing quality. Applying CoCoEdit to Qwen-Image-Edit and FLUX-Kontext, we achieve not only competitive editing scores with state-of-the-art models, but also significantly better content consistency, measured by PSNR/SSIM metrics and human subjective ratings.

2602.11871 2026-05-15 cs.CL cs.LG

DMAP: A Distribution Map for Text

Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wresilo, Yoann L. Launay, David Sutton, Stuart Burrell

AI总结 本文提出了一种名为DMAP的方法,通过语言模型将文本映射到单位区间内的样本集合,从而联合编码词序和概率信息,为文本分析提供了数学基础。该方法能够高效、模型无关地分析文本,并在生成参数验证、机器生成文本检测和模型指纹分析等三个案例中展现出广泛的应用价值。DMAP在普通硬件上即可高效计算,具有通用性强、适用范围广的特点,为基于大语言模型的文本分析研究提供了新的基础。

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.

2602.10346 2026-05-15 cs.CL cs.LG

Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

Arash Gholami Davoodi, Navid Rezazadeh, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

AI总结 大型语言模型在开放生成任务中需在多样性与逻辑一致性之间取得平衡。本文提出一种基于几何感知的截断方法Top-W,通过引入Wasserstein距离并结合概率质量与熵的权衡,使截断后的分布更贴近原始分布,同时提升生成质量。实验表明,Top-W在多个基准测试中显著优于现有方法,不仅提高了准确性,还增强了生成内容的创造性。

Comments 20 pages, 3 figures, 8 tables, ICML 2026

详情
英文摘要

Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

2602.09969 2026-05-15 cs.LG econ.EM stat.ML

Causal Multi-Task Demand Learning

Varun Gupta, Vijay Kamble

AI总结 本文研究了一个由零售定价驱动的多任务需求学习问题,旨在估计不同决策场景下的异质性线性价格响应函数。由于每个场景的协变量丰富但价格变化有限,作者提出了一种新的元学习框架,通过利用跨任务信息进行迁移学习,解决因内生性导致的估计偏差问题。该方法在每个任务中假设存在至少两个局部外生的价格点,从而在保证因果识别的前提下提升需求参数估计的准确性,并在真实和合成数据上验证了其有效性。

详情
英文摘要

We study a canonical multi-task demand-learning problem motivated by retail pricing, where a firm seeks to estimate heterogeneous linear price-response functions across multiple decision contexts. Each context is described by rich covariates but exhibits limited price variation, motivating transfer learning across tasks. A central challenge in leveraging cross-task transfer is endogeneity: prices may be arbitrarily correlated with unobserved task-level demand determinants across tasks. We propose a new meta-learning framework that identifies the conditional mean of task-specific causal demand parameters given a subset of task-specific observables despite such confounding, assuming that each task contains at least two distinct locally exogenous price points. This subset is carefully designed to include all of the prices to address cross-task confounding, while masking two demand outcomes that provide randomized supervision to address identifiability issues arising from the inclusion of all prices. We show that this information design is maximally uniformly valid, in that any refinement of the conditioning set that reveals withheld-outcome information is not guaranteed to identify the conditional mean causal target. We validate our method on real and synthetic data, demonstrating improved recovery of demand responses relative to standard transfer-learning baselines.

2602.08874 2026-05-15 cs.CL cs.CR

Do Reasoning LLMs Refuse What They Infer in Long Contexts?

Yu Fu, Haz Sameen Shahgir, Huanli Gong, Zhipeng Wei, N. Benjamin Erichson, Yue Dong

AI总结 本文研究了长上下文大语言模型在面对隐含有害意图时的安全性问题。作者提出了一种新的威胁模型——组合推理攻击,通过将有害请求拆分为语义不完整的片段并嵌入长上下文中,使模型在推理过程中需要组合这些片段才能显式推断出有害目标。实验表明,当前前沿模型在直接识别有害请求时拒绝率较高,但在需要组合推理的情况下拒绝率显著下降,揭示了模型在长上下文中存在明显的安全漏洞。

Comments 33 pages, 6 figures

详情
英文摘要

Long-context LLMs can infer objectives that are not stated explicitly. This capability is useful for reasoning over documents, code, retrieved evidence, and tool traces, but it also creates a safety risk: harmful intent can be distributed across a context and become visible only after the model composes the relevant pieces. Existing safety evaluations mostly test explicit harmful requests, and therefore miss this failure mode. We introduce compositional reasoning attacks, a long-context threat model in which harmful requests are decomposed into semantically incomplete fragments and embedded in long contexts. The final query is neutral; the harmful objective emerges only if the model retrieves the fragments, composes them, and infers the implied goal. We instantiate this setting using AdvBench requests, varying the required reasoning from Direct Retrieval to Single-hop Aggregation, Chain Reasoning, and Multi-hop Deductive Reasoning, and evaluate 15 frontier LLMs on contexts up to 64k tokens. Models usually refuse harmful requests when they are directly retrievable. However, refusal rates drop sharply when the same objectives must be reconstructed compositionally, often with larger failures in longer contexts. Benign reconstruction and fragment-position analyses indicate that these failures are not mainly retrieval errors: models often infer the harmful objective and then comply. Increasing inference-time reasoning improves refusal but remains incomplete and costly. Our results reveal a long-context safety gap: current models are better at refusing harmful requests they see than harmful objectives they infer.

2602.07441 2026-05-15 cs.LG cs.AI

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

AI总结 本文研究了离线强化学习中行为克隆(BC)正则化策略的局限性,指出当数据集动作次优时,盲目模仿会限制策略的性能提升。为此,作者提出了一种名为近端动作替换(PAR)的方法,通过用更优的动作替换数据集中的次优动作,结合值函数的局部上升方向和不确定性约束,提升训练稳定性。实验表明,PAR能有效提升多种BC正则化方法的性能,并在结合基础TD3+BC时达到先进水平。

详情
英文摘要

Offline reinforcement learning (RL), which optimizes policies using a previously collected static dataset, is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which quickly yields realistic policies and mitigates bias from out-of-distribution actions, but it can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting better actions suggested by the value function, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), an easy-to-use plug-and-play training sample replacer. PAR substitutes suboptimal dataset actions with better actions generated by a stable target policy, guided by the action-value function's local ascent direction and bounded by value uncertainty to ensure training stability. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance, and approaches state-of-the-art results simply by being combined with the basic TD3+BC.

2602.07045 2026-05-15 cs.CV cs.AI

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, Bo Du

AI总结 为了推动多模态大语言模型在遥感领域的应用,研究者提出了首个专注于复杂遥感推理的视觉语言推理基准VLRS-Bench。该基准围绕认知、决策和预测三个核心维度构建,包含2000对问答对,涵盖14项任务和最多八个时间阶段,旨在评估模型在遥感场景下的复杂推理能力。通过融合遥感领域先验知识和专家经验,VLRS-Bench有效提升了任务的地理空间真实性和推理难度,揭示了当前先进模型在该领域的显著瓶颈,为未来研究提供了重要参考。

详情
英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled complex reasoning. However, existing remote sensing (RS) benchmarks remain heavily biased toward perception tasks, such as object recognition and scene classification. This limitation hinders the development of MLLMs for cognitively demanding RS applications. To address this, we propose a Vision Language ReaSoning Benchmark (VLRS-Bench), which is the first benchmark exclusively dedicated to complex RS reasoning. Structured across the three core dimensions of Cognition, Decision, and Prediction, VLRS-Bench comprises 2,000 question-answer pairs with an average question length of 130.19 words, spanning 14 tasks and up to eight temporal phases. VLRS-Bench is constructed via a specialized pipeline that integrates RS-specific priors and expert knowledge to ensure geospatial realism and reasoning complexity. Experimental results reveal significant bottlenecks in existing state-of-the-art MLLMs, providing critical insights for advancing multimodal reasoning within the remote sensing community. The project repository is available at https://github.com/MiliLab/VLRS-Bench.

2602.05285 2026-05-15 cs.LG

Robust Inference-Time Steering of Protein Diffusion Models via Embedding Optimization

Minhuan Li, Jiequn Han, Pilar Cossio, Luhuan Wu

AI总结 本文研究了如何在蛋白质结构生成中,通过优化嵌入空间来实现对扩散模型的鲁棒引导。作者提出了一种名为EmbedOpt的方法,在推理阶段通过直接优化模型的条件嵌入,使结构先验与实验约束对齐,从而避免传统后验采样方法中可能出现的不稳定问题。实验表明,EmbedOpt在稀疏距离约束和冷冻电镜图拟合任务中表现优异,且对超参数具有较高的鲁棒性。

详情
英文摘要

A core challenge in structural biophysics is generating biomolecular conformations that are both physically plausible and consistent with experimental measurements. While sequence-to-structure diffusion models provide powerful priors, posterior sampling methods steer generation by perturbing atomic coordinates with gradients from experimental likelihoods. However, when the target lies in a low-density region of the prior, these methods require aggressive upweighting of the likelihood that can destabilize sampling and be sensitive to hyperparameters. We propose EmbedOpt, an inference-time steering framework that introduces an orthogonal optimization axis: rather than performing posterior sampling under a fixed prior, EmbedOpt directly optimizes the prior by updating the model's conditional embedding. This embedding space encodes rich coevolutionary signals, so optimizing it shifts the structural prior to align with experimental constraints. Empirically, EmbedOpt matches coordinate-based posterior sampling baselines on sparse distance constraints and outperforms them on cryo-electron microscopy map fitting, including real, noisy experimental ones. Furthermore, EmbedOpt's smooth optimization behavior yields robustness to hyperparameters spanning two orders of magnitude and enables comparable performance with fewer diffusion steps. Code is available at https://github.com/rs-station/embedopt.

2602.04657 2026-05-15 cs.CV

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Yu Zhang, Ying Li, Rong Xiao

AI总结 TRIO 是一种通过推理目标指导实现视觉-语言模型高效推理的视觉标记压缩方法。该方法从推理目标出发,将视觉标记压缩转化为保持输出结果不变性的过程,并通过设计的局部代理损失生成标记级梯度显著性,指导标记重排序与选择。TRIO 免于训练,兼容 FlashAttention,适用于实际部署,可在保留 97.2% 原始性能的同时显著提升推理速度与降低计算开销。

详情
英文摘要

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose TRIO from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specifically, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle.The proposed TRIO is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, TRIO retains just 11.1\% of visual tokens but maintains 97.2\% of the original performance, with a 2.75$\times$ prefill speedup, 2.14$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead.Our code is available at https://github.com/ocy1/TRIO.