arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10938 2026-05-12 cs.CL cs.AI cs.LG

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He

AI总结 本文提出了一种名为ELF(Embedded Language Flows)的连续语言扩散模型,旨在解决当前主流离散扩散语言模型(DLMs)在生成质量与效率上的局限。ELF基于连续时间流匹配技术,在嵌入空间中进行建模,直到最终时间步才映射到离散词元,从而更有效地结合图像领域扩散模型的优化技术,如无分类器引导(CFG)。实验表明,ELF在生成质量与采样效率上均优于现有离散和连续DLMs,为构建高效的连续扩散语言模型提供了新方向。

Comments Tech Report. Project webpage: https://github.com/lillian039/ELF

详情
英文摘要

Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

2605.10937 2026-05-12 cs.CV

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu

AI总结 本文研究了如何通过强化学习后训练进一步提升文本到图像生成模型的性能,并针对现有方法中奖励黑客问题提出了解决方案。作者指出标准化操作可能导致策略校准偏差,进而影响训练效果,为此提出了一种基于信息几何的超线性优势塑造方法(SLAS),通过引入优势依赖的权重对策略空间进行非线性重构,从而增强有效更新、抑制虚假梯度。实验表明,SLAS在多个模型和基准测试中均优于现有方法,提升了训练效率、泛化能力和生成质量。

详情
英文摘要

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

2605.10936 2026-05-12 cs.CV

Personal Visual Context Learning in Large Multimodal Models

Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

AI总结 随着智能眼镜等可穿戴设备将大 multimodal 模型(LMMs)融入用户的连续第一人称视觉流,这些模型要成为真正的个人助手,关键在于视觉个性化能力。本文提出个人视觉上下文学习(Personal VCL),旨在利用用户特定的视觉信息解决个性化查询,并构建了 Personal-VCL-Bench 作为评估基准。研究发现当前 LMMs 在利用视觉上下文方面存在显著差距,为此提出了一种名为 Agentic Context Bank 的推理时基线方法,通过结构化的记忆银行和查询自适应的证据选择,有效提升了模型在多任务中的表现。

Comments Project website: https://vision.cs.utexas.edu/projects/PersonalVCL/

详情
英文摘要

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

2605.10934 2026-05-12 cs.LG cs.AI cs.CV cs.RO stat.ML

Variational Inference for Lévy Process-Driven SDEs via Neural Tilting

Yaman Kindap, Manfred Opper, Benjamin Dupuis, Umut Simsekli, Tolga Birdal

AI总结 该论文研究了如何利用变分推断方法对由Lévy过程驱动的随机微分方程(SDEs)进行建模,以准确捕捉金融、气候等领域的极端事件和重尾现象。传统方法要么计算开销大,要么依赖高斯假设而无法处理跳跃特性。为此,作者提出了一种基于神经网络的指数倾斜框架,通过神经网络对Lévy测度进行指数加权,构建灵活的变分族,在保留跳跃结构的同时保证计算可行性。实验表明,该方法在合成和真实数据上均能有效捕捉跳跃动态,并在高斯变分方法失效的情况下提供可靠的后验推断。

Comments The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/

详情
英文摘要

Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.

2605.10925 2026-05-12 cs.RO

PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

Xinyu Guo, Bin Xie, Wei Chai, Xianchi Deng, Tiancai Wang, Zhengxing Wu, Xingyu Chen

AI总结 该研究提出了一种名为 PriorVLA 的新型框架,旨在在视觉-语言-动作(VLA)模型的下游任务适配中保留预训练的先验知识。该方法通过冻结预训练专家模型作为只读先验源,并训练一个适配专家模型进行任务特定学习,从而在保持广泛先验的同时实现有效适配。实验表明,PriorVLA 在多个基准和现实任务中均优于全微调和现有先进方法,尤其在分布外和少样本场景下表现突出。

Comments 32 pages. Project page: https://priorvla.github.io/

详情
英文摘要

Large-scale pretraining has made Vision-Language-Action (VLA) models promising foundations for generalist robot manipulation, yet adapting them to downstream tasks remains necessary. However, the common practice of full fine-tuning treats pretraining as initialization and can shift broad priors toward narrow training-distribution patterns. We propose PriorVLA, a novel framework that preserves pretrained priors and learns to leverage them for effective adaptation. PriorVLA keeps a frozen Prior Expert as a read-only prior source and trains an Adaptation Expert for downstream specialization. Expert Queries capture scene priors from the pretrained VLM and motor priors from the Prior Expert, integrating both into the Adaptation Expert to guide adaptation. Together, PriorVLA updates only 25% of the parameters updated by full fine-tuning. Across RoboTwin 2.0, LIBERO, and real-world tasks, PriorVLA achieves stronger overall performance than full fine-tuning and state-of-the-art VLA baselines, with the largest gains under out-of-distribution (OOD) and few-shot settings. PriorVLA improves over pi0.5 by 11 points on RoboTwin 2.0-Hard and achieves 99.1% average success on LIBERO. Across eight real-world tasks and two embodiments, PriorVLA reaches 81% in-distribution (ID) and 57% OOD success with standard data. With only 10 demonstrations per task, PriorVLA reaches 48% ID and 32% OOD success, surpassing pi0.5 by 24 and 22 points, respectively.

2605.10922 2026-05-12 cs.CV

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, Shi-Min Hu

AI总结 Pixal3D 是一种基于图像的高保真3D生成方法,旨在解决现有3D生成模型在像素级细节还原方面的不足。该方法通过引入像素级反投影条件机制,直接在输入视角下生成与像素对齐的3D几何结构,建立了明确的像素到3D特征的对应关系,从而显著提升了生成结果的保真度。此外,Pixal3D 还支持多视角生成和场景级合成,为从单张或多张图像生成高精度3D物体和场景提供了新的解决方案。

Comments SIGGRAPH 2026. Project page: https://ldyang694.github.io/projects/pixal3d/

详情
英文摘要

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

2605.10921 2026-05-12 cs.RO

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Huashuo Lei, Wenxuan Song, Huarui Zhang, Jieyuan Pei, Jiayi Chen, Haodong Yan, Han Zhao, Pengxiang Ding, Zhipeng Zhang, Lida Huang, Donglin Wang, Yan Wang, Haoang Li

AI总结 《RoboMemArena: 一个全面且具有挑战性的机器人记忆基准》提出了一种新的机器人记忆评估基准,旨在解决现有基准在多模态注释、任务覆盖和现实环境评估方面的不足。该基准包含26个任务,平均轨迹长度超过1000步,其中68.9%的子任务依赖记忆。研究还设计了PrediMem,一种结合视觉-语言模型的双系统架构,通过预测编码机制提升对任务动态的感知能力,实验表明其在复杂记忆任务中表现优异。

Comments Project website: https://robomemarena.github.io

详情
英文摘要

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.

2605.10917 2026-05-12 cs.LG cs.MA cs.RO

Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges

Usman A. Khan, Joseph W. Durham

AI总结 本文研究匿名多智能体路径规划(MAPF)问题,将其建模为具有马尔可夫结构的多边际最优传输(MMOT)问题,并证明在该结构下原指数级规模的问题可简化为多项式规模的线性规划(LP)。通过引入薛定谔桥的概率框架,作者提出了一种基于熵正则化的迭代解法,能够在保证近似最优性的同时显著降低计算复杂度。实验表明,该方法在保持解的质量方面具有优越的可扩展性。

Comments Accepted in ICML 2026 as a spotlight paper

详情
英文摘要

We consider anonymous multi-agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min-cost, integral $(\{0,1\})$ transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.

2605.10912 2026-05-12 cs.CL

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

AI总结 WildClawBench 是一个用于评估真实环境中长期任务执行能力的基准,包含60个由人类编写的双语多模态任务,涵盖六个主题类别。该基准在可复现的Docker容器中运行,使用真实的命令行代理框架和工具,任务平均耗时约8分钟,涉及20次以上工具调用。评估方法结合了规则检查、环境状态审计和大模型语义判断,结果显示当前前沿模型在真实运行时的长期任务表现仍有较大提升空间。

Comments Github link: https://github.com/internlm/WildClawBench

详情
英文摘要

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

2605.10909 2026-05-12 cs.LG stat.ML

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

Alex DeWeese, Guannan Qu

AI总结 本文重新审视了在受限策略类中使用的标准策略梯度方法,发现其容易陷入次优临界点,主要原因在于策略梯度本身具有短视性,仅依赖于一步Q函数进行优化。为此,作者提出了一种基于$k$-步策略梯度的通用方法,通过结合$k$步时间窗口内的随机性,能够逃离受限策略类中的短视局部最优解。理论分析表明,该方法在性能上可以指数级接近最优确定性策略,并且在仅假设价值函数光滑可微的前提下,投影梯度下降和镜像下降方法能在$O(1/T)$次迭代内实现这一保证,适用于状态聚合和部分可观测协作多智能体等之前难以求解的问题。

详情
英文摘要

This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_μ^{π^*} / d_μ^π||_\infty$ and $||d_μ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.

2605.10904 2026-05-12 cs.RO

MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems

Marco Coscoy, Zewei Zhou, Seth Z. Zhao, Henry Wei, Angela Magtoto, Johnson Liu, Rui Song, Walter Zimmer, Zhiyu Huang, Chen Tang, Bolei Zhou, Jiaqi Ma

AI总结 本文提出MDrive,一个用于端到端多智能体系统的闭环协作驾驶基准,旨在解决现有V2X基准在闭环评估和场景多样性方面的不足。该基准基于NHTSA预碰撞类型和真实V2X数据构建了225个场景,实验表明多智能体系统在整体表现上优于单智能体系统,但在感知共享和协商机制在复杂交通场景中的效果仍有挑战。MDrive还提供了开源工具箱,支持场景生成、现实到模拟转换及人机协同仿真,为评估和提升协作驾驶系统的泛化性和鲁棒性提供了可复现的基础。

Comments website:https://mdrive-challenge.github.io/

详情
英文摘要

Vehicle-to-Everything (V2X) communication has emerged as a promising paradigm for autonomous driving, enabling connected agents to share complementary perception information and negotiate with each other to benefit the final planning. Existing V2X benchmarks, however, fall short in two ways: (i) open-loop evaluations fail to capture the inherently closed-loop nature of driving, leading to evaluation gaps, and (ii) current closed-loop evaluations lack behavioral and interactive diversity to reflect real-world driving. Thus, it is still unclear the extent of benefits of multi-agent systems for closed-loop driving. In this paper, we introduce MDrive, a closed-loop cooperative driving benchmark comprising 225 scenarios grounded in both NHTSA pre-crash typologies and real-world V2X datasets. Our benchmark results demonstrate that multi-agent systems are generally better than single-agent counterparts. However, current multi-agent systems still face two important challenges: (i) perception sharing enhances perceptions, but doesn't always translate to better planning; (ii) negotiation improves planning performance but harms it in complex and dense traffic scenarios. MDrive further provides an open-source toolbox for scenario generation, Real2Sim conversion, and human-in-the-loop simulation. Together, MDrive establishes a reproducible foundation for evaluating and improving the generalization and robustness of cooperative driving systems.

2605.10903 2026-05-12 cs.CV cs.RO

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang, Jing Lyu, Pengxiang Ding, Yan Wang, Donglin Wang, Haoang Li

AI总结 本文提出了一种新的方法,解决预训练视觉-语言-动作(VLA)模型在标准监督微调过程中性能提升有限且适应成本高的问题。该方法通过在参数空间中解耦辅助目标微调的两个目标——增强通用能力和拟合任务特定动作分布,并利用两种不同的训练策略在小规模任务集上训练出两个微调模型,从而提取出由辅助目标提供的能力向量。将这些能力向量与预训练参数结合形成增强能力的元模型,并引入轻量正交正则化损失,使模型在保持高性能的同时显著降低计算开销。实验表明,该方法在多种模型和新环境中均具有良好的有效性和泛化能力。

详情
英文摘要

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

2605.10901 2026-05-12 cs.LG

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

Nikita Kezins, Urbas Ekka, Pascal Berrang, Luca Arnaboldi

AI总结 该研究旨在为语言模型的防护分类器提供形式化保证,以确保其能有效防御有害行为。传统方法在离散输入空间中难以定义“有害行为”的形式化规范,因此作者将验证转移到分类器的预激活空间,通过构造凸区域并利用分类头的单调性,实现了高效且无近似的形式化证明。实验表明,现有防护分类器在形式化验证下存在可验证的安全漏洞,揭示了其在实际应用中可能存在的稳定性与覆盖范围问题。

详情
英文摘要

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.

2605.10899 2026-05-12 cs.CL cs.LG

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister

AI总结 本文提出 RubricEM,一种基于评分标准引导的元强化学习框架,旨在解决深度研究智能体在缺乏明确奖励信号下的训练问题。该方法通过将研究过程分解为多个阶段,并结合基于反思的元策略进化,实现了对长期任务的高效优化。RubricEM 通过结构化的评分标准提供更精细的反馈,并将评估经验转化为可复用的指导,显著提升了智能体在长文本生成等复杂任务中的表现。

Comments 63 pages, 6 figures

详情
英文摘要

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

2605.10894 2026-05-12 cs.CV

Counterfactual Stress Testing for Image Classification Models

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker

AI总结 本文研究了医学影像分类模型在新临床环境中因分布偏移而失效的问题,提出了一种基于因果生成模型的反事实压力测试框架,通过干预扫描仪类型、患者性别等属性生成具有临床真实性的“假设”图像,从而在保持解剖结构不变的前提下,进行有针对性的分布偏移评估。实验表明,该方法相比传统扰动方法能更准确地反映模型在真实分布外场景下的性能变化,为医学AI系统的鲁棒性评估提供了更可靠的基础。

详情
英文摘要

Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

2605.10889 2026-05-12 cs.LG cs.AI

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N. M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar

AI总结 本文研究了策略蒸馏在训练推理模型中的作用机制,探讨了在何种情况下蒸馏信号是有益的、在何种情况下是有害的。作者提出了一种无需训练的诊断框架,能够在每个标记、每个问题和每个教师模型的粒度上分析蒸馏效果,并通过梯度对齐分数衡量实际蒸馏梯度与理想梯度的接近程度。实验表明,蒸馏信号在学生模型表现不佳时更有效,而在正确推理路径上容易引入噪声,且最佳蒸馏配置依赖于任务和模型能力,不存在普适的最优方案。

详情
英文摘要

On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

2605.10887 2026-05-12 cs.CV

Count Anything at Any Granularity

Chang Liu, Haoning Wu, Weidi Xie

AI总结 本文研究了开放世界物体计数中的细粒度计数问题,指出当前方法因未明确计数粒度而导致计数可靠性不足。为此,作者提出了多粒度计数框架,通过视觉示例和细粒度文本描述明确指定计数目标,并构建了首个自动化的数据增强管道,生成了目前最大的细粒度计数数据集KubriCount。基于该数据集,作者进一步训练了HieraCount模型,显著提升了细粒度计数的准确性和实际场景的泛化能力。

Comments Project page: https://verg-avesta.github.io/KubriCount/

详情
英文摘要

Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.

2605.10885 2026-05-12 cs.CV

Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation

Feifan Song, Yuntian Bo, Haofeng Zhang

AI总结 跨域小样本医学图像分割(CD-FSMIS)旨在仅凭少量标注样本,使模型同时适应新的解剖类别和未见过的成像领域。现有基于原型的方法往往将解剖结构与领域特定的外观变化混杂在一起,导致在领域变化下难以实现稳定匹配。本文提出GeoProto框架,通过引入几何感知的原型增强机制,利用人体解剖结构的几何先验信息,提升原型匹配的鲁棒性与泛化能力,并在多个跨模态、跨序列和跨场景的数据集上取得了最先进的性能。

详情
英文摘要

Cross-domain few-shot medical image segmentation (CD-FSMIS) requires a model to generalise simultaneously to novel anatomical categories and unseen imaging domains from only a handful of annotated examples. Existing prototypical approaches inevitably entangle anatomical structure with domain-specific appearance variations, and thus lack a stable reference for reliable matching under domain shift. We observe that the geometric structure of human anatomy constitutes a reliable, domain-transferable prior that has been overlooked. Building on this insight, we propose GeoProto, a geometry-aware CD-FSMIS framework that enriches prototypical matching with explicit structural priors. The core component, Geometry-Aware Prototype Enrichment (GAPE), augments each local appearance prototype with a learned geometric offset encoding its ordinal position within the organ's interior topology. This offset is derived from an auxiliary Ordinal Shape Branch (OSB) trained under an ordinally consistent objective that enforces monotonic variation of geometric embeddings across interior strata, requiring no annotation beyond standard segmentation masks. Extensive experiments across seven datasets spanning three evaluation settings (cross-modality, cross-sequence, and cross-context) demonstrate that GeoProto achieves state-of-the-art performance.

2605.10880 2026-05-12 cs.RO

Safe Aerial 3D Path Planning for Autonomous UAVs using Magnetic Potential Fields

Haechan Mark Bong, Giovanni Beltrame

AI总结 本文研究了如何在城市环境中实现自主无人机的安全三维路径规划问题。提出了一种基于麦克斯韦方程性质的磁势场方法——3DMaxConvNet,利用卷积自编码器从激光雷达生成的三维体素网格中预测避障势场,从而生成无局部极小值的路径。实验表明,该方法在两个不同的城市环境中均实现了100%的路径规划成功率,并在运行时间和路径质量方面优于传统算法如A*和RRT*。

详情
英文摘要

Safe autonomous Uncrewed Aerial Vehicle (UAV) navigation in urban environments requires real-time path planning that avoids obstacles. MaxConvNet is a potential-field planner that leverages properties of Maxwell's equations to generate a path to the goal without local minima. We extend the 2D MaxConvNet magnetic field planner to 3D, using a convolutional autoencoder to predict obstacle-aware potential fields from LiDAR-derived 101^3 voxel grids. Evaluation across 100 randomized closed-loop trials in two distinct Cosys-AirSim urban environments, a dense night-time cityscape and a suburban district shows a 100% path planning success rate on both maps without retraining. In offline path planning, 3DMaxConvNet produces path lengths comparable to A* on unseen maps while reducing runtime from 0.155--0.17s to 0.087--0.089s, or about 1.7--1.95 times faster than A*. Against RRT*(3k), 3DMaxConvNet achieves similar path quality while reducing planning runtime from 17.2--17.5s to about 0.09s, which is roughly 193--201 times faster than RRT*(3k).

2605.10878 2026-05-12 cs.LG cs.IT math.IT

Neural Weight Norm = Kolmogorov Complexity

Tiberiu Musat

AI总结 本文研究了权重衰减(weight decay)在神经网络中的理论依据,证明在固定精度下,神经网络输出二进制字符串的最小权重范数与该字符串的 Kolmogorov 复杂度成比例,相差一个对数因子。这一结果表明,权重衰减诱导的先验与 Solomonoff 的通用先验在多项式因子内一致,并且该结论适用于任意权重范数。研究还展示了固定精度神经网络参数与 Kolmogorov 复杂度之间的编码关系,并指出无限精度下该结论不再成立。

详情
英文摘要

Why does weight decay work? We prove that, in any fixed-precision regime, the smallest weight norm of a looped neural network outputting a binary string equals the Kolmogorov complexity of that string, up to a logarithmic factor. This implies that weight decay induces a prior matching Solomonoff's universal prior, the optimal prior over computable functions, up to a polynomial factor. The result is norm-agnostic: in fixed precision, every weight norm collapses to the non-zero parameter count up to constants, so the same sandwich bound holds for any norm used as a regulariser. The proof has two short reductions: any program for a universal Turing machine can be encoded into neural weights at unit cost per program bit, and any fixed-precision network can be described by enumerating its non-zero parameters with logarithmic addressing overhead. Both bounds are tight up to constants, with the logarithmic factor realised by permutation encodings: a network whose parameters encode a permutation produces a string whose Kolmogorov complexity is the non-zero parameter count times its logarithm. The fixed-precision assumption is essential: with infinite precision, neural networks can encode non-computable functions and the weight norm loses its relevance.

2605.10877 2026-05-12 cs.CL cs.IR

Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs

Abrar Majeedi, Viswanatha Reddy Gajjala, Sai Prasanna Teja Reddy Bogireddy, Siddhant Rai

AI总结 该研究针对电子健康记录(EHR)上的临床问答任务,提出了一种统一的提示优化方法Neural1.5,用于解决包括问题理解、证据识别、答案生成和证据对齐在内的四个子任务。该方法通过模块化处理每个阶段,并结合自动提示优化与自一致性验证机制,有效提升了答案的准确性和可靠性。实验结果表明,该方法在ArchEHR-QA 2026共享任务中整体排名第二,验证了其在多阶段临床问答任务中的有效性与高效性。

Comments Accepted to CL4Health @ LREC 2026

详情
英文摘要

Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for the ArchEHR-QA 2026 shared task at CL4Health@LREC 2026, which comprises four subtasks: question interpretation, evidence identification, answer generation, and evidence alignment. Our approach decouples the task into independent, modular stages and employs DSPy"s MIPROv2 optimizer to automatically discover high-performing prompts, jointly tuning instructions and few-shot demonstrations for each stage. Within every stage, self-consistency voting over multiple stochastic inference runs suppresses spurious errors and improves reliability, while stage-specific verification mechanisms (e.g., self-reflection and chain-of-verification for alignment) further refine output quality. Among all teams that participated in all four subtasks, our method ranks second overall (mean rank 4.00), placing 4th, 1st, 4th, and 7th on Subtasks 1-4, respectively. These results demonstrate that systematic, per-stage prompt optimization combined with self-consistency mechanisms is a cost-effective alternative to model fine-tuning for multifaceted clinical QA.

2605.10876 2026-05-12 cs.LG cs.AI q-bio.QM

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

Edward De Brouwer, Carl Edwards, Alexander Wu, Jenna Collier, Graham Heimberg, Xiner Li, Meena Subramaniam, Ehsan Hajiramezanali, David Richmond, Jan-Christian Hütter, Sara Mostafavi, Gabriele Scalia

AI总结 本文提出 AssayBench,一个用于评估大语言模型和智能体在虚拟细胞表型筛选任务中表现的基准数据集,涵盖1920个公开的CRISPR筛选实验,涉及五类细胞表型。研究将表型筛选任务转化为基因排序预测问题,并引入调整后的nDCG指标以衡量不同实验间的模型性能。实验表明,现有的方法与经验估计的性能上限仍有较大差距,零样本通用大语言模型在该任务中表现优于专门的生物语言模型和可训练基线模型。

Comments 22 pages

详情
英文摘要

Recent advances in machine learning and large-scale biological data collections have revived the prospect of building a virtual cell, a computational model of cellular behavior that could accelerate biological discovery. One of the most compelling promises of this vision is the ability to perform in silico phenotypic screens, in which a model predicts the effects of cellular perturbations in unseen biological contexts. This task combines heterogeneous textual inputs with diverse phenotypic outputs, making it particularly well-suited to LLMs and agentic systems. Yet, no standard benchmark currently exists for this task, as existing efforts focus on narrower molecular readouts that are only indirectly aligned with the phenotypic endpoints driving many real-world drug discovery workflows. In this work, we present AssayBench, a benchmark for phenotypic screen prediction, built from 1,920 publicly available CRISPR screens spanning five broad classes of cellular phenotypes. We formulate the screen prediction task as a gene rank prediction for each screen and introduce the adjusted nDCG, a continuous metric for comparing performance across heterogeneous assays. Our extensive evaluation shows that existing methods remain far from empirically estimated performance ceilings and zero-shot generalist LLMs outperform biology-specific LLMs and trainable baselines. Optimization techniques such as fine-tuning, ensembling, and prompt optimization can further improve LLM performance on this task. Overall, AssayBench offers a practical testbed for measuring progress toward in silico phenotypic screening and, more broadly, virtual cell models.

2605.10875 2026-05-12 cs.LG cs.CL

Compute Where it Counts: Self Optimizing Language Models

Yash Akhauri, Mohamed S. Abdelfattah

AI总结 本文研究了如何在自回归解码过程中动态分配计算资源,以提高大语言模型推理的效率与质量。提出了一种自优化语言模型(SOL),通过一个轻量的策略网络,在解码过程中根据当前隐藏状态选择不同的计算效率动作,从而动态控制注意力稀疏性、MLP激活剪枝和量化位宽。实验表明,SOL在保持预算一致的情况下优于静态分配和随机调度策略,显著提升了模型在多个任务上的性能,如MMLU准确率最高提升了7.3%。

Comments Accepted at ICML'26 Code: https://github.com/akhauriyash/SOL

详情
英文摘要

Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget to every generated token. In practice, token difficulty varies widely, so static compression can over-compute on easy steps and under-compute on hard ones. We study dynamic budget allocation for autoregressive decoding: learning how much computation to spend per token from within a single model. Self-Optimizing Language Models (SOL) pair a frozen LLM with a lightweight policy network that reads the LLM hidden state and selects a discrete efficiency action at each decode step. Actions can jointly control (i) token-level attention sparsity, (ii) structured activation pruning in the MLP, and (iii) activation quantization bit-width, while leaving the base model weights unchanged. We train the policy with group-relative policy optimization on teacher-forced episodes: the token sequence is fixed, while we sample multiple compute schedules (i.e., "counterfactual" schedules that vary only the efficiency actions for the same token path) and compare their likelihoods under the same supervision. Our reward trades off language-model quality against soft penalties that encourage episode-average budget usage to match a requested target. Across model variants and compute regimes, SOL improves quality at matched budget over static allocation and strong random schedule search, offering a complementary axis for inference-efficiency optimization. SOL discovers a better quality-efficiency pareto-front across all our experiments and improves MMLU accuracy by up to 7.3% over uniform budget allocation strategies.

2605.10873 2026-05-12 cs.CV cs.AI

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

AI总结 本文提出 CADBench,一个用于评估人工智能辅助计算机辅助设计(CAD)程序生成的多模态基准。该基准包含来自多个来源的18,000个样本,涵盖多种输入模态和六种评估指标,旨在全面衡量模型在几何保真度、可执行性和程序简洁性等方面的表现。实验表明,专门的网格到CAD模型在理想输入下表现优于通用视觉语言模型,但整体仍存在可靠性不足的问题,CADBench揭示了多模态CAD理解中的关键挑战与改进方向。

详情
英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://huggingface.co/datasets/DeCoDELab/CADBench.

2605.10870 2026-05-12 cs.AI

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Mingxi Zou, Zhihan Guo, Langzhang Liang, Zhuo Wang, Qifan Wang, Qingsong Wen, Irwin King, Lizhen Qu, Zenglin Xu

AI总结 本文提出了一种以决策为中心的记忆压缩框架DeMem,用于解决长期语言智能体在有限运行内存下的记忆管理问题。不同于传统基于描述性特征的记忆机制,该方法通过率失真理论衡量记忆对决策质量的影响,从而确定可以安全遗忘的信息边界,并优化记忆预算与决策质量之间的权衡。实验表明,DeMem在保证相同运行预算的前提下,能有效提升决策性能,验证了记忆应服务于决策区分而非单纯描述的观点。

详情
英文摘要

Long-horizon language agents must operate under limited runtime memory, yet existing memory mechanisms often organize experience around descriptive criteria such as relevance, salience, or summary quality. For an agent, however, memory is valuable not because it faithfully describes the past, but because it preserves the distinctions between histories that must remain separated under a fixed budget to support good decisions. We cast this as a decision-centric rate-distortion problem, measuring memory quality by the loss in achievable decision quality induced by compression. This yields an exact forgetting boundary for what can be safely forgotten, and a memory-distortion frontier characterizing the optimal tradeoff between memory budget and decision quality. Motivated by this decision-centric view of memory, we propose DeMem, an online memory learner that refines its partition only when data certify that a shared state would induce decision conflict, and prove near-minimax regret guarantees. On both controlled synthetic diagnostics and long-horizon conversational benchmarks, DeMem yields consistent gains under the same runtime budget, supporting the principle that memory should preserve the distinctions that matter for decisions, not descriptions.

2605.10863 2026-05-12 cs.CL

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

Mengyi Deng, Zhiwei Li, Xin Li, Tingyu Zhu, Yulan Yuan, Zhijiang Guo, Wei Wang

AI总结 尽管大语言模型(LLMs)取得了显著进展,但现有的偏好优化方法在保持推理多样性的同时仍难以保证方向一致性。为此,本文提出了一种轻量级框架——方向一致性组优化(DGPO),通过多候选比较显式建模方向感知对齐,并在组级别聚合监督信号。DGPO通过结构化集合组织正向和反向问答实例,优化基于边距的似然目标,以区分一致的推理路径与不一致的替代方案,从而在多个数据集和模型家族中实现了稳定的性能提升。

详情
英文摘要

Although Large Language Models (LLMs) have made remarkable progress, current preference optimization methods still struggle to align directional consistency while preserving reasoning diversity. To address this limitation, we propose Directional-Groupwise Preference Optimization (DGPO), a lightweight framework that aggregates supervision signals at the group level and explicitly models direction-aware alignment through multi-candidate comparisons. DGPO organizes forward and reverse question-answer instances into structured sets and optimizes a margin-based likelihood objective that separates coherent reasoning paths from inconsistent alternatives. This group-wise formulation captures richer relative information than pairwise objectives and reinforces consistency across diverse reasoning pathways. Empirical results show that our constructed reverse data yields a 3.2% average improvement across five benchmarks, while DGPO further delivers consistent gains across multiple datasets and model families, achieving average accuracy improvements of up to 3.6%.

2605.10862 2026-05-12 cs.CL

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta

AI总结 本文介绍了RUBEN,一个用于发现解释检索增强型大语言模型(LLM)输出的最小规则的交互式工具。该工具通过新颖的剪枝策略高效识别出能够涵盖所有其他规则的最小规则集,并将其应用于LLM安全领域,用于测试安全训练的有效性和对抗性提示注入的影响。这一方法为理解与提升LLM的可解释性与安全性提供了新的途径。

Comments Accepted by ICDE 2026 (Demonstration Track)

详情
英文摘要

This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.

2605.10859 2026-05-12 cs.CV cs.LG

Masked Generative Transformer Is What You Need for Image Editing

Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

AI总结 该论文提出了一种基于掩码生成变压器(MGT)的图像编辑框架EditMGT,旨在解决扩散模型在编辑过程中修改扩散到非目标区域的问题。通过局部化token预测机制和多层注意力整合,EditMGT能够精确控制编辑区域,同时避免非目标区域的意外变化。研究还构建了一个包含200万张高分辨率图像的编辑数据集CrispEdit-2M,并在多个基准测试中取得了最先进的图像相似度表现,且编辑速度比现有方法快6倍。

Comments CVPR 2026 HiGen Workshop; Project Page at https://weichow23.github.io/EditMGT/ GitHub at https://github.com/weichow23/EditMGT

详情
英文摘要

Diffusion models dominate image editing, yet their global denoising mechanism entangles edited regions with surrounding context, causing modifications to propagate into areas that should remain intact. We propose a fundamentally different approach by leveraging Masked Generative Transformers (MGTs), whose localized token-prediction paradigm naturally confines changes to intended regions. We present EditMGT, an MGT-based editing framework that is the first of its kind. Our approach employs multi-layer attention consolidation to aggregate cross-attention maps into precise edit localization signals, and region-hold sampling to explicitly prevent token flipping in non-target areas. To support training, we construct CrispEdit-2M, a 2M-sample high-resolution (>1024) editing dataset spanning seven categories. With only 960M parameters, EditMGT achieves state-of-the-art image similarity on multiple benchmarks while delivering 6x faster editing, demonstrating that MGTs offer a compelling alternative to diffusion-based editing.

2605.10858 2026-05-12 cs.CV cs.RO

Is Your Driving World Model an All-Around Player?

Lingdong Kong, Ao Liang, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Xian Sun, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

AI总结 当前的驾驶世界模型虽然能生成逼真的行车记录仪视频,但尚无单一模型在所有方面都表现优异。本文提出WorldLens基准,从像素质量、4D几何结构、闭环驾驶行为及人类感知等多个维度全面评估世界模型的真实性,并揭示现有模型在纹理、几何或行为一致性上各有所长,却难以兼顾。研究还构建了包含26,808条人类标注数据的WorldLens-26K数据集,以及一个能自动评估生成世界的视觉语言模型WorldLens-Agent,为模型评估提供了更贴近人类感知的统一框架。

Comments CVPR 2026 VideoWorldModel Workshop; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens

详情
英文摘要

Today's driving world models can generate remarkably realistic dash-cam videos, yet no single model excels universally. Some generate photorealistic textures but violate basic physics; others maintain geometric consistency but fail when subjected to closed-loop planning. This disconnect exposes a critical gap: the field evaluates how real generated worlds appear, but rarely whether they behave realistically. We introduce WorldLens, a unified benchmark that measures world-model fidelity across the full spectrum, from pixel quality and 4D geometry to closed-loop driving and human perceptual alignment, through five complementary aspects and 24 standardized dimensions. Our evaluation of six representative models reveals that no existing approach dominates across all axes: texture-rich models violate geometry, geometry-aware models lack behavioral fidelity, and even the strongest performers achieve only 2-3 out of 10 on human realism ratings. To bridge algorithmic metrics with human perception, we further contribute WorldLens-26K, a 26,808-entry human-annotated preference dataset pairing numerical scores with textual rationales, and WorldLens-Agent, a vision-language evaluator distilled from these judgments that enables scalable, explainable auto-assessment. Together, the benchmark, dataset, and agent form a unified ecosystem for assessing generated worlds not merely by visual appeal, but by physical and behavioral fidelity.

2605.10855 2026-05-12 cs.CL

Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding

Jianzhu Bao, Haozhen Zhang, Kuicai Dong, Bozhi Wu, Sarthak Ketanbhai Modi, Zi Pong Lim, Yon Shin Teo, Wenya Wang

AI总结 该论文提出了一种名为ChartCF的数据高效训练框架,旨在提升视觉-语言模型在图表理解中的反事实敏感性。通过代码修改生成反事实数据,并结合图表相似性筛选和多模态偏好优化,ChartCF能够在较少训练数据的情况下实现与现有强模型相当甚至更优的图表理解性能。这一方法充分利用了图表作为程序生成视觉对象的特性,有效提升了模型对细微视觉变化的感知能力。

Comments Accepted to ACL 2026 Main Conference

详情
英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable progress in chart understanding, largely driven by supervised fine-tuning (SFT) on increasingly large synthetic datasets. However, scaling SFT data alone is inefficient and overlooks a key property of charts: charts are programmatically generated visual artifacts, where small, code-controlled visual changes can induce drastic shifts in semantics and correct answers. Learning this counterfactual sensitivity requires VLMs to discriminate fine-grained visual differences, yet standard SFT treats training instances independently and provides limited supervision to enforce this behavior. To address this, we introduce ChartCF, a data-efficient training framework designed to enhance counterfactual sensitivity. ChartCF consists of: (1) a counterfactual data synthesis pipeline via code modification, (2) a chart similarity-based data selection strategy that filters overly difficult samples for improved training efficiency, and (3) multimodal preference optimization across both textual and visual modalities. Experiments on five benchmarks show that ChartCF achieves superior or comparable performance to strong chart-specific VLMs while using significantly less training data.