arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.10941 2026-05-12 cs.CC

Average-Case Hardness of Binary-Encoded Clique in Proof and Communication Complexity

Susanna F. de Rezende, David Engström, Yassine Ghannane, Duri Andrea Janett, Artur Riazanov

AI总结 本文研究了在平均情况下,证明图中不存在大团问题在证明复杂度和通信复杂度中的困难性。通过分析随机采样的稠密图的二进制编码团问题,作者证明了切割平面和有限深度的模2解证法的下界为指数级,并指出在这些公式中寻找被违反子句的随机通信复杂度为多项式级。这一结果揭示了在平均情况下,这类问题在不同计算模型中表现出显著的难度差异。

Comments Full version of a paper to appear at ICALP 2026

详情
英文摘要

We study the average-case hardness of establishing that a graph does not have a large clique in both proof and communication complexity. We show exponential lower bounds on the length of cutting planes and bounded-depth resolution over parities refutations of the binary encoding of clique formulas on randomly sampled dense graphs. Moreover, we show that the randomized communication complexity of finding a falsified clause in these formulas is polynomial.

2605.10938 2026-05-12 cs.CL cs.AI cs.LG

ELF: Embedded Language Flows

Keya Hu, Linlu Qiu, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He

AI总结 本文提出了一种名为ELF(Embedded Language Flows)的连续语言扩散模型,旨在解决当前主流离散扩散语言模型(DLMs)在生成质量与效率上的局限。ELF基于连续时间流匹配技术,在嵌入空间中进行建模,直到最终时间步才映射到离散词元,从而更有效地结合图像领域扩散模型的优化技术,如无分类器引导(CFG)。实验表明,ELF在生成质量与采样效率上均优于现有离散和连续DLMs,为构建高效的连续扩散语言模型提供了新方向。

Comments Tech Report. Project webpage: https://github.com/lillian039/ELF

详情
英文摘要

Diffusion and flow-based models have become the de facto approaches for generating continuous data, e.g., in domains such as images and videos. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps. These results suggest that ELF offers a promising path toward effective continuous DLMs.

2605.10937 2026-05-12 cs.CV

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu

AI总结 本文研究了如何通过强化学习后训练进一步提升文本到图像生成模型的性能,并针对现有方法中奖励黑客问题提出了解决方案。作者指出标准化操作可能导致策略校准偏差,进而影响训练效果,为此提出了一种基于信息几何的超线性优势塑造方法(SLAS),通过引入优势依赖的权重对策略空间进行非线性重构,从而增强有效更新、抑制虚假梯度。实验表明,SLAS在多个模型和基准测试中均优于现有方法,提升了训练效率、泛化能力和生成质量。

详情
英文摘要

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

2605.10936 2026-05-12 cs.CV

Personal Visual Context Learning in Large Multimodal Models

Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

AI总结 随着智能眼镜等可穿戴设备将大 multimodal 模型(LMMs)融入用户的连续第一人称视觉流,这些模型要成为真正的个人助手,关键在于视觉个性化能力。本文提出个人视觉上下文学习(Personal VCL),旨在利用用户特定的视觉信息解决个性化查询,并构建了 Personal-VCL-Bench 作为评估基准。研究发现当前 LMMs 在利用视觉上下文方面存在显著差距,为此提出了一种名为 Agentic Context Bank 的推理时基线方法,通过结构化的记忆银行和查询自适应的证据选择,有效提升了模型在多任务中的表现。

Comments Project website: https://vision.cs.utexas.edu/projects/PersonalVCL/

详情
英文摘要

As wearable devices like smart glasses integrate Large Multimodal Models (LMMs) into the continuous first-person visual streams of individual users, the evolution of these models into true personal assistants hinges on visual personalization: the ability to reason over visual information unique to the wearer. We formalize this capability as Personal Visual Context Learning (Personal VCL), the prompt-time capability of using user-specific visual context to resolve personalized queries. To systematically evaluate this, we present Personal-VCL-Bench, a comprehensive benchmark capturing the personal visual world across persons, objects, and behaviors. Our analysis of frontier LMMs identifies a profound context utilization gap, revealing that the mechanisms for leveraging visual evidence, as well as aggregating multiple visual observations, remain critically understudied. Motivated by these findings, we propose the Agentic Context Bank, a strong inference-time baseline that structures a user's visual context into a self-refining memory bank and employs query-adaptive evidence selection. Our baseline approach consistently improves over standard context prompting regimes across tasks and evaluated backbones, demonstrating a practical path towards future personalized LMMs.

2605.10934 2026-05-12 cs.LG cs.AI cs.CV cs.RO stat.ML

Variational Inference for Lévy Process-Driven SDEs via Neural Tilting

Yaman Kindap, Manfred Opper, Benjamin Dupuis, Umut Simsekli, Tolga Birdal

AI总结 该论文研究了如何利用变分推断方法对由Lévy过程驱动的随机微分方程(SDEs)进行建模,以准确捕捉金融、气候等领域的极端事件和重尾现象。传统方法要么计算开销大,要么依赖高斯假设而无法处理跳跃特性。为此,作者提出了一种基于神经网络的指数倾斜框架,通过神经网络对Lévy测度进行指数加权,构建灵活的变分族,在保留跳跃结构的同时保证计算可行性。实验表明,该方法在合成和真实数据上均能有效捕捉跳跃动态,并在高斯变分方法失效的情况下提供可靠的后验推断。

Comments The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/

详情
英文摘要

Modelling extreme events and heavy-tailed phenomena is central to building reliable predictive systems in domains such as finance, climate science, and safety-critical AI. While Lévy processes provide a natural mathematical framework for capturing jumps and heavy tails, Bayesian inference for Lévy-driven stochastic differential equations (SDEs) remains intractable with existing methods: Monte Carlo approaches are rigorous but lack scalability, whereas neural variational inference methods are efficient but rely on Gaussian assumptions that fail to capture discontinuities. We address this tension by introducing a neural exponential tilting framework for variational inference in Lévy-driven SDEs. Our approach constructs a flexible variational family by exponentially reweighting the Lévy measure using neural networks. This parametrization preserves the jump structure of the underlying process while remaining computationally tractable. To enable efficient inference, we develop a quadratic neural parametrization that yields closed-form normalization of the tilted measure, a conditional Gaussian representation for stable processes that facilitates simulation, and symmetry-aware Monte Carlo estimators for scalable optimization. Empirically, we demonstrate that the method accurately captures jump dynamics and yields reliable posterior inference in regimes where Gaussian-based variational approaches fail, on both synthetic and real-world datasets.

2605.10931 2026-05-12 math.AP cs.LG math.DS

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith

AI总结 本文研究了在低温极限下,仅包含编码器的深度Transformer模型中token分布的演化行为,利用平均场连续方程对其进行描述。通过引入多粒子系统收敛分析的思想,论文证明了token分布会迅速集中到由键、查询和值矩阵诱导的投影映射所推动的初始分布上,并在中等时间尺度内保持亚稳态。研究还给出了Wasserstein距离随温度参数和推理时间的变化规律,并通过数值实验验证了理论结果,揭示了在有限温度和长时间演化下系统会进入由值矩阵谱主导的另一阶段。

Comments 30 pages, 10 figures

详情
英文摘要

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

2605.10929 2026-05-12 math.NA cs.NA

Efficient Admissible Set Projection in Optimization-based Invariant-Domain-Preserving Limiters for Ideal MHD

Chen Liu, Chi-Wang Shu, Xiangxiong Zhang

AI总结 本文研究了在理想磁流体动力学(MHD)方程的优化型不变域保持限制器中,如何高效地进行可接受集投影的问题。为实现物理合理且计算稳健的数值解,作者提出了一种基于优化的限制器,在保持全局守恒和精度的同时确保解的可接受性。通过将可接受集按磁能参数化为切片,将高维投影问题简化为一维最小化问题,从而高效求解,并结合分裂方法与Zhang-Shu限制器进一步提升计算效率与精度。

详情
英文摘要

Preserving the admissible set of the ideal magnetohydrodynamics (MHD) equations is important not only for producing physically meaningful numerical solutions, but more importantly for achieving robust computations. In this paper, we develop an optimization-based limiter to enforce admissibility while preserving global conservation and accuracy. For an easy and efficient projection, we decompose the admissible set into slices parameterized by the magnetic energy, so that the MHD projection reduces to a one-dimensional minimization, which can be solved efficiently by the Brent method. The splitting method can be used to efficiently solve the global minimization problem of the optimization-based limiter, which can be used to enforce cell average admissibility in discontinuous Galerkin (DG) schemes, and pointwise admissibility can be further enforced by the Zhang-Shu positivity-preserving limiter. We apply the limiter to high-order DG schemes and present numerical results for a few representative MHD problems.

2605.10927 2026-05-12 cs.DS

Chasing Small Sets Optimally Against Adaptive Adversaries

Christian Coester, Alexa Tudose

AI总结 本文研究了在度量空间中确定性在线算法追踪至多 $k$ 个元素集合的问题,该问题也被称为度量服务系统或宽度-$k$ 分层图遍历。作者提出了一种 $O(2^k)$ 竞争比的确定性算法,填补了该问题长达三十年的理论空白,并证明这一界在对抗自适应对手的随机化算法中也是最优的。此外,作者还改进了确定性下界,并针对 $k=3$ 的情况给出了匹配的上界,相关结果对分布式异步树探索和 $k$-出租车问题也具有重要意义。

Comments 32 pages

详情
英文摘要

We study deterministic online algorithms for the problem of chasing sets of cardinality at most $k$ in a metric space, also known as metrical service systems and equivalent to width-$k$ layered graph traversal. We resolve the 30-year-old gap of $Ω(2^k)\cap O(k2^k)$ on the competitive ratio of this problem by giving an $O(2^k)$-competitive deterministic algorithm. This bound is optimal even among randomized algorithms against adaptive adversaries. We also (slightly) improve the deterministic lower bound to $D_k$, defined recursively by $D_1=1$ and $D_{k+1}=2D_k+\sqrt{8+8D_k}+3$, which we conjecture to be exactly tight. For $k=3$, we provide a matching upper bound of $D_3$. Our results imply slightly improved upper and lower bounds for distributed asynchronous collective tree exploration and for the $k$-taxi problem, respectively. Our algorithm generalizes the classical doubling strategy, previously known to be optimal for $k=2$. The previous best bound for general $k$ was achieved by the generalized work function algorithm (WFA), and was known to be tight for WFA. Our improved bound therefore implies that WFA is sub-optimal for chasing small sets.

2605.10925 2026-05-12 cs.RO

PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

Xinyu Guo, Bin Xie, Wei Chai, Xianchi Deng, Tiancai Wang, Zhengxing Wu, Xingyu Chen

AI总结 该研究提出了一种名为 PriorVLA 的新型框架,旨在在视觉-语言-动作(VLA)模型的下游任务适配中保留预训练的先验知识。该方法通过冻结预训练专家模型作为只读先验源,并训练一个适配专家模型进行任务特定学习,从而在保持广泛先验的同时实现有效适配。实验表明,PriorVLA 在多个基准和现实任务中均优于全微调和现有先进方法,尤其在分布外和少样本场景下表现突出。

Comments 32 pages. Project page: https://priorvla.github.io/

详情
英文摘要

Large-scale pretraining has made Vision-Language-Action (VLA) models promising foundations for generalist robot manipulation, yet adapting them to downstream tasks remains necessary. However, the common practice of full fine-tuning treats pretraining as initialization and can shift broad priors toward narrow training-distribution patterns. We propose PriorVLA, a novel framework that preserves pretrained priors and learns to leverage them for effective adaptation. PriorVLA keeps a frozen Prior Expert as a read-only prior source and trains an Adaptation Expert for downstream specialization. Expert Queries capture scene priors from the pretrained VLM and motor priors from the Prior Expert, integrating both into the Adaptation Expert to guide adaptation. Together, PriorVLA updates only 25% of the parameters updated by full fine-tuning. Across RoboTwin 2.0, LIBERO, and real-world tasks, PriorVLA achieves stronger overall performance than full fine-tuning and state-of-the-art VLA baselines, with the largest gains under out-of-distribution (OOD) and few-shot settings. PriorVLA improves over pi0.5 by 11 points on RoboTwin 2.0-Hard and achieves 99.1% average success on LIBERO. Across eight real-world tasks and two embodiments, PriorVLA reaches 81% in-distribution (ID) and 57% OOD success with standard data. With only 10 demonstrations per task, PriorVLA reaches 48% ID and 32% OOD success, surpassing pi0.5 by 24 and 22 points, respectively.

2605.10922 2026-05-12 cs.CV

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, Shi-Min Hu

AI总结 Pixal3D 是一种基于图像的高保真3D生成方法,旨在解决现有3D生成模型在像素级细节还原方面的不足。该方法通过引入像素级反投影条件机制,直接在输入视角下生成与像素对齐的3D几何结构,建立了明确的像素到3D特征的对应关系,从而显著提升了生成结果的保真度。此外,Pixal3D 还支持多视角生成和场景级合成,为从单张或多张图像生成高精度3D物体和场景提供了新的解决方案。

Comments SIGGRAPH 2026. Project page: https://ldyang694.github.io/projects/pixal3d/

详情
英文摘要

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

2605.10921 2026-05-12 cs.RO

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Huashuo Lei, Wenxuan Song, Huarui Zhang, Jieyuan Pei, Jiayi Chen, Haodong Yan, Han Zhao, Pengxiang Ding, Zhipeng Zhang, Lida Huang, Donglin Wang, Yan Wang, Haoang Li

AI总结 《RoboMemArena: 一个全面且具有挑战性的机器人记忆基准》提出了一种新的机器人记忆评估基准,旨在解决现有基准在多模态注释、任务覆盖和现实环境评估方面的不足。该基准包含26个任务,平均轨迹长度超过1000步,其中68.9%的子任务依赖记忆。研究还设计了PrediMem,一种结合视觉-语言模型的双系统架构,通过预测编码机制提升对任务动态的感知能力,实验表明其在复杂记忆任务中表现优异。

Comments Project website: https://robomemarena.github.io

详情
英文摘要

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However, existing robotic memory benchmarks still lack multimodal annotations for memory formation, provide limited task coverage and structural complexity, and remain restricted to simulation without real-world evaluation. We address this gap with RoboMemArena, a large-scale benchmark of 26 tasks, with average trajectory lengths exceeding 1,000 steps per task and 68.9% of subtasks being memory-dependent. The generation pipeline leverages a vision-language model (VLM) to design and compose subtasks, generates full trajectories through atomic functions, and provides memory-related annotations, including subtask instructions and native keyframe annotations, while paired real-world memory tasks support physical evaluation. We further design PrediMem, a dual-system VLA in which a high-level VLM planner manages a memory bank with recent and keyframe buffers and uses a predictive coding head to improve sensitivity to task dynamics. Extensive experiments on RoboMemArena show that PrediMem outperforms all baselines and provides insights into memory management, model architecture, and scaling laws for complex memory systems.

2605.10920 2026-05-12 cs.SE

Using Logs to support Programming Education

Gilmar Gomes do Nascimento, Maria Claudia F. P Emer, Adolfo Gustavo Serra Seca Neto, Laudelino Cordeiro Bastos

AI总结 本研究旨在通过分析编程学习过程中的实时代码日志,弥补编程教育中缺乏量化评估的不足。研究提出了一种新型插件,用于广泛使用的代码编辑器,以记录学生在编程和文档编写过程中的细粒度交互行为,生成包含错误、进度和时间戳的详细数据集。该方法为教育者提供了基于数据的学习分析工具,支持教学方法研究、学习难点识别和个性化教学改进,推动编程教育向数据驱动和实证化方向发展。

Comments Author version of the paper accepted for publication at XX Conferência Latino-Americana de Tecnologias de Aprendizagem - LACLO 2025

详情
英文摘要

Software developers use metrics to evaluate code quality and productivity, but these practices are still rare in programming education. This project bridges the gap by collecting real-time learning analytics from individual student and whole-class code development logs. This granular, quantitative data provides educators with qualitative insights into the learning process. It allows them to evaluate student comprehension, identify common challenges, and critically assess whether the allocated time for exercises and algorithms is sufficient for mastery. Unlike traditional Learning Management Systems, we propose a novel approach: a plugin for a widely used code editor that captures granular interactions during programming and documentation. The resulting dataset logs coding behaviors, errors, and progress, enabling evidence-based analysis of learning patterns and educational benchmarking. By structuring this real-time programming trail, we support research on teaching methodologies, learner challenges, and skill acquisition. Quantitative metrics complement qualitative assessment by evaluating code, exercise progress, and timestamp logs. Our goal is to provide an open-access database for educators and researchers, fostering data-driven insights to enhance instruction and personalize learning experiences. This work aligns industrial best practices with pedagogical innovation, advancing measurable, empirical approaches to programming education.

2605.10917 2026-05-12 cs.LG cs.MA cs.RO

Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges

Usman A. Khan, Joseph W. Durham

AI总结 本文研究匿名多智能体路径规划(MAPF)问题,将其建模为具有马尔可夫结构的多边际最优传输(MMOT)问题,并证明在该结构下原指数级规模的问题可简化为多项式规模的线性规划(LP)。通过引入薛定谔桥的概率框架,作者提出了一种基于熵正则化的迭代解法,能够在保证近似最优性的同时显著降低计算复杂度。实验表明,该方法在保持解的质量方面具有优越的可扩展性。

Comments Accepted in ICML 2026 as a spotlight paper

详情
英文摘要

We consider anonymous multi-agent path finding (MAPF) where a set of robots is tasked to travel to a set of targets on a finite, connected graph. We show that MAPF can be cast as a special class of multi-marginal optimal transport (MMOT) problems with an underlying Markovian structure, under which the exponentially large MMOT collapses to a linear program (LP) polynomial in size. Focusing on the anonymous setting, we establish conditions under which the corresponding LP is feasible, totally unimodular, and consequently, yields min-cost, integral $(\{0,1\})$ transports that do not overlap in both space and time. To adapt the approach to large-scale problems, we cast the MAPF-MMOT in a probabilistic framework via Schrödinger bridges. Under standard assumptions, we show that the Schrödinger bridge formulation reduces to an entropic regularization of the corresponding MMOT that admits an iterative Sinkhorn-type solution. The Schrödinger bridge, being a probabilistic framework, provides a shadow (fractional) transport that we use as a template to solve a reduced LP and demonstrate that it results in near-optimal, integral transports at a significant reduction in complexity. Extensive experiments highlight the optimality and scalability of the proposed approaches.

2605.10912 2026-05-12 cs.CL

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

AI总结 WildClawBench 是一个用于评估真实环境中长期任务执行能力的基准,包含60个由人类编写的双语多模态任务,涵盖六个主题类别。该基准在可复现的Docker容器中运行,使用真实的命令行代理框架和工具,任务平均耗时约8分钟,涉及20次以上工具调用。评估方法结合了规则检查、环境状态审计和大模型语义判断,结果显示当前前沿模型在真实运行时的长期任务表现仍有较大提升空间。

Comments Github link: https://github.com/internlm/WildClawBench

详情
英文摘要

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

2605.10911 2026-05-12 math.PR cs.CC cs.DS math.CO math.ST stat.TH

The stochastic block model has the overlap graph property for modularity

Shankar Bhamidi, David Gamarnik, Remco van der Hofstad, Nelly Litvak, Pawel Pralat, Fiona Skerman, Yasmin Tousinejad

AI总结 本文研究了随机块模型(SBM)中基于模块度的聚类算法的理论极限,指出模块度在SBM中具有重叠间隙性质(OGP)。这一性质表明,基于模块度的局部算法在恢复隐藏的社区结构时存在困难,并且相关马尔可夫链的混合时间较慢。该研究扩展了Bickel和Chen的结论,证明了在高概率下,任何接近最优模块度的划分都与隐藏的社区划分接近,为理解SBM中算法性能的瓶颈提供了理论依据。

Comments 28 pages, 2 figures

详情
英文摘要

The overlap gap property (OGP) is a statement about the geometry of near-optimal solutions. Exhibiting OGP implies failure of a class of local algorithms; and has been observed to coincide with conjectured algorithmic limits in problems with statistical computational gap. We consider the Stochastic Block Model (SBM), where the graph has a planted partition with $k$ equal-size blocks which form the `communities', and where, for parameters $p>q$, vertices within the same community connect with probability $p$, while vertices in different communities connect with probability $q$, independently across pairs of vertices. Modularity--based clustering algorithms have become ubiquitous in applications. This article studies theoretical limits of local algorithms based on the modularity score on the SBM. We establish that modularity exhibits OGP on the SBM. This rules out a class of local algorithms based on modularity for recovery in the SBM, and shows slow mixing time for a related Markov Chain. Theoretically this is one of the few instances where OGP has been established for a `planted' model, as most such analyses to date consider the `null' model. As part of our analysis, we extend a result by Bickel and Chen 2009, who established that with high probability, the modularity optimal partition of SBM is $o(n)$ local moves away from the planted partition, where $n$ is the graph size. We show that, with high probability, any partition with modularity score sufficiently near the optimal value is close to the planted partition.

2605.10910 2026-05-12 quant-ph cs.LG

Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis

Richie Yeung, Aleks Kissinger, Rob Cornish

AI总结 本文研究了在全连接量子器件上合成克利福德量子线路的问题,将其建模为强化学习任务,通过学习一系列基本克利福德门将给定的辛矩阵表示简化为单位矩阵。提出了一种对量子比特重标等操作具有等变性的新型神经网络架构,能够适用于不同规模的量子系统,无需重新参数化网络。实验表明,该方法在六量子比特线路中接近最优解,并能扩展到三十量子比特的未知克利福德表,其两量子比特门数量优于现有的合成方法。

详情
英文摘要

We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy Clifford synthesizers.

2605.10909 2026-05-12 cs.LG stat.ML

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

Alex DeWeese, Guannan Qu

AI总结 本文重新审视了在受限策略类中使用的标准策略梯度方法,发现其容易陷入次优临界点,主要原因在于策略梯度本身具有短视性,仅依赖于一步Q函数进行优化。为此,作者提出了一种基于$k$-步策略梯度的通用方法,通过结合$k$步时间窗口内的随机性,能够逃离受限策略类中的短视局部最优解。理论分析表明,该方法在性能上可以指数级接近最优确定性策略,并且在仅假设价值函数光滑可微的前提下,投影梯度下降和镜像下降方法能在$O(1/T)$次迭代内实现这一保证,适用于状态聚合和部分可观测协作多智能体等之前难以求解的问题。

详情
英文摘要

This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_μ^{π^*} / d_μ^π||_\infty$ and $||d_μ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.

2605.10904 2026-05-12 cs.RO

MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems

Marco Coscoy, Zewei Zhou, Seth Z. Zhao, Henry Wei, Angela Magtoto, Johnson Liu, Rui Song, Walter Zimmer, Zhiyu Huang, Chen Tang, Bolei Zhou, Jiaqi Ma

AI总结 本文提出MDrive,一个用于端到端多智能体系统的闭环协作驾驶基准,旨在解决现有V2X基准在闭环评估和场景多样性方面的不足。该基准基于NHTSA预碰撞类型和真实V2X数据构建了225个场景,实验表明多智能体系统在整体表现上优于单智能体系统,但在感知共享和协商机制在复杂交通场景中的效果仍有挑战。MDrive还提供了开源工具箱,支持场景生成、现实到模拟转换及人机协同仿真,为评估和提升协作驾驶系统的泛化性和鲁棒性提供了可复现的基础。

Comments website:https://mdrive-challenge.github.io/

详情
英文摘要

Vehicle-to-Everything (V2X) communication has emerged as a promising paradigm for autonomous driving, enabling connected agents to share complementary perception information and negotiate with each other to benefit the final planning. Existing V2X benchmarks, however, fall short in two ways: (i) open-loop evaluations fail to capture the inherently closed-loop nature of driving, leading to evaluation gaps, and (ii) current closed-loop evaluations lack behavioral and interactive diversity to reflect real-world driving. Thus, it is still unclear the extent of benefits of multi-agent systems for closed-loop driving. In this paper, we introduce MDrive, a closed-loop cooperative driving benchmark comprising 225 scenarios grounded in both NHTSA pre-crash typologies and real-world V2X datasets. Our benchmark results demonstrate that multi-agent systems are generally better than single-agent counterparts. However, current multi-agent systems still face two important challenges: (i) perception sharing enhances perceptions, but doesn't always translate to better planning; (ii) negotiation improves planning performance but harms it in complex and dense traffic scenarios. MDrive further provides an open-source toolbox for scenario generation, Real2Sim conversion, and human-in-the-loop simulation. Together, MDrive establishes a reproducible foundation for evaluating and improving the generalization and robustness of cooperative driving systems.

2605.10903 2026-05-12 cs.CV cs.RO

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang, Jing Lyu, Pengxiang Ding, Yan Wang, Donglin Wang, Haoang Li

AI总结 本文提出了一种新的方法,解决预训练视觉-语言-动作(VLA)模型在标准监督微调过程中性能提升有限且适应成本高的问题。该方法通过在参数空间中解耦辅助目标微调的两个目标——增强通用能力和拟合任务特定动作分布,并利用两种不同的训练策略在小规模任务集上训练出两个微调模型,从而提取出由辅助目标提供的能力向量。将这些能力向量与预训练参数结合形成增强能力的元模型,并引入轻量正交正则化损失,使模型在保持高性能的同时显著降低计算开销。实验表明,该方法在多种模型和新环境中均具有良好的有效性和泛化能力。

详情
英文摘要

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

2605.10901 2026-05-12 cs.LG

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

Nikita Kezins, Urbas Ekka, Pascal Berrang, Luca Arnaboldi

AI总结 该研究旨在为语言模型的防护分类器提供形式化保证,以确保其能有效防御有害行为。传统方法在离散输入空间中难以定义“有害行为”的形式化规范,因此作者将验证转移到分类器的预激活空间,通过构造凸区域并利用分类头的单调性,实现了高效且无近似的形式化证明。实验表明,现有防护分类器在形式化验证下存在可验证的安全漏洞,揭示了其在实际应用中可能存在的稳定性与覆盖范围问题。

详情
英文摘要

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.

2605.10900 2026-05-12 cs.GT

Effective, Efficient, and General Information Abstraction for Imperfect-Information Extensive-Form Games

Boning Li, Longbo Huang

AI总结 该研究提出了一种名为WEVA的信息抽象方法,用于降低解决不完美信息博弈的计算成本。该方法通过少量的反事实遗憾最小化(CFR)迭代生成每手牌的期望值特征,结合深度加权的多节点特征向量,使用k-means++聚类实现信息集的抽象,无需领域知识和预训练。实验表明,WEVA在多种结构不同的博弈中均优于基于权益和排名的传统方法,显著提升了求解效率并降低了策略可被利用的程度。

Comments 17 pages, 6 figures

详情
英文摘要

Information abstraction reduces the computational cost of solving imperfect-information games by clustering information sets into a smaller number of \emph{buckets}. Existing methods either rely on domain-specific features such as rank or equity, which are inapplicable to games with non-standard payoff structures, or require expensive offline neural-network training on billions of samples. We propose \textbf{Warm-up Expected Value-based Abstraction (WEVA)}, a simple yet effective alternative: run a small number of Counterfactual Regret Minimization (CFR) iterations on the full game as a \emph{warm-up} phase, extract per-hand expected value features at every decision node, form a depth-weighted multi-node feature vector, and apply $k$-means++ clustering to obtain the abstraction mapping. WEVA requires no domain knowledge, no pre-training, and incurs only a small overhead on top of the abstract-game solve. Experiments on three structurally diverse games, with different bucket numbers and CFR variants, show that WEVA consistently outperforms equity-based and rank-based abstractions, reducing exploitability by up to over $80\%$. Surprisingly, as few as $W{=}10$ warm-up iterations already produce abstractions that outperform existing information abstraction methods in most settings. These results establish WEVA as an \emph{effective, efficient, and general} approach to information abstraction in imperfect-information extensive-form games.

2605.10899 2026-05-12 cs.CL cs.LG

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister

AI总结 本文提出 RubricEM,一种基于评分标准引导的元强化学习框架,旨在解决深度研究智能体在缺乏明确奖励信号下的训练问题。该方法通过将研究过程分解为多个阶段,并结合基于反思的元策略进化,实现了对长期任务的高效优化。RubricEM 通过结构化的评分标准提供更精细的反馈,并将评估经验转化为可复用的指导,显著提升了智能体在长文本生成等复杂任务中的表现。

Comments 63 pages, 6 figures

详情
英文摘要

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

2605.10898 2026-05-12 cs.HC

How Creatives Approach GenAI Image Generation: Tensions Between Structured Guidance, Self-Experimentation, and Creative Autonomy

Haidan Liu, Isabelle Kwan, Taiga Okuma, Jeffrey Loverock, Nicholas Vincent, Parmit K Chilana

AI总结 本研究探讨了创作者在使用生成式AI图像工具时的学习方式与支持需求,发现他们常通过自我实验或教程来探索工具,但常因术语混乱而感到困扰。研究还揭示了创作者在接受结构化指导与保持创作自主性之间的张力,尽管指导有助于理解AI,但许多创作者仍更倾向于自我实验以维护创作自由。这一发现为设计更符合创作者需求的AI辅助工具提供了重要参考。

Comments Accepted at ACM Creativity & Cognition 2026

详情
英文摘要

As generative AI tools increasingly influence creative practice, they raise longstanding HCI questions about how creatives learn complex software and how they can be better supported. We conducted an interview study with artists and hobbyists (n=8) and a follow-up survey (n=159) to understand how this population approaches and seeks guidance for GenAI image tools. We found that creatives commonly use either self-experimentation or tutorials to explore GenAI tools, yet many struggle with confusing AI terminology. To gain further insight into creatives' learning experiences, we developed a research probe to elicit creatives' perceptions of structured guidance. Our user study with 17 creatives revealed that, even when creatives described the guidance as helpful for understanding AI, many still preferred self-experimentation, feeling that guidance could limit their creativity. Our findings highlight a central tension in supporting AI literacy for creatives: balancing guidance and promoting literacy while preserving creative freedom.

2605.10895 2026-05-12 cs.DS cs.CG

FPT Approximation Schemes for Min-Sum Radii and Min-Sum Diameters Clustering

Fabrizio Grandoni, Anupam Gupta, Jatin Yadav

AI总结 本文研究了经典的最小和半径聚类(MSR)和最小和直径聚类(MSD)问题,旨在将点集划分为若干簇以最小化各簇半径或直径之和。作者提出了针对这两个问题的参数化近似算法,针对参数 $k$ 提供了固定参数可追踪(FPT)的近似方案,能够在较优的时间复杂度内实现任意精度的近似解。该成果解决了这两个问题在FPT近似算法方面长期存在的开放性难题,是该领域的重要进展。

详情
英文摘要

In the classical Min-Sum Radii problem (MSR) we are given a set $X$ of $n$ points in a metric space and a positive integer $k\in [n]$. Our goal is to partition $X$ into $k$ subsets (the clusters) so as to minimize the sum of the radii of these clusters. The Min-Sum Diameters problem (MSD) is defined analogously, where instead of the radii of the clusters we consider their diameters. For both problems we present FPT approximation schemes for the natural parameter $k$. Specifically, given $ε>0$, we show how to compute $(1+ε)$-approximations for both MSD and MSR in time $(1/ε)^kn^{O(1)}$ and $(1/ε)^{O(k/ε\log 1/ε)}n^{poly(1/ε)}$ respectively. The previous best FPT approximation algorithms for these problems have approximation factors $4+ε$ and $2+ε$, respectively, and finding an FPT approximation scheme for both these problems had been outstanding open problems.

2605.10894 2026-05-12 cs.CV

Counterfactual Stress Testing for Image Classification Models

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker

AI总结 本文研究了医学影像分类模型在新临床环境中因分布偏移而失效的问题,提出了一种基于因果生成模型的反事实压力测试框架,通过干预扫描仪类型、患者性别等属性生成具有临床真实性的“假设”图像,从而在保持解剖结构不变的前提下,进行有针对性的分布偏移评估。实验表明,该方法相比传统扰动方法能更准确地反映模型在真实分布外场景下的性能变化,为医学AI系统的鲁棒性评估提供了更可靠的基础。

详情
英文摘要

Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

2605.10890 2026-05-12 cs.SE

CppPerf: An Automated Pipeline and Dataset for Performance-Improving C++ Commits

Tommy Ho, Khashayar Etemadi, Zhendong Su

AI总结 为满足性能错误自动修复领域对真实可执行基准的需求,研究提出了CppPerf-Mine,一个可配置的自动化流程,用于从GitHub上的开源C++仓库中挖掘能提升运行时间的补丁。该流程结合了结构化提交过滤、基于大语言模型的提交分类器以及容器化的构建与测试阶段,生成可复现的Docker镜像。基于该流程构建了包含347个手动验证补丁的CppPerf-DB基准库,其中39%为多文件补丁,可用于评估仓库级别的修复工具。初步实验表明,现有工具OpenHands仅能正确修复其中13.5%的补丁,凸显出真实世界C++性能修复仍面临挑战。

详情
英文摘要

Recent progress in automated repair of performance bugs demands realistic, executable benchmarks. However, existing C++ performance benchmarks are largely built from competitive programming submissions, and recent real-world benchmarks predominantly target Python and .NET. To fill this gap, we present CppPerf-Mine, a configurable pipeline that mines execution-time-improving patches from open-source C++ repositories on GitHub by combining structural commit filtering, an LLM-based commit classifier, and a containerized build & test stage that produces fully reproducible Docker images for each patch. Using CppPerf-Mine, we build CppPerf-DB, a benchmark comprising 347 manually verified patches from 42 mature C++ repositories, 39% of which are multi-file, enabling the evaluation of repository-level repair tools. In our preliminary study, OpenHands correctly fixes only 13.5% of the patches in CppPerf-DB, confirming that real-world C++ performance repair remains an open challenge. CppPerf-Mine and CppPerf-DB are open-source and publicly available at: https://doi.org/10.5281/zenodo.20097425. In addition, a demonstration video is available at: https://www.youtube.com/watch?v=nixlupIgSdM.

2605.10889 2026-05-12 cs.LG cs.AI

Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N. M Hoang, Fartash Faghri, Yizhe Zhang, Minsik Cho, Mehrdad Farajtabar

AI总结 本文研究了策略蒸馏在训练推理模型中的作用机制,探讨了在何种情况下蒸馏信号是有益的、在何种情况下是有害的。作者提出了一种无需训练的诊断框架,能够在每个标记、每个问题和每个教师模型的粒度上分析蒸馏效果,并通过梯度对齐分数衡量实际蒸馏梯度与理想梯度的接近程度。实验表明,蒸馏信号在学生模型表现不佳时更有效,而在正确推理路径上容易引入噪声,且最佳蒸馏配置依赖于任务和模型能力,不存在普适的最优方案。

详情
英文摘要

On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.

2605.10887 2026-05-12 cs.CV

Count Anything at Any Granularity

Chang Liu, Haoning Wu, Weidi Xie

AI总结 本文研究了开放世界物体计数中的细粒度计数问题,指出当前方法因未明确计数粒度而导致计数可靠性不足。为此,作者提出了多粒度计数框架,通过视觉示例和细粒度文本描述明确指定计数目标,并构建了首个自动化的数据增强管道,生成了目前最大的细粒度计数数据集KubriCount。基于该数据集,作者进一步训练了HieraCount模型,显著提升了细粒度计数的准确性和实际场景的泛化能力。

Comments Project page: https://verg-avesta.github.io/KubriCount/

详情
英文摘要

Open-world object counting remains brittle: despite rapid advances in vision-language models (VLMs), reliably counting the objects a user intends is far from solved. We argue that a central reason is that counting granularity is left implicit; users may refer to a specific identity, an attribute, an instance type, a category, or an abstract concept, yet most methods treat "what to count" as a single, category-level matching problem. In this work, we redefine open-world counting as multi-grained counting, where visual exemplars specify target appearance and fine-grained text, with optional negative prompts, specifies the intended semantic granularity across five explicit levels. Making granularity explicit, however, exposes a critical data bottleneck: existing counting datasets lack the multi-category scenes, controlled distractors, and instance-level annotations needed to verify fine-grained prompt semantics. To address this, we propose the first fully automatic data-scaling pipeline that integrates controllable 3D synthesis with consistent image editing and VLM-based filtering, and use it to construct KubriCount, the largest and most comprehensively annotated counting dataset to date, supporting both training and multi-grained evaluation. Systematic benchmarking reveals that both multimodal large language models and specialist counting models exhibit severe prompt-following failures under fine-grained distinctions. Motivated by these findings, we train HieraCount, a multi-grained counting model that jointly leverages text and visual exemplars as complementary target specifications. HieraCount substantially improves multi-grained counting accuracy and generalizes robustly to challenging real-world scenarios. The project page is available here: https://verg-avesta.github.io/KubriCount/.

2605.10885 2026-05-12 cs.CV

Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation

Feifan Song, Yuntian Bo, Haofeng Zhang

AI总结 跨域小样本医学图像分割(CD-FSMIS)旨在仅凭少量标注样本,使模型同时适应新的解剖类别和未见过的成像领域。现有基于原型的方法往往将解剖结构与领域特定的外观变化混杂在一起,导致在领域变化下难以实现稳定匹配。本文提出GeoProto框架,通过引入几何感知的原型增强机制,利用人体解剖结构的几何先验信息,提升原型匹配的鲁棒性与泛化能力,并在多个跨模态、跨序列和跨场景的数据集上取得了最先进的性能。

详情
英文摘要

Cross-domain few-shot medical image segmentation (CD-FSMIS) requires a model to generalise simultaneously to novel anatomical categories and unseen imaging domains from only a handful of annotated examples. Existing prototypical approaches inevitably entangle anatomical structure with domain-specific appearance variations, and thus lack a stable reference for reliable matching under domain shift. We observe that the geometric structure of human anatomy constitutes a reliable, domain-transferable prior that has been overlooked. Building on this insight, we propose GeoProto, a geometry-aware CD-FSMIS framework that enriches prototypical matching with explicit structural priors. The core component, Geometry-Aware Prototype Enrichment (GAPE), augments each local appearance prototype with a learned geometric offset encoding its ordinal position within the organ's interior topology. This offset is derived from an auxiliary Ordinal Shape Branch (OSB) trained under an ordinally consistent objective that enforces monotonic variation of geometric embeddings across interior strata, requiring no annotation beyond standard segmentation masks. Extensive experiments across seven datasets spanning three evaluation settings (cross-modality, cross-sequence, and cross-context) demonstrate that GeoProto achieves state-of-the-art performance.

2605.10880 2026-05-12 cs.RO

Safe Aerial 3D Path Planning for Autonomous UAVs using Magnetic Potential Fields

Haechan Mark Bong, Giovanni Beltrame

AI总结 本文研究了如何在城市环境中实现自主无人机的安全三维路径规划问题。提出了一种基于麦克斯韦方程性质的磁势场方法——3DMaxConvNet,利用卷积自编码器从激光雷达生成的三维体素网格中预测避障势场,从而生成无局部极小值的路径。实验表明,该方法在两个不同的城市环境中均实现了100%的路径规划成功率,并在运行时间和路径质量方面优于传统算法如A*和RRT*。

详情
英文摘要

Safe autonomous Uncrewed Aerial Vehicle (UAV) navigation in urban environments requires real-time path planning that avoids obstacles. MaxConvNet is a potential-field planner that leverages properties of Maxwell's equations to generate a path to the goal without local minima. We extend the 2D MaxConvNet magnetic field planner to 3D, using a convolutional autoencoder to predict obstacle-aware potential fields from LiDAR-derived 101^3 voxel grids. Evaluation across 100 randomized closed-loop trials in two distinct Cosys-AirSim urban environments, a dense night-time cityscape and a suburban district shows a 100% path planning success rate on both maps without retraining. In offline path planning, 3DMaxConvNet produces path lengths comparable to A* on unseen maps while reducing runtime from 0.155--0.17s to 0.087--0.089s, or about 1.7--1.95 times faster than A*. Against RRT*(3k), 3DMaxConvNet achieves similar path quality while reducing planning runtime from 17.2--17.5s to about 0.09s, which is roughly 193--201 times faster than RRT*(3k).