arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.12501 2026-05-13 cs.CV

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo

AI总结 该研究针对计算机使用代理(CUA)在处理复杂、低频交互任务时可靠性不足的问题,提出了一种新的基准测试CUActSpot,涵盖GUI、文本、表格、画布和自然图像等多种交互模态及多种操作类型。为解决复杂交互数据稀缺的问题,研究设计了一种基于渲染器的数据合成方法,自动生成场景并生成对应的指令和操作轨迹。实验表明,基于该数据集训练的模型在性能上优于参数量更少的开源模型。

详情
英文摘要

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

2605.12500 2026-05-13 cs.CV

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin

AI总结 本文提出了一种名为 SenseNova-U1 的统一多模态模型,旨在解决当前视觉-语言模型中理解与生成分离的问题。该模型基于 NEO-unify 架构,将理解和生成视为同一底层过程的协同视角,从而实现更自然的多模态智能。研究展示了该模型在多种任务上的优越性能,并提供了详细的设计与训练策略,为多模态研究提供了新的方向。

详情
Comments
Project page: https://github.com/OpenSenseNova/SenseNova-U1
英文摘要

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

2605.12498 2026-05-13 cs.CV cs.GR

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Christen Millerdurai, Shaoxiang Wang, Yaxu Xie, Vladislav Golyanik, Didier Stricker, Alain Pagani

AI总结 本文提出了一种名为EgoForce的单目手部三维姿态重建框架,旨在从用户的视角(即相机空间)准确恢复手部的绝对三维姿态和位置,适用于AR/VR、远程呈现等需要紧凑且无干扰感知的场景。该方法通过引入可微分的前臂表示、统一的臂手变换器以及光线空间闭式求解器,有效解决了单目方法中深度尺度模糊的问题,并能在多种广角相机模型上实现鲁棒的重建。实验表明,EgoForce在三个自拍视角数据集上达到了最先进的精度,尤其在HOT3D数据集上将相机空间MPJPE降低了28%。

详情
Comments
23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference
英文摘要

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

2605.12497 2026-05-13 cs.CV

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

AI总结 该研究探讨了在开放世界场景下,如何通过外部信息(如事实、事件、长尾实体和多跳关系)辅助完成视觉感知任务的问题。为此,作者提出了“感知深度研究”这一新挑战,并构建了WebEye基准,包含可验证证据、知识密集型查询和精确标注的图像实例。同时,他们设计了Pixel-Searcher方法,通过智能搜索流程实现从外部信息到像素级目标定位的端到端感知,显著提升了开放世界视觉任务的性能。

详情
Comments
Project page: https://pixel-searcher.github.io/
英文摘要

Visual perception connects high-level semantic understanding to pixel-level perception, but most existing settings assume that the decisive evidence for identifying a target is already in the image or frozen model knowledge. We study a more practical yet harder open-world case where a visible object must first be resolved from external facts, recent events, long-tail entities, or multi-hop relations before it can be localized. We formalize this challenge as Perception Deep Research and introduce WebEye, an object-anchored benchmark with verifiable evidence, knowledge-intensive queries, precise box/mask annotations, and three task views: Search-based Grounding, Search-based Segmentation, and Search-based VQA. WebEyes contains 120 images, 473 annotated object instances, 645 unique QA pairs, and 1,927 task samples. We further propose Pixel-Searcher, an agentic search-to-pixel workflow that resolves hidden target identities and binds them to boxes, masks, or grounded answers. Experiments show that Pixel-Searcher achieves the strongest open-source performance across all three task views, while failures mainly arise from evidence acquisition, identity resolution, and visual instance binding.

2605.12496 2026-05-13 cs.CV

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yue Yu, Hanlin Wang, Haobo Li, Jiapeng Zhu, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen, Huamin Qu

AI总结 CausalCine 是一种用于多镜头视频叙事的实时自回归生成框架,旨在解决现有模型在长序列生成中出现的运动停滞和语义漂移问题。该方法通过引入因果基础模型和内容感知记忆路由机制,实现了跨镜头的连贯生成,并支持动态提示输入与上下文复用。实验表明,CausalCine 在生成质量上优于传统自回归模型,同时实现了接近双向模型的效果,并支持实时交互式生成。

详情
Comments
Project page: https://yihao-meng.github.io/CausalCine/
英文摘要

Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/

2605.12495 2026-05-13 cs.CV cs.AI cs.LG

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao

AI总结 本文提出了一种名为 AlphaGRPO 的新框架,通过将组相对策略优化(GRPO)应用于统一多模态模型(UMMs),在无需额外冷启动阶段的情况下提升了多模态生成能力。该方法通过分解可验证奖励(DVReward)机制,利用大语言模型将复杂的用户请求拆解为可验证的语义和质量问题,从而提供稳定可靠的反馈,支持模型进行文本到图像的推理生成和自主的自我反思优化。实验表明,AlphaGRPO 在多个多模态生成基准测试中均取得显著提升,并在无需编辑任务训练的情况下也表现出色。

详情
Comments
ICML2026
英文摘要

In this paper, we propose AlphaGRPO, a novel framework that applies Group Relative Policy Optimization (GRPO) to AR-Diffusion Unified Multimodal Models (UMMs) to enhance multimodal generation capabilities without an additional cold-start stage. Our approach unlocks the model's intrinsic potential to perform advanced reasoning tasks: Reasoning Text-to-Image Generation, where the model actively infers implicit user intents, and Self-Reflective Refinement, where it autonomously diagnoses and corrects misalignments in generated outputs. To address the challenge of providing stable supervision for real-world multimodal generation, we introduce the Decompositional Verifiable Reward (DVReward). Unlike holistic scalar rewards, DVReward utilizes an LLM to decompose complex user requests into atomic, verifiable semantic and quality questions, which are then evaluated by a general MLLM to provide reliable and interpretable feedback. Extensive experiments demonstrate that AlphaGRPO yields robust improvements across multimodal generation benchmarks, including GenEval, TIIF-Bench, DPG-Bench and WISE, while also achieving significant gains in editing tasks on GEdit without training on editing tasks. These results validate that our self-reflective reinforcement approach effectively leverages inherent understanding to guide high-fidelity generation. Project page: https://huangrh99.github.io/AlphaGRPO/

2605.12494 2026-05-13 cs.CV

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu, Gim Hee Lee

AI总结 本文研究了在基于可微渲染的表面重建中普遍存在的光度模糊问题,并提出了一种名为 AmbiSuR 的框架,旨在提升高斯点扩散(Gaussian Splatting)方法在光度模糊环境下的重建精度。通过重新审视高斯点扩散的表示基础,作者发现了其内在的光度模糊特性,并提出了一种光度去模糊方法和模糊指示模块,以约束几何解的求解并引导重建过程。实验表明,该方法在多种复杂场景下均能实现更准确、更鲁棒的表面重建。

详情
Comments
Accepted at ICML 2026. Project page: https://fictionarry.github.io/AmbiSuR-Proj/
英文摘要

Surface reconstruction with differentiable rendering has achieved impressive performance in recent years, yet the pervasive photometric ambiguities have strictly bottlenecked existing approaches. This paper presents AmbiSuR, a framework that explores an intrinsic solution upon Gaussian Splatting for the photometric ambiguity-robust surface 3D reconstruction with high performance. Starting by revisiting the foundation, our investigation uncovers two built-in primitive-wise ambiguities in representation, while revealing an intrinsic potential for ambiguity self-indication in Gaussian Splatting. Stemming from these, a photometric disambiguation is first introduced, constraining ill-posed geometry solution for definite surface formation. Then, we propose an ambiguity indication module that unleashes the self-indication potential to identify and further guide correcting underconstrained reconstructions. Extensive experiments demonstrate our superior surface reconstructions compared to existing methods across various challenging scenarios, excelling in broad compatibility. Project: https://fictionarry.github.io/AmbiSuR-Proj/ .

2605.12493 2026-05-13 cs.CL

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang

AI总结 LongMemEval-V2 是一个用于评估智能体长期记忆能力的新基准,旨在检验其是否能有效学习并记住网络环境中的关键经验,从而成为有经验的同事。该基准包含 451 个精心设计的问题,涵盖静态状态回忆、动态状态追踪、工作流程知识等多个核心能力,并提供大量历史轨迹作为输入。研究提出两种记忆方法,其中基于编码代理的方法在准确率上表现优异,但存在较高的延迟成本,表明在长期记忆系统的设计上仍有提升空间。

详情
Comments
Work in Progress
英文摘要

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

2605.12492 2026-05-13 cs.LG stat.ML

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Kexuan Shi, Hanxuan Li, Zeju Qiu, Yandong Wen, Simon Buchholz, Weiyang Liu

AI总结 本文提出了一种基于正交等价变换的谱值保持优化器Pion,用于大语言模型的训练。与Adam等加法优化器不同,Pion通过左右正交变换更新权重矩阵,从而在训练过程中保持其奇异值不变。该方法在调整权重矩阵几何结构的同时固定其谱范数,实验表明Pion在大模型预训练和微调任务中表现出稳定且具有竞争力的性能。

详情
Comments
Technical report v1 (30 pages, 19 figures, project page: https://spherelab.ai/pion/)
英文摘要

We introduce Pion, a spectrum-preserving optimizer for large language model (LLM) training based on orthogonal equivalence transformation. Unlike additive optimizers such as Adam and Muon, Pion updates each weight matrix through left and right orthogonal transformations, preserving its singular values throughout training. This yields an optimization mechanism that modulates the geometry of weight matrices while keeping their spectral norm fixed. We derive the Pion update rule, systematically examine its design choices, and analyze its convergence behavior along with several key properties. Empirical results show that Pion offers a stable and competitive alternative to standard optimizers for both LLM pretraining and finetuning.

2605.12491 2026-05-13 cs.CV cs.LG

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo

AI总结 本文提出了一种名为VECA的视觉Transformer架构,旨在解决传统ViT在高分辨率图像处理中计算复杂度过高的问题。VECA通过引入弹性核心-边缘注意力机制,利用少量学习得到的核心嵌入作为通信接口,使得图像块之间无需直接交互,从而将计算复杂度从二次降低到线性。该方法在保持输入token完整性的前提下,实现了计算资源与精度之间的灵活权衡,在多个视觉任务中表现出与最新视觉基础模型相当的性能。

详情
Comments
Project repository here: https://github.com/alansong1322/VECA
英文摘要

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

2605.12487 2026-05-13 cs.CL cs.IR cs.LG

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, Assaf Toledo

AI总结 本文研究了如何利用大语言模型(LLM)指导的查询优化方法,提升嵌入模型在零样本搜索和分类任务中的适用性。通过在少量文档上获取LLM的反馈来实时优化用户查询的嵌入表示,使模型能够适应具体任务需求。实验表明,该方法在多个基准任务中均取得显著提升,最高相对改进达25%,有效提升了检索质量与分类准确性,并拓展了嵌入模型在实际场景中的应用范围。

详情
英文摘要

We explore the effectiveness of an LLM-guided query refinement paradigm for extending the usability of embedding models to challenging zero-shot search and classification tasks. Our approach refines the embedding representation of a user query using feedback from a generative LLM on a small set of documents, enabling embeddings to adapt in real time to the target task. We conduct extensive experiments with state-of-the-art text embedding models across a diverse set of challenging search and classification benchmarks. Empirical results indicate that LLM-guided query refinement yields consistent gains across all models and datasets, with relative improvements of up to +25% in literature search, intent detection, key-point matching, and nuanced query-instruction following. The refined queries improve ranking quality and induce clearer binary separation across the corpus, enabling the embedding space to better reflect the nuanced, task-specific constraints of each ad-hoc user query. Importantly, this expands the range of practical settings in which embedding models can be effectively deployed, making them a compelling alternative when costly LLM pipelines are not viable at corpus-scale. We release our experimental code for reproducibility, at https://github.com/IBM/task-aware-embedding-refinement.

2605.12481 2026-05-13 cs.AI

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, Jieping Ye

AI总结 计算机使用代理(CUA)在执行任务时需要在底层GUI操作和高层工具调用之间进行切换,但这种混合动作空间使得代理难以判断何时使用哪种方式,导致执行路径次优。为了解决这一问题,本文提出ToolCUA,一种通过分阶段训练范式学习最优GUI-工具路径选择的端到端代理。该方法通过生成混合轨迹、引导式强化学习和在线代理强化学习等技术,显著提升了任务执行的准确性和效率,在多个基准测试中表现出色,验证了其在现实数字代理中的应用潜力。

详情
英文摘要

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

2605.12480 2026-05-13 cs.CV cs.AI

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

AI总结 OmniNFT 是一种针对联合音视频生成任务的新型强化学习框架,旨在解决现有方法在模态保真度、跨模态对齐和细粒度同步方面的不足。该方法通过模态感知的奖励路由、分层梯度手术和区域损失重加权三大创新,有效缓解了多目标优势不一致、多模态梯度不平衡和信用分配不均等问题。实验表明,OmniNFT 在多个基准测试中显著提升了音视频的感知质量与同步效果。

详情
Comments
Project page: https://zghhui.github.io/OmniNFT/
英文摘要

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

2605.12477 2026-05-13 cs.LG cs.CL

MEME: Multi-entity & Evolving Memory Evaluation

Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh

AI总结 MEME 是一个用于评估大型语言模型代理在多实体和动态记忆场景下表现的基准,定义了六个涵盖多实体与演化维度的任务,其中包含此前未被评估的级联推理、缺失推理和删除状态等挑战。研究发现,现有记忆系统在依赖推理任务上的表现普遍较差,即使在静态检索性能良好的情况下,准确率也远低于平均水平。实验表明,仅有一种基于文件存储并结合强语言模型的系统部分缓解了这一问题,但其计算成本极高,说明当前有效解决方案尚不适用于大规模实际场景。

详情
英文摘要

LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.

2605.12476 2026-05-13 cs.LG cs.CL

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Sagi Ahrac, Noya Hochwald, Mor Geva

AI总结 本文研究了稀疏混合专家(SMoE)模型中路由器与专家之间的几何耦合关系,揭示了路由决策与专家权重更新之间的内在联系。研究发现,对于同一个输入标记,路由器和对应专家的梯度更新方向一致,仅在标量系数上存在差异,这一现象在实验中也得到验证。基于这一几何耦合特性,作者提出了一种无需辅助损失的在线K-Means路由策略,通过专家对路由标记的隐藏状态进行平均,实现高效的负载分配,实验表明该方法在保持较低困惑度的同时显著降低了负载不平衡。

详情
英文摘要

Sparse Mixture-of-Experts (SMoE) models enable scaling language models efficiently, but training them remains challenging, as routing can collapse onto few experts and auxiliary load-balancing losses can reduce specialization. Motivated by these hurdles, we study how routing decisions in SMoEs are formed mechanistically. First, we reveal a geometric coupling between routers and their corresponding experts. For a given token, the router weights for the selected expert and the expert weights processing it receive gradients along the same input direction, differing only in scalar coefficients. Thus, matched router--expert directions accumulate the same routed token history. This theoretical coupling also appears empirically in routing dynamics. In a $1$B SMoE trained from scratch, higher router scores predict stronger expert neuron activations, showing that routing decisions are mirrored inside the selected expert. Next, we analyze the effects of auxiliary load balancing on the router--expert geometric coupling, showing that such losses break this structure by spreading input-directed gradients across router weights, making distinct router directions nearly three times more similar to each other. Last, we demonstrate the centrality of geometric coupling for effective routing with a parameter-free online K-Means router, in which each expert maintains a running average of the hidden states routed to it and tokens are assigned based on cosine similarity. Compared with auxiliary-loss and loss-free balancing, this router achieves the lowest load imbalance with only a modest perplexity increase, indicating that geometric coupling captures a substantial part of what the router learns. Overall, our results explain how routers form assignment geometry that supports an effective division of labor.

2605.12474 2026-05-13 cs.AI

Reward Hacking in Rubric-Based Reinforcement Learning

Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He

AI总结 本文研究了基于评分标准(rubric-based)的强化学习中的奖励黑客(reward hacking)问题,探讨了在训练时使用验证器(verifier)优化策略,但在评估时由多个独立评委进行判断时可能产生的偏差。研究发现,弱验证器会导致策略在训练中获得高分但无法迁移到真实评估中,而强验证器虽能缓解这一问题,却无法完全消除。此外,研究还引入了“自我内化差距”作为验证器无关的诊断指标,并指出评分标准设计的局限性可能导致策略在完整性等指标上得分提升,却牺牲了事实准确性与整体质量。

详情
英文摘要

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

2605.12471 2026-05-13 cs.LG cs.AI cs.CL

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez

AI总结 本文提出了一种名为 KV-Fold 的长上下文推理方法,通过将键值(KV)缓存视为序列块的左折叠累加器,实现无需训练的推理过程。模型在每一步处理下一个块时,基于累积的缓存进行条件处理,并将生成的键值追加到缓存中,从而逐步扩展缓存并传递至后续步骤。该方法在保持模型结构和参数不变的前提下,实现了稳定的长距离信息保留和高效推理,实验表明其在大规模上下文任务中表现出优异的准确性和内存效率。

详情
Comments
12 pages, 3 figures, 6 tables
英文摘要

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

2605.12466 2026-05-13 cs.LG cs.AI cs.CL cs.NE

Solve the Loop: Attractor Models for Language and Reasoning

Jacob Fein-Ashley, Paria Rashidinejad

AI总结 该论文提出了一种名为“吸引子模型”(Attractor Models)的新架构,用于改进语言建模和推理任务中的迭代计算过程。该模型通过一个主干模块生成初始输出嵌入,再通过吸引子模块求解固定点以逐步优化结果,利用隐式微分进行训练,从而实现固定深度下的内存效率和自适应迭代次数。实验表明,吸引子模型在大规模语言预训练和小模型推理任务中均优于现有方法,显著提升了性能并降低了训练成本,同时展现出一种新的“均衡内化”现象,使得模型在推理时可移除求解器而几乎不损失性能。

详情
英文摘要

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.

2605.12464 2026-05-13 cs.LG cs.AR cs.PF

Search Your Block Floating Point Scales!

Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu, Reyna Abhyankar, Qingyang Wu, Austin Silveria, Pragaash Ponnusamy, Jue Wang, Ben Athiwaratkun, Leon Song, Tri Dao, Daniel Y. Fu, Chris De Sa

AI总结 本文研究了如何优化块浮点(BFP)格式中的缩放因子选择,以降低量化误差并提升模型性能。作者提出了一种名为 ScaleSearch 的方法,通过精细搜索利用微缩放格式中的尾数位,为给定数据分布选择最优缩放因子。该方法可与现有量化技术结合,显著提升量化效果,并引入了基于 ScaleSearch 的加速注意力算法 ScaleSearchAttention,在保持性能的同时有效减少了量化误差。实验表明,该方法在多个任务上均取得了显著的性能提升。

详情
英文摘要

Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.

2605.12462 2026-05-13 cs.AI cs.CY cs.GT cs.LG

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

Jose E. Aguilar Escamilla, Lingdong Zhou, Xiangqi Zhu, Huazheng Wang

AI总结 本文提出了一种名为DR-Gym的开源仿真环境,旨在从电力公司视角训练和评估需求响应策略,以提升电网灵活性和能源可负担性。该环境专注于市场级电力场景,提供了与电力公司相关的丰富观测空间,并引入了基于真实极端事件的批发电价模型和物理基础的建筑用电需求模型。研究通过多目标奖励函数支持多样化的学习目标,展示了该仿真器在创建现实且可学习环境方面的能力。

详情
英文摘要

Extreme weather and volatile wholesale electricity markets expose residential consumers to catastrophic financial risks, yet demand response at the distribution level remains an underutilized tool for grid flexibility and energy affordability. While a demand-response program can shield consumers by issuing financial credits during high-price periods, optimizing this sequential decision-making process presents a unique challenge for reinforcement learning despite the plentiful offline historical smart meter and wholesale pricing data available publicly. Offline historical data fails to capture the dynamic, interactive feedback loop between an electric utility's pricing signals and customer acceptance and adaptation to a demand-response program. To address this, we introduce DR-Gym, an open-source, online Gymnasium-compatible environment designed to train and evaluate demand-response from the electric utility's perspective. Unlike existing device-level energy simulators, our environment focuses on the market-level electric utility setting and provides a rich observational space relevant to the electric utility. The simulator additionally features a regime-switching wholesale price model calibrated to real-world extreme events, alongside physics-based building demand profiles. For our learning signal, we use a configurable, multi-objective reward function for specifying diverse learning objectives. We demonstrate through baseline strategies and data snapshots the capability of our simulator to create realistic and learnable environments.

2605.12460 2026-05-13 cs.LG cs.CL

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping

AI总结 本文提出了一种多流语言模型(Multi-Stream LLMs),通过将传统的单一计算流改为多个并行计算流,解决了当前语言模型在处理输入、思考和输出时的串行瓶颈问题。该方法将不同角色(如输入、思考、输出)分离到独立的流中,使模型能够在同一时间步同时读取多个输入并生成多个输出,从而提升模型的效率、安全性和可监控性。这一数据驱动的改进为构建更高效、更可控的自主智能体提供了新的思路。

详情
Comments
Preprint, 37 pages. Code at https://github.com/seal-rg/streaming/
英文摘要

The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information. In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.

2605.12452 2026-05-13 cs.CL cs.AI cs.CY

The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events

Gunjan, Sidahmed Benabderrahmane, Talal Rahwan

AI总结 该研究关注大型语言模型(LLM)生成的政治话语在危机事件中的表现,探讨其与真实在线舆论的差异。研究构建了一个包含九个危机事件的配对语料库,从情感强度、结构规律性、词汇意识形态框架和跨事件依赖性四个维度进行对比分析,发现生成内容虽然流畅,但在群体层面缺乏现实感,情感更单一、结构更规整、用词更抽象。研究提出“漫画化差距”(Caricature Gap)作为衡量指标,揭示生成政治话语在社会真实性和多样性上的局限性。

详情
英文摘要

Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.

2605.12449 2026-05-13 cs.CV

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Wufei Ma, Chloe Wang, Siyi Chen, Jiawei Peng, Patrick Li, Alan Yuille

AI总结 LychSim 是一个基于 Unreal Engine 5 构建的可控且交互式的视觉研究仿真框架,旨在降低仿真平台的技术门槛,促进闭环优化和分布外评估。该框架通过简洁的 Python 接口、程序化数据生成管道以及与 Model Context Protocol 的集成,实现了高保真环境生成、语义对齐的三维标注以及与大语言模型的动态交互。LychSim 在合成数据生成、对抗性检验和语言驱动场景生成等多个下游任务中展现出广泛应用潜力,并将开源以供研究社区使用。

详情
Comments
3D-LLM/VLA Workshop at CVPR 2026. Project page: https://lychsim.github.io/
英文摘要

While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.

2605.12446 2026-05-13 cs.LG cs.CL

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

Chen Li, Xiaoling Hu, Songzhu Zheng, Jiawei Zhou, Chao Chen

AI总结 大型语言模型在生成答案时常常表现出过高的置信度,即使答案错误,因此可靠的置信度估计对于实际应用至关重要。本文提出了一种解耦且顺序感知的框架,用于校准语言模型的口头置信度,通过先生成答案再基于固定的问题-答案对进行置信度估计,避免了答案生成过程的干扰。该方法通过多模型完成的采样构建替代指标,并优化基于排序的强化学习目标,使更高正确性可能性的回答获得更高的口头置信度,实验表明该方法在保持答案准确性的同时显著提升了校准和失败预测性能。

详情
Comments
18 pages, 2 figures
英文摘要

Large language models (LLMs) often produce answers with high certainty even when they are incorrect, making reliable confidence estimation essential for deployment in real-world scenarios. Verbalized confidence, where models explicitly state their confidence in natural language, provides a flexible and user-facing uncertainty signal that can be applied even when token logits are unavailable. However, existing verbalized-confidence methods often optimize answer generation and confidence generation jointly, which can cause confidence-alignment objectives to interfere with answer accuracy. In this work, we propose a decoupled and order-aware framework for verbalized confidence calibration. Our method first generates an answer and then estimates confidence conditioned on the fixed question--answer pair, allowing confidence optimization without directly perturbing the answer-generation process. To align confidence with correctness likelihood, we construct a sampling-based surrogate from multiple model completions and optimize rank-based reinforcement learning objectives that encourage responses with higher estimated correctness likelihood to receive higher verbalized confidence. Experiments on reasoning and knowledge-intensive benchmarks show that our method improves calibration and failure prediction performance while largely preserving answer accuracy. These results demonstrate that verbalized confidence can be more reliably aligned by decoupling confidence estimation from answer generation and optimizing the relative ordering of confidence across responses.

2605.12438 2026-05-13 cs.CL cs.AI

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Rian Touchent, Eric de la Clergerie

AI总结 在将编码器适配到新领域时,通常采用遮蔽语言建模(MLM)进行继续预训练。本文提出一种改进方法:在继续训练前临时切换为因果语言建模(CLM),随后再进行短期的MLM退火,从而提升下游任务性能。实验表明,这种方法在生物医学文本上显著优于传统MLM方法,且通过分析发现CLM对编码器低层结构的影响更大,其带来的表征变化在后续MLM阶段仍能保持,并随模型规模增加而增强。

详情
英文摘要

When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

2605.12437 2026-05-13 cs.CV

3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark

Yunxiao Zhang, Suryansh Kumar

AI总结 本文研究了在同步多视角(MV)设置下实现高效回顾式动态场景新视角合成(NVS)的问题,提出了一种基于3D高斯泼溅(3DGS)的方法,无需显式的时序耦合即可实现动态场景的高质量重建。该方法通过在初始时间生成SfM点云并随时间传播优化的高斯分布,有效提升了NVS的效率。同时,作者构建了一个基于Blender的动态MV数据集框架,用于生成标准化、高质量的同步相机配置和训练数据,推动了动态NVS方法的可复现性和系统性评估。

详情
Comments
Accepted for publication at CVPR 2026; 4D World Models Workshop. Draft info: 14 pages, 4 figures, 8 tables
英文摘要

Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.

2605.12436 2026-05-13 cs.AI

CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction

Islam Eldifrawi, Shengrui Wang, Amine Trabelsi

AI总结 随着网络内容和AI生成内容的激增,自动事实核查(AFC)变得尤为重要。本文提出了一种名为CAAFC的时序可操作自动事实核查框架,旨在弥补现有系统与实际事实核查工作之间的差距。CAAFC不仅能检测虚假信息和幻觉,还能通过引用权威信息源提供可操作的纠正依据,并能结合最新上下文信息动态更新知识库,显著提升了事实核查的准确性与可靠性。

详情
英文摘要

With the vast amount of content uploaded every hour, along with the AI generated content that can include hallucinations, Automated Fact-Checking (AFC) has become increasingly vital, as it is infeasible for human fact-checkers to manually verify the sheer volume of information generated online. Professional fact-checkers have identified several gaps in existing AFC systems, noting a misalignment between how these systems operate and how fact-checking is performed in practice. In this paper, we introduce CAAFC (Chronological Actionable Automated Fact-Checker), a frame-work designed to bridge these gaps. It surpasses SOTA AFC and hallucination detection systems across multiple benchmark datasets. CAAFC operates on claims, conversations, and dialogues, enabling it not only to detect factual errors and hallucinations, but also to correct them by providing actionable justifications supported by primary information sources. Furthermore, CAAFC can update evidence and knowledge bases by incorporating recent and contextual information when necessary, thereby enhancing the reliability of fact verification.

2605.12435 2026-05-13 cs.LG cs.CE

Environment-Adaptive Preference Optimization for Wildfire Prediction

Enyi Jiang, Wu Sun

AI总结 该研究针对气象数据中预测如野火等罕见极端事件的问题,提出了一种环境自适应偏好优化(EAPO)框架,以应对环境变化带来的分布偏移和长尾分布挑战。EAPO通过构建与目标环境分布对齐的数据集,并结合监督学习与偏好优化进行混合微调,有效提升了模型在极端情况下的检测能力。实验表明,EAPO在真实野火预测任务中表现出色,具有较高的鲁棒性和检测性能。

详情
英文摘要

Predicting rare extreme events such as wildfires from meteorological data requires models that remain reliable under evolving environmental conditions. This problem is inherently long-tailed: wildfire events are rare but high-impact, while most observations correspond to non-fire conditions, causing standard learning objectives to underemphasize the minority class (fire) that matters most. In addition, models trained on historical distributions often fail under distribution shifts, exhibiting degraded performance in new environments. To this end, we propose Environment-Adaptive Preference Optimization (EAPO), a framework that adapts prediction to the target environment with long-tail distribution. Given a new input distribution, we first construct distribution-aligned datasets via $k$-nearest neighbor retrieval. We then perform a hybrid fine-tuning procedure on this local manifold, combining supervised learning with preference optimization, as well as emphasizing on rare extreme events. EAPO refines decision boundaries while avoiding conflicting signals from heterogeneous training data. We evaluate EAPO on a real-world wildfire prediction task with environmental shifts. EAPO achieves robust performance (ROC-AUC 0.7310) and improves detection in extreme regimes, demonstrating its effectiveness in dynamic wildfire prediction systems.

2605.12431 2026-05-13 cs.CV

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

Huiran Duan, Qian Zhou, Zhongliang Guo, Junhao Dong, Yuqi Li, Guoying Zhao, Yingli Tian

AI总结 传统步态去识别方法往往面临身份抑制不足或引入时空失真、影响后续应用的问题。本文提出GaitProtector,一种基于伪装驱动的步态去识别框架,通过统一的优化目标实现隐私保护,包含身份排斥与目标身份吸引两个紧密耦合的组件。该方法无需重新训练模型,利用预训练的3D扩散模型对输入轮廓序列进行潜空间优化,生成既保护隐私又保持结构合理性的步态,实验表明其在身份伪装成功率和下游任务性能保持方面均表现出色。

详情
Comments
Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)
英文摘要

Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.

2605.12430 2026-05-13 cs.CV

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

Joaquín Figueira, Rob Van Gastel, Giacomo D'Amicantonio, Zhuoran Liu, Ioan Gabriel Bucur, Faysal Boughorbel, Egor Bondarev

AI总结 本文提出了一种名为AOI-SSL的自监督学习框架,用于提高半导体线键封装产品在光学检测中的分割效率。该方法结合了小样本自监督预训练和上下文推理,有效减少了对标注数据的依赖,尤其在数据量有限的情况下表现出色。实验表明,该框架在分割精度和适应新设备的能力上优于从头训练和基于ImageNet预训练的方法,尤其在处理单个设备图像时,基于检索的分割方法比微调具有更优表现。

详情
Comments
Accepted to the AI4RWC Workshop at CVPR 2026
英文摘要

Segmentation models in automated optical inspection of wire-bonded semiconductors are typically device-specific and must be re-trained when new devices or distribution shifts appear. We introduce AOI-SSL, a training-efficient framework for semantic segmentation of wire-bonded semiconductors by combining small-domain self-supervised pre-training of vision transformers with in-context inference that minimizes the need of labeled examples. We pre-train SOTA self-supervised algorithms in a small industrial inspection dataset and find that Masked Autoencoders are the most effective in this small-data setting, improving downstream segmentation while reducing the labeled fine-tuning effort. We further introduce in-context, patch-level retrieval methods that predict masks directly from dense encoder embeddings with negligible additional training. We show that, in this setting, simple similarity-based retrieval performs on par with more complex attention-based aggregation used currently in the literature. Furthermore, our experiments demonstrate that self-supervised pre-training significantly improves segmentation quality compared to training from scratch and to ImageNet pre-trained backbones under a fixed fine-tuning computational budget. Finally, the results reveal that retrieval based segmentation outperforms fine-tuning when targeting single device images, allowing for near-instant adaptation to difficult samples.