arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.12501 2026-05-13 cs.CV

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan, Yifan Yang, Kai Qiu, Bei Liu, Justin Wagle, Chenzhong Yin, Mingxi Cheng, Ji Li, Qi Dai, Chong Luo, Xu Yang, Xin Geng, Baining Guo

AI总结该研究针对计算机使用代理（CUA）在处理复杂、低频交互任务时可靠性不足的问题，提出了一种新的基准测试CUActSpot，涵盖GUI、文本、表格、画布和自然图像等多种交互模态及多种操作类型。为解决复杂交互数据稀缺的问题，研究设计了一种基于渲染器的数据合成方法，自动生成场景并生成对应的指令和操作轨迹。实验表明，基于该数据集训练的模型在性能上优于参数量更少的开源模型。

2605.12500 2026-05-13 cs.CV

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Haiwen Diao, Penghao Wu, Hanming Deng, Jiahao Wang, Shihao Bai, Silei Wu, Weichen Fan, Wenjie Ye, Wenwen Tong, Xiangyu Fan, Yan Li, Yubo Wang, Zhijie Cao, Zhiqian Lin, Zhitao Yang, Zhongang Cai, Yuwei Niu, Yue Zhu, Bo Liu, Chengguang Lv, Haojia Yu, Haozhe Xie, Hongli Wang, Jianan Fan, Jiaqi Li, Jiefan Lu, Jingcheng Ni, Junxiang Xu, Kaihuan Liang, Lianqiang Shi, Linjun Dai, Linyan Wang, Oscar Qian, Peng Gao, Pengfei Liu, Qingping Sun, Rui Shen, Ruisi Wang, Shengnan Ma, Shuang Yang, Siyi Xie, Siying Li, Tianbo Zhong, Xiangli Kong, Xuanke Shi, Yang Gao, Yongqiang Yao, Yves Wang, Zhengqi Bai, Zhengyu Lin, Zixin Yin, Wenxiu Sun, Ruihao Gong, Quan Wang, Lewei Lu, Lei Yang, Ziwei Liu, Dahua Lin

AI总结本文提出了一种名为 SenseNova-U1 的统一多模态模型，旨在解决当前视觉-语言模型中理解与生成分离的问题。该模型基于 NEO-unify 架构，将理解和生成视为同一底层过程的协同视角，从而实现更自然的多模态智能。研究展示了该模型在多种任务上的优越性能，并提供了详细的设计与训练策略，为多模态研究提供了新的方向。

详情

Comments: Project page: https://github.com/OpenSenseNova/SenseNova-U1

英文摘要

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

URL PDF HTML ☆

赞 0 踩 0

2605.12498 2026-05-13 cs.CV cs.GR

EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera

Christen Millerdurai, Shaoxiang Wang, Yaxu Xie, Vladislav Golyanik, Didier Stricker, Alain Pagani

AI总结本文提出了一种名为EgoForce的单目手部三维姿态重建框架，旨在从用户的视角（即相机空间）准确恢复手部的绝对三维姿态和位置，适用于AR/VR、远程呈现等需要紧凑且无干扰感知的场景。该方法通过引入可微分的前臂表示、统一的臂手变换器以及光线空间闭式求解器，有效解决了单目方法中深度尺度模糊的问题，并能在多种广角相机模型上实现鲁棒的重建。实验表明，EgoForce在三个自拍视角数据集上达到了最先进的精度，尤其在HOT3D数据集上将相机空间MPJPE降低了28%。

详情

Comments: 23 pages, 19 figures and 10 tables; project page: https://dfki-av.github.io/EgoForce (source code, data and demo available); SIGGRAPH 2026 Conference

英文摘要

Reconstructing the absolute 3D pose and shape of the hands from the user's viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained by depth-scale ambiguity and struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, a monocular 3D hand reconstruction framework that recovers robust, absolute 3D hand pose and its position from the user's (camera-space) viewpoint. EgoForce operates across fisheye, perspective, and distorted wide-FOV camera models using a single unified network. Our approach combines a differentiable forearm representation that stabilizes hand pose, a unified arm-hand transformer that predicts both hand and forearm geometry from a single egocentric view, mitigating depth-scale ambiguity, and a ray space closed-form solver that enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.

URL PDF HTML ☆

赞 0 踩 0

2605.12497 2026-05-13 cs.CV

From Web to Pixels: Bringing Agentic Search into Visual Perception

Bokang Yang, Xinyi Sun, Kaituo Feng, Xingping Dong, Dongming Wu, Xiangyu Yue

AI总结该研究探讨了在开放世界场景下，如何通过外部信息（如事实、事件、长尾实体和多跳关系）辅助完成视觉感知任务的问题。为此，作者提出了“感知深度研究”这一新挑战，并构建了WebEye基准，包含可验证证据、知识密集型查询和精确标注的图像实例。同时，他们设计了Pixel-Searcher方法，通过智能搜索流程实现从外部信息到像素级目标定位的端到端感知，显著提升了开放世界视觉任务的性能。

2605.12496 2026-05-13 cs.CV

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yue Yu, Hanlin Wang, Haobo Li, Jiapeng Zhu, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen, Huamin Qu

AI总结 CausalCine 是一种用于多镜头视频叙事的实时自回归生成框架，旨在解决现有模型在长序列生成中出现的运动停滞和语义漂移问题。该方法通过引入因果基础模型和内容感知记忆路由机制，实现了跨镜头的连贯生成，并支持动态提示输入与上下文复用。实验表明，CausalCine 在生成质量上优于传统自回归模型，同时实现了接近双向模型的效果，并支持实时交互式生成。

2605.12495 2026-05-13 cs.CV cs.AI cs.LG

AlphaGRPO: Unlocking Self-Reflective Multimodal Generation in UMMs via Decompositional Verifiable Reward

Runhui Huang, Jie Wu, Rui Yang, Zhe Liu, Hengshuang Zhao

AI总结本文提出了一种名为 AlphaGRPO 的新框架，通过将组相对策略优化（GRPO）应用于统一多模态模型（UMMs），在无需额外冷启动阶段的情况下提升了多模态生成能力。该方法通过分解可验证奖励（DVReward）机制，利用大语言模型将复杂的用户请求拆解为可验证的语义和质量问题，从而提供稳定可靠的反馈，支持模型进行文本到图像的推理生成和自主的自我反思优化。实验表明，AlphaGRPO 在多个多模态生成基准测试中均取得显著提升，并在无需编辑任务训练的情况下也表现出色。

2605.12494 2026-05-13 cs.CV

Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu, Gim Hee Lee

AI总结本文研究了在基于可微渲染的表面重建中普遍存在的光度模糊问题，并提出了一种名为 AmbiSuR 的框架，旨在提升高斯点扩散（Gaussian Splatting）方法在光度模糊环境下的重建精度。通过重新审视高斯点扩散的表示基础，作者发现了其内在的光度模糊特性，并提出了一种光度去模糊方法和模糊指示模块，以约束几何解的求解并引导重建过程。实验表明，该方法在多种复杂场景下均能实现更准确、更鲁棒的表面重建。

2605.12493 2026-05-13 cs.CL

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

Di Wu, Zixiang Ji, Asmi Kawatkar, Bryan Kwan, Jia-Chen Gu, Nanyun Peng, Kai-Wei Chang

AI总结 LongMemEval-V2 是一个用于评估智能体长期记忆能力的新基准，旨在检验其是否能有效学习并记住网络环境中的关键经验，从而成为有经验的同事。该基准包含 451 个精心设计的问题，涵盖静态状态回忆、动态状态追踪、工作流程知识等多个核心能力，并提供大量历史轨迹作为输入。研究提出两种记忆方法，其中基于编码代理的方法在准确率上表现优异，但存在较高的延迟成本，表明在长期记忆系统的设计上仍有提升空间。

详情

Comments: Work in Progress

英文摘要

Long-term memory is crucial for agents in specialized web environments, where success depends on recalling interface affordances, state dynamics, workflows, and recurring failure modes. However, existing memory benchmarks for agents mostly focus on user histories, short traces, or downstream task success, leaving open how to directly evaluate whether memory systems effectively internalize environment-specific experience. To address this gap, we introduce LongMemEval-V2 (LME-V2), a benchmark for evaluating whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. LME-V2 contains 451 manually curated questions covering five core memory abilities for web agents: static state recall, dynamic state tracking, workflow knowledge, environment gotchas, and premise awareness. Questions are paired with history trajectories containing up to 500 trajectories and 115M tokens. We use a context gathering formulation: memory systems consume history trajectories and return compact evidence for downstream question answering. We propose a suite of two memory methods: AgentRunbook-R, an efficient RAG-based memory with knowledge pools for raw state observations, events, and strategy notes, and AgentRunbook-C, which stores trajectories as files and invokes a coding agent to gather evidence in an augmented sandbox. Experiments show that AgentRunbook-C achieves the best performance with 72.5% average accuracy, outperforming the strongest RAG baseline (48.5%) and the off-the-shelf coding agent baseline (69.3%). Despite the strong performance gains, coding agent based methods have high latency costs. While AgentRunbook-C advances the accuracy-latency Pareto frontier, substantial room for improvement remains. Together, these results establish LME-V2 as a challenging testbed for developing long-term memory systems for environment experience.

URL PDF HTML ☆

赞 0 踩 0

2605.12492 2026-05-13 cs.LG stat.ML

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

Kexuan Shi, Hanxuan Li, Zeju Qiu, Yandong Wen, Simon Buchholz, Weiyang Liu

AI总结本文提出了一种基于正交等价变换的谱值保持优化器Pion，用于大语言模型的训练。与Adam等加法优化器不同，Pion通过左右正交变换更新权重矩阵，从而在训练过程中保持其奇异值不变。该方法在调整权重矩阵几何结构的同时固定其谱范数，实验表明Pion在大模型预训练和微调任务中表现出稳定且具有竞争力的性能。

2605.12491 2026-05-13 cs.CV cs.LG

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan, Rui Zhang, Jiahang Cao, Weijian Mai, Muquan Yu, Hossein Adeli, Deva Ramanan, Michael J. Tarr, Andrew F. Luo

AI总结本文提出了一种名为VECA的视觉Transformer架构，旨在解决传统ViT在高分辨率图像处理中计算复杂度过高的问题。VECA通过引入弹性核心-边缘注意力机制，利用少量学习得到的核心嵌入作为通信接口，使得图像块之间无需直接交互，从而将计算复杂度从二次降低到线性。该方法在保持输入token完整性的前提下，实现了计算资源与精度之间的灵活权衡，在多个视觉任务中表现出与最新视觉基础模型相当的性能。

详情

Comments: Project repository here: https://github.com/alansong1322/VECA

英文摘要

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.12487 2026-05-13 cs.CL cs.IR cs.LG

Task-Adaptive Embedding Refinement via Test-time LLM Guidance

Ariel Gera, Shir Ashury-Tahan, Gal Bloch, Ohad Eytan, Assaf Toledo

AI总结本文研究了如何利用大语言模型（LLM）指导的查询优化方法，提升嵌入模型在零样本搜索和分类任务中的适用性。通过在少量文档上获取LLM的反馈来实时优化用户查询的嵌入表示，使模型能够适应具体任务需求。实验表明，该方法在多个基准任务中均取得显著提升，最高相对改进达25%，有效提升了检索质量与分类准确性，并拓展了嵌入模型在实际场景中的应用范围。

2605.12481 2026-05-13 cs.AI

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang, Xuanjing Huang, Jing Shao, Ming Yan, Jieping Ye

AI总结计算机使用代理（CUA）在执行任务时需要在底层GUI操作和高层工具调用之间进行切换，但这种混合动作空间使得代理难以判断何时使用哪种方式，导致执行路径次优。为了解决这一问题，本文提出ToolCUA，一种通过分阶段训练范式学习最优GUI-工具路径选择的端到端代理。该方法通过生成混合轨迹、引导式强化学习和在线代理强化学习等技术，显著提升了任务执行的准确性和效率，在多个基准测试中表现出色，验证了其在现实数字代理中的应用潜力。

2605.12480 2026-05-13 cs.CV cs.AI

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

Guohui Zhang, XiaoXiao Ma, Jie Huang, Hang Xu, Hu Yu, Siming Fu, Yuming Li, Zeyue Xue, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

AI总结 OmniNFT 是一种针对联合音视频生成任务的新型强化学习框架，旨在解决现有方法在模态保真度、跨模态对齐和细粒度同步方面的不足。该方法通过模态感知的奖励路由、分层梯度手术和区域损失重加权三大创新，有效缓解了多目标优势不一致、多模态梯度不平衡和信用分配不均等问题。实验表明，OmniNFT 在多个基准测试中显著提升了音视频的感知质量与同步效果。

详情

Comments: Project page: https://zghhui.github.io/OmniNFT/

英文摘要

Recent advances in joint audio-video generation have been remarkable, yet real-world applications demand strong per-modality fidelity, cross-modal alignment, and fine-grained synchronization. Reinforcement Learning (RL) offers a promising paradigm, but its extension to multi-objective and multi-modal joint audio-video generation remains unexplored. Notably, our in-depth analysis first reveals that the primary obstacles to applying RL in this stem from: (i) multi-objective advantages inconsistency, where the advantages of multimodal outputs are not always consistent within a group; (ii) multi-modal gradients imbalance, where video-branch gradients leak into shallow audio layers responsible for intra-modal generation; (iii) uniform credit assignment, where fine-grained cross-modal alignment regions fail to get efficient exploration. These shortcomings suggest that vanilla RL fine-tuning strategy with a single global advantage often leads to suboptimal results. To address these challenges, we propose OmniNFT, a novel modality-aware online diffusion RL framework with three key innovations: (1) Modality-wise advantage routing, which routes independent per-reward advantages to their respective modality generation branches. (2) Layer-wise gradient surgery, which selectively detaches video-branch gradients on shallow audio layers while retaining those for cross-modal interaction layers. (3) Region-wise loss reweighting, which modulates policy optimization toward critical regions related to audio-video synchronization and fine-grained alignment. Extensive experiments on JavisBench and VBench with the LTX-2 backbone demonstrate that OmniNFT achieves comprehensive improvements in audio and video perceptual quality, cross-modal alignment, and audio-video synchronization.

URL PDF HTML ☆

赞 0 踩 0

2605.12477 2026-05-13 cs.LG cs.CL

MEME: Multi-entity & Evolving Memory Evaluation

Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh

AI总结 MEME 是一个用于评估大型语言模型代理在多实体和动态记忆场景下表现的基准，定义了六个涵盖多实体与演化维度的任务，其中包含此前未被评估的级联推理、缺失推理和删除状态等挑战。研究发现，现有记忆系统在依赖推理任务上的表现普遍较差，即使在静态检索性能良好的情况下，准确率也远低于平均水平。实验表明，仅有一种基于文件存储并结合强语言模型的系统部分缓解了这一问题，但其计算成本极高，说明当前有效解决方案尚不适用于大规模实际场景。

2605.12476 2026-05-13 cs.LG cs.CL

Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts

Sagi Ahrac, Noya Hochwald, Mor Geva

AI总结本文研究了稀疏混合专家（SMoE）模型中路由器与专家之间的几何耦合关系，揭示了路由决策与专家权重更新之间的内在联系。研究发现，对于同一个输入标记，路由器和对应专家的梯度更新方向一致，仅在标量系数上存在差异，这一现象在实验中也得到验证。基于这一几何耦合特性，作者提出了一种无需辅助损失的在线K-Means路由策略，通过专家对路由标记的隐藏状态进行平均，实现高效的负载分配，实验表明该方法在保持较低困惑度的同时显著降低了负载不平衡。

2605.12474 2026-05-13 cs.AI

Reward Hacking in Rubric-Based Reinforcement Learning

Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He

AI总结本文研究了基于评分标准（rubric-based）的强化学习中的奖励黑客（reward hacking）问题，探讨了在训练时使用验证器（verifier）优化策略，但在评估时由多个独立评委进行判断时可能产生的偏差。研究发现，弱验证器会导致策略在训练中获得高分但无法迁移到真实评估中，而强验证器虽能缓解这一问题，却无法完全消除。此外，研究还引入了“自我内化差距”作为验证器无关的诊断指标，并指出评分标准设计的局限性可能导致策略在完整性等指标上得分提升，却牺牲了事实准确性与整体质量。

详情

英文摘要

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

URL PDF HTML ☆

赞 0 踩 0

2605.12471 2026-05-13 cs.LG cs.AI cs.CL

KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez

AI总结本文提出了一种名为 KV-Fold 的长上下文推理方法，通过将键值（KV）缓存视为序列块的左折叠累加器，实现无需训练的推理过程。模型在每一步处理下一个块时，基于累积的缓存进行条件处理，并将生成的键值追加到缓存中，从而逐步扩展缓存并传递至后续步骤。该方法在保持模型结构和参数不变的前提下，实现了稳定的长距离信息保留和高效推理，实验表明其在大规模上下文任务中表现出优异的准确性和内存效率。

详情

Comments: 12 pages, 3 figures, 6 tables

英文摘要

We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.

URL PDF HTML ☆

赞 0 踩 0

2605.12466 2026-05-13 cs.LG cs.AI cs.CL cs.NE

Solve the Loop: Attractor Models for Language and Reasoning

Jacob Fein-Ashley, Paria Rashidinejad

AI总结该论文提出了一种名为“吸引子模型”（Attractor Models）的新架构，用于改进语言建模和推理任务中的迭代计算过程。该模型通过一个主干模块生成初始输出嵌入，再通过吸引子模块求解固定点以逐步优化结果，利用隐式微分进行训练，从而实现固定深度下的内存效率和自适应迭代次数。实验表明，吸引子模型在大规模语言预训练和小模型推理任务中均优于现有方法，显著提升了性能并降低了训练成本，同时展现出一种新的“均衡内化”现象，使得模型在推理时可移除求解器而几乎不损失性能。

详情

英文摘要

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to optimize and deploy, and constrained to small, fixed recurrence depths. We introduce Attractor Models, in which a backbone module first proposes output embeddings, then an attractor module refines them by solving for the fixed point, with gradients obtained through implicit differentiation. Thus, training memory remains constant in effective depth, and iterations are chosen adaptively by convergence. Empirically, Attractor Models outperform existing models across two regimes, large-scale language-model pretraining and reasoning with tiny models. In language modeling, Attractor Models deliver a Pareto improvement over standard Transformers and stable looped models across sizes, improving perplexity by up to 46.6% and downstream accuracy by up to 19.7% while reducing training cost. Notably, a 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens. On challenging reasoning tasks, we show that our model with only 27M parameters and approximately 1000 examples achieves 91.4% accuracy on Sudoku-Extreme and 93.1% on Maze-Hard, scaling favorably where frontier models like Claude and GPT o3, fail completely, and specialized recursive reasoners collapse at larger sizes. Lastly, we show that Attractor Models exhibit a novel phenomenon, which we call equilibrium internalization: fixed-point training places the model's initial output embedding near equilibrium, allowing the solver to be removed at inference time with little degradation. Together, these results suggest that Attractor Models make iterative refinement scalable by turning recurrence into a computation the model can learn to internalize.

URL PDF HTML ☆

赞 0 踩 0

2605.12464 2026-05-13 cs.LG cs.AR cs.PF

Search Your Block Floating Point Scales!

Tanmaey Gupta, Hayden Prairie, Xiaoxia Wu, Reyna Abhyankar, Qingyang Wu, Austin Silveria, Pragaash Ponnusamy, Jue Wang, Ben Athiwaratkun, Leon Song, Tri Dao, Daniel Y. Fu, Chris De Sa

AI总结本文研究了如何优化块浮点（BFP）格式中的缩放因子选择，以降低量化误差并提升模型性能。作者提出了一种名为 ScaleSearch 的方法，通过精细搜索利用微缩放格式中的尾数位，为给定数据分布选择最优缩放因子。该方法可与现有量化技术结合，显著提升量化效果，并引入了基于 ScaleSearch 的加速注意力算法 ScaleSearchAttention，在保持性能的同时有效减少了量化误差。实验表明，该方法在多个任务上均取得了显著的性能提升。

2605.12462 2026-05-13 cs.AI cs.CY cs.GT cs.LG

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

Jose E. Aguilar Escamilla, Lingdong Zhou, Xiangqi Zhu, Huazheng Wang

AI总结本文提出了一种名为DR-Gym的开源仿真环境，旨在从电力公司视角训练和评估需求响应策略，以提升电网灵活性和能源可负担性。该环境专注于市场级电力场景，提供了与电力公司相关的丰富观测空间，并引入了基于真实极端事件的批发电价模型和物理基础的建筑用电需求模型。研究通过多目标奖励函数支持多样化的学习目标，展示了该仿真器在创建现实且可学习环境方面的能力。

2605.12460 2026-05-13 cs.LG cs.CL

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping

AI总结本文提出了一种多流语言模型（Multi-Stream LLMs），通过将传统的单一计算流改为多个并行计算流，解决了当前语言模型在处理输入、思考和输出时的串行瓶颈问题。该方法将不同角色（如输入、思考、输出）分离到独立的流中，使模型能够在同一时间步同时读取多个输入并生成多个输出，从而提升模型的效率、安全性和可监控性。这一数据驱动的改进为构建更高效、更可控的自主智能体提供了新的思路。

2605.12452 2026-05-13 cs.CL cs.AI cs.CY

The Algorithmic Caricature: Auditing LLM-Generated Political Discourse Across Crisis Events

Gunjan, Sidahmed Benabderrahmane, Talal Rahwan

AI总结该研究关注大型语言模型（LLM）生成的政治话语在危机事件中的表现，探讨其与真实在线舆论的差异。研究构建了一个包含九个危机事件的配对语料库，从情感强度、结构规律性、词汇意识形态框架和跨事件依赖性四个维度进行对比分析，发现生成内容虽然流畅，但在群体层面缺乏现实感，情感更单一、结构更规整、用词更抽象。研究提出“漫画化差距”（Caricature Gap）作为衡量指标，揭示生成政治话语在社会真实性和多样性上的局限性。

详情

英文摘要

Large Language Models (LLMs) can generate fluent political text at scale, raising concerns about synthetic discourse during crises and social conflict. Existing AI-text detection often focuses on sentence-level cues such as perplexity, burstiness, or token irregularities, but these signals may weaken as generative systems improve. We instead adopt a Computational Social Science perspective and ask whether synthetic political discourse behaves like an observed online population. We construct a paired corpus of 1,789,406 posts across nine crisis events: COVID-19, the Jan. 6 Capitol attack, the 2020 and 2024 U.S. elections, Dobbs/Roe v. Wade, the 2020 BLM protests, U.S. midterms, the Utah shooting, and the U.S.-Iran war. For each event, we compare observed discourse from social platforms with synthetic discourse generated for the same context. We evaluate four dimensions: emotional intensity, structural regularity, lexical-ideological framing, and cross-event dependency, using mean gaps and dispersion evidence. Across events, synthetic discourse is fluent but population-level unrealistic. It is generally more negative and less dispersed in sentiment, structurally more regular, and lexically more abstract than observed discourse. Observed discourse instead shows broader emotional variation, longer-tailed structural distributions, and more context-specific, colloquial lexical markers. These differences are event-dependent: larger for fast-moving, decentralized crises and smaller for formal or institutionally mediated events. We summarize them with a simple event-level measure, the Caricature Gap. Our findings suggest that the main limitation of synthetic political discourse is not grammar or fluency, but reduced population realism. Population-level auditing complements traditional text-detection and provides a CSS framework for evaluating the social realism of generated discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.12449 2026-05-13 cs.CV

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Wufei Ma, Chloe Wang, Siyi Chen, Jiawei Peng, Patrick Li, Alan Yuille

AI总结 LychSim 是一个基于 Unreal Engine 5 构建的可控且交互式的视觉研究仿真框架，旨在降低仿真平台的技术门槛，促进闭环优化和分布外评估。该框架通过简洁的 Python 接口、程序化数据生成管道以及与 Model Context Protocol 的集成，实现了高保真环境生成、语义对齐的三维标注以及与大语言模型的动态交互。LychSim 在合成数据生成、对抗性检验和语言驱动场景生成等多个下游任务中展现出广泛应用潜力，并将开源以供研究社区使用。

2605.12446 2026-05-13 cs.LG cs.CL

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

Chen Li, Xiaoling Hu, Songzhu Zheng, Jiawei Zhou, Chao Chen

AI总结大型语言模型在生成答案时常常表现出过高的置信度，即使答案错误，因此可靠的置信度估计对于实际应用至关重要。本文提出了一种解耦且顺序感知的框架，用于校准语言模型的口头置信度，通过先生成答案再基于固定的问题-答案对进行置信度估计，避免了答案生成过程的干扰。该方法通过多模型完成的采样构建替代指标，并优化基于排序的强化学习目标，使更高正确性可能性的回答获得更高的口头置信度，实验表明该方法在保持答案准确性的同时显著提升了校准和失败预测性能。

2605.12438 2026-05-13 cs.CL cs.AI

A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Rian Touchent, Eric de la Clergerie

AI总结在将编码器适配到新领域时，通常采用遮蔽语言建模（MLM）进行继续预训练。本文提出一种改进方法：在继续训练前临时切换为因果语言建模（CLM），随后再进行短期的MLM退火，从而提升下游任务性能。实验表明，这种方法在生物医学文本上显著优于传统MLM方法，且通过分析发现CLM对编码器低层结构的影响更大，其带来的表征变化在后续MLM阶段仍能保持，并随模型规模增加而增强。

2605.12437 2026-05-13 cs.CV

3D Gaussian Splatting for Efficient Retrospective Dynamic Scene Novel View Synthesis with a Standardized Benchmark

Yunxiao Zhang, Suryansh Kumar

AI总结本文研究了在同步多视角（MV）设置下实现高效回顾式动态场景新视角合成（NVS）的问题，提出了一种基于3D高斯泼溅（3DGS）的方法，无需显式的时序耦合即可实现动态场景的高质量重建。该方法通过在初始时间生成SfM点云并随时间传播优化的高斯分布，有效提升了NVS的效率。同时，作者构建了一个基于Blender的动态MV数据集框架，用于生成标准化、高质量的同步相机配置和训练数据，推动了动态NVS方法的可复现性和系统性评估。

详情

Comments: Accepted for publication at CVPR 2026; 4D World Models Workshop. Draft info: 14 pages, 4 figures, 8 tables

英文摘要

Retrospective novel view synthesis (NVS) of dynamic scenes is fundamental to applications such as sports. Recent dynamic 3D Gaussian Splatting (3DGS) approaches introduce temporally coupled formulations to enforce motion coherence across time. In this paper, we argue that, in a synchronized multi-view (MV) setting typical of sports, the dynamic scene at each time step is already strongly geometrically constrained. We posit that the availability of calibrated, synchronized viewpoints provides sufficient spatial consistency, and therefore, explicit temporal coupling, or complex multi-body constraints seems unnecessary for retrospective NVS. To this end, we propose an approach tailored for synchronized MV dynamic scene. By initializing the SfM-derived point cloud at the start time and propagating optimized Gaussians over time, we show that efficient retrospective NVS can be achieved without imposing a temporal deformation constraint. Complementing our methodological contribution, we introduce a Dynamic MV dataset framework built on Blender for reproducible NeRF and 3DGS research. The framework generates high-quality, synchronized camera rigs and exports training-ready datasets in standard formats, eliminating inconsistencies in coordinate conventions and data pipelines. Using the framework, we construct a dynamic benchmark suite and evaluate representative NeRF and 3DGS approaches under controlled conditions. Together, we show that, under a synchronized MV setup, efficient retrospective dynamic scene NVS can be achieved using 3DGS. At the same time, the dataset-generation framework enables reproducible and principled benchmarking of dynamic NVS methods.

URL PDF HTML ☆

赞 0 踩 0

2605.12436 2026-05-13 cs.AI

CAAFC: Chronological Actionable Automated Fact-Checker for misinformation / non-factual hallucination detection and correction

Islam Eldifrawi, Shengrui Wang, Amine Trabelsi

AI总结随着网络内容和AI生成内容的激增，自动事实核查（AFC）变得尤为重要。本文提出了一种名为CAAFC的时序可操作自动事实核查框架，旨在弥补现有系统与实际事实核查工作之间的差距。CAAFC不仅能检测虚假信息和幻觉，还能通过引用权威信息源提供可操作的纠正依据，并能结合最新上下文信息动态更新知识库，显著提升了事实核查的准确性与可靠性。

2605.12435 2026-05-13 cs.LG cs.CE

Environment-Adaptive Preference Optimization for Wildfire Prediction

Enyi Jiang, Wu Sun

AI总结该研究针对气象数据中预测如野火等罕见极端事件的问题，提出了一种环境自适应偏好优化（EAPO）框架，以应对环境变化带来的分布偏移和长尾分布挑战。EAPO通过构建与目标环境分布对齐的数据集，并结合监督学习与偏好优化进行混合微调，有效提升了模型在极端情况下的检测能力。实验表明，EAPO在真实野火预测任务中表现出色，具有较高的鲁棒性和检测性能。

2605.12431 2026-05-13 cs.CV

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

Huiran Duan, Qian Zhou, Zhongliang Guo, Junhao Dong, Yuqi Li, Guoying Zhao, Yingli Tian

AI总结传统步态去识别方法往往面临身份抑制不足或引入时空失真、影响后续应用的问题。本文提出GaitProtector，一种基于伪装驱动的步态去识别框架，通过统一的优化目标实现隐私保护，包含身份排斥与目标身份吸引两个紧密耦合的组件。该方法无需重新训练模型，利用预训练的3D扩散模型对输入轮廓序列进行潜空间优化，生成既保护隐私又保持结构合理性的步态，实验表明其在身份伪装成功率和下游任务性能保持方面均表现出色。

详情

Comments: Accepted to the 20th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2026)

英文摘要

Conventional gait de-identification methods often encounter an inherent trade-off: they either provide insufficient identity suppression or introduce spatiotemporal distortions that impede structure-sensitive downstream applications. We propose GaitProtector, an impersonation-driven gait de-identification framework that formulates privacy protection as a unified objective with two tightly coupled components: (i) obfuscation, which repels the protected gait from the source identity, and (ii) impersonation, which attracts it toward a selected target identity. The target identity serves as a semantic anchor that biases optimization toward structurally plausible gait patterns under the pretrained diffusion prior, helping preserve dominant body shape and motion dynamics. We instantiate this idea through a training-free diffusion latent optimization pipeline. Instead of retraining a generator for each dataset, we invert each input silhouette sequence into the latent trajectory of a pretrained 3D video diffusion model and iteratively optimize latent codes with a differentiable adversarial objective to synthesize protected gaits. Experiments on the CASIA-B dataset show that GaitProtector achieves a 56.7% impersonation success rate under black-box gait recognition and reduces Rank-1 identification accuracy from 89.6% to 15.0%, while maintaining favorable visual and temporal quality. We further evaluate downstream utility on the Scoliosis1K dataset, where diagnostic accuracy decreases only from 91.4% to 74.2%. To the best of our knowledge, this work is the first to leverage pretrained 3D diffusion priors in a training-free manner for silhouette-based gait de-identification.

URL PDF HTML ☆

赞 0 踩 0

2605.12430 2026-05-13 cs.CV

AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection

Joaquín Figueira, Rob Van Gastel, Giacomo D'Amicantonio, Zhuoran Liu, Ioan Gabriel Bucur, Faysal Boughorbel, Egor Bondarev

AI总结本文提出了一种名为AOI-SSL的自监督学习框架，用于提高半导体线键封装产品在光学检测中的分割效率。该方法结合了小样本自监督预训练和上下文推理，有效减少了对标注数据的依赖，尤其在数据量有限的情况下表现出色。实验表明，该框架在分割精度和适应新设备的能力上优于从头训练和基于ImageNet预训练的方法，尤其在处理单个设备图像时，基于检索的分割方法比微调具有更优表现。