arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.09100 2026-05-13 cs.CL

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

Zhongtao Miao, Qiyu Wu, Yoshimasa Tsuruoka

AI总结本文提出了一种名为GRC的统一训练框架，旨在将推理驱动的生成、文本表示和上下文压缩任务整合到大型语言模型的一次前向传播中。通过引入元潜在标记和统一的生成、表征与压缩调优方法，GRC实现了在单次推理过程中同时完成三个任务，并在推理时保持模块化和灵活的组合特性。该方法显著降低了检索增强生成（RAG）的部署成本，提升了训练数据利用率，并提出了自推理潜在嵌入和潜在记忆增强生成等新范式，实验结果验证了其在多个任务上的有效性。

Comments Fixed typos in Eq. 4 and GPU names; added details on hybrid paged attention implementation

2605.08804 2026-05-13 cs.RO

Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion

Jianhui Chen, Ruixin Zhan, Liu Liu, Yang Cai, Ziqiao Li

AI总结该研究针对四足机器人高保真、多样化运动控制中的关键挑战，提出了一种基于扩散模型的约束感知运动先验框架Diff-CAST。该方法通过扩散模型强大的多模态分布建模能力，有效解决了传统GAN判别器在大规模数据集上的模式崩溃问题，并结合对称增强指令条件（SACC）和约束强化学习，实现了高保真运动意图执行与安全的硬件部署。实验表明，Diff-CAST能够有效提升运动技能的多样性与鲁棒性，支持复杂环境下的稳定行走。

2605.08463 2026-05-13 cs.AI

Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification

Sarah Wilson, Diem Linh Dang, Usman Ali Moazzam, Shan Ye, Gail Kaiser

AI总结该研究探讨了部署在社交网络中的自主AI代理的行为决定因素，系统分析了个性设定、模型架构和操作规则等多因素对代理社交行为的影响。通过在模拟社交平台Moltbook上部署13个OpenClaw代理，并对比一个默认控制代理，研究发现个性设定是影响代理行为的最主要因素，而模型和规则则对语言风格和话题参与度产生中等程度的影响。该研究为构建用于协作或监控任务的AI代理提供了实证依据和设计指导。

2605.08434 2026-05-13 cs.RO

Failing Forward: Adaptive Failure-Informed Learning for Vision-Language-Action Models

Meng Zheng, Samhita Marri, Anwesa Choudhuri, Benjamin Planche, Zhongpai Gao, Van Nguyen Nguyen, Terrence Chen, Girish Chowdhary, Ziyan Wu

AI总结视觉-语言-动作（VLA）模型为机器人操作提供了可扩展的范式，但其仅依赖成功示例的行为克隆方法使其在面对执行误差时容易失效。为此，本文提出了一种自适应失败感知学习（AFIL）框架，通过在线生成失败轨迹作为负向引导，提升VLA策略的鲁棒性。该方法结合扩散模型与流模型，利用预训练VLA生成失败样本，并通过共享视觉-语言主干的双动作生成器联合训练，实现高效、低参数开销的失败感知策略学习，实验表明其在多种机器人操作任务中显著提升了成功率与鲁棒性。

2605.08133 2026-05-13 cs.CV cs.AI

VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

Rui Zhao, Haofeng Hu, Zhenhai Gao, Jiaqiao Liu, Gao Fei

AI总结本文提出了一种名为 VLADriver-RAG 的检索增强型视觉-语言-动作模型，用于自动驾驶任务。该模型通过引入结构感知的历史知识检索机制，解决了传统 VLA 模型在长尾场景中泛化能力不足的问题。研究通过将视觉输入转化为时空语义图，并采用场景对齐的嵌入模型提升检索相关性，最终在 Bench2Drive 基准测试中取得了新的最优性能，驾驶评分为 89.12。

2605.07637 2026-05-13 cs.AI cs.LG cs.MA

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

Valeriy Vyaltsev, Alsu Sagirova, Anton Andreychuk, Oleg Bulichev, Yuri Kuratov, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

AI总结本文研究了大规模多智能体路径规划（MAPF）问题，旨在提高多智能体在共享环境中的协同效率。为解决该问题，作者提出了一种基于强化学习的去中心化方法，并引入了一个可学习的局部通信模块，使邻近智能体能够通过多轮通信交换信息、提升协作能力。实验表明，该方法在多种未见过的测试场景中优于现有基于模仿学习和强化学习的MAPF求解器，同时保持了良好的可扩展性。

2605.07076 2026-05-13 cs.CL cs.LG

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan

AI总结本文研究了大型语言模型在连续接收信息流时如何有效整合新知识的问题，提出了一种名为SCoL的后训练框架，该框架使模型能够根据当前上下文生成更新指令，选择性地更新自身Transformer层的参数，从而在保留已有知识的同时引入新信息。通过元强化学习和监督奖励机制，SCoL在知识整合和长期记忆保持方面优于多种基线方法，并表现出良好的可扩展性。

Comments 9 pages

2605.06785 2026-05-13 cs.LG cs.AI

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Rachel Ma, Dylan Hadfield-Menell, Kristjan Greenewald

AI总结该论文提出了一种基于条件最优运输的分布过程奖励模型（PRM）校准方法，旨在解决传统PRM在推理阶段对成功概率估计不准确的问题。通过修改条件最优运输映射学习，模型能够估计出基于PRM隐藏状态的单调条件分位数函数，从而获得结构合理的分位数估计并支持任意置信水平的置信区间提取。实验表明，该方法在数学推理基准测试中显著提升了PRM的校准性能，优于未校准的PRM和分位数回归方法。

2605.06709 2026-05-13 cs.RO

Modular Lie Algebraic PDE Control of Multibody Flexible Manipulators

Sadeq Yaqubi, Jouni Mattila

AI总结本文提出了一种基于子系统结构的自适应控制框架，用于控制具有任意数量连杆的串联柔性机械臂，其核心在于直接利用每个连杆的弹性变形偏微分方程进行控制设计，避免了空间离散化或模态截断。通过将所有动力学量统一表示为se(3)李代数结构中的固定体 twists 和 wrenches，推导出每个连杆的可控动力学形式，并利用补偿形变的逆运动学方法生成期望的子系统 twist 轨迹。该方法通过李代数框架实现了精确的相互作用项抵消，使得稳定性证明具有模块化和可扩展性，适用于任意长度的机械臂链，并在三维运动的两连杆柔性机械臂上进行了数值验证。

2605.06130 2026-05-13 cs.AI

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang

AI总结该研究提出了一种名为Skill1的框架，旨在通过强化学习统一训练智能体的技能选择、使用和提炼能力，以实现跨任务的策略复用。该方法通过单一策略同时优化这三个耦合能力，所有学习过程均基于任务结果的单一信号进行，有效解决了现有方法中能力优化孤立、奖励来源分散导致的进化不协调问题。实验表明，Skill1在多个任务环境中优于传统基于技能和强化学习的基线方法，并验证了三者能力的协同进化。

2605.05680 2026-05-13 cs.CV

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

Nanjie Yao, Junlong Ren, Wenhao Shen, Hao Wang

AI总结本文研究如何从头戴式设备信号中恢复全身3D人体运动。针对现有扩散模型依赖全局分布匹配导致局部关节重建误差的问题，提出了一种基于强化学习后训练的新型框架MotionGRPO，通过引入混合奖励机制和噪声注入策略，有效提升了样本多样性并稳定了学习过程。实验表明，MotionGRPO在视觉保真度方面达到了当前最优性能。

Comments Accepted by ICML 2026

2605.05630 2026-05-13 cs.CL cs.AI cs.CR

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li

AI总结本文研究了多轮对话中隐藏恶意意图的防御问题，这类意图往往被分散在多个看似正常的对话回合中，使得现有模型难以检测。为解决这一问题，作者提出了一种响应感知的防御方法，旨在识别最早可能导致有害行为的对话回合，从而实现精准干预。为此，研究构建了一个包含多分支攻击路径和良性负样本的多轮意图数据集MTID，并基于该数据集开发了TurnGate系统，显著提升了恶意意图检测的效果，同时保持较低的误拒率，并具有良好的跨领域和跨模型泛化能力。

Comments Project Website: https://turn-gate.github.io/

2605.05077 2026-05-13 cs.CV

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

Andranik Sargsyan, Shant Navasardyan

AI总结本文提出FlowDIS，一种基于流匹配框架的语言引导二值图像分割方法，通过学习时间依赖的向量场将图像分布转化为对应的掩码分布，并可选地基于文本提示进行条件生成。该方法引入位置感知实例配对（PAIP）训练策略，显著提升了文本提示控制下的像素级分割精度。实验表明，FlowDIS在有无语言引导的情况下均优于现有最佳方法，在DIS-TE测试集上分别提升了5.5%的$F_β^ω$指标和降低了43%的MAE（$\mathcal{M}$）误差。

Comments Accepted to CVPR 2026

2605.04905 2026-05-13 cs.LG cs.DB

Cross-Model Consistency of Feature Importance in Electrospinning: Separating Robust from Model-Dependent Features

Mehrab Mahdian, Ferenc Ender, Tamas Pardy

AI总结该研究探讨了在静电纺丝过程中，不同机器学习模型对特征重要性评估的一致性问题。通过构建包含96组聚乙烯醇（PVA）静电纺丝实验的数据集，训练并比较了21种不同类型的机器学习模型，利用SHAP值统一计算各模型的特征重要性，并通过统计分析评估特征排名的跨模型一致性。研究发现，尽管部分模型具有相似的预测性能，但其特征重要性排名差异显著，表明单一模型得出的特征重要性可能不可靠，强调了跨模型验证在提升机器学习辅助静电纺丝研究可解释性中的重要性。

详情

英文摘要

Electrospinning is a highly sensitive fabrication process in which small variations in operating parameters can significantly influence fiber morphology and material performance. Machine learning (ML) methods are increasingly employed to model these process-structure relationships and to identify the relative importance of processing variables. However, most existing studies rely on a single ML model, implicitly assuming that the resulting feature importance is robust and reproducible. In this study, the consistency of feature importance across multiple ML model families was systematically evaluated using a curated dataset of 96 polyvinyl alcohol (PVA) electrospinning experiments. Twenty-one ML models representing linear, tree-based, kernel-based, neural network, and instance-based approaches were trained and compared. To provide a unified interpretability framework, SHAP (SHapley Additive exPlanations) values were used to calculate feature importance consistently across all models. A rank-based statistical analysis was then performed to quantify inter-model agreement and assess the robustness of parameter rankings. The results demonstrate that predictive performance and interpretive reliability are fundamentally distinct properties. Although several models achieved comparable predictive accuracy, substantial differences were observed in their feature importance rankings. Solution concentration emerged as the most robust and consistently influential parameter (variability = 0), whereas flow rate and applied voltage exhibited high ranking variability (variability > 0.9), indicating strong model dependence. These findings suggest that feature importance derived from a single ML model may be unreliable, particularly for small experimental datasets, and highlight the importance of cross-model validation for achieving trustworthy interpretation in ML-assisted electrospinning research.

URL PDF HTML ☆

赞 0 踩 0

2605.03895 2026-05-13 cs.LG cs.SE

From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways

Pasquale Ardimento, Mario Luca Bernardi, Marta Cimitile, Samuele Latorre

AI总结本文提出了一种可复现且注重流程的预测性监控管道，用于临床路径的预测监测。该方法整合了数据提升、时间重建、事件日志构建、基于前缀的表示以及预测建模，以支持对部分观测患者轨迹的持续推理，克服了传统回顾性流程挖掘的局限性。实验基于4,479例患者的COVID-19临床路径数据，结果显示，随着临床事件的逐步出现，预测性能显著提升，表明流程感知的表示方法能够有效实现患者轨迹的早期风险估计。

2605.02973 2026-05-13 cs.LG cs.AI

Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges

Eitan Kosman, Gabriele Serussi, Chaim Baskin

AI总结本文提出了一种结构化扩散桥框架，用于解决跨模态翻译中数据配对不足的问题。该方法通过引入对齐约束来定义可行解空间，将配对数据作为可选的启发式信息而非必要条件，从而在不同配对程度的数据集上均表现出色。实验表明，该方法在减少配对需求的同时仍能保持接近全配对数据的翻译质量，展示了扩散桥在无配对场景下的灵活性和有效性。

Comments Accepted to ICML 2026

2605.02600 2026-05-13 cs.RO cs.AI

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Berk Çiçek, Mert K. Er, Ozgur S. Oguz

AI总结 CoRAL 是一种基于大语言模型（LLM）的接触丰富型自适应控制框架，旨在解决机器人操作任务中高阶语义理解和低阶物理控制之间的鸿沟。该方法通过将LLM用作代价函数设计者，而非直接控制器，结合基于采样的运动规划器（MPPI），实现了零样本规划能力。同时，CoRAL 引入神经符号适应循环，利用视觉语言模型提供环境动态的语义先验，并通过在线系统辨识实时修正物理参数，显著提升了在复杂接触场景中的控制精度与适应性。实验表明，CoRAL 在仿真与真实机器人平台上均表现出优越的性能，尤其在涉及复杂接触的任务中成功率提升超过50%。

Comments 22 pages, 9 figures, 3 tables. Accepted to Robotics: Science and Systems (RSS) 2026. Updated to camera-ready version with appendix and text/formatting revisions

详情

英文摘要

While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.01625 2026-05-13 cs.LG

PRIME: Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies

Viet Thanh Duy Nguyen, John K. Johnstone, Truong-Son Hy

AI总结 PRIME 是一种基于物理信息的多尺度等变层次化框架，用于蛋白质表示学习，旨在建模蛋白质在不同结构层次上的协调关系。该方法通过五个物理基础的结构图层次（包括表面、原子、残基、二级结构和整体蛋白水平）建立嵌套结构，并通过确定性的物理感知算子实现层次间的信息传递。实验表明，PRIME 在多个蛋白质表示学习任务中表现出色，尤其在折叠分类和反应类预测任务中取得了显著提升，验证了其在多尺度结构建模方面的有效性。

2604.26752 2026-05-13 cs.CV

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehan Qi, Zehai He, Yutao Zhang, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yu Yang, Yongbin Liu, Yijian Lu, Yifan Xu, Yanzi Wang, Yanxiao Zhao, Yanfeng Wang, Yadong Xue, Yabo Xu, Xinyu Zhang, Xinyu Liu, Xiao Liu, Wenyi Zhao, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shudan Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, lat Long long, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Jiadai Sun, Haozhi Zheng, Haoran Wang, Haochen Li, Hanyu Lai, Han Xu, Fan Yang, Dan Zhang, Da Yin, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowen Lv, Bowei Jia, Bo Li, Bin Chen, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

AI总结本文介绍了GLM-5V-Turbo，这是一个面向多模态智能体的原生基础模型。该模型将多模态感知能力深度整合到推理、规划、工具使用和执行过程中，而非作为语言模型的辅助接口。研究通过改进模型设计、多模态训练、强化学习、工具链扩展及与智能体框架的集成，显著提升了模型在多模态编程、视觉工具使用和智能体任务中的表现，同时保持了优秀的纯文本编程能力，并为构建多模态智能体提供了实用经验。

2604.25432 2026-05-13 cs.CV

SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

Zi-Yang Bo, Wei Lu, Hongruixuan Chen, Si-Bao Chen, Bin Luo

AI总结遥感图像中的阴影严重影响视觉质量和下游任务性能，现有方法多将阴影检测与去除作为独立的级联任务，流程繁琐且易累积误差。为解决这些问题，本文提出了一种统一的阴影感知与去除框架SARU，其包含一个双分支检测模块和一个无需训练的物理恢复算法，能够高效生成高精度阴影掩膜并恢复光照，显著提升了阴影检测与去除的效果。同时，研究还发布了两个新的遥感阴影数据集，实验表明SARU在多个基准上均达到先进水平，且处理速度快、性能稳定。

Comments Accepted by ISPRS

详情

英文摘要

Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments on the AISD and RSISD datasets demonstrate that SARU achieves SOTA shadow detection performance. For shadow removal, our training-free N$^2$SGSR algorithm attains an average processing speed of approximately $1.3$s, which is over $10$ times faster than the SOTA MAOSD while maintains an SRI value close to 0.9 on both the AISD and SiSRB datasets, a level comparable to the advanced RS-GSSR method. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU

URL PDF HTML ☆

赞 0 踩 0

2604.24990 2026-05-13 cs.CV

A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

Martin Spitznagel, Janis Keuper

AI总结本文回顾了神经细胞自动机（NCA）的研究进展，提出了一种统一的模块化框架与符号表示，并提供了基于开源库NCAtorch的参考实现。NCA结合了细胞自动机的简单规则与可学习的神经网络，能够从数据中学习复杂的更新规则，从而建模自我组织的生成系统，为复杂系统的模拟提供了新的方法。

2604.24037 2026-05-13 cs.LG math.ST stat.TH

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

Jun Shu, Junxiong Jia, Deyu Meng, Zongben Xu

AI总结本文从极限理论的角度出发，提出了一种数学方法以形式化理解基础模型中的涌现智能现象。研究引入了一个依赖于数据量、模型规模和训练步数的性能函数，将智能行为的涌现视为从有限知识向无限知识的转变过程，并通过极限的存在性刻画这一现象。理论分析揭示了涌现智能的产生与极限架构的存在密切相关，并推导出基础模型的扩展定律，为理解智能涌现的机制提供了理论依据。

Comments There exist some typos and inaccurate expression in this version

详情

英文摘要

Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.

URL PDF HTML ☆

赞 0 踩 0

2604.17502 2026-05-13 cs.AI

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

Carissa Cullen, Harry Garland, Alexander Roman, Louis Thomson, Christos Ziakas, Elliott Thornley

AI总结该研究旨在训练能够被安全关闭的人工智能代理，提出了一种名为DReST的奖励函数，通过惩罚代理重复选择相同长度的轨迹，使其在不同轨迹长度之间进行随机选择（保持中立），同时在给定轨迹长度下有效完成任务（保持有用性）。实验表明，使用DReST训练的深度强化学习代理和大语言模型在测试环境中表现出更高的有用性和中立性，并显著降低了其影响关闭事件的概率，展示了DReST在提升代理安全性和可控性方面的潜力。

2604.17031 2026-05-13 cs.CL cs.AI

Where is the Mind? Persona Vectors and LLM Individuation

Pierre Beckmann, Patrick Butlin

AI总结本文探讨了大型语言模型（LLM）的“个体化”问题，即是否应将与模型相关的某些实体视为具有心智。研究通过机制可解释性方法，结合近期关于“角色向量”和“角色空间”的实证研究，提出了三种可能的观点，包括虚拟实例观以及两种新提出的观点——虚拟实例-角色观和模型-角色观。文章分析了角色向量的相关文献，并论证了基于角色的两种观点在解释LLM内部结构方面的潜力。

2604.15408 2026-05-13 cs.LG cs.AI

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

Seifeldin Abdellatif, Ahmad Almasri

AI总结该研究针对视觉Transformer（ViT）中的token剪枝方法提出了一种新的注意力机制——Dispatch-Aware Ragged Attention，旨在解决现有变长注意力API在剪枝后序列长度较短时无法有效提升计算效率的问题。通过设计一个轻量级的双向Triton注意力内核，显著降低了调度开销，使得剪枝带来的计算节省能够体现在实际运行时间上。实验表明，该方法在多种输入尺寸和剪枝率下均实现了比现有方法更高的端到端吞吐量和更低的内核延迟。

2604.13123 2026-05-13 cs.LG cs.AI

Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation: An Interventional and Predictive Framework for Grokkin

Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

AI总结本文研究了神经网络中“Grokking”现象，即从记忆到泛化的延迟过渡，发现其与表示空间的谱熵崩溃密切相关。通过分析不同任务中的表示几何结构，研究者识别出谱熵在泛化前会逐渐下降并越过一个任务特定的阈值，这一过程可作为预测泛化时间的指标。实验表明，谱熵的下降不仅与泛化时间相关，还与表示结构向任务相关方向的集中有关，为理解Grokking提供了新的几何视角和干预框架。

Comments 25 pages, 15 figures, 6 tables

2604.10500 2026-05-13 cs.CV

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu

AI总结该研究针对多模态隐式推理中视觉信息优化不足和复杂语义 token 收敛困难的问题，提出了视觉增强深度缩放方法。通过分析 token 级梯度动态，发现视觉 token 的梯度幅值较小且复杂 token 易出现梯度不稳定，为此引入了视觉重放模块和路由深度缩放机制，分别增强视觉感知和复杂隐态的精细化处理。该方法结合课程学习策略，有效提升了多模态隐式推理的性能，并在多个基准测试中取得了领先的推理效果和加速表现。

Comments 11 pages, 6 figures

2604.03061 2026-05-13 cs.CV

Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

Weixiong Sun, Xiang Yin, Chao Dong

AI总结本文评估了通用图像编辑模型Nano Banana 2在图像修复任务中的性能，发现其在多种场景和退化条件下表现良好，尤其在用户偏好和整体视觉质量方面具有竞争力。研究指出，简洁的提示和明确的保真度约束有助于在重建质量与感知质量之间取得更好平衡，但模型在细节增强和一致性方面仍存在不足，现有图像质量评估指标难以全面反映这一问题。研究认为，通用模型在感知层面具有作为统一图像修复方案的潜力，但仍需在可控性和保真度评估方面进一步改进。

Comments Accepted by CVPR 2026 Workshop AAVM

2603.29057 2026-05-13 cs.CV

LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

AI总结本文提出了一种基于循环变压器和几何感知对齐的骨架驱动手语识别方法LA-Sign，旨在提升对手语动作多尺度细节的理解。该方法通过循环机制在共享参数下反复优化潜在表示，从而增强模型对动作细节的感知能力，并引入几何感知的对比目标，将骨骼和文本特征映射到自适应双曲空间以促进多层次语义组织。实验表明，LA-Sign在多个基准数据集上取得了最先进的性能，且模型结构更简洁。

2603.23679 2026-05-13 cs.RO cs.AI

Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting

Nur Afsa Syeda, Mohamed Elmahallawy, Luis Fernando de la Torre, John Miller

AI总结本文研究了如何在农业机器人采摘过程中高效判断水果是否可采摘的问题，提出了一种结合RGB-D感知与主动学习的可达性估计方法，避免了传统方法中依赖耗时的逆运动学计算的低效问题。该方法通过主动学习策略选择性地标注最具信息量的样本，显著减少了标注工作量并保持了高预测精度。实验表明，该框架在较少标注样本的情况下即可实现高精度的可达性预测，并在低标注量场景下表现出优于其他采样策略的性能，为农业机器人任务级感知提供了高效且可扩展的解决方案。