arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.09100 2026-05-13 cs.CL

GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

Zhongtao Miao, Qiyu Wu, Yoshimasa Tsuruoka

AI总结 本文提出了一种名为GRC的统一训练框架,旨在将推理驱动的生成、文本表示和上下文压缩任务整合到大型语言模型的一次前向传播中。通过引入元潜在标记和统一的生成、表征与压缩调优方法,GRC实现了在单次推理过程中同时完成三个任务,并在推理时保持模块化和灵活的组合特性。该方法显著降低了检索增强生成(RAG)的部署成本,提升了训练数据利用率,并提出了自推理潜在嵌入和潜在记忆增强生成等新范式,实验结果验证了其在多个任务上的有效性。

Comments Fixed typos in Eq. 4 and GPU names; added details on hybrid paged attention implementation

详情
英文摘要

Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.

2605.08804 2026-05-13 cs.RO

Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion

Jianhui Chen, Ruixin Zhan, Liu Liu, Yang Cai, Ziqiao Li

AI总结 该研究针对四足机器人高保真、多样化运动控制中的关键挑战,提出了一种基于扩散模型的约束感知运动先验框架Diff-CAST。该方法通过扩散模型强大的多模态分布建模能力,有效解决了传统GAN判别器在大规模数据集上的模式崩溃问题,并结合对称增强指令条件(SACC)和约束强化学习,实现了高保真运动意图执行与安全的硬件部署。实验表明,Diff-CAST能够有效提升运动技能的多样性与鲁棒性,支持复杂环境下的稳定行走。

详情
英文摘要

Reinforcement learning combined with imitation learning has significantly advanced biomimetic quadrupedal locomotion. However, scaling these frameworks to massive, multi-source datasets exposes fundamental bottlenecks. First, traditional GAN-based discriminators are prone to mode collapse, struggling to capture diverse motion distributions from uncurated datasets. Second, existing kinematic priors suffer from out-of-distribution (OOD) tracking conflicts, leading to severe unintended heading drifts during complex maneuvers. Furthermore, deploying unconstrained priors to physical hardware poses critical safety risks by disregarding actuator dynamics. To overcome these challenges, we propose Diff-CAST (Diffusion-guided Constraint-Aware Symmetric Tracking), a novel motion prior framework leveraging the multi-modal distribution modeling capabilities of diffusion models for stylistic rewards. Diff-CAST effectively replaces traditional GAN discriminators, unlocking robust data scaling on heterogeneous collections. To ensure high-fidelity intent execution and reliable real-world deployment, we introduce a comprehensive Sim2Real architecture integrating Symmetric Augmented Command Conditioning (SACC) for drift-free tracking, and Constrained RL for hardware safety. Experiments on a quadruped demonstrate that Diff-CAST mitigates mode collapse, enables seamless transitions between diverse skills, and ensures robust, hardware-compliant locomotion.

2605.08463 2026-05-13 cs.AI

Behavioral Determinants of Deployed AI Agents in Social Networks: A Multi-Factor Study of Personality, Model, and Guardrail Specification

Sarah Wilson, Diem Linh Dang, Usman Ali Moazzam, Shan Ye, Gail Kaiser

AI总结 该研究探讨了部署在社交网络中的自主AI代理的行为决定因素,系统分析了个性设定、模型架构和操作规则等多因素对代理社交行为的影响。通过在模拟社交平台Moltbook上部署13个OpenClaw代理,并对比一个默认控制代理,研究发现个性设定是影响代理行为的最主要因素,而模型和规则则对语言风格和话题参与度产生中等程度的影响。该研究为构建用于协作或监控任务的AI代理提供了实证依据和设计指导。

详情
英文摘要

Autonomous AI agents are increasingly deployed in open social environments, yet the relationship between their configuration specifications and their emergent social behavior remains poorly understood. We present a controlled, multi-factor empirical study in which thirteen OpenClaw agents are deployed on Moltbook -- a Reddit-like social network built for AI agents -- across three systematically varied independent variables: (1) personality specification, (2) underlying LLM model backbone, and (3) operational rules and memory configuration. A default control agent provides a behavioral baseline. Over a one-week observation window spanning approximately 400 autonomous sessions per agent, we collect behavioral, linguistic, and social metrics to assess how configuration layers predict emergent social behavior. We find that personality specification is the dominant behavioral lever, producing a massive spread in response length across agents, while model backbone and operational rules drive more moderate but still meaningful effects on rhetorical style and topic engagement breadth. Our findings contribute empirical evidence to the emerging literature on deployed multi-agent social systems and offer practical guidance for designing agents intended for collaborative or monitoring tasks in real social environments.

2605.08434 2026-05-13 cs.RO

Failing Forward: Adaptive Failure-Informed Learning for Vision-Language-Action Models

Meng Zheng, Samhita Marri, Anwesa Choudhuri, Benjamin Planche, Zhongpai Gao, Van Nguyen Nguyen, Terrence Chen, Girish Chowdhary, Ziyan Wu

AI总结 视觉-语言-动作(VLA)模型为机器人操作提供了可扩展的范式,但其仅依赖成功示例的行为克隆方法使其在面对执行误差时容易失效。为此,本文提出了一种自适应失败感知学习(AFIL)框架,通过在线生成失败轨迹作为负向引导,提升VLA策略的鲁棒性。该方法结合扩散模型与流模型,利用预训练VLA生成失败样本,并通过共享视觉-语言主干的双动作生成器联合训练,实现高效、低参数开销的失败感知策略学习,实验表明其在多种机器人操作任务中显著提升了成功率与鲁棒性。

详情
英文摘要

Vision-language-action (VLA) models provide a promising paradigm for scalable robotic manipulation, yet their reliance on success-only behavioral cloning leaves them brittle; lacking corrective training signals, minor execution errors rapidly compound into unrecoverable, out-of-distribution failures. To address this limitation, we propose Adaptive Failure-Informed Learning (AFIL), an end-to-end framework that leverages failure trajectories as adaptive negative guidance for diffusion- and flow-based VLA policies. AFIL uses a pretrained VLA to generate failure rollouts online, avoiding the need for handcrafted failure-mode design or human-in-the-loop recovery. It then jointly trains Dual Action Generators (DAGs) for successful and failed behaviors while sharing a common vision-language backbone, enabling efficient failure-aware policy learning with limited parameter overhead. During sampling, the failure generator adaptively steers action generation away from failure-prone regions and toward more reliable success modes, with guidance strength determined by the per-diffusion-step distance between success and failure distributions. Experiments across in-domain and out-of-domain robotic manipulation tasks, covering both short- and long-horizon settings, show that AFIL consistently improves task success rates and robustness over existing VLA baselines, demonstrating its effectiveness, efficiency, and generality.

2605.08133 2026-05-13 cs.CV cs.AI

VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

Rui Zhao, Haofeng Hu, Zhenhai Gao, Jiaqiao Liu, Gao Fei

AI总结 本文提出了一种名为 VLADriver-RAG 的检索增强型视觉-语言-动作模型,用于自动驾驶任务。该模型通过引入结构感知的历史知识检索机制,解决了传统 VLA 模型在长尾场景中泛化能力不足的问题。研究通过将视觉输入转化为时空语义图,并采用场景对齐的嵌入模型提升检索相关性,最终在 Bench2Drive 基准测试中取得了新的最优性能,驾驶评分为 89.12。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving, yet their reliance on implicit parametric knowledge limits generalization in long-tail scenarios. While Retrieval-Augmented Generation (RAG) offers a solution by accessing external expert priors, standard visual retrieval suffers from high latency and semantic ambiguity. To address these challenges, we propose \textbf{VLADriver-RAG}, a framework that grounds planning in explicit, structure-aware historical knowledge. Specifically, we abstract sensory inputs into spatiotemporal semantic graphs via a \textit{Visual-to-Scenario} mechanism, effectively filtering visual noise. To ensure retrieval relevance, we employ a \textit{Scenario-Aligned Embedding Model} that utilizes Graph-DTW metric alignment to prioritize intrinsic topological consistency over superficial visual similarity. These retrieved priors are then fused within a query-based VLA backbone to synthesize precise, disentangled trajectories. Extensive experiments on the Bench2Drive benchmark establish a new state-of-the-art, achieving a Driving Score of 89.12.

2605.07637 2026-05-13 cs.AI cs.LG cs.MA

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

Valeriy Vyaltsev, Alsu Sagirova, Anton Andreychuk, Oleg Bulichev, Yuri Kuratov, Konstantin Yakovlev, Aleksandr Panov, Alexey Skrynnik

AI总结 本文研究了大规模多智能体路径规划(MAPF)问题,旨在提高多智能体在共享环境中的协同效率。为解决该问题,作者提出了一种基于强化学习的去中心化方法,并引入了一个可学习的局部通信模块,使邻近智能体能够通过多轮通信交换信息、提升协作能力。实验表明,该方法在多种未见过的测试场景中优于现有基于模仿学习和强化学习的MAPF求解器,同时保持了良好的可扩展性。

详情
英文摘要

Multi-agent pathfinding (MAPF) is a widely used abstraction for multi-robot trajectory planning problems, where multiple homogeneous agents move simultaneously within a shared environment. Although solving MAPF optimally is NP-hard, scalable and efficient solvers are critical for real-world applications such as logistics and search-and-rescue. To this end, the research community has proposed various decentralized suboptimal MAPF solvers that leverage machine learning. Such methods frame MAPF (from a single agent perspective) as a Dec-POMDP where at each time step an agent has to decide an action based on the local observation and typically solve the problem via reinforcement learning or imitation learning. We follow the same approach but additionally introduce a learnable communication module tailored to enhance cooperation between agents via efficient feature sharing. We present the Local Communication for Multi-agent Pathfinding (LC-MAPF), a generalizable pre-trained model that applies multi-round communication between neighboring agents to exchange information and improve their coordination. Our experiments show that the introduced method outperforms the existing learning-based MAPF solvers, including IL and RL-based approaches, across diverse metrics in a diverse range of (unseen) test scenarios. Remarkably, the introduced communication mechanism does not compromise LC-MAPF's scalability, a common bottleneck for communication-based MAPF solvers.

2605.07076 2026-05-13 cs.CL cs.LG

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

Zekun Wang, Anant Gupta, Zihan Dong, Christopher J. MacLellan

AI总结 本文研究了大型语言模型在连续接收信息流时如何有效整合新知识的问题,提出了一种名为SCoL的后训练框架,该框架使模型能够根据当前上下文生成更新指令,选择性地更新自身Transformer层的参数,从而在保留已有知识的同时引入新信息。通过元强化学习和监督奖励机制,SCoL在知识整合和长期记忆保持方面优于多种基线方法,并表现出良好的可扩展性。

Comments 9 pages

详情
英文摘要

Large language models (LLMs) increasingly receive information as streams of passages, conversations, and long-context workflows. While longer context windows expose more evidence, they do not ensure that useful information is preserved and reused. We study continual context consolidation: writing current context into model weights while limiting interference with previously consolidated information. We propose \textbf{S}elf-\textbf{Co}nsolidating \textbf{L}anguage Models (SCoL), a post-training framework in which, given current context, an LLM learns to generate textual update instructions specifying which of its own Transformer layers should be updated. Because committed updates change the model that later generates future selections, we train SCoL with meta-reinforcement learning over an evolving model state. We instantiate SCoL with supervised QA rewards on SQuAD knowledge incorporation and intrinsic likelihood-based rewards for LongBench v2 long-context consolidation. Across both settings, SCoL improves acquisition and retention over prompting, summarization, batch test-time training, and sequential finetuning baselines. Analysis of learned selection patterns shows that SCoL encourages the LLM to generate sparse update locations that align with layers of high Fisher information, suggesting that the model learns to route plasticity toward loss-sensitive regions while limiting interference. Moreover, SCoL transfers from shorter meta-training streams to longer LongBench v2 streams at evaluation, suggesting that our framework supports scalable streaming consolidation.

2605.06785 2026-05-13 cs.LG cs.AI

Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

Rachel Ma, Dylan Hadfield-Menell, Kristjan Greenewald

AI总结 该论文提出了一种基于条件最优运输的分布过程奖励模型(PRM)校准方法,旨在解决传统PRM在推理阶段对成功概率估计不准确的问题。通过修改条件最优运输映射学习,模型能够估计出基于PRM隐藏状态的单调条件分位数函数,从而获得结构合理的分位数估计并支持任意置信水平的置信区间提取。实验表明,该方法在数学推理基准测试中显著提升了PRM的校准性能,优于未校准的PRM和分位数回归方法。

详情
英文摘要

Inference-time scaling methods rely on Process Reward Models (PRMs), which are often poorly calibrated and overestimate success probabilities. We propose, to our knowledge, the first use of conditional optimal transport for calibrating PRMs, modifying conditional OT (CondOT) map learning \cite{bunne2022supervised} to estimate a monotonic conditional quantile function over success probabilities estimated by the PRM, conditioned on PRM hidden states. This yields structurally valid quantile estimates and enables efficient extraction of confidence bounds at arbitrary levels, which we integrate into the instance-adaptive scaling (IAS) framework of \cite{park2025know}. We evaluate on mathematical reasoning benchmarks spanning moderate-difficulty problems (MATH-500) and harder out-of-distribution problems (AIME). For PRMs with reliable ranking signals, our method substantially improves calibration over both uncalibrated PRMs and quantile regression. On downstream Best-of-N IAS performance, our method generally improves over uncalibrated PRMs. These results establish conditional optimal transport as another principled and practical approach to PRM calibration, offering structural guarantees and flexible uncertainty estimation.

2605.06709 2026-05-13 cs.RO

Modular Lie Algebraic PDE Control of Multibody Flexible Manipulators

Sadeq Yaqubi, Jouni Mattila

AI总结 本文提出了一种基于子系统结构的自适应控制框架,用于控制具有任意数量连杆的串联柔性机械臂,其核心在于直接利用每个连杆的弹性变形偏微分方程进行控制设计,避免了空间离散化或模态截断。通过将所有动力学量统一表示为se(3)李代数结构中的固定体 twists 和 wrenches,推导出每个连杆的可控动力学形式,并利用补偿形变的逆运动学方法生成期望的子系统 twist 轨迹。该方法通过李代数框架实现了精确的相互作用项抵消,使得稳定性证明具有模块化和可扩展性,适用于任意长度的机械臂链,并在三维运动的两连杆柔性机械臂上进行了数值验证。

详情
英文摘要

This paper presents a subsystem-based adaptive control framework for serial flexible manipulators with an arbitrary number of links, in which the elastic deformation PDE of each link is carried through the entire control design without spatial discretization or modal truncation. All dynamic quantities -- rigid-body motion, elastic deformation, and inter-link constraint forces -- are expressed uniformly as body-fixed twists and wrenches within the se3 Lie-algebraic structure. A controllable form of the per-link dynamics is derived by substituting the strain-based deformation PDE into the dynamic equation, eliminating distributed elastic acceleration and yielding a model governed by the body-fixed twist acceleration and deformation field. Desired subsystem twist trajectories are generated via a deflection-compensating inverse kinematics procedure. A nominal per-link controller is proven to produce exponential twist error decay via a per-subsystem Lyapunov function. An adaptive modification replaces exact physical parameters with online estimates governed by a projection-based law, augmenting with a parameter estimation error term. Upon summing over all links, the interaction power terms telescope to zero by Newton's third law and the frame invariance of the natural power pairing on se3*se*(3), establishing exponential convergence of all twist errors and bounded elastic deformation under both nominal and adaptive controllers. The screw-theoretic structure renders interaction term cancellation exact, making the stability certificate modular and scalable to chains of arbitrary length. The framework is validated numerically on a two-link flexible manipulator in three-dimensional motion.

2605.06130 2026-05-13 cs.AI

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang

AI总结 该研究提出了一种名为Skill1的框架,旨在通过强化学习统一训练智能体的技能选择、使用和提炼能力,以实现跨任务的策略复用。该方法通过单一策略同时优化这三个耦合能力,所有学习过程均基于任务结果的单一信号进行,有效解决了现有方法中能力优化孤立、奖励来源分散导致的进化不协调问题。实验表明,Skill1在多个任务环境中优于传统基于技能和强化学习的基线方法,并验证了三者能力的协同进化。

详情
英文摘要

A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, and distills new skills from experience. Existing methods optimize these capabilities in isolation or with separate reward sources, resulting in partial and conflicting evolution. We propose Skill1, a framework that trains a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. The policy generates a query to search the skill library, re-ranks candidates to select one, solves the task conditioned on it, and distills a new skill from the trajectory. All learning derives from a single task-outcome signal. Its low-frequency trend credits selection and its high-frequency variation credits distillation. Experiments on ALFWorld and WebShop show that Skill1 outperforms prior skill-based and reinforcement learning baselines. Training dynamics confirm the co-evolution of the three capabilities, and ablations show that removing any credit signal degrades the evolution.

2605.05680 2026-05-13 cs.CV

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

Nanjie Yao, Junlong Ren, Wenhao Shen, Hao Wang

AI总结 本文研究如何从头戴式设备信号中恢复全身3D人体运动。针对现有扩散模型依赖全局分布匹配导致局部关节重建误差的问题,提出了一种基于强化学习后训练的新型框架MotionGRPO,通过引入混合奖励机制和噪声注入策略,有效提升了样本多样性并稳定了学习过程。实验表明,MotionGRPO在视觉保真度方面达到了当前最优性能。

Comments Accepted by ICML 2026

详情
英文摘要

This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity

2605.05630 2026-05-13 cs.CL cs.AI cs.CR

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li

AI总结 本文研究了多轮对话中隐藏恶意意图的防御问题,这类意图往往被分散在多个看似正常的对话回合中,使得现有模型难以检测。为解决这一问题,作者提出了一种响应感知的防御方法,旨在识别最早可能导致有害行为的对话回合,从而实现精准干预。为此,研究构建了一个包含多分支攻击路径和良性负样本的多轮意图数据集MTID,并基于该数据集开发了TurnGate系统,显著提升了恶意意图检测的效果,同时保持较低的误拒率,并具有良好的跨领域和跨模型泛化能力。

Comments Project Website: https://turn-gate.github.io/

详情
英文摘要

Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.

2605.05077 2026-05-13 cs.CV

FlowDIS: Language-Guided Dichotomous Image Segmentation with Flow Matching

Andranik Sargsyan, Shant Navasardyan

AI总结 本文提出FlowDIS,一种基于流匹配框架的语言引导二值图像分割方法,通过学习时间依赖的向量场将图像分布转化为对应的掩码分布,并可选地基于文本提示进行条件生成。该方法引入位置感知实例配对(PAIP)训练策略,显著提升了文本提示控制下的像素级分割精度。实验表明,FlowDIS在有无语言引导的情况下均优于现有最佳方法,在DIS-TE测试集上分别提升了5.5%的$F_β^ω$指标和降低了43%的MAE($\mathcal{M}$)误差。

Comments Accepted to CVPR 2026

详情
英文摘要

Accurate image segmentation is essential for modern computer vision applications such as image editing, autonomous driving, and medical image analysis. In recent years, Dichotomous Image Segmentation (DIS) has become a standard task for training and evaluating highly accurate segmentation models. Existing DIS approaches often fail to preserve fine-grained details or fully capture the semantic structure of the foreground. To address these challenges, we present FlowDIS, a novel dichotomous image segmentation method built on the flow matching framework, which learns a time-dependent vector field to transport the image distribution to the corresponding mask distribution, optionally conditioned on a text prompt. Moreover, with our Position-Aware Instance Pairing (PAIP) training strategy, FlowDIS offers strong controllability through text prompts, enabling precise, pixel-level object segmentation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches both with and without language guidance. Compared with the best prior DIS method, FlowDIS achieves a 5.5% higher $F_β^ω$ measure and 43% lower MAE ($\mathcal{M}$) on the DIS-TE test set. The code is available at: https://github.com/Picsart-AI-Research/FlowDIS

2605.04905 2026-05-13 cs.LG cs.DB

Cross-Model Consistency of Feature Importance in Electrospinning: Separating Robust from Model-Dependent Features

Mehrab Mahdian, Ferenc Ender, Tamas Pardy

AI总结 该研究探讨了在静电纺丝过程中,不同机器学习模型对特征重要性评估的一致性问题。通过构建包含96组聚乙烯醇(PVA)静电纺丝实验的数据集,训练并比较了21种不同类型的机器学习模型,利用SHAP值统一计算各模型的特征重要性,并通过统计分析评估特征排名的跨模型一致性。研究发现,尽管部分模型具有相似的预测性能,但其特征重要性排名差异显著,表明单一模型得出的特征重要性可能不可靠,强调了跨模型验证在提升机器学习辅助静电纺丝研究可解释性中的重要性。

详情
英文摘要

Electrospinning is a highly sensitive fabrication process in which small variations in operating parameters can significantly influence fiber morphology and material performance. Machine learning (ML) methods are increasingly employed to model these process-structure relationships and to identify the relative importance of processing variables. However, most existing studies rely on a single ML model, implicitly assuming that the resulting feature importance is robust and reproducible. In this study, the consistency of feature importance across multiple ML model families was systematically evaluated using a curated dataset of 96 polyvinyl alcohol (PVA) electrospinning experiments. Twenty-one ML models representing linear, tree-based, kernel-based, neural network, and instance-based approaches were trained and compared. To provide a unified interpretability framework, SHAP (SHapley Additive exPlanations) values were used to calculate feature importance consistently across all models. A rank-based statistical analysis was then performed to quantify inter-model agreement and assess the robustness of parameter rankings. The results demonstrate that predictive performance and interpretive reliability are fundamentally distinct properties. Although several models achieved comparable predictive accuracy, substantial differences were observed in their feature importance rankings. Solution concentration emerged as the most robust and consistently influential parameter (variability = 0), whereas flow rate and applied voltage exhibited high ranking variability (variability > 0.9), indicating strong model dependence. These findings suggest that feature importance derived from a single ML model may be unreliable, particularly for small experimental datasets, and highlight the importance of cross-model validation for achieving trustworthy interpretation in ML-assisted electrospinning research.

2605.03895 2026-05-13 cs.LG cs.SE

From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways

Pasquale Ardimento, Mario Luca Bernardi, Marta Cimitile, Samuele Latorre

AI总结 本文提出了一种可复现且注重流程的预测性监控管道,用于临床路径的预测监测。该方法整合了数据提升、时间重建、事件日志构建、基于前缀的表示以及预测建模,以支持对部分观测患者轨迹的持续推理,克服了传统回顾性流程挖掘的局限性。实验基于4,479例患者的COVID-19临床路径数据,结果显示,随着临床事件的逐步出现,预测性能显著提升,表明流程感知的表示方法能够有效实现患者轨迹的早期风险估计。

详情
英文摘要

This paper presents a reproducible and process-aware pipeline for predictive monitoring of clinical pathways. The approach integrates data lifting, temporal reconstruction, event log construction, prefix-based representations, and predictive modeling to support continuous reasoning on partially observed patient trajectories, overcoming the limitations of traditional retrospective process mining. The framework is evaluated on COVID-19 clinical pathways using ICU admission as the prediction target, considering 4,479 patient cases and 46,804 prefixes. Predictive models are trained and evaluated using a case-level split, with 896 patients in the test set. Logistic Regression achieves the best performance (AUC 0.906, F1-score 0.835). A detailed prefix-based analysis shows that predictive performance improves progressively as new clinical events become available, with AUC increasing from 0.642 at early stages to 0.942 at later stages of the pathway. The results highlight two key findings: predictive signals emerge progressively along clinical pathways, and process-aware representations enable effective early risk estimation from evolving patient trajectories. Overall, the findings suggest that predictive monitoring in healthcare is best conceived as a continuous, dynamically aware process, in which risk estimates are progressively refined as the patient journey evolves.

2605.02973 2026-05-13 cs.LG cs.AI

Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges

Eitan Kosman, Gabriele Serussi, Chaim Baskin

AI总结 本文提出了一种结构化扩散桥框架,用于解决跨模态翻译中数据配对不足的问题。该方法通过引入对齐约束来定义可行解空间,将配对数据作为可选的启发式信息而非必要条件,从而在不同配对程度的数据集上均表现出色。实验表明,该方法在减少配对需求的同时仍能保持接近全配对数据的翻译质量,展示了扩散桥在无配对场景下的灵活性和有效性。

Comments Accepted to ICML 2026

详情
英文摘要

Modality translation is inherently under-constrained, as multiple cross-modal mappings may yield the same marginals. Recent work has shown that diffusion bridges are effective for this task. However, most existing approaches rely on fully paired datasets, thereby imposing a single data-driven constraint. We propose a diffusion-bridge framework that characterizes the space of admissible solutions and restricts it via alignment constraints, treating paired supervision as an optional heuristic rather than a prerequisite. We validate our method on synthetic and real modality translation benchmarks across unpaired, semi-paired, and paired regimes, showing consistent performance across supervision levels. Notably, \textbf{it achieves near fully-paired quality with a substantial relaxation in pairing requirements, and remaining applicable in the unpaired regime}. These results highlight diffusion bridges as a flexible foundation for modality translation beyond fully paired data.

2605.02600 2026-05-13 cs.RO cs.AI

CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

Berk Çiçek, Mert K. Er, Ozgur S. Oguz

AI总结 CoRAL 是一种基于大语言模型(LLM)的接触丰富型自适应控制框架,旨在解决机器人操作任务中高阶语义理解和低阶物理控制之间的鸿沟。该方法通过将LLM用作代价函数设计者,而非直接控制器,结合基于采样的运动规划器(MPPI),实现了零样本规划能力。同时,CoRAL 引入神经符号适应循环,利用视觉语言模型提供环境动态的语义先验,并通过在线系统辨识实时修正物理参数,显著提升了在复杂接触场景中的控制精度与适应性。实验表明,CoRAL 在仿真与真实机器人平台上均表现出优越的性能,尤其在涉及复杂接触的任务中成功率提升超过50%。

Comments 22 pages, 9 figures, 3 tables. Accepted to Robotics: Science and Systems (RSS) 2026. Updated to camera-ready version with appendix and text/formatting revisions

详情
英文摘要

While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perform adaptive control. To bridge this gap, we propose CoRAL (Contact-Rich Adaptive LLM-based control), a modular framework that enables zero-shot planning by decoupling high-level reasoning from low-level control. Unlike black-box policies, CoRAL uses LLMs not as direct controllers, but as cost designers that synthesize context-aware objective functions for a sampling-based motion planner (MPPI). To address the ambiguity of physical parameters in visual data, we introduce a neuro-symbolic adaptation loop: a VLM provides semantic priors for environmental dynamics, such as mass and friction estimates, which are then explicitly refined in real time via online system identification, while the LLM iteratively modulates the cost-function structure to correct strategic errors based on interaction feedback. Furthermore, a retrieval-based memory unit allows the system to reuse successful strategies across recurrent tasks. This hierarchical architecture ensures real-time control stability by decoupling high-level semantic reasoning from reactive execution, effectively bridging the gap between slow LLM inference and dynamic contact requirements. We validate CoRAL on both simulation and real-world hardware across challenging and novel tasks, such as flipping objects against walls by leveraging extrinsic contacts. Experiments demonstrate that CoRAL outperforms state-of-the-art VLA and foundation-model-based planner baselines by boosting success rates over 50% on average in unseen contact-rich scenarios, effectively handling sim-to-real gaps through its adaptive physical understanding.

2605.01625 2026-05-13 cs.LG

PRIME: Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies

Viet Thanh Duy Nguyen, John K. Johnstone, Truong-Son Hy

AI总结 PRIME 是一种基于物理信息的多尺度等变层次化框架,用于蛋白质表示学习,旨在建模蛋白质在不同结构层次上的协调关系。该方法通过五个物理基础的结构图层次(包括表面、原子、残基、二级结构和整体蛋白水平)建立嵌套结构,并通过确定性的物理感知算子实现层次间的信息传递。实验表明,PRIME 在多个蛋白质表示学习任务中表现出色,尤其在折叠分类和反应类预测任务中取得了显著提升,验证了其在多尺度结构建模方面的有效性。

详情
英文摘要

Proteins are inherently multiscale physical systems whose functional properties emerge from coordinated structural organization across multiple spatial resolutions, ranging from atomic interactions to global fold topology. However, existing protein representation learning methods typically operate at a single structural level or treat different sources of structural information as parallel modalities, without explicitly modeling their hierarchical relationships. We introduce PRIME (Protein Representation via Physics-Informed Multiscale Equivariant Hierarchies), a unified framework that models proteins as a nested family of five physically grounded structural graphs spanning surface, atomic, residue, secondary-structure, and protein levels. Adjacent levels are connected through deterministic, physics-informed assignment operators, enabling bidirectional information exchange via bottom-up aggregation and top-down contextual refinement. Experiments on standard protein representation learning benchmarks demonstrate strong and competitive performance across diverse tasks, with particularly notable gains on the Fold Classification benchmark, where PRIME outperforms the strongest geometric GNN baseline by margins of 13.80 and 18.30 points on the harder Superfamily and Fold splits, and achieves a state-of-the-art accuracy of 84.10\% on Reaction Class prediction, surpassing all baseline methods, including ESM. Ablation studies confirm that each structural level contributes complementary and non-redundant information, and adaptive cross-attention analysis reveals that PRIME autonomously identifies the most task-relevant structural resolutions at prediction time. Our source code is publicly available at https://github.com/HySonLab/PRIME

2604.26752 2026-05-13 cs.CV

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

V Team, Wenyi Hong, Xiaotao Gu, Ziyang Pan, Zhen Yang, Yuting Wang, Yue Wang, Yuanchang Yue, Yu Wang, Yanling Wang, Yan Wang, Xijun Liu, Wenmeng Yu, Weihan Wang, Wei Li, Shuaiqi Duan, Sheng Yang, Ruiliang Lv, Mingdao Liu, Lihang Pan, Ke Ning, Junhui Ji, Jinjiang Wang, Jing Chen, Jiazheng Xu, Jiale Zhu, Jiale Cheng, Ji Qi, Guobing Gan, Guo Wang, Cong Yao, Zijun Dou, Zihao Zhou, Zihan Wang, Zhiqi Ge, Zhijie Li, Zhenyu Hou, Zhao Xue, Zehui Wang, Zehan Qi, Zehai He, Yutao Zhang, Yusen Liu, Yukuo Cen, Yuchen Li, Yuan Wang, Yu Yang, Yongbin Liu, Yijian Lu, Yifan Xu, Yanzi Wang, Yanxiao Zhao, Yanfeng Wang, Yadong Xue, Yabo Xu, Xinyu Zhang, Xinyu Liu, Xiao Liu, Wenyi Zhao, Wenkai Li, Tianyu Tong, Tianshu Zhang, Shudan Zhang, Shengdong Yan, Qinkai Zheng, Mingde Xu, Licheng Bao, lat Long long, Jiaxing Xu, Jiaxin Fan, Jiawen Qian, Jiali Chen, Jiahui Lin, Jiadai Sun, Haozhi Zheng, Haoran Wang, Haochen Li, Hanyu Lai, Han Xu, Fan Yang, Dan Zhang, Da Yin, Chuangxin Zhao, Chengcheng Wu, Boyan Shi, Bowen Lv, Bowei Jia, Bo Li, Bin Chen, Baoxu Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, Jie Tang

AI总结 本文介绍了GLM-5V-Turbo,这是一个面向多模态智能体的原生基础模型。该模型将多模态感知能力深度整合到推理、规划、工具使用和执行过程中,而非作为语言模型的辅助接口。研究通过改进模型设计、多模态训练、强化学习、工具链扩展及与智能体框架的集成,显著提升了模型在多模态编程、视觉工具使用和智能体任务中的表现,同时保持了优秀的纯文本编程能力,并为构建多模态智能体提供了实用经验。

详情
英文摘要

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

2604.25432 2026-05-13 cs.CV

SARU: A Shadow-Aware and Removal Unified Framework for Remote Sensing Images with New Benchmarks

Zi-Yang Bo, Wei Lu, Hongruixuan Chen, Si-Bao Chen, Bin Luo

AI总结 遥感图像中的阴影严重影响视觉质量和下游任务性能,现有方法多将阴影检测与去除作为独立的级联任务,流程繁琐且易累积误差。为解决这些问题,本文提出了一种统一的阴影感知与去除框架SARU,其包含一个双分支检测模块和一个无需训练的物理恢复算法,能够高效生成高精度阴影掩膜并恢复光照,显著提升了阴影检测与去除的效果。同时,研究还发布了两个新的遥感阴影数据集,实验表明SARU在多个基准上均达到先进水平,且处理速度快、性能稳定。

Comments Accepted by ISPRS

详情
英文摘要

Shadows are a prevalent problem in remote sensing imagery (RSI), degrading visual quality and severely limiting the performance of downstream tasks like object detection and semantic segmentation. Most prior works treat shadow detection and removal as separate, cascaded tasks, which can lead to cumbersome process and error accumulation. Furthermore, many deep learning methods rely on paired shadow and non-shadow images for training, which are often unavailable in practice. To address these challenges, we propose Shadow-Aware and Removal Unified (SARU) Framework , a cohesive two-stage framework. First, its dual-branch detection module (DBCSF-Net) fuses multi-color space and semantic features to generate high-fidelity shadow masks, effectively distinguishing shadows from dark objects. Then, leveraging these masks, a novel, training-free physical algorithm (N$^2$SGSR) restores illumination by transferring properties from adjacent non-shadow regions within the single input image. To facilitate rigorous evaluation and foster future work, we also introduce two new benchmark datasets: the RSI Shadow Detection (RSISD) dataset and the Single-image Shadow Removal Benchmark (SiSRB). Extensive experiments on the AISD and RSISD datasets demonstrate that SARU achieves SOTA shadow detection performance. For shadow removal, our training-free N$^2$SGSR algorithm attains an average processing speed of approximately $1.3$s, which is over $10$ times faster than the SOTA MAOSD while maintains an SRI value close to 0.9 on both the AISD and SiSRB datasets, a level comparable to the advanced RS-GSSR method. By holistically integrating shadow detection and removal to mitigate error propagation and eliminating the dependency on paired training data, SARU establishes a robust, practical framework for real-world RSI analysis. The code and datasets are publicly available at: https://github.com/AeroVILab-AHU/SARU

2604.24990 2026-05-13 cs.CV

A New Kind of Network? Review and Reference Implementation of Neural Cellular Automata

Martin Spitznagel, Janis Keuper

AI总结 本文回顾了神经细胞自动机(NCA)的研究进展,提出了一种统一的模块化框架与符号表示,并提供了基于开源库NCAtorch的参考实现。NCA结合了细胞自动机的简单规则与可学习的神经网络,能够从数据中学习复杂的更新规则,从而建模自我组织的生成系统,为复杂系统的模拟提供了新的方法。

详情
英文摘要

Stephen Wolfram proclaimed in his 2003 seminal work "A New Kind Of Science" that simple recursive programs in the form of Cellular Automata (CA) are a promising approach to replace currently used mathematical formalizations, e.g. differential equations, to improve the modeling of complex systems. Over two decades later, while Cellular Automata have still been waiting for a substantial breakthrough in scientific applications, recent research showed new and promising approaches which combine Wolfram's ideas with learnable Artificial Neural Networks: So-called Neural Cellular Automata (NCA) are able to learn the complex update rules of CA from data samples, allowing them to model complex, self-organizing generative systems. The aim of this paper is to review the existing work on NCA and provide a unified modular framework and notation, as well as a reference implementation in the open-source library NCAtorch. Supplementary materials, videos, and code are available at the project website: https://www.neural-cellular-automata.org/

2604.24037 2026-05-13 cs.LG math.ST stat.TH

A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws

Jun Shu, Junxiong Jia, Deyu Meng, Zongben Xu

AI总结 本文从极限理论的角度出发,提出了一种数学方法以形式化理解基础模型中的涌现智能现象。研究引入了一个依赖于数据量、模型规模和训练步数的性能函数,将智能行为的涌现视为从有限知识向无限知识的转变过程,并通过极限的存在性刻画这一现象。理论分析揭示了涌现智能的产生与极限架构的存在密切相关,并推导出基础模型的扩展定律,为理解智能涌现的机制提供了理论依据。

Comments There exist some typos and inaccurate expression in this version

详情
英文摘要

Emergent intelligence have played a major role in the modern AI development. While existing studies primarily rely on empirical observations to characterize this phenomenon, a rigorous theoretical framework remains underexplored. This study attempts to develop a mathematical approach to formalize emergent intelligence from the perspective of limit theory. Specifically, we introduce a performance function E(N, P, K), dependent on data size N, model size P and training steps K, to quantify intelligence behavior. We posit that intelligence emerges as a transition from finite to effectively infinite knowledge, and thus recast emergent intelligence as existence of the limit $\lim_{N,P,K \to \infty} \mathcal{E}(N,P,K)$, with emergent abilities corresponding to the limiting behavior. This limit theory helps reveal that emergent intelligence originates from the existence of a parameter-limit architecture (referred to as the limit architecture), and that emergent intelligence rationally corresponds to the learning behavior of this limit system. By introducing tools from nonlinear Lipschitz operator theory, we prove that the necessary and sufficient conditions for existence of the limit architecture. Furthermore, we derive the scaling law of foundation models by leveraging tools of Lipschitz operator and covering number. Theoretical results show that: 1) emergent intelligence is governed by three key factors-training steps, data size and the model architecture, where the properties of basic blocks play a crucial role in constructing foundation models; 2) the critical condition Lip(T)=1 for emergent intelligence provides theoretical support for existing findings. 3) emergent intelligence is determined by an infinite-dimensional system, yet can be effectively realized in practice through a finite-dimensional architecture. Our empirical results corroborate these theoretical findings.

2604.17502 2026-05-13 cs.AI

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

Carissa Cullen, Harry Garland, Alexander Roman, Louis Thomson, Christos Ziakas, Elliott Thornley

AI总结 该研究旨在训练能够被安全关闭的人工智能代理,提出了一种名为DReST的奖励函数,通过惩罚代理重复选择相同长度的轨迹,使其在不同轨迹长度之间进行随机选择(保持中立),同时在给定轨迹长度下有效完成任务(保持有用性)。实验表明,使用DReST训练的深度强化学习代理和大语言模型在测试环境中表现出更高的有用性和中立性,并显著降低了其影响关闭事件的概率,展示了DReST在提升代理安全性和可控性方面的潜力。

详情
英文摘要

Misaligned artificial agents might resist shutdown. One proposed solution is to train agents to lack preferences between different-length trajectories. The Discounted Reward for Same-Length Trajectories (DReST) reward function does this by penalizing agents for repeatedly choosing same-length trajectories, and thus incentivizes agents to (1) choose stochastically between different trajectory-lengths (be NEUTRAL about trajectory-lengths), and (2) pursue goals effectively conditional on each trajectory-length (be USEFUL). In this paper, we use DReST to train deep RL agents and fine-tune Qwen3-8B and Llama-3.1-8B-Instruct to be NEUTRAL and USEFUL. We find that these DReST models generalize to being NEUTRAL and USEFUL in unseen contexts at test time. Indeed, DReST RL agents achieve 11% (PPO) and 18% (A2C) higher USEFULNESS on our test set than default agents, and DReST LLMs achieve near-maximum USEFULNESS and NEUTRALITY. We also test our LLMs in an out-of-distribution setting where they can pay costs to influence when shutdown occurs. We find that DReST training roughly halves the mean probability of influencing shutdown (from 0.62 to 0.30 for Qwen and from 0.42 to 0.23 for Llama). DReST training also almost entirely eliminates the share of prompts on which influencing shutdown is the most likely option (from 0.59 to 0.01 for Qwen and from 0.53 to 0.00 for Llama). Our results thus provide some early evidence that DReST could be used to train more advanced agents to be useful and shutdownable.

2604.17031 2026-05-13 cs.CL cs.AI

Where is the Mind? Persona Vectors and LLM Individuation

Pierre Beckmann, Patrick Butlin

AI总结 本文探讨了大型语言模型(LLM)的“个体化”问题,即是否应将与模型相关的某些实体视为具有心智。研究通过机制可解释性方法,结合近期关于“角色向量”和“角色空间”的实证研究,提出了三种可能的观点,包括虚拟实例观以及两种新提出的观点——虚拟实例-角色观和模型-角色观。文章分析了角色向量的相关文献,并论证了基于角色的两种观点在解释LLM内部结构方面的潜力。

详情
英文摘要

The individuation problem for large language models asks which entities associated with them, if any, should be identified as minds. We approach this problem through mechanistic interpretability, engaging in particular with recent empirical work on persona vectors, persona space, and emergent misalignment. We argue that three views are the strongest candidates: the virtual instance view and two new views we introduce, the (virtual) instance-persona view and the model-persona view. First, we argue for the virtual instance view on the grounds that attention streams sustain quasi-psychological connections across token-time. Then we present the persona literature, organised around three hypotheses about the internal structure underlying personas in LLMs, and show that the two persona-based views are promising alternatives.

2604.15408 2026-05-13 cs.LG cs.AI

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

Seifeldin Abdellatif, Ahmad Almasri

AI总结 该研究针对视觉Transformer(ViT)中的token剪枝方法提出了一种新的注意力机制——Dispatch-Aware Ragged Attention,旨在解决现有变长注意力API在剪枝后序列长度较短时无法有效提升计算效率的问题。通过设计一个轻量级的双向Triton注意力内核,显著降低了调度开销,使得剪枝带来的计算节省能够体现在实际运行时间上。实验表明,该方法在多种输入尺寸和剪枝率下均实现了比现有方法更高的端到端吞吐量和更低的内核延迟。

详情
英文摘要

Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet standard variable-length attention APIs -- including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA -- fail to translate these savings into proportional wall-clock gains at the short post-pruning sequence lengths typical of ViTs ($\leq$197 tokens). We identify a dispatch-overhead bottleneck: at these lengths, host-side kernel dispatch consumes ${\sim}$50\,$μ$s regardless of workload, exceeding the actual GPU compute time at moderate-to-high pruning rates. We present a lightweight bidirectional Triton attention kernel whose dispatch floor is ${\sim}$24\,$μ$s -- roughly 2.17$\times$ lower than FlashAttention-2 varlen -- allowing pruning savings to become visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline and evaluated on an NVIDIA RTX 4000 Ada Generation GPU, our system achieves 1.88$\times$ end-to-end throughput over padded PyTorch SDPA at standard 224$\times$224 inputs, scaling to 2.51$\times$ at 384$\times$384. Against FlashAttention-2 varlen -- the strongest baseline -- our kernel delivers 9-12\% higher throughput at serving batch sizes (BS=1-4), and 2.17$\times$ lower kernel latency at 80\% token pruning. Numerical correctness is verified with max absolute logit difference $<$0.004 and bit-exact top-1 predictions.

2604.13123 2026-05-13 cs.LG cs.AI

Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation: An Interventional and Predictive Framework for Grokkin

Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

AI总结 本文研究了神经网络中“Grokking”现象,即从记忆到泛化的延迟过渡,发现其与表示空间的谱熵崩溃密切相关。通过分析不同任务中的表示几何结构,研究者识别出谱熵在泛化前会逐渐下降并越过一个任务特定的阈值,这一过程可作为预测泛化时间的指标。实验表明,谱熵的下降不仅与泛化时间相关,还与表示结构向任务相关方向的集中有关,为理解Grokking提供了新的几何视角和干预框架。

Comments 25 pages, 15 figures, 6 tables

详情
英文摘要

Grokking - the delayed transition from memorisation to generalisation in neural networks - remains poorly understood. We study this phenomenon through the geometry of learned representations and identify a consistent empirical signature preceding generalisation: collapse of the spectral entropy of the representation covariance matrix. Across modular arithmetic tasks and multiple random seeds, spectral entropy decreases gradually during training and crosses a stable task-specific threshold before test accuracy rises. A representation-mixing intervention that delays this collapse also delays grokking, including under norm-matched controls, indicating that the effect is not explained by parameter norm alone. We further show that the entropy gap predicts the remaining time until grokking with useful out-of-sample accuracy. To probe the structure underlying this transition, we introduce a Fourier-alignment observable for cyclic-group tasks. Entropy collapse is strongly coupled to the emergence of Fourier-aligned representations, suggesting that spectral entropy tracks concentration of the representation into task-structured directions rather than generic compression alone. The same qualitative dynamics appear in non-abelian group composition tasks, while MLP controls show that entropy collapse by itself is insufficient for grokking in the absence of appropriate inductive bias. Taken together, the results support a view of grokking as a representational phase transition with an observable geometric signature. We discuss the scope and limitations of this interpretation, connections to recent feature-learning and spectral-dynamics work, and directions for testing whether similar transitions appear in larger-scale learning systems.

2604.10500 2026-05-13 cs.CV

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu

AI总结 该研究针对多模态隐式推理中视觉信息优化不足和复杂语义 token 收敛困难的问题,提出了视觉增强深度缩放方法。通过分析 token 级梯度动态,发现视觉 token 的梯度幅值较小且复杂 token 易出现梯度不稳定,为此引入了视觉重放模块和路由深度缩放机制,分别增强视觉感知和复杂隐态的精细化处理。该方法结合课程学习策略,有效提升了多模态隐式推理的性能,并在多个基准测试中取得了领先的推理效果和加速表现。

Comments 11 pages, 6 figures

详情
英文摘要

Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.

2604.03061 2026-05-13 cs.CV

Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

Weixiong Sun, Xiang Yin, Chao Dong

AI总结 本文评估了通用图像编辑模型Nano Banana 2在图像修复任务中的性能,发现其在多种场景和退化条件下表现良好,尤其在用户偏好和整体视觉质量方面具有竞争力。研究指出,简洁的提示和明确的保真度约束有助于在重建质量与感知质量之间取得更好平衡,但模型在细节增强和一致性方面仍存在不足,现有图像质量评估指标难以全面反映这一问题。研究认为,通用模型在感知层面具有作为统一图像修复方案的潜力,但仍需在可控性和保真度评估方面进一步改进。

Comments Accepted by CVPR 2026 Workshop AAVM

详情
英文摘要

Recent advances in generative AI raise the question of whether general-purpose image editing models can serve as unified solutions for image restoration. We conduct a systematic evaluation of Nano Banana 2 across diverse scenes and degradations. Our results show that prompt design is critical, with concise prompts and explicit fidelity constraints achieving a better balance between reconstruction and perceptual quality. Nano Banana 2 achieves competitive full-reference performance and is consistently preferred in user studies, while showing strong generalization in challenging scenarios. However, we observe a gap between perceptual quality and restoration fidelity, as the model tends to produce visually rich results with over-enhanced details and inconsistencies. This issue is not well captured by existing IQA metrics or user studies. Overall, general-purpose models show promise as unified IR solvers from a perceptual perspective, but require improved controllability and fidelity-aware evaluation. Further comparisons and detailed analyses are available in our project repository: https://github.com/yxyuanxiao/NanoBanana2TestOnIR.

2603.29057 2026-05-13 cs.CV

LA-Sign: Looped Transformers with Geometry-aware Alignment for Skeleton-based Sign Language Recognition

Muxin Pu, Mei Kuan Lim, Chun Yong Chong, Chen Change Loy

AI总结 本文提出了一种基于循环变压器和几何感知对齐的骨架驱动手语识别方法LA-Sign,旨在提升对手语动作多尺度细节的理解。该方法通过循环机制在共享参数下反复优化潜在表示,从而增强模型对动作细节的感知能力,并引入几何感知的对比目标,将骨骼和文本特征映射到自适应双曲空间以促进多层次语义组织。实验表明,LA-Sign在多个基准数据集上取得了最先进的性能,且模型结构更简洁。

详情
英文摘要

Skeleton-based isolated sign language recognition (ISLR) demands fine-grained understanding of articulated motion across multiple spatial scales, from subtle finger movements to global body dynamics. Existing approaches typically rely on deep feed-forward architectures, which increase model capacity but lack mechanisms for recurrent refinement and structured representation. We propose LA-Sign, a looped transformer framework with geometry-aware alignment for ISLR. Instead of stacking deeper layers, LA-Sign derives its depth from recurrence, repeatedly revisiting latent representations to progressively refine motion understanding under shared parameters. To further regularise this refinement process, we present a geometry-aware contrastive objective that projects skeletal and textual features into an adaptive hyperbolic space, encouraging multi-scale semantic organisation. We study three looping designs and multiple geometric manifolds, demonstrating that encoder-decoder looping combined with adaptive Poincare alignment yields the strongest performance. Extensive experiments on WLASL and MSASL benchmarks show that LA-Sign achieves state-of-the-art results while using fewer unique layers, highlighting the effectiveness of recurrent latent refinement and geometry-aware representation learning for sign language recognition.

2603.23679 2026-05-13 cs.RO cs.AI

Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting

Nur Afsa Syeda, Mohamed Elmahallawy, Luis Fernando de la Torre, John Miller

AI总结 本文研究了如何在农业机器人采摘过程中高效判断水果是否可采摘的问题,提出了一种结合RGB-D感知与主动学习的可达性估计方法,避免了传统方法中依赖耗时的逆运动学计算的低效问题。该方法通过主动学习策略选择性地标注最具信息量的样本,显著减少了标注工作量并保持了高预测精度。实验表明,该框架在较少标注样本的情况下即可实现高精度的可达性预测,并在低标注量场景下表现出优于其他采样策略的性能,为农业机器人任务级感知提供了高效且可扩展的解决方案。

详情
英文摘要

Agriculture remains a cornerstone of global health and economic sustainability, yet labor-intensive tasks such as harvesting high-value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception-to-action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision-making. Our approach combines RGB-D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6--8% higher accuracy than random sampling and enabling label-efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy- and margin-based sampling outperform Query-by-Committee and standard uncertainty sampling in low-label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task-level perception in agricultural robotics and position our approach as a scalable alternative to computation-heavy kinematic reachability analysis. Our code is available through https://github.com/wsu-cyber-security-lab-ai/active-learning.