arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2601.09495 2026-05-19 cs.LG

Parallelizable memory recurrent units

可并行化的记忆递归单元

Florent De Geeter, Gaspard Lambrechts, Damien Ernst, Guillaume Drion

AI总结本文提出了一种结合非线性递归网络持久记忆能力和状态空间模型并行计算能力的新递归神经网络——记忆递归单元（MRUs），通过多稳态机制实现持久记忆，同时避免瞬态动态以提高效率，并展示了其在长时序依赖任务中的有效性。

Comments 19 pages, 12 figures. This work has been the subject of patent applications (Numbers: EP26151077 and EP26175248.9)

详情

AI中文摘要

随着大规模并行处理单元的出现，并行化已成为新序列模型的 desirable 属性。在训练过程中，能够针对序列长度并行处理序列的能力是Transformer架构兴起的主要原因之一。然而，Transformer在序列生成方面效率低下，因为它们需要在每个生成步骤重新处理所有先前的时间步。最近，状态空间模型（SSMs）作为一种更高效的替代方案出现。这些新的递归神经网络（RNNs）在保持RNN高效更新的同时，通过去除非线性动态（或递归）获得了并行化能力。SSMs通过高效训练可能非常大的网络，可以达到最先进的性能，但仍受有限表示能力的限制。特别是，由于其单稳态性，SSMs无法表现出持久记忆，即保留信息无限期的能力。在本文中，我们介绍了一种新的RNN家族——记忆递归单元（MRUs），它们结合了非线性RNN的持久记忆能力与SSMs的并行计算能力。这些单元利用多稳态作为持久记忆的来源，同时通过去除瞬态动态以实现高效计算。我们随后推导出一个具体的实现作为概念验证：双稳态记忆递归单元（BMRU）。这种新的RNN与并行扫描算法兼容。我们证明BMRU在具有长期依赖的任务中表现良好，并且可以与状态空间模型结合，创建具有瞬态动态和持久记忆的混合网络。

英文摘要

With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

URL PDF HTML ☆

赞 0 踩 0

2601.09413 2026-05-19 cs.SD cs.AI cs.CL cs.MA eess.AS

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: 一种基于自我反思的语音代理方法用于语音识别和多感知音频推理

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

AI总结本文提出Speech-Hands框架，通过自我反思决策机制解决语音识别和外部声音理解任务中的信任问题，提升了模型在多任务音频推理中的准确性和鲁棒性。

Comments Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073

详情

迈向无缝的物理人机交互：从控制、意图和建模的角度见解以及未来发展的展望

Gustavo A. Cardona, Shubham S. Kumbhar, Panagiotis Artemiadis

AI总结本文探讨了物理人机交互领域中控制、意图估计和计算人类模型三个核心支柱，总结了当前的研究现状、开放挑战和限制，并提出了跨领域整合的路径，旨在推动更鲁棒、安全和直观的物理交互研究。

Comments 60 pages, 5 figures, 3 tables

详情

DOI: 10.1007/s10846-026-02401-0

AI中文摘要

物理人机交互（pHHI）是一个快速发展的领域，对在无结构、以人为中心的环境中部署机器人具有重要意义。在本文综述中，我们通过三个核心支柱审视当前pHHI的现状：（i）人形机器人的建模与控制，（ii）人类意图估计，以及（iii）计算人类模型。对于每个支柱，我们调查了代表性方法，识别了开放挑战，并分析了当前限制，这些限制阻碍了鲁棒、可扩展和适应性交互的实现。这些包括需要能够处理不确定人类动态的全身控制策略、在有限感知下实时意图推断的需求，以及能够考虑人类身体状态变化的建模技术。尽管每个领域都取得了显著进展，但跨支柱的整合仍然有限。我们提出了统一这些领域的路径，以实现连贯的交互框架。这种结构不仅使我们能够映射当前的现状，还提出了未来研究的具体方向，旨在弥合这些领域之间的差距。此外，我们引入了一种基于模态的统一交互类型分类法，区分直接交互（如物理接触）和间接交互（如物体中介），并基于机器人参与的程度，从协助到合作和协作。对于此分类中的每个类别，我们提供了三个核心支柱，突出跨支柱整合的机会。我们的目标是建议推动鲁棒、安全和直观物理交互的途径，为未来研究提供路线图，使人形系统能够有效地理解、预测并与人类伙伴在多样化的现实环境中协作。

英文摘要

Physical Human-Humanoid Interaction (pHHI) is a rapidly advancing field with significant implications for deploying robots in unstructured, human-centric environments. In this review, we examine the current state of the art in pHHI through three core pillars: (i) humanoid modeling and control, (ii) human intent estimation, and (iii) computational human models. For each pillar, we survey representative approaches, identify open challenges, and analyze current limitations that hinder robust, scalable, and adaptive interaction. These include the need for whole-body control strategies capable of handling uncertain human dynamics, real-time intent inference under limited sensing, and modeling techniques that account for variability in human physical states. Although significant progress has been made within each domain, integration across pillars remains limited. We propose pathways for unifying methods across these areas to enable cohesive interaction frameworks. This structure enables us not only to map the current landscape but also to propose concrete directions for future research that aim to bridge these domains. Additionally, we introduce a unified taxonomy of interaction types based on modality, distinguishing between direct interactions (e.g., physical contact) and indirect interactions (e.g., object-mediated), and on the level of robot engagement, ranging from assistance to cooperation and collaboration. For each category in this taxonomy, we provide the three core pillars that highlight opportunities for cross-pillar unification. Our goal is to suggest avenues to advance robust, safe, and intuitive physical interaction, providing a roadmap for future research that will allow humanoid systems to effectively understand, anticipate, and collaborate with human partners in diverse real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2512.05136 2026-05-19 cs.CV cs.AI

Evo-Memory：通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

AI总结本文提出Evo-Memory，一个用于评估LLM代理自演化记忆能力的综合流基准和框架，通过构建序列任务流数据集，要求LLM在每次交互后搜索、适应和演化记忆，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情

AI中文摘要

状态性对于大型语言模型（LLM）代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件，但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上，其中记忆被动地从对话中检索以回答查询，忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中，如交互问题助手或具身代理中，LLM需要处理连续的任务流，但通常无法从积累的交互中学习，失去有价值的上下文见解，这限制了测试时间的进化，即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距，我们引入了Evo-Memory，一个综合的流基准和框架，用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流，要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用，我们提供了一个基线方法ExpRAG，用于检索和利用先前经验，并进一步提出ReMem，一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程，以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

URL PDF HTML ☆

赞 0 踩 0

2511.11654 2026-05-19 cs.LG cs.AI cs.MA

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的收敛性

Sayambhu Sen, Shalabh Bhatnagar

AI总结本文研究了多智能体强化学习在交通信号控制中的收敛性问题，通过随机逼近方法分析学习动态，并证明了在特定条件下该算法能够收敛。

Comments 14 pages 2 figures

详情

AI中文摘要

快速城市化导致城市如班加罗尔面临严重的交通拥堵，使得高效的交通信号控制（TSC）变得至关重要。多智能体强化学习（MARL）作为一种减少平均通勤延误的有希望策略，通常将每个交通信号视为一个独立的智能体使用Q学习进行建模。尽管先前的工作Prashant L A等人已经证明了这种方法的有效性，但在交通控制背景下对这种算法稳定性及收敛性进行严谨理论分析的研究尚未开展。本文通过专注于该多智能体算法的理论基础，填补了这一空白。我们研究了在合作性TSC任务中使用独立学习者固有的收敛问题。利用随机逼近方法，我们正式分析了学习动态。本文的主要贡献是证明了特定的交通控制多智能体强化学习算法在给定条件下能够收敛，扩展了从单智能体收敛证明中异步价值迭代的结论。

英文摘要

Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

URL PDF HTML ☆

赞 0 踩 0

2511.07288 2026-05-19 cs.LG cs.AI

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为批评稳定化实现非策略模仿学习

Sayambhu Sen, Shalabh Bhatnagar

AI总结本文提出一种结合非策略学习的对抗模仿学习算法，通过双Q网络稳定化和价值学习（无需奖励函数推断）来提高样本效率，从而更高效地匹配专家行为。

Comments 14 pages and 4 images

2511.06316 2026-05-19 cs.AI

RADRON：通过配备康普顿相机的微型飞行器进行离子化辐射源的协同定位

Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc, Jan Jakubek, Petr Stepan, Martin Saska

AI总结该研究提出了一种利用微型飞行器协同定位放射性物质的新方法，通过康普顿相机实时估计辐射源位置，即使在稀疏测量条件下也能实现高灵敏度检测。

Comments 8 pages, 9 figures, submitted for review to IEEE RA-L

详情

DOI: 10.1109/LRA.2026.3688053

AI中文摘要

我们提出了一种新型方法，通过合作微型飞行器（MAVs）定位放射性物质。我们的方法利用了最先进的单探测器康普顿相机，作为高灵敏度且微型的离子化辐射探测器。该探测器极低的重量（40克）为由协作敏捷MAVs进行的辐射检测开辟了新可能。我们提出了一种新的基本概念，将康普顿相机测量融合以实时估计辐射源位置，即使从极稀疏的测量中也能做到。数据读取和处理直接在机载上进行，结果用于动态反馈以驱动车辆运动。MAVs在紧密协作的群体中稳定，以最大化康普顿相机获取的信息，快速定位辐射源，甚至跟踪移动的辐射源。

英文摘要

We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.

URL PDF HTML ☆

赞 0 踩 0

2510.24680 2026-05-19 cs.RO

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

AI总结本文提出DocReward，一种用于评估文档结构和风格的奖励模型，通过构建包含117,000对文档的DocPair数据集，采用Bradley-Terry损失训练，有效提升了文档生成的结构和风格专业性。

详情

AI中文摘要

近期的代理工作流程自动化了专业文档生成，但主要关注文本质量，忽视了结构和风格的专业性，这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型，无法引导代理生成结构和风格专业的文档。我们引入DocReward，一种评估文档结构和风格的文档奖励模型。为此，我们提出了一种文本质量无关的框架，确保评估不受内容质量的影响，并构建了包含117,000对文档的DocPair数据集，涵盖32个领域和267种类型。每对文档内容相同，但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中，DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明，DocReward能有效引导代理生成具有更一致结构和风格专业性的文档，突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

URL PDF HTML ☆

赞 0 踩 0

2510.10930 2026-05-19 cs.CL cs.AI

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

AI总结本文研究了语言模型对游戏评估的能力，通过比较现代语言模型和人类及符号计算代理的评估结果，发现推理模型在游戏评估上更接近人类，但随着模型接近博弈最优，其与人类数据的匹配度会减弱，且在评估趣味性时表现出更大的波动。

详情

AI中文摘要

推理不仅仅是解决问题，也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上，通过研究模型如何玩国际象棋和围棋等游戏。在本文中，我们倡导一种新的范式，即评估人工智能系统对游戏的评估。首先，我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集，将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询：评估游戏的收益（或公平性）和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度：计算查询的复杂性和量化查询的难度。我们的结果表明，推理模型在游戏评估上通常比非推理语言模型更接近人类。然而，我们观察到非单调的关系：随着模型接近博弈最优，其与人类数据的匹配度会减弱。我们还发现，在评估趣味性时，模型之间存在更多的波动性，这与量化该查询的难度更大有关。在各种查询和游戏中，推理模型在评估查询时表现出高度变化和不可预测的资源使用，这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2510.08141 2026-05-19 cs.LG

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

SCOPE-RL: 稳定和定量控制强化学习后训练中的策略熵

Chen Wang, Zhaochun Li, Jionghao Bai, Hexuan Deng, Ge Lan, Yue Wang

AI总结本文提出SCOPE-RL框架，通过温度自适应的正样本构造正则化项，稳定并定量控制强化学习后训练中的策略熵，实验表明其在Pass@1和Pass@$k$任务上优于现有基线方法。

详情

AI中文摘要

强化学习（RL）是训练大型语言模型（LLMs）的关键范式，但广泛使用的分组相对策略优化（GRPO）常面临熵崩溃问题：探索迅速消失，策略提前收敛，样本多样性下降，最终损害训练效果。现有解决方案，包括熵奖励和裁剪方法，很少能保持熵在稳定的探索范围内，且常引入振荡的熵或奖励退化。在本文中，我们识别出熵动态中被忽视的不对称性：在高温度采样下，正样本和负样本对策略熵有相反影响。具体而言，高温度正样本促进熵增长，而负样本抑制它。我们为此现象提供了理论解释：当策略更新过程中熵下降时，其对温度的导数在正样本更新下严格为正，表明高温度正样本可以抵消熵衰减，从而减缓熵崩溃并可能逆转它。受此启发，我们提出了SCOPE-RL，通过构造来自温度自适应正样本的正则化项，实现稳定且定量的熵控制。广泛实验表明，SCOPE-RL在Pass@1和Pass@$k$任务上均优于现有强RL基线方法。我们的结果提供了证据，证明摆脱熵崩溃可以提高推理性能，同时显示收益是非单调的，RL后训练在推理LLMs中存在最优的探索水平。

英文摘要

Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.

URL PDF HTML ☆

赞 0 踩 0

2510.06809 2026-05-19 cs.CV

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

VA-Adapter：将超声基础模型适应于超声心动图探头引导

Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang

AI总结本文提出VA-Adapter，通过将超声基础模型与理解个体三维结构的能力相结合，提高超声心动图探头引导的精度和效率，实验表明其在参数量较少的情况下表现优于现有模型。

Comments MICCAI2026 Early Accept Paper

详情

AI中文摘要

超声心动图是检测心脏疾病的关键工具，但其操作难度高导致专业人员短缺。探头引导系统通过辅助获取高质量图像，提供了降低操作门槛的有前景的解决方案。然而，由于显著的个体差异，稳健的探头引导仍具挑战性。这种差异表现为二维图像中低级特征的差异，这使得图像特征理解复杂化，以及个体三维结构的差异，这给精确导航带来挑战。为了解决这些挑战，我们首先提出利用超声基础模型从大量数据集中学习的稳健图像表示。然而，将这些模型应用于探头导航是困难的，因为它们缺乏对个体三维结构的理解。为此，我们精心设计了视觉-动作适配器（VA-Adapter）以在线注入理解个体三维结构的能力。具体来说，通过将VA-Adapter嵌入基础模型的图像编码器中，模型可以从历史视觉-动作序列中推断心脏解剖结构，模拟超声技师的认知过程。在包含超过131万样本的数据集上进行的广泛实验表明，VA-Adapter在参数量少约33倍的情况下优于现有探头引导模型。代码可在https://github.com/LeapLabTHU/VA-Adapter上获得。

英文摘要

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

URL PDF HTML ☆

赞 0 踩 0

2510.04930 2026-05-19 cs.LG

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

平等梯度下降：一种加速 Grokking 的简单方法

Ali Saheb Pasand, Elvis Dohmatob

AI总结本文提出平等梯度下降（EGD）方法，通过规范化梯度使所有主方向的动态以相同速度演化，从而加速模型的 Grokking 过程，消除测试性能的停滞现象。

详情

AI中文摘要

Grokking 是一种现象，其中不同于训练性能在早期达到峰值，模型的测试/泛化性能在任意多个周期内停滞，然后突然跃升至接近完美的水平。在实践中，减少此类停滞的长度是有利的，即使学习过程'更快地 Grok'。在本工作中，我们提供了对 Grokking 的新见解。首先，我们通过实证和理论证明，不对称的（随机）梯度下降速度可以在不同主方向（即奇异方向）上诱导 Grokking。然后，我们提出了一种简单的修改，规范化梯度，使得所有主方向的动力学以相同的速度演化。接着，我们证明这种修改方法，称为平等梯度下降（EGD），可以被视为一种精心修改的自然梯度下降方法，能够更快地 Grok。事实上，在某些情况下，停滞完全被消除。最后，我们实证地展示了在经典算术问题如模加法和稀疏奇偶问题上，这种停滞现象被我们的方法消除。

英文摘要

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

URL PDF HTML ☆

赞 0 踩 0

2510.02590 2026-05-19 cs.LG

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

在可以的时候使用在线网络：迈向快速且稳定的强化学习

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo

AI总结本文提出了一种新的更新规则，通过在目标网络和在线网络之间取最小估计来改进价值函数学习，从而实现更快且更稳定的强化学习。

Comments Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

AI中文摘要

在深度强化学习（RL）中，使用目标网络来估计价值函数是一种流行的方法。虽然有效，但目标网络仍是一种折中方案，它在保持稳定性的同时牺牲了缓慢移动的目标，从而延迟了学习。相反，使用在线网络作为强化目标在直觉上很有吸引力，但众所周知会导致不稳定的学。在本文中，我们旨在结合两者的优势，通过引入一种新的更新规则，该规则通过目标网络和在线网络之间的最小估计来计算目标，从而得到我们的方法MINTO。通过这种简单而有效的修改，我们证明MINTO能够通过缓解使用在线网络进行强化时的潜在过估计偏差，从而实现更快且更稳定的价值函数学习。值得注意的是，MINTO可以无缝集成到广泛的价值基础和演员-评论家算法中，成本极低。我们对MINTO在多种基准上的进行了广泛评估，涵盖了在线和离线RL以及离散和连续动作空间。在所有基准上，MINTO都一致地提高了性能，展示了其广泛的应用性和有效性。

英文摘要

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2509.23183 2026-05-19 cs.LG cs.NI

ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse

ZeroSiam: 一种高效的非对称方法用于测试时熵优化而不发生崩溃

Guohao Chen, Shuaicheng Niu, Deyu Chen, Jiahao Yang, Zitian Zhang, Mingkui Tan, Pengcheng Wu, Zhiqi Shen

AI总结本文提出ZeroSiam，一种针对测试时熵最小化的高效非对称架构，通过非对称发散对齐防止崩溃，并通过可学习预测器和stop-gradient操作符有效实现，实验和理论证明其能防止崩溃并正则化偏见学习信号，提升性能，尤其在易崩溃的小模型上表现稳定。

详情

AI中文摘要

测试时熵最小化有助于适应新环境并激励模型的推理能力，在推理过程中允许模型通过自身预测实时进化和改进，从而实现有竞争力的性能。然而，纯粹的熵最小化可能会偏好不可推广的捷径，如放大logit范数并驱动所有预测到主导类别以减少熵，从而导致崩溃解（例如，恒定的一热输出），这些解仅通过简单的方式最小化目标函数而没有有意义的学习。在本文中，我们揭示了非对称性作为防止崩溃的关键机制，并引入了ZeroSiam——一种专门针对测试时熵最小化的高效非对称孪生架构。ZeroSiam通过非对称发散对齐来防止崩溃，这一过程通过在分类器之前使用可学习预测器和stop-gradient操作符高效实现。我们提供了实证和理论证据表明，ZeroSiam不仅能够防止崩溃，还能正则化偏见学习信号，即使在没有崩溃的情况下也能提升性能。尽管其简单性，广泛的结果显示，ZeroSiam在使用可忽略开销的情况下，比先前的方法更稳定，展示了其在视觉适应和大语言模型推理任务中的有效性，包括在具有挑战性的测试场景和多样化的模型中，特别是易崩溃的微型模型上。

英文摘要

Test-time entropy minimization helps adapt a model to novel environments and incentivize its reasoning capability, unleashing the model's potential during inference by allowing it to evolve and improve in real-time using its own predictions, achieving promising performance. However, pure entropy minimization can favor non-generalizable shortcuts, such as inflating the logit norm and driving all predictions to a dominant class to reduce entropy, risking collapsed solutions (e.g., constant one-hot outputs) that trivially minimize the objective without meaningful learning. In this paper, we reveal asymmetry as a key mechanism for collapse prevention and introduce ZeroSiam--an efficient asymmetric Siamese architecture tailored for test-time entropy minimization. ZeroSiam prevents collapse through asymmetric divergence alignment, efficiently achieved by a learnable predictor and a stop-gradient operator before the classifier. We provide empirical and theoretical evidence that ZeroSiam not only prevents collapse, but also regularizes biased learning signals, enhancing performance even when no collapse occurs. Despite its simplicity, extensive results show that ZeroSiam performs more stably over prior methods using negligible overhead, demonstrating efficacy on both vision adaptation and large language model reasoning tasks across challenging test scenarios and diverse models, including particularly collapse-prone tiny models.

URL PDF HTML ☆

赞 0 踩 0