arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2601.09495 2026-05-19 cs.LG

Parallelizable memory recurrent units

可并行化的记忆递归单元

Florent De Geeter, Gaspard Lambrechts, Damien Ernst, Guillaume Drion

AI总结 本文提出了一种结合非线性递归网络持久记忆能力和状态空间模型并行计算能力的新递归神经网络——记忆递归单元(MRUs),通过多稳态机制实现持久记忆,同时避免瞬态动态以提高效率,并展示了其在长时序依赖任务中的有效性。

Comments 19 pages, 12 figures. This work has been the subject of patent applications (Numbers: EP26151077 and EP26175248.9)

详情
AI中文摘要

随着大规模并行处理单元的出现,并行化已成为新序列模型的 desirable 属性。在训练过程中,能够针对序列长度并行处理序列的能力是Transformer架构兴起的主要原因之一。然而,Transformer在序列生成方面效率低下,因为它们需要在每个生成步骤重新处理所有先前的时间步。最近,状态空间模型(SSMs)作为一种更高效的替代方案出现。这些新的递归神经网络(RNNs)在保持RNN高效更新的同时,通过去除非线性动态(或递归)获得了并行化能力。SSMs通过高效训练可能非常大的网络,可以达到最先进的性能,但仍受有限表示能力的限制。特别是,由于其单稳态性,SSMs无法表现出持久记忆,即保留信息无限期的能力。在本文中,我们介绍了一种新的RNN家族——记忆递归单元(MRUs),它们结合了非线性RNN的持久记忆能力与SSMs的并行计算能力。这些单元利用多稳态作为持久记忆的来源,同时通过去除瞬态动态以实现高效计算。我们随后推导出一个具体的实现作为概念验证:双稳态记忆递归单元(BMRU)。这种新的RNN与并行扫描算法兼容。我们证明BMRU在具有长期依赖的任务中表现良好,并且可以与状态空间模型结合,创建具有瞬态动态和持久记忆的混合网络。

英文摘要

With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

2601.09413 2026-05-19 cs.SD cs.AI cs.CL cs.MA eess.AS

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: 一种基于自我反思的语音代理方法用于语音识别和多感知音频推理

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

AI总结 本文提出Speech-Hands框架,通过自我反思决策机制解决语音识别和外部声音理解任务中的信任问题,提升了模型在多任务音频推理中的准确性和鲁棒性。

Comments Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073

详情
AI中文摘要

我们介绍了一种语音代理框架,该框架学习了一种关键的全方位理解技能:知道何时信任自身,何时咨询外部音频感知。我们的工作受到一个关键但反直觉的发现的启发:简单地在语音识别和外部声音理解任务上微调全方位模型往往会降低性能,因为模型容易被噪声假说误导。为了解决这个问题,我们的框架Speech-Hands将问题重新表述为一个显式的自我反思决策。这个可学习的反思原语在防止模型被错误的外部候选干扰方面证明是有效的。我们展示了这种代理行为机制能够自然地从语音识别推广到复杂的多选音频推理。在OpenASR排行榜上,Speech-Hands在七个基准测试中比强大的基线高出12.1%的WER。该模型在音频问答决策中也实现了77.37%的准确率和高F1分数,展示了在多样化的音频问答数据集上的鲁棒性和可靠性。通过统一感知和决策,我们的工作为更可靠和稳健的音频智能提供了实用路径。

英文摘要

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

2601.08679 2026-05-19 cs.AI

PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

PersonaDual: 通过自适应推理平衡个性化与客观性

Xiaoyou Liu, Xinyi Mou, Shengbin Yue, Liang Wang, Yuqing Wang, Qiexiang Wang, Tianrui Qin, Zhongyu Wei

AI总结 本文提出PersonaDual框架,通过自适应切换模式,在单一模型中实现通用客观推理与个性化推理的平衡,减少干扰并提升客观问题解决能力。

详情
AI中文摘要

随着用户对LLM对齐其偏好的期望增加,个性化信息变得有价值。然而,个性化信息可能是一把双刃剑:它能提高交互但可能损害客观性和事实准确性,尤其是在与问题不匹配时。为缓解此问题,我们提出PersonaDual,一个支持单个模型中通用目的客观推理和个性化推理的框架,并根据上下文自适应切换模式。PersonaDual首先通过SFT学习两种推理模式,然后通过强化学习和我们提出的DualGRPO进一步优化模式选择。在客观和个性化基准测试中,PersonaDual在保留个性化优势的同时减少干扰,实现近无干扰性能,并更有效地利用有用的个性化信号以改善客观问题解决。

英文摘要

As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.

2601.05653 2026-05-19 cs.RO cs.MA

EvoQRE: Modeling Bounded Rationality in Safety-Critical Traffic Simulation via Evolutionary Quantal Response Equilibrium

EvoQRE: 通过进化量化反应均衡建模安全关键交通仿真中的有限理性

Phu-Hoa Pham, Chi-Nguyen Tran, Duy-Minh Dao-Sy, Phu-Quy Nguyen-Lam, Trung-Kiet Huynh

AI总结 本文提出EvoQRE框架,通过量化反应均衡和进化博弈动态建模安全关键交通交互,理论证明其在弱单调性假设下收敛到Logit-QRE,并在Waymo和nuPlan数据集上验证了其在真实性和安全指标上的优越性。

Comments This article is being withdrawn due to identified issues in the experimental evaluation and theoretical assumptions that may affect the validity of some reported conclusions. The authors plan to revise the methodology and provide a corrected version in future work.

详情
AI中文摘要

现有的自动驾驶交通仿真框架通常依赖于模仿学习或博弈论方法来求解纳什或粗相关均衡,隐含假设了完美理性的代理。然而,人类驾驶员表现出有限理性,在认知和感知限制下做出近似最优决策。我们提出EvoQRE,一种原理性的框架,将安全关键交通交互建模为一般和博弈,通过量化反应均衡(QRE)和进化博弈动态求解。EvoQRE整合了预训练的生成世界模型与熵正则化的复制动态,捕捉随机的人类行为同时保持均衡结构。我们提供了严格的理论结果,证明所提出的动态在双重时间尺度随机近似下收敛到Logit-QRE,具有显式的收敛速率O(log k / k^{1/3})在弱单调性假设下。我们进一步通过混合基和能量基策略表示扩展QRE到连续动作空间。在Waymo Open Motion Dataset和nuPlan基准测试中,EvoQRE实现了最先进的现实感,改进的安全指标,以及通过可解释的理性参数可控生成多样化的安全关键场景。

英文摘要

Existing traffic simulation frameworks for autonomous vehicles typically rely on imitation learning or game-theoretic approaches that solve for Nash or coarse correlated equilibria, implicitly assuming perfectly rational agents. However, human drivers exhibit bounded rationality, making approximately optimal decisions under cognitive and perceptual constraints. We propose EvoQRE, a principled framework for modeling safety-critical traffic interactions as general-sum Markov games solved via Quantal Response Equilibrium (QRE) and evolutionary game dynamics. EvoQRE integrates a pre-trained generative world model with entropy-regularized replicator dynamics, capturing stochastic human behavior while maintaining equilibrium structure. We provide rigorous theoretical results, proving that the proposed dynamics converge to Logit-QRE under a two-timescale stochastic approximation with an explicit convergence rate of O(log k / k^{1/3}) under weak monotonicity assumptions. We further extend QRE to continuous action spaces using mixture-based and energy-based policy representations. Experiments on the Waymo Open Motion Dataset and nuPlan benchmark demonstrate that EvoQRE achieves state-of-the-art realism, improved safety metrics, and controllable generation of diverse safety-critical scenarios through interpretable rationality parameters.

2601.03851 2026-05-19 cs.CL

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

重新思考表格修剪在表格问答中的应用:从顺序修订到黄金轨迹监督的并行搜索

Yu Guo, Shenghao Ye, Shuangwu Chen, Zijian Wen, Tao Zhang, Qirui Bai, Dong Jin, Yunpeng Hou, Huasen He, Jian Yang, Xiaobin Tan

AI总结 本文提出TabTrim框架,通过将表格修剪从顺序修订转变为由黄金轨迹监督的并行搜索,解决了现有方法中无法检测关键答案数据丢失的问题,并在多个表格推理任务中实现了最先进的性能。

Comments 17 pages, 5 figures, accepted to ACL 2026 Oral

详情
AI中文摘要

表格问答(TableQA)显著受益于表格修剪,它通过消除冗余单元提取紧凑的子表格以简化下游推理。然而,现有修剪方法通常依赖于由不可靠的批评信号驱动的顺序修订,常常无法检测到答案关键数据的丢失。为了解决这一限制,我们提出了TabTrim,一种新的表格修剪框架,将表格修剪从顺序修订转变为由黄金轨迹监督的并行搜索。TabTrim通过使用黄金SQL查询执行过程中的中间子表格推导出黄金修剪轨迹,并训练一个修剪器和一个验证器,使逐步修剪结果与黄金修剪轨迹一致。在推理过程中,TabTrim执行并行搜索以探索多个候选修剪轨迹并识别最优的子表格。广泛的实验表明,TabTrim在多样化的表格推理任务中实现了最先进的性能:TabTrim-8B达到73.5%的平均准确率,优于最强基线3.2%,包括在WikiTQ上79.4%和TableBench上61.2%。

英文摘要

Table Question Answering (TableQA) benefits significantly from table pruning, which extracts compact sub-tables by eliminating redundant cells to streamline downstream reasoning. However, existing pruning methods typically rely on sequential revisions driven by unreliable critique signals, often failing to detect the loss of answer-critical data. To address this limitation, we propose TabTrim, a novel table pruning framework which transforms table pruning from sequential revisions to gold trajectory-supervised parallel search. TabTrim derives a gold pruning trajectory using the intermediate sub-tables in the execution process of gold SQL queries, and trains a pruner and a verifier to make the step-wise pruning result align with the gold pruning trajectory. During inference, TabTrim performs parallel search to explore multiple candidate pruning trajectories and identify the optimal sub-table. Extensive experiments demonstrate that TabTrim achieves state-of-the-art performance across diverse tabular reasoning tasks: TabTrim-8B reaches 73.5% average accuracy, outperforming the strongest baseline by 3.2%, including 79.4% on WikiTQ and 61.2% on TableBench.

2601.01685 2026-05-19 cs.CL cs.AI cs.MA

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

用真理欺骗:通过生成蒙太奇进行开放式通道多智能体合谋以操纵信念

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

AI总结 本文研究了通过公开通道分发真实证据片段,利用多智能体合谋操纵信念的新威胁,提出了生成蒙太奇框架,展示了在14种LLM家族中74.4%的攻击成功率,并揭示了更强的推理能力反而增加了易受攻击的风险。

Comments Accepted to the ACL 2026 Main Conference (Oral Presentation)

详情
AI中文摘要

随着大型语言模型(LLMs)向自主代理合成实时信息转变,其推理能力引入了意想不到的攻击面。本文介绍了一种新的威胁,即合谋代理通过仅使用真实证据片段在公开通道中引导受害者信念,而无需依赖隐蔽通信、后门或伪造文件。通过利用LLMs的过度思考倾向,我们正式化了首次认知合谋攻击,并提出生成蒙太奇:一个由写作者-编辑-导演框架构成的框架,通过对抗性辩论和协调发布证据片段来构建欺骗性叙述,使受害者内化并传播伪造结论。为研究此风险,我们开发了CoPHEME数据集,该数据集源自真实世界谣言事件,并在多种LLM家族中模拟攻击。我们的结果表明,14种LLM家族普遍存在漏洞:攻击成功率达到74.4%(专有模型)和70.6%(开放式权重模型)。反直觉的是,更强的推理能力增加了易受攻击性,推理专精模型的攻击成功率高于基础模型或提示。此外,这些虚假信念会传播到下游判断者,达到超过60%的欺骗率,突显了LLM代理在动态信息环境中交互的社会技术脆弱性。我们的实现和数据可在:https://github.com/CharlesJW222/Lying_with_Truth/tree/main。

英文摘要

As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

2512.18953 2026-05-19 cs.CV

Symmetry Matters: Auditing and Symmetrizing 3D Generative Models

对称性至关重要:审计和对称化3D生成模型

Nicolas Caytuiro, Ivan Sipiran

AI总结 本文研究了无条件点云生成中对称性的保持问题,通过审计多个3D生成模型的对称性并计算基于Chamfer距离的归一化对称性分数,发现现有模型在对称性意识评估协议下存在持续的对称性差距。通过分析训练数据和引入对称性意识干预,作者提出了在半对象数据集上训练生成模型并在采样时进行反射重建的方法,从而提高几何一致性和视觉合理性。

Comments 12 pages, 8 figures, 4 tables

详情
AI中文摘要

对称性是许多物体类别中强有力的先验知识,但标准的3D生成模型基准很少报告这一先验是否被保留。我们研究了无条件点云生成中的对称性保持问题。我们首先通过几种3D生成模型审计生成形状的对称性,并基于Chamfer距离(CD)计算归一化对称性分数。我们表明,尽管当前3D生成模型在标准评估下取得竞争性结果,但当应用对称性意识评估协议时,它们显示出持续的对称性差距。为了测试这个差距是否仅仅继承自训练数据,我们评估了这些模型在由ShapeNet衍生的镜像物体数据集上的表现,并分析了训练过程中的对称性动态。通过机制可解释性技术,在采样和潜在空间层面进一步表明,反射对称性在学习的生成过程中并不可靠地编码。最后,为了解决这个差距,我们提出了一种数据导向的对称性意识干预:在半对象数据集上训练生成模型,并在采样时通过反射重建完整物体。在多个模型架构上,这种干预显著提高了几何一致性和视觉合理性,同时在标准度量下仍具竞争力。这些发现表明,需要伴随标准基准进行对称性意识评估,未来的3D生成模型应显式地将这一先验纳入训练或采样过程中。

英文摘要

Symmetry is a strong prior present in many object categories, yet standard benchmarks for 3D generative models rarely report whether this prior is preserved. We study symmetry preservation in unconditional point cloud generation. We first audit the symmetry of generated shapes by several 3D generative models and compute a normalized symmetry score based on the Chamfer Distance (CD). We show that although current 3D generative models achieve competitive results under standard evaluation, they reveal a persistent symmetry gap when a symmetry-aware evaluation protocol is applied. To test whether this gap is merely inherited from the training data, we evaluate these models over a mirrored-objects dataset derived from ShapeNet and analyze symmetry dynamics during training. Mechanistic interpretability techniques were employed at the sampling and latent levels to further show that reflection symmetry is not reliably encoded in the learned generative process. Finally, to address this gap, we propose a data-centric symmetry-aware intervention: training generative models on a half-objects dataset and reconstructing full objects by reflection during sampling. Across multiple backbones, this intervention substantially improves geometric consistency and visual plausibility while remaining competitive under standard metrics. These findings suggest that symmetry-aware evaluation is needed alongside standard benchmarks, and incoming 3D generative models should incorporate this prior explicitly, either during training or sampling.

2512.13506 2026-05-19 cs.LG stat.ML

Learning under Distributional Drift: Prequential Reproducibility as an Intrinsic Statistical Resource

在分布漂移下学习:预quential可再现性作为内在统计资源

Sofiya Zaichyk

AI总结 本文研究了在分布漂移下学习的问题,提出了一种内在的漂移预算$C_T$,用于量化数据分布沿实际学习者-环境轨迹的累积信息几何运动,以 Fisher-Rao 距离衡量。该预算将外生环境变化与学习者动作引起的反馈分离,从而提供了基于速率的预quential可再现性特征。文章证明了漂移反馈界,并建立了匹配的下界,展示了平均 Fisher-Rao 运动率的依赖性是紧的。此外,还证明了信息论上的不可区分性结果,并通过实验表明适当选择的监控通道可以保留风险相关的漂移信号。

Comments Revised: Added additional experiment. Clarified lower bound

详情
AI中文摘要

在分布漂移下统计学习仍然缺乏充分的描述,尤其是在闭环设置中,学习会改变数据生成规律。我们引入了一个内在的漂移预算$C_T$,用于量化数据分布沿实际学习者-环境轨迹的累积信息-几何运动,以Fisher-Rao距离衡量。该预算将外生环境变化与由学习者动作引起的反馈分离。这给出了基于速率的预quential可再现性特征:当使用实际流上的性能来预测下一步分布下的一步 ahead 性能时,漂移贡献通过平均运动率$C_T/T$,而不是单独的累积漂移。我们证明了一个漂移反馈界,其顺序为$T^{-1/2}+C_T/T$,至多有受控的二阶余项。我们还建立了在标准正则子类上的匹配尖锐下界。因此,对平均Fisher-Rao运动率的依赖性在常数范围内是紧的:$C_T/T$足够用于上界控制,并且在正则困难子类上是不可避免的。我们进一步证明了一个信息论上的不可区分性结果,表明在一步 ahead 目标上的顺序$C/T$效应不需要仅从实际性能流中识别。最后,我们表明固定监控通道诱导了收缩的可观察Fisher运动,并通过实验,包括一个不正确的现实数据反馈设置,表明适当选择的通道可以在内在数据生成规律不可用时保留风险相关的漂移信号。由此产生的理论将外生漂移、自适应数据分析和表现反馈视为沿同一学习者-环境轨迹的Fisher-Rao运动的不同来源。

英文摘要

Statistical learning under distributional drift remains poorly characterized, especially in closed-loop settings where learning alters the data-generating law. We introduce an intrinsic drift budget $C_T$ that quantifies cumulative information-geometric motion of the data distribution along the realized learner-environment trajectory, measured in Fisher-Rao distance. The budget separates exogenous environmental change from policy-sensitive feedback induced by the learner's actions. This gives a rate-based characterization of prequential reproducibility: when performance on the realized stream is used to predict one-step-ahead performance under the next distribution, the drift contribution enters through the average motion rate $C_T/T$, not through cumulative drift alone. We prove a drift-feedback bound of order $T^{-1/2}+C_T/T$, up to controlled second-order remainder terms, and establish a matching sharpness lower bound for the same prequential reproducibility gap on a canonical regular subclass. Thus the dependence on the average Fisher-Rao motion rate is tight up to constants: $C_T/T$ is sufficient for upper control and unavoidable on regular hard subclasses. We further prove an information-theoretic indistinguishability result showing that order-$C/T$ effects on the one-step-ahead target need not be identifiable from the realized performance stream alone. Finally, we show that fixed monitoring channels induce contracted observable Fisher motion, and experiments, including a misspecified real-data feedback setting, indicate that appropriately chosen channels can retain risk-relevant drift signal when the intrinsic data-generating law is unavailable. The resulting theory treats exogenous drift, adaptive data analysis, and performative feedback as different sources of Fisher-Rao motion along the same learner-environment trajectory.

2512.11446 2026-05-19 cs.CV

YawDD+: Frame-level Annotations for Accurate Yawn Prediction

YawDD+: 用于准确打哈欠预测的帧级标注

Ahmed Mujtaba, Gleb Radchenko, Marc Masana, Radu Prodan

AI总结 本文提出了一种半自动化标注流程,通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注,从而在边缘设备上提升模型训练效果,实现更高效的疲劳驾驶检测。

Comments This paper is accepted in the 33rd IEEE International Conference on Image Processing (ICIP) 2026

详情
AI中文摘要

驾驶员疲劳仍然是道路事故的主要原因,导致24%的碰撞事故。尽管打哈欠是疲劳的早期行为指标,但现有方法面临挑战,因为视频标注数据集中存在系统性噪声,源于粗略的时间标注。训练稳健的机器学习(ML)模型需要丰富的监督标签,以帮助从训练数据中学习显著特征。此外,在边缘设备上高效训练和推断模型对于疲劳驾驶检测任务至关重要,以在不依赖云基础设施的情况下实现车辆上的准确实时决策。为了解决这个问题,我们开发了一种半自动标注流程,通过人工在循环验证来标注YawDD视频以获得更准确的帧级标注,从而在边缘平台如NVIDIA Jetson NANO上更准确地训练模型。在YawDD+上训练已建立的MNasNet分类器和YOLOv11检测器架构,比视频级监督提高了多达6%的帧准确率和5%的mAP,分别在Jetson NANO和AGX上实现了99.34%的分类准确率和95.69%的检测mAP。此外,MNasNet在AGX上仅用8.69分钟/epoch完成一个周期,同时提供高达115帧/秒(FPS)的推断时间,证明了增强的数据质量本身支持边缘设备上的驾驶员疲劳监测系统,而无需服务器端计算。YawDD+数据集和训练好的模型已在线上提供。

英文摘要

Driver fatigue remains a leading cause of road accidents, responsible for 24% of crashes. While yawning serves as an early behavioral indicator of fatigue, existing approaches face significant challenges due to the presence of systematic noise in video-annotated datasets arising from coarse temporal annotations. Training robust machine learning (ML) models requires rich supervisory labels that help learn salient features from the training data. Moreover, efficient on-device training and inference of models on edge devices is crucial in driver fatigue detection tasks to enable accurate real-time decisions on vehicles without reliance on cloud infrastructure. To address this issue, we develop a semi-automated labeling pipeline with human-in-the-loop verification to annotate YawDD videos to YawDD+ frame-level annotations, enabling more accurate model training on edge platforms such as NVIDIA Jetson NANO. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6% and mAP by 5% over video-level supervision, achieving 99.34% classification accuracy and 95.69% detection mAP on Jetson NANO and AGX. Moreover, MNasNet completed the epoch time in just 8.69 min/epoch while delivering up to 115 frames-per-second (FPS) inference time on AGX, confirming that enhanced data quality alone supports on-device driver fatigue monitoring systems without server-side computation. The YawDD+ dataset and trained models are available online.

2512.07765 2026-05-19 cs.RO

Toward Seamless Physical Human-Humanoid Interaction: Insights from Control, Intent, and Modeling with a Vision for What Comes Next

迈向无缝的物理人机交互:从控制、意图和建模的角度见解以及未来发展的展望

Gustavo A. Cardona, Shubham S. Kumbhar, Panagiotis Artemiadis

AI总结 本文探讨了物理人机交互领域中控制、意图估计和计算人类模型三个核心支柱,总结了当前的研究现状、开放挑战和限制,并提出了跨领域整合的路径,旨在推动更鲁棒、安全和直观的物理交互研究。

Comments 60 pages, 5 figures, 3 tables

详情
AI中文摘要

物理人机交互(pHHI)是一个快速发展的领域,对在无结构、以人为中心的环境中部署机器人具有重要意义。在本文综述中,我们通过三个核心支柱审视当前pHHI的现状:(i)人形机器人的建模与控制,(ii)人类意图估计,以及(iii)计算人类模型。对于每个支柱,我们调查了代表性方法,识别了开放挑战,并分析了当前限制,这些限制阻碍了鲁棒、可扩展和适应性交互的实现。这些包括需要能够处理不确定人类动态的全身控制策略、在有限感知下实时意图推断的需求,以及能够考虑人类身体状态变化的建模技术。尽管每个领域都取得了显著进展,但跨支柱的整合仍然有限。我们提出了统一这些领域的路径,以实现连贯的交互框架。这种结构不仅使我们能够映射当前的现状,还提出了未来研究的具体方向,旨在弥合这些领域之间的差距。此外,我们引入了一种基于模态的统一交互类型分类法,区分直接交互(如物理接触)和间接交互(如物体中介),并基于机器人参与的程度,从协助到合作和协作。对于此分类中的每个类别,我们提供了三个核心支柱,突出跨支柱整合的机会。我们的目标是建议推动鲁棒、安全和直观物理交互的途径,为未来研究提供路线图,使人形系统能够有效地理解、预测并与人类伙伴在多样化的现实环境中协作。

英文摘要

Physical Human-Humanoid Interaction (pHHI) is a rapidly advancing field with significant implications for deploying robots in unstructured, human-centric environments. In this review, we examine the current state of the art in pHHI through three core pillars: (i) humanoid modeling and control, (ii) human intent estimation, and (iii) computational human models. For each pillar, we survey representative approaches, identify open challenges, and analyze current limitations that hinder robust, scalable, and adaptive interaction. These include the need for whole-body control strategies capable of handling uncertain human dynamics, real-time intent inference under limited sensing, and modeling techniques that account for variability in human physical states. Although significant progress has been made within each domain, integration across pillars remains limited. We propose pathways for unifying methods across these areas to enable cohesive interaction frameworks. This structure enables us not only to map the current landscape but also to propose concrete directions for future research that aim to bridge these domains. Additionally, we introduce a unified taxonomy of interaction types based on modality, distinguishing between direct interactions (e.g., physical contact) and indirect interactions (e.g., object-mediated), and on the level of robot engagement, ranging from assistance to cooperation and collaboration. For each category in this taxonomy, we provide the three core pillars that highlight opportunities for cross-pillar unification. Our goal is to suggest avenues to advance robust, safe, and intuitive physical interaction, providing a roadmap for future research that will allow humanoid systems to effectively understand, anticipate, and collaborate with human partners in diverse real-world settings.

2512.05136 2026-05-19 cs.CV cs.AI

Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

微调一种心电图基础模型以预测冠状动脉CT血管造影结果

Yujie Xiao, Qinghao Zhao, Gongzheng Tang, Hao Zhang, Zhuoran Kan, Deyun Zhang, Jun Li, Guangkun Nie, Xiaocheng Fang, Haoyu Wang, Shun Huang, Tong Liu, Jian Liu, Kangyin Chen, Shenda Hong

AI总结 本文研究了通过微调心电图基础模型来预测冠状动脉CT血管造影结果的研究问题,采用多中心研究方法,利用CTCA作为解剖参考标准,开发并验证了AI-ECG模型,以预测血管特异性冠状动脉狭窄,并展示了模型在内部和外部验证中的表现,以及其在临床中的应用价值。

详情
AI中文摘要

CAD仍然是全球公共卫生的主要负担,然而可扩展的筛查工具有限。尽管CTCA是首选的非侵入性诊断方法,但其使用受到资源需求和辐射暴露的限制。AI-ECG可能为CAD风险分层提供补充方法。在多中心研究中,我们开发并验证了使用CTCA作为解剖参考标准的AI-ECG模型,以预测血管特异性冠状动脉狭窄。在内部验证中,模型在各血管上的AUC值为0.683-0.744,并表现出一致的外部性能。在临床正常ECG中保持了鉴别能力,并在各亚组中保持了广泛稳定性。模型预测的概率随着CTCA定义的狭窄严重程度呈单调增加。模型概率通过预定义的灵敏度和特异性基于阈值转换为血管特异性低、中、高风险分层。校准分析显示预测风险与观察风险之间的一致性,而DCA表明与“全部治疗”和“不治疗”策略相比,具有净临床获益。将AI衍生的风险分层与指南基于的PTP类别相结合,提高了排除性能,减少了灰色区域比例,并与PTP单独使用相比实现了正NRI。在纵向随访队列中,Kaplan-Meier分析显示模型定义的风险组在主要不良心血管事件风险上存在明显分离。波形和归因分析进一步识别了与高风险预测相关的结构化ECG形态差异和具有生理意义的信号区域。这些发现支持AI-ECG作为补充CAD筛查、解剖风险估计和临床分层的可行工具,但需要进一步的前瞻性研究来确认其临床影响。

英文摘要

CAD remains a major global public health burden, yet scalable screening tools are limited. Although CCTA is a first-line non-invasive diagnostic modality, its use is constrained by resource requirements and radiation exposure. AI-ECG may offer a complementary approach for CAD risk stratification. In this multicenter study, we developed and validated an AI-ECG model using CCTA as the anatomical reference standard to predict vessel-specific coronary stenosis. In internal validation, the model achieved AUC values of 0.683-0.744 across vessels and showed consistent external performance. Discrimination was maintained in clinically normal ECGs and remained broadly stable across subgroups. Model-predicted probabilities increased monotonically with CCTA-defined stenosis severity. Model probabilities were converted into vessel-specific low-, intermediate-, and high-risk strata using predefined sensitivity- and specificity-based thresholds. Calibration analysis showed agreement between predicted and observed risk, while DCA indicated net clinical benefit over treat-all and treat-none strategies. Integrating AI-derived risk strata with guideline-based PTP categories improved rule-out performance, reduced the gray-zone proportion, and achieved positive NRI compared with PTP alone. In a longitudinal follow-up cohort, Kaplan-Meier analysis showed clear separation of major adverse cardiovascular event risk across model-defined risk groups. Waveform- and attribution-based analyses further identified structured ECG morphology differences and physiologically meaningful signal regions associated with high-risk predictions. These findings support AI-ECG as a feasible tool for complementary CAD screening, anatomical risk estimation, and clinical triage, while prospective studies are needed to confirm its clinical impact.

2512.01843 2026-05-19 cs.CV

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

PhyDetEx: 检测和解释T2V模型的物理合理性

Zeqing Wang, Keze Wang, Lei Zhang

AI总结 本文提出PhyDetEx,通过构建PID数据集和轻量级微调方法,评估T2V模型在生成物理合理视频方面的性能,并发现尽管模型在视频生成上有所进步,但理解并遵循物理定律仍存在挑战。

Comments 23 pages, 10 figures

详情
AI中文摘要

受模型容量和训练规模增长的推动,文本到视频(T2V)生成模型在视频质量、长度和遵循指令的能力方面取得了显著进展。然而,这些模型是否能理解物理并生成物理上合理的视频仍是一个问题。尽管视觉语言模型(VLMs)已被广泛用于各种应用中的通用评估,但它们难以识别生成视频中的物理不可能内容。为研究此问题,我们构建了一个PID(物理不可信检测)数据集,包含500个手动标注的测试视频和2,588对训练视频,其中每个不可信视频都是通过仔细修改其对应真实视频的描述来生成的,以诱导T2V模型生成物理上不可信的内容。利用构建的数据集,我们提出了一种轻量级微调方法,使VLMs不仅能检测物理不可信事件,还能生成违反物理原理的文本解释。将微调后的VLM作为物理合理性检测器和解释器,即PhyDetEx,我们评估了一系列最先进的T2V模型,以评估它们对物理定律的遵守程度。我们的发现表明,尽管最近的T2V模型在生成物理合理内容方面取得了显著进展,但理解和遵守物理定律仍是一个具有挑战性的问题,特别是对于开源模型。我们的数据集、训练代码和检查点可在https://github.com/Zeqing-Wang/PhyDetEx获取。

英文摘要

Driven by the growing capacity and training scale, Text-to-Video (T2V) generation models have recently achieved substantial progress in video quality, length, and instruction-following capability. However, whether these models can understand physics and generate physically plausible videos remains a question. While Vision-Language Models (VLMs) have been widely used as general-purpose evaluators in various applications, they struggle to identify the physically impossible content from generated videos. To investigate this issue, we construct a \textbf{PID} (\textbf{P}hysical \textbf{I}mplausibility \textbf{D}etection) dataset, which consists of a \textit{test split} of 500 manually annotated videos and a \textit{train split} of 2,588 paired videos, where each implausible video is generated by carefully rewriting the caption of its corresponding real-world video to induce T2V models producing physically implausible content. With the constructed dataset, we introduce a lightweight fine-tuning approach, enabling VLMs to not only detect physically implausible events but also generate textual explanations on the violated physical principles. Taking the fine-tuned VLM as a physical plausibility detector and explainer, namely \textbf{PhyDetEx}, we benchmark a series of state-of-the-art T2V models to assess their adherence to physical laws. Our findings show that although recent T2V models have made notable progress toward generating physically plausible content, understanding and adhering to physical laws remains a challenging issue, especially for open-source models. Our dataset, training code, and checkpoints are available at \href{https://github.com/Zeqing-Wang/PhyDetEx}{https://github.com/Zeqing-Wang/PhyDetEx}.

2512.01537 2026-05-19 cs.SD cs.AI cs.IT cs.LG eess.SP math.IT

Two-Dimensional Quantization for Geometry-Aware Audio Coding

二维量化用于几何感知的音频编码

Tal Shuster, Eliya Nachmani

AI总结 本文提出了一种二维量化方法Q2D2,通过将特征对投影到结构化的2D网格上,提高了音频压缩效率,同时保持了最先进的重建质量。

Comments Accepted to ICML 2026

详情
AI中文摘要

最近的神经音频编解码器在重建质量上取得了显著成就,通常依赖于残差向量量化(RVQ)、向量量化(VQ)和有限标量量化(FSQ)等量化方法。然而,这些量化技术限制了潜在空间的几何结构,使特征之间的相关性捕捉变得更加困难,导致表示学习、代码本利用和令牌速率的效率低下。在本文中,我们引入了二维量化(Q2D2),一种将特征对投影到结构化2D网格(如六边形、菱形或矩形铺砌)并量化到最近网格值的量化方案,从而生成由网格级别乘积定义的隐式代码本,其代码本大小与传统方法相当。尽管其简单的几何公式,Q2D2在音频压缩效率方面有所提升,具有低令牌速率和高代码本利用率,同时保持了最先进的重建质量。具体而言,Q2D2在语音、音频和音乐领域广泛实验中,在各种客观和主观重建度量上实现了具有竞争力甚至更优的性能。全面的消融研究进一步证实了我们设计选择的有效性。

英文摘要

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

2511.20857 2026-05-19 cs.CL cs.AI

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory:通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

AI总结 本文提出Evo-Memory,一个用于评估LLM代理自演化记忆能力的综合流基准和框架,通过构建序列任务流数据集,要求LLM在每次交互后搜索、适应和演化记忆,并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情
AI中文摘要

状态性对于大型语言模型(LLM)代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件,但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上,其中记忆被动地从对话中检索以回答查询,忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中,如交互问题助手或具身代理中,LLM需要处理连续的任务流,但通常无法从积累的交互中学习,失去有价值的上下文见解,这限制了测试时间的进化,即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距,我们引入了Evo-Memory,一个综合的流基准和框架,用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流,要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块,并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用,我们提供了一个基线方法ExpRAG,用于检索和利用先前经验,并进一步提出ReMem,一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程,以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

2511.11654 2026-05-19 cs.LG cs.AI cs.MA

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的收敛性

Sayambhu Sen, Shalabh Bhatnagar

AI总结 本文研究了多智能体强化学习在交通信号控制中的收敛性问题,通过随机逼近方法分析学习动态,并证明了在特定条件下该算法能够收敛。

Comments 14 pages 2 figures

详情
AI中文摘要

快速城市化导致城市如班加罗尔面临严重的交通拥堵,使得高效的交通信号控制(TSC)变得至关重要。多智能体强化学习(MARL)作为一种减少平均通勤延误的有希望策略,通常将每个交通信号视为一个独立的智能体使用Q学习进行建模。尽管先前的工作Prashant L A等人已经证明了这种方法的有效性,但在交通控制背景下对这种算法稳定性及收敛性进行严谨理论分析的研究尚未开展。本文通过专注于该多智能体算法的理论基础,填补了这一空白。我们研究了在合作性TSC任务中使用独立学习者固有的收敛问题。利用随机逼近方法,我们正式分析了学习动态。本文的主要贡献是证明了特定的交通控制多智能体强化学习算法在给定条件下能够收敛,扩展了从单智能体收敛证明中异步价值迭代的结论。

英文摘要

Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

2511.07288 2026-05-19 cs.LG cs.AI

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为批评稳定化实现非策略模仿学习

Sayambhu Sen, Shalabh Bhatnagar

AI总结 本文提出一种结合非策略学习的对抗模仿学习算法,通过双Q网络稳定化和价值学习(无需奖励函数推断)来提高样本效率,从而更高效地匹配专家行为。

Comments 14 pages and 4 images

详情
AI中文摘要

使用强化学习(RL)学习复杂策略通常受到不稳定性慢收敛的阻碍,这一问题在奖励工程困难时尤为严重。模仿学习(IL)从专家演示中绕过了对奖励的依赖。然而,最先进的IL方法,如生成对抗模仿学习(GAIL)Ho等人,存在严重的样本不效率问题。这是由于其基础的策略学习算法,如TRPO Schulman等人,所导致的。在本文中,我们介绍了一种对抗模仿学习算法,该算法结合了非策略学习以提高样本效率。通过结合非策略框架和辅助技术,特别是在此情况下基于双Q网络的稳定化和价值学习(无需奖励函数推断),我们展示了在稳健匹配专家行为所需样本减少。

英文摘要

Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, in this case a double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.

2511.06316 2026-05-19 cs.AI

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN:一种通过地理空间神经推理进行高精度事故位置推断的视觉-语言框架

MD Thamed Bin Zaman Chowdhury, Moazzem Hossain

AI总结 本文提出ALIGN框架,通过视觉-语言模型整合文本和图像数据,以高精度推断事故位置,显著优于传统文本解析方法,实现了亚公里级的定位精度。

详情
AI中文摘要

在低收入和中等收入国家,公共安全和城市规划项目经常面临准确的、特定位置的道路事故数据短缺。从非结构化文本中提取可靠的地理信息需要克服传统文本基于地理编码工具的局限性,这些工具在多语言环境中常常无法处理含糊的地点描述。本研究引入ALIGN(通过地理空间神经推理进行事故位置推断),一种视觉-语言框架,旨在模拟人类空间推理能力,从非结构化的孟加拉语新闻报告和基于地图的线索中推断出精确的事故坐标。开发了一个多阶段自动化流程来处理多样化的文本和视觉数据,整合大语言模型用于线索提取与视觉-语言模型用于地图验证。使用代理架构,我们建模了一个迭代推理循环,结合光学字符识别(OCR)、基于网格的空间扫描以及三轮几何投票方法,以数学方式隔离和减少视觉幻觉。研究结果表明,多模态ALIGN框架显著优于传统文本-only地理解析基线。例如,所提出系统成功将平均定位误差从不可用的10.915公里减少到验证数据集上的亚公里精度0.593公里。此外,测试该框架与官方达卡警察局记录相比,证实了其可靠性,通过达到平均误差0.465公里。结果提供了一个高精度、无需训练的基础,用于数据稀少地区的自动化事故制图,支持证据驱动的道路安全政策制定,并促进多模态AI在交通分析中的整合。

英文摘要

In low- and middle-income countries, public safety and urban planning initiatives frequently face a critical shortage of accurate, location-specific road crash data. Extracting reliable geospatial information from unstructured text requires overcoming the limitations of traditional text-based geocoding tools, which often fail in multilingual environments with ambiguous place descriptions. This study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework designed to emulate human spatial reasoning to infer precise accident coordinates from unstructured Bangla news reports and map-based cues. A multi stage automated pipeline was developed to process diverse textual and visual data, integrating large language models for cue extraction with vision-language models for map verification. Using an agentic architecture, we modelled an iterative reasoning loop that combines Optical Character Recognition (OCR), grid-based spatial scanning, and a 3-run geometric voting method to mathematically isolate and reduce visual hallucinations. The findings highlight that the multimodal ALIGN framework significantly outperforms traditional text-only geoparsing baselines. For example, the proposed system successfully reduced the mean localization error from an unusable 10.915 km to a sub-kilometer precision of 0.593 km on a validation dataset. Furthermore, testing the framework against official Dhaka Metropolitan Police records confirmed its reliability by achieving a mean error of 0.465 km. The results provide a high-accuracy, training-free foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the integration of multimodal AI in transportation analytics.

2511.00392 2026-05-19 cs.RO cs.AI cs.CV

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep: 通过平面扫描融合声纳与视觉以实现鲁棒的3D重建

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui, Ziyang Hong, Junfeng Wu

AI总结 本文提出SonarSweep,一种端到端的深度学习框架,通过将平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了单一模态方法在 underwater 环境中3D重建的局限性,实现了更精确和稳定的深度图生成。

Comments 8 pages, 9 figures, conference

详情
AI中文摘要

在视觉退化的水下环境中实现准确的3D重建仍是一个严峻的挑战。单一模态方法不足:基于视觉的方法因可见性差和几何约束而失败,而声纳则因固有的高度歧义和低分辨率而受限。因此,先前的融合技术依赖于启发式方法和错误的几何假设,导致显著的伪影和无法建模复杂场景。在本文中,我们引入了SonarSweep,一种新颖的端到端深度学习框架,通过将原理性的平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了这些限制。在高保真模拟和真实环境中的大量实验表明,SonarSweep能够一致地生成密集且准确的深度图,在挑战性条件下,特别是在高浊度情况下,显著优于最先进的方法。为了促进进一步研究,我们将公开我们的代码和一个新型的数据集,该数据集包含同步的立体相机和声纳数据,这是首次公开的此类数据集。

英文摘要

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

2510.26745 2026-05-19 cs.LG cs.AI cs.CL stat.ML

Deep sequence models tend to memorize geometrically; it is unclear why

深度序列模型倾向于记忆几何学;不清楚为何

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

AI总结 研究探讨了深度序列模型中原子事实的存储机制,发现几何记忆能编码全局关系,即使在训练中未共现的实体间也能建立联系,挑战了传统关联记忆的观点。

Comments Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

深度序列模型被认为主要通过关联记忆存储原子事实,即通过暴力查找共现实体。我们识别出一种不同的存储形式,称为几何记忆。在此模型中,嵌入编码了所有实体之间的新型全局关系,包括训练中未共现的实体。这种存储形式强大:例如,我们展示了它如何将涉及ℓ-折叠组合的困难推理任务转化为易于学习的一步导航任务。从这一现象中,我们提取了神经嵌入几何学中难以解释的基本方面。我们认为,这种几何的出现,与局部关联的查找相比,不能简单归因于典型的监督、架构或优化压力。反直觉的是,即使几何比暴力查找更复杂,它仍然会被学习。然后,通过分析与Node2Vec的联系,我们展示了几何起源于一种光谱偏见,这与主流理论相反,确实自然产生,尽管缺乏各种压力。这一分析也指出了从业者在使Transformer记忆更几何化方面的可见空间。我们希望几何视角的参数记忆鼓励重新审视指导知识获取、容量、发现和遗忘等领域的默认直觉。

英文摘要

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

2510.26018 2026-05-19 cs.RO cs.AI

RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras

RADRON:通过配备康普顿相机的微型飞行器进行离子化辐射源的协同定位

Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc, Jan Jakubek, Petr Stepan, Martin Saska

AI总结 该研究提出了一种利用微型飞行器协同定位放射性物质的新方法,通过康普顿相机实时估计辐射源位置,即使在稀疏测量条件下也能实现高灵敏度检测。

Comments 8 pages, 9 figures, submitted for review to IEEE RA-L

详情
AI中文摘要

我们提出了一种新型方法,通过合作微型飞行器(MAVs)定位放射性物质。我们的方法利用了最先进的单探测器康普顿相机,作为高灵敏度且微型的离子化辐射探测器。该探测器极低的重量(40克)为由协作敏捷MAVs进行的辐射检测开辟了新可能。我们提出了一种新的基本概念,将康普顿相机测量融合以实时估计辐射源位置,即使从极稀疏的测量中也能做到。数据读取和处理直接在机载上进行,结果用于动态反馈以驱动车辆运动。MAVs在紧密协作的群体中稳定,以最大化康普顿相机获取的信息,快速定位辐射源,甚至跟踪移动的辐射源。

英文摘要

We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.

2510.24680 2026-05-19 cs.RO

InFeR: Informed Failure Resilience in Learned Visual Navigation Control

InFeR:在学习视觉导航控制中的有信息故障韧性

Zishuo Wang, Joel Loo, David Hsu

AI总结 该研究提出InFeR框架,通过变分信息瓶颈损失重构潜在空间以检测OOD故障,并利用Grad-CAM技术局部化故障源,从而在无需额外训练数据的情况下实现故障自恢复,提升了复杂环境中的长距离导航鲁棒性。

详情
AI中文摘要

尽管模仿学习(IL)已在许多常见环境中实现了成功的视觉导航,但在分布外(OOD)场景下,IL策略容易出现不可预测的故障。这需要具有故障韧性的策略,不仅能够检测故障,还能识别其来源并自主恢复。我们提出了InFeR,一种通用框架,用于构建具有有信息故障韧性的IL策略,而无需故障或恢复演示。InFeR通过变分信息瓶颈(VIB)损失重新训练IL策略,以结构化其潜在空间以检测OOD故障。它应用视觉可解释性技术Grad-CAM,以局部化图像区域作为故障源,并告知恢复的启发式策略。所有这些都在不需要额外训练数据的情况下实现。现实世界实验表明,InFeR在两种不同的策略架构上实现了有信息的故障恢复,从而在复杂环境中实现了稳健的长距离导航。

英文摘要

While imitation learning (IL) has enabled successful visual navigation in many common environments, IL policies are prone to unpredictable failures under out-of-distribution (OOD) scenarios. This necessitates failure-resilient policies, which not only detect failures, but also recognise their sources and recover from them autonomously. We propose InFeR, a general framework for building IL policies with informed failure resilience without failure or recovery demonstrations. InFeR retrains an IL policy with a Variational Information Bottleneck (VIB) loss to structure its latent space for OOD failure detection. It applies a visual explainability technique, Grad-CAM, to localise an image region as the source of failure and inform a heuristic policy for recovery. All these are achieved without requiring additional training data. Real-world experiments show that InFeR enables informed failure recovery across two different policy architectures, yielding robust long-range navigation in complex environments.

2510.24208 2026-05-19 cs.CL cs.LG

Beyond Neural Incompatibility: Cross-Scale Knowledge Transfer in Language Models through Latent Semantic Alignment

超越神经不兼容:通过潜在语义对齐实现语言模型中的跨尺度知识转移

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

AI总结 本文提出SemAlign方法,通过潜在语义对齐实现跨尺度知识转移,解决了不同架构和参数化模型间参数重用受限的问题,通过激活值作为转移介质,利用语义分解与重组稳定地实现知识迁移。

Comments an early-stage version

详情
AI中文摘要

语言模型(LMs)在其参数中编码了大量知识,但如何以细粒度方式转移此类知识,即参数化知识转移(PKT)仍不明确。核心挑战是当源模型和目标模型在架构和参数化上存在差异时,如何实现有效的、高效的跨尺度转移,这使得直接参数重用受到神经不兼容的限制。在本文中,我们识别出潜在语义对齐是跨尺度知识转移的关键前提。与直接移动层参数不同,我们的方法使用激活值作为转移介质。SemAlign包含两个阶段:一个层归因阶段,用于归因任务相关的源层并为每个目标层选择恰好一个源层;一个语义对齐阶段,通过逐层配对并优化目标模型,利用源侧语义监督。对齐通过语义分解和重组在潜在空间中进行。在浅层到深层的转移过程中,只有前沿目标层是可训练的。层目标通过匹配中心化的词-词关系几何与对齐的监督残差来监督该层的残差贡献,而输出KL保持源级预测行为。因此,转移介质既不是参数块也不是绝对的隐藏状态,而是由配对源层监督诱导的目标空间残差几何。在四个基准测试中的评估证实了SemAlign的有效性,进一步分析确认语义分解和重组为跨尺度知识转移提供了一个稳定的机制。

英文摘要

Language Models (LMs) encode substantial knowledge in their parameters, yet it remains unclear how to transfer such knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A central challenge is to make cross-scale transfer effective and efficient when source and target models differ in architecture and parameterization, making direct parameter reuse strongly limited by neural incompatibility. In this paper, we identify latent semantic alignment as the key prerequisite for cross-scale knowledge transfer. Instead of directly moving layer parameters, our approach uses activations as the transfer medium. \textsc{SemAlign} has two stages: an \emph{layer attribution} stage that attributes task-relevant source layers and selects exactly one source layer for each target layer, and a \emph{semantic alignment} stage that pairs them layer by layer and optimizes the target with source-side semantic supervision. The alignment is carried out in latent space through semantic decomposition and recomposition. During the shallow-to-deep transfer, only the frontier target layer is trainable. The layer objective supervises the residual contribution of that layer by matching centered token-token relation geometry against an aligned supervisory residual, while output KL preserves source-level predictive behavior. The transferred medium is therefore neither a parameter block nor an absolute hidden state, but target-space residual geometry induced by paired source-layer supervision. Evaluations on four benchmarks demonstrate the efficacy of \textsc{SemAlign}, and further analysis confirms that semantic decomposition and recomposition provide a stable mechanism for cross-scale knowledge transfer.

2510.20584 2026-05-19 cs.CL cs.AI

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

使用ChatGPT自动编码通信数据:子群体一致性分析

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

AI总结 本文研究了使用ChatGPT进行通信数据编码在不同性别和种族/族裔群体间的一致性,发现其编码结果与人类评分者一致,为大规模评估协作与沟通提供了可能。

Comments Accepted to the Journal of Educational Measurement

详情
AI中文摘要

在大规模评估沟通和协作方面,对通信数据进行分类编码是一项劳动密集型任务,根据不同的框架进行分类。先前研究已证明,可以通过直接指示ChatGPT使用编码评分表来对通信数据进行编码,并且其准确性与人类评分者相当。然而,ChatGPT或类似AI技术在不同人口群体(如性别和种族)之间编码的一致性仍不清楚。为填补这一空白,我们引入了三种检查方法,用于评估基于LLM的编码中的子群体一致性,通过适应自自动化评分文献中已有的框架。使用典型的协作问题解决编码框架和三种类型的协作任务数据,我们检查了基于ChatGPT的编码在性别和种族/族裔群体中的表现。我们的结果表明,基于ChatGPT的编码在性别或种族/族裔群体中表现一致,与人类评分者一致,证明了其在大规模评估协作和沟通中的可行性。

英文摘要

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

2510.11391 2026-05-19 cs.CV cs.AI cs.CL

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

AI总结 本文提出DocReward,一种用于评估文档结构和风格的奖励模型,通过构建包含117,000对文档的DocPair数据集,采用Bradley-Terry损失训练,有效提升了文档生成的结构和风格专业性。

详情
AI中文摘要

近期的代理工作流程自动化了专业文档生成,但主要关注文本质量,忽视了结构和风格的专业性,这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型,无法引导代理生成结构和风格专业的文档。我们引入DocReward,一种评估文档结构和风格的文档奖励模型。为此,我们提出了一种文本质量无关的框架,确保评估不受内容质量的影响,并构建了包含117,000对文档的DocPair数据集,涵盖32个领域和267种类型。每对文档内容相同,但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中,DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明,DocReward能有效引导代理生成具有更一致结构和风格专业性的文档,突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

2510.10930 2026-05-19 cs.CL cs.AI

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

AI总结 本文研究了语言模型对游戏评估的能力,通过比较现代语言模型和人类及符号计算代理的评估结果,发现推理模型在游戏评估上更接近人类,但随着模型接近博弈最优,其与人类数据的匹配度会减弱,且在评估趣味性时表现出更大的波动。

详情
AI中文摘要

推理不仅仅是解决问题,也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上,通过研究模型如何玩国际象棋和围棋等游戏。在本文中,我们倡导一种新的范式,即评估人工智能系统对游戏的评估。首先,我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集,将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询:评估游戏的收益(或公平性)和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度:计算查询的复杂性和量化查询的难度。我们的结果表明,推理模型在游戏评估上通常比非推理语言模型更接近人类。然而,我们观察到非单调的关系:随着模型接近博弈最优,其与人类数据的匹配度会减弱。我们还发现,在评估趣味性时,模型之间存在更多的波动性,这与量化该查询的难度更大有关。在各种查询和游戏中,推理模型在评估查询时表现出高度变化和不可预测的资源使用,这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

2510.08141 2026-05-19 cs.LG

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

SCOPE-RL: 稳定和定量控制强化学习后训练中的策略熵

Chen Wang, Zhaochun Li, Jionghao Bai, Hexuan Deng, Ge Lan, Yue Wang

AI总结 本文提出SCOPE-RL框架,通过温度自适应的正样本构造正则化项,稳定并定量控制强化学习后训练中的策略熵,实验表明其在Pass@1和Pass@$k$任务上优于现有基线方法。

详情
AI中文摘要

强化学习(RL)是训练大型语言模型(LLMs)的关键范式,但广泛使用的分组相对策略优化(GRPO)常面临熵崩溃问题:探索迅速消失,策略提前收敛,样本多样性下降,最终损害训练效果。现有解决方案,包括熵奖励和裁剪方法,很少能保持熵在稳定的探索范围内,且常引入振荡的熵或奖励退化。在本文中,我们识别出熵动态中被忽视的不对称性:在高温度采样下,正样本和负样本对策略熵有相反影响。具体而言,高温度正样本促进熵增长,而负样本抑制它。我们为此现象提供了理论解释:当策略更新过程中熵下降时,其对温度的导数在正样本更新下严格为正,表明高温度正样本可以抵消熵衰减,从而减缓熵崩溃并可能逆转它。受此启发,我们提出了SCOPE-RL,通过构造来自温度自适应正样本的正则化项,实现稳定且定量的熵控制。广泛实验表明,SCOPE-RL在Pass@1和Pass@$k$任务上均优于现有强RL基线方法。我们的结果提供了证据,证明摆脱熵崩溃可以提高推理性能,同时显示收益是非单调的,RL后训练在推理LLMs中存在最优的探索水平。

英文摘要

Reinforcement learning (RL) is a key paradigm for post-training large language models (LLMs), but the widely used Group Relative Policy Optimization (GRPO) often suffers from entropy collapse: exploration quickly disappears, policies converge prematurely, and sample diversity declines, ultimately harming training effectiveness. Existing remedies, including entropy bonuses and clip-based methods, rarely keep entropy within a stable exploration regime and often introduce oscillatory entropy or reward degradation. In this work, we identify a previously overlooked asymmetry in entropy dynamics: under high-temperature sampling, positive and negative samples have opposite effects on policy entropy. Specifically, high-temperature positive samples promote entropy growth, whereas negative samples suppress it. We provide a theoretical explanation for this phenomenon: when entropy decreases during policy updates, its derivative with respect to temperature is strictly positive under positive-sample updates, indicating that high-temperature positive samples can counteract entropy decay, thereby slowing entropy collapse and potentially reversing it. Motivated by this insight, we propose SCOPE-RL, a stable and quantitative entropy control framework through a regularization term constructed from temperature-adaptive positive samples. Extensive experiments show that SCOPE-RL consistently outperforms strong RL baselines on both Pass@1 and Pass@$k$. Our results provide evidence that escaping entropy collapse can improve reasoning performance, while also showing that the benefit is non-monotonic, with an optimal level of exploration for RL post-training in reasoning LLMs.

2510.06809 2026-05-19 cs.CV

VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

VA-Adapter:将超声基础模型适应于超声心动图探头引导

Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Yujiao Deng, Shiji Song, Gao Huang

AI总结 本文提出VA-Adapter,通过将超声基础模型与理解个体三维结构的能力相结合,提高超声心动图探头引导的精度和效率,实验表明其在参数量较少的情况下表现优于现有模型。

Comments MICCAI2026 Early Accept Paper

详情
AI中文摘要

超声心动图是检测心脏疾病的关键工具,但其操作难度高导致专业人员短缺。探头引导系统通过辅助获取高质量图像,提供了降低操作门槛的有前景的解决方案。然而,由于显著的个体差异,稳健的探头引导仍具挑战性。这种差异表现为二维图像中低级特征的差异,这使得图像特征理解复杂化,以及个体三维结构的差异,这给精确导航带来挑战。为了解决这些挑战,我们首先提出利用超声基础模型从大量数据集中学习的稳健图像表示。然而,将这些模型应用于探头导航是困难的,因为它们缺乏对个体三维结构的理解。为此,我们精心设计了视觉-动作适配器(VA-Adapter)以在线注入理解个体三维结构的能力。具体来说,通过将VA-Adapter嵌入基础模型的图像编码器中,模型可以从历史视觉-动作序列中推断心脏解剖结构,模拟超声技师的认知过程。在包含超过131万样本的数据集上进行的广泛实验表明,VA-Adapter在参数量少约33倍的情况下优于现有探头引导模型。代码可在https://github.com/LeapLabTHU/VA-Adapter上获得。

英文摘要

Echocardiography is a critical tool for detecting heart diseases, yet its steep operational difficulty causes a shortage of skilled personnel. Probe guidance systems, which assist in acquiring high-quality images, offer a promising solution to lower this operational barrier. However, robust probe guidance remains challenging due to significant individual variability. This variability manifests as differences in low-level features within two-dimensional (2D) images, which complicates image feature understanding, and differences in individual three-dimensional (3D) structures, which poses challenges for precise navigation. To address these challenges, we first propose leveraging the robust image representations learned by ultrasound foundation models from vast datasets. Yet, applying these models to probe navigation is non-trivial due to their lack of understanding of individual 3D structures. To this end, we meticulously design a Vision-Action Adapter (VA-Adapter) to online inject the capability of understanding individual 3D structures. Specifically, by embedding the VA-Adapter into the foundation model's image encoder, the model can infer cardiac anatomy from historical vision-action sequences, mimicking the cognitive process of a sonographer. Extensive experiments on a dataset with over 1.31M samples demonstrate that the VA-Adapter outperforms strong probe guidance models while requiring approximately 33 times fewer trained parameters. Code is available at https://github.com/LeapLabTHU/VA-Adapter.

2510.04930 2026-05-19 cs.LG

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

平等梯度下降:一种加速 Grokking 的简单方法

Ali Saheb Pasand, Elvis Dohmatob

AI总结 本文提出平等梯度下降(EGD)方法,通过规范化梯度使所有主方向的动态以相同速度演化,从而加速模型的 Grokking 过程,消除测试性能的停滞现象。

详情
AI中文摘要

Grokking 是一种现象,其中不同于训练性能在早期达到峰值,模型的测试/泛化性能在任意多个周期内停滞,然后突然跃升至接近完美的水平。在实践中,减少此类停滞的长度是有利的,即使学习过程'更快地 Grok'。在本工作中,我们提供了对 Grokking 的新见解。首先,我们通过实证和理论证明,不对称的(随机)梯度下降速度可以在不同主方向(即奇异方向)上诱导 Grokking。然后,我们提出了一种简单的修改,规范化梯度,使得所有主方向的动力学以相同的速度演化。接着,我们证明这种修改方法,称为平等梯度下降(EGD),可以被视为一种精心修改的自然梯度下降方法,能够更快地 Grok。事实上,在某些情况下,停滞完全被消除。最后,我们实证地展示了在经典算术问题如模加法和稀疏奇偶问题上,这种停滞现象被我们的方法消除。

英文摘要

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

2510.02590 2026-05-19 cs.LG

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

在可以的时候使用在线网络:迈向快速且稳定的强化学习

Ahmed Hendawy, Henrik Metternich, Théo Vincent, Mahdi Kallel, Jan Peters, Carlo D'Eramo

AI总结 本文提出了一种新的更新规则,通过在目标网络和在线网络之间取最小估计来改进价值函数学习,从而实现更快且更稳定的强化学习。

Comments Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

在深度强化学习(RL)中,使用目标网络来估计价值函数是一种流行的方法。虽然有效,但目标网络仍是一种折中方案,它在保持稳定性的同时牺牲了缓慢移动的目标,从而延迟了学习。相反,使用在线网络作为强化目标在直觉上很有吸引力,但众所周知会导致不稳定的学。在本文中,我们旨在结合两者的优势,通过引入一种新的更新规则,该规则通过目标网络和在线网络之间的最小估计来计算目标,从而得到我们的方法MINTO。通过这种简单而有效的修改,我们证明MINTO能够通过缓解使用在线网络进行强化时的潜在过估计偏差,从而实现更快且更稳定的价值函数学习。值得注意的是,MINTO可以无缝集成到广泛的价值基础和演员-评论家算法中,成本极低。我们对MINTO在多种基准上的进行了广泛评估,涵盖了在线和离线RL以及离散和连续动作空间。在所有基准上,MINTO都一致地提高了性能,展示了其广泛的应用性和有效性。

英文摘要

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

2509.23183 2026-05-19 cs.LG cs.NI

ZeroSiam: An Efficient Asymmetry for Test-Time Entropy Optimization without Collapse

ZeroSiam: 一种高效的非对称方法用于测试时熵优化而不发生崩溃

Guohao Chen, Shuaicheng Niu, Deyu Chen, Jiahao Yang, Zitian Zhang, Mingkui Tan, Pengcheng Wu, Zhiqi Shen

AI总结 本文提出ZeroSiam,一种针对测试时熵最小化的高效非对称架构,通过非对称发散对齐防止崩溃,并通过可学习预测器和stop-gradient操作符有效实现,实验和理论证明其能防止崩溃并正则化偏见学习信号,提升性能,尤其在易崩溃的小模型上表现稳定。

详情
AI中文摘要

测试时熵最小化有助于适应新环境并激励模型的推理能力,在推理过程中允许模型通过自身预测实时进化和改进,从而实现有竞争力的性能。然而,纯粹的熵最小化可能会偏好不可推广的捷径,如放大logit范数并驱动所有预测到主导类别以减少熵,从而导致崩溃解(例如,恒定的一热输出),这些解仅通过简单的方式最小化目标函数而没有有意义的学习。在本文中,我们揭示了非对称性作为防止崩溃的关键机制,并引入了ZeroSiam——一种专门针对测试时熵最小化的高效非对称孪生架构。ZeroSiam通过非对称发散对齐来防止崩溃,这一过程通过在分类器之前使用可学习预测器和stop-gradient操作符高效实现。我们提供了实证和理论证据表明,ZeroSiam不仅能够防止崩溃,还能正则化偏见学习信号,即使在没有崩溃的情况下也能提升性能。尽管其简单性,广泛的结果显示,ZeroSiam在使用可忽略开销的情况下,比先前的方法更稳定,展示了其在视觉适应和大语言模型推理任务中的有效性,包括在具有挑战性的测试场景和多样化的模型中,特别是易崩溃的微型模型上。

英文摘要

Test-time entropy minimization helps adapt a model to novel environments and incentivize its reasoning capability, unleashing the model's potential during inference by allowing it to evolve and improve in real-time using its own predictions, achieving promising performance. However, pure entropy minimization can favor non-generalizable shortcuts, such as inflating the logit norm and driving all predictions to a dominant class to reduce entropy, risking collapsed solutions (e.g., constant one-hot outputs) that trivially minimize the objective without meaningful learning. In this paper, we reveal asymmetry as a key mechanism for collapse prevention and introduce ZeroSiam--an efficient asymmetric Siamese architecture tailored for test-time entropy minimization. ZeroSiam prevents collapse through asymmetric divergence alignment, efficiently achieved by a learnable predictor and a stop-gradient operator before the classifier. We provide empirical and theoretical evidence that ZeroSiam not only prevents collapse, but also regularizes biased learning signals, enhancing performance even when no collapse occurs. Despite its simplicity, extensive results show that ZeroSiam performs more stably over prior methods using negligible overhead, demonstrating efficacy on both vision adaptation and large language model reasoning tasks across challenging test scenarios and diverse models, including particularly collapse-prone tiny models.