arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2370
2510.02480 2026-05-29 cs.AI cs.LG

Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting

通过早退机制控制语言模型中有害上下文的风险

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick

AI总结 提出一种结合动态早退预测与无分布风险控制的方法,限制有害上下文对语言模型性能的退化,并在有益上下文中实现计算效率提升。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)可能受到有害或不相关上下文的影响,这会显著损害模型在下游任务上的性能。这促使我们设计具有内置机制的原则性方案,以防范此类“垃圾进,垃圾出”场景。我们提出一种新颖方法,限制有害上下文对模型性能的退化程度。首先,我们定义模型的基线“安全”行为——即无任何上下文(零样本)时的模型性能。接着,我们应用无分布风险控制(DFRC)来控制用户提供的上下文将性能降至该安全零样本基线以下的程度。我们通过利用动态早退预测实现这一点,忽略那些最关注不安全输入的后注意力头。最后,我们提出对DFRC的修改,使其既能控制有害输入的风险,又能利用有益输入的性能和效率提升。我们在涵盖上下文学习和开放式问答的9项任务上展示了理论和实证结果,表明我们的方法能有效控制有害上下文的风险,同时在使用有益上下文时实现显著的计算效率提升。

英文摘要

Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include built-in mechanisms to guard against such "garbage in, garbage out" scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline "safe" behavior for the model -- the model's performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.

2510.00777 2026-05-29 cs.LG

In-Place Feedback: Reliable Refinement for Multi-Turn Expert-LLM Collaboration

原地反馈:多轮专家-LLM协作的可靠精炼方法

Youngbin Choi, Minjong Lee, Saemi Moon, Seunghyuk Cho, Chaehyeon Chung, MoonJeong Park, Dongwoo Kim

AI总结 提出原地反馈交互范式,通过用户直接编辑模型先前响应并让模型从编辑上下文继续生成,在五个推理密集型基准上优于标准多轮反馈且更省token,用户研究证实其能提高最终输出满意度并降低疲劳。

Comments 42pages

详情
AI中文摘要

LLM生成的草稿常包含细微的事实或逻辑错误,但先前研究表明模型难以可靠地整合旨在修正这些错误的多轮反馈。我们提出原地反馈,一种交互范式,其中用户直接编辑模型先前的响应,模型从编辑后的上下文继续生成。在五个推理密集型基准上,原地反馈始终优于标准多轮反馈,同时需要更少的token,我们的细粒度分析表明,它能更可靠地应用修正并将修正传播到后续推理中。一项由领域专家精炼LLM生成摘要的用户研究证实了这些发现:参与者报告了更高的最终输出满意度和显著更低的疲劳感,而结合原地反馈和多轮反馈的混合策略在每个测量维度上得分最高。这些结果表明,直接编辑错误是专家-LLM协作的更有效范式。

英文摘要

LLM-generated drafts often contain subtle factual or logical errors, yet prior work shows that models struggle to reliably integrate multi-turn feedback aimed at fixing them. We propose in-place feedback, an interaction paradigm in which the user directly edits the model's previous response and the model continues generation from the edited context. In-place feedback consistently outperforms standard multi-turn feedback across five reasoning-intensive benchmarks while requiring fewer tokens, and our fine-grained analysis shows that it applies corrections more reliably and propagates them to subsequent reasoning. A user study with domain experts refining LLM-generated summaries corroborates these findings: participants report higher final-output satisfaction and substantially lower fatigue with in-place feedback, and a mixed strategy combining in-place and multi-turn feedback scores highest on every measured dimension. These results suggest that editing errors directly is a more effective paradigm for expert-LLM collaboration.

2509.18377 2026-05-29 cs.CL

Interactive In-Meeting Speaker Correction with Human Feedback

交互式会议中基于人类反馈的说话人修正

Xinlu He, Yiwen Guan, Badrivishal Paurana, Pitipat Kongsomjit, Zilin Dai, Jacob Whitehill

AI总结 提出一种LLM辅助的会议内说话人修正系统,通过用户简短反馈修正说话人归属错误,结合流式ASR、说话人日志、LLM摘要和在线注册机制,在AMI数据集上实现DER降低31.99%、说话人替换错误降低52.68%。

详情
AI中文摘要

大多数自动语音处理系统以“开环”模式运行,没有用户关于谁说了什么的反馈,然而人在回路的工作流程有可能实现更高的准确性。我们提出了一种LLM辅助的会议内说话人修正系统,允许用户通过简短纠正性反馈来修复说话人归属错误。在执行流式ASR和说话人日志后,系统呈现简洁的LLM生成的摘要,帮助用户识别重要的说话人错误,并通过更新带说话人标注的转录文本和添加在线说话人注册来整合用户反馈。为了使该工作流程在语音处理、LLM分析和用户反馈存在错误的情况下仍然有效,我们开发了多种机制来更精确地识别预期的修正。此外,我们构建了一个LLM驱动的用户反馈模拟,以评估工作流程的可复现性和可扩展性。应用于AMI头戴式麦克风测试集,我们的系统相对于流式基线(Google ASR + ECAPA)显著降低了31.99%的DER和52.68%的说话人替换错误。

英文摘要

Most automatic speech processing systems operate in ``open loop'' mode without user feedback about who said what, yet human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted in-meeting speaker correction system that lets users fix speaker attribution errors through brief corrective feedback. After performing streaming ASR and diarization, the system presents concise LLM-generated summaries to help users identify important speaker errors, and it incorporates user feedback by updating the speaker-attributed transcript and adding online speaker enrollments. To make this workflow effective despite errors in speech processing, LLM analysis, and user feedback, we developed several mechanisms to identify the intended correction more precisely. Further, we built an LLM-driven user feedback simulation to evaluate the workflow reprodubilty and at scale. Applied to the AMI headset test set, our system substantially reduces the DER from a streaming baseline (Google ASR + ECAPA) by 31.99% and speaker substitution error by 52.68%.

2509.17208 2026-05-29 cs.LG physics.atm-clus

Active Learning for Machine Learning Driven Molecular Dynamics

主动学习驱动的分子动力学机器学习

Kevin Bachelor, Sanya Murdeshwar, Daniel Sabo, Razvan Marinescu

AI总结 针对机器学习粗粒化势函数在模拟中因采样不足而性能退化的问题,提出基于RMSD的主动学习框架,通过在线查询Oracle生成数据,在保持粗粒化效率的同时修正覆盖缺口,使Chignolin蛋白模型的TICA空间W1指标提升33.05%。

Comments 9 pages, 4 figures, for Neurips Workshop: Machine Learning and the Physical Sciences 2025

详情
AI中文摘要

机器学习粗粒化(CG)势函数速度快,但当模拟到达欠采样的生物分子构象时性能会随时间退化,而生成广泛的全原子(AA)数据来应对这一问题在计算上不可行。我们提出了一种用于分子动力学(MD)中CG神经网络势函数的新型主动学习(AL)框架。该方法基于CGSchNet模型,采用从MD模拟中基于均方根偏差(RMSD)的帧选择,通过在神经网络势函数训练过程中查询预言机来实时生成数据。该框架在保持CG级效率的同时,在RMSD识别的精确覆盖缺口处修正模型。通过训练粗粒化神经网络势函数CGSchNet,我们实验证明该框架探索了先前未见过的构型,并在构象空间中未探索的区域上训练模型。我们的主动学习框架使得在Chignolin蛋白上训练的CGSchNet模型在内部基准测试套件上的时间滞后独立成分分析(TICA)空间中,Wasserstein-1(W1)指标提升了33.05%。

英文摘要

Machine-learned coarse-grained (CG) potentials are fast, but degrade over time when simulations reach under-sampled bio-molecular conformations, and generating widespread all-atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)-based frame selection from MD simulations in order to generate data on-the-fly by querying an oracle during the training of a neural network potential. This framework preserves CG-level efficiency while correcting the model at precise, RMSD-identified coverage gaps. By training CGSchNet, a coarse-grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05\% improvement in the Wasserstein-1 (W1) metric in Time-lagged Independent Component Analysis (TICA) space on an in-house benchmark suite.

2509.15629 2026-05-29 cs.SD eess.AS

An Extensive Analysis of the Singing Voice Conversion Challenge 2025 Evaluation Results

歌唱声音转换挑战2025评估结果的深入分析

Lester Phillip Violeta, Xueyao Zhang, Jiatong Shi, Yusuke Yasuda, Wen-Chin Huang, Zhizheng Wu, Tomoki Toda

AI总结 本文对2025年歌唱声音转换挑战赛的评估结果进行了深入分析,通过新数据库、两个任务、开源基线和大规模众包测试,比较了33个系统在歌手身份和歌唱风格转换上的表现,发现顶级系统在身份相似性上接近真实样本,但风格建模(如气息、滑音、颤音)仍具挑战,且现有客观指标无法完全替代主观评分。

Comments Submitted to IEEE TASLP

详情
AI中文摘要

我们呈现了对最新一届歌唱声音转换挑战赛结果的分析,该科学活动旨在在受控环境中比较和理解不同的声音转换系统。与以往仅关注转换歌手身份的迭代相比,今年我们还关注了转换歌手的歌唱风格。为了创建受控环境和进行彻底评估,我们开发了一个新的挑战数据库,引入了两个任务,开源了基线系统,并进行了大规模的众包听力测试和客观评估。该挑战赛持续了两个月,我们总共评估了33个不同的系统。大规模众包听力测试的结果表明,顶级系统在歌手身份评分上与真实样本相当。然而,建模歌唱风格并因此实现高自然度仍然是该任务中的一个挑战,主要原因是难以对气息、滑音和颤音歌唱风格中的动态信息进行建模。对挑战赛的进一步分析还讨论了传统相似性测试和动态偏好测试在评估歌唱风格相似性方面的局限性。此外,计算斯皮尔曼秩相关系数表明,依赖客观指标(如色度对齐)和非匹配指标(如说话人嵌入)与主观评分相关性最高,但仍未达到可被视为真正替代主观评分的水平。

英文摘要

We present a thorough analysis of the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared to previous iterations which solely focused on converting the singer identity, this year we also focused on converting the singing style of the singer. To create a controlled environment and thorough evaluations, we developed a new challenge database, introduced two tasks, open-sourced baselines, and conducted large-scale crowd-sourced listening tests and objective evaluations. The challenge was run for two months and in total we evaluated 33 different systems. The results of the large-scale crowd-sourced listening test showed that top systems had comparable singer identity scores to ground truth samples. However, modeling the singing style and consequently achieving high naturalness still remains a challenge in this task, primarily due to the difficulty in modeling dynamic information in breathy, glissando, and vibrato singing styles. Further analyses of the challenge also discuss the limitations of both the traditional similarity test and the dynamic preference test in evaluating singing style similarity. Moreover, calculating Spearman's rank correlation coefficient shows that dependent objective metrics such as chroma-alignment and non-match metrics such as speaker embeddings are the most correlated to subjective scores, but are still not at a level where it could be considered as a true replacement for subjective scores.

2509.08194 2026-05-29 cs.LG stat.ML

Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

先规定后选择:面向情境随机优化的自适应策略选择

Caio de Prospero Iglesias, Kimberly Villalobos Carballo, Dimitris Bertsimas

AI总结 针对情境随机优化中候选策略在协变量空间表现异质的问题,提出Prescribe-then-Select模块化框架,通过构建可行策略库并基于最优策略树集成学习元策略实现数据驱动的自适应选择,在单阶段报童和两阶段运输规划问题中优于单一最优策略。

详情
AI中文摘要

我们解决了情境随机优化中的策略选择问题,其中协变量作为情境信息可用,且决策必须满足严格的可行性约束。在许多情境随机优化场景中,来自不同建模范式的多个候选策略在协变量空间上表现出异质性能,没有单一策略能够统一占优。我们提出了Prescribe-then-Select(PS)模块化框架,该框架首先构建一个可行候选策略库,然后学习一个元策略来为观测到的协变量选择最佳策略。我们使用在训练集上通过交叉验证训练的最优策略树集成来实现元策略,使策略选择完全数据驱动。在两个基准情境随机优化问题——单阶段报童和两阶段运输规划中,PS在协变量空间的异质区域始终优于最佳单一策略,并在不存在这种异质性时收敛到占优策略。所有重现结果的代码可在https://anonymous.4open.science/r/Prescribe-then-Select-TMLR获取。

英文摘要

We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies--arising from different modeling paradigms--exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems--single-stage newsvendor and two-stage shipment planning--PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.

2508.16873 2026-05-29 cs.CV cs.SI

Multimodal LLMs See Sentiment

多模态大语言模型感知情感

Neemias B. da Silva, John Harrison, Rodrigo Minetto, Myriam R. Delgado, Bogdan T. Nassu, Thiago H. Silva

AI总结 本文通过系统评估研究,探讨多模态大语言模型在图像情感分析中的三种方法,发现基于MLLM描述的两阶段流水线在微调后性能显著优于传统基线。

Comments 24 pages, 7 figures

详情
AI中文摘要

理解视觉内容如何传达情感在以图像为主导的数字环境中日益重要。然而,情感感知依赖于复杂的场景级语义,这对计算模型而言是一项具有挑战性的任务。本文通过一项系统性的、以评估为导向的研究,从三个视角考察多模态大语言模型如何执行图像情感分析:(i) 使用MLLM直接从图像进行情感分类;(ii) 使用预训练LLM对MLLM生成的描述进行情感分析;(iii) 在情感标注的描述上微调这些LLM以评估性能和泛化能力。在最新基准上的实验表明,两阶段MLLM描述中介流水线在多种评估设置下能显著提高预测准确性,尤其是当LLM组件被微调时。在不同的一致性阈值和情感粒度下,该流水线的最强配置在基准测试中分别优于基于词典、CNN和Transformer的基线高达30.9%、64.8%和42.4%。在跨数据集评估中,所提出的流水线——无需在目标数据集上进行训练或微调——仍比最佳域内基线高出8%以上。总体而言,本研究提供了对MLLM描述中介情感分析的综合评估,阐明了其有效的条件、失败的场景以及与基于传统视觉方法的比较,同时为未来研究提供了可复现的基准资源。

英文摘要

Understanding how visual content conveys sentiment is increasingly important in a digital landscape dominated by imagery. However, sentiment perception depends on complex scene-level semantics, making this a challenging task for computational models. This paper examines how Multimodal Large Language Models (MLLMs) perform sentiment analysis in images through a systematic, evaluation-driven study encompassing three perspectives: (i) direct sentiment classification from images using MLLMs; (ii) sentiment analysis on MLLM-generated descriptions using pre-trained LLMs; and (iii) fine-tuning these LLMs on sentiment-labeled descriptions to assess performance and generalization. Experiments on a recent benchmark show that a two-stage MLLM description-mediated pipeline can substantially improve prediction accuracy under several evaluation settings, particularly when the LLM component is fine-tuned. Across different agreement thresholds and sentiment granularities, the strongest configurations of this pipeline outperform lexicon-, CNN-, and Transformer-based baselines in our benchmark by up to 30.9%, 64.8%, and 42.4%, respectively. In cross-dataset evaluation, the proposed pipeline - without training or fine-tuning on the target dataset - still surpasses the best in-domain baseline by over 8%. Overall, the study provides a comprehensive assessment of MLLM description-mediated sentiment analysis, clarifying the conditions under which it is effective, the scenarios in which it fails, and its comparison with traditional vision-based approaches, while also providing a reproducible benchmark resource for future research.

2508.15371 2026-05-29 cs.CL cs.AI cs.LG

Confidence-Modulated Speculative Decoding for Large Language Models

置信度调节的推测解码用于大型语言模型

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

AI总结 本文提出一种基于置信度调节的推测解码框架,通过熵和边际不确定性度量动态调整草稿长度与验证过程,在机器翻译和摘要任务上实现加速并保持或提升BLEU和ROUGE分数。

Comments This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

详情
AI中文摘要

推测解码已成为一种通过草稿-验证范式并行化令牌生成来加速自回归推理的有效方法。然而,现有方法依赖静态草稿长度和刚性验证标准,限制了其在不同模型不确定性和输入复杂性下的适应性。本文提出一种基于置信度调节草稿的信息论推测解码框架。通过利用草稿模型输出分布上的熵和边际不确定性度量,所提方法在每次迭代中动态调整推测生成的令牌数量。这种自适应机制减少了回滚频率,提高了资源利用率,并保持了输出保真度。此外,验证过程使用相同的置信度信号进行调节,使得在不牺牲生成质量的情况下更灵活地接受草稿令牌。在机器翻译和摘要任务上的实验表明,与标准推测解码相比,该方法在保持或提升BLEU和ROUGE分数的同时实现了显著加速。所提方法提供了一种原则性的即插即用方法,用于在不确定性变化条件下实现大型语言模型的高效且鲁棒的解码。

英文摘要

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

2508.12176 2026-05-29 cs.CV cs.AI eess.SP

Scalable RF Simulation in Generative 4D Worlds

生成式4D世界中的可扩展射频仿真

Zhiwei Zheng, Dongyin Hu, Mingmin Zhao

AI总结 提出WaveVerse框架,通过语言引导的4D世界生成器和物理信号模拟器实现可扩展的射频信号仿真,在相位敏感基准上表现高保真度,并有效提升下游任务性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

射频(RF)感知已成为一种强大的、保护隐私的替代视觉方法,用于各种感知任务。然而,在动态和多样化的环境中构建高质量的RF数据集仍然是一个重大挑战。为了解决这一问题,我们引入了WaveVerse,一个基于提示的可扩展框架,该框架从生成的室内场景中模拟真实的RF信号,并包含由空间路径引导的人体运动,从而无需手动轨迹设计即可实现多样且可行的行为。WaveVerse具有语言引导的4D世界生成器和基于物理的信号模拟器,能够在多样化的环境中实现RF信号的逼真模拟。它采用了一个相位相干光线追踪器,保留了空间和时间上的相位一致性。模拟信号在相位敏感基准上显示出高保真度,并且与真实世界收集的测量数据以及来自专有电磁求解器的模拟结果高度一致。当用于数据增强时,WaveVerse在RF成像和人类活动识别等下游任务中持续提升性能,其增益随模拟数据量的增加而增长,并超越了现有方法。代码和附加材料可在网页上获取。

英文摘要

Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks. However, building high-quality RF datasets in dynamic and diverse environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions guided by spatial paths, enabling diverse and feasible behaviors without manual trajectory design. WaveVerse features a language-guided 4D world generator and a physics-based signal simulator that enables realistic simulation of RF signals in diverse environments. It employs a phase-coherent ray tracer that preserves both spatial and temporal phase consistency. The simulated signals show high fidelity on phase-sensitive benchmarks, and closely align with both real-world collected measurements and simulations from a proprietary electromagnetic solver. When used for data augmentation, WaveVerse consistently improves performance in downstream tasks like RF imaging and human activity recognition, with gains that grow with the amount of simulated data and surpass existing methods. Code and additional materials are available on the webpage.

2508.10566 2026-05-29 cs.CV

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

HM-Talker:用于高保真说话头合成的混合运动建模

Shiyu Liu, Kui Jiang, Junjun Jiang, Xianming Liu, Xiaocheng Feng, Hongxun Yao, Qi Tian

AI总结 提出HM-Talker框架,通过混合显式发音线索与隐式韵律特征,结合交叉模态映射和随机特征配对策略,解决说话头生成中个性化与泛化的权衡问题,在视觉真实感和唇同步精度上超越现有方法。

详情
AI中文摘要

音频驱动的说话头生成面临个性化与泛化之间的基本权衡,限制了其实际应用。隐式模型通常以结构不一致为代价实现泛化,导致不稳定的头部运动和不准确的唇同步。而显式方法引入了几何和解剖先验,如参数化面部几何的3D可变形模型(3DMM)或编码面部肌肉运动的动作单元(AU),但它们往往产生过度中性的表情或泛化能力有限。为解决这一矛盾,我们提出了HM-Talker,一个音频驱动的说话头框架,它协同整合显式发音线索与隐式韵律特征,以刻画身份特定动态,同时实现音频驱动的泛化。其显著特点可概括为:i) 跨模态映射模块(CMMM),从音频和视频中提取全面的运动线索词汇表;ii) 混合运动建模模块(HMMM),采用随机特征配对(SFP)策略,动态融合配对的隐式和显式特征以进行运动合成。该设计促进了下半部分面部运动的迭代优化,在身份特定目标与身份无关(仅音频)目标之间交替进行。大量实验表明,HM-Talker在多种设置下的视觉真实感和唇同步精度方面均优于最先进方法。

英文摘要

Audio-driven talking head generation faces a fundamental trade-off between personalization and generalization, limiting its practical application. Implicit models often achieve generalization at the cost of structural incoherence, resulting in unstable head motion and inaccurate lip synchronization. While explicit methods incorporate geometric and anatomical priors such as 3D Morphable Models (3DMMs), which parameterize facial geometry, or Action Units (AUs), which code facial muscle movements--they tend to produce overly neutral expressions or suffer from limited generalization. To resolve this conflict, we present HM-Talker, an audio-driven talking head framework that synergistically integrates explicit articulatory cues with implicit prosodic features to characterize identity-specific dynamics while enabling audio-driven generalization. Its distinctive features can be summarized as: i) the Cross-Modal Mapping Module (CMMM) that extracts a comprehensive vocabulary of motion cues from audio and video, and ii) the Hybrid Motion Modeling Module (HMMM) that employs a Stochastic Feature Pairing (SFP) strategy to dynamically merge paired implicit and explicit features for motion synthesis. This design facilitates an iterative optimization of the lower face motion, alternating between identity-specific and identity-agnostic (audio-only) objectives. Extensive experiments demonstrate that HM-Talker outperforms state-of-the-art methods in both visual realism and lip-sync accuracy across diverse settings.

2508.09976 2026-05-29 cs.RO

Masquerade: Learning from In-the-wild Human Videos using Data-Editing

Masquerade: 利用数据编辑从真实世界人类视频中学习

Marion Lepert, Jiaying Fang, Jeannette Bohg

AI总结 提出Masquerade方法,通过编辑真实世界第一人称人类视频(估计3D手部姿态、修复手臂、叠加渲染双臂机器人)弥合视觉具身差距,并利用编辑后的视频预训练视觉编码器、微调扩散策略头,在三个长时程双臂厨房任务中实现比基线高5-6倍的泛化性能。

Comments Project website at https://masquerade-robot.github.io/

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA), 2026
AI中文摘要

机器人操作研究仍然面临严重的数据稀缺问题:即使是最大的机器人数据集,其规模和多样性也比推动语言和视觉领域近期突破的数据集小几个数量级。我们提出Masquerade,一种编辑真实世界第一人称人类视频以弥合人类与机器人之间视觉具身差距,并利用这些编辑后的视频学习机器人策略的方法。我们的流程通过以下步骤将每段人类视频转化为机器人化演示:(i) 估计3D手部姿态,(ii) 修复人类手臂,(iii) 叠加一个追踪恢复的末端执行器轨迹的渲染双臂机器人。在67.5万帧编辑后的视频片段上预训练一个视觉编码器以预测未来的2D机器人关键点,并在每个任务仅使用50个机器人演示微调扩散策略头时继续该辅助损失,所得到的策略泛化能力显著优于先前工作。在三个分别于三个未见场景中评估的长时程、双臂厨房任务中,Masquerade的性能比基线高出5-6倍。消融实验表明,机器人叠加和联合训练均不可或缺,且性能随编辑后人类视频数量呈对数增长。这些结果表明,明确弥合视觉具身差距能够解锁来自人类视频的庞大、现成数据源,可用于改进机器人策略。

英文摘要

Robot manipulation research still suffers from significant data scarcity: even the largest robot datasets are orders of magnitude smaller and less diverse than those that fueled recent breakthroughs in language and vision. We introduce Masquerade, a method that edits in-the-wild egocentric human videos to bridge the visual embodiment gap between humans and robots and then learns a robot policy with these edited videos. Our pipeline turns each human video into robotized demonstrations by (i) estimating 3-D hand poses, (ii) inpainting the human arms, and (iii) overlaying a rendered bimanual robot that tracks the recovered end-effector trajectories. Pre-training a visual encoder to predict future 2-D robot keypoints on 675K frames of these edited clips, and continuing that auxiliary loss while fine-tuning a diffusion policy head on only 50 robot demonstrations per task, yields policies that generalize significantly better than prior work. On three long-horizon, bimanual kitchen tasks evaluated in three unseen scenes each, Masquerade outperforms baselines by 5-6x. Ablations show that both the robot overlay and co-training are indispensable, and performance scales logarithmically with the amount of edited human video. These results demonstrate that explicitly closing the visual embodiment gap unlocks a vast, readily available source of data from human videos that can be used to improve robot policies.

2508.08677 2026-05-29 cs.LG cs.CV

Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

多级协作蒸馏遇见全局工作空间模型:面向OCIL的统一框架

Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran

AI总结 提出一种结合全局工作空间模型和多级协作蒸馏的统一框架,通过融合多学生模型参数形成共享隐式记忆并周期性广播,以及跨学生一致性和历史知识对齐机制,有效平衡在线类增量学习中的稳定性与可塑性。

Comments 15 pages, 8 figures

详情
AI中文摘要

在线类增量学习(OCIL)使模型能够从非独立同分布的数据流中持续学习。由于数据流中的样本只能被看到一次,因此与离线学习相比,它更适用于现实场景。然而,这一约束加剧了OCIL在维持稳定性与可塑性之间适当平衡的挑战。此外,在现实世界中更严格的内存缓冲区约束下,当前基于重放的方法效果较差。虽然集成方法提高了可塑性,但它们常常在稳定性上遇到困难。受全局工作空间理论(GWT)启发,我们提出了一种新颖方法,通过全局工作空间模型(GWM)——一种共享的隐式记忆,指导多个学生模型的学习——来增强集成学习。GWM通过在每个训练批次中融合所有学生的参数形成,捕获历史学习轨迹,并作为知识巩固的动态锚点。类似于GWT的广播机制,GWM定期重新分发给学生,稳定学习并促进跨任务一致性。此外,我们引入了一种多级协作蒸馏机制。它强制学生之间保持对等一致性,并通过将每个学生与GWM对齐来保留历史知识。因此,学生模型在保持先前所学知识的同时,仍能适应新任务,在稳定性与可塑性之间实现更好的平衡。在三个标准OCIL基准上的大量实验表明,我们的方法在各种内存预算下为多个OCIL模型带来了显著的性能提升。代码可在https://github.com/susususushi/GWM获取。

英文摘要

Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams. Since samples of the data streams can be seen only once, it is more suitable for real-world scenarios compared to offline learning. However, this constraint intensifies the challenge for OCIL in maintaining an appropriate balance between stability and plasticity. Moreover, under stricter memory buffer constraints in real world, current replay-based methods are less effective. While ensemble methods improve plasticity, they often struggle with stability. Inspired by the Global Workspace Theory (GWT), we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. Like the broadcasting mechanism of GWT, the GWM is redistributed periodically to students, stabilizing learning and promoting cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. It enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. The code is available at https://github.com/susususushi/GWM.

2508.05614 2026-05-29 cs.CL cs.AI

GroundAct: Can LLM Agents Ground Actions in Environmental States?

GroundAct:LLM智能体能否在环境状态中实现动作落地?

Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

AI总结 本研究提出GroundAct基准,通过1500个场景和16592个任务实例评估15个LLM,发现动作落地能力是多维挑战,不能仅通过模型规模解决。

Comments Project Page: https://zju-real.github.io/OmniEmbodied Code: https://github.com/ZJU-REAL/OmniEmbodied

详情
AI中文摘要

LLM智能体在指令完全指定动作的任务上成功率达到85-96%,但当动作可行性取决于指令未提及的环境状态时,成功率降至29-53%。我们认为这一差距反映了一种缺失的能力:动作落地,即从结构化环境状态推断动作是否可行、缺少哪些前提条件以及是否超出个体能力的能力。我们引入GroundAct,这是一个包含1500个场景和16592个任务实例的基准,基于文本的交互式环境涵盖11个领域,任务按认知复杂度层级组织为七个类别。评估15个LLM(3B-671B)后,我们发现三种诊断模式:(i)属性推理与工具和协作推理弱相关,产生不同的模型轮廓;(ii)完整环境图在工具使用与隐式协作之间产生高达+27.6/-22.9%的差异,区分了搜索边界与约束过滤瓶颈;(iii)监督微调将Qwen2.5-3B在直接命令上的性能从0.6%提升至76.3%,但在隐式协作上仅从1.5%提升至5.5%。这些结果表明动作落地是一个多维挑战,不能仅通过规模扩展解决。

英文摘要

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this gap reflects a missing capability: action grounding, the ability to infer from structured environmental state whether an action is feasible, what prerequisites it lacks, and whether it exceeds individual capacity. We introduce GroundAct, a benchmark of 1,500 scenarios and 16,592 task instances in text-based interactive environments spanning 11 domains, with tasks organized into seven categories along a cognitive complexity hierarchy. Evaluating 15 LLMs (3B-671B), we find three diagnostic patterns: (i) attribute reasoning is weakly correlated with tool and coordination reasoning, producing distinct model profiles; (ii) complete environment graphs yield up to +27.6/-22.9% on tool use vs. implicit collaboration, separating search-bound from constraint-filtering bottlenecks; and (iii) supervised fine-tuning lifts Qwen2.5-3B from 0.6% to 76.3% on direct command but only 1.5% to 5.5% on implicit collaboration. These results establish action grounding as a multi-dimensional challenge irreducible to scaling.

2508.03726 2026-05-29 cs.CL

Hierarchical Verification of Speculative Beams for Accelerating LLM Inference

分层验证投机波束以加速大语言模型推理

Jaydip Sen, Harshitha Puvvala, Subhasis Dasgupta

AI总结 提出分层验证树(HVT)框架,通过优先验证高似然草稿并早期剪枝次优候选,以分层方式重构投机波束解码,从而在不重训练或修改架构下显著降低推理时间和能耗。

Comments This paper was accepted for oral presentation and publication in the 3rd International Conference on Data Science and Network Engineering (ICDSNE 2025), organized at NIT, Agartala, India, from July 25 to 26, 2025. The paper is 12 pages long, and it contains 3 tables and 4 figures. This is NOT the final paper, which will be published in the Springer-published proceedings

详情
AI中文摘要

大语言模型(LLMs)在多种自然语言处理任务中取得了显著成功,但由于其自回归特性,在推理效率方面面临持续挑战。尽管投机解码和波束采样带来了显著改进,传统方法按顺序验证草稿序列且无优先级区分,导致不必要的计算开销。本文提出分层验证树(HVT),一种通过优先处理高似然草稿并实现次优候选早期剪枝来重构投机波束解码的新框架。我们开发了理论基础和形式化的验证-剪枝算法以确保正确性和效率。该方法无需重训练或架构修改即可集成到标准LLM推理流程中。跨多个数据集和模型的实验评估表明,HVT始终优于现有投机解码方案,在维持或提升输出质量的同时,实现了推理时间和能耗的大幅降低。研究结果凸显了分层验证策略作为加速大语言模型推理新方向的潜力。

英文摘要

Large language models (LLMs) have achieved remarkable success across diverse natural language processing tasks but face persistent challenges in inference efficiency due to their autoregressive nature. While speculative decoding and beam sampling offer notable improvements, traditional methods verify draft sequences sequentially without prioritization, leading to unnecessary computational overhead. This work proposes the Hierarchical Verification Tree (HVT), a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts and enabling early pruning of suboptimal candidates. Theoretical foundations and a formal verification-pruning algorithm are developed to ensure correctness and efficiency. Integration with standard LLM inference pipelines is achieved without requiring retraining or architecture modification. Experimental evaluations across multiple datasets and models demonstrate that HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption while maintaining or enhancing output quality. The findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.

2508.02537 2026-05-29 cs.LG

Solved in Unit Domain: JacobiNet for Differentiable Coordinate-Transformed PINNs

在单位域中求解:用于可微坐标变换PINNs的JacobiNet

Xi Chen, Jianchuan Yang, Junjie Zhang, Runnan Yang, Xu Liu, Hong Wang, Tinghui Zheng, Ziyu Ren, Wenqi Hu

AI总结 提出JacobiNet,一种基于学习的可微坐标变换PINN框架,通过端到端可微架构统一域映射与PDE求解,解决不规则边界域中PINNs的归一化、边界强制和损失项不平衡问题,显著提升精度和效率。

Comments Accepted by Journal of Computational Physics

详情
AI中文摘要

物理信息神经网络(PINNs)通过将物理定律嵌入学习过程,为求解偏微分方程提供了强大框架。然而,当应用于具有不规则边界的域时,PINNs常遭受不稳定和收敛缓慢的问题,这源于(1)几何各向异性导致的不一致归一化,(2)不精确的边界强制,以及(3)损失项竞争的不平衡。常见的解决方法是将该域映射到规则空间。然而,传统的映射方法依赖于特定情况的网格,在预指定的固定节点上定义雅可比矩阵,并通过链式法则重新表述PDE——使其与现代自动微分和张量框架不兼容。为弥合这一差距,我们提出了JacobiNet,一种基于学习的坐标变换PINN框架,在端到端可微架构中统一了域映射和PDE求解。JacobiNet通过自动梯度实现直接的雅可比矩阵计算,与下游PINNs共享计算图,从而避免了特定情况的网格划分、显式的雅可比矩阵计算/存储以及手动的PDE重新表述,同时解锁了几何编辑操作。通过将物理建模与几何复杂性分离,JacobiNet(1)解决了原始各向异性坐标中的归一化挑战,(2)促进了边界条件的硬性强制,以及(3)缓解了长期存在的损失项间不平衡问题。在各种PDE上的评估表明,JacobiNet将相对L2误差从0.11-0.73降低到0.01-0.09,平均精度提升了15.6倍。在具有变化形状的血管状域中,JacobiNet实现了对未见几何形状的毫秒级映射推理,平均预测精度提升了3.65倍,同时提供了超过10倍的加速——展示了强大的泛化能力、精度和效率。

英文摘要

Physics-Informed Neural Networks (PINNs) offer a powerful framework for solving PDEs by embedding physical laws into the learning process. However, when applied to domains with irregular boundaries, PINNs often suffer from instability and slow convergence, which stems from (1) inconsistent normalization due to geometric anisotropy, (2) inaccurate boundary enforcement, and (3) imbalanced loss term competition. A common workaround is to map the domain to a regular space. Yet, conventional mapping methods rely on case-specific meshes, define Jacobians at pre-specified fixed nodes, reformulate PDEs via the chain rule-making them incompatible with modern automatic differentiation, tensor-based frameworks. To bridge this gap, we propose JacobiNet, a learning-based coordinate-transformed PINN framework that unifies domain mapping and PDE solving within an end-to-end differentiable architecture. JacobiNet enables direct Jacobian computation via autograd, shares computation graph with downstream PINNs, thereby avoiding case-specific meshing, explicit Jacobian computation/storage, and manual PDE reformulation while unlocking geometric-editing operations. Separating physical modeling from geometric complexity, JacobiNet (1) addresses normalization challenges in the original anisotropic coordinates, (2) facilitates the hard enforcement of boundary conditions, and (3) mitigates the long-standing imbalance among loss terms. Evaluated on various PDEs, JacobiNet reduces the relative L2 error from 0.11-0.73 to 0.01-0.09, achieving an average 15.6x improvement in accuracy. In vessel-like domains with varying shapes, JacobiNet enables millisecond-level mapping inference for unseen geometries, improves prediction accuracy by an average of 3.65x, while delivering over 10x speedup-demonstrating strong generalization, accuracy, and efficiency.

2507.09574 2026-05-29 cs.CV cs.AI cs.CL

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

MENTOR: 面向自回归视觉生成模型的高效多模态条件微调

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

AI总结 提出MENTOR框架,通过两阶段训练范式实现自回归图像生成器与多模态输入的细粒度token级对齐,无需辅助适配器或交叉注意力模块,在DreamBench++上取得优异性能。

Comments Findings of ACL 2026

详情
AI中文摘要

最近的文本到图像模型能够生成高质量结果,但在精确视觉控制、平衡多模态输入以及需要大量训练以实现复杂多模态图像生成方面仍存在困难。为解决这些局限,我们提出MENTOR,一种新颖的自回归(AR)框架,用于高效的多模态条件微调以实现自回归多模态图像生成。MENTOR将AR图像生成器与两阶段训练范式相结合,无需依赖辅助适配器或交叉注意力模块,即可实现多模态输入与图像输出之间的细粒度、token级对齐。两阶段训练包括:(1)多模态对齐阶段,建立稳健的像素级和语义级对齐;随后是(2)多模态指令微调阶段,平衡多模态输入的整合并增强生成可控性。尽管模型规模适中、基础组件非最优且训练资源有限,MENTOR在DreamBench++基准测试上仍取得了强劲性能,在概念保持和提示遵循方面优于竞争基线。此外,与基于扩散的方法相比,我们的方法具有更优的图像重建保真度、广泛的任务适应性以及更高的训练效率。数据集、代码和模型可在 https://github.com/HaozheZhao/MENTOR 获取。

英文摘要

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

2507.03318 2026-05-29 cs.LG cs.AI

Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

基于图神经网络与组套索正则化的结构感知化合物-蛋白质亲和力预测

Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang

AI总结 提出利用图神经网络结合组套索和稀疏组套索正则化,从活性悬崖分子对中学习结构信息以预测化合物-蛋白质亲和力(IC50),并提升模型可解释性。

Comments 15 pages, 7 figures

详情
Journal ref
Comput Struct Biotechnol J. 2026;35:0012
AI中文摘要

可解释人工智能(XAI)方法越来越多地被应用于药物发现中,以学习分子表示并识别驱动性质预测的子结构。然而,为化合物性质预测构建结构-活性关系(SAR)建模的端到端可解释模型面临诸多挑战,例如特定蛋白质靶标的化合物-蛋白质相互作用活性数据有限,以及分子构型位点的细微变化会显著影响分子性质。我们利用具有活性悬崖的分子对,这些分子共享骨架但在取代基位点不同,其特征是对特定蛋白质靶标具有较大的效力差异。我们提出一个框架,通过实现图神经网络(GNN)来利用活性悬崖对的性质和结构信息,以预测化合物-蛋白质亲和力(即半数最大抑制浓度,IC50)。为了增强模型性能和可解释性,我们使用结构感知损失函数训练GNN,采用组套索和稀疏组套索正则化,这些正则化方法能够剪枝并突出与活性差异相关的分子子图。我们将该框架应用于针对三种原癌基因酪氨酸蛋白激酶Src蛋白(PDB ID:1O42、2H8H、4MXO)的分子活性悬崖数据。我们的方法通过稀疏组套索整合公共和私有节点信息,改进了性质预测,这体现在均方根误差(RMSE)降低和皮尔逊相关系数(PCC)提高上。应用正则化还通过提升图级全局方向分数和改进原子级着色精度,增强了GNN的特征归因能力。这些进展增强了药物发现流程中模型的可解释性,特别是在先导化合物优化中识别关键分子子结构方面。

英文摘要

Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

2506.12508 2026-05-29 cs.AI

AgentOrchestra: Orchestrating Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol

AgentOrchestra:使用工具-环境-智能体(TEA)协议编排多智能体智能

Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An

AI总结 提出TEA协议和AgentOrchestra框架,通过统一抽象和分层编排实现多智能体系统的生命周期感知协调,在GAIA测试集上达到89.04%的准确率。

详情
AI中文摘要

最近基于LLM的智能体系统在复杂、长时任务上显示出潜力,但现有智能体协议(如A2A和MCP)未能充分支持跨智能体、工具和环境的生命周期感知协调。为解决这一局限,我们引入了 extbf{工具-环境-智能体}(TEA)协议,这是一种统一的抽象,将这些组件建模为具有显式生命周期的一等版本化资源。TEA支持端到端的上下文和版本管理,提高了可追溯性和可复现性,同时实现了智能体相关组件的持续自我进化 ootnote{除非另有说明,\emph{智能体相关组件}包括提示、记忆/工具/智能体/环境代码以及智能体输出(解决方案)。}。基于TEA,我们提出了\projectname,这是一个分层多智能体框架,其中中央规划器协调专门的子智能体,并在执行过程中动态扩展能力。在四个具有挑战性的基准测试(涵盖专家级智能体任务和科学/数学推理)上的实验表明,AgentOrchestra始终优于强基线;特别是在GAIA测试集上达到89.04%,据我们所知,这使其跻身领先方法之列。这些结果凸显了显式协议设计和分层编排对于构建更鲁棒、更自适应的多智能体系统的价值。

英文摘要

Recent advances in LLM-based agent systems have shown promise on complex, long-horizon tasks, but existing agent protocols (e.g., A2A and MCP) do not adequately support lifecycle-aware coordination across agents, tools, and environments. To address this limitation, we introduce the \textbf{Tool-Environment-Agent} (TEA) protocol, a unified abstraction that models these components as first-class, versioned resources with explicit lifecycles. TEA supports end-to-end context and version management, improving traceability and reproducibility, while also enabling continual self-evolution of agent-associated components\footnote{Unless otherwise specified, \emph{agent-associated components} include prompts, memory/tool/agent/environment code, and agent outputs (solutions).}. Building on TEA, we present \projectname, a hierarchical multi-agent framework in which a central planner coordinates specialized sub-agents and dynamically extends capabilities during execution. Experiments on four challenging benchmarks, spanning expert-level agent tasks and scientific/mathematical reasoning, show that AgentOrchestra consistently outperforms strong baselines; in particular, it achieves 89.04\% on the GAIA Test set, placing it among the leading methods to the best of our knowledge. These results highlight the value of explicit protocol design and hierarchical orchestration for building more robust and adaptive multi-agent systems.

2506.08354 2026-05-29 cs.CL cs.AI cs.IR

Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

立场:文本嵌入应捕获隐含语义,而不仅仅是表面意义

Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu

AI总结 本文主张文本嵌入研究应从表面意义转向隐含语义,通过试点研究揭示现有模型在隐含语义任务上的局限,并提出范式转变以优先发展语言学基础训练数据、深层语义基准和核心建模目标。

Comments To appear in ICML 2026

详情
AI中文摘要

这篇立场论文主张,文本嵌入研究应超越表面意义,将隐含语义作为核心建模目标。文本嵌入是现代自然语言处理的基础组件,支撑着广泛的应用并推动持续的研究进展。尽管进展迅速,大多数嵌入模型仍局限于表面层次的语义,而语言学理论强调人类意义的大部分是隐含的,由语用学、说话者意图和社会文化语境塑造。当前模型通常在缺乏此类深度的数据集上训练,并使用奖励表面相似性的基准进行评估。因此,它们在需要解释性推理、立场识别或社会性理解的任务中表现不佳。我们的试点研究明确揭示了这一局限性,表明即使在探测隐含语义的任务上,最先进的嵌入相比简单的词汇基线也仅取得边际改进。因此,我们呼吁范式转变:嵌入研究应优先考虑具有语言学基础且多样化的训练数据,开发探测更深层语义理解的基准,并将隐含意义作为核心建模目标,以更好地使嵌入与现实世界的语言复杂性对齐。代码可在 http://github.com/dukesun99/Implicit-Embeddings 获取。

英文摘要

This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central modeling objective. Text embeddings are a foundational component of modern NLP, underpinning a wide range of applications and driving sustained research progress. Despite rapid progress, most embedding models remain narrowly focused on surface-level semantics, whereas linguistic theory emphasizes that much of human meaning is implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current models are typically trained on datasets that lack such depth and evaluated using benchmarks that reward surface similarity. As a result, they struggle with tasks that require interpretive reasoning, stance recognition, or socially grounded understanding. Our pilot study makes this limitation explicit, showing that even state-of-the-art embeddings achieve only marginal improvements over simple lexical baselines on tasks probing implicit semantics. We therefore call for a paradigm shift: embedding research should prioritize linguistically grounded and diverse training data, develop benchmarks that probe deeper semantic understanding, and treat implicit meaning as a core modeling objective to better align embeddings with real-world language complexity. The code is available at http://github.com/dukesun99/Implicit-Embeddings.

2506.06254 2026-05-29 cs.AI cs.CL cs.LG

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

PersonaAgent:弥合个性化LLM智能体的记忆与行动

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

AI总结 提出PersonaAgent框架,通过整合个性化记忆模块(情景与语义记忆)和行动模块,并利用角色提示作为中介实现记忆与行动的协同,以解决LLM智能体的个性化任务。

Comments Accepted in ACL 2026

详情
AI中文摘要

由大型语言模型驱动的智能体近期作为先进范式出现,在广泛领域和任务中展现出令人印象深刻的能力。尽管潜力巨大,当前LLM智能体常采用一刀切方法,缺乏响应用户不同需求和偏好的灵活性。这一局限促使我们开发PersonaAgent——首个旨在处理多样化个性化任务的个性化LLM智能体框架。具体而言,PersonaAgent整合了两个互补组件:一个包含情景记忆和语义记忆机制的个性化记忆模块;一个使智能体能够执行针对用户定制的工具行动的个性化行动模块。核心在于,角色(定义为每位用户独特的系统提示)充当中间件:它利用来自个性化记忆的洞察来控制智能体行动,而这些行动的结果反过来又优化记忆。基于该框架,我们提出一种测试时用户偏好对齐策略,该策略模拟最近的n次交互以优化角色提示,通过模拟响应与真实响应之间的文本损失反馈确保实时用户偏好对齐。实验评估表明,PersonaAgent不仅有效个性化行动空间,还能在测试时实际应用中扩展,显著优于其他基线方法。这些结果证明了我们的方法在提供定制化、动态用户体验方面的可行性和潜力。

英文摘要

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

2506.06095 2026-05-29 cs.LG

Accelerating Sparse Transformer Inference on GPU

加速GPU上的稀疏Transformer推理

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

AI总结 针对稀疏Transformer推理加速问题,提出STOF框架,通过分析建模将多头注意力映射为行式或块式核并采用独特存储格式,结合两阶段搜索的算子融合方案,在GPU上实现高达1.6倍的多头注意力计算加速和1.4倍的端到端推理加速。

详情
AI中文摘要

大型语言模型(LLMs)因其强大的理解能力在全球广受欢迎。作为LLMs的核心组件,通过并行化加速Transformer逐渐成为研究热点。掩码层向Transformer引入稀疏性以减少计算量。然而,以往的工作很少关注稀疏Transformer的性能优化。此外,当前的静态算子融合方案无法适应多样化的应用场景。为解决上述问题,我们提出STOF,一个针对稀疏Transformer的优化框架,能够在GPU上实现灵活的掩码和算子融合。对于多头注意力(MHA)结构,STOF根据分析建模将计算映射为具有独特存储格式的行式或块式核。对于下游算子,STOF将融合方案映射到编译模板,并通过两阶段搜索确定最优运行配置。实验结果表明,与最先进的工作相比,STOF在MHA计算中实现了最高1.6倍的加速,在端到端推理中实现了最高1.4倍的加速。

英文摘要

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

2506.05985 2026-05-29 cs.LG cs.RO

Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

动态渐进式参数高效专家库混合用于终身机器人学习

Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo

AI总结 针对终身学习中任务标识不可用和知识隔离问题,提出动态渐进式参数高效专家库混合(DMPEL),通过构建低秩专家库和轻量路由器实现灵活的前向迁移,并引入专家系数回放缓解遗忘,在LIBERO基准上以最少可训练参数和存储超越现有方法。

Comments Accepted to Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=MHVBrjS8cG . Code is available at https://github.com/HarryLui98/DMPEL

详情
AI中文摘要

一个通用智能体必须在其生命周期中持续学习和适应,实现高效的前向迁移,同时最小化灾难性遗忘。先前在主导的预训练-微调范式中的工作探索了用于单任务适应的参数高效微调,通过少量参数有效引导冻结的预训练模型。然而,在终身学习背景下,这些方法依赖于测试时任务标识符这一不切实际的假设,并限制了孤立适配器之间的知识共享。为解决这些限制,我们提出了用于终身机器人学习的动态渐进式参数高效专家库混合(DMPEL)。DMPEL逐步构建一个低秩专家库,并采用轻量路由器将专家动态组合成端到端策略,从而实现灵活高效的终身前向迁移。此外,通过利用微调参数的模块化结构,我们引入了专家系数回放,引导路由器准确检索先前遇到任务的冻结专家。该技术缓解了遗忘,同时相比对整个策略进行经验回放,显著节省存储和计算。在终身机器人学习基准LIBERO上的大量实验表明,我们的框架在持续适应过程中的成功率上优于最先进的终身学习方法,同时使用了最少的可训练参数和存储。

英文摘要

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

2505.21996 2026-05-29 cs.CV cs.AI

VRAG: Learning World Models for Interactive Video Generation

VRAG:面向交互式视频生成的世界模型学习

Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

AI总结 针对自回归视频生成中累积误差和记忆机制不足的问题,提出视频检索增强生成(VRAG)方法,通过显式全局状态条件降低长期累积误差并提升时空一致性。

Comments Published at NeurIPS 2025. Project page: https://sites.google.com/view/vrag

详情
AI中文摘要

基础世界模型必须既具有交互性,又能保持时空连贯性,以便通过动作选择进行有效的未来规划。然而,当前的长时间视频生成模型由于两个主要挑战而具有有限的内在世界建模能力:累积误差和记忆机制不足。我们通过额外的动作条件和自回归框架增强了图像到视频模型的交互能力,并揭示了在自回归视频生成中累积误差本质上是不可约的,而记忆机制不足则导致世界模型的不连贯。我们提出了带有显式全局状态条件的视频检索增强生成(VRAG),它显著减少了长期累积误差并提高了世界模型的时空一致性。相比之下,具有扩展上下文窗口的朴素自回归生成和检索增强生成在视频生成中被证明效果较差,这主要是由于当前视频模型有限的上下文学习能力。我们的工作阐明了视频世界模型中的基本挑战,并为改进具有内在世界建模能力的视频生成模型建立了全面的基准。

英文摘要

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

2505.21876 2026-05-29 cs.CV cs.AI

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

EPiC: 基于精确锚点视频引导的高效视频摄像机控制学习

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

AI总结 提出EPiC框架,通过基于首帧可见性掩码构建精确对齐的锚点视频,并引入轻量模块Anchor-ControlNet,以极低参数实现高效、精确的3D摄像机控制,在RealEstate10K和MiraData上达到最先进性能。

Comments Accepted to ICML 2026. Project website: https://zunwang1.github.io/Epic

详情
AI中文摘要

近期带摄像机控制的视频生成方法通常通过从估计的点云沿摄像机轨迹渲染,创建锚点视频(即近似所需摄像机运动的渲染视频),以作为结构化先验引导扩散模型。然而,点云和摄像机轨迹估计中的误差常导致不准确的锚点视频,并带来更高的训练成本和低效率,因为模型被迫补偿渲染错位。为解决这些局限,我们提出EPiC,一种高效且精确的摄像机控制学习框架,无需摄像机姿态或点云估计即可构建良好对齐的训练锚点视频。具体而言,我们通过基于首帧可见性掩码掩蔽源视频来创建高精度锚点视频,这确保了强对齐,消除了对摄像机/点云估计的需求,因此可轻松应用于任意野外视频。此外,我们引入Anchor-ControlNet,一种轻量模块,将可见区域中的锚点视频引导集成到预训练视频扩散模型中,仅增加不到1%的额外参数。EPiC以显著更少的参数、训练步骤和数据实现高效训练,并在测试时对使用点云制作的锚点视频具有鲁棒泛化能力,从而实现精确的3D感知摄像机控制。EPiC在RealEstate10K和MiraData上的I2V摄像机控制任务中达到最先进性能。值得注意的是,EPiC还展现出对视频到视频(V2V)场景的强零样本泛化能力。

英文摘要

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories. However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos with higher training cost and low efficiency, as the model is forced to compensate for rendering misalignments. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that constructs well-aligned training anchor videos without the need for camera pose or point cloud estimation. Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility, which ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video. Furthermore, we introduce Anchor-ControlNet, a lightweight module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1% of additional parameters. EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, and generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control. EPiC achieves SoTA performance on RealEstate10K and MiraData for I2V camera control task. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios.

2505.20634 2026-05-29 cs.LG stat.ML

Explaining Concept Shift with Interpretable Feature Attribution

用可解释的特征归因解释概念漂移

Ruiqi Lyu, Alistair Turcan, Bryan Wilder

AI总结 提出SGShift方法,通过将概念漂移建模为特征选择任务,利用广义加性模型、敲除和吸收等统计工具识别导致源域与目标域模型性能差异的稀疏漂移特征。

详情
AI中文摘要

当特征条件标签分布在域间发生变化时,就会发生概念漂移,这可能导致即使调优良好的机器学习模型在新域上校准失效。识别这些漂移特征可以独特地揭示域间特征-标签关系如何不同,考虑到这种差异可能跨越科学相关的维度(如时间、疾病状态、人群等)。在本文中,我们提出SGShift,一种将表格数据中概念漂移导致的性能下降归因于稀疏漂移特征集的方法。我们将概念漂移框架化为特征选择任务,以学习能够解释源域和目标域模型间性能差异的特征。该框架使SGShift能够适应强大的统计工具,如广义加性模型、敲除和吸收,以识别这些漂移特征。我们在各种机器学习模型的合成数据和真实数据上进行了广泛实验,发现SGShift比基线方法更准确地识别漂移特征,在漂移域中所需样本少,并且对复杂的概念漂移情况具有鲁棒性。

英文摘要

Concept shift occurs when the distribution of labels conditioned on the features changes between domains, which can make even a well-tuned ML model miscalibrated on a new domain. Identifying these shifted features provides unique insight into how feature-label relationships differ between domains, considering the difference may be across a scientifically relevant dimension, such as time, disease status, population, etc. In this paper, we propose SGShift, a method for attributing performance degradation under concept shift in tabular data to a sparse set of shifted features. We frame concept shift as a feature selection task to learn the features that can explain performance differences between models in the source and target domain. This framework enables SGShift to adapt powerful statistical tools such as generalized additive models, knockoffs, and absorption towards identifying these shifted features. We conduct extensive experiments in synthetic and real data across various ML models and find SGShift can identify shifted features much more accurately than baseline methods, requires few samples in the shifted domain, and is robust to complex cases of concept shift.

2505.18744 2026-05-29 cs.CL

LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning

LogicCat:面向复杂推理的思维链文本到SQL基准测试

Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng

AI总结 提出首个针对复杂推理和思维链解析的Text-to-SQL基准数据集LogicCat,涵盖物理、算术、常识和假设推理场景,通过4038个问题与12114条思维链步骤显著提升任务难度,现有模型执行准确率最高仅33.20%。

Comments 9 pages, 5 figures

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence, 40(36): 29958-29966, 2026
AI中文摘要

文本到SQL是自然语言处理中的关键任务,旨在将自然语言问题转化为准确且可执行的SQL查询。在现实场景中,这些推理任务通常伴随复杂的数学计算、领域知识和假设推理场景。然而,现有大规模文本到SQL数据集通常聚焦于业务逻辑和任务逻辑,忽略了垂直领域知识、复杂数学推理和假设推理等关键因素,而这些因素对于真实反映实际应用中的推理需求并完成数据查询与分析至关重要。为弥补这一空白,我们引入了LogicCat,这是首个专门为复杂推理和思维链解析设计的文本到SQL基准数据集,涵盖物理、算术、常识和假设推理场景。LogicCat包含4038个英文问题,配有12114条详细的思维链推理步骤,跨越45个不同领域的数据库,在复杂性上显著超越现有数据集。实验结果表明,LogicCat将当前最先进模型的任务难度大幅提升至最高33.20%的执行准确率,表明该任务仍然极具挑战性。LogicCat的进步代表了向开发适用于真实企业数据分析和自主查询生成的系统迈出的关键一步。我们已在https://github.com/Ffunkytao/LogicCat发布了数据集代码。

英文摘要

Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.

2505.16178 2026-05-29 cs.CL

Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

理解语言模型中的事实回忆:为什么两阶段训练鼓励记忆而混合训练教授知识

Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Kentaro Inui

AI总结 通过比较2.8~4B语言模型中的两阶段训练与混合训练,发现混合训练通过联合优化目标实现存储与查询格式间的梯度一致性,驱动表征一致性并建立格式不变的检索过程,从而泛化回忆未见查询中的事实。

详情
AI中文摘要

虽然微调是将事实知识注入大型语言模型(LLM)的标准方法,但通过未见查询实现可靠事实回忆的机制仍鲜为人知。常见的两阶段训练策略依次对事实存储和查询格式进行训练,往往导致死记硬背。相比之下,混合训练联合优化两种格式,展现出更优的泛化回忆能力。我们通过比较2.8∼4B LLM中的两种范式来研究这一成功机制,并识别出核心机制:混合训练中的联合优化目标诱导了存储格式与查询格式之间的梯度一致性。这进而驱动两种格式之间的表征一致性,建立了一个格式不变的检索过程,将未见查询映射到存储的事实。相反,两阶段训练中缺乏这种目标导致表征不一致和回忆失败。这种一致性进一步定位于由两种格式共同更新的参数,在混合训练下该参数集远大于两阶段训练。在输入层面,一致性留下了可解释的特征:混合训练从主语-关系标记(查询中可用的相同成分)以存储格式编码事实,而两阶段训练则依赖完整上下文。我们的发现刻画了事实回忆的机制,并为优化LLM中的知识注入提供了机理基础。

英文摘要

While fine-tuning is the standard for injecting factual knowledge into large language models (LLMs), the mechanisms enabling reliable fact recall via unseen queries remain poorly understood. Common two-stage training strategies, which sequentially train on fact storage and query formats, often cause rote memorization. In contrast, mixed training jointly optimizes both formats and exhibits superior generalized recall. We investigate this success by comparing the two paradigms across 2.8$\sim$4B LLMs and identify the core mechanism: the joint optimization objective in mixed training induces gradient consistency across storage and query formats. This in turn drives the representation consistency between the two formats, establishing a format-invariant retrieval process that maps unseen queries to stored facts. In contrast, the lack of such an objective in two-stage training results in inconsistent representations and failed recall. The consistency further localizes to the parameters updated by both formats, a set that is substantially larger under mixed training than under two-stage training. At the input level, the consistency leaves an interpretable signature: mixed training encodes facts in storage format from subject-relation tokens, the same components available in queries, while two-stage training relies on the full context. Our findings characterize the mechanisms of fact recall and offer mechanistic foundation for optimizing knowledge injection in LLMs.

2505.10975 2026-05-29 cs.CL cs.AI cs.SD eess.AS

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

AI总结 本文系统综述了端到端多说话人自动语音识别的神经架构范式(SIMO与SISO)、近期改进方法及长语音扩展策略,并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情
AI中文摘要

单声道多说话人自动语音识别(ASR)由于数据稀缺以及识别并将词语归因于单个说话人的内在困难(尤其是在重叠语音中)仍然具有挑战性。最近的进展推动了从级联系统向端到端(E2E)架构的转变,这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展,但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法,突出了近期进展和比较分析。具体而言,我们分析了:(1)用于预分割音频的架构范式(SIMO与SISO),分析了它们的不同特征和权衡;(2)基于这两种范式的近期架构和算法改进;(3)对长语音的扩展,包括分割策略和说话人一致性的假设拼接。此外,我们(4)在标准基准上评估和比较了各种方法。最后,我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

2505.05968 2026-05-29 cs.LG cs.MA

Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition

离线多智能体强化学习通过序列得分分解

Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang

AI总结 针对离线合作多智能体强化学习中联合动作空间高维和异质行为数据导致的策略分布偏移问题,提出序列得分函数分解方法,利用扩散模型从多模态离线数据中学习每个智能体的正则化信号,指导策略更新至高分、分布内区域,在多个粒子环境和多智能体MuJoCo基准上实现最先进性能。

Comments ICML 2026 Accepted

详情
Journal ref
Forty-Third International Conference on Machine Learning, 2026
AI中文摘要

离线合作多智能体强化学习(MARL)因分布偏移面临独特挑战,尤其源于联合动作空间的高维性和分布外联合动作选择的存在。在这项工作中,我们强调离线MARL的一个基本挑战来自合作任务的多均衡性质,这诱导了高度多模态的联合行为策略空间与异质质量行为数据的耦合。这使得个体策略正则化难以与一致的协调模式对齐,导致策略分布偏移问题。为应对这一挑战,我们设计了一种序列得分函数分解方法,从联合行为策略中提炼每个智能体的正则化信号,在分散执行约束下诱导协调模态选择。然后我们利用灵活的基于扩散的生成模型从多模态离线数据中学习这些得分函数,并将其集成到联合动作评论家中,以在共享团队奖励下引导策略更新朝向高分、分布内区域。我们的方法在多个粒子环境和多智能体MuJoCo基准上一致实现了最先进性能。据我们所知,这是首个明确解决离线与在线MARL之间分布差距的工作,为更可泛化的基于离线策略的MARL方法铺平了道路。

英文摘要

Offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts, particularly stemming from the high dimensionality of joint action spaces and the presence of out-of-distribution joint action selections. In this work, we highlight that a fundamental challenge in offline MARL arises from the multi-equilibrium nature of cooperative tasks, which induces a highly multimodal joint behavior policy space coupled with heterogeneous-quality behavior data. This makes it difficult for individual policy regularization to align with a consistent coordination pattern, leading to the policy distribution shift problems. To tackle this challenge, we design a sequential score function decomposition method that distills per-agent regularization signals from the joint behavior policy, which induces coordinated modality selection under decentralized execution constraints. Then we leverage a flexible diffusion-based generative model to learn these score functions from multimodal offline data, and integrate them into joint-action critics to guide policy updates toward high-reward, in-distribution regions under a shared team reward. Our approach achieves state-of-the-art performance across multiple particle environments and Multi-agent MuJoCo benchmarks consistently. To the best of our knowledge, this is the first work to explicitly address the distributional gap between offline and online MARL, paving the way for more generalizable offline policy-based MARL methods.

2505.02743 2026-05-29 cs.LG stat.ML

Cooperative Variance Estimation and Bayesian Neural Networks for Disentangling Aleatoric and Epistemic Uncertainties

合作方差估计与贝叶斯神经网络用于分离偶然不确定性和认知不确定性

Jiaxiang Yi, Miguel A. Bessa

AI总结 提出通过合作训练方差估计网络与贝叶斯神经网络,实现偶然不确定性与认知不确定性的分离,并提升均值估计性能。

Comments 38 pages, 26 figures

详情
AI中文摘要

真实世界的数据包含偶然不确定性——由不完美的测量或对数据生成过程的不完全了解引起的不可约噪声。均值-方差估计网络可以学习这种类型的不确定性,但需要即兴的正则化策略以避免过拟合,并且无法预测认知不确定性(模型不确定性)。相反,贝叶斯神经网络可以预测认知不确定性,但由于贝叶斯推断的近似性质,它们以难以训练而著称。我们提出合作训练一个方差估计网络与一个贝叶斯神经网络,并通过实验证明,所得模型在改善均值估计的同时分离了偶然不确定性和认知不确定性。我们展示了该方法在多种数据集上的有效性和可扩展性,包括我们创建的一个时间依赖异方差回归数据集,其中偶然不确定性是已知的。所提出的方法易于实现、鲁棒,并且适用于各种模型架构。

英文摘要

Real-world data contains aleatoric uncertainty - irreducible noise arising from imperfect measurements or from incomplete knowledge about the data generation process. Mean-variance estimation networks can learn this type of uncertainty but require ad-hoc regularization strategies to avoid overfitting and are unable to predict epistemic uncertainty (model uncertainty). Conversely, Bayesian neural networks predict epistemic uncertainty but are notoriously difficult to train due to the approximate nature of Bayesian inference. We propose to cooperatively train a variance estimation network with a Bayesian neural network and empirically demonstrate that the resulting model disentangles aleatoric and epistemic uncertainties while improving the mean estimation. We demonstrate the effectiveness and scalability of this method across a diverse range of datasets, including a time-dependent heteroscedastic regression dataset we created where the aleatoric uncertainty is known. The proposed method is straightforward to implement, robust, and adaptable to various model architectures.