arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8098
2605.08935 2026-06-03 cs.AI cs.LG

PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting

PnP-Corrector:一种用于耦合时空预测的通用校正框架

Hao Wu, Fan Xu, Yuxu Lu, Penghao Zhao, Fan Zhang, Hao Jia, Yuxuan Liang, Ruijian Gou, Qingsong Wen, Xian Wu, Xiaomeng Huang, Yuan Gao

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 针对耦合系统中误差相互放大导致长期预测崩溃的问题,提出一种即插即用的校正框架PnP-Corrector,通过冻结物理模拟引擎并训练校正代理来主动抵消系统偏差,显著提升长期预测的稳定性和准确性。

详情
AI中文摘要

耦合时空预测对于预测多个相互作用动力系统的未来演化(如气候模型)非常重要。然而,现有方法受到复合误差这一持续瓶颈的严重限制。在耦合系统中,每个子系统模拟器的误差会相互传播和放大,我们将这种现象称为互惠误差放大,导致长期预测迅速崩溃。为了应对这一挑战,我们提出了一种通用框架,称为PnP-Corrector(即插即用校正器)。我们框架的核心思想是将物理模拟与误差校正过程解耦:它冻结预训练的物理模拟引擎,并专门训练一个校正代理,以主动抵消耦合系统中出现的系统偏差。此外,我们设计了一种高效的预测模型架构DSLCast,作为该框架的主干。大量实验表明,我们的方法显著增强了耦合预测系统的长期稳定性和准确性。例如,在300天的全球海洋-大气耦合预测这一具有挑战性的任务中,我们的PnP-Corrector框架将基线模型的预测误差降低了28%,并在多个关键指标上超越了最先进的模型。

英文摘要

Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in climate models. However, existing methods are severely constrained by the persistent bottleneck of compounding errors. In coupled systems, errors from each subsystem simulator propagate and amplify one another, a phenomenon we term Reciprocal Error Amplification, leading to a rapid collapse of long-range predictions. To address this challenge, we propose a universal framework called PnP-Corrector (Plug-and-Play Corrector). The core idea of our framework is to decouple the physical simulation from the error correction process: it freezes pre-trained physics simulation engines and exclusively trains a correction agent to proactively counteract the systematic biases emerging from the coupled system. Furthermore, we design an efficient predictive model architecture, DSLCast, to serve as the backbone of this framework. Extensive experiments demonstrate that our method significantly enhances the long-term stability and accuracy of coupled forecasting systems. For instance, in the challenging task of a 300-day global ocean-atmosphere coupled forecast, our PnP-Corrector framework reduces the prediction error of the baseline model by 28% and surpasses state-of-the-art models on several key metrics.

2605.11954 2026-06-03 cs.AI

Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement

评估与缓解基于LLM的社会科学测量中的校准误差

Jinyuan Wang, Ningyuan Deng, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 研究LLM在社会测量中的校准问题,提出软标签蒸馏方法,通过训练小型分类器将校准误差降低43.2%的ECE和34.0%的Brier分数。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于社会科学中,作为可扩展的测量工具,将非结构化文本转换为可进入标准实证设计的变量。测量有效性不仅要求高平均准确率,还需要良好校准的置信度,以忠实反映每次测量正确的经验概率。本文研究了基于LLM的社会科学测量中的模型校准误差。我们首先以FOMC为例,展示当LLM置信度校准不良时,基于置信度的过滤会改变下游回归估计。然后,我们对涵盖专有模型(包括GPT-5-mini、DeepSeek-V3.2)和开源模型的14个社会科学构念进行校准审计。跨任务和模型家族,报告的置信度与基于容错的正确性对齐不良。作为一种简单的缓解方法,我们提出了一种用于校准BERT与LLM的软标签蒸馏流程。该方法将LLM分数及其语言化置信度转换为软目标分布,然后在编码器模型上训练一个较小的判别分类器以适应这些目标。平均而言,该方法将ECE降低了43.2%,Brier分数降低了34.0%。这些结果表明,基于LLM的社会科学流程应将校准视为测量有效性的一部分,而非可选的后期处理问题。

英文摘要

Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into variables that can enter standard empirical designs. Measurement validity demands more than high average accuracy, which requires well calibrated confidence that faithfully reflects the empirical probability of each measurement being correct. This paper studies the model miscalibration in LLM-based social science measurement. We begin with a case study on FOMC and show that confidence based filtering can change downstream regression estimates when LLM confidence is miscalibrated. We then audit calibration across 14 social science constructs covering both proprietary models, including GPT-5-mini, DeepSeek-V3.2, and open source models. Across tasks and model families, reported confidence is poorly aligned with tolerance-based correctness. As a simple mitigation, we propose a soft label distillation pipeline for calibrating Bert with LLM. The method converts an LLM score and its verbalized confidence into a soft target distribution, then trains a smaller discriminative classifier on encoder models for these targets. Averaged across datasets, this approach reduces ECE by 43.2\% and Brier by 34.0\%. These results suggest that LLM-based social science pipelines should treat calibration as part of measurement validity, rather than as an optional post-processing concern.

2605.11170 2026-06-03 cs.LG cs.CR

Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

非对称源下的反学习:利用公共数据改进反学习-效用权衡

Ahmed Mehdi Inane, Vincent Quirion, Gintare Karolina Dziugaite, Ioannis Mitliagkas

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出非对称朗之万反学习(ALU)框架,利用公共数据注入将反学习认证噪声成本降低O(1/n_pub^2)倍,在分布偏移下仍保持高效用。

详情
AI中文摘要

基于噪声的认证机器反学习目前面临一个硬性上限:认证反学习所需的噪声幅度通常会破坏模型效用,特别是在大规模删除请求的情况下。虽然利用公共数据是差分隐私中缓解这一紧张关系的标准技术,但其在反学习中的作用尚未被探索。我们通过引入非对称朗之万反学习(ALU)框架来填补这一空白,该框架利用公共数据来降低隐私成本。我们证明,公共数据注入将反学习成本抑制了O(1/n_pub^2)倍,保证了相对于重新训练的计算优势。这建立了一种新的控制机制:从业者可以通过增加公共数据量来缓解对高噪声及其相关效用损失的需求。关键的是,我们分析了分布偏移的现实场景,明确刻画了公共和私有源之间的偏移如何影响效用。我们表明,ALU能够实现对恒定数据集部分的大规模反学习——在这种机制下,标准对称方法变得不切实际——同时保持高效用。使用变分Rényi散度和成员推理攻击的实验评估证实,在合理的分布偏移下,ALU能有效阻止隐私攻击,同时保持效用。

英文摘要

Noise-based certified machine unlearning currently faces a hard ceiling: the noise magnitude required to certify unlearning typically destroys model utility, particularly for large-scale deletion requests. While leveraging public data is a standard technique in differential privacy to relax this tension, its role in unlearning remains unexplored. We address this gap by introducing Asymmetric Langevin Unlearning (ALU), a framework that uses public data to mitigate privacy costs. We prove that public data injection suppresses the unlearning cost by a factor of $O(1/n_{\mathrm{pub}}^2)$, guaranteeing a strict computational advantage over retraining. This establishes a new control mechanism: practitioners can mitigate the need for high noise-and the associated utility loss-by increasing the volume of public data. Crucially, we analyze the realistic setting of distribution mismatch, explicitly characterizing how shifts between public and private sources impact utility. We show that ALU enables mass unlearning of constant dataset fractions -- a regime where standard symmetric methods become impractical -- while maintaining high utility. Empirical evaluations using variational Rényi divergence and membership inference attacks confirm that ALU effectively thwarts privacy attacks while preserving utility under reasonable distribution shifts.

2602.22480 2026-06-03 cs.AI cs.CL cs.LG

VeRO: A Harness for Agents to Optimize Agents

VeRO: 用于优化智能体的智能体框架

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan Xue, Samuel Marc Denton

发表机构 * arXiv

AI总结 提出 VeRO 框架和 VeRO-Bench 基准,通过版本化快照、预算控制评估和结构化执行轨迹来优化智能体代码,并实验比较不同优化器对目标智能体的改进效果。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

编码智能体的一个重要新兴应用是智能体框架优化:通过编辑和评估目标智能体的代码来迭代改进它。尽管具有相关性,但社区对编码智能体在此任务上的表现缺乏系统理解。框架优化与传统软件工程不同:智能体框架将确定性代码与随机 LLM 完成交错,需要结构化捕获中间执行轨迹和下游结果。为了解决这些挑战,我们引入了 (1) VeRO(版本化、奖励和观察),一个外部框架,提供目标框架的版本化快照、预算控制评估和结构化执行轨迹,以及 (2) VeRO-Bench,一个包含参考评估程序的目标智能体和任务的基准套件。使用 VeRO,我们进行了一项实证研究,比较了不同任务上的优化器,并分析了哪些修改能可靠地改进目标智能体框架。我们发布 VeRO 以支持作为编码智能体核心能力的智能体优化研究。代码可在 https://github.com/scaleapi/vero 获取。

英文摘要

An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing and evaluating its code. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Harness optimization differs from conventional software engineering: agent harnesses interleave deterministic code with stochastic LLM completions, requiring structured capture of both intermediate execution traces and downstream outcomes. To address these challenges, we introduce (1) VeRO (Versioning, Rewards, and Observations), an outer harness that provides versioned snapshots, budget-controlled evaluation, and structured execution traces of target harnesses, and (2) VeRO-Bench, a benchmark suite of target agents and tasks with reference evaluation procedures. Using VeRO, we conduct an empirical study comparing optimizers across tasks and analyzing which modifications reliably improve target agent harnesses. We release VeRO to support research on agent optimization as a core capability for coding agents. Code is available at https://github.com/scaleapi/vero.

2605.09233 2026-06-03 cs.CV cs.AI

Towards Robust Sequential Decomposition for Complex Image Editing

面向复杂图像编辑的鲁棒顺序分解

Zilai Zeng, Mingdeng Cao, Zijie Li, Xiaochen Lian, Yichun Shi, Peihao Zhu, Chen Sun, Peng Wang

发表机构 * Brown University(布朗大学) ByteDance Seed(字节跳动种子) The University of Tokyo(东京大学)

AI总结 提出通过顺序分解将复杂编辑任务拆解为简单步骤,并利用合成数据训练模型,在统一上下文编辑框架下平衡分解优势与误差累积,实现鲁棒改进和从模拟到真实的泛化。

Comments CVPR 2026

详情
AI中文摘要

视觉生成模型的最新进展使得由人类指令引导的高保真图像编辑成为可能。然而,这些模型在处理涉及组合编辑操作或跨步骤依赖的复杂指令时常常遇到困难。这种困难源于两种典型范式的局限性:(1)单轮编辑,试图一次性应用所有指示的编辑,通常无法准确解析复杂指令并导致不期望的编辑;(2)顺序编辑可以将任务分解为更简单的步骤,但受到顺序执行引入的复合误差的影响,导致低保真结果。为了获得复杂图像编辑的鲁棒解决方案,我们在统一的上下文编辑框架下检查了不同范式的编辑行为,并研究了如何平衡顺序分解的优势与其误差累积的缺点。我们进一步开发了一个合成数据流水线,构建了不同指令复杂度的编辑任务,使我们能够整理一个具有高质量分解序列的大规模编辑数据集。通过在合成数据上进行微调,我们发现,通过适当设计的编辑范式,即使任务复杂度增加,顺序分解也能产生鲁棒的改进。此外,从合成任务中学到的分解技能可以通过与真实世界编辑数据共同训练迁移到真实图像,展示了模拟到真实泛化在更广泛领域中处理复杂图像编辑的前景。

英文摘要

Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.

2605.08767 2026-06-03 cs.AI

From Holo Pockets to Electron Density: GPT-style Drug Design with Density

从全息口袋到电子密度:基于密度的GPT式药物设计

Jiahao Chen, Letian Gao, Yanhao Zhu, Wenbiao Zhou, Bing Su, Zhi John Lu, Bo Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出EDMolGPT,一种利用低分辨率电子密度作为物理条件进行从头药物设计的自回归框架,通过密度点云生成分子,减轻结构偏差并产生3D构象。

Comments Published as a conference paper in ICML 2026

详情
AI中文摘要

生成建模的最新进展推动了基于结构的药物设计(SBDD)的重大进步。现有方法通常以全息复合物中的空结合口袋为条件生成分子,忽略了填充物(配体和溶剂)等信息成分。在这里,我们利用从填充物中导出的低分辨率电子密度(ED)作为从头药物设计的物理基础条件。我们考虑了两种类型的ED:计算得到的和冷冻电镜/X射线得到的,可从计算或实验来源获得,支持统一预训练和实验集成。与刚性的口袋表示相比,实验ED自然捕获构象灵活性,并提供结合环境的更忠实描述。基于此,我们引入了EDMolGPT,一个仅解码器的自回归框架,从低分辨率ED点云生成分子。通过将生成过程基于物理上有意义的密度信号,EDMolGPT减轻了结构偏差,并产生具有3D构象的分子。在101个生物靶标上的评估验证了其有效性。我们的项目页面:https://jiahaochen1.github.io/EDMolGPT_Page/。

英文摘要

Recent advances in generative modeling have enabled significant progress in structure-based drug design (SBDD). Existing methods typically condition molecule generation on empty binding pockets from holo complexes, overlooking informative components such as the filler (ligands and solvent). Here, we leverage low-resolution electron density (ED) derived from the filler as a physically grounded condition for \textit{de novo} drug design. We consider two types of ED, calculated and cryo-EM/X-ray, obtainable from computational or experimental sources, supporting unified pre-training and experimental integration. Compared with rigid pocket representations, experimental ED naturally captures conformational flexibility and provides a more faithful description of the binding environment. Based on this, we introduce EDMolGPT, a decoder-only autoregressive framework that generates molecules from low-resolution ED point clouds. By grounding generation in physically meaningful density signals, EDMolGPT mitigates structural bias and produces molecules with 3D conformations. Evaluations on 101 biological targets verify the effectiveness. Our project page: https://jiahaochen1.github.io/EDMolGPT_Page/.

2605.01386 2026-06-03 cs.CL

MemORAI: Memory Organization and Retrieval via Adaptive Graph Intelligence for LLM Conversational Agents

MemORAI: 通过自适应图智能实现LLM对话代理的记忆组织与检索

Hung Pham Van, Nguyen Manh Hieu, Khang Pham Tran Tuan, Nam Le Hai, Linh Ngo Van, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Independent Researcher(独立研究者) Hanoi University of Science and Technology(河内科学技术大学) VNU University of Engineering and Technology(VNU工程大学) Monash University(墨尔本大学)

AI总结 提出MemORAI框架,通过选择性记忆过滤、富来源追踪的多关系图存储和查询自适应子图检索,解决LLM长期对话中记忆缺失、信息稀释和检索不精准的问题,在LOCOMO和LongMemEval基准上达到最优性能。

Comments ACL Findings

详情
AI中文摘要

大型语言模型(LLM)缺乏用于长期个性化对话的持久记忆。现有的基于图的记忆系统存在信息稀释、缺乏来源追踪以及忽略查询上下文的统一检索问题。我们引入MemORAI(通过自适应图智能实现记忆组织与检索),该框架整合了三个创新:具有双层压缩的选择性记忆过滤以保留用户个性相关内容;富来源追踪的多关系图,在对话轮次级别跟踪事实来源;以及查询自适应子图检索,采用动态加权PageRank应用查询条件化的边权重。在LOCOMO和LongMemEval基准上评估,MemORAI在记忆检索和个性化响应生成方面达到了最先进的性能,表明选择性存储、富表示和自适应检索对于连贯、个性化的LLM代理至关重要。

英文摘要

Large Language Models (LLMs) lack persistent memory for long-term personalized conversations. Existing graph-based memory systems suffer from information dilution, absent provenance tracking, and uniform retrieval that ignores query context. We introduce MemORAI (Memory Organization and Retrieval via Adaptive Graph Intelligence), a framework that integrates three innovations: selective memory filtering with dual-layer compression to retain user-persona-relevant content, a provenance-enriched multi-relational graph tracking factual origins at the turn level, and query-adaptive subgraph retrieval with Dynamic Weighted PageRank that applies query-conditioned edge weighting. Evaluated on LOCOMO and LongMemEval benchmarks, MemORAI achieves state-of-the-art performance in memory retrieval and personalized response generation, demonstrating that selective storage, enriched representation, and adaptive retrieval are essential for coherent, personalized LLM agents.

2605.01374 2026-06-03 cs.CL

MTA: Multi-Granular Trajectory Alignment for Large Language Model Distillation

MTA:面向大型语言模型蒸馏的多粒度轨迹对齐

Pham Khanh Chi, Quoc Phong Dao, Thuat Nguyen, Linh Ngo Van, Trung Le, Thanh Hong Nguyen

发表机构 * Hanoi University of Science and Technology(河内理工大学) Monash University(墨尔本大学) University of Oregon(俄勒冈大学)

AI总结 提出多粒度轨迹对齐(MTA)框架,通过层自适应策略对齐师生模型的层间变换轨迹,结合动态结构对齐损失和隐藏表示对齐损失,提升知识蒸馏效果。

Comments ACL 2026

详情
AI中文摘要

知识蒸馏是压缩大型语言模型(LLMs)的关键技术,但现有方法大多在固定层或token级输出上对齐表示,忽略了表示随深度演化的过程。因此,学生在蒸馏过程中仅能弱化地捕捉教师的内部关系结构,限制了知识迁移。为解决这一局限,我们提出多粒度轨迹对齐(MTA)框架,该框架沿层间变换轨迹对齐师生表示。MTA采用层自适应策略:低层在词级别对齐以保留词汇信息,而高层在短语级跨度(如名词短语和动词短语)上操作以捕获组合语义。我们通过动态结构对齐损失实例化这一思想,该损失匹配每层内语义单元之间的相对几何关系。该设计基于Transformer表示随深度逐渐抽象的实验发现,也与语言学观点一致,即高层意义通过低层词汇单元的组合涌现。我们进一步引入隐藏表示对齐损失以直接对齐选定的师生层。实验表明,MTA在标准基准上持续优于最先进的基线,消融实验证实了每个组件的贡献。

英文摘要

Knowledge distillation is a key technique for compressing large language models (LLMs), but most existing methods align representations at fixed layers or token-level outputs, ignoring how representations evolve across depth. As a result, the student is only weakly guided to capture the teacher's internal relational structure during distillation, which limits knowledge transfer. To address this limitation, we propose Multi-Granular Trajectory Alignment (MTA), a framework that aligns teacher and student representations along their layer-wise transformation trajectory. MTA adopts a layer-adaptive strategy: lower layers are aligned at the word level to preserve lexical information, while higher layers operate on phrase-level spans (e.g., noun and verb phrases) to capture compositional semantics. We instantiate this idea through a Dynamic Structural Alignment loss that matches the relative geometry among semantic units within each layer. This design is motivated by empirical findings that Transformer representations become increasingly abstract with depth, and is also consistent with linguistic views in which higher-level meaning emerges through the composition of lower-level lexical units. We further incorporate a Hidden Representation Alignment loss to directly align selected teacher-student layers. Experiments show that MTA consistently outperforms state-of-the-art baselines on standard benchmarks, with ablations confirming the contribution of each component.

2605.01205 2026-06-03 cs.CL

SRA: Span Representation Alignment for Large Language Model Distillation

SRA: 面向大型语言模型蒸馏的跨度表示对齐

Quoc Phong Dao, Hoang Son Nguyen, Pham Khanh Chi, Tung Nguyen, Linh Ngo Van, Nguyen Thi Ngoc Diep, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) VNU University of Engineering and Technology(VNU工程大学) Monash University(莫纳什大学)

AI总结 提出SRA框架,通过将蒸馏对齐单元从token转为跨度的质心表示,并引入几何正则化和对齐跨度logit蒸馏,显著提升跨分词器知识蒸馏性能。

Comments ACL 2026

详情
AI中文摘要

跨分词器知识蒸馏(CTKD)使得大型语言模型与较小的学生模型之间能够进行知识迁移,即使它们使用不同的分词器。现有方法主要关注token级别的对齐策略,这些策略往往脆弱且对分词器之间的差异敏感。我们认为,在蒸馏之前将token聚合成更稳健的表示同样重要。本文提出SRA(面向大型语言模型蒸馏的跨度表示对齐),这是一个通过多粒子动力系统的物理视角重新定义CTKD的新框架。SRA将对齐的基本单元从token转移到稳健、与分词器无关的跨度。我们将每个跨度建模为一个粒子簇,并通过其质心(CoM)——一种捕捉丰富语义信息的注意力加权平均值——来表示其状态。我们利用跨度质心的概念,结合注意力导出的权重来优先考虑最显著的跨度。此外,我们采用几何正则化器来保持表示空间的结构完整性,并引入对齐跨度logit蒸馏以增强跨模型的知识迁移。在具有挑战性的跨架构蒸馏实验中,SRA始终显著优于最先进的CTKD基线,验证了我们基于物理的方法的有效性。

英文摘要

Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce \textbf{SRA} (\textbf{S}pan \textbf{R}epresentation \textbf{A}lignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.

2604.27660 2026-06-03 cs.AI

From Context to Skills: Can Language Models Learn from Context Skillfully?

从上下文到技能:语言模型能否从上下文中熟练学习?

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, Maosong Sun

发表机构 * THU(清华大学) DeepLang AI UIUC(伊利诺伊大学香槟分校) FDU(福建大学) CUHK(香港中文大学)

AI总结 提出Ctx2Skill框架,通过多智能体自博弈和跨时间回放机制,自动从上下文中发现、提炼和选择技能,提升语言模型在复杂上下文中的学习能力。

详情
AI中文摘要

许多现实任务要求语言模型(LMs)推理超出其参数知识的复杂上下文。这需要上下文学习,即LM直接从给定上下文中学习相关知识。一个直观的解决方案是推理时技能增强:从上下文中提取规则和过程作为自然语言技能。然而,为上下文学习场景构建这样的技能面临两个挑战:对长且技术密集的上下文进行手动技能标注的成本过高,以及缺乏自动技能构建的外部反馈。在本文中,我们提出Ctx2Skill,一个自我进化的框架,无需人工监督或外部反馈即可自主发现、提炼和选择上下文特定的技能。其核心是一个多智能体自博弈循环:一个挑战者生成探测任务和评分标准,一个推理者尝试在进化技能集的指导下解决这些任务,以及一个中立的评判者提供二元反馈。关键的是,挑战者和推理者都通过积累的技能进化:专门的提议者和生成者智能体分析失败案例,并将它们综合成针对双方的有针对性的技能更新,从而实现自动化的技能发现和提炼。为了防止由日益极端的任务生成和过度专业化的技能积累引起的对抗性崩溃,我们进一步引入了一种跨时间回放机制,该机制识别出在推理者方面跨代表性案例实现最佳平衡的技能集,确保稳健且可泛化的技能进化。由此产生的技能可以插入任何语言模型,以获得更好的上下文学习能力。在来自CL-bench的四个上下文学习任务上评估,Ctx2Skill在骨干模型上持续提高了解决率。

英文摘要

Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for context learning, where LMs directly learn relevant knowledge from the given context. An intuitive solution is inference-time skill augmentation: extracting the rules and procedures from context into natural-language skills. However, constructing such skills for context learning scenarios faces two challenges: the prohibitive cost of manual skill annotation for long, technically dense contexts, and the lack of external feedback for automated skill construction. In this paper, we propose Ctx2Skill, a self-evolving framework that autonomously discovers, refines, and selects context-specific skills without human supervision or external feedback. At its core, a multi-agent self-play loop has a Challenger that generates probing tasks and rubrics, a Reasoner that attempts to solve them guided by an evolving skill set, and a neutral Judge that provides binary feedback. Crucially, both the Challenger and the Reasoner evolve through accumulated skills: dedicated Proposer and Generator agents analyze failure cases and synthesize them into targeted skill updates for both sides, enabling automated skill discovery and refinement. To prevent adversarial collapse caused by increasingly extreme task generation and over-specialized skill accumulation, we further introduce a Cross-time Replay mechanism that identifies the skill set achieving the best balance across representative cases for the Reasoner side, ensuring robust and generalizable skill evolution. The resulting skills can be plugged into any language model to obtain better context learning capability. Evaluated on four context learning tasks from CL-bench, Ctx2Skill consistently improves solving rates across backbone models.

2604.27232 2026-06-03 cs.CL

Targeted Linguistic Analysis of Sign Language Models with Minimal Translation Pairs

基于最小翻译对的手语语言模型目标语言分析

Serpil Karabüklü, Kanishka Misra, Shester Gueuwou, Diane Brentari, Greg Shakhnarovich, Karen Livescu

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Linguistics Department, The University of Texas at Austin(德克萨斯大学奥斯汀分校语言学系) Linguistics Department, The University of Chicago(芝加哥大学语言学系)

AI总结 针对手语翻译模型,通过构建美国手语最小翻译对数据集(ASL-MTP)并消融输入线索,分析模型对不同手语现象(尤其是非手动线索)的捕捉能力。

Comments It is accepted to CVPR 2026 Workshop GenSign: Generative AI for Sign Language

详情
AI中文摘要

手语模型历来落后于口语(文本和语音)模型。最近的工作极大地提高了它们在手语翻译和孤立手语识别等任务上的性能。然而,现有模型在多大程度上捕捉了手语的各种语言现象,以及它们如何利用手语中多个发音器官(手、上半身、面部)的线索,仍不清楚。我们引入了一个新的美国手语基准数据集,ASL最小翻译对(ASL-MTP),分为多种手语现象类型和相应的最小翻译对,用于进行此类语言分析。作为案例研究,我们使用ASL-MTP分析了一个最先进的ASL到英语翻译模型。我们通过在训练和推理期间消融各种输入线索,并在ASL-MTP的现象上进行评估,对模型进行了目标分析。我们的结果表明,虽然模型在大多数现象上表现高于随机水平,但它强烈依赖手动线索,而常常错过关键的非手动线索。

英文摘要

Models of sign language have historically lagged behind those for spoken language (text and speech). Recent work has greatly improved their performance on tasks like sign language translation and isolated sign recognition. However, it remains unclear to what extent existing models capture various linguistic phenomena of sign language, and how well they use cues from the multiple articulators used in sign language (hands, upper body, face). We introduce a new benchmark dataset for American Sign Language, ASL Minimal Translation Pairs (ASL-MTP), divided into multiple types of sign language phenomena and corresponding minimal pairs of translations, for performing such linguistic analyses. As a case study, we use ASL-MTP to analyze a state-of-the-art ASL-to-English translation model. We conduct a targeted analysis of the model by ablating various input cues during training and inference and evaluating on the phenomena in ASL-MTP. Our results show that, while the model performs above chance level on most of the phenomena, it relies strongly on manual cues while often missing crucial non-manual cues.

2604.25928 2026-06-03 cs.CL

CogRAG: Tackling Heterogeneous Cognitive Demands in RAG via Stratified Retrieval and Reasoning

CogRAG:通过分层检索与推理应对RAG中的异构认知需求

Xudong Wang, Zilong Wang, Kui Su, Zhaoyan Ming

发表机构 * School of Computer and Computing Science, Hangzhou City University(杭州城市大学计算机与计算科学学院) Innovation Center of Yangtze River Delta, Zhejiang University(长三角创新中心,浙江大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究院)

AI总结 提出CogRAG框架,基于布鲁姆分类法预测查询的认知负荷,协调认知自适应证据精炼与认知分层结构化推理,解决RAG中不同任务的异构认知需求,在注册营养师资格考试中将Qwen3-8B准确率提升至85.8%。

详情
AI中文摘要

检索增强生成(RAG)框架通常通过一刀切的流水线处理所有查询,忽略了不同任务的异构认知需求。这种认知盲区方法导致两种失败模式:当低层级事实缺口引发幻觉推理时的级联错误,以及高阶分析任务中的推理-答案不一致。我们提出CogRAG,一个无需训练、领域无关的框架,通过分层检索与推理应对这些异构认知需求。受布鲁姆分类法启发,CogRAG使用查询的预测认知负荷作为中央控制信号,协调两个模块:认知自适应证据精炼通过以事实为中心或以选项为中心的路径补充缺失上下文,以及认知分层结构化推理用认知对齐的推理模板替代无约束的思维链。我们在一个高要求的专业测试平台——注册营养师资格考试上评估CogRAG。CogRAG有效减少了早期阶段的事实错误并消除了推理-答案不一致,在单选题模式下将Qwen3-8B准确率从73.4%提升至85.8%,在场景模式下从63.3%提升至80.5%。这些结果突显了认知分层控制作为大型语言模型中可靠复杂推理的一种有效且可泛化的范式。

英文摘要

Retrieval-Augmented Generation (RAG) frameworks typically process all queries through a one-size-fits-all pipeline, ignoring the heterogeneous cognitive demands of different tasks. This cognitive-blind approach causes two failure modes: cascading errors when low-level factual gaps trigger hallucinated reasoning, and reasoning-answer inconsistency in higher-order analytical tasks. We introduce CogRAG, a training-free, domain-agnostic framework that tackles these heterogeneous cognitive demands via stratified retrieval and reasoning. Inspired by Bloom's Taxonomy, CogRAG uses the predicted cognitive load of a query as a central control signal that coordinates two modules: Cognition-Adaptive Evidence Refinement supplements missing context via fact-centric or option-centric paths, and Cognition-Stratified Structured Reasoning replaces unconstrained chain-of-thought with cognition-aligned reasoning templates. We evaluate CogRAG on a demanding professional testbed, the Registered Dietitian qualification examination. CogRAG effectively reduces early-stage factual errors and eliminates reasoning-answer inconsistency, raising Qwen3-8B accuracy from 73.4\% to 85.8\% in single-choice mode and from 63.3\% to 80.5\% in scenario mode. These results highlight cognitive-stratified control as an effective, generalizable paradigm for reliable complex reasoning in large language models.

2604.24374 2026-06-03 cs.CL

MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

MIPIC: 通过自蒸馏内部关系与渐进信息链的套娃表示学习

Phung Gia Huy, Hai An Vu, Minh-Phuc Truong, Thang Duc Tran, Linh Ngo Van, Thanh Hong Nguyen, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) University of Oregon(俄勒冈大学) Monash University(莫纳什大学)

AI总结 提出MIPIC框架,通过自蒸馏内部关系对齐和渐进信息链,实现嵌套嵌入的跨维度结构一致性与深度语义整合,在低维下显著提升性能。

Comments ACL Findings

详情
AI中文摘要

表示学习是NLP的基础,但构建在不同计算预算下均表现良好的嵌入具有挑战性。套娃表示学习(MRL)通过嵌套嵌入提供了灵活的推理范式;然而,学习此类结构需要明确协调信息在嵌入维度和模型深度上的排列方式。本文提出MIPIC(通过自蒸馏内部关系对齐与渐进信息链的套娃表示学习),一个统一的训练框架,旨在生成结构连贯且语义紧凑的套娃表示。MIPIC通过自蒸馏内部关系对齐(SIA)促进跨维度结构一致性,该对齐利用top-k CKA自蒸馏,对齐完整表示与截断表示之间的token级几何和注意力驱动关系。互补地,它通过渐进信息链(PIC)实现深度语义整合,这是一种支架式对齐策略,逐步将成熟的深层任务语义迁移到浅层。在STS、NLI和分类基准(涵盖从TinyBERT到BGEM3、Qwen3的模型)上的大量实验表明,MIPIC生成的套娃表示在所有容量下均具有高度竞争力,并在极端低维下展现出显著的性能优势。

英文摘要

Representation learning is fundamental to NLP, but building embeddings that work well at different computational budgets is challenging. Matryoshka Representation Learning (MRL) offers a flexible inference paradigm through nested embeddings; however, learning such structures requires explicit coordination of how information is arranged across embedding dimensionality and model depth. In this work, we propose MIPIC (Matryoshka Representation Learning via Self-Distilled Intra-Relational Alignment and Progressive Information Chaining), a unified training framework designed to produce structurally coherent and semantically compact Matryoshka representations. MIPIC promotes cross-dimensional structural consistency through Self-Distilled Intra-Relational Alignment (SIA), which aligns token-level geometric and attention-driven relations between full and truncated representations using top-k CKA self-distillation. Complementarily, it enables depth-wise semantic consolidation via Progressive Information Chaining (PIC), a scaffolded alignment strategy that incrementally transfers mature task semantics from deeper layers into earlier layers. Extensive experiments on STS, NLI, and classification benchmarks (spanning models from TinyBERT to BGEM3, Qwen3) demonstrate that MIPIC yields Matryoshka representations that are highly competitive across all capacities, with significant performance advantages observed under extreme low-dimensional.

2604.23099 2026-06-03 cs.LG cs.AI stat.ML

ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation

ProEval:生成式AI评估的主动故障发现与高效性能估计

Yizheng Huang, Wenjun Zeng, Aditi Kumaresan, Zi Wang

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出ProEval框架,利用预训练高斯过程进行贝叶斯积分和超水平集采样,实现高效性能估计和主动故障发现,在推理、安全对齐和分类基准上以8-65倍更少样本达到1%误差内估计。

Comments Our open-sourced code and data can be found at https://github.com/google-deepmind/proeval

详情
Journal ref
International Conference on Machine Learning, 2026
AI中文摘要

由于推理速度慢、评估成本高以及模型和基准的快速增长,评估生成式AI模型变得越来越资源密集。我们提出ProEval,一个主动评估框架,利用迁移学习高效估计性能并识别故障案例。ProEval采用预训练高斯过程(GPs)作为性能评分函数的代理,将模型输入映射到指标,如错误严重性或安全违规。通过将性能估计构建为贝叶斯积分(BQ)和故障发现构建为超水平集采样,我们开发了不确定性感知的决策策略,主动选择或合成高度信息量的输入进行测试。理论上,我们证明了基于预训练GP的BQ估计器是无偏且有界的。实验上,在推理、安全对齐和分类基准上的大量实验表明,ProEval比竞争基线显著更高效。它需要8-65倍更少的样本即可达到真实值1%内的估计,同时在更严格的评估预算下揭示更多样化的故障案例。

英文摘要

Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superlevel set sampling, we develop uncertainty-aware decision strategies that actively select or synthesize highly informative inputs for testing. Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.

2507.05519 2026-06-03 cs.AI cs.LO

Modeling Deontic Modal Logic in the s(CASP) Goal-directed Predicate Answer Set Programming System

在 s(CASP) 目标导向谓词回答集编程系统中建模道义模态逻辑

Gopal Gupta, Abhiramon Rajasekharan, Alexis R. Tudor, Elmer Salazar, Joaquín Arias

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) CETINIA, Universidad Rey Juan Carlos(CETINIA,雷耶·胡安·卡洛斯大学)

AI总结 本文利用回答集编程中的默认否定和强否定直接表达道义模态算子,并通过全局约束表示义务、禁止和许可,解决了道义模态逻辑的经典悖论,并支持条件义务和条件禁止的知识表示。

Comments Will appear in as a Technical Communication in the 42nd International Conference on Logic Programming (ICLP 2026)

详情
AI中文摘要

我们考虑实现道义模态逻辑的问题。我们展示了如何利用回答集编程(ASP)中的默认否定(否定即失败)和强否定,优雅而直接地表达(道义)模态算子。我们提出使用ASP的全局约束来表示道义模态逻辑中的义务、禁止和许可。我们表明,我们提出的表示方法简单而优雅地解决了道义模态逻辑中数十年的各种悖论。我们的方法也为知识表示中的条件义务和条件禁止建模提供了一种手段。

英文摘要

We consider the problem of implementing deontic modal logic. We show how (deontic) modal operators can be elegantly and directly expressed using default negation (negation-as-failure) and strong negation present in answer set programming (ASP). We propose using global constraints of ASP to represent obligations, prohibitions, and permissions in deontic modal logic. We show that our proposed representation results in the various decades-old paradoxes of deontic modal logic being simply and elegantly resolved. Our method also serves as a means for modeling conditional obligations and conditional prohibitions in knowledge representation.

2508.06165 2026-06-03 cs.CL cs.AI

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

UR$^2$:通过强化学习统一检索增强生成与推理

Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu

发表机构 * Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China(计算机科学与技术系,人工智能研究院,清华大学,北京,中国) Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China(人工智能产业研究机构(AIR),清华大学,北京,中国) School of Management Science & Information Engineering, Hebei University of Economics and Business, Hebei, China(管理科学与信息工程学院,河北经贸大学,河北,中国)

AI总结 提出UR$^2$框架,通过强化学习动态协调检索与推理,结合难度感知课程和混合知识访问策略,在开放域问答、MMLU-Pro、医学和数学推理任务上优于现有基线,性能接近GPT-4o-mini和GPT-4.1-mini。

详情
AI中文摘要

大型语言模型(LLM)通过两种互补范式展现了强大能力:用于知识基础的检索增强生成(RAG)和用于复杂推理的可验证奖励强化学习(RLVR)。然而,现有统一这些范式的尝试范围狭窄,通常局限于具有固定检索设置的开放域问答,限制了向更广泛领域的泛化。为解决这一局限,我们提出UR$^2$(统一RAG与推理),一个通用的强化学习框架,动态协调检索与推理。UR$^2$引入了两个关键设计:一个难度感知课程,仅对困难实例选择性调用检索;以及一个混合知识访问策略,结合领域特定的离线语料库和即时生成的LLM摘要。这些组件共同缓解了检索与推理之间的不平衡,并提高了对噪声信息的鲁棒性。在开放域问答、MMLU-Pro、医学和数学推理任务上的实验表明,基于Qwen-2.5-3/7B和LLaMA-3.1-8B构建的UR$^2$持续优于现有RAG和RL基线,并在多个基准上达到与GPT-4o-mini和GPT-4.1-mini相当的性能。我们的代码可在https://github.com/Tsinghua-dhy/UR2获取。

英文摘要

Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR$^2$ (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR$^2$ introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR$^2$, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.

2604.20316 2026-06-03 cs.LG

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

R2IF: 通过复合奖励对齐推理与决策以实现可解释的LLM函数调用

Aijia Cheng, Kailong Wang, Ling Shi, Yongxin Zhao

发表机构 * Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai, China(上海可信计算实验室,华东师范大学,上海,中国) Huazhong University of Science and Technology(华中科技大学) Nanyang Technological University(南洋理工大学)

AI总结 提出R2IF框架,通过复合奖励(格式/正确性约束、思维链有效性奖励和规范-修改-价值奖励)和GRPO优化,对齐推理过程与工具调用决策,在BFCL/ACEBench上提升函数调用准确性和可解释性。

详情
AI中文摘要

函数调用使大型语言模型(LLM)能够与外部工具交互,但现有的基于强化学习的方法存在推理过程与工具调用决策之间的错位。我们提出了R2IF,一种面向可解释函数调用的推理感知强化学习框架,采用复合奖励,整合格式/正确性约束、思维链有效性奖励(CER)和规范-修改-价值(SMV)奖励,并通过GRPO进行优化。在BFCL/ACEBench上的实验表明,R2IF在性能上优于基线方法,最高提升34.62%(Llama3.2-3B在BFCL上),同时平均思维链有效性为正(Llama3.2-3B为0.05),增强了函数调用的准确性和可解释性,为可靠的工具增强型LLM部署提供了支持。

英文摘要

Function calling empowers large language models (LLMs) to interface with external tools, yet existing RL-based approaches suffer from misalignment between reasoning processes and tool-call decisions. We propose R2IF, a reasoning-aware RL framework for interpretable function calling, adopting a composite reward integrating format/correctness constraints, Chain-of-Thought Effectiveness Reward (CER), and Specification-Modification-Value (SMV) reward, optimized via GRPO. Experiments on BFCL/ACEBench show R2IF outperforms baselines by up to 34.62% (Llama3.2-3B on BFCL) with positive Average CoT Effectiveness (0.05 for Llama3.2-3B), enhancing both function-calling accuracy and interpretability for reliable tool-augmented LLM deployment.

2604.20183 2026-06-03 cs.CL

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

双簇记忆智能体:解决优化问题求解中的多范式歧义

Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) Ministry of Education Key Laboratory of Intelligent Networks and Network Security, China(教育部智能网络与网络security重点实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering, China(陕西省大数据知识工程重点实验室)

AI总结 提出双簇记忆智能体(DCM-Agent),通过无训练方式利用历史解决方案构建双簇记忆,并采用记忆增强推理动态导航解路径,在七个优化基准上平均性能提升11%-21%。

详情
AI中文摘要

大型语言模型(LLMs)在优化问题中常面临结构歧义,即单个问题存在多个相关但冲突的建模范式,阻碍了有效解的生成。为解决此问题,我们提出双簇记忆智能体(DCM-Agent),以无训练方式利用历史解决方案提升性能。其核心是双簇记忆构建:该智能体将历史解决方案分配到建模和编码两个簇中,然后将每个簇的内容提炼为三种结构化类型:方法、检查表和陷阱。该过程推导出可泛化的指导知识。此外,该智能体引入记忆增强推理,以动态导航解路径、检测并修复错误,并利用结构化知识自适应切换推理路径。在七个优化基准上的实验表明,DCM-Agent平均性能提升11%-21%。值得注意的是,我们的分析揭示了“知识继承”现象:由更大模型构建的记忆可以引导较小模型获得更优性能,凸显了该框架的可扩展性和效率。

英文摘要

Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

2604.19005 2026-06-03 cs.CL

Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

辩论未尽之言:基于角色锚定的多智能体推理用于半真半假检测

Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung

发表机构 * National University of Singapore(国立新加坡大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出RADAR框架,通过角色锚定的多智能体辩论(政治家与科学家对抗推理,法官中立裁决)和双阈值早停控制,在噪声检索下有效检测因省略上下文而误导的半真半假陈述。

详情
AI中文摘要

半真半假,即事实正确但因省略上下文而具有误导性的主张,对于专注于显式虚假的事实核查系统而言仍然是一个盲点。解决这种基于省略的操纵需要不仅推理所说内容,还要推理未说内容。我们提出RADAR,一个角色锚定的多智能体辩论框架,用于在现实噪声检索下进行省略感知的事实核查。RADAR为政治家和科学家分配互补角色,他们在共享检索证据上进行对抗性推理,并由中立法官主持。双阈值早停控制器自适应地决定何时达到足够推理以做出裁决。实验表明,RADAR在数据集和骨干网络上始终优于强单智能体和多智能体基线,提高了遗漏检测准确性,同时降低了推理成本。这些结果表明,具有自适应控制的角色锚定、基于检索的辩论是揭示事实核查中缺失上下文的有效且可扩展的框架。

英文摘要

Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification.

2604.18572 2026-06-03 cs.CV cs.AI cs.LG

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

回到柏拉图的洞穴:大规模检验跨模态表示收敛性

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

发表机构 * UC Berkeley(伯克利大学) Technical University Munich, MCML(慕尼黑技术大学) University of Tübingen, Tübingen AI Center(图宾根大学) Toyota Technical Institute at Chicago(芝加哥丰田技术研究所)

AI总结 本文通过大规模数据集实验,质疑了柏拉图表示假说中跨模态表示收敛的证据,发现对齐度随数据规模增大而显著下降,且仅反映粗粒度语义重叠。

Comments Project page: http://akoepke.github.io/cave_umwelten/

详情
AI中文摘要

柏拉图表示假说认为,在不同模态(例如文本和图像)上训练的神经网络会趋向于对齐并最终收敛到相同的现实表示。如果该假说成立,将对模态选择是否重要产生重大影响。我们表明,该假说的实验证据是脆弱的,且关键依赖于评估方式。对齐度通过小数据集(约1000个样本)上的互最近邻测量,当数据集扩展到数百万样本时,对齐度显著下降。在文本-音频和文本-视频对齐中也观察到相同行为。模型表示之间剩余的对齐反映的是粗粒度语义重叠,而非一致的细粒度结构。此外,Huh等人的评估是在一对一图像-标题设置中进行的,这种约束在现实的多对多设置中失效,进一步降低了测量的对齐度。我们还发现,更强的语言模型与视觉对齐度增加的趋势似乎不适用于较新的模型。总体而言,我们的发现表明,当前跨模态表示收敛的证据比后续工作所认为的要弱得多。在不同模态上训练的模型可能学习到同样丰富的世界表示,但并非相同的表示。

英文摘要

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The same behavior is observed beyond text-image, for text-audio and text-video alignment. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces measured alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

2604.08782 2026-06-03 cs.CL

MT-OSC: Path for LLMs that Get Lost in Multi-Turn Conversation

MT-OSC:解决大语言模型在多轮对话中迷失的路径

Jyotika Singh, Fang Tu, Miguel Ballesteros, Weiyi Sun, Sandip Ghoshal, Michelle Yuan, Yassine Benajiba, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 提出MT-OSC框架,通过后台自动压缩对话历史(Condenser Agent)减少token量,提升多轮对话性能,在13个LLM上验证有效性。

详情
AI中文摘要

大型语言模型(LLMs)在用户指令和上下文分布在多个对话轮次中时,性能会显著下降,然而多轮(MT)交互主导着聊天界面。将完整聊天历史附加到提示中的常规方法会迅速耗尽上下文窗口,导致延迟增加、计算成本升高,并且随着对话延长收益递减。我们引入了MT-OSC,一种一次性顺序压缩框架,可以在不干扰用户体验的情况下,高效自动地在后台压缩聊天历史。MT-OSC采用了一个压缩代理,该代理使用基于少样本推理的压缩器和轻量级决策器,选择性地保留必要信息,在10轮对话中减少高达72%的token数量。在13个最先进的LLM和多样化的多轮基准测试中评估,MT-OSC持续缩小了多轮性能差距——在数据集上保持或提高了准确性,同时对干扰项和无关轮次保持鲁棒性。我们的结果确立了MT-OSC作为多轮聊天的可扩展解决方案,能够在受限的输入空间内实现更丰富的上下文,降低延迟和运营成本,同时平衡性能。

英文摘要

Large language models (LLMs) suffer significant performance degradation when user instructions and context are distributed over multiple conversational turns, yet multi-turn (MT) interactions dominate chat interfaces. The routine approach of appending full chat history to prompts rapidly exhausts context windows, leading to increased latency, higher computational costs, and diminishing returns as conversations extend. We introduce MT-OSC, a One-off Sequential Condensation framework that efficiently and automatically condenses chat history in the background without disrupting the user experience. MT-OSC employs a Condenser Agent that uses a few-shot inference-based Condenser and a lightweight Decider to selectively retain essential information, reducing token counts by up to 72% in 10-turn dialogues. Evaluated across 13 state-of-the-art LLMs and diverse multi-turn benchmarks, MT-OSC consistently narrows the multi-turn performance gap - yielding improved or preserved accuracy across datasets while remaining robust to distractors and irrelevant turns. Our results establish MT-OSC as a scalable solution for multi-turn chats, enabling richer context within constrained input spaces, reducing latency and operational cost, while balancing performance.

2604.17708 2026-06-03 cs.AI

Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

协同进化智能体架构与可解释推理用于自动化优化

Jiahao Huang, Peilan Xu, Xiaoya Nan, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(南京信息工程大学人工智能学院) Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院)

AI总结 提出EvoOR-Agent协同进化框架,通过将智能体工作流表示为活动边网络,并利用图介导的路径条件重组、多粒度语义变异和精英种群更新,实现自动化优化中的自适应协调与可解释推理。

详情
AI中文摘要

使用大语言模型(LLM)自动化运筹学(OR)仍受限于手工设计的推理-执行工作流。复杂的OR任务需要问题解释、数学建模、求解器选择、代码生成和迭代调试之间的自适应协调。为解决这一限制,我们提出了EvoOR-Agent,一个用于自动化优化的协同进化框架。该框架将智能体工作流表示为活动边(AOE)风格网络,使工作流拓扑、执行依赖和替代推理路径显式化。在此表示上,框架维护一个架构图,并通过图介导的路径条件重组、多粒度语义变异和精英种群更新来进化推理个体种群。一个基于知识库的经验获取模块进一步将可重用的OR实践注入初始化和语义变异。在异构OR基准上的实验结果表明,所提框架一致优于零样本LLM、固定流水线OR智能体和代表性进化智能体框架。案例研究和消融分析进一步表明,显式架构进化和图支持的推理轨迹搜索有助于性能提升和结构可解释性。这些结果表明,将智能体架构和推理轨迹视为可进化对象,为自适应和可解释的自动化优化提供了有效途径。

英文摘要

Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical formulation, solver selection, code generation, and iterative debugging. To address this limitation, we propose EvoOR-Agent, a co-evolutionary framework for automated optimization. The framework represents agent workflows as activity-on-edge (AOE)-style networks, making workflow topology, execution dependencies, and alternative reasoning paths explicit. On this representation, the framework maintains an architecture graph and evolves a population of reasoning individuals through graph-mediated path-conditioned recombination, multi-granularity semantic mutation, and elitist population update. A knowledge-base-assisted experience-acquisition module further injects reusable OR practices into initialization and semantic variation. Empirical results on heterogeneous OR benchmarks show that the proposed framework consistently improves over zero-shot LLMs, fixed-pipeline OR agents, and representative evolutionary agent frameworks. Case studies and ablation analyses further indicate that explicit architecture evolution and graph-supported reasoning-trajectory search contribute to both performance improvement and structural interpretability. These results suggest that treating agent architectures and reasoning trajectories as evolvable objects provides an effective route toward adaptive and interpretable automated optimization.

2604.16808 2026-06-03 cs.CV

BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling

BioLip: 通过生物力学约束违反建模实现语言泛化的唇同步深度伪造检测

Hao Chen, Junnan Xu

发表机构 * Independent Researcher(独立研究者)

AI总结 针对现有检测方法在生成器或语言迁移下失效的问题,提出基于唇部运动生物力学约束的轻量级三分支网络,仅利用地标坐标检测唇同步伪造,在零样本设置下对未见生成器和多种语言表现鲁棒。

Comments 13 pages, 5 figures. Keywords: Deepfake detection, lip-sync forgery, biomechanical constraints, landmark kinematics, cross-lingual generalization, video forensics, privacy-preserving inference, compression robustness

详情
AI中文摘要

现有的唇同步深度伪造检测器依赖于像素伪影或视听对应关系,两者在生成器或语言迁移下均会失效,因为它们学习的特征与训练分布绑定。我们采用不同的方法。真实的唇部运动受到组织力学和神经肌肉带宽的约束;当前的生成器通常不施加这些约束,产生的轨迹在速度、加速度和加加速度上具有升高的方差,而真实语音不会表现出这些特征。我们利用这一信号,称为时间唇部抖动,通过从64个口周地标在短滑动窗口上计算运动学统计量,并将其输入一个轻量级三分支网络。该模型仅使用地标坐标:无像素、无音频、无声纹数据。我们仅在英语数据上训练,并在零样本设置下对五个未见生成器和七种语言进行测试。

英文摘要

Existing lip-sync deepfake detectors rely on pixel artifacts or audio-visual correspondence, and both fail under generator or language shift because the features they learn are tied to the training distribution. We take a different approach. Authentic lip motion is constrained by tissue mechanics and neuromuscular bandwidth; current generators typically do not impose these constraints, producing trajectories with elevated variance in velocity, acceleration, and jerk that real speech does not exhibit. We exploit this signal, which we term temporal lip jitter, by computing kinematic statistics from 64 perioral landmarks over short sliding windows and feeding them into a lightweight three-branch network. The model uses only landmark coordinates: no pixels, no audio, and no voiceprint data. We train only on English data and test in a zero-shot setting on five unseen generators and seven languages.

2505.24037 2026-06-03 cs.AI

Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution

交给专家:通过稀疏性演化进行稀疏微调修复稀疏大语言模型

Qiao Xiao, Alan Ansell, Boqian Wu, Lu Yin, Mykola Pechenizkiy, Shiwei Liu, Decebal Constantin Mocanu

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) University of Cambridge(剑桥大学) University of Luxembourg(卢森堡大学) University of Twente(埃因霍温理工大学) University of Surrey(萨里大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 提出稀疏演化微调(SEFT)框架,通过周期性重分配稀疏任务特定更新和重新激活有益剪枝权重,在保持稀疏性效率优势的同时实现稀疏大语言模型的有效下游任务适配。

详情
AI中文摘要

稀疏大语言模型为高效部署提供了有吸引力的方向,但将其适配到下游任务仍然具有挑战性。核心困难在于在不牺牲稀疏性效率优势的情况下实现有效的任务适配。现有的微调方法不适用于这种设置,因为它们要么引入额外的密集参数,要么假设固定的稀疏拓扑,限制了它们与稀疏大语言模型的兼容性。在本文中,我们提出了稀疏演化微调(SEFT),这是一个专门为稀疏大语言模型设计的微调框架。SEFT允许稀疏结构在微调过程中演化,通过周期性重分配稀疏任务特定更新,并在有益时重新激活先前剪枝的权重。同时,SEFT通过基于参数重要性的拓扑适配保留了稀疏性的效率优势。在LLaMA、DeepSeek和Mistral模型上的多个基准实验表明,与现有基线相比,SEFT在提供更强性能的同时,具有更优的内存和时间效率。我们的代码公开在:https://github.com/QiaoXiao7282/SEFT。

英文摘要

Sparse large language models (LLMs) offer an attractive direction toward efficient deployment, but adapting them to downstream tasks remains challenging. The central difficulty is to enable effective task adaptation without sacrificing the efficiency advantages of sparsity. Existing fine-tuning methods are not well-suited to this setting, as they either introduce additional dense parameters or assume a fixed sparse topology, limiting their compatibility with sparse LLMs. In this paper, we propose Sparsity Evolution Fine-Tuning (SEFT), a fine-tuning framework designed specifically for sparse LLMs. SEFT allows sparse structure to evolve during fine-tuning by periodically reallocating sparse task-specific updates and reactivating previously pruned weights when beneficial. At the same time, SEFT preserves the efficiency advantages of sparsity through topology adaptation based on parameter importance. Experiments on LLaMA, DeepSeek, and Mistral models across multiple benchmarks show that SEFT delivers stronger performance while offering superior memory and time efficiency compared to existing baselines. Our code is publicly available at: https://github.com/QiaoXiao7282/SEFT.

2604.16029 2026-06-03 cs.CL cs.LG

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

减少损失!学习早期剪枝路径以实现高效并行推理

Jiaxi Bi, Tongxu Luo, Wenyu Du, Zhengyang Tang, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环形区研究所) USTB(中国地质大学) DualityRL

AI总结 提出一种基于可学习内部信号的路径剪枝方法STOP,通过前缀级剪枝减少并行推理中的无效路径,显著提升大型推理模型的效率与性能。

Comments 9 pages, 7 figures

详情
AI中文摘要

并行推理增强了大型推理模型(LRMs),但由于早期错误导致的无效路径,其成本高昂。为了缓解这一问题,前缀级的路径剪枝至关重要,然而现有研究缺乏标准化框架,较为零散。在这项工作中,我们提出了第一个系统的路径剪枝分类法,根据信号来源(内部与外部)和可学习性(可学习与不可学习)对方法进行分类。这种分类揭示了可学习内部方法的未开发潜力,促使我们提出STOP(用于剪枝的超级令牌)。在参数规模从1.5B到20B的LRMs上的广泛评估表明,与现有基线相比,STOP在效果和效率上均表现出优越性。此外,我们在不同计算预算下严格验证了STOP的可扩展性——例如,在固定计算预算下,将GPT-OSS-20B在AIME25上的准确率从84%提升至近90%。最后,我们将发现提炼为形式化的经验指南,以促进最优的实际部署。代码、数据和模型可在 https://bijiaxihh.github.io/STOP 获取。

英文摘要

Parallel reasoning enhances Large Reasoning Models (LRMs) but incurs prohibitive costs due to futile paths caused by early errors. To mitigate this, path pruning at the prefix level is essential, yet existing research remains fragmented without a standardized framework. In this work, we propose the first systematic taxonomy of path pruning, categorizing methods by their signal source (internal vs. external) and learnability (learnable vs. non-learnable). This classification reveals the unexplored potential of learnable internal methods, motivating our proposal of STOP (Super TOken for Pruning). Extensive evaluations across LRMs ranging from 1.5B to 20B parameters demonstrate that STOP achieves superior effectiveness and efficiency compared to existing baselines. Furthermore, we rigorously validate the scalability of STOP under varying compute budgets - for instance, boosting GPT-OSS-20B accuracy on AIME25 from 84% to nearly 90% under fixed compute budgets. Finally, we distill our findings into formalized empirical guidelines to facilitate optimal real-world deployment. Code, data and models are available at https://bijiaxihh.github.io/STOP

2604.15744 2026-06-03 cs.CL

Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand

语言、地点与社交媒体:新西兰的地理方言对齐

Sidney Wong

发表机构 * Computer Science and Software Engineering, University of Canterbury(坎特伯雷大学计算机科学与软件工程系) Department of Linguistics, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校语言学系) School of Psychology, Speech and Hearing, University of Canterbury(坎特伯雷大学心理学、语音与听力学校) Department of Linguistics, University of Canterbury(坎特伯雷大学语言学系)

AI总结 本研究通过整合用户感知的定性分析与计算方法,探究新西兰Reddit社区中地理方言对齐现象,发现地点相关社区形成连续言语社区,且高级语言建模揭示了语义变异与变化。

Comments PhD thesis

详情
AI中文摘要

本论文研究了基于地点的社交媒体社区中的地理方言对齐现象,重点关注新西兰相关的Reddit社区。通过将用户感知的定性分析与计算方法相结合,研究考察了语言使用如何反映地点身份,以及基于用户提供的词汇、形态句法和语义变量的语言变异与变化模式。研究结果表明,用户通常将语言与地点联系起来,地点相关社区形成了一个连续的言语社区,尽管地理方言社区与地点相关社区之间的对齐仍然复杂。包括静态和历时Word2Vec语言嵌入在内的高级语言建模揭示了基于地点的社区之间的语义变异,以及新西兰英语中有意义的语义变化。研究创建了一个包含42.6亿未处理单词的语料库,为未来研究提供了宝贵资源。总体而言,结果凸显了社交媒体作为社会语言学自然实验室的潜力。

英文摘要

This thesis investigates geographic dialect alignment in place-informed social media communities, focussing on New Zealand-related Reddit communities. By integrating qualitative analyses of user perceptions with computational methods, the study examines how language use reflects place identity and patterns of language variation and change based on user-informed lexical, morphosyntactic, and semantic variables. The findings show that users generally associate language with place, and place-related communities form a contiguous speech community, though alignment between geographic dialect communities and place-related communities remains complex. Advanced language modelling, including static and diachronic Word2Vec language embeddings, revealed semantic variation across place-based communities and meaningful semantic shifts within New Zealand English. The research involved the creation of a corpus containing 4.26 billion unprocessed words, which offers a valuable resource for future study. Overall, the results highlight the potential of social media as a natural laboratory for sociolinguistic inquiry.

2604.12176 2026-06-03 cs.AI

Evaluating Relational Reasoning in LLMs with REL

使用REL评估大语言模型中的关系推理能力

Lukas Fesser, Yasha Ektefaie, Ada Fang, Sham M. Kakade, Marinka Zitnik

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过关系复杂度(RC)定义推理难度,构建涵盖代数、化学和生物学的生成式基准REL,发现前沿大语言模型在RC增加时性能持续下降,表明模型在高元关系绑定上存在固有局限。

Comments ICML 2026

详情
AI中文摘要

关系推理是推断同时绑定多个实体、属性或变量的关系的能力。这种能力对科学推理至关重要,但现有对大语言模型关系推理的评估通常侧重于结构化输入(如表格、图或合成任务),并未分离高元关系绑定带来的困难。我们通过关系复杂度(RC)来研究这个问题,将其定义为应用一个关系时必须同时绑定的独立实体或操作数的最小数量。RC提供了一种原则性的方式来改变推理难度,同时控制输入大小、词汇和表示选择等混杂因素。基于RC,我们引入了REL,一个涵盖代数、化学和生物学的生成式基准框架,在每个领域内变化RC。在前沿大语言模型中,当RC增加时,性能持续且单调下降,即使实体总数保持不变。这种失败模式在增加测试时计算量和上下文学习时仍然存在,表明这一限制与所需关系绑定的元数有关,而非推理步骤不足或缺乏示例暴露。我们的结果识别了当前模型难以应对的高元推理场景,并促使通过关系复杂度的视角重新审视基准测试。

英文摘要

Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when the total number of entities is held fixed. This failure mode persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than to insufficient inference steps or lack of exposure to examples. Our results identify a regime of higher-arity reasoning in which current models struggle, and motivate re-examining benchmarks through the lens of relational complexity.

2604.10169 2026-06-03 cs.AI cs.LG

MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

MAVEN-T:用于实时多智能体轨迹预测的强化异构蒸馏

Wenchang Duan, Zhenguo Gao, Jinguo Xian, Yi Shi

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University(上海交通大学Bio-X研究院、发育与神经精神疾病遗传学重点实验室) Shanghai Key Laboratory of Psychotic Disorders, Brain Science and Technology Research Center, Shanghai Jiao Tong University(上海精神疾病重点实验室、脑科学与技术研究中心,上海交通大学)

AI总结 提出MAVEN-T框架,通过高容量教师模型和紧凑学生模型的异构蒸馏,结合强化学习优化,实现实时多智能体轨迹预测,在多个数据集上达到高精度与低延迟。

详情
AI中文摘要

轨迹预测是自动驾驶系统的关键组成部分,因为未来运动直接影响碰撞检查、行为规划和控制。在密集交互、异构行为、多模态未来和有限车载计算条件下,该任务仍然具有挑战性。现有的图、注意力和生成式预测器改进了交互推理或不确定性建模,但其高容量设计通常成本高昂,难以实时部署。轻量级预测器和传统蒸馏降低了推理成本,但通常依赖静态模仿,并未明确纠正与安全相关的教师偏差。本文提出了MAVEN-T,一种用于实时多智能体轨迹预测的强化异构蒸馏框架。高容量教师模型通过环绕感知图编码器建模有向局部交互,结合高效时间滤波与移位窗口空间注意力,并通过稀疏混合专家头解码特定机动未来。紧凑的GRU-挤压激励学生模型配备低秩自适应策略头,通过特征级、注意力级和语义级蒸馏进行训练。为了与下游行为对齐,学生模型进一步通过近端策略优化奖励进行细化,奖励包括碰撞避免、舒适性和进度,同时复杂度感知课程和弹性权重巩固稳定了分阶段训练。在NGSIM、HighD、MoCAD、Argoverse 2和Waymo开放运动数据集上的实验评估了准确性、效率、泛化性、鲁棒性和闭环安全性。学生模型在NVIDIA Jetson AGX Orin上实现了6.2倍参数压缩、3.7倍推理加速和14.6毫秒延迟,同时保持竞争性准确性。

英文摘要

Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior planning, and control. The task remains challenging under dense interactions, heterogeneous behaviors, multimodal futures, and limited on-board computation. Existing graph, attention, and generative predictors improve interaction reasoning or uncertainty modeling, but their high-capacity designs are often costly for real-time deployment. Lightweight predictors and conventional distillation reduce inference cost, yet usually rely on static imitation and do not explicitly correct safety-relevant teacher bias. This paper proposes \textbf{MAVEN-T}, a reinforced heterogeneous distillation framework for real-time multi-agent trajectory prediction. A high-capacity teacher models directed local interactions with a surround-aware graph encoder, combines efficient temporal filtering with shifted-window spatial attention, and decodes maneuver-specific futures through a sparse Mixture-of-Experts head. A compact GRU--Squeeze-and-Excitation student with a Low-Rank Adapted policy head is trained by feature-, attention-, and semantic-level distillation. To align prediction with downstream behavior, the student is further refined by Proximal Policy Optimization rewards for collision avoidance, comfort, and progress, while a complexity-aware curriculum and Elastic Weight Consolidation stabilize stage-wise training. Experiments on NGSIM, HighD, MoCAD, Argoverse~2, and the Waymo Open Motion Dataset evaluate accuracy, efficiency, generalization, robustness, and closed-loop safety. The student achieves 6.2$\times$ parameter compression, 3.7$\times$ inference acceleration, and 14.6,ms latency on an NVIDIA Jetson AGX Orin while maintaining competitive accuracy.

2510.02779 2026-06-03 cs.LG

Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification

深度ReLU分类中梯度下降泛化的最优速率

Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, Yiming Ying

发表机构 * School of Mathematical Sciences, Zhejiang University(浙江大学数学科学学院) Department of Mathematics, The University of Hong Kong(香港大学数学系) School of mathematics and statistics, University of Sydney(悉尼大学数学与统计学学院)

AI总结 针对深度ReLU网络,通过权衡优化与泛化误差,在NTK可分离假设下证明了梯度下降的泛化误差率为~O(L^6/(nγ^2)),与SVM最优率仅差深度相关因子,关键技术是控制参考模型附近的激活模式以得到更紧的Rademacher复杂度界。

Comments Published in NeurIPS 2025

详情
AI中文摘要

近期进展显著提升了我们对深度神经网络中梯度下降(GD)方法泛化性能的理解。一个自然且基本的问题是:GD能否达到核方法中建立的最小最大最优速率?现有结果要么给出次优的$O(1/\sqrt{n})$速率,要么关注具有光滑激活函数的网络,导致对网络深度$L$的指数依赖。本文通过仔细权衡优化与泛化误差,为深度ReLU网络的GD建立了最优泛化速率,仅对深度有多项式依赖。具体地,在数据以间隔$γ$为NTK可分离的假设下,我们证明了过风险率为$\widetilde{O}(L^6 / (n γ^2))$,这与最优SVM型速率$\widetilde{O}(1 / (n γ^2))$仅差深度相关因子。一项关键的技术贡献是我们对参考模型附近激活模式的新颖控制,从而为梯度下降训练的深度ReLU网络获得了更紧的Rademacher复杂度界。

英文摘要

Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $γ$, we prove an excess risk rate of $\widetilde{O}(L^6 / (n γ^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n γ^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.

2604.07366 2026-06-03 cs.LG

Flow Learners for PDEs: Toward a Physics-to-Physics Paradigm for Scientific Computing

PDE的流学习器:迈向科学计算的物理到物理范式

Yilong Dai, Shengyu Chen, Xiaowei Jia, Runlong Yu

发表机构 * The University of Alabama(阿拉巴马大学) University of Pittsburgh(匹兹堡大学)

AI总结 本文提出流学习器(flow learners)范式,通过参数化传输向量场并积分生成轨迹,将PDE求解从状态预测转向物理上允许的未来传输建模,实现连续时间预测、不确定性量化及物理感知求解器设计。

详情
AI中文摘要

偏微分方程(PDE)支配着科学与工程中几乎所有的物理过程,但大规模求解仍然代价高昂。生成式AI已经改变了语言、视觉和蛋白质科学,但学习的PDE求解器尚未经历类似的转变。现有范式各自捕捉了问题的一部分。物理信息神经网络嵌入残差结构,尽管在刚性、多尺度或大区域情况下通常难以优化。神经算子跨实例进行摊销,尽管它们通常继承快照预测的求解视图,并可能在长滚动中退化。基于扩散的求解器对不确定性建模,尽管它们通常建立在仍以状态回归为中心的求解器模板上。我们认为核心问题是用于训练学习求解器的抽象。许多模型被要求预测状态,而许多科学设置需要建模不确定性如何在约束动力学中移动。相关对象是物理上允许的未来上的传输。这激发了流学习器:参数化传输向量场并通过积分生成轨迹的模型,呼应定义PDE演化的连续动力学。这种物理到物理的对齐支持连续时间预测、原生不确定性量化以及物理感知求解器设计的新机会。我们解释了为什么基于传输的学习为学习的PDE求解提供了更强的组织原则,并概述了从这一转变中产生的研究议程。

英文摘要

Partial differential equations (PDEs) govern nearly every physical process in science and engineering, but solving them at scale remains prohibitively expensive. Generative AI has transformed language, vision, and protein science, but learned PDE solvers have not undergone a comparable shift. Existing paradigms each capture part of the problem. Physics-informed neural networks embed residual structure, although they are often difficult to optimize in stiff, multiscale, or large-domain regimes. Neural operators amortize across instances, although they commonly inherit a snapshot-prediction view of solving and can degrade over long rollouts. Diffusion-based solvers model uncertainty, although they are often built on a solver template that still centers on state regression. We argue that the core issue is the abstraction used to train learned solvers. Many models are asked to predict states, while many scientific settings require modeling how uncertainty moves through constrained dynamics. The relevant object is transport over physically admissible futures. This motivates flow learners: models that parameterize transport vector fields and generate trajectories through integration, echoing the continuous dynamics that define PDE evolution. This physics-to-physics alignment supports continuous-time prediction, native uncertainty quantification, and new opportunities for physics-aware solver design. We explain why transport-based learning offers a stronger organizing principle for learned PDE solving and outline the research agenda that follows from this shift.