arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09365 2026-06-16 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练：通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出SkeMex框架，通过技能记忆实现医疗智能体后部署自进化，无需更新模型权重，在临床任务中优于现有记忆型智能体。

详情

AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策，而不仅仅是静态问答。在这种设置中，有效的智能体必须跨演化病例重用先前经验，然而现有的记忆机制通常保留原始历史轨迹，这些轨迹冗余、嘈杂且难以管理。更重要的是，它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距，我们提出SkeMex，一种部署后自进化框架，通过基于技能的记忆改进医疗智能体，无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能，编码可重用的程序性知识，并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留，SkeMex从环境反馈中估计上下文相关的效用，并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明，SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.07226 2026-06-16 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University（南京大学）； Shanghai Innovation Institute（上海创新研究院）； East China Normal University（华东师范大学）

AI总结提出DEFINED框架，通过层次化八维指标体系、预训练语言模型和混合粒度训练策略，在辩论场景中实现数据高效的细粒度创造力自动评估，优于现有方法。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817874

AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战，目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景，辩论反映了创造力的多个维度，涵盖发散思维和收敛思维。此外，辩论是一个数据丰富的领域，拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景，因此仍然依赖昂贵的人工评估。为此，本文提出DEFINED，一种数据高效的计算框架，用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力，采用预训练自回归语言模型，并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分，并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略，能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度，我们纳入了一项针对辩论新手参与者的实证研究，利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中，评分模型实现了准确且稳定的评分，优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07082 2026-06-16 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST（香港科技大学）； UT Austin（得克萨斯大学奥斯汀分校）； Zhejiang University（浙江大学）； Hong Kong PolyU（香港理工大学）； USTC（中国科学技术大学）； BUPT（北京邮电大学）； Nankai University（南开大学）； BIT（北京理工大学）

AI总结本文通过参数空间诊断，揭示在线策略蒸馏（OPD）的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性，表明其并非介于SFT和RLVR之间的中间方法。

Comments 17 pages, 8 figures

详情

AI中文摘要

在线策略蒸馏（OPD）越来越多地被用于改进大型语言模型的推理能力，但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹，并将其与监督微调（SFT）和可验证奖励强化学习（RLVR）进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域：与SFT相比，其更新影响更少的权重，并更强烈地避开主方向；而与RLVR相比，其约束更宽松。除了这种静态定位外，OPD还表现出子空间锁定：其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能，但会严重降低SFT，表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明，稀疏化更新令牌和将rollout生成移至离策略能保持秩动态，而将OPD目标与RLVR混合则会改变它们。总体而言，这些结果表明OPD不仅仅是SFT和RLVR之间的中间点，而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

URL PDF HTML ☆

赞 0 踩 0

2606.06302 2026-06-16 cs.LG cs.SE 版本更新

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Tangram: 解锁非均匀KV缓存以实现高效的多轮LLM服务

Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi

发表机构 * Hanyang University（翰林大学）； Rebellions Republic of Korea（Rebellions）

AI总结针对多轮LLM服务中KV缓存线性增长导致的GPU内存和带宽压力，提出Tangram系统，通过确定性预算分配、头组页面和提前负载均衡三项技术实现非均匀KV缓存的高效管理，吞吐量提升达2.6倍。

Comments 13 pages. 15 figures

详情

AI中文摘要

多轮大语言模型（LLM）服务对于一致的用户体验至关重要，但键值（KV）缓存的线性增长给GPU内存和带宽带来了巨大压力。非均匀KV压缩通过考虑每个KV缓存的重要性来有效保留更多信息。然而，这种KV缓存的异质性带来了各种系统挑战——包括内存碎片、调度复杂性和内核利用率降低——这些共同导致现有LLM服务系统的显著低效。为了克服这些挑战，我们提出了Tangram，一种新颖的服务系统，旨在使非均匀KV缓存变得实用。Tangram通过三种核心技术解决系统低效问题：（1）确定性预算分配根据每个头的内在模式为其分配静态内存占用，完全消除动态调度开销和预填充停滞；（2）头组页面将具有相似保留需求的注意力头聚类，并使用独立的向量化页表进行管理，从而最大化物理内存回收；（3）提前（AOT）负载均衡利用静态预算配置文件确保均匀的GPU利用率，无需运行时开销。实验结果表明，与现有基线相比，Tangram在完全保持模型准确性的同时，吞吐量提升高达2.6倍。我们的实现已在https://github.com/aiha-lab/TANGRAM公开。

英文摘要

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to $1.7\times$ or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to $2.6\times$ over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.

URL PDF HTML ☆

赞 0 踩 0

2606.06176 2026-06-16 cs.CV 版本更新

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

RQUL-UIE: 通过数据集内自监督重振质量不稳定标签用于水下图像增强

Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出一种基于扩散模型的数据集内自监督学习策略，通过评估标签质量并量化噪声级别进行分步去噪监督，结合傅里叶细化网络，有效利用不稳定标签提升水下图像增强质量。

详情

AI中文摘要

水下图像增强对于减轻水介质引起的退化至关重要。尽管基于学习的方法取得了显著进展，但大多数依赖于具有不稳定标签质量的配对数据集，这限制了模型性能。本文提出了一种基于扩散的数据集内自监督学习策略，旨在利用训练标签的质量分布。具体地，我们通过预训练扩散模型的语义感知嵌入以无需训练的方式评估标签质量。这些质量分数随后被量化为噪声级别索引，指导多步去噪过程以进行级别监督。该机制防止低质量标签降低模型性能，同时最大化其在训练中的效用。此外，引入基于傅里叶的细化网络以显式重建高频分量。大量评估表明，我们的方法在恢复质量上始终优于最先进的方法。代码和预训练模型将在接收后提供链接。

英文摘要

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

URL PDF HTML ☆

赞 0 踩 0

2606.06007 2026-06-16 cs.LG 版本更新

Diffusion Models for Adaptive Sequential Data Generation

自适应序列数据生成的扩散模型

Haoyang Cao, Minshuo Chen, Yinbin Han, Renyuan Xu

发表机构 * Department of Applied Mathematics and Statistics, Data Science and AI Institute, and Mathematical Institute for Data Science, Johns Hopkins University（应用数学与统计学系、数据科学与人工智能研究所、数据科学数学研究所，约翰霍普金斯大学）； Department of Industrial Engineering and Management Sciences, Northwestern University（工业工程与管理科学系，西北大学）； Department Management Science and Engineering, Stanford University（管理科学与工程系，斯坦福大学）

AI总结提出一种顺序前向后向扩散框架，通过沿序列逐步注入和去除噪声并基于历史生成条件确保自适应性，用于生成自适应时间序列数据，并引入新的分数匹配目标实现高效并行训练，在合成数据和均值-方差最优投资组合构建中验证有效性。

Comments 38 pages

详情

AI中文摘要

生成逼真的合成序列数据在运筹学、金融、医疗、能源系统和科学计算等实际应用中至关重要，这些领域使用时间索引观测进行预测、模拟、风险评估和数据驱动决策。虽然扩散模型在生成静态数据方面取得了显著成功，但其直接扩展到序列设置往往无法捕捉时间依赖性和信息结构。设计能够以自适应方式模拟序列数据且不预知未来信息的扩散模型仍然是一个开放挑战。在这项工作中，我们提出了一种用于自适应时间序列生成的顺序前向后向扩散框架。我们的方法沿序列逐步注入和去除噪声，并基于先前生成的历史进行条件化以确保自适应性。引入了一种新的分数匹配目标以实现高效的并行训练。我们在一个通用框架下推导了严格的统计保证，然后以ReLU网络作为具体实例建立了分数逼近、分数估计和分布估计结果。在实验上，我们在合成数据（包括ARMA模型和高斯过程）上验证了我们的方法，并展示了其在构建均值-方差最优投资组合中的有效性。

英文摘要

Generating realistic synthetic sequential data is critical in real-world applications across operations research, finance, healthcare, energy systems, and scientific computing, where time-indexed observations are used for prediction, simulation, risk assessment, and data-driven decision-making. While diffusion models have achieved remarkable success in generating static data, their direct extensions to sequential settings often fail to capture temporal dependence and information structure. Designing diffusion models that can simulate sequential data in an adapted manner, and hence without anticipation of future information, therefore remains an open challenge. In this work, we propose a sequential forward-backward diffusion framework for adapted time series generation. Our approach progressively injects and removes noise along the sequence, conditioning on the previously generated history to ensure adaptiveness. A novel score-matching objective is introduced for efficient parallel training. We derive rigorous statistical guarantees under a generic framework, then establish score approximation, score estimation, and distribution estimation results with ReLU networks serving as a concrete instance. Empirically, we validate our method on synthetic data, including ARMA models and Gaussian processes, and demonstrate its effectiveness in constructing mean-variance optimal portfolios.

URL PDF HTML ☆

赞 0 踩 0

2606.05878 2026-06-16 cs.LG 版本更新

TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning

TS-ICL: 一种基于上下文学习的灵活时间索引时间序列基础模型

Etienne Le Naour, Tahar Nabil, Adrien Petralia

发表机构 * EDF R&D（EDF研究与发展）

AI总结提出TS-ICL，一种基于上下文学习的概率编码器-回归器Transformer，统一了时间序列预测与插值，并在插值任务上达到新最优，同时在部分观测回溯窗口预测中表现突出。

详情

AI中文摘要

基础模型标志着时间序列建模的深刻范式转变，任务特定模型正被通用零样本模型取代。然而，当前方法主要关注预测，而现实世界的时间序列通常是不规则和部分观测的，需要模型能够联合预测、插补缺失值并处理降采样条件。为应对这些挑战，我们引入了TS-ICL，一种新颖的基于概率上下文学习的编码器-回归器Transformer，统一了预测和插值。TS-ICL将时间序列任务表述为时间戳对齐的回归，并通过训练从新颖的因果数据先验生成的合成依赖结构自然地纳入协变量。实验上，TS-ICL在插值任务上达到了新的最优，同时在单变量和协变量感知基准上与领先的预测基础模型保持竞争力。它在部分观测回溯窗口的预测中表现出特别强的性能。

英文摘要

Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder--regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.

URL PDF HTML ☆

赞 0 踩 0

2606.05742 2026-06-16 cs.CL 版本更新

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

AdaPLD: 自适应检索与重用实现高效无模型推测解码

Runheng Liu, Jincheng Xie, Wen Hu, Xingchen Xiao, Heyan Huang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）； Department of Mathematical Sciences, Tsinghua University（清华大学数学科学部）； JDT AI Infra（京东AI基础设施）

AI总结针对现有基于重用的推测解码方法在词汇匹配失败时召回率低和确定性复制脆弱的问题，提出无需训练的自适应方法AdaPLD，通过语义相似性恢复重用机会并构建分支假设，实现最高3.10倍解码加速。

详情

AI中文摘要

推测解码通过在单次目标模型前向传播中验证多个草拟令牌来加速生成，减少了顺序解码迭代。无模型变体通过重用生成过程中已有的文本和模型状态来避免辅助草稿模型，但其加速效果取决于构建的草稿的可靠性。我们指出现有基于重用的方法存在两个局限性：基于词汇锚定的检索在表面形式变化下召回率有限，以及当检索上下文不能唯一确定续写时，确定性跨度复制可能脆弱。我们提出\emph{AdaPLD}，一种无需训练的方法，自适应地改进检索和草稿构建。AdaPLD保留高精度的词汇重用，同时利用语义相似性在词汇匹配失败时恢复额外的重用机会。它进一步构建分支重用假设以考虑续写的不确定性，而不是依赖单个复制的跨度。在多个基准测试中，AdaPLD减少了目标模型前向传播次数，并实现了高达$3.10 imes$的解码加速。

英文摘要

Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under surface-form variation, and deterministic span copying can be brittle when the retrieved context does not uniquely determine the continuation. We propose \emph{AdaPLD}, a training-free method that adaptively improves both retrieval and draft construction. AdaPLD preserves high-precision lexical reuse while using semantic similarity to recover additional reuse opportunities when lexical matching fails. It further constructs branched reuse hypotheses to account for continuation uncertainty, rather than relying on a single copied span. Across diverse benchmarks, AdaPLD reduces target-model forward passes and achieves up to $3.10\times$ decoding speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.05693 2026-06-16 cs.LG cs.IR 版本更新

MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry

MolE-RAG：面向化学的分子结构增强检索增强生成

Joey Chan, Wonbin Kweon, Ashley Shin, Niharika Bhattacharjee, Pengcheng Jiang, Yue Guo, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出无需训练的分子中心检索增强生成框架MolE-RAG，通过整合检索文献、分子特定信息和结构相似分子三种上下文，显著提升LLM在分子性质预测任务中的性能。

详情

AI中文摘要

大型语言模型（LLM）在分子性质预测方面展现出潜力，但其对化学结构的推理能力仍然有限，因为分子表示（如SMILES）与LLM主要训练的自然语言存在显著差异。为弥合这一语义和化学知识鸿沟，我们提出MolE-RAG，一种无需训练的、以分子为中心的检索增强生成框架，用于基于LLM的分子性质预测。MolE-RAG通过三种互补的推理时上下文来源增强每次预测：检索的化学文献、分子特定信息（包括化合物同义词、标识符、官能团注释和物理化学描述符），以及从训练集中检索的结构相似分子。我们使用专有、化学专用和开源LLM在九个分子性质预测任务上评估MolE-RAG。在通用LLM上，相比仅使用SMILES的基线，MolE-RAG在分类任务上将ROC-AUC提升最多28个百分点，并将回归RMSE降低最多67%。我们进一步发现，每种上下文来源的效用因模型和任务而异，不同模型分别从文本检索、分子上下文或结构检索中获益最多。这些结果表明，以分子为中心的检索可以在无需模型微调的情况下改进基于LLM的分子性质预测，同时为在推理时整合异构化学知识提供灵活框架。

英文摘要

Large language models (LLMs) have shown promise for molecular property prediction, but their ability to reason over chemical structures remains limited, as molecular representations such as SMILES differ substantially from the natural language on which LLMs are primarily trained. To bridge this semantic and chemical knowledge gap, we propose MolE-RAG, a training-free, molecule-centric retrieval-augmented generation framework for LLM-based molecular property prediction. MolE-RAG augments each prediction with three complementary sources of inference-time context: retrieved chemistry literature, molecule-specific information including compound synonyms, identifiers, functional group annotations, and physicochemical descriptors, and structurally similar molecules retrieved from the training set. We evaluate MolE-RAG across nine molecular property prediction tasks using proprietary, chemistry-specialized, and open-source LLMs. Across general-purpose LLMs, MolE-RAG improves ROC-AUC by up to 28 percentage points on classification tasks and reduces regression RMSE by up to 67% relative to a SMILES-only baseline. We further find that the utility of each context source varies across models and tasks, with different models benefiting most from textual retrieval, molecular context, or structural retrieval. These results suggest that molecule-centric retrieval can improve LLM-based molecular property prediction without model fine-tuning while providing a flexible framework for integrating heterogeneous chemical knowledge at inference time.

URL PDF HTML ☆

赞 0 踩 0

2606.05692 2026-06-16 cs.LG cs.AI 版本更新

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

具有时变干预的流行病时间序列中的反事实预测基准测试

Wenhao Mu, Facundo Yan, Anik Mumssen, Marisa Eisenberg, Alexander Rodríguez

发表机构 * University of Michigan Computer Science and Engineering（密歇根大学计算机科学与工程系）； University of Michigan Epidemiology & Complex Systems（密歇根大学流行病学与复杂系统）

AI总结为解决缺乏可观测反事实结果的真实基准问题，基于校准的基于智能体的模型生成大规模流行病时间序列反事实预测基准，支持静态/时变治疗和单/多策略干预，评估多种因果推断方法。

Comments To appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情

DOI: 10.1145/3770855.3817522

AI中文摘要

深度学习在时间序列因果推断方面取得了显著进展，但由于缺乏具有可观测反事实结果的现实基准，进展仍然受到限制。现有数据集要么依赖没有真实反事实的真实世界观测，要么依赖无法捕捉复杂因果动态的简化模拟。为了解决这一差距，我们开发了一个大规模基准，用于动态干预下流行病时间序列的反事实预测。与现有基准不同，它支持静态和时变治疗，以及单策略和多策略干预设置，从而能够在广泛的因果推断场景中评估因果推断方法。利用基于真实世界人口、流动性、流行病学和政策数据校准的基于智能体的模型，我们生成了跨越美国150多个县的真实反事实轨迹。使用该基准，我们评估了广泛使用和最先进的因果推断方法，揭示了显著的性能差异，并突出了现实时间序列因果推理的挑战。

英文摘要

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.05014 2026-06-16 cs.CL 版本更新

Depth-Attention: Cross-Layer Value Mixing for Language Models

深度注意力：语言模型的跨层值混合

Boyi Zeng, Yiqin Hao, Zitong Wang, Shixiang Song, He Li, Feichen Song, Yifan Liu, Ziwei He, Xinbing Wang, Zhouhan Lin

发表机构 * LUMIA Lab（LUMIA实验室）； School of Artificial Intelligence（人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； Sun Yat-sen University（中山大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出深度注意力机制，在注意力模块内部实现跨层值混合，无需额外参数和推理状态，提升语言模型性能。

Comments 21 pages, 4 figures, 9 tables

详情

AI中文摘要

自注意力机制可以在序列中自由选择信息，但在深度方向上，Transformer仅将每一层的输出加到残差流中，因此后续层无法选择性重用早期层的表示。最近的跨层方法改善了这种流动，但在注意力之外的隐藏状态上操作，在推理时增加了键值缓存之外的状态——随着现代LLM使用分组查询和多头潜在注意力压缩缓存，这一成本日益显著。我们引入深度注意力，它在注意力模块内部执行这种选择：在一层对序列进行注意力之前，其查询在同一token位置上对早期层的键进行注意力，并将它们的值混合到自注意力随后读取的值中。由于深度注意力重用标准的注意力查询、键和值缓存槽，将深度混合后的值替换原始值，因此它不增加参数，也不引入超出标准键值缓存的持久推理状态——缓存大小与普通解码器相同，且小于基于隐藏状态的跨层方法。在1.5B和3B参数的Qwen3风格解码器上，深度注意力取得了最低的困惑度和最高的平均下游准确率，相比普通Transformer提升高达2.3个准确率点，在困惑度和平均准确率上超越了强跨层基线，同时仅增加不到0.01%的额外算术FLOPs，且无额外持久推理状态。这些增益在360M到3B参数范围内保持一致，并扩展到循环Transformer。

英文摘要

Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.

URL PDF HTML ☆

赞 0 踩 0

2606.04907 2026-06-16 cs.RO 版本更新

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

WAM-Nav：面向统一视觉导航的非对称潜在世界-动作建模

Ning Yang, Yan Huang, Kaiwen Peng, Ziheng He, Kai Wang, Cui Miao, Kailin Lyu, Guo Li, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu

发表机构 * Nanjing University（南京大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； University of Chinese Academy of Sciences（中国科学院大学）； FiveAges ； National University of Defense Technology（国防科技大学）； Tsinghua University（清华大学）； GigaAI

AI总结提出WAM-Nav，一种联合学习动作生成与潜在视觉预测的非对称扩散Transformer模型，通过共享扩散Transformer实现长时程动作与短时程视觉预测的联合扩散，并引入双流上下文条件机制和目标对齐模块，在统一策略下支持图像目标、点目标和无目标导航，在ClutterScenes和InternScenes基准上分别提升15.7%和3.3%的成功率，并在真实环境中实现85%的任务成功率。

详情

AI中文摘要

视觉导航需要在复杂的几何和物理约束下生成平滑且无碰撞的轨迹。现有的反应式策略直接将观测映射到动作，缺乏预期推理能力，限制了其主动避障的能力。虽然视觉想象提供了预测性前瞻，但传统的模块化方法将场景预测与策略学习分离，常常导致误差累积和推理效率低下。为了解决这些限制，我们提出了WAM-Nav，一种用于具身视觉导航的潜在世界-动作模型，它联合学习动作生成和潜在视觉预测，从而在不影响推理效率的情况下实现更鲁棒和更具前瞻性的导航决策。具体来说，WAM-Nav利用共享的扩散Transformer进行非对称联合扩散，同时生成长时程动作和短时程视觉预测，减少了多步自回归展开中固有的推理延迟和视觉误差累积。为了进一步促进平滑且一致的轨迹生成，我们引入了一种双流上下文条件机制，将情节级别的自运动历史与顺序视觉观测相结合。结合统一的目标对齐模块，该模块在不同目标类型间保持平衡表示，WAM-Nav在单一策略下自然支持图像目标、点目标和无目标探索。在具有挑战性的ClutterScenes和InternScenes基准上的大量实验证明了WAM-Nav的强大泛化能力，特别是在图像目标和点目标导航中，成功率分别提高了15.7%和3.3%。真实世界部署进一步验证了有效的零样本模拟到现实迁移，在多样化的室内和室外环境中实现了平均85%的任务成功率。

英文摘要

Visual navigation requires generating smooth and collision-free trajectories under complex geometric and physical constraints. Existing reactive policies that directly map observations to actions lack anticipatory reasoning, limiting their ability to proactively avoid obstacles. While visual imagination offers predictive foresight, conventional modular approaches separate scene prediction from policy learning, often leading to error accumulation and inefficient inference. To address these limitations, we propose WAM-Nav, a Latent World-Action Model for embodied visual navigation that jointly learns action generation and latent visual foresight, enabling more robust and foresighted navigation decisions without compromising inference efficiency. Specifically, WAM-Nav utilizes a shared Diffusion Transformer for asymmetric joint diffusion to concurrently generate long-horizon actions and short-horizon visual foresight, reducing the inference latency and visual error accumulation inherent in multi-step autoregressive rollouts. To further encourage smooth and consistent trajectory generation, we introduce a dual-stream contextual conditioning mechanism that integrates episode-level ego-motion history with sequential visual observations. Combined with a unified goal alignment module that preserves balanced representations across goal types, WAM-Nav naturally supports Image-Goal, Point-Goal, and No-Goal exploration within a single policy. Extensive experiments on the challenging ClutterScenes and InternScenes benchmarks demonstrate strong generalization of WAM-Nav, particularly on Image-Goal and Point-Goal navigation, where it improves success rates by 15.7% and 3.3%, respectively. Real-world deployment further validates effective zero-shot sim-to-real transfer, achieving an average 85% task success rate across diverse indoor and outdoor environments.

URL PDF HTML ☆

赞 0 踩 0

2606.04678 2026-06-16 cs.LG 版本更新

EvalStop：利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Kai Zhao, Xu Chu, Shahryar Sarkani, John M. Fossaceca

发表机构 * DeepMind, London, UK（深度Mind, 英国伦敦）； University of Cambridge, UK（英国剑桥大学）； University of Washington, USA（美国华盛顿大学）

AI总结提出EvalStop调度原语，通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点，以纠正奖励过度优化，在RLHF负载上实现高精度检测并提升JCT。

详情

AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载，其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示，在持续优化压力下，该代理与世界反馈（下游评估指标）发生偏离，这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离：非预见性调度器优化JCT而不考虑任何质量信号，SLAQ式质量感知调度器使用训练损失（一个单调下降的较弱代理，可通过黑客攻击降低），而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop，一个可组合的调度原语，它在连续k次评估分数下降时终止作业，释放GPU，保留最佳检查点，并委托给任何基础调度器。我们将调度器级别的早停视为检测问题，并在一个离散事件模拟器中评估它，该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行，真实标签对调度器隐藏。在RLHF密集型负载（80% RLHF，64 GPU）上，EvalStop实现了精确率98%、召回率99%、假阳性率1.5%，同时相比SRTF-Est将JCT提高了9%，将浪费的计算减少了22%（p<0.05）。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率，要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立（JCT提升9-25%），且检测质量在评估噪声（噪声标准差≤0.05时精确率至少91%）和黑客攻击基础率（黑客攻击比例20-80%时精确率至少89%）下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

URL PDF HTML ☆

赞 0 踩 0

2606.03788 2026-06-16 cs.CV 版本更新

SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

SLU-2K：基于问题的手语翻译语义评估基准

Zeno Testa, Antonino Furnari, Lorenzo Baraldi, Natalia Díaz-Rodríguez

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷吉奥艾米利亚大学）； University of Catania（卡塔尼亚大学）； University of Granada（格拉纳达大学）； CITIC & DaSCI Institute（CITIC与DaSCI研究所）

AI总结提出SLU-2K基准，通过2350个视频问答对评估手语翻译的语义理解，揭示当前系统在语义正确性上的不足。

Comments Accepted at the GenSign Workshop, CVPR 2026

详情

AI中文摘要

手语翻译（SLT）通常使用表面形式指标（如BLEU和ROUGE）进行评估，这些指标奖励词汇重叠，但不直接衡量翻译是否保留了源手语序列的含义。这与将SLT集成到辅助技术中的最终目标相悖。在这项工作中，我们将重点从手语翻译（SLT）转向手语理解（SLU），特别强调语义理解。具体来说，我们根据系统从输入视频中正确恢复原始句子关键语义方面的能力来评估系统，例如发生的动作以及关于人和物体的事实。为了系统地实现这种评估，我们提出了SLU-2K，这是一个基于流行的PHOENIX-2014T和CSL-Daily数据集的2350个封闭式视频问答对的数据集。为了获得SLU-2K，我们提出并广泛评估了一个自动数据生成流水线，该流水线生成7个类别的问题，即动作、位置、数字、物体、人物、时间和天气条件。我们通过评估流行的多模态大语言模型（MLLM）和两个代表性的最先进系统MMSTL和SpaMo，展示了SLU-2K的潜力。我们的结果表明，MLLM达到了接近随机的性能，突显了当前AI系统中需要更系统地集成SLU。此外，在领域内数据上精心微调的最先进翻译系统仍然存在显著的语义差距，结果范围从56.7%到75.2%。这些发现表明，当前的SLT评估协议高估了真正的理解，未来的进展不仅应通过流畅性和n-gram重叠来衡量，还应通过语义正确性来衡量。代码、提示和基准文件可在此https URL获取。

英文摘要

Sign Language Translation (SLT) is typically evaluated with surface-form metrics such as BLEU and ROUGE, which reward lexical overlap but do not directly measure whether a translation preserves the meaning of the source sign sequence. This is in contrast with the final objective of integrating SLT in assistive technology. In this work, we shift the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), with particular emphasis on semantic understanding. Specifically, we evaluate systems based on their ability to correctly recover, from the input video, key semantic aspects of the original sentence, such as actions taking place and facts about people and objects. To enable this evaluation systematically, we propose SLU-2K, a dataset of 2,350 closed-ended video question-answer pairs based on the popular PHOENIX-2014T and CSL-Daily datasets. To obtain SLU-2K, we propose and extensively evaluate an automated data generation pipeline which produces questions across 7 categories, namely actions, locations, numbers, objects, people, time, and weather conditions. We show the potential of SLU-2K by evaluating popular Multimodal Large Language Models (MLLMs) and two representative state-of-the-art systems, MMSTL and SpaMo. Our results show that MLLMs reach near-random performance, highlighting the need for a more systematic integration of SLU in current AI systems. Furthermore, state-of-the-art translation systems carefully fine-tuned on in-domain data still exhibit a substantial semantic gap, with results ranging from 56.7% to 75.2%. These findings suggest that current SLT evaluation protocols overestimate true understanding and that future progress should be measured not only by fluency and n-gram overlap, but also by semantic correctness. Code, prompts, and benchmark files are available at https://github.com/ZenoTsT/SLU-2K

URL PDF HTML ☆

赞 0 踩 0

2606.03654 2026-06-16 cs.CV cs.NA math.NA 版本更新

估计时间序列与时间事件序列在不同分析任务中的互信息

Haoji Hu, Huaqing Mao, Yijun Lin, Xiaowei Jia, Jinwei Zhou, Minoh Jeong, Yao-Yi Chiang

发表机构 * University of Minnesota - Twin Cities（明尼苏达大学-双城分校）； University of Pittsburgh（匹兹堡大学）； Inha University（Inha大学）

AI总结提出一种非参数互信息估计器，直接度量连续时间序列与离散事件序列之间的依赖关系，无需数据转换或离散化，通过处理量化伪影和事件冗余实现鲁棒统一框架。

详情

DOI: 10.1145/3770855.3817693

AI中文摘要

成对依赖度量（如相关性和因果性）是时间数据挖掘的基础，但目前仍缺乏一种原则性且稳健的方法来量化异构数据类型之间的依赖关系，特别是连续时间序列与离散时间事件序列之间。现有方法依赖于对量化、重复值和事件冗余高度敏感的临时变换或互信息估计器，导致实践中结果有偏或不稳定。我们提出一种非参数互信息估计器，无需数据转换、学习或临时离散化，直接度量时间序列与事件序列之间的依赖关系。我们的方法对真实世界时间序列的连续-离散二元性进行建模，以处理量化和重复值伪影，并引入潜在事件聚类策略以减轻事件共现和冗余带来的偏差。这些共同构成了一个鲁棒且统一的框架，桥接了离散和连续互信息。我们在四个代表性任务上评估了所提出的估计器：用于因果分析的离散-连续时延互信息、全局和局部时间重复发现、用于时间序列预测的离散协变量选择以及用于分类的连续特征选择。在合成和真实世界数据集上的实验表明，在准确性、鲁棒性和可解释性方面，该方法一致优于现有方法，使其成为异构时间数据的通用依赖算子，类似于同质时间序列的皮尔逊相关。代码见：https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

英文摘要

Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

URL PDF HTML ☆

赞 0 踩 0

2606.01561 2026-06-16 cs.AI cs.LG 版本更新

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

S-SPPO：语义校准的自对弈偏好优化

Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu, Zhipeng Wang, Huayu Li, ZhengXiao He, Xuanzhao Dong, Prayag Tiwari, Mingkun Xu, Yujian Xiong, Feng Luo, Abolfazl Razi, Brendan Hogan Rappazzo, Anderson Schneider, Yuriy Nevmyvaka

发表机构 * University of Arizona, USA（亚利桑那大学）； Arizona State University, USA（亚利桑那州立大学）； Now at Google LLC, work done at Rice University（现就职于谷歌公司，曾就职于里士大学）； Clemson University, USA（克莱姆森大学）； Washington University in St. Louis, USA（圣路易斯华盛顿大学）； Halmstad University, Sweden（哈姆斯塔德大学）； Guangdong Institute of Intelligence Science and Technology, China（广东智能科学与技术研究院）

AI总结针对自对弈偏好优化（SPPO）中因偏好预测过度自信导致策略退化的问题，提出双空间语义校准框架S-SPPO，通过语义门控监督校准和潜在排斥表示校准，在保持博弈结构的同时提升对齐性能。

Comments Accepted by ICML2026

详情

AI中文摘要

将大型语言模型（LLM）与人类偏好对齐通常通过直接偏好优化（DPO）来实现。然而，DPO的标准Bradley-Terry实现在建模人类偏好中常见的传递性偏离方面存在局限。为解决此问题，近期工作引入了自对弈偏好优化（SPPO），通过训练自生成的胜负对来迭代优化策略。然而，我们的研究发现SPPO存在一个关键的不稳定性：当偏好预测器对语义上无法区分的响应赋予过度自信的胜利时，优化容易导致策略退化。为缓解这一问题，我们提出S-SPPO，一个双空间语义校准框架，包括：i）通过语义门控进行监督校准，随着语义重叠增加将胜率目标退火至最大熵基线；ii）通过潜在排斥进行表示校准，以强制几何多样性，防止流形坍塌并保持所选样本与拒绝样本之间的潜在多样性。理论上，我们证明该校准保持了常和博弈结构，促进收敛至纳什均衡。实验上，S-SPPO避免了先前方法中的性能退化，在AlpacaEval 2.0上使用Llama-3-8B实现了52.19%的胜率和47.46%的长度控制胜率，且在训练过程中未使用额外的人工标注偏好。代码将在https://github.com/xiwenc1/s-sppo提供。

英文摘要

Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.

URL PDF HTML ☆

赞 0 踩 0

2606.01365 2026-06-16 cs.AI 版本更新

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

多智能体LLM系统中浪费计算资源的早期诊断：基于故障感知的可观测性

Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种故障感知的可观测性框架，通过在线轨迹信号诊断多智能体LLM系统中的浪费计算，并在GAIA验证集上评估，揭示不同故障机制及其与资源消耗的关系。

详情

AI中文摘要

使用工具的多智能体大语言模型（LLM）系统在产生答案之前，通过模型令牌、工具调用、重试和代码执行来消耗计算资源。当运行失败时，最终答案评估揭示了终点，但通常无法揭示轨迹停止可恢复进展的时间点。本文引入了一个故障感知的可观测性框架，用于诊断多智能体LLM轨迹中的浪费计算。该框架将重复出现的故障模式映射到在线轨迹信号，包括工具可靠性、执行恢复、编排循环、证据可用性、信息变化和预算压力。我们在一个三智能体问答系统中实例化该框架，并在相同的执行上限下对165条GAIA验证轨迹进行评估。操作故障仍然常见：22/53的1级运行、33/86的2级运行和12/26的3级运行未能产生可用的最终答案。轨迹揭示了这些结果背后的不同机制，包括证据不足、重复动作循环、最大步数终止、工具故障连续以及成功执行但无有用输出的调用。平均令牌使用量从1级的8,152个令牌上升到3级的16,389个令牌，而证据可用性和句子级支持则出现分歧。一项缓存的10条轨迹LLM评判基础审计表明，廉价的在线信号和更深入的语义指标捕捉了故障的互补层面。结果将故障感知可观测性定位为原始执行日志与最终答案准确性之间的诊断层。

英文摘要

Failure-aware observability diagnoses wasted computation in multi-agent LLM systems before final-answer evaluation can explain what went wrong. We propose a trace-based framework for a three-agent architecture -- orchestrator, search agent, and execution agent -- that converts structured events into online signals for loops, budget pressure, low information gain, and tool instability, then adds offline semantic grounding metrics and selective LLM-as-judge evaluation. On 165 GAIA validation traces under identical caps, 98 runs produce usable final answers and 67 fail or stop without one. Among warned failed runs, 58.1% of tokens are spent after the first warning on average, indicating substantial opportunity for intervention. A 10-task Level-2 pilot uses warnings to diversify search or require evidence, reducing post-warning token fraction from 0.638 in the baseline to 0.304. The results support a layered design: cheap online signals help the orchestrator redirect or halt redundant behavior, while deeper semantic checks identify whether completed answers are grounded enough to trust.

URL PDF HTML ☆

赞 0 踩 0

2606.00558 2026-06-16 cs.LG 版本更新

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

半监督噪声适应：从噪声域迁移知识

Yuan Yao, Jin Song, Huixia Li, Tongtong Yuan, Jiaqi Wu, Yu Zhang

发表机构 * Guangdong Laboratory of Artificial Intelligence and Digital Economy（广东人工智能与数字经济实验室）； Nanjing University of Posts and Telecommunications（南京邮电大学）； Beijing Jiaotong University（北京交通大学）； Beijing University of Technology（北京工业大学）； Tsinghua University（清华大学）； Southern University of Science and Technology（南方科技大学）

AI总结提出半监督噪声适应（SSNA）问题，利用合成噪声域作为源域，通过噪声适应框架（NAF）改善目标域的泛化性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

迁移学习旨在通过从源域迁移知识来促进目标域的学习。源域通常包含语义上有意义的样本（例如图像），以促进有效的知识迁移。然而，最近的一项研究观察到，由简单分布（例如高斯分布）构建的噪声域可以在半监督设置中作为替代源域，其中只有一小部分目标样本被标记，而大多数样本未标记。基于这一令人惊讶的观察，我们提出了一种称为半监督噪声适应（SSNA）的新问题，旨在利用合成噪声域来提高目标域的泛化能力。为了解决这个问题，我们首先建立了一个泛化界，描述了噪声域对泛化的影响，基于此我们提出了噪声适应框架（NAF）。大量实验表明，NAF有效地利用噪声域来收紧目标域的泛化界，从而提高了性能。代码可在 https://github.com/AIResearch-Group/SSNA 获取。

英文摘要

Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful samples (*e.g.*, images) to facilitate effective knowledge transfer. However, a recent study observes that the noise domain constructed from simple distributions (*e.g.*, Gaussian distributions) can serve as a surrogate source domain in the semi-supervised setting, where only a small proportion of target samples are labeled while most remain unlabeled. Based on this surprising observation, we formulate a novel problem termed *Semi-Supervised Noise Adaptation* (SSNA), which aims to leverage a synthetic noise domain to improve the generalization of the target domain. To address this problem, we first establish a generalization bound characterizing the effect of the noise domain on generalization, based on which we propose a Noise Adaptation Framework (NAF). Extensive experiments demonstrate that NAF effectively leverages the noise domain to tighten the generalization bound of the target domain, leading to improved performance. The codes are available at https://github.com/AIResearch-Group/SSNA.

URL PDF HTML ☆

赞 0 踩 0

2606.00435 2026-06-16 cs.CV cs.AI 版本更新

Detect Before You Leap: Mirage Detection in Vision-Language Models

在跳跃前检测：视觉语言模型中的幻象检测

Sayeed Shafayet Chowdhury, Md. Shaown Miah, S. M. Taiabul Haque, Syed Ishtiaque Ahmed

发表机构 * Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）； Bangladesh University of Engineering and Technology（孟加拉工程与技术大学）

AI总结针对视觉语言模型在缺乏视觉证据时产生自信但无根据回答的幻象问题，提出文本条件层内对齐方法，通过分析视觉编码器各层补丁令牌与问题嵌入的对齐轨迹，结合像素统计、零样本域路由和结构化自评估，实现高精度预响应幻象检测。

详情

AI中文摘要

视觉语言模型（VLM）即使在所需视觉证据缺失、空白或与问题无关时，也能产生自信的视觉答案。这种失败模式被称为幻象（Asadi et al. 2026），在医学和文档视觉问答中尤其令人担忧，因为看似合理但缺乏视觉依据的响应可能被误认为是基于图像的证据。我们研究预发布幻象检测：给定图像-问题对，目标是在VLM生成响应之前确定其应回答还是弃权。我们提出文本条件层内对齐（TC-LIA），一种模型无关的方法，探测CLIP ViT-H/14视觉编码器各层的补丁令牌表示。TC-LIA将逐层图像补丁令牌投影到最终CLIP嵌入空间，并测量它们与问题嵌入的相似度，从而跟踪问题相关视觉证据是否在视觉层中出现。得到的对齐轨迹通过最终图像-文本余弦相似度、后期层top-k补丁-文本对齐、早期到后期增益和逐层斜率进行总结。这些特征与像素统计空白/噪声检测、零样本域路由和结构化VLM自评估相结合，形成一个集成系统。在五个VQA领域、三种输入条件和十二个VLM骨干网络上，最佳系统实现了约94.6-94.7%的三类检测准确率，幻象率低于3%，而基线幻象率范围为21.7%至66.6%。

英文摘要

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, recently described as mirage (mirage2026), is especially concerning in medical and document VQA, where a plausible but visually ungrounded answer may be mistaken for image-based evidence. We study the complementary problem of pre-release mirage detection: given an image-question pair, determine whether the VLM should answer or abstain before generation. To that end, we propose a novel model-agnostic Text-Conditioned Layer-wise Internal Alignment (TC-LIA) method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. The key idea is to project layer-wise image patch tokens into the final CLIP embedding space and measure their similarity with the question embedding, thereby tracking whether question-relevant visual evidence emerges across vision layers. TC-LIA summarizes this alignment trajectory using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic based blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains with related, unrelated-real, and blank/noise inputs, and across twelve VLM backbones, Qwen2.5-VL-32B achieves the highest three-class detection accuracy of 94.7% with a 3.0% mirage rate, while Qwen2.5-VL-72B achieves 94.6% accuracy with a lower 2.8% mirage rate. Baseline mirage rates span 21.7-66.6%.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

On the Geometry of On-Policy Distillation

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

Diffusion Models for Adaptive Sequential Data Generation

TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context Learning

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

Depth-Attention: Cross-Layer Value Mixing for Language Models

WAM-Nav: Asymmetric Latent World-Action Modeling for Unified Visual Navigation

Test-Time Compute Scaling for ASR with Depth-Conditioned Looped Transformers

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

Graph Regularized Non-negative Reduced Biquaternion Matrix Factorization for Color Image Recognition

Bayesian Tensor Decomposition with Diffusion Model Prior

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

Pathway-Structured Privileged Distillation for Deployable Computational Pathology

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

Question-Aware Evidence Ledgers for Video Relational Reasoning

Not What, But How: A Framework for Auditing LLM Responses across Positioning, Generalization, Anthropomorphism, and Maxims

Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

Detect Before You Leap: Mirage Detection in Vision-Language Models