arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2605.26252 2026-05-27 cs.AI cs.DB

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

智能体记忆是数据库吗?重新思考长期AI智能体记忆的数据基础

Abdelghny Orogat, Essam Mansour

AI总结 本文提出将长期AI智能体记忆视为一种新的数据管理工作负载,通过形式化治理演化记忆(GEM)框架,用四个状态级操作替代记录级操作,并论证记录级系统无法满足其正确性条件,最后通过原型MemState验证可行性并指出未来研究方向。

详情
AI中文摘要

长期运行的AI智能体需要持久记忆。记忆支持跨会话的学习,减少重复的上下文注入,并能够审计过去的决策。当前的智能体记忆系统和数据库范式将记忆视为存储。它们将正确性定位在记录、嵌入或边上。每个只提供了长期记忆所需的部分能力。结果导致四种反复出现的故障模式:无节制的增长、缺乏语义修订、容量驱动的遗忘以及只读检索。在我们的愿景中,长期智能体记忆是一种新的数据管理工作负载。其正确性是状态轨迹的属性,而非单个记录的属性。我们将其形式化为治理演化记忆(GEM)。GEM用四个状态级操作替代记录级数据库操作:摄取、修订、遗忘和检索。六个正确性条件控制状态如何演化。三个结构性观察表明,无论存储模型如何,没有记录级系统能够满足这些条件。我们在一个属性图后端上实现了该抽象的原型MemState。MemState验证了可行性并揭示了与原生引擎之间的差距。我们概述了三个研究方向,将记忆中心的数据管理定义为一个工作负载。

英文摘要

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

2605.26248 2026-05-27 cs.LG cs.AI cs.NE

Unified Neural Scaling Laws

统一神经缩放定律

Ethan Caballero, Priyank Jaini, David Krueger, Irina Rish

AI总结 提出一种统一神经缩放定律(UNSL)函数形式,能够准确建模和预测深度神经网络在多个维度(模型参数、训练数据量、训练步数、推理步数、计算量及超参数)同时变化时的缩放行为,适用于多种架构和任务,并在大规模视觉、语言、数学和强化学习任务中实现更精确的缩放行为外推。

详情
AI中文摘要

我们提出了一种函数形式(称为统一神经缩放定律(UNSL)),该形式能够准确建模和预测深度神经网络在多个维度(即评估指标如何随模型参数数量、训练数据集大小、训练步数、推理步数、计算量以及各种超参数同时变化)同时变化时的缩放行为,适用于多种架构以及各种上游和下游任务中的每个任务。这些任务包括大规模视觉、语言、数学和强化学习。与其他神经缩放的函数形式相比,该函数形式在该任务集上产生的缩放行为外推结果显著更准确。

英文摘要

We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.

2605.26246 2026-05-27 cs.LG

The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

LLM蒸馏中的桥园困境:为什么混合硬标签和软标签有效

Guanghui Wang, Kaiwen Lv Kacuila, Zhiyong Yang, Zitai Wang, Jin-Wen Wu, Longtao Huang, Qianqian Xu, Qingming Huang

AI总结 针对大语言模型知识蒸馏中硬标签与软标签的混合使用,提出桥园分解理论解释其降低暴露偏差的机制,并开发自适应混合监督方法,在多个模型上实现性能提升和9.7倍训练成本降低。

Comments Accepted at ICML 2026

详情
AI中文摘要

知识蒸馏(KD)将知识从大型教师模型转移到较小的学生模型。在语言建模中,学生模型要么在从教师模型采样的标记(硬标签)上训练,要么在教师模型的完整下一个标记分布(软标签)上训练。尽管软标签看起来严格更丰富,但我们发现混合硬标签和软标签始终能产生更好的结果。关键的是,我们表明这种增益不能通过训练期间更接近教师匹配来解释。相反,它来自于减少暴露偏差,即训练和推理分布之间的不匹配。为了解释这一现象,我们引入了桥园分解理论,该理论将生成步骤分为两类:桥(Bridge),其中下一个标记必须精确;园(Garden),其中下一个标记可以灵活。我们表明,仅硬标签的KD在桥中通过避免风险偏差表现出色,而仅软标签的KD在园中保持多样性。混合策略处理两种情况,从而减少整个序列中的暴露偏差。在该理论的指导下,我们开发了一系列桥园混合监督方法,自适应地平衡硬标签和软标签。在包含七个教师-学生对(包括Qwen、Llama、Gemma和DeepSeek)的主要套件以及推理和编码基准测试中,我们的方法优于基于散度和基于策略的KD基线,同时将训练成本降低了9.7倍,实现了高效的模型压缩。代码可在https://github.com/ghwang-s/bridge_garden_hybrid_kd_release获取。

英文摘要

Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full next-token distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge-Garden hybrid supervision methods that adaptively balance hard and soft labels. Across a primary suite of seven teacher-student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression. Code is available at https://github.com/ghwang-s/bridge_garden_hybrid_kd_release.

2605.26244 2026-05-27 cs.CV cs.MM cs.SD

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

LongAV-Compass:面向分钟级音视频生成在T2AV、I2AV和V2AV上的统一评估

Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang

AI总结 针对现有评估协议局限于短片段的问题,提出LongAV-Compass基准,通过284个测试用例和统一评估框架,系统评估分钟级音视频生成在文本、图像、视频条件下的质量、一致性和对齐。

详情
AI中文摘要

音视频生成正从短片段快速发展到分钟级内容,而现有评估协议仍主要局限于短片段设置。现有基准主要关注5-10秒的文本条件生成,很少支持跨文本、图像和视频条件模态的统一评估。此外,它们对身份一致性、叙事连贯性和音视频对齐在长时间跨度上的退化提供的洞察有限。为弥补这一差距,我们引入了LongAV-Compass,一个用于分钟级音视频生成的系统基准。LongAV-Compass包含284个精选测试用例,涵盖文本到音视频(T2AV)、图像到音视频(I2AV)和视频到音视频(V2AV),按应用场景和生成复杂度组织。该基准结合了基于分类法的基准构建和统一评估框架,该框架集成了MLLM辅助评估与互补的感知和多模态指标,包括DINO-v2、ArcFace、CLIP和ImageBind。该框架评估超过20个细粒度维度,涵盖片段内质量、跨片段一致性、全局叙事连贯性、语义对齐和音视频同步。通过对11个代表性模型的实验以及人类对齐验证,LongAV-Compass提供了一个诊断测试平台,用于分析当前系统在跨不同输入模态维持连贯、语义对齐和时间一致的分钟级音视频生成方面的局限性。

英文摘要

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.

2605.26243 2026-05-27 cs.LG

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

可证明通信高效且隐私保护的联邦图神经网络

Zhishuai Guo, Wenhan Wu, Chen Chen, Lei Zhang, Olivera Kotevska, Ravi K Madduri

AI总结 提出CE-FedGNN框架,通过稀疏交换聚合节点表示和移动平均估计器处理跨客户端依赖,结合度量差分隐私实现通信高效与隐私保护,并证明收敛速率和隐私保证。

详情
AI中文摘要

图神经网络(GNN)在关系数据上取得了强性能,但现实世界的图通常分布在多个组织之间,由于隐私和政策约束,这些组织无法共享原始数据。现有的联邦GNN方法要么忽略跨客户端链接导致精度下降,要么需要频繁的嵌入交换,带来巨大的通信和隐私成本。我们提出了CE-FedGNN,一个通信高效且隐私保护的联邦GNN框架,用于学习此类耦合图。我们的方法避免共享原始数据或每轮嵌入,而是通过稀疏交换聚合的节点表示。为了处理跨客户端依赖和过时性,我们引入了一个移动平均估计器,持续跟踪节点表示并使其能够在多轮中稳定重用。为了为发布的表示提供正式的隐私保证,我们采用了度量差分隐私(metric-DP)框架,该框架根据学习嵌入空间中的距离而非最坏情况输入扰动来衡量隐私。这在标准差分隐私变得过于保守的噪声水平下提供了有意义的保证。我们建立了以$O(1/\sqrt{T})$速率收敛到稳定点,通信复杂度为$O(T^{3/4})$。此外,我们在公共队列威胁模型下通过Rényi差分隐私组合推导了$(\varepsilon,\delta)$-度量差分隐私保证。在合成银行间反洗钱基准和引文网络上的实验表明,CE-FedGNN在显著降低通信的同时保持了强性能,并在隐私保护噪声下保持鲁棒性。

英文摘要

Graph neural networks (GNNs) achieve strong performance on relational data, but real-world graphs are often distributed across organizations that cannot share raw data due to privacy and policy constraints. Existing federated GNN methods either ignore cross-client links, leading to degraded accuracy, or require frequent embedding exchanges, incurring substantial communication and privacy costs. We propose CE-FedGNN, a communication-efficient and privacy-preserving federated GNN framework for learning over such coupled graphs. Our approach avoids sharing raw data or per-round embeddings by infrequently exchanging aggregated node representations. To handle cross-client dependency and staleness, we introduce a moving-average estimator that continuously tracks node representations and enables their stable reuse across rounds. To provide formal privacy guarantees for the released representations, we adopt the metric differential privacy (metric-DP) framework, which measures privacy with respect to distances in the learned embedding space rather than worst-case input perturbations. This yields meaningful guarantees at noise levels where standard differential privacy becomes overly conservative. We establish convergence to a stationary point at a rate of $O(1/\sqrt{T})$ with $O(T^{3/4})$ communication complexity. In addition, we derive $(\varepsilon,δ)$-metric-DP guarantees via Rényi differential privacy composition under a public-cohort threat model. Experiments on synthetic interbank anti-money laundering benchmarks and citation networks demonstrate that CE-FedGNN achieves strong performance while significantly reducing communication and maintaining robustness under privacy-preserving noise.

2605.26242 2026-05-27 cs.AI

Can LLMs Introspect? A Reality Check

LLM 能否内省?一个现实检验

Shashwat Singh, Tal Linzen, Shauli Ravfogel

AI总结 本文基于人类元认知研究的教训,质疑大型语言模型能否真正内省,并通过重新审视两个评估范式发现,当前证据不足以证明LLM具有元认知监控能力。

详情
AI中文摘要

大型语言模型能否检测并报告自身的内部状态?许多研究认为答案是肯定的。我们基于人类元认知研究的教训认为,这一结论可能为时过早:要确信这一结论,我们需要区分真正的内省与基于表面线索的模式匹配。此外,我们认为仅凭行为证据本身不足以建立强有力的内省主张。 我们在此考虑下重新审视了两个最近引入的评估范式。在第一个范式中,模型需要检测其内部状态是否被篡改。我们发现,模型无法可靠地区分对其内部状态的干预与对输入的操纵,这表明它们在原始研究中的成功反映了它们更一般地检测异常的能力,而非特别针对其内部状态的干预。在我们检查的第二个范式中,模型被要求预测从其自身隐藏状态派生的标签。我们发现,仅能访问输入的分类器达到了与模型自身上下文预测相当的性能,这表明原始结果并未决定性地证明模型对其内部表示具有特权访问。我们进一步引入了一个重新标记的控制设置,其中模型不能依赖任务的语义来解决问题,而必须依赖内部表示;在这个更好控制的版本中,模型的表现更接近随机。综合这些结果,表明当前证据不足以证明LLM表现出元认知监控。

英文摘要

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

2605.26241 2026-05-27 cs.CV

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

RoMo:用于人体运动生成的大规模、丰富组织的数据集和语义分类体系

Jiahao Zhang, Joseph Liu, Young-Yoon Lee, Seonghyeon Moon, Victor Zordan, Guy Tevet, Karen Liu, Stephen Gould, Oren Jacob, Haomiao Jiang, Mubbasir Kapadia, Yizhak Ben-Shabat

AI总结 提出RoMo数据集,通过分类感知过滤流水线确保质量,并采用三级语义分类体系组织数据,使训练模型在保真度和多样性上达到最优,同时提升对复杂文本提示的理解。

Comments Accepted to CVPR'26

详情
AI中文摘要

在语言、图像和视频领域的生成建模成功表明,大型、精心策划的数据集是构建强大模型的关键驱动力。然而,3D人体运动领域一直滞后,受限于在小型高保真运动捕捉数据集和以静态或低质量序列为主的大规模野外数据集之间的不满意选择。我们引入了RoMo,一个丰富、大规模、精心策划的野外人体运动数据集,解决了这些权衡。为确保质量,我们引入了一个分类感知过滤流水线,积极去除静态和易产生伪影的序列。每个序列都带有详细注释,并由一个新颖的三级语义分类体系组织。这种层次结构实现了细粒度的逐类别评估,揭示了全局指标所掩盖的模型优势和弱点。我们证明,在RoMo上训练的模型在保真度和多样性上达到最优,同时获得了对复杂、细微文本提示的卓越理解。最后,我们发布了Motion Toolbox以标准化指标、数据转换和可视化,为可重复和可解释的运动生成研究奠定了基础。

英文摘要

Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.

2605.26239 2026-05-27 cs.CV cs.MA

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

Sentinel:具身协同空间推理与规划

Xiangye Lin, Hongxin Zhang, Ruxi Deng, Qinhong Zhou, Chuang Gan

AI总结 提出Sentinel挑战和CoSaR框架,通过自然语言通信与空间导航算法结合,解决多智能体在城市规模户外环境中的协同空间推理与规划问题。

Comments The first two authors contributed equally

详情
AI中文摘要

在这项工作中,我们研究了协同空间智能,即分散的具身智能体在跨城市规模的户外领域中,在动态环境约束下有效协调的能力。我们引入了Sentinel挑战,这是一个基准测试,其中多个分散的具身智能体必须通过自然语言进行通信,以在大规模城市户外环境中就一个相互安全且方便的会合点达成一致。然后,每个智能体必须安全导航,同时避开巡逻的动态哨兵,并使用提供粗略空间信息的工具。为了解决这个问题,我们提出了CoSaR(协同空间推理与规划)框架,该框架将基础模型的高层通信和规划能力与经典空间导航算法的精度相结合。CoSaR使智能体能够交换情境更新、推理不断变化的空间约束,并协同重新规划轨迹。在14个城市级别场景(包含3-5个智能体)的评估中,CoSaR始终导致更快的聚集、更短的路径长度和更高的安全性。我们的结果表明,将动态通信与空间推理相结合对于鲁棒的多智能体协作至关重要。通过形式化这一新设置并提供可扩展的基准测试,我们旨在为推进具身多智能体系统中的协同空间智能奠定基础。代码和挑战可在https://github.com/UMass-Embodied-AGI/Sentinel获取。

英文摘要

In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized embodied agents must communicate in natural language to agree on a mutually safe and convenient meeting point within large, city-scale outdoor environments. Each agent must then navigate safely while avoiding dynamic sentinels patrolling the area, using a tool that provides coarse spatial information. To address this, we propose CoSaR (Cooperative Spatial Reasoning and Planning), a framework that bridges the high-level communication and planning abilities of foundation models with the precision of classical spatial navigation algorithms. CoSaR enables agents to exchange situational updates, reason over evolving spatial constraints, and collaboratively replan trajectories. Evaluated across 14 city-level scenes with 3-5 agents, CoSaR consistently leads to faster gathering, shorter path lengths, and improved safety. Our results demonstrate that integrating dynamic communication with spatial reasoning is essential for robust multi-agent cooperation. By formalizing this new setting and providing a scalable benchmark, we aim to build a foundation for advancing cooperative spatial intelligence in embodied multi-agent systems. Code and challenge are available at https://github.com/UMass-Embodied-AGI/Sentinel.

2605.26232 2026-05-27 cs.CV

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

并非所有模态都平等:面向多模态视频的指令感知门控

Bonan Ding, Umair Nawaz, Ufaq Khan, Abdelrahman M. Shaker, Muhammad Haris Khan, Jiale Cao, Jin Xie, Fahad Shahbaz Khan

AI总结 提出UniMVU框架,通过内模态和模态级指令感知动态门控实现多模态视频理解,在六个基准上优于静态融合方法。

Comments 19 pages, 8 figures, 7 tables, preprint

详情
AI中文摘要

预训练视频大语言模型在视觉推理方面表现出色。然而,当视频伴随辅助流(如音频、深度图或密集时间证据)时,它们会陷入困境。在这种情况下,统一融合会导致模态干扰,使不相关的通道分散模型注意力。为了解决这个问题,我们提出了一个统一的多模态视频理解框架UniMVU,该框架通过两个级别的动态门控在视频、音频、深度图或任何其他模态输入之间执行指令感知融合:内模态门强调每个模态内的显著区域,而模态级门重新加权整个流;两者都根据文本指令进行条件化,以自适应地平衡模态重要性。我们的UniMVU将跨模态自注意力与指令驱动的内模态门控模块以及带有控制令牌的模态级门控模块相结合;对于时间对齐的流,我们进一步采用了一种快慢融合方案,以减少冗余。在六个基准(AVQA、AVSD、Music-AVQA、ScanQA、SQA3D和MVBench)上,我们的UniMVU相对于静态融合基线取得了一致的提升,在CIDEr指标上最高提升了13.5。此外,我们的分析表明,门控机制与人类可解释的模态相关性一致,消融实验显示了内模态和模态级门控的贡献。我们的UniMVU为指令感知的多模态视频理解提供了一种简单、统一的方案,无需手工设计的融合规则即可扩展到多种模态。

英文摘要

Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model. To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance. Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13.5 in terms of CIDEr metric. Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.

2605.26230 2026-05-27 cs.CV

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

几何感知表示去噪用于鲁棒的多视角3D重建

Jin Hyeon Kim, Jaeeun Lee, Claire Kim, Kyoungjin Oh, Paul Hyunbin Cho, Jaewon Min, Yeji Choi, Jihye Park, Hyunhee Park, Minkyu Park, Seungryong Kim

AI总结 提出几何感知表示去噪(GARD)框架,在前馈3D重建模型的特征空间中执行扩散式多视角恢复,同时恢复场景几何与高质量RGB图像,在DA3基准上验证有效性。

详情
AI中文摘要

多视角3D重建随着前馈3D重建模型的出现取得了显著进展。然而,这些模型通常在理想的无退化成像条件下训练和评估,而真实世界的观测往往包含与此类设置显著不同的退化。因此,在退化条件下提高多视角3D重建的鲁棒性仍然是一个重要挑战。我们提出了几何感知表示去噪(GARD),一种新颖的框架,直接在前馈3D重建模型的特征空间中执行基于扩散的多视角恢复。这种设计利用3D重建器的几何感知特征表示来有效恢复准确的场景几何。此外,通过使用额外的RGB图像解码器,精炼的表示还可用于恢复高质量的RGB图像,从而同时恢复3D场景几何和高质量图像。在Depth Anything 3(DA3)基准上的全面实验证明了所提出的GARD框架的有效性。

英文摘要

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.

2605.26222 2026-05-27 cs.LG stat.ML

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

从隐私到泛化:DP-SGD的线性最大信息界

Christoph H. Lampert, Hossein Zakerinia

AI总结 本文证明了DP-SGD的近似最大信息量具有与数据集大小成线性关系的有限样本界,并基于此推导出PAC-Bayes泛化界和DP-SGD训练模型的显式泛化界。

Comments 22 pages

详情
AI中文摘要

理解泛化与隐私之间的关系仍然是现代机器学习理论中的一个核心挑战,特别是对于通过差分隐私随机梯度下降(DP-SGD)变体训练的深度网络。在这项工作中,我们通过证明DP-SGD的近似最大信息量的有限样本界,该界展现出与(Dwork et al, 2015)关于$ε$-差分隐私算法的经典结果相当的缩放性质,即最多与数据集大小成线性关系,从而在这个长期存在的开放问题上取得了进展。根据我们的结果,我们得到了一个通用的PAC-Bayes泛化界,其中所需的先验分布可以由DP-SGD学习,以及一个针对DP-SGD训练模型本身的泛化界,其复杂度项完全显式且由优化超参数控制。

英文摘要

Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for $ε$-differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.

2605.26192 2026-05-27 cs.LG cs.AI q-bio.BM

Co-folding model guided by structural proteomics

结构蛋白质组学引导的共折叠模型

Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner

AI总结 提出AIMS-Fold框架,通过整合XL-MS和HDX-MS实验数据与扩散模型,在推理时引导蛋白质复合物构象生成,提升诱导接近靶标的预测准确性。

详情
AI中文摘要

蛋白质结构生成模型擅长从序列预测单个蛋白质的静态结构,但通常无法捕捉蛋白质复合物的正确构象状态,这对蛋白质设计和诱导接近模式(如抗体和PROTACs)至关重要。虽然交联质谱(XL-MS)和氢氘交换质谱(HDX-MS)等结构蛋白质组学技术提供了有价值的空间和动态信息,但将这些稀疏、异质的测量整合到这些模型中仍然是一个开放的挑战。在这里,我们通过将结构蛋白质组学数据与预训练扩散模型学到的丰富生物物理先验相结合来弥合这一差距。我们引入了AIMS-Fold,一个推理时引导扩散框架,它使用源自XL-MS空间约束和HDX-MS溶剂可及性轮廓的可微物理势能主动引导生成采样轨迹。我们证明这些结构方法各自提高了预测准确性,并且它们的整合产生了协同改进。关键的是,通过利用这些实验约束,AIMS-Fold在具有挑战性的诱导接近靶标上比纯计算、无引导的最先进模型(如Boltz-2)实现了更高的准确性。这确立了我们的框架作为诱导接近药物基于结构的药物设计的强大整合计算方法。评估代码将在发表后公开。

英文摘要

Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.

2605.26191 2026-05-27 cs.LG cs.AI

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

从流式时间序列建模时滞系统的动态混合

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

AI总结 提出在线框架DelayMix,将流式时间序列视为时滞系统的动态混合,通过固定长度表示总结过去状态,利用马尔可夫参数张量捕捉动态和延迟,实现快速适应环境变化并降低内存使用。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

本研究解决了具有清晰输入输出关系的时间序列数据流中的自适应建模问题。该问题具有挑战性,因为环境因素或输入延迟变化导致的快速系统变化(状态转移)会降低模型性能,并且在使用多个小模型处理每种时间序列模式时,需要在准确性、鲁棒性和内存使用之间进行权衡。为了解决这些问题,本文提出了一种在线框架/方法,将流式时间序列视为时滞系统的动态混合。该框架通过使用固定长度表示来总结过去的状态,该表示同时捕捉系统动态和输入输出延迟,从而保持模型跟踪的鲁棒性并减少内存使用。具体来说,该方法利用系统的马尔可夫参数序列构建一个摘要系统张量,同时捕捉动态行为和延迟特征。如有必要,张量分解算法从张量中提取相关的过去模型,并帮助选择最适合当前状态的系统。该方法能够快速适应环境变化,并且计算效率高。在真实数据集上的测试表明,DelayMix始终优于其他方法,实现了卓越的预测准确性和更快的延迟适应,特别是对于高度非平稳的数据。

英文摘要

This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system's Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.

2605.26190 2026-05-27 cs.LG cs.AI eess.SP

HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals

HRVConformer:基于心率信号的新生儿缺氧缺血性脑病分类

Shuwen Yu, William P Marnane, Geraldine B. Boylan, Gordon Lightbody

AI总结 提出HRVConformer,一种混合卷积-Transformer深度学习架构,直接从原始心率信号端到端分类新生儿缺氧缺血性脑病,在测试集上达到83.23% AUC和74.56%准确率,优于Transformer、ResNet50等基线。

Comments Paper submitted to Journal of Engineering Applications of Artifical Intelligence

详情
AI中文摘要

本文提出了HRVConformer,一种新颖的深度学习架构,用于使用瞬时心率(HR)信号对缺氧缺血性脑病(HIE)进行分类。与依赖手工特征的常规方法不同,HRVConformer以端到端方式直接处理原始HR信号,通过混合卷积-Transformer框架捕获局部和长距离依赖关系。通过集成用于局部特征提取的卷积层和用于全局上下文建模的基于Transformer的注意力机制,该架构有效增强了信号表示和分类性能。该模型使用监督学习在包含1,573个一小时时段的大型HR数据集上训练,其中包括259个专家标注的一小时时段和大量弱标注数据。一个314小时的验证集提供了稳健的性能估计,而一个独立的215小时专家标注数据集被保留用于最终测试。使用改进的Pan-Tompkins算法从心电图(ECG)记录中提取HR信号,该算法显著提高了信号质量和数据可用性。实验结果表明,HRVConformer在测试集上实现了83.23%的AUC和74.56%的准确率。这些结果超越了Transformer、ResNet50和全卷积网络基线,突显了集成卷积和Transformer组件用于基于HR的HIE分类的优势。所提出的方法为使用HR信号实现更准确和自动化的HIE评估提供了有希望的一步。代码可在https://github.com/syu-kylin/HRVConformer获取。

英文摘要

This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end-to-end manner, capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. By integrating convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one-hour epochs, including 259 one-hour expert-annotated epochs and a substantial set of weakly labelled data. A 314-hour validation set provided a robust performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan-Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23\% and accuracy of 74.56\% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer-based components for HR-based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu-kylin/HRVConformer.

2605.26184 2026-05-27 cs.LG cs.AI

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC: 面向混合SFT-RL后训练的噪声感知自适应混合

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

AI总结 提出噪声感知控制器GAC,通过在线估计梯度方差和两个训练信号之间的不一致性,自适应调整混合权重,以改进混合后训练性能。

Comments 15 pages, 3 figures, 22 tables

详情
AI中文摘要

混合后训练通常结合监督微调和强化学习,但固定的混合调度无法适应两种信号相对噪声随时间变化的情况。我们提出GAC,一种噪声感知控制器,通过在线估计梯度方差和两个训练信号之间的不一致性,推导出自适应混合权重。该方法在重用现有训练张量的同时,增加了平滑、先验指导和有界更新。在数学、代码、科学和逻辑基准上的实验表明,与强固定和基于规则的基线相比,GAC持续改进混合后训练,在更大模型规模下获得更大收益,且训练开销小于1%。

英文摘要

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.

2605.26182 2026-05-27 cs.AI cs.GR

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

BrickAnything: 基于几何条件的可构建砖块生成与结构感知标记化

Zhengyang Ni, Feng Yan, Yu Guo, Fei Wang

AI总结 提出BrickAnything,一个基于几何条件的自回归框架,通过结构感知树标记化生成满足装配约束和结构稳定性的砖块结构。

详情
AI中文摘要

从3D形状生成物理可构建的砖块结构不仅需要几何重建,输出还必须满足离散零件约束和结构稳定性。现有的砖块生成方法要么依赖启发式优化,当目标3D形状在预定义约束下无法实现可行结构时可能失败;要么生成砖块序列而不显式建模底层3D几何和装配关系。在这项工作中,我们提出了BrickAnything,一个基于几何条件的自回归框架,用于从多样的3D表示生成可构建的砖块结构。BrickAnything使用点云作为统一的几何接口,并预测在装配约束下重建目标形状的砖块序列。为了建模砖块之间的结构依赖性,我们引入了结构感知树标记化,通过局部附着关系表示砖块结构。这种公式使序列生成更符合物理构建过程,并减少无效中间状态。我们进一步引入了基于偏好的对齐后训练、有效性约束解码和自适应回滚,以改善可构建性目标,如稳定性和几何保真度。大量实验表明,BrickAnything生成几何忠实且物理可实现的砖块结构,并且与传统的排序策略相比,所提出的标记化有效减少了回滚和重新生成。

英文摘要

Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.

2605.26176 2026-05-27 cs.SD cs.AI

PitchBench: Measuring Pitch Hearing in Audio-Language Models

PitchBench: 测量音频-语言模型中的音高听觉能力

Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith, David M. Chan, Karina Nguyen

AI总结 提出PitchBench评估套件,通过28个实验系统测量音频-语言模型在绝对和相对音高感知上的表现,发现当前模型在不同声源、音长和格式下音高感知不可靠。

Comments Preprint

详情
AI中文摘要

音频-语言模型(ALMs)越来越多地用于需要理解音乐的实际应用,从音乐辅导和转录到字幕、推荐系统和音乐制作。更广泛地说,它们正在成为多模态AI系统的重要组成部分,这些系统必须从感官输入而非仅文本进行推理。这使得可靠的音乐感知成为关键前提:如果模型无法准确听到声音的结构,就不能信任它来推理、教学、转录或对现实世界中的音频采取行动。然而,现有的基准测试很少评估这种感知背后最基本的音乐能力之一:音高听觉。当前的评估往往通过更高层次的任务间接探测音高听觉,且通常采用多项选择格式,这留下了ALMs在不同乐器、声学条件和响应格式下识别细粒度音高的可靠性问题。我们引入了PitchBench,一个系统测量ALMs音高听觉的评估套件。PitchBench包含28个实验,涵盖序列和和弦中的绝对和相对音高感知,同时变化响度、音符时长、声源、时间拉伸、背景噪声和其他声学条件。任务范围从识别孤立音高到在四声部音乐织体中跟踪旋律线。评估前沿ALMs,我们发现音高听觉仍然非常不可靠:模型在不同设置下表现持续不佳,准确率随声源、音符时长和记谱格式急剧变化。当前的ALMs尚未具备稳定的音高感知,即使对于受控的合成和乐器刺激也是如此。除了基准测试,我们还发布了PitchBench作为Python包,包含评估数据和数据生成工具,以支持未来关于音高感知音频-语言建模的工作。

英文摘要

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

2605.26175 2026-05-27 cs.LG cs.AI

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant:为低比特LLM量化塑造激活分布

Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang

AI总结 针对低比特激活量化中分布与量化器不匹配的问题,提出基于信息论的分析和无需训练的峰值抑制正交变换(PSOT)方法,显著提升量化精度。

详情
AI中文摘要

低比特激活量化仍然是高效大语言模型(LLM)部署的主要瓶颈。难点不仅在于激活值包含异常值,还在于其分布通常与低比特均匀量化器不匹配。现有的训练后量化(PTQ)方法抑制峰值、平衡通道或最小化重建误差,但很少明确说明什么样的激活分布实际上易于离散化。因此,激活值可能在数值上更平滑,但仍会产生较大的量化误差,因为量化范围仍然很宽,或者大多数值坍缩到均值附近的几个水平。我们将激活变换重新表述为面向量化器的分布设计,并从信息论角度分析量化误差。我们的分析表明,有利于量化的激活值应同时具有较小的数值范围和在该范围内的足够分散性。在此分析指导下,我们提出InfoQuant,一种无需训练的方法,采用峰值抑制正交变换(PSOT)将激活值塑造成更有利于量化的分布。我们进一步引入自适应异常值标记选择,以提高PSOT在优化过程中的鲁棒性。在多个LLM家族中,InfoQuant始终优于先前的PTQ和端到端训练基线。在W4A4KV4下,它平均保留了97%的浮点精度,并将LLaMA-2 13B的性能差距较先前最先进方法缩小了42%。代码可在[https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)获取。

英文摘要

Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

2605.26172 2026-05-27 cs.LG

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

ARBITER:测试时采样中的推理轨迹盆地与多数投票失败

Meng Cai, Lars Kulik, Farhana Choudhury

AI总结 本文发现语言模型测试时采样的推理轨迹会聚集成少数“推理盆地”,导致多数投票选择最稳定而非最准确的盆地,并提出ARBITER方法通过保守加性证据修正共识,从样本池中恢复部分正确性。

Comments Preprint. 34 pages, 2 figures

详情
AI中文摘要

当语言模型使用测试时采样时,它们会生成多个推理轨迹并通过多数投票选择答案。我们证明这些轨迹并非独立:对于给定问题,它们会聚集成少数几个簇,即推理盆地,每个盆地由归一化的最终答案和达到该答案的解决方案定义。因此,多数投票选择的是最稳定的盆地而非最准确的盆地,这导致错误多数失败,即正确答案存在但被否决。我们提出ARBITER,一种模型无关的方法,仅使用基础模型自身的采样输出、隐藏状态和派生证据来建模盆地之间的交互。大多数直接纠正策略失败;ARBITER则在共识之上使用保守的加性证据。在其最简单的无参数形式中,ARBITER-Δ将同模型证据添加到多数先验中,而ARBITER-Enc则通过来自完整解决方案的隐藏状态的有界残差信号增强这一过程。在GSM8K上使用Qwen3-4B,K=24个样本的共识达到约94%中段,而同池top-2 oracle达到约96%中段。ARBITER在不使用外部信息的情况下恢复了这些案例的一个子集。在三个模型系列和三个数学基准上,它带来了一致的提升,且没有净负例;例如,在Llama-3.1-8B MMLU-HS-Math上,它将准确率从约78%中段提高到约82%中段,恢复了约22%的可用oracle余量,表明该余量可以从样本池本身部分恢复。

英文摘要

When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-Δ adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.

2605.26171 2026-05-27 cs.LG

When Rule Violations Are Rare: Chimera Training for Logical Anomaly Detection

当规则违反罕见时:用于逻辑异常检测的嵌合体训练

Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado

AI总结 针对规则违反样本稀少的逻辑异常检测,提出嵌合体训练方法,通过特征级操作数反事实构造生成监督信号,提升规则级异常检测性能。

Comments 9+30 pages, 4+4 figures, under review

详情
AI中文摘要

许多实际异常不仅仅是罕见的输入,而是语义约束的违反:对象以结构化方式共现,动作蕴含前提条件,事件满足时间或关系规律。我们研究这种设置下的异常检测,其中约束以学习到的视觉概念上的逻辑规则形式给出,但训练期间真实规则违反罕见或缺失。我们提出一种神经规则评估器,将每个约束编译成有向无环图,并为其内部逻辑运算符学习特征感知的子树MLP门。每个门将子特征和边级否定映射到父表示和规则满足概率,并通过基于真实概念标签的精确布尔传播获得中间监督。关键困难在于同图像训练数据通常无法提供信息性真值配置的充分覆盖,并允许捷径解。为解决此问题,我们引入嵌合体训练:在特征级别进行操作数级反事实构造。我们不混合输入图像,而是连接来自不同样本的子树特征;每个操作数保留其来源样本的硬真值标签,并通过将节点的逻辑运算符应用于这些继承标签来获得嵌合体目标。这提供了监督逻辑反例,而无需真实异常图像。在CLEVRER、OpenImages和VidOR上,所得到的评估器在规则级异常AUROC上优于独立事件和同图像语义训练基线,特别是对于组合和关系规则。该方法产生标量异常分数和规则级归因。

英文摘要

Many practical anomalies are not merely rare inputs, but violations of semantic constraints: objects co-occur in structured ways, actions imply preconditions, and events satisfy temporal or relational regularities. We study anomaly detection in this setting, where constraints are given as logical rules over learned visual concepts, but real rule violations are rare or absent during training. We propose a neural rule evaluator that compiles each constraint into a directed acyclic graph and learns feature-aware subtree MLP gates for its internal logical operators. Each gate maps child features and edge-level negations to a parent representation and a rule-satisfaction probability, with intermediate supervision obtained from exact Boolean propagation over ground-truth concept labels. The key difficulty is that same-image training data often provide insufficient coverage of informative truth configurations and also allow shortcut solutions. To address this, we introduce chimera training: an operand-level counterfactual construction at the feature level. Instead of mixing input images, we concatenate subtree features from different samples; each operand keeps the hard truth label of the sample it came from, and the chimera target is obtained by applying the node's logical operator to those inherited labels. This supplies supervised logical counterexamples without requiring real anomalous images. Across CLEVRER, OpenImages, and VidOR, the resulting evaluator improves rule-level anomaly AUROC over independent-events and same-image semantic-training baselines, especially for compositional and relational rules. The method yields both scalar anomaly scores and rule-level attributions.

2605.26167 2026-05-27 cs.LG cs.AI math.DS math.RA

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

通过监督投影流形学习进行李群嵌入的神经动力学规划

Tianwei Wang, Bryan Chen, Qian Zuo, Qiyue Xia, Xin Li, Wei Pang

AI总结 提出李群嵌入动力神经网络(LieEDNN),通过梯度下降和流形上的度量投影实现可学习且稳定的动力学,解决李群与神经网络加法不兼容及非线性表示空间中的演化问题,并在SE(3)伸缩机械臂上验证。

Comments Preprint. Under review

详情
AI中文摘要

我们提出了李群嵌入动力神经网络(LieEDNN)以及基于梯度下降和光滑流形上度量投影的相应学习算法,其中我们将李群视为流形几何连续对称性的内在表示。因此,我们在底层流形上实现了可学习且稳定的动力学,适用于一般李群,并且能够利用李群(如SO(3)和SE(3))强大的表示能力来解决机器人、图形和控制等领域的实际工程问题。两个核心挑战是:(i)一般李群与加法运算不兼容,而加法是神经网络交互所必需的。(ii)动力学在特殊代数的非线性表示空间中演化,而非正常的欧几里得空间,这违反了常见神经常微分方程的范式。为了解决这两个挑战,我们首先引入李代数上的伴随李群作用,它诱导出一个线性映射并转移到权重矩阵的分块结构,使得加法可以在李代数上作为向量空间进行运算。然后我们将李代数和伴随作用参数化为线性变换,从而使架构与神经网络感知器对齐。明确地说,这种嵌入表现为权重上的分块流形约束,我们开发了学习算法,以确保时间神经网络动力学的平衡态具有稳定性保证。我们在特定李群SE(3)上进行了实验,应用场景为伸缩机械臂。

英文摘要

We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and metric projection on smooth manifold, where we treat Lie group as an intrinsic representation for continuous symmetry of manifold geometry. Thereby we achieve learnable and stable dynamics on the underlying manifold for general Lie group, and we are able to utilize the powerful representation capability of Lie group such as SO(3) and SE(3) to solve real world engineering problems in areas such as robotics, graphics, and control. Two core challenges are: (i) General Lie groups are incompatible with addition arithmetic, which is necessary for neural network interactions. (ii) The dynamics evolve in the nonlinear representation space of special algebra rather than the normal Euclidean space, which violates the paradigm of common neural ODEs. To address these two challenges, we firstly introduce adjoint Lie group action on the Lie algebra, which induces a linear mapping and transfer to the block-wise structure of weight matrices, such that addition could operate on the Lie algebra as a vector space. Then we parameterize the Lie algebra and the adjoint action as linear transformation so that the architecture is aligned with neural network perceptrons. Explicitly, this embedding appears as block-wise manifold constraints on weights, and we develop algorithms to learn the equilibrium with stability guarantees of the temporal neural network dynamics. Experiments are implemented on a specific Lie group SE(3), with the application scenario of telescopic manipulators.

2605.26162 2026-05-27 cs.LG cs.AI

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

基于推送的异步联邦学习:一种偏差校正聚合方法

Jiahui Bai, Hai Dong, A. K. Qin

AI总结 提出PushCen-ADFL框架,通过中心表示空间中的平均保持推-求和混合与轻量级中心正则化,解决异步去中心化联邦学习中的通信开销、聚合偏差和模型漂移问题。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). This is the extended version with full appendix

详情
AI中文摘要

异步去中心化联邦学习(ADFL)消除了中央协调和全局同步,使其在大规模和异构系统中具有吸引力。然而,频繁的点对点通信、有向拓扑上的异步更新以及非独立同分布数据共同导致了过高的通信开销、有偏聚合和严重的模型漂移。我们提出了PushCen-ADFL,一种通信高效的ADFL框架,能够在非对称通信和延迟客户端参与下实现稳定训练。PushCen-ADFL在共享中心表示空间中耦合了通信、聚合和局部稳定化,形成了压缩与优化之间的闭环。客户端交换中心形式的消息,应用平均保持的推-求和混合来校正聚合偏差,并使用锚定在同一中心空间的轻量级中心正则化来减轻异构性和陈旧性下的漂移。一个有界、发送者去重的缓冲区进一步提高了在异步到达不规则情况下的鲁棒性。在视觉数据集上的实验表明,PushCen-ADFL在数据异构性下将准确率提高了最多6%,同时将每次推送的通信成本降低了80%以上,实现了良好的准确率-通信权衡。

英文摘要

Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.

2605.26161 2026-05-27 cs.LG cs.AI

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit: 时间序列基础模型中的数据污染审计

Hongkai Li, Shifeng Xie, Lefei Shen, Zhuo Li, Mouxiang Chen, Xiaobin Zhang, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

AI总结 针对时间序列基础模型(TSFMs)预训练数据污染问题,提出基于探针适应动力学的审计方法TSFMAudit,通过检测微调后损失下降更快且骨干网络移动更小的异常现象来识别污染数据集。

Comments 22 pages, 7 figures, 9 tables

详情
AI中文摘要

时间序列基础模型(TSFMs)越来越多地在大型语料库上进行预训练,这引发了评估数据集可能在预训练期间被暴露从而导致过于乐观的性能估计的担忧。在时间序列中审计此类污染具有挑战性,因为信号是连续且异质的,并且通常缺乏语料库文档。据我们所知,这是第一个研究TSFMs预训练污染审计的工作。我们形式化了TSFMs的预训练污染审计问题,并提出了TSFMAudit,一种基于探针适应动力学的方法。我们的关键直觉是,污染表现为异常高效的适应:在微调探针后,受污染的数据集往往表现出更快的损失减少和更小的骨干网络移动。我们在6个TSFMs和187个数据集上评估了TSFMAudit,使用文档化的训练来源证据作为监督,并与从LLM文献中改编的10个竞争基线进行了比较。

英文摘要

Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.

2605.26155 2026-05-27 cs.RO cs.AI cs.LG

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

自适应引导何时有帮助?部分可观测条件下自动驾驶的信念感知特权蒸馏

Mehmet Haklidir

AI总结 本文提出信念感知GSAC(BA-GSAC),通过集成分歧动态调节蒸馏系数,系统研究自适应引导在部分可观测自动驾驶中的有效性,发现严重遮挡下系数过早崩溃,并揭示可观测性盲区问题。

Comments 9 pages, 3 figures, 7 tables. Accepted at CVPR 2026 Workshop on Autonomous Driving (WAD)

详情
AI中文摘要

引导软演员-评论家(GSAC)将来自特权全状态教师的知识蒸馏给部分可观测的学生,用于自动驾驶,但使用固定的蒸馏系数λ,而不考虑智能体的不确定性。我们提出信念感知GSAC(BA-GSAC),通过集成分歧调节λ,并将其作为系统实证研究的测试平台,探究:自适应引导何时真正有帮助?在Highway-Env上评估五种策略(固定λ∈{0.01, 0.1}、自适应、线性衰减和普通SAC)在三个POMDP难度级别下,我们发现初步的单种子运行表明在轻度和中度部分可观测性下有收益,但在严重遮挡下(所有方法使用3个种子评估),自适应系数在大约3K步内坍缩到λ_min。我们将其归因于可观测性盲区现象:由于集成预测部分观测,即使在严重遮挡下也能达到低分歧,建模了可见部分但无法检测缺失部分。我们诊断了根本原因并提出了架构修复(使用引导演员的特权访问在完整状态预测上训练集成);虽然此处未验证,但我们表明即使存在当前限制,预热阶段也提供了可测量的稳定性(CV=13.3% vs. 常数λ=0.01的29.8%)。实际上,简单的确定性线性衰减计划在所有指标上实现了最佳的严重POMDP性能(均值116.5,CV=8.9%),表明稳定性收益来自调度效应而非集成。这些发现为设计不确定性感知的师生框架提供了实用指导,并强调了集成预测目标是一个重要的设计选择。

英文摘要

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

2605.26147 2026-05-27 cs.LG

Neural Bayesian Sequential Routing

神经贝叶斯序列路由

Yongchao Huang

AI总结 提出神经贝叶斯序列路由(NBSR)框架,将神经推理建模为有向无环图上的主动证据累积,通过狄利克雷-分类共轭框架实现不确定性量化、早期退出和资源理性推理。

Comments 71 pages

详情
AI中文摘要

人类决策是序列化的且具有不确定性意识,然而标准神经网络通常依赖静态、密集的前向计算,对证据获取、不确定性演化或何时停止计算的可视性有限。我们引入了 extbf{神经贝叶斯序列路由(NBSR)},这是一个将神经推理建模为层次化有向无环图(DAG)上的主动证据累积的框架。在狄利克雷-分类共轭框架内,神经专家查询一个持久的全局知识预言机以提取正证据向量,这些向量作为伪计数,通过精确共轭加法更新狄利克雷信念状态。结合Gumbel-Softmax直通估计器,该更新实现了硬性、路径依赖的路由,同时保留用于端到端训练的代理梯度。由此产生的狄利克雷精度和熵为不确定性量化、基于熵的早期退出、分布外(OOD)弃权以及成本感知的证据获取提供了机制。我们证明,在严格正证据提取下,总狄利克雷精度沿任何有效轨迹单调增加,边际预测方差有界,形式化了序列“假设锐化”;在理想容量和优化假设下,终端狄利克雷期望恢复贝叶斯最优条件分布。在视觉分类、结构化医学诊断、语言建模、部分可观测控制以及成本感知贝叶斯实验设计上的实证评估表明,NBSR在提供透明的路由轨迹、路径依赖的证据归因、不确定性感知的决策控制以及资源理性推理的同时,实现了具有竞争力的预测性能。总体而言,NBSR为可解释、模块化和资源理性的智能体AI提供了一个数学上坚实的框架。

英文摘要

Human decision-making is sequential and uncertainty-aware, yet standard neural networks often rely on static, dense forward computation with limited visibility into evidence acquisition, uncertainty evolution, or when computation should stop. We introduce \textbf{Neural Bayesian Sequential Routing (NBSR)}, a framework that models neural inference as active evidence accumulation over a hierarchical Directed Acyclic Graph (DAG). Within a Dirichlet--Categorical conjugate framework, neural experts query a persistent global knowledge oracle to extract positive evidence vectors, which act as pseudo-counts and update a Dirichlet belief state by exact conjugate addition. Coupled with a Gumbel-Softmax Straight-Through estimator, this update enables hard, path-dependent routing while preserving surrogate gradients for end-to-end training. The resulting Dirichlet precision and entropy provide mechanisms for uncertainty quantification, entropy-based early exiting, OOD abstention, and cost-aware evidence acquisition. We prove that, under strictly positive evidence extraction, total Dirichlet precision increases monotonically along any valid trajectory and marginal predictive variance is bounded, formalizing sequential ``hypothesis sharpening''; under idealized capacity and optimization assumptions, the terminal Dirichlet expectation recovers the Bayes-optimal conditional distribution. Empirical evaluations across visual categorization, structured medical diagnosis, language modeling, partially observable control, and cost-aware Bayesian experimental design show that NBSR achieves competitive predictive performance while providing transparent routing traces, path-dependent evidence attribution, uncertainty-aware decision control, and resource-rational inference. Overall, NBSR offers a mathematically grounded framework for interpretable, modular, and resource-rational agentic AI.

2605.26136 2026-05-27 cs.SD cs.AI

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

侵蚀对真实语音的信任:人类音频深度伪造感知的大规模研究

Nicolas M. Müller, Wei Herng Choong

AI总结 通过大规模听辨实验(1768名参与者,35532次判断),发现音频深度伪造导致人类对真实语音的信任下降(准确率从72.7%降至64.1%),而非检测伪造能力下降。

详情
AI中文摘要

音频深度伪造近期发展迅速,但其对人类信任真实语音的影响尚未被研究。我们进行了迄今为止最大规模的音频深度伪造感知听辨研究,收集了来自1768名参与者对138个文本转语音和语音转换系统的35532次判断。我们的核心发现是怀疑偏移:与2021年的基线相比,人类对伪造样本的准确率几乎没有变化(72.9%降至71.2%),但对真实样本的准确率从72.7%降至64.1%。参与者并非更难以检测合成伪影,而是越来越不信任真实的语音。由商业和自回归语言模型系统生成的样本最难检测(61.3-65.9%),而传统seq2seq和流匹配模型生成的样本仍然较易识别(75.4-76.8%)。作为参考的机器学习检测器在所有条件下保持超过94.5%的准确率。我们的结果表明,现代深度伪造的主要威胁可能不仅仅是欺骗,而是对真实语音信任的侵蚀。

英文摘要

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

2605.26135 2026-05-27 cs.LG

SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection

SilIF:基于轮廓增强的隔离森林用于无监督交易欺诈检测

Venkatakrishnan Gopalakrishnan

AI总结 提出SilIF方法,通过添加基于轮廓得分的层次增强隔离森林,在IEEE-CIS欺诈检测基准上平均AUC-PR提升0.0080,并在五个种子中均优于原始隔离森林。

Comments 5 pages, 1 figure, 5 tables. Code: https://github.com/venkat15vk/silif-anomaly-detection

详情
AI中文摘要

无监督异常检测广泛应用于标签稀缺的交易欺诈检测中。隔离森林(IF)因其可扩展性和易于部署而成为最流行的经典方法之一。我们提出了SilIF,一种隔离森林的增强方法,它在森林树诱导的表示空间中添加了一个基于轮廓得分的计算层。对于每个点,我们提取每棵树路径长度的向量,将这些“指纹”聚类成结构组,并计算轮廓得分,衡量该点与其分配组的匹配程度相对于最近替代组。轮廓信号通过单个超参数alpha与基础IF得分结合。在IEEE-CIS欺诈检测基准(约59万笔交易,3.5%欺诈)上,alpha=1.0的SilIF在五个种子上平均AUC-PR比普通隔离森林提高0.0080,且SilIF在所有五个种子上获胜(配对t检验p=0.046)。我们还在合成信用卡数据集(Sparkov)上报告了结果,其中轮廓增强并未优于普通IF,并描述了区分两种结果的条件。本文提出了SilIF作为隔离森林的一种可调、易于部署的增强方法,并诚实报告了其何时有效何时无效。代码见https://github.com/venkat15vk/silif-anomaly-detection。

英文摘要

Unsupervised anomaly detection is widely used in transaction fraud detection where labels are scarce. Isolation Forest (IF) is among the most popular classical methods due to its scalability and ease of deployment. We propose SilIF, an augmentation of Isolation Forest that adds a silhouette-based scoring layer computed in a representation space induced by the trees of the forest. For each point, we extract a vector of per-tree path lengths, cluster these "fingerprints" into structural groups, and compute a silhouette score that measures how well the point fits its assigned group versus the nearest alternative. The silhouette signal is combined with the base IF score via a single hyperparameter alpha. On the IEEE-CIS Fraud Detection benchmark (~590K transactions, 3.5% fraud), SilIF with alpha=1.0 improves over plain Isolation Forest by +0.0080 AUC-PR on average across five seeds, with SilIF winning on all five seeds (paired t-test p=0.046). We also report results on a synthetic credit-card dataset (Sparkov) where the silhouette augmentation does not improve over plain IF, and we characterize the conditions that distinguish the two outcomes. The paper presents SilIF as a tunable, easy-to-deploy enhancement to Isolation Forest with honest reporting of when it helps and when it does not. Code at https://github.com/venkat15vk/silif-anomaly-detection.

2605.26133 2026-05-27 cs.CL cs.AI cs.LG

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露:成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

AI总结 本文首次统一综述了大型语言模型中的预训练数据暴露问题,涵盖成员推断和数据污染,形式化定义了暴露级别,回顾了攻击与防御方法,并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情
AI中文摘要

大型语言模型(LLMs)已成为NLP中的主导范式,推动了研究和工业的发展。随着模型规模和预训练数据的增长,由于训练数据集的规模和不可见性,对预训练数据暴露(PDE)的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要,涉及两个关键领域:数据污染和成员推断。尽管概念上相关,但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE,回顾了攻击和防御方法,综合了实证发现,并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

2605.26132 2026-05-27 cs.CL cs.LG

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

自验证蒸馏:你的语言模型秘密地就是它自己的合成数据管道

Tony Lee, Percy Liang

AI总结 提出自验证蒸馏算法,让大语言模型仅用无标注种子问题,通过自生成、自验证和自训练提升推理能力,在数学、科学和编程任务上取得显著提升。

详情
AI中文摘要

经过后训练的大语言模型能否仅使用无标注提示,在没有外部教师或工具反馈的情况下进一步提升自己?我们在三个推理领域(数学、科学和编程)中研究这一设置,仅从没有真实解的无标注种子问题开始。我们提出自验证蒸馏,一种简单的后训练精炼算法,其中模型生成这些种子问题的候选解,使用基于提示的自验证进行过滤,并在由此产生的自策展数据集上进行训练。受UQ基准使用多个验证器筛选困难未解问题候选答案的启发,我们将这种基于验证的过滤思想应用于自训练:模型通过三级级联的循环一致性、事实性和正确性检查来过滤自己生成的解,仅当解通过所有阶段且获得一致判断时才被接受。我们发现,在训练数据构建过程中采样更多候选生成并使用更大的验证预算,可以产生更高质量的自策展数据,进而得到更好的推理模型。然后,我们使用自验证蒸馏训练多个规模的Qwen3模型,并在所有三个领域获得收益。对于Qwen3-4B,我们的方法在数学(AIME26和HMMT)上将聚合保留pass@1提升了+16.7个百分点,在科学(GPQA Diamond和HLE)上提升了+11.1个百分点,在编程(LCBv5和LCBv6)上提升了+8.3个百分点,这些收益也扩展到0.6B和8B模型。与我们的仅测试时基线(UQ-TTC)相比,后者通过在推理时花费额外计算来提升性能,自验证蒸馏在大多数设置下实现了更好的性能,同时仅在测试时进行一次推理调用。

英文摘要

Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark's use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.

2605.26130 2026-05-27 cs.LG physics.ao-ph

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR: 基于潜在一致性扩散的千米级大气超分辨率基础模型

Somnath Luitel, Manmeet Singh, Joshua Durkee, Abdullah Al Fahad, Naveen Sudharsan, Prabhjot Singh, Cenlin He, Harsh Kamath, Zong-Liang Yang, Krishnagopal Halder, Sandeep Juneja, Parthasarathi Mukhopadhyay, Saptarishi Dhanuka, Amit Kumar Srivastava

AI总结 提出AirCast-SR基础模型,利用潜在一致性扩散框架将全球AI天气预报从0.25度降尺度至1公里分辨率,实现零偏差和跨区域零样本迁移。

Comments Somnath Luitel and Manmeet Singh are equal-contribution co-first authors, with Manmeet Singh (manmeet.singh@wku.edu) as corresponding author

详情
AI中文摘要

千米尺度的业务天气预报对于传统数值天气预报(NWP)模型而言仍然计算成本过高,限制了需要精细时空细节的能源、农业和灾害管理等应用对预报的获取。本文介绍AirCast-SR,一种用于大气超分辨率的基础模型,将全球AI天气预报从0.25度(约28公里)降尺度至1公里水平分辨率,时间分辨率为每小时,同时生成八个耦合地表变量的67小时预报。EarthMind-SR采用三维U-Net,在潜在一致性模型(LCM)扩散框架内进行条件化,使用基于图块(patch)的样本在美国本土(CONUS)上训练,以GraphCast预报为输入,NOAA的校准记录分析(AORC)为目标。该模型在所有变量和预报时效上实现接近零偏差,其径向功率谱密度分析表明,在10公里至100公里波长范围内,精细大气结构得以保留,而较粗模型在此范围内会损失谱功率。我们通过涵盖冬季、夏季和春季的三个CONUS案例研究验证了EarthMind-SR,并利用独立地面站观测数据,在无需任何重新训练或微调的情况下,展示了在印度和德国上的零样本全球迁移能力。作为一个开放权重的基础模型,EarthMind-SR为千米级AI天气预报建立了新范式,并为区域微调、蒸馏以及气候服务和灾害预报中的下游应用提供了平台。

英文摘要

Operational weather prediction at kilometer scales remains computationally prohibitive for traditional numerical weather prediction (NWP) models, limiting forecast access for applications in energy, agriculture, and disaster management that require fine-grained spatiotemporal detail. Here we introduce AirCast-SR, a foundation model for atmospheric super-resolution that downscales global AI weather forecasts from 0.25 degree (~28 km) to 1 km horizontal resolution at hourly temporal resolution, producing 67-hour forecasts of eight coupled surface variables simultaneously. EarthMind-SR employs a three-dimensional U-Net conditioned within a Latent Consistency Model (LCM) diffusion framework, trained on patch-based samples over the contiguous United States (CONUS) using GraphCast forecasts as input and NOAA's Analysis of Record for Calibration (AORC) as the target. The model achieves near-zero bias across all variables and lead times, and its radial power spectral density analysis demonstrates preservation of fine-scale atmospheric structure at wavelengths of 10 km to 100 km where coarser models lose spectral power. We validate EarthMind-SR across three CONUS case studies spanning winter, summer, and spring seasons, and demonstrate zero-shot global transferability over India and Germany using independent surface station observations without any retraining or fine-tuning. As an open-weights foundation model, EarthMind-SR establishes a new paradigm for kilometer-scale AI weather prediction and provides a platform for regional fine-tuning, distillation, and downstream applications in climate services and hazard forecasting.