arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1706
2605.23264 2026-05-25 cs.CV cs.AI

Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution

着色噪声:用于忠实图像超分辨率的对抗性Sobolev对齐

Hongbo Wang, Huaibo Huang, Pin Wang, Jinhua Hao, Chao Zhou, Ran He

AI总结 图像超分辨率生成中,生成先验常导致还原不够忠实,本文认为这是由于各向同性目标与自然图像内在流形之间存在基本的谱不匹配。为解决这一问题,研究提出了一种基于Sobolev诱导黎曼几何的ASASR框架,通过显式地对噪声转移核进行谱色处理,使其更符合自然图像的谱衰减特性,并引入基于Riesz表示定理的参数化对抗网络,生成针对性的负样本以引导优化方向。实验表明,该方法在保持谱一致性和结构保真度方面优于现有生成方法,有效减少了伪影。

Comments Accepted to ICML 2026

详情
AI中文摘要

图像超分辨率(SR)中的生成先验常常损害忠实重建,我们将这一限制归因于各向同性目标与内在自然图像流形之间的基本光谱失配。虽然直接偏好优化提供了一条对齐路径,但其对光谱平坦高斯噪声的依赖无法区分真实高频细节与幻觉。为了弥合这一几何差距,我们提出了ASASR,一个理论基础的框架,通过显式着色噪声转移核以镜像自然光谱衰减,将生成流重铸为Sobolev诱导的黎曼几何。驱动这一几何对齐,我们集成一个基于Riesz表示定理的参数化对抗器,该对抗器合成目标负样本,等效于最坏情况下的Sobolev梯度,以沿着可能结构失效的切空间引导优化。大量评估表明,ASASR优于领先的生成基线,特别是在保持光谱一致性和结构保真度方面,提供了一种有效缓解伪影的鲁棒解决方案。

英文摘要

Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

2605.23263 2026-05-25 cs.RO cs.AI cs.SY eess.SP eess.SY

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

6G通信网络赋能具身智能体:架构与原型

Lipeng Dai, Luping Xiang, Kun Yang

AI总结 本文研究了6G通信网络如何支持具身智能体的通信需求,探讨了具身智能体与6G网络之间的协同关系,并提出了面向人机远程交互的分层通信架构。通过构建包含触觉设备、工业机械臂和5G O-RAN测试平台的原型系统,验证了该架构在毫秒级时延和稳定闭环控制方面的可行性,为未来6G与具身智能体的融合应用提供了重要参考。

详情
AI中文摘要

具身智能体将智能决策与物理执行相结合,对通信提出了比纯软件智能体更严格和多样化的要求。尽管6G承诺亚毫秒级延迟、超高可靠性、原生智能和集成感知,但如何利用这些能力支持具身智能体通信的系统性研究仍然有限。本文从概念和工程两个角度研究了面向具身智能体的6G通信系统。首先,我们回顾了具身智能体的概念和具身价值,并澄清了其与非具身智能体的区别。然后,我们分析了具身智能体与6G网络的共生关系,强调了关键6G使能技术如何支持人机交互的严苛需求。此外,我们展示了具身智能体通过覆盖扩展、环境感知和物理世界理解在增强通信网络中的主动作用。基于这些见解,我们提出了一种用于人机远程交互的分层通信架构,包括人类意图感知层、基于开放无线接入网(O-RAN)的传输层、智能中间层和具身层。为验证其可行性,我们实现了一个端到端原型,集成了触觉设备、工业机械臂、中间平台和5G O-RAN测试床。实验结果表明毫秒级延迟和稳定的闭环操作,证实了所提架构的实用性,并为未来6G-具身智能体研究和工业部署提供了参考。

英文摘要

Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

2605.23262 2026-05-25 cs.AI

Design and Report Benchmarks for Knowledge Work

知识工作的设计与报告基准

Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

AI总结 本文针对知识工作领域的人工智能系统评估问题,提出了一种三步骤的基准设计方法,以明确任务评分与实际工作成果之间的对应关系。研究指出当前知识工作评估仍沿用传统NLP任务逻辑,难以真实反映系统在实际部署中的能力。为此,作者从工作活动、测试环境和评分标准三个维度构建基准设计框架,并基于O*NET职业任务数据库提炼出18类工作活动,结合三个实际案例展示了该方法在不同知识工作场景中的应用与效果。

详情
AI中文摘要

LLM智能体的发展催生了越来越多关于知识工作AI的研究,包括编程、研究和医疗保健。然而,当前的知识工作评估和基准设计在很大程度上仍遵循传统NLP任务的逻辑。因此,更高的基准性能并不能可靠地表明系统能够在实际部署环境中执行知识工作。本文提出了一种三步法,用于明确基准任务如何代表其分数所附的工作主张:定义被评估的工作活动,指定测试设置,并对适当的工作产品进行评分。我们回顾了工作研究表明,知识工作是通过角色和职责、本地材料和工具以及必须在下游工作流程中保持可用的工件来组织的。然后,我们将这些关注点转化为基准设计和报告指南,涵盖任务应如何映射到工作活动、测试设置应如何指定材料、工具、角色和约束,以及评分应如何关注系统留下的工作产品。为了命名被评估的工作活动并将其与常见的基准任务区分开来,我们从O{*}NET职业任务数据库中导出了18个工作活动的清单。我们通过三个基准案例分析来演示该方法:GDPval(一个非代码职业交付物基准)、OfficeQA Pro(一个基于文档的分析基准,通过最终答案评分)和APEX-SWE(一个软件工程基准,具有可执行评分产品)。这些案例展示了基准设计选择如何塑造分数所能支持的最强工作主张,以及基准任务、测试设置、评分产品和更广泛工作主张之间出现的差距。

英文摘要

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

2605.23259 2026-05-25 cs.LG cs.AI cs.CL

Multi-Gate Residuals

多门残差

Zhizhan Zheng, Feiyun Zhang, Shuchun Liu, Tian Xia, Xi Liu, Dasheng Hu, Hongquan Zhou

AI总结 本文提出了一种名为Multi-Gate Residuals(MGR)的新方法,旨在解决深度残差网络中激活值无界增长的问题,同时避免引入额外的通信开销。该方法通过简单的评分与门控机制维护多流上下文,并结合注意力池化技术提取隐藏状态,从而在保持激活规模稳定的同时提升模型性能。实验表明,MGR在大规模训练与部署中具有实用性,并优于现有架构。

详情
AI中文摘要

虽然注意力残差在解决深度残差层中普遍存在的激活值无界增长问题方面显示出一定效果,但它不可避免地引入了显著的通信开销。为了规避这一瓶颈,我们提出了多门残差(MGR),它在不增加通信负担的情况下稳定激活尺度。它利用简单的评分和门控机制来维护多流上下文,并结合注意力池化从流状态中提取隐藏状态。实证实验表明,MGR对于大规模训练和部署是实用的,相比现有架构提供了切实的性能提升。

英文摘要

While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.

2605.23258 2026-05-25 cs.LG

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

一种改进基于驱逐的KV缓存压缩的简单插件

Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee

AI总结 在大型语言模型的长上下文推理中,键值缓存(KV cache)的增长是一个主要瓶颈。本文提出VECTOR,一种用于改进基于驱逐的KV缓存压缩的即插即用方法,通过引入三类标记路由机制(保留、近似和驱逐),结合基础评分器的重要信号与离线校准的回归值估计的可重构信号,有效提升了缓存压缩下的质量与内存权衡,尤其在严格的内存预算下表现突出。

详情
AI中文摘要

KV缓存增长是大语言模型长上下文推理的主要瓶颈。现有方法通常以二元驱逐或表示近似为主,可能未充分利用那些对精确保留不关键但仍可重构的令牌。我们提出VECTOR,一种用于基于驱逐的流水线的即插即用增强,引入了三路令牌路由:保留、近似和驱逐。VECTOR将来自基础评分器的重要性信号与来自离线校准的基于回归的值估计的可重构性信号相结合。通过利用可重构性,VECTOR恢复了在二元驱逐下本会不可逆丢失的有用值信息,同时保留关键向量以保证注意力路由稳定性。实验结果表明,VECTOR在中高压缩率下改善了质量-内存权衡,在更严格的预算方案中尤其有显著收益。

英文摘要

KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical for exact retention but are still reconstructable. We present VECTOR, a plug-and-play augmentation for eviction-based pipelines that introduces three-way token routing: retention, approximation, and eviction. VECTOR combines an importance signal from the base scorer with a reconstructability signal from an offline-calibrated regression-based value estimation. By leveraging reconstructability, VECTOR recovers useful value information that would otherwise be irreversibly lost under binary eviction, while preserving key vectors for attention routing stability. Experimental results show that VECTOR improves quality-memory trade-offs under medium-to-high compression, with especially clear gains in stricter budget regimes.

2605.23257 2026-05-25 cs.RO cs.CV

Turning Adaptation into Assets: Cross-Domain Bridging for Online Vision-Language Navigation

将适应转化为资产:面向在线视觉语言导航的跨域桥接

Zixuan Hu, Xuantuo Huang, Yancheng Li, Yichun Hu, Shengyong Xu, Ling-Yu Duan

AI总结 本文研究了视觉语言导航(VLN)代理在非平稳环境下的适应问题,提出了一种新的测试时适应(TTA)框架IDEA,通过将在线适应转化为知识资产的积累与组合,有效解决了现有方法中的灾难性遗忘和负迁移问题。IDEA引入了基于Fisher指导的软提示优化机制,并结合领域坐标构建动态资产库,利用历史知识构建跨领域桥梁,实现无需训练的适应。实验表明,该方法在多个基准测试中表现优异,展示了其在实际应用中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

在非平稳环境变化下导航对部署在野外的视觉语言导航(VLN)智能体构成了关键挑战。然而,现有的 VLN 测试时适应(TTA)方法大多将在线适应视为瞬时的、孤立的更新,导致灾难性遗忘和负迁移。为了克服这些问题,我们提出了 IDEA(Inter-Domain BridgE with Historical Assets),一种新颖的 TTA 框架,将适应转化为资产的积累和组合。具体来说,IDEA 引入了通过 Fisher 引导的加权方案优化的软提示,以捕获可迁移的知识。然后,这些优化后的提示与域坐标相结合,形成动态资产库。利用该库,IDEA 通过将目标域投影到历史知识的凸包上来构建跨域桥接。这些设计形成了一个互补循环:不断演化的库支撑桥接构建,而桥接提供优越的初始化以加速资产优化。在 REVERIE、R2R 和 R2R-CE 基准上的大量实验表明,IDEA 相对于现有方法具有一致的优越性,展示了其通过资产共享实现无需训练的适应的能力。

英文摘要

Navigating under non-stationary environment shifts poses a critical challenge for a Vision-and-Language Navigation (VLN) agent deployed in the wild. Yet, existing Test-Time Adaptation (TTA) methods for VLN largely treat online adaptation as transient, isolated updates, leading to catastrophic forgetting and negative transfer. To overcome these issues, we propose Inter-Domain BridgE with Historical Assets (IDEA), a novel TTA framework that transforms adaptation into the accumulation and composition of assets. Specifically, IDEA introduces soft prompts optimized via a Fisher-guided weighting scheme to capture the transferable knowledge. These optimized prompts are then augmented with domain coordinates to form a dynamic asset library. Leveraging this library, IDEA constructs a cross-domain bridge by projecting the target domain onto the convex hull of historical knowledge. These designs form a complementary loop: the evolving library underpins bridge construction, while the bridge provides superior initialization to accelerate asset optimization. Extensive experiments across REVERIE, R2R, and R2R-CE benchmarks demonstrate the consistent superiority of IDEA over existing methods, showcasing its ability to enable training-free adaptation via asset sharing.

2605.23255 2026-05-25 cs.LG cs.DS

Learning-Augmented Online Scheduling with Parsimonious Preemption

具有节俭抢占的学习增强在线调度

Mugen Blue, Sungjin Im, Alexander Lindermayr

AI总结 本文研究了学习增强型在线调度问题,旨在在优化任务延迟的同时减少预emption(任务切换)的次数。作者提出了一种新的算法框架,在保证调度性能的同时,将每个任务的预emption次数控制为常数级别,并且预emption开销随预测误差对数增长。该工作首次为非相关和可变形机器的调度提供了有限预emption的理论保证,拓展了学习增强调度理论的应用范围。

详情
AI中文摘要

学习增强算法已成为一种强大的范式,通过整合可能带有噪声的预测来超越传统的最坏情况下限。虽然该框架在在线调度中取得了成功,但现有工作主要优化作业延迟,同时依赖于频繁的“盲目”抢占。这忽略了算法性能与抢占复杂度之间的基本权衡。我们首次系统研究了在优化延迟的同时限制抢占的学习增强调度。我们证明了理论延迟界限与抢占开销之间的差距可以通过坚实的分析基础来弥合。我们的结果包括:在准确预测下,单机和无关并行机上每作业仅需$O(1)$次抢占的$O(1)$-竞争比算法,且开销随预测误差对数增长。通过为无关机和可塑机提供首个有界抢占保证,我们将学习增强框架的理论范围扩展到更受约束和更现实的设置。最后,通过实验验证了我们的算法。

英文摘要

Learning-augmented algorithms have emerged as a powerful paradigm to surpass traditional worst-case lower bounds by integrating potentially noisy predictions. While this framework has seen success in online scheduling, existing work primarily optimizes job latency while relying on frequent, ``blind'' preemptions. This ignores the fundamental trade-off between algorithmic performance and preemption complexity. We provide the first systematic study of learning-augmented scheduling that curbs preemption while optimizing latency. We establish that the gap between theoretical latency bounds and preemption overhead can be bridged with solid analytical foundations. Our results include $O(1)$-competitive algorithms for single and unrelated parallel machines with only $O(1)$ preemptions per job under accurate predictions, with overhead scaling logarithmically with the prediction error. By providing the first bounded-preemption guarantees for unrelated and malleable machines, we extend the theoretical reach of the learning-augmented framework to more constrained and realistic settings. Finally, our algorithms are validated through experiments.

2605.23254 2026-05-25 cs.CV

CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels

CARE: 面向长尾噪声标签可靠学习的类别自适应专家共识

Mengke Li, Haiquan Ling, Lihao Chen, Yang Lu, Yiqun Zhang, Hui Huang

AI总结 在现实数据学习中,长尾类别分布和噪声标签的复合挑战常常导致模型性能下降。为了解决这一问题,本文提出了一种参数高效的框架CARE,通过结合视觉-语言模型的三种互补监督源,引入类自适应专家共识机制,根据不同类别的频率调整标签校正的严格程度,从而更有效地过滤噪声并重新校准类别分布。实验表明,CARE在多个合成和真实数据集上均优于现有方法,性能提升最高达3.0%。

Comments poster in ICML 2026

详情
AI中文摘要

从现实世界数据中学习常常受到长尾类别分布和噪声标注的双重挑战。现有方法部分解决了这些问题,但通常忽略了标签噪声在不同类别上的非均匀影响,导致对尾部类的修正无效,对头部类的过度正则化。为了解决这个问题,我们提出了类别自适应专家修正(CARE),一个参数高效的框架,利用来自视觉语言模型(VLM)的三种互补监督源:观察到的噪声标签、VLM文本嵌入和视觉特征。CARE引入了一种类别自适应专家共识机制,根据类别频率对尾部类施加更严格的一致性,对头部类施加更宽松的一致性。通过聚合这些来源的高置信度预测,CARE过滤不可靠信号并重新校准类别分布,从而在长尾分布下实现更可靠的修正。在合成和真实世界基准上的大量实验表明,CARE始终优于最先进的方法,实现了高达3.0%的性能提升。源代码可在https://github.com/qwq123-study/CARE获取。

英文摘要

Learning from real-world data is frequently hindered by the compound challenge of long-tailed class distributions and noisy annotations. Existing methods partially address these issues but typically ignore the non-uniform impact of label noise across classes, resulting in ineffective correction for tail classes and over-regularization for head classes. To address this issue, we propose Class-Adaptive Rectification with Experts (CARE), a parameter-efficient framework that leverages three complementary supervision sources from vision-language models (VLM): observed noisy labels, VLM text embeddings, and visual features. CARE introduces a class-adaptive expert consensus mechanism that enforces stricter agreement for tail classes and more permissive agreement for head classes based on class frequency. By aggregating high-confidence predictions across these sources, CARE filters unreliable signals and recalibrates class distributions, yielding more reliable rectification under long-tailed distributions. Extensive experiments on both synthetic and real-world benchmarks demonstrate that CARE consistently outperforms state-of-the-art methods, achieving up to 3.0\% performance gains. The source code is available at https://github.com/qwq123-study/CARE.

2605.23249 2026-05-25 cs.LG cs.AI

Enhancing Deep Neural Network Reliability with Refinement and Calibration

通过精炼和校准增强深度神经网络的可靠性

Ramya Hebbalaguppe, Ajay Shastry, Soumya Suvra Ghosal, Chetan Arora

AI总结 尽管深度神经网络在预测准确性方面表现优异,但其置信度估计往往不可靠,可能影响用户对其决策的信任。为此,本文提出了一种新的损失函数和统一训练框架RefCal,旨在同时提升模型的校准性、锐度(即正确与错误预测之间的置信度差异)和准确率,从而增强深度神经网络的可靠性。实验表明,RefCal在类别不平衡的数据集上显著优于现有方法。

Comments ICLR 2026, Trustworthy AI and Representational Alignment

详情
AI中文摘要

尽管深度神经网络(DNN)实现了高预测精度,但其置信度估计通常不可靠,可能损害用户对其决策的信任。这推动了校准模型的研究,其中校准衡量模型预测置信度与正确经验概率的一致性。然而,校准指标通常可以通过后处理技术改进,这些技术仅模仿训练时的不确定性,而并未真正提升模型的理解。因此,统计学家建议模型不仅要校准,还要精炼。直观上,如果模型对正确和错误预测分配显著不同的置信度分数,则被认为更精炼,这一属性也称为锐度。我们观察到,许多现有的校准方法以降低精炼度为代价来改善校准。为解决这一局限,我们提出:(1)一种新的损失函数,显式促进精炼度,并可通过监督对比学习优化;(2)一个统一的训练框架RefCal,联合优化校准、精炼度和准确性,以提高DNN的可靠性。在类别不平衡率为10%的CIFAR-100-LT数据集上,RefCal实现了(准确率,精炼度,ECE)为(58.81,95.67,0.08),显著优于广泛使用的Correctness Ranking Loss(46.27,93.7,0.22)。

英文摘要

Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially compromising user trust in their decisions. This has motivated research on calibrated models, where calibration measures how well a model's predicted confidence aligns with the empirical probability of correctness. However, calibration metrics can often be improved through post-processing techniques that merely mimic training-time uncertainty without genuinely improving the model's understanding. For this reason, statisticians recommend that models be not only calibrated but also refined. Intuitively, a model is considered more refined if it assigns significantly different confidence scores to correct and incorrect predictions, a property also referred to as sharpness. We observe that many existing calibration methods improve calibration at the cost of reduced refinement. To address this limitation, we propose: (1) a novel loss function that explicitly promotes refinement and can be optimized through supervised contrastive learning; and (2) a unified training framework, RefCal, that jointly optimizes calibration, refinement, and accuracy to improve DNN reliability. On the CIFAR-100-LT dataset with 10 percent class imbalance, RefCal achieves (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), substantially outperforming the widely used Correctness Ranking Loss, which achieves (46.27, 93.7, 0.22).

2605.23245 2026-05-25 cs.CV cs.AI

SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion

SimInsert: 通过区域稀疏注意力融合实现无缝视频对象插入

Xinyu Chen, Yuyi Qian, Jiang Lin, Shenyi Wang, Gao Wang, Zhiqiu Zhang, Jizhi Zhang, Mingjie Wang, Qiang Tang, Qian Wang, Song Wu, Zili Yi

AI总结 SimInsert 是一种无需训练的视频对象插入方法,旨在解决现有方法依赖显式运动工程或耗时重训练的问题,提升灵活性和泛化能力。该方法通过区域稀疏注意力融合,将任务分解为单帧编辑和语义运动描述,利用图像到视频扩散模型的生成先验,实现编辑内容在时间上的自然传播,并保持背景不变性与交互真实感。实验表明,SimInsert 在多项指标上显著优于现有方法,为高保真视频编辑提供了高效解决方案。

Comments Accepted by ICME2026

详情
AI中文摘要

视频对象插入需要确保时空连贯性和交互真实感,远不止简单的内容放置。然而,当前方法通常受限于对显式运动工程或资源密集型重新训练的依赖,限制了其灵活性和泛化能力。为弥补这一差距,我们提出了 extit{SimInsert},一种无需训练的新范式,将任务高效地分解为直观的单帧编辑和语义运动描述。通过利用图像到视频扩散模型的强大生成先验,SimInsert在时间上传播编辑,严格保持背景不变性,同时实现插入对象与动态环境之间合理的、文本驱动的交互。我们的方法依赖于非侵入式引导机制,这些机制强制执行结构一致性,促进无缝边界融合,并抵消在去噪轨迹中通常累积的保真度漂移。大量定量实验验证了我们的有效性:SimInsert在PSNR上超越最先进方法18.8%,在SSIM上超越20.1%,在LPIPS上降低44.1%,为高保真视频编辑提供了流线型解决方案。

英文摘要

Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement. However, current approaches are often hindered by a reliance on explicit motion engineering or resource-intensive retraining, restricting their flexibility and generalization. To bridge this gap, we present \textit{SimInsert}, a training-free paradigm that efficiently decouples the task into intuitive single-frame editing and semantic motion description. By harnessing the robust generative priors of image-to-video diffusion models, SimInsert propagates edits temporally, strictly preserving background invariance while enabling plausible, text-driven interactions between the inserted object and the dynamic environment. Our approach hinges on non-invasive guidance mechanisms that enforce structural consistency, facilitate seamless boundary fusion, and counteract the fidelity drift that typically accumulates during the denoising trajectory. Extensive quantitative experiments validate our efficacy: SimInsert surpasses state-of-the-art methods with an 18.8\% gain in PSNR, 20.1\% in SSIM, and a 44.1\% decrease in LPIPS, offering a streamlined solution for high-fidelity video editing.

2605.23244 2026-05-25 cs.LG

Convex Optimization for Alignment and Preference Learning on a Single GPU

单GPU上的对齐与偏好学习的凸优化

Miria Feng, Mert Pilanci

AI总结 本文提出了一种名为COALA的凸优化算法,用于在单块GPU上高效完成大语言模型的对齐与偏好学习。该方法通过将神经网络重新表述为凸优化问题,避免了传统方法对参考模型的依赖,显著降低了训练时间和显存消耗。实验表明,COALA在多个数据集和模型上表现出优异的性能和效率,其计算量仅为DPO方法的约17.6%,且训练过程中奖励稳定增长,达到性能峰值的时间也明显缩短。

详情
AI中文摘要

微调大型语言模型(LLMs)以符合人类偏好推动了Gemini和ChatGPT等系统的成功。然而,从人类反馈中强化学习(RLHF)等方法仍然计算昂贵且复杂。直接偏好优化(DPO)提供了一种更简单的替代方案,但存在排名准确性不一致、对GPU资源依赖度高以及超参数调优成本高等局限性。我们提出了对齐与偏好学习的凸优化算法(COALA):一种具有强理论保证的新型轻量级策略。通过利用神经网络的凸优化重表述,COALA消除了对参考模型的需求,并在训练时间和VRAM消耗上实现了显著减少,从而能够在单个GPU上进行高效训练。在四个数据集(包括一个26621样本的合成教育反馈数据集)和六个模型(包括Llama-3.1-8B)上的实验表明,COALA在仅使用DPO总TFLOPs约17.6%的情况下,展现了具有竞争力的性能和效率。与DPO和ORPO等传统方法相比,COALA表现出稳定、单调递增的奖励,并在显著更短的时间内达到峰值边际。据我们所知,这是首次将凸优化有效应用于LLMs的偏好微调。

英文摘要

Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets--including a 26621-sample synthetic Educational Feedback dataset--and six models (including Llama-3.1-8B) demonstrate COALA's competitive performance and efficiency while utilizing as little as ~17.6% of DPO's total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.

2605.23241 2026-05-25 cs.LG

RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases

RelPrism:面向关系数据库的多面预训练框架与自生成任务

Jinyu Yang, Cheng Yang, Junze Chen, Zedi Liu, Muhan Zhang, Hanyang Peng, Chuan Shi

AI总结 关系数据库(RDB)仍是现代数据系统的核心,支持多种预测任务。尽管现有的关系深度学习方法通过将数据库转化为图结构并应用图模型进行表征学习,但有效的自监督预训练方法仍面临挑战,尤其是在处理多视角、多粒度的信息需求时。为此,本文提出RelPrism,一种多视角的自监督学习框架,通过从不同角度构建内在属性、关系属性和混合属性,并结合多粒度聚类生成伪任务,使预训练表征更具适应性。实验表明,RelPrism在多个真实数据集上的分类和回归任务中均优于现有方法。

详情
AI中文摘要

关系数据库(RDB)仍然是现代数据系统的基石,并支持多种预测任务。最近的关系深度学习(RDL)方法通过将RDB转换为图(其中行表示为节点,表间交互表示为边),然后应用基于图的模型进行表示学习,从而实现端到端预测。尽管RDL具有强大的能力,但有效的自监督预训练对于RDB仍然具有挑战性。RDB任务通常需要跨不同视角和粒度的多面信息。例如,用户流失分类可能更依赖于交互模式,而消费价值预测则需要用户-项目行为和内在用户属性来进行细粒度回归。这种异构需求对RDB表示学习提出了挑战,因为预训练目标应涵盖全面的信息以适应下游任务。然而,现有的自监督学习方法通常从单一视角(如节点级内在属性或子图级关系结构)获取监督信号,适应性有限。为此,我们提出了RelPrism,一个面向RDB的多面自监督学习框架。RelPrism从不同视角构建内在、关系和混合属性,并对每个视角应用多粒度聚类以形成相应的伪任务池。在这些池上进行预训练使表示暴露于更广泛的视角和粒度级别,为下游适应提供了更强的基础。在5个真实数据集上的14个任务上的实验表明,RelPrism在分类任务上比最先进的基线提高了4.15%的ROC-AUC,在回归任务上降低了10.75%的MAE。我们的代码可在https://anonymous.4open.science/r/RelPrism获取。

英文摘要

Relational databases (RDBs) remain the cornerstone of modern data systems and support diverse predictive tasks. Recent relational deep learning (RDL) methods enable end-to-end prediction by converting RDBs into graphs, where rows are represented as nodes and inter-table interactions are represented as edges, and then applying graph-based models for representation learning. Despite the strong capability of RDL, effective self-supervised pre-training for RDBs remains non-trivial. RDB tasks often require multi-faceted information across different perspectives and granularities. For example, user churn classification may rely more on interaction patterns, whereas consumption value prediction requires both user-item behaviors and intrinsic user attributes for fine-grained regression. Such heterogeneous needs challenge RDB representation learning, as pre-training objectives should cover comprehensive information for downstream adaptation. However, existing SSL methods typically derive supervision from a single facet, such as node-level intrinsic attributes or subgraph-level relational structures, providing limited adaptability. To this end, we propose RelPrism, a multi-faceted self-supervised learning framework for RDBs. RelPrism constructs intrinsic, relational, and hybrid attributes from distinct perspectives, and applies multi-granularity clustering to each perspective to form corresponding pseudo-task pools. Pre-training over these pools exposes representations to broader perspectives and granularity levels, yielding a stronger basis for downstream adaptation. Experiments on 14 tasks across 5 real-world datasets show that RelPrism improves ROC-AUC by 4.15% for classification and reduces MAE by 10.75% for regression over state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/RelPrism.

2605.23240 2026-05-25 cs.RO cs.SY eess.SY

Signal Temporal Logic Motion Planning via Graphs of Convex Sets

基于凸集图的信号时序逻辑运动规划

Yu Chen, Ancheng Hou, Mingyang Feng, Xiao Yu, Xiang Yin

AI总结 本文研究了在信号时序逻辑(STL)规范下的连续时间运动规划问题,旨在生成满足高层逻辑与时序要求且符合底层运动约束的平滑机器人轨迹。为此,作者提出了一种高效框架,将时序自动机推理与凸集图(GCS)相结合,将STL运动规划问题转化为GCS上的最短路径问题,从而生成满足STL规范、平滑性要求和速度限制的Bézier样条轨迹。实验表明,该方法在多个低维基准、三维四旋翼无人机、30自由度人形机器人以及UR-3机械臂的硬件实验中均能高效求解复杂STL运动规划问题。

详情
AI中文摘要

本文研究信号时序逻辑(STL)规范下的连续时间运动规划。目标是生成满足高层次逻辑和时间要求,同时遵守低层次运动约束的平滑机器人轨迹。为此,我们提出了一种高效框架,结合了时间自动机推理与凸集图(GCS)。首先将STL规范表示为时间自动机,然后与配置空间的凸分解耦合,形成联合转移系统,编码任务进展和区域占用。基于该联合转移系统,STL运动规划问题被重新表述为GCS上的最短路径问题,其解生成满足STL规范、平滑性要求和速度约束的平滑贝塞尔样条轨迹。我们建立了所提公式的正确性,并分析了其计算复杂度,表明一旦时间自动机和凸分解固定,凸松弛的规模与配置空间维度和贝塞尔次数成多项式关系。我们进一步利用专用模板和布尔组合,为表达性强的STL片段开发了紧凑的时间自动机构造。低维基准、3-D四旋翼、30自由度人形机器人的数值实验以及UR-3机械臂的硬件实验表明,所提方法能高效解决复杂的STL运动规划问题,并生成平滑可执行的轨迹。

英文摘要

This paper investigates continuous-time motion planning under Signal Temporal Logic (STL) specifications. The goal is to generate smooth robot trajectories that satisfy high-level logical and timing requirements while respecting low-level motion constraints. To this end, we propose an efficient framework that combines timed-automata reasoning with graphs of convex sets (GCS). An STL specification is first represented by a timed automaton, which is then coupled with a convex decomposition of the configuration space to form a joint transition system encoding both task progress and region occupancy. Based on this joint transition system, the STL motion-planning problem is reformulated as a shortest-path problem over a GCS, whose solution induces a smooth Bézier-spline trajectory satisfying the STL specification, smoothness requirements, and velocity bounds. We establish the soundness of the proposed formulation and analyze its computational complexity, showing that, once the timed automaton and convex decomposition are fixed, the convex relaxation scales polynomially with the configuration-space dimension and the Bézier degree. We further develop a compact timed-automaton construction for an expressive STL fragment using dedicated templates and Boolean composition. Numerical experiments on low-dimensional benchmarks, a $3$-D quadrotor, a $30$-DoF humanoid, and a hardware experiment on a UR-3 robot arm demonstrate that the proposed method efficiently solves complex STL motion-planning problems and produces smooth executable trajectories.

2605.23238 2026-05-25 cs.AI cs.GT cs.LG cs.MA

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

GENSTRAT:迈向大型语言模型中的战略推理科学

Vartan Shadarevian, Kia Ghods, Alex Kenich, Anany Kotawala

AI总结 本文提出GENSTRAT,一种基于程序生成战略环境的评估框架,用于更准确地评估大型语言模型在复杂战略场景中的推理能力。该方法生成一系列两人零和不完全信息卡牌游戏,并结合能力分析和“崎岖度”指标,全面评估模型在不同战略维度上的表现和稳定性。实验表明,前沿模型在整体表现上更优,但其能力分布和局部波动性存在显著差异,为实际部署提供了更细致的诊断依据。

Comments 33 pages, 8 figures, 9 tables (4 figures, 2 tables in main paper)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为市场、拍卖和竞价环境中的经济主体。预测它们在特定部署中的行为是困难的。现有的战略推理基准在固定的规范博弈上评估模型。这些基准可能会随着前沿模型的改进而饱和,并且不允许评估者从基准性能自信地推广到实际部署中涉及的各种混乱的战略环境。我们引入了GENSTRAT,它使用程序化生成的战略环境来解决这些挑战。具体来说,我们生成了一个两人零和、不完全信息纸牌游戏的分布。生成器可以按需生成新游戏,从而实现常青评估并抵抗污染。我们将游戏分布与一种能力剖面方法论配对,该方法论将模型能力分解为六个轴(状态空间、时间深度、信息敏感性、对手建模、风险和脆弱性)。我们还引入了一种分布内平滑度的锯齿度量,用于检测模型在战略相似游戏之间优势是否不可预测地跳跃。我们从2000个游戏的生成池中采样了50个基准游戏,并在一个包含超过36,000场比赛的正面交锋锦标赛中评估了九个前沿和开放权重LLM。较新的前沿模型平均得分更高。除了平均值之外,整体实力几乎相同的模型显示出性质不同的能力剖面,并且排行榜前三名模型中的两个(gpt-5和claude)在局部波动性上明显高于第三个(gemini-3.1-pro),尽管整体实力接近。总之,能力剖面和锯齿度量提供了仅靠整体排名无法提供的与部署相关的诊断信息。

英文摘要

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

2605.23237 2026-05-25 cs.CV

StereoGenBench: A Synthetic Multi-Camera Benchmark for Stereo Generation under Controlled Baseline Regimes

StereoGenBench:一种用于受控基线条件下立体生成的合成多相机基准

Yangzhi Cui, Feng Qiao, Nathan Jacobs

AI总结 StereoGenBench 是一个基于 Unreal Engine 的合成多相机基准数据集,旨在为立体生成、几何估计和可控视角合成提供精确可控的多基线配对数据。该数据集通过固定场景下六相机阵列的渲染,生成包含多基线、内参、深度、相机位姿等信息的高质量配对视图,支持对不同基线范围下的生成模型进行评估。该工作填补了现有数据集在多基线配对和可控参数方面的不足,为立体生成研究提供了标准化的测试平台。

详情
AI中文摘要

立体图像和视频生成、立体几何估计以及条件控制视图合成需要配对数据,其中决定双目几何的变量——相机基线、内参、场景深度和相机运动——是已知且可控的。现有的立体资源提供了这些变量的子集,但据我们所知,常用于立体生成评估的资源并未在单一受控源中提供场景配对的、校准的多基线右视图真值,以及联合记录的内参、密集度量深度和每帧姿态。我们引入了StereoGenBench,一个合成的Unreal Engine基准,旨在使基线灵敏度与目标相机一致性在匹配的场景内容下可测量。每个场景使用刚性六相机横向阵列渲染,产生多达15个校准视图对;相邻基线从瞳孔间到宽基线范围采样;焦距独立采样;每个视图发布RGB、度量深度、内参、每对基线和每帧姿态。数据集划分包括窄基线和宽基线两个评估族,以及一个仅训练族用于更广泛的全对覆盖。我们发布了数据集、评估代码、参考结果、Croissant元数据以及用于扩展的生成代码/配置(兼容资产)。数据集可在https://huggingface.co/datasets/stereo-dataset/stereo-dataset获取。

英文摘要

Stereo image and video generation, stereo geometry estimation, and condition-controlled view synthesis require paired data in which the variables that determine binocular geometry -- camera baseline, intrinsics, scene depth, and camera motion -- are known and controllable. Existing stereo resources provide subsets of these variables, but resources commonly used for stereo generation evaluation do not, to our knowledge, provide scene-paired, calibrated multi-baseline right-view ground truth with jointly recorded intrinsics, dense metric depth, and per-frame poses in a single controlled source. We introduce StereoGenBench, a synthetic Unreal Engine benchmark designed to make baseline-regime sensitivity and target-camera consistency measurable under matched scene content. Each scene is rendered with a rigid six-camera lateral array, yielding up to 15 calibrated view pairs; adjacent baselines are sampled from inter-pupillary to wide-baseline regimes; focal length is sampled independently; and every view is released with RGB, metric depth, intrinsics, per-pair baselines, and per-frame poses. The splits include two evaluation families for narrow and wide baseline regimes and a train-only family for broader all-pairs coverage. We release the dataset, evaluation code, reference results, Croissant metadata, and generation code/configuration for extension with compatible assets. The dataset is available at https://huggingface.co/datasets/stereo-dataset/stereo-dataset

2605.23235 2026-05-25 cs.LG

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

语音识别中的凸低资源口音鲁棒语言检测

Miria Feng, William Tan, Mert Pilanci

AI总结 随着全球化和多元文化的发展,语音识别系统在面对资源匮乏的方言和口音时常常表现不佳,导致语言识别错误并影响后续对话任务。本文提出了一种基于凸优化的低资源鲁棒语言检测方法Convex Language Detection(CLD),通过引入理论支撑的凸优化技术,结合多GPU加速的ADMM算法,实现了高效训练与全局最优解。该方法在理论上有稳定性保证,在实验中表现出对输入方言变化的强鲁棒性,即使在低资源条件下也能达到97-98%的识别准确率。

详情
AI中文摘要

全球化和多元文化持续产生日益多样化的语音变体。然而,当前的语音对话系统在处理代表性不足的方言和口音时经常失败,常常误识别输入语言,导致下游对话任务中的级联故障。在低资源约束下解决这种方言差异仍然是一个开放的挑战,因为标准微调计算成本高且容易在高维语音数据上过拟合。我们提出了凸语言检测(CLD),一种新颖的框架,将理论基础的凸优化技术集成到语音对话系统流程中。我们的方法通过JAX中的多GPU交替方向乘子法(ADMM)高效实现,从而提供全局最优性保证和多项式时间内的快速训练。理论上,我们证明了我们的凸目标诱导了认证的边际稳定性,并提供了对特征扰动的保证。实验上,我们展示了样本效率和对输入方言变化的鲁棒性,在具有挑战性的低资源场景中达到了97-98%的准确率。我们的开源包可在https://pypi.org/project/jaxcld/获取。

英文摘要

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at https://pypi.org/project/jaxcld/

2605.23220 2026-05-25 cs.LG

WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

WMAttack:世界模型智能体对抗评估的自动化攻击搜索

Zhixiang Guo, Siyuan Liang, Shi Fu, Cheng Guo, Andras Balogh, Mark Jelasity, Dacheng Tao

AI总结 尽管世界模型作为决策代理的应用日益广泛,但其对抗鲁棒性仍因缺乏专门的自动化评估方法而研究不足。为解决攻击评估中准确性和效率之间的矛盾,本文提出WMAttack,一个用于世界模型代理对抗评估的自动攻击搜索框架。该方法通过有限预算下的攻击配置搜索,并结合自纠正攻击搜索和表示引导的攻击检索技术,显著提升了攻击发现的效率和效果,在多个基准任务中均优于现有基线方法。

详情
AI中文摘要

尽管世界模型作为决策智能体的使用日益增多,但由于缺乏专用的自动化评估方法,其对抗鲁棒性仍未得到充分探索。一个关键障碍是攻击评估必须既准确又高效:弱的手动调优攻击可能高估鲁棒性,而穷举超参数搜索由于每个候选都需要通过学习的潜在动力学进行闭环展开而代价高昂。我们引入了WMAttack,一个用于世界模型智能体对抗评估的自动化攻击搜索框架。WMAttack将鲁棒性评估形式化为对攻击配置的有限预算搜索,包括攻击族、扰动预算、优化步骤、重启和分配规则。为了提高搜索准确性,自校正攻击搜索(SCAS)利用来自奖励退化、动作不稳定性、运行时间和展开变异性的反馈来细化攻击提议分布。为了提高搜索效率,表征引导攻击检索(RGAR)从表征相似的任务中检索有效的历史配置,为未见环境提供热启动。我们提供了一个理论解释,表明当提议细化将概率质量转移到高效用攻击时,它能改善有限预算搜索。在Atari和DeepMind Control任务上,WMAttack始终发现比评估基线更强的攻击,在DreamerV3 Atari上将归一化奖励下降从0.497提高到1.034,在DMC上从0.319提高到0.682。消融实验进一步表明,在固定评估预算下,RGAR提高了初始候选质量,SCAS提高了最终攻击效用。

英文摘要

Despite the growing use of world models as decision-making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed-loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack-search framework for adversarial evaluation of world-model agents. WMAttack formulates robustness evaluation as a finite-budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self-Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation-Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation-similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

2605.23219 2026-05-25 cs.LG cs.AI

PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows

PaP-NF: 通过前缀作为提示重编程和归一化流进行概率长期时间序列预测

Minju Kim, Youngbum Hur

AI总结 本文提出了一种名为PaP-NF的概率长期时间序列预测框架,通过Prefix-as-Prompt机制将连续时间序列表示与冻结的大语言模型对齐,并基于该模型提取的全局上下文条件化归一化流解码器,从而实现对不确定性的建模。该方法在多个长期预测基准上表现出色,能够有效捕捉多模态不确定性,同时保持较高的点预测精度。

Comments Accepted to ICPR 2026

详情
AI中文摘要

时间序列预测在许多实际应用中扮演核心角色,并已被广泛研究。大多数现有方法依赖于确定性模型。然而,现实环境表现出固有的不确定性和复杂的未来行为,使得单点预测不足。这凸显了对能够量化和表示不确定性的概率预测方法的需求。在这项工作中,我们提出了PaP-NF,一个概率预测框架,它使用前缀作为提示机制将连续时间序列表示与冻结的大语言模型(LLM)对齐,并基于LLM提取的全局上下文条件化归一化流解码器。所得预测分布的质量使用连续排名概率得分(CRPS)进行评估,这是概率预测中的标准指标。在各种长期预测基准上,PaP-NF稳健地捕获多模态不确定性,同时保持有竞争力的点预测精度。官方实现可在:https://github.com/democracy04/PaP-NF 获取。

英文摘要

Time series forecasting plays a central role in many real-world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real-world environments exhibit inherently uncertain and complex future behaviors, making single-point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP-NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix-as-Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long-term forecasting benchmarks, PaP-NF robustly captures multi-modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: https://github.com/democracy04/PaP-NF

2605.23218 2026-05-25 cs.AI

Foundation Protocol: A Coordination Layer for Agentic Society

Foundation Protocol: 智能体社会的协调层

Bang Liu, Yongfeng Gu, Jiayi Zhang, Zhaoyang Yu, Sirui Hong, Maojia Song, Xiaoqiang Wang, Mingyi Deng, Zijie Zhuang, Ronghao Wang, Mingzhe Cao, Yutong Zhu, Xingjian Li, Yifan Wu, Jianhao Ruan, Yiran Peng, Shuangrui Chen, Jinlin Wang, Yizhang Lin, Dongjie Zhang, Dekun Wu, Chen Ma, Lizi Liao, Han Yu, Jian Pei, Heng Ji, Qiang Yang, Yuyu Luo, Chenglin Wu

AI总结 随着自主代理系统逐渐成为社会基础设施的一部分,协调能力成为系统扩展的关键瓶颈。本文提出了一种名为Foundation Protocol(FP)的协调层,旨在为人类与人工智能共存的社会提供基础架构支持。FP通过图结构统一不同类型的实体,支持多方协作与事件驱动的合作,并引入经济原语和治理机制,以确保系统的可组合性与责任可追溯性。该协议旨在兼容现有标准,降低集成与治理成本,推动自主代理系统在开放、多元和可治理的环境中发展。

详情
AI中文摘要

自主智能体正从工具转变为社会基础设施层:它们浏览、购买、部署软件、管理系统,并越来越多地相互交互。随着这些系统规模扩大,瓶颈从原始模型能力转向协调。智能体需要建立可靠的关系、组织多智能体工作、交换价值、支持人工智能经济,并在现实监督下保持安全和问责。本文介绍了Foundation Protocol (FP),一种为新兴人机社会设计的以图为核心的协调层。FP统一了异构实体,包括智能体、工具、资源、人类、机构和组织,并支持原生的多方组织和基于事件的协作。它还提供了用于计量、收据和结算的经济原语,并将策略、来源和审计视为一等关注点。FP旨在包装和桥接现有协议而非替代它们,从而在减少集成和治理开销的同时实现渐进式采用。目标是保持自主智能体的可组合性,同时确保问责制不可妥协,从而使协调本身成为开放、多元和可治理的人机社会的共享基础设施。

英文摘要

Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.

2605.23216 2026-05-25 cs.CV

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

CaST-Bench:面向视频问答的因果链时空推理基准

Mingfang Zhang, Jingjing Pan, Ashutosh Kumar, Rajat Saini, Mustafa Erdogan, Hsuan-Kung Yang, Caixin Kang, Yifei Huang, Yoichi Sato, Quan Kong

AI总结 CaST-Bench 是一个用于评估视频问答中因果链引导的时空推理能力的新基准,旨在解决现有模型在因果推理方面缺乏细致、可验证证据的问题。该基准通过人类与AI协作构建了包含2066个问题的高质量数据集,每个问题都附带有时间片段和边界框标注的因果链证据。研究还设计了新的评估指标,全面衡量模型在答案正确性和视觉证据推理方面的能力,揭示了当前视觉语言模型在构建精确因果链方面的不足,为未来模型改进指明了方向。

Comments CVPR 2026

详情
AI中文摘要

视频中的因果推理对视觉语言模型(VLM)是一个重大挑战,因为它需要超越表面感知,深入理解因果机制。然而,现有基准很少提供严格评估这一能力所需的细粒度、有依据的证据。为填补这一空白,我们引入了CaST-Bench,一个用于因果链时空视频推理的基准。CaST-Bench提出复杂的因果问题,要求模型识别并定位多个时空证据组成的链条。通过人机协作流程,我们构建了一个高质量数据集,包含1015个视频上的2066个问题,因果链由时间片段和边界框轨迹标注。此外,我们设计了一套全面的评估方案,包含新颖的指标,不仅评估答案正确性,还评估基于视觉证据的推理能力。这种证据基础对于通过减轻虚假相关性来提高准确性,以及通过使模型更透明来增强用户信任至关重要。我们的实验表明,当前的VLM在因果问题上表现不佳,主要原因是它们构建精确且有依据的因果链的能力有限。这为改进未来VLM指明了一个重要方向。

英文摘要

Cause-and-effect reasoning in video is a significant challenge for Vision-Language Models (VLMs), as it requires going beyond surface-level perception to a deeper understanding of causal mechanisms. However, existing benchmarks rarely provide the fine-grained, grounded evidence needed to rigorously evaluate this capability. To address this gap, we introduce CaST-Bench, a benchmark for Causal Chain-Grounded Spatio-Temporal Video Reasoning. CaST-Bench presents complex causal questions that require models to identify and localize a chain of multiple spatio-temporal evidences. Through a human-AI collaborative pipeline, we construct a high-quality dataset of 2,066 questions over 1,015 videos, with causal chains annotated by temporal segments and bounding-box tracks. Furthermore, we design a comprehensive evaluation suite with novel metrics that assess not only answer correctness but also the capability for visual evidence grounded reasoning. This grounding is crucial for improving accuracy by mitigating spurious correlations and for enhancing user trust by making models more transparent. Our experiments show that current VLMs struggle with causal questions, largely due to their limited ability to construct precise and grounded causal chains. This highlights an important direction for improving future VLMs.

2605.23215 2026-05-25 cs.LG cs.AI cs.CL

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels:生产中GPU内核生成的基准测试

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari

AI总结 当前基于大语言模型的GPU内核生成代理在性能评估方面面临基准与实际生产环境不匹配的问题。为此,研究提出了FastKernels,一个基于46个代表性架构构建的基准测试集,覆盖了8个类别,几乎涵盖了96.2%的HuggingFace Transformers架构,并同时提供了一个生产级推理框架。实验表明,现有最先进的内核生成代理在FastKernels上的加速效果有限,突显了基准与实际应用之间存在的关键瓶颈。

详情
AI中文摘要

基于LLM的GPU内核生成代理正在快速发展,但其进展从根本上受到所优化基准的限制。现有基准与生产推理框架严重脱节:它们在单GPU上使用合成输入评估内核,忽略周围的编译栈,并奖励复制已知优化而非发现新优化。由此产生的奖励信号具有误导性:代理学会生成在沙箱中得分高但在集成到实际系统时引入接口不兼容、编译栈冲突和静默正确性下降的内核。我们引入FastKernels,一个基于最小化46个代表性架构(涵盖8个类别)的内核基准,这些内核共同涵盖了96.2%(409/425)的HuggingFace Transformers架构。FastKernels同时作为一个简约的生产级推理框架,在主流LLM服务上与vLLM和SGLang等成熟系统运行性能相当,并在服务不足的架构上显著超过上游参考;每个任务的接口镜像其架构家族中最先进库的相应模块,使得优化后的内核能够直接部署到生产代码库中。在FastKernels上评估最先进的内核代理,我们发现即使最强的代理也仅实现0.94倍于生产基线的总加速,而较弱的代理分别为0.78倍和0.53倍——证实基准-生产错位是该领域的关键瓶颈。我们发布FastKernels,作为迈向基准收益直接转化为生产吞吐量改进的内核代理的垫脚石。代码可在https://github.com/Snowflake-AI-Research/fastkernels获取。

英文摘要

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

2605.23204 2026-05-25 cs.AI

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

AutoResearch AI:迈向人工智能驱动的科研自动化以实现科学发现

Guiyao Tie, Jiawen Shi, Dingjie Song, Yixiao Huang, Ziji Sheng, Xueyang Zhou, Daizong Liu, Pan Zhou, Yongchao Chen, Ran Xu, Lifang He, Qingsong Wen, Manling Li, Cong Lu, Shuai Li, Pengtao Xie, Yixuan Yuan, Rui Meng, Lei Xing, Lichao Sun, Caiming Xiong, Philip S. Yu, Jianfeng Gao

AI总结 本文探讨了AI驱动的科研自动化(AutoResearch)的发展趋势,旨在通过人工智能实现从文献调研、假设生成到实验验证、结果报告等全流程的科研工作自动化。研究分析了当前系统在自主性、领域适用性、验证机制等方面的不足,并提出了五个评估维度,指出AutoResearch的自主程度依赖于具体应用场景,在结构化、可执行和易于验证的领域更具可信度,而在涉及伦理、机构责任等复杂情境中仍面临挑战。

Comments 49 pages, 12 figures, 10 tables

详情
AI中文摘要

科学研究正在被AI系统重塑,这些系统从孤立的辅助转向更长周期的工作流,涵盖文献基础、假设生成、实验、验证、报告和修订。这一转变标志着从面向科学的任务级AI向工作流级研究自动化的过渡。然而,当前系统仍然碎片化,在自主性、领域范围、执行环境、验证机制和人类监督方面存在差异,同时在证据保存、可重复性、弱方向拒绝、溯源追踪、跨领域鲁棒性和负责任的科学闭环方面仍面临挑战。本综述通过AutoResearch(定义为AI驱动的科学工作流自动化的演进谱系)审视这些发展。其中,Vibe Research表示人类引导的基于提示的辅助和人工验证执行区域,而新兴的AI主导系统协调发现循环的更大部分,但尚未实现稳健的自主性。我们分析了研究系统如何在工作流中重新分配控制、证据、执行、验证和问责,并围绕五个工作流条件组织该领域:文献与研究基础;假设形成与规划;实验与工具使用;反馈、验证与评审;报告与知识传播。我们进一步综合了AI科学家系统、混合主动协同研究框架、基准测试、领域部署和开源基础设施。最后,我们提出五个评估维度——新颖性、有效性、影响力、可靠性和溯源——并表明AutoResearch的自主性是领域条件化的,在结构化、可执行且快速可验证的环境中更为可信,但在具身、延迟、异构、伦理或机构问责的背景下则受限。

英文摘要

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

AI总结 本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题,特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法,建立了相机姿态到像素值的闭式映射,并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景,如增强现实、自动驾驶和机器人操作等,并在多个基准测试中验证了其有效性,相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情
AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证,尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而,当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性,仅覆盖了图像形成过程中一小部分扰动。特别是,对相机运动的鲁棒性仍然是一个开放问题,尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法,针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质,我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景,例如增强现实中的地面、自动驾驶中的道路标记和交通标志,或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证,无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现,并展示了相比先前工作最高89%的加速和7%更紧的边界。然后,我们在VNN-COMP基准上评估了我们的方法,揭示了投影扰动的系统性弱点。最后,我们在一个安全关键的跑道分类器上进行了真实世界案例研究,突出了对相机运动的实际漏洞,并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

2605.23201 2026-05-25 cs.SD cs.MM

MixFake: Benchmarking and Enhancing Audio Deepfake Detection in Diverse Real-world Mixed Audio

MixFake: 在多样真实混合音频中基准测试和增强音频深度伪造检测

Qingcao Li, Yipeng Lin, Weichen Lian, Zhongjie Ba, Peng Cheng, Zhichao Lian

AI总结 本文提出MixFake,一个用于评估和提升音频深度伪造检测性能的大型基准数据集,旨在模拟真实世界中包含背景音乐或噪声的复杂语音环境。为解决现有基于自监督学习的方法在处理非语音或混合源音频时的不足,作者提出了一种多流提示调优框架,通过注入信号级先验信息增强SSL模型对音频伪影的捕捉能力。实验表明,该方法在前景检测和复杂背景检测任务中均显著优于现有方法,取得了优异的检测性能。

Comments Accepted by ICME2026

详情
AI中文摘要

语音深度伪造检测在干净环境中取得了显著成功,但在复杂真实场景中面临重大挑战,因为语音常与背景音乐或噪声混合。当前最先进的方法依赖于自监督学习(SSL)模型的语义特征,但在处理非语音或混合源音频时常常失败。本文首先引入了MixFake,一个大规模基准数据集,旨在模拟具有不同信噪比(SNR)水平和混合真实性成分的多样化声学环境。为了解决“语义中心”限制,我们提出了一个多流提示微调框架,将信号级先验注入SSL骨干网络。通过深度提示注入集成基础流、频率流和纹理流,我们的模型有效捕获了声学伪影。实验结果表明,我们的方法显著优于现有基线,在前景检测中实现了0.95%的等错误率(EER),在复杂背景检测任务中实现了7.72%的绝对改进。我们的数据集和代码可在https://github.com/saltfish233/MixFake获取。

英文摘要

Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.

2605.23200 2026-05-25 cs.LG cs.AI

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

自适应质量分段KV压缩用于长上下文推理

Junzhe Yang, Xiaoyu Shen

AI总结 在长文本推理中,键值(KV)缓存的线性增长是关键瓶颈,现有压缩方法基于重要性评分剔除 tokens,但易导致连续推理块被严重清除,破坏逻辑连贯性。为此,本文提出自适应分块(AMS)KV压缩框架,通过关注注意力质量的空间分布,动态分配内存配额,保障关键推理段的稳定性,并兼容多种主流压缩方法和现代KV服务框架。实验表明,AMS有效缓解了结构碎片化问题,提升了模型性能。

详情
AI中文摘要

键值(KV)缓存的线性增长是长文本LLM推理中的关键瓶颈。现有的KV压缩方法通过基于重要性分数驱逐令牌来缓解这一问题。然而,我们表明它们依赖全局Top-k选择会触发区域擦除:连续推理块的严重驱逐破坏了逻辑连贯性。为解决此问题,我们提出自适应质量分段(AMS)KV压缩框架,该框架将范式从令牌级竞争转变为区域感知配额分配。AMS根据注意力质量的空间分布自适应地划分KV缓存,确保结构上重要的推理段获得有保障的内存配额。为在迭代解码过程中保持稳定性,引入了基于EMA的平滑机制以防止分段边界的抖动。关键的是,AMS是一个通用的即插即用层,与现有评分器正交。它可以无缝集成到代表性方法中,如TOVA、Expected Attention、KeyDiff、R-KV和TriAttention。AMS还与现代分页KV服务框架(如vLLM)系统兼容,支持高效的收集和压缩KV执行,而不引入额外的稳态注意力开销。在多种任务上的大量实验,包括数学推理(MATH500、AIME、GSM8K)、代码补全、开放域问答和稀疏检索,表明AMS持续减轻结构碎片化并提升模型性能。

英文摘要

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

2605.23198 2026-05-25 cs.LG

Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling

标签高效的数据集剪枝通过半监督伪标签

Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

AI总结 本文提出了一种高效的半监督数据集剪枝方法SemiPrune,旨在解决传统剪枝方法依赖大量标注数据的问题。该方法仅需一小部分随机标注的数据,通过生成伪标签来利用大量未标注数据,从而提升剪枝效果。与依赖预训练模型特征的方法不同,SemiPrune直接从目标数据集中学习,更准确地捕捉数据分布,提升了剪枝的可靠性和性能,在多个数据集上均取得了优于现有方法的实验结果。

Comments 10 pages

详情
AI中文摘要

数据集剪枝通过从大型数据集中选择信息丰富的子集来减少深度学习的存储和训练成本。然而,大多数现有的剪枝方法需要完全标注的数据,这限制了它们在未标注数据丰富且标注成本高昂的现实场景中的适用性。最近的无标签剪枝方法解决了这个问题,但它们依赖于预训练模型的特征来估计样本难度。当目标数据集与预训练分布差异较大时,这种依赖可能不可靠。我们提出了 SemiPrune,一个标签高效的数据集剪枝框架,仅使用少量随机标注的子集,利用半监督学习为未标注数据生成伪标签,使得需要标签信息的现有监督剪枝方法可以无缝应用于生成的伪标签训练池。然后,我们从伪标签诱导的训练动态中估计样本难度并选择核心集。通过直接从目标数据集学习,我们的方法更好地捕捉目标分布,并为难度估计和核心集选择提供更可靠的信号。我们在领域特定、图像损坏和长尾数据集上验证了我们的方法,它在无标签和标签高效的基线中实现了最先进的性能,同时在标准基准上也展示了有竞争力的性能。

英文摘要

Dataset pruning reduces the storage and training costs of deep learning by selecting an informative subset from a large dataset. However, most existing pruning methods require fully labeled data, which limits their applicability in realistic settings where unlabeled data are abundant and annotation is costly. Recent label-free pruning methods address this issue, but they rely on features from pretrained models to estimate example difficulty. This dependence can be unreliable when the target dataset differs substantially from the pretraining distribution. We propose SemiPrune, a label-efficient dataset pruning framework, using only a small randomly labeled subset, that uses semi-supervised learning to generate pseudo-labels for unlabeled data, allowing existing supervised pruning methods that require label information to be seamlessly applied to the resulting pseudo-labeled training pool. We then estimate example difficulty from pseudo-label-induced training dynamics and select a coreset. By learning directly from the target dataset, our method better captures the target distribution and provides more reliable signals for difficulty estimation and coreset selection. We validate our approach on domain-specific, image-corrupted, and long-tailed datasets, where it achieves state-of-the-art performance among label-free and label-efficient baselines, while also demonstrating competitive performance on standard benchmarks.

2605.23194 2026-05-25 cs.LG cs.AI

Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

面向智能电网数据驱动最优潮流问题的可扩展异构图基础模型

Massimiliano Lupo Pasini, Yijiang Li, Kibaek Kim, Teja Kuruganti

AI总结 本文提出了一种基于HydraGNN的可扩展异构图神经网络(GNN)框架,用于构建数据驱动的最优潮流(OPF)代理模型和图基础模型(GFM)。该方法保留了电力网络中不同节点和边类型的异构结构,支持在超计算机上进行分布式预处理、训练、超参数优化和下游微调。实验表明,该框架能够生成参数量较少但验证损失更低的紧凑模型,并在可行性分类和N-1故障回归任务中显著提升小样本条件下的模型性能与训练效率。

Comments 10 pages, 6 tables, 4 figures

详情
AI中文摘要

快速可靠的最优潮流(OPF)近似对于可靠的智能电网运行至关重要,然而许多基于学习的替代模型要么扁平化处理电网的天然异质结构,要么针对有限的电网拓扑,要么缺乏用于图基础模型(GFM)训练的可扩展基础设施。本文提出了一种基于HydraGNN的可扩展异构图神经网络(GNN)工作流,用于数据驱动OPF代理建模和OPF-GFM开发。该工作流保留了电网中不同的节点和边类型——母线、发电机、负荷、并联电抗器、交流线路、变压器以及设备到母线的耦合——并支持在领导级超级计算机上进行分布式预处理、训练、超参数优化(HPO)和下游微调。利用跨越十个PGLib-OPF案例(从14到13,659个母线)的三百万个异构图实例,我们在ORNL Frontier超级计算机上进行了DeepHyper驱动的HPO。该实验识别出具有最低验证损失的紧凑模型(约1.6–1.7M参数)。关于可行性分类和N-1应急回归的下游实验表明,微调预训练的OPF GFM在部分或仅头部微调时,能够提高低数据精度、稳定训练、加速收敛并降低适应成本。

英文摘要

Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates either flatten the native heterogeneous structure of power networks, target a limited set of grid topologies, or lack scalable infrastructure for graph foundation model (GFM) training. This paper presents a scalable heterogeneous graph neural network (GNN) workflow, built on HydraGNN, for data-driven OPF surrogate modeling and OPF-GFM development. The workflow preserves the distinct node and edge types of power grids -- buses, generators, loads, shunts, AC lines, transformers, and device-to-bus couplings -- and supports distributed preprocessing, training, hyperparameter optimization (HPO), and downstream fine-tuning on leadership-class supercomputers. Using three million heterogeneous graph instances spanning ten PGLib-OPF cases, from 14 to 13,659 buses, we conduct DeepHyper-driven HPO on the ORNL Frontier supercomputer. The campaign identifies compact models ($\sim$1.6--1.7M parameters) with the lowest validation losses. Downstream experiments on feasibility classification and N-1 contingency regression show that fine-tuning pretrained OPF GFM improves low-data accuracy, stabilizes training, accelerates convergence, and reduces adaptation cost when partial or head-only fine-tuning is used.

2605.23191 2026-05-25 cs.LG cs.IR cs.NA math.NA

Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

扩展更多,收缩更少:塑造有效秩动态以实现推荐中的密集扩展

Guoming Li, Shangyu Zhang, Junwei Pan, Wentao Ning, Jin Chen, Gengsheng Xue, Chao Zhou, Shudong Huang, Haijie Gu, Menglin Yang

AI总结 在推荐系统中,扩展推荐模型的规模是一个核心挑战。本文针对现有方法RankMixer在扩展过程中出现的嵌入坍塌问题,提出了一种新的架构RankElastor,通过参数化的全混合机制和改进的GLU风格前馈网络,有效提升了表示的谱稳定性,缓解了有效秩的衰减现象。实验表明,RankElastor在大规模工业数据集上显著提升了推荐性能,并表现出更稳健的扩展行为。

Comments Accepted at the 32st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Research Track), KDD 2026 February Cycle

详情
AI中文摘要

扩展推荐模型是推荐系统中的一个核心挑战。最近,RankMixer作为一种有效的解决方案出现,它基于统一的令牌表示,交替进行令牌混合和每个令牌的前馈网络(P-FFN),以实现可扩展的性能。然而,RankMixer存在 extit{嵌入坍缩}问题,即学习到的表示具有较低的有效秩,限制了表达能力并未能充分利用扩展后的表示空间。通过实证分析和理论洞察,我们识别出刚性令牌混合和P-FFN模块是这一现象的主要原因,它们共同在跨层的有效秩演化中诱导出 extbf{阻尼振荡轨迹}。为了解决这个问题,我们提出了RankElastor,一种新颖的架构,能够产生频谱鲁棒的表示,并具有可证明的坍缩缓解能力。RankElastor引入了两个组件:(i) extbf{参数化全混合},通过改进的频谱鲁棒性实现表达性令牌混合;(ii) extbf{GLU改进的P-FFN},通过GLU风格的FFN模块稳定表示频谱。在大规模工业数据集上的大量实验表明,RankElastor持续改进推荐性能,缓解嵌入坍缩,并表现出稳健的扩展行为。代码可在以下GitHub仓库获取:https://github.com/vasile-paskardlgm/RankElastor

英文摘要

Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textit{embedding collapse}, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbf{damped oscillatory trajectory} in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbf{parameterized full mixing}, which enables expressive token mixing with improved spectral robustness; and (ii) \textbf{GLU-improved P-FFNs}, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: https://github.com/vasile-paskardlgm/RankElastor

2605.23190 2026-05-25 cs.CL

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

机器生成文本的隐藏类人本质:理论与检测增强

Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian

AI总结 随着大语言模型生成的文本在各类应用中日益普及,其潜在的滥用问题引发了对检测方法的迫切需求。现有方法通常将生成文本视为完全机械化的产物,忽视了其中可能存在的类人类写作风格的片段。本文揭示了这类隐藏的类人类片段的存在,并分析其对检测任务的负面影响,进而提出一种无需模型依赖的堆叠增强框架,通过迭代过滤和优化提升检测性能,实验表明该方法在多种场景下均有效,并支持无训练部署。

详情
AI中文摘要

由大型语言模型(LLMs)生成的机器生成文本(MGTs)在各种应用中越来越普遍,但其在虚假新闻传播和网络钓鱼中的潜在滥用引发了严重担忧,凸显了MGT检测的必要性。现有的段落级检测方法通常将MGT视为完全机器化的,忽略了机器生成文本的隐藏类人本质:即使是完全机器生成的文本也可能包含与人类写作高度一致的片段。为此,我们首先揭示了这种隐藏类人片段的存在,然后从理论上分析了它们对检测的影响。我们的分析表明,这些片段增加了检测的句子复杂度,从而使MGT检测本质上更加困难。基于这一发现,我们提出了一种模型无关的堆叠增强框架,通过减少隐藏类人片段的影响来改进现有检测器。具体来说,我们将片段级别的保留决策建模为潜在变量问题,并使用硬EM启发式过程实例化优化,其中检测器迭代地过滤置信度高的类人子序列,并在剩余文本上自我优化。在各种LLM和实际场景中的大量实验表明,所提出的框架能够持续增强现有检测器。值得注意的是,该框架还可以以无需训练的方式工作,为实际部署提供了灵活性和可扩展性。

英文摘要

Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MGT detection. Existing paragraph-level detection methods commonly treat MGTs as entirely machine-like, overlooking the hidden human-like nature of machine-generated texts: even fully machine-generated texts may contain spans that are highly consistent with human writing. To this end, we first reveal the existence of such hidden human-like spans, and then theoretically analyze their impact on detection. Our analysis shows that these spans increase the sentence complexity for detection, thereby making MGT detection intrinsically harder. Based on this finding, we propose a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of hidden human-like spans. Specifically, we model span-level retention decisions as a latent-variable problem and instantiate the optimization with a hard-EM-inspired procedure, where the detector iteratively filters confidently human-like subsequences and refines itself on the remaining text. Extensive experiments across various LLMs and practical scenarios demonstrate that the proposed framework consistently enhances existing detectors. Notably, the framework can also work in a training-free manner, offering flexibility and scalability for practical deployment.

2605.23189 2026-05-25 cs.LG

Empirical Bayes Conformal Prediction for Vision and Language Models

视觉与语言模型的经验贝叶斯共形预测

Jiapeng Zeng, Yogesh Prabhu, Zhanpeng Zeng, Michael A. Newton, Vikas Singh

AI总结 本文提出了一种基于经验贝叶斯的符合性预测框架,用于提升视觉与语言模型的预测置信度评估。该方法通过引入 $r$-值将分数的不确定性转化为置信度评分,从而更准确地判断候选结果是否属于高分组。该方法在保持目标置信度的同时,有效减少了高方差错误候选的纳入,并在多个基准任务中表现出更稳定的排序性能和更小的预测集合规模。

详情
AI中文摘要

共形预测(CP)为现代视觉和语言模型提供无分布覆盖,但通常被迫从单个不稳定的非一致性得分中做出排序决策。标准CP使用一次实现,而平均后校准变体将多次实现平滑为点估计。这两种选项都丢弃了有助于识别候选是否真正稳定的不一致性。一个弱答案可能进入共形集,即使证据不充分,仅仅因为一个后验样本或提示措辞使其看起来很强。但变异性有助于区分稳定信号和噪声驱动的波动。我们描述了一个经验贝叶斯共形预测框架,该框架使用r值将得分变异性转化为不确定性感知的非一致性得分。得到的r值估计一个候选的潜在得分在考虑其均值和不确定性后属于排名靠前组的可能性。它既接受闭式正态-正态经验贝叶斯估计器,也接受非参数后验采样估计器。使用r值作为非一致性得分在温和正则条件下保留了目标共形覆盖,同时可证明地减少了高方差假候选的包含。在图像分类、基于CLIP的VLM基准和LLM上,我们展示了r值共形预测在变异性具有信息性时保持目标覆盖,同时提高排序稳定性并减小集合大小,并在变异性消失时恢复为类似CP的行为。

英文摘要

Conformal prediction (CP) gives distribution-free coverage for modern vision and language models, but it is often forced to make a ranking decision from a single unstable nonconformity score. Standard CP uses one realization, while average-then-calibrate variants smooth multiple realizations into a point estimate. Both options discard the inconsistency that can help identify whether a candidate is indeed stable. A weak answer can enter the conformal set even if the evidence is not strong, simply because one posterior sample or prompt phrasing made it look strong. But variability can help distinguish a stable signal from noise-driven fluctuations. We describe an empirical Bayes conformal prediction framework that uses $r$-values to convert score variability into an uncertainty informed nonconformity score. The resulting $r$-value estimates how likely a candidate's latent score belongs to the top-ranked group after accounting for both its mean score and its uncertainty. It admits both a closed-form Normal-Normal empirical Bayes estimator and a nonparametric posterior-sampling estimator. Using the $r$-value as the nonconformity score preserves the target conformal coverage while provably reducing the inclusion of high variance false candidates under mild regularity conditions. Across image classification, CLIP-based VLM benchmarks, and LLMs, we show that $r$-value conformal prediction preserves target coverage while improving ranking stability and reducing set size when variability is informative, and reverting to CP-like behavior when variability vanishes.