arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.24862 2026-05-26 cs.LG

Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets

统一跨域离线强化学习中异构数据集的价值对齐与价值分配

Zhongjian Qiao, Jiafei Lyu, Chenjia Bai, Peisong Wang, Siyang Gao, Shuang Qiu

发表机构 * City University of Hong Kong Tencent Institute of Artificial Intelligence (TeleAI), China Telecom Institute of Automation, Chinese Academy of Sciences. Corresponding Author

AI总结 针对异构跨域离线强化学习中价值误分配问题,提出V2A方法,通过时间一致模态表示学习和模态感知优势学习统一动力学对齐、价值对齐与价值分配,显著提升策略性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

跨域离线强化学习旨在利用有限的目标域数据集和存在动力学偏移的源域数据集,在目标域中学习策略。直接在原始源数据集上训练通常会导致性能崩溃。最近的研究从动力学对齐或价值对齐的角度进行数据过滤,以实现有效的策略迁移。然而,这些研究通常在单域或单行为策略的源数据集上验证。在这项工作中,我们探索了一个更一般的异构跨域离线强化学习设置,其中源数据集可能由多种行为策略从多个源域收集。我们首先揭示了该设置中一个关键但被忽视的问题:价值误分配。通过实验和理论,我们证明了价值误分配会破坏价值对齐,误导数据过滤选择次优样本,并扩大次优性差距,从而降低智能体的性能。为了解决这个问题,我们提出了V2A,它整合了动力学对齐、价值对齐和价值分配。V2A首先采用时间一致的模态表示学习从源数据集中提取动力学模态,然后通过模态感知优势学习纠正价值对齐。最后,它采用数据过滤范式选择性共享源数据进行策略学习。实验结果表明,在一般异构跨域离线强化学习设置下,V2A显著优于强基线方法。

英文摘要

Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general heterogeneous cross-domain offline RL setting, where the source datasets may be collected from multiple source domains by diverse behavior policies. We first uncover a critical yet overlooked issue in this setting: value misassignment. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent's performance. To address this issue, we propose V2A, which integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that V2A significantly outperforms strong baseline methods under general heterogeneous cross-domain offline RL settings.

2605.24856 2026-05-26 cs.LG cs.AI

The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth

概念分配区:追踪概念如何跨越Transformer深度形成

James Henry

发表机构 * Independent Researcher(独立研究者)

AI总结 提出概念分配区(CAZ)框架,通过层间度量(分离度、概念一致性、概念速度)检测概念在残差流中逐渐形成的深度区间,并在34个模型上验证了分离曲线的多模态性及温和CAZ的因果活性。

Comments 34 models, 8 architectural families, 7 concepts. Companion papers: GEM (arXiv forthcoming), CAZ Validation (arXiv forthcoming), PRH Validation (arXiv forthcoming). Code: https://github.com/jamesrahenry/Rosetta_Tools

详情
AI中文摘要

Transformer语言模型中的概念形成是深度扩展的,而非单层事件:概念在残差流的连续区域内逐渐出现。可解释性方法识别出类别分离峰值的单层——

英文摘要

Concept formation in transformer language models is depth-extended, not a single-layer event: concepts emerge gradually across a contiguous region of the residual stream. Mechanistic interpretability methods identify the single layer of peak class separation -- the "best layer" -- capturing a snapshot rather than the process itself. We introduce the Concept Allocation Zone (CAZ): the depth interval within which a concept becomes measurably separable, the region allocated to its geometric expression. We formalize the CAZ through three layer-wise metrics (Separation, Concept Coherence, Concept Velocity) and derive principled boundary detection without manual layer sweeps. A CAZ is not a concept: it is the depth region within which the model organizes its geometry to make a concept separable. A single concept typically participates in multiple CAZes; multiple concepts may share one. Empirical validation across 34 models from 8 architectural families and 7 concepts reveals that the separation curve S(l) is frequently multimodal. A scored detector uncovers "gentle CAZes" -- subtle allocation regions invisible to standard peak detection but causally active in 93-100% of cases under ablation (16 of 34 models; 26 in the companion validation paper). The framework generates seven testable predictions; four yield clear verdicts (two not supported, one partially supported, one supported), one had its precondition invalidated by the data, and two are underpowered -- with cross-architecture alignment confirmed as depth-matched rather than monolithic under leave-one-concept-out cross-validation. Reference implementation: rosetta_tools v1.3.1 (doi:10.5281/zenodo.20361433).

2605.24852 2026-05-26 cs.LG cs.SY eess.SY

T2S-MPC: Time-Embedded Online Adaptive Model Predictive Control for Time-Varying Dynamics

T2S-MPC:面向时变动力学的时间嵌入在线自适应模型预测控制

Zeyu Shen, Zhuoyuan Wang, Laixi Shi

发表机构 * JHU Department of Applied Mathematics and Statistics, Johns Hopkins University, MD, USA(约翰霍普金斯大学应用数学与统计学系) CMU Department of Electrical and Computer Engineering, Carnegie Mellon University, PA, USA(卡内基梅隆大学电气与计算机工程系)

AI总结 提出T2S-MPC框架,通过时间嵌入和双时间尺度更新在线学习残差动力学模型,实现快速时变环境下的自适应模型预测控制,在四旋翼任务中优于经典和神经MPC方法。

详情
AI中文摘要

基于学习的模型预测控制(MPC)的最新进展利用神经网络进行在线模型学习,当非平稳系统动力学偏离标称模型时,取得了强劲的性能。然而,现有方法主要处理特定或相对结构化的动力学变化形式,对于更一般、未知且不可预测的时变动力学处理不足。为应对这一挑战,我们提出T2S-MPC框架,该框架在线自适应学习残差动力学模型,并将其与MPC框架内的标称模型集成,以实现快速演变的在线规划。为使模型具有时间感知能力,我们通过结构化时间嵌入显式编码时间信息,并采用双时间尺度更新方案,使控制器能够捕捉非平稳动力学,同时平衡快速适应与稳定学习。我们在二维四旋翼上评估了所提方法,在多种时变扰动(包括线性漂移和周期性扰动)下执行稳定和轨迹跟踪任务。实验结果表明,T2S-MPC在控制性能上始终优于经典MPC、神经MPC及消融变体,同时在没有额外调参的情况下,在广泛的扰动条件下展现出强鲁棒性。源代码公开于https://github.com/Zeyuu0920/T2S_MPC。

英文摘要

Recent advances in learning-based model predictive control (MPC) have leveraged neural networks for online model learning, achieving strong performance when nonstationary system dynamics deviate from nominal models. However, existing approaches primarily address specific or relatively structured forms of dynamical variation, leaving more general, unknown, and unpredictable time-varying dynamics insufficiently handled. To tackle this challenge, we propose T2S-MPC, a framework that adaptively learns a residual dynamics model online and integrates it with the nominal model within the MPC framework to enable fast-evolving online planning. To make the model time-aware, we explicitly encode temporal information through a structured time embedding and employ a two-timescale update scheme, allowing the controller to capture nonstationary dynamics while balancing rapid adaptation with stable learning. We evaluate the proposed method on a 2D quadrotor across stabilization and trajectory tracking tasks under diverse time-varying disturbances, including linear drifting and periodic perturbations. Experimental results show that T2S-MPC consistently outperforms classical MPC, neural MPC, and ablated variants in control performance, while also demonstrating strong robustness across a wide range of disturbance conditions without additional tuning. The source code is publicly available at https://github.com/Zeyuu0920/T2S_MPC

2605.24850 2026-05-26 cs.CL cs.IT math.IT stat.AP

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

重复序列揭示大语言模型与自然语言之间的差距

Kumiko Tanaka-Ishii

发表机构 * Waseda University(早稻田大学)

AI总结 通过分析重复子序列的分布及其与高阶Rényi熵的关系,提出一种评估大语言模型生成文本长程统计组织的框架,发现GPT生成文本在熵增长模式上与自然语言存在系统性差异。

Comments ACL 2026

详情
AI中文摘要

评估大语言模型(LLMs)是否捕捉到自然语言的结构(超越局部流畅性)仍然是一个开放的挑战。现有的评估方法主要基于任务性能或短上下文行为,对生成文本的长程统计组织提供的洞察有限。我们提出了一种基于重复子序列的补充评估框架。通过分析其跨尺度的分布并将其与高阶Rényi熵联系起来,我们探究文本在有限长度条件下如何重用先前建立的结构。对人类撰写的文本和长度匹配的GPT生成文本的实验表明,虽然幂律模型可以描述有限范围的块长度,但观察到的熵增长通常同样或更好地由对数-幂形式刻画。跨数据集,自然语言在可访问范围内表现出稳定的熵增长模式,尽管个体文本之间存在变异性,但平均行为一致。相比之下,GPT生成文本的估计指数随模型大小呈现系统性和统计显著的变化。这些结果表明,重复子序列熵提供了一种定量的结构诊断,揭示了长程组织中的系统性差异,从而在表面流畅性之外区分自然语言与最先进的LLM输出。

英文摘要

Evaluating whether large language models (LLMs) capture the structure of natural language beyond local fluency remains an open challenge. Existing evaluation methods, largely based on task performance or short-context behavior, provide limited insight into the long-range statistical organization of generated text. We propose a complementary evaluation framework based on repeated subsequences. By analyzing their distribution across scales and relating it to higher-order Rényi entropies, we probe how texts reuse previously established structure under finite-length conditions. Experiments on human-written texts and length-matched GPT-generated texts show that, while power-law models can describe restricted ranges of block length, the observed entropy growth is often equally or better characterized by logarithmic--power forms. Across datasets, natural language exhibits stable entropy-growth patterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast, GPT-generated texts show systematic and statistically significant shifts in estimated exponents with model size. These results demonstrate that repeated-subsequence entropy provides a quantitative structural diagnostic that reveals systematic differences in long-range organization, distinguishing natural language from state-of-the-art LLM outputs beyond surface-level fluency.

2605.24845 2026-05-26 cs.AI math.CO

Solving Combinatorial Counting Problems with Weighted First-Order Model Counting

使用加权一阶模型计数解决组合计数问题

Yuanhong Wang, Juhua Pu, Yuxu Zhou, Yuyi Wang, Ondřej Kuželka

发表机构 * School of Artificial Intelligence, Jilin University, Changchun, China State Key Laboratory of Complex \& Critical Software Environment, Beihang University, China National Research Center for Educational Materials, China Tengen Intelligence Institute, China Czech Technical University in Prague, Prague, Czech Republic

AI总结 提出Cofola语言,通过类型化声明式编程和加权一阶模型计数(WFOMC)编译流水线,统一解决集合、多重集、排列、划分等组合计数问题。

Comments 47 pages, 9 figures

详情
AI中文摘要

组合计数问题遍及人工智能、统计学和离散数学。无论是枚举子集、多重集、排列、划分还是在结构和算术约束下的组合,解决它们仍然是一项顽固的手动练习。封闭形式的推导强大但脆弱,而将问题朴素编码为命题模型计数或约束满足会破坏使计数易于处理的交换性。我们提出了Cofola(组合计数语言与一阶逻辑),一种类型化声明式语言,其原语是日常计数问题中反复出现的组合对象,包括集合、袋子、元组、序列、圆圈、划分和组合,以及它们之上的自然关系和算术约束。指称语义将每个Cofola程序映射到一个明确定义的组合计数问题,一个三阶段编译流水线(预处理、分解和对称保持编码)将该问题简化为一个加权一阶模型计数(WFOMC)实例,并附加系数提取约束。为了尽可能保持在已知的可域提升片段内,编码将不可区分的实体分组,按字典序打破无序分组的对称性,并通过顺序公理编码序列和圆圈。在一系列代表性的组合计数问题上,从教科书数学问题到最接近的先前框架无法表达的多对象场景,Cofola生成了简洁的规范和统一的求解流水线,端到端实用。

英文摘要

Combinatorial counting problems pervade artificial intelligence, statistics, and discrete mathematics. Whether the task is enumerating subsets, multisets, permutations, partitions, or compositions under structural and arithmetic constraints, solving it remains a stubbornly manual exercise. Closed-form derivations are powerful but brittle, while naive encodings to propositional model counting or constraint satisfaction destroy the exchangeability that makes counting tractable in the first place. We present Cofola (COmbinatorial counting LAnguage with First-Order logic), a typed declarative language whose primitives are the combinatorial objects that recur in everyday counting questions, including sets, bags, tuples, sequences, circles, partitions, and compositions, together with natural relational and arithmetic constraints over them. A denotational semantics maps every Cofola program to a well-defined combinatorial counting problem, and a three-phase compilation pipeline (preprocessing, decomposition, and symmetry-preserving encoding) reduces this problem to a weighted first-order model counting (WFOMC) instance augmented with coefficient-extraction constraints. To stay inside known domain-liftable fragments whenever possible, the encoding groups indistinguishable entities, breaks the symmetry of unordered groupings lexicographically, and encodes sequences and circles via order axioms. On a suite of representative combinatorial counting problems, ranging from textbook math problems to multi-object scenarios that the closest prior framework cannot express, Cofola produces concise specifications and a uniform solving pipeline that is practical end-to-end.

2605.24844 2026-05-26 cs.AI cs.CL

Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning

Geo-Expert: 通过参数高效微调实现专家级地质推理

Chenyou Guo, Zongqi Liu, Yizhou Zhang, Zhaorui Jiang, Ze Liu

发表机构 * Ocean University of China(中国海洋大学) Peking University(北京大学) Monash University(墨尔本大学)

AI总结 本文提出Geo-Expert,通过参数高效微调(LoRA)在定制高质量指令数据集上微调小规模语言模型,在专门的地质推理基准Geo-Eval上,8B模型超越70B通用模型和GPT-4o,32B模型接近前沿推理模型。

Comments 11 pages, 1 figure, 3 tables. Accepted at ICML 2026 AI for Science Workshop

详情
AI中文摘要

虽然应用于地质学的通用大语言模型(LLM)在推理地下结构和深时演化时常常产生幻觉,但目前地球科学中的人工智能主要针对地表遥感和GIS。为弥补这一差距,我们引入了Geo-Expert,这是一个参数高效的地质LLM系列,基于我们自定义指令合成流程处理的自定义策划高质量指令数据集进行微调。我们通过使用低秩适配(LoRA)方法微调三个基础模型:Qwen3-8B、Qwen3-32B和Gemma-3-27B,研究了模型缩放和架构的影响。我们在新的领域特定基准Geo-Eval上的广泛评估表明,领域对齐的8B模型在专门的地质推理上可以超越开放权重的70B通用模型和专有的GPT-4o,而32B变体接近前沿推理模型。优化后的8B模型进一步为部署提供了具有竞争力的性价比。这项工作为科学LLM的民主化提供了可复现的配方,并为地质人工智能建立了基线。

英文摘要

While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep-time evolution, current AI in Earth sciences predominantly targets surface remote sensing and GIS. To bridge this gap, we introduce Geo-Expert, a family of parameter-efficient geological LLMs fine-tuned on a custom-curated, high-quality instruction dataset processed using our custom instruction synthesis pipeline. We investigate the impact of model scaling and architecture by fine-tuning three base models: Qwen3-8B, Qwen3-32B, and Gemma-3-27B, with Low-Rank Adaptation (LoRA) method. Our extensive evaluation on a novel domain-specific benchmark, Geo-Eval, reveals that a domain-aligned 8B model can outperform open-weight 70B generalists and proprietary GPT-4o on specialized geological reasoning, while a 32B variant approaches frontier reasoning models. The optimized 8B model further offers a competitive cost-performance ratio for deployment. This work provides a reproducible recipe for democratizing scientific LLMs and establishes a baseline for geological artificial intelligence.

2605.24843 2026-05-26 cs.CV cs.AI

Adversarial Error Correction for Visual Autoregressive Generation

视觉自回归生成的对抗性纠错

Ligong Bi, Tao Huang, Jianyuan Guo, Chang Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) City University of Hong Kong(香港城市大学) The University of Sydney(悉尼大学)

AI总结 提出AID-VAR框架,通过对抗性注入诊断机制纠正视觉自回归模型中的级联误差,提升生成质量。

详情
AI中文摘要

视觉自回归(VAR)模型通过执行层次化的下一尺度预测,已成为图像合成的强大范式。然而,VAR模型天生容易产生级联误差传播,其中细微的粗尺度误预测会在层次结构中放大,最终扭曲最终合成。为了缓解这一问题,我们提出了AID-VAR,一个即插即用的框架,通过对抗性注入诊断增强预训练的VAR。与标准的被动生成不同,AID-VAR引入了一种主动纠错机制,灵感来自GAN中的对抗性反馈。我们部署了一个判别器来诊断每个尺度转换处的保真度差距,并配有一个轻量级的引导注入器。该模块作为一个非侵入式适配器,优化冻结的VAR骨干网络的特征流形,有效引导生成朝向真实图像的分布,同时不破坏预训练潜在空间的稳定性。此外,为了严格评估这种跨尺度进展,我们引入了跨尺度一致性得分(ISCS),这是一个新的度量标准,用于量化连续分辨率尺度之间的保真度和结构对齐。在各种骨干网络上的实验结果表明,AID-VAR以可忽略的开销提供了更清晰的纹理细节和更少的结构失真。例如,AID-VAR-d20在参数仅增加3%的情况下,FID提升了16%。这些结果确立了AID-VAR作为升级大规模VAR生成器的高效且可扩展的途径,在不改变训练数据、基础架构或采样调度的情况下,增强了全局连贯性和局部细节。代码可在https://github.com/bijiw515/AID-VAR获取。

英文摘要

Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction. However, VAR models are inherently prone to cascading error propagation, where subtle coarse-scale mispredictions are amplified across the hierarchy, ultimately distorting the final synthesis. To mitigate this, we propose AID-VAR, a plug-and-play framework that enhances pre-trained VARs through Adversarially Injected Diagnosis. Instead of a standard passive generation, AID-VAR introduces a proactive error-correction mechanism inspired by the adversarial feedback in GANs. We deploy a discriminator to diagnose fidelity gaps at each scale transition, coupled with a lightweight guidance injector. This module operates as a non-invasive adapter that refines the feature manifold of a frozen VAR backbone, effectively steering the generation toward the distribution of real images without destabilizing the pre-trained latent space. Furthermore, to rigorously evaluate this cross-scale progression, we introduce the Inter-Scale Consistency Score (ISCS), a novel metric that quantifies the fidelity and structural alignment between consecutive resolution scales. Experimental results across various backbones demonstrate that AID-VAR delivers sharper textural details and fewer structural distortions with negligible overhead. For instance, AID-VAR-d20 achieves a 16% improvement in FID with only a 3% increase in parameters. These results establish AID-VAR as a highly efficient and scalable pathway for upgrading large-scale VAR generators, enhancing global coherence and local detail without altering training data, base architectures, or sampling schedules. Code is available at https://github.com/bijiw515/AID-VAR.

2605.24842 2026-05-26 cs.CL cs.CY

Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data

译者作为人工智能的无形教师:版权、翻译记忆库与语言数据的政治经济学

Masaru Yamada

发表机构 * College and Graduate School of Intercultural Communication, Rikkyo University(文化交流研究生院,立命馆大学)

AI总结 本文研究译者劳动如何转化为人工智能的基础数据资本,提出“无消费的挪用”和“译者的无形教师化”两个概念,并探讨版权框架下的数据供应链与再分配设计方向。

Comments 13 pages; comments welcome

详情
AI中文摘要

本文考察了译者的劳动如何转化为人工智能时代的基础数据资本。翻译记忆库和平行语料库保留了源文本和目标文本之间的一一对应关系,因此构成了机器翻译极其宝贵的监督训练数据。统计机器翻译、神经机器翻译、Transformer架构以及多语言大语言模型的发展与这类翻译数据的积累密不可分。然而,译者的译文作为合同交付物被购买,作为技术对象被分割,并在版权法下作为“信息分析”数据被处理——失去了对产生它们的译者的道德、创作和经济归属。本文提出了两个概念来捕捉这一过程。第一个是“无消费的挪用”:一种使用模式,作品不被阅读、观看或聆听,而仅被挖掘统计特征——这种使用在日本著作权法第30-4条下是合法的。第二个是“译者的无形教师化”:译者通过构建翻译记忆库、译后编辑和质量评估,充当了人工智能的教师而未得到承认的过程。基于从译者通过语言服务提供商和平台到模型开发者的数据供应链,对日本、欧洲和美国法律框架的比较解读,开放与专有AI模型的区分,以及人类生成数据在模型崩溃时代获得的溢价地位,本文探讨了译者实际担忧的问题,并指出了再分配设计的具体方向。

英文摘要

This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as "information analysis" data under copyright law -- losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features -- a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.

2605.24841 2026-05-26 cs.LG

DriftingMol: Decoder-Coupled Drift for One-Pass Property-Conditional Molecular Generation

DriftingMol: 用于一次性属性条件分子生成的解码器耦合漂移

Jiangjie Qiu, Yijun Li, Wentao Li, Xiaonan Wang

发表机构 * Beijing Key Laboratory of Artificial Intelligence for Advanced Chemical Engineering Materials(北京先进化工材料人工智能重点实验室)

AI总结 提出 DriftingMol 两阶段框架,通过解码器耦合漂移将漂移模型适应于 SELFIES 潜在分子空间,实现低采样成本、高有效性和多样性的属性条件分子生成。

Comments 9 pages, 5 figures

详情
AI中文摘要

属性条件分子生成应在响应连续目标值的同时,以低采样成本生成有效且多样的分子。我们引入了 DriftingMol,一个两阶段框架,将漂移模型适应于 SELFIES 潜在分子空间。冻结的 SELFIES beta-VAE 提供潜在空间,其解码器的隐藏表示作为漂移特征图。在解码器耦合漂移中,解码器权重保持不变,但漂移梯度通过解码器特征图反向传播到 DiT 生成器,从而诱导出与分子解码对齐的拉回度量。在 ZINC250K 上,默认设置实现了 QED Spearman 相关系数 0.493,独特性 94.7%,而最强的解码器耦合条件达到 0.510。在协议匹配的四属性条件下,解码器耦合漂移的平均 Spearman 相关系数高达 0.598。在 15 个受控变体中,保留通过解码器特征的梯度路径的模型比测试的潜在空间、随机特征和外部特征漂移变体实现了更高的相关性,而分离或停止梯度的解码器控制导致 QED 相关性接近零且独特性极低。这些结果表明,解码器耦合漂移是一种有用的低成本机制,用于属性偏置分子生成,只需一次生成器评估和一次冻结解码器传递。

英文摘要

Property-conditional molecular generation should produce valid, diverse molecules while responding to continuous target values at low sampling cost. We introduce DriftingMol, a two-stage framework that adapts drifting models to a SELFIES latent molecular space. A frozen SELFIES beta-VAE provides the latent space, and the hidden representation of its decoder serves as the drift feature map. In decoder-coupled drift, decoder weights remain fixed, but drift gradients are backpropagated through the decoder feature map to a DiT generator, inducing a pullback metric aligned with molecular decoding. On ZINC250K, the default setting achieves QED Spearman correlation 0.493 with 94.7% uniqueness, while the strongest decoder-coupled condition reaches 0.510. Under protocol-matched four-property conditioning, decoder-coupled drift reaches mean Spearman correlation up to 0.598. Across 15 controlled variants, models that preserve the gradient path through decoder features achieve higher correlations than the tested latent-space, random-feature, and external-feature drift variants, while detached or stop-gradient decoder controls yield near-zero QED correlation and very low uniqueness. These results indicate that decoder-coupled drift is a useful low-cost mechanism for property-biased molecular generation, requiring one generator evaluation and one frozen decoder pass.

2605.24831 2026-05-26 cs.CV cs.AI

Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26

无NMS时代的实时多尺度目标检测:YOLOv8与YOLO26的对比性能评估

Chidera G. Oguine, Kanyifeechukwu J. Oguine, Obiozor M. Oguine, Ozioma C. Oguine

发表机构 * University of Abuja(阿布贾大学) Vanderbilt University(范德比大学) University of Notre Dame(圣约翰大学)

AI总结 本文在Pascal VOC和VisDrone数据集上,从准确率、定位、模型大小、计算量和延迟等维度,系统比较了基于NMS的YOLOv8与无NMS的YOLO26在多尺度下的性能,发现YOLO26在多数尺度上检测更强且模型复杂度更低,但在密集小目标场景下优势缩小,且YOLOv8在GPU延迟上仍有竞争力。

Comments 11 pages, 6 tables, 9 figures

详情
AI中文摘要

非极大值抑制(NMS)仍然是许多实时目标检测流程中的关键后处理步骤,但在资源受限的环境中可能引入延迟变化和部署复杂性。最近的无NMS设计(如YOLO26)旨在通过端到端检测减少这种依赖,然而与基于NMS的成熟模型(如YOLOv8)相比,其性能在标准基准之外尚未得到充分探索。本文在Pascal VOC和VisDrone上比较了YOLOv8和YOLO26,这两个数据集分别代表通用目标检测和密集空中小目标检测。两个模型家族在五个尺度上使用准确率、定位、模型大小、GFLOPs以及CPU/GPU延迟进行评估。结果表明,YOLO26在Pascal VOC上的大多数尺度上实现了更强的检测性能和更低的模型复杂度,而在VisDrone上性能差距缩小,两个模型在处理密集小目标时均表现困难。YOLOv8在GPU延迟上仍具有竞争力,表明无NMS设计并不能保证普遍的部署优势。总体而言,研究表明检测器的选择取决于数据集特征、目标尺度、模型容量和硬件约束。

英文摘要

Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency variation and deployment complexity in resource-constrained settings. Recent NMS-free designs such as YOLO26 aim to reduce this dependence through end-to-end detection, yet their performance relative to established NMS-based models such as YOLOv8 remains underexplored beyond standard benchmarks. This paper compares YOLOv8 and YOLO26 on Pascal VOC and VisDrone, representing general object detection and dense aerial small-object detection, respectively. Both model families are evaluated across five scales using accuracy, localization, model size, GFLOPs, and CPU/GPU latency. Results show that YOLO26 achieves stronger detection performance and lower model complexity on Pascal VOC across most scales, while the performance gap narrows on VisDrone, where both models struggle with dense small targets. YOLOv8 remains competitive in GPU latency, showing that NMS-free design does not guarantee universal deployment superiority. Overall, the study shows that detector selection depends on dataset characteristics, object scale, model capacity, and hardware constraints.

2605.24823 2026-05-26 cs.AI

Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Agent制造:基础模型Agent作为一级工业实体

Yilei Zhang

发表机构 * University of Canterbury(坎特伯雷大学)

AI总结 本文提出Agent制造范式,即基础模型Agent通过解释开放目标、长程规划、调用工具和机器、与其他Agent及人类协商来协调生产,从而将工业中的人类协调认知工作自动化。

详情
AI中文摘要

制造业已经历了四个广泛认可的范式——机械化、电气化、可编程自动化和智能制造——每个范式都定义了从人类转移到机器的工作类型。在每种情况下,有一层工业工作仍然基本上由人类完成:生产的协调认知,包括工程师、规划师和运营经理所执行的解释、分配、诊断、协商和治理工作。我们认为,第五次转型正在进行中,其中这一层(而非其下的物理或常规认知层)正是基于基础模型的自主Agent主要重新分配的对象。我们将这一范式命名为Agent制造,并操作性地定义:当一个制造系统的主要协调机制是由基础模型Agent执行的推理,这些Agent能够解释开放目标、在长周期内规划、调用工具和机器、并与其他Agent和人类协商时,该系统就是Agent制造的一个实例。这一定义比现有的认知制造或工业5.0文献更窄且更可证伪,并且它将该范式与经典的多Agent制造系统(后者仅在封闭协议空间内自主)明确区分开来。

英文摘要

Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

2605.24816 2026-05-26 cs.CV

AOEPT: Breaking the Implicit Modality-Reduction Bottleneck in Modality-Missing Prompt Tuning

AOEPT:打破模态缺失提示调优中的隐式模态缩减瓶颈

Jian Lang, Rongpei Hong, Ting Zhong, Fan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Intelligent Digital Media Technology Key Laboratory of Sichuan Province(四川省智能数字媒体技术重点实验室)

AI总结 提出AOEPT方法,通过模态上下文提示(MCPs)蒸馏全局模态先验,为缺失模态提供潜在信息源,恢复多模态Transformer的推理范围,解决模态缺失场景下隐式模态缩减瓶颈问题。

Comments 20 pages, Accepted by ICML 2026, Code is available from https://github.com/Jian-Lang/AOEPT

详情
AI中文摘要

在现实环境中部署多模态系统通常需要处理模态缺失场景,即一个或多个模态不可用。虽然最近的研究通过提示调优解决了通用多模态Transformer(MT)架构的这一挑战,但我们发现了这些方法的一个基本限制:隐式模态缩减瓶颈。通过仅将提示条件限制在观察到的模态上,它们无意中将MT的推理范围限制在模态缩减子空间内,切断了缺失模态潜在信息源的访问。为克服这一限制,我们提出AOEPT,开创了一种新颖的模态上下文提示方式。具体来说,我们引入了轻量级的模态上下文提示(MCPs),从训练数据中蒸馏全局模态先验,作为缺失模态信息源的潜在存储库。基于剩余模态,这些MCPs被实例化为实例感知提示,为每个样本选择性地增强缺失模态信息,从而将MT的推理范围恢复到仅观察模态子空间之外。在各种多模态基准和骨干网络上的实验证实了AOEPT的强大性能,且计算开销极小。

英文摘要

Deploying multimodal systems in real-world environments often entails handling modality-missing scenarios, where one or more modalities are unavailable. While recent studies address this challenge for the general Multimodal Transformer (MT) architecture via prompt tuning, we identify a fundamental limitation in these methods: the Implicit Modality-Reduction bottleneck. By conditioning prompts solely on the observed modalities, they inadvertently restrict the reasoning scope of MTs to the modality-reduced subspace, cutting off access to the latent information sources of the missing modalities. To overcome this limitation, we propose AOEPT, which pioneers a novel modal-contextualized prompting fashion. Specifically, we introduce lightweight Modal-Contextualized Prompts (MCPs) that distill global modality-wise priors from training data, serving as latent repositories of the information sources for missing modalities. Conditioned on the remaining modalities, these MCPs are instantiated into instance-aware prompts that selectively augment missing-modality information for each sample, thereby restoring the reasoning scope of MTs beyond the observed-modality-only subspace. Experiments across various multimodal benchmarks and backbones confirm the strong performance of AOEPT, with minimal computational overhead.

2605.24813 2026-05-26 cs.RO cs.SY eess.SY

Manifold-Constrained MPPI: Real-Time Sampling-Based Control Under Hard Constraints

流形约束MPPI:硬约束下的实时采样控制

Seulchan Lee, Sanghyun Kim

发表机构 * School of Mechanical Engineering, Kyung Hee University(京畿大学机械工程学院) Advanced Institute of Convergence Technology(融合技术高级研究院)

AI总结 提出流形约束MPPI(MC-MPPI),通过变分自编码器学习约束流形的低维表示,结合二次规划控制器,实现实时硬约束满足。

Comments International Journal of Control, Automation, and Systems

详情
AI中文摘要

基于采样的模型预测控制方法,如模型预测路径积分(MPPI),在复杂机器人系统中提供了无导数优化和鲁棒性。然而,标准MPPI依赖于基于成本的软惩罚,无法保证硬约束满足,严重限制了其在高度约束任务(如闭链操作)中的适用性。为解决这一问题,我们提出了流形约束MPPI(MC-MPPI),一种实时采样控制框架,在保持MPPI计算优势的同时强制执行基于流形的等式约束。关键思想是将约束最优控制问题解耦为潜在空间规划和执行级校正。在规划阶段,变分自编码器(VAE)学习约束流形的低维潜在表示,使MPPI能够高效生成接近可行的候选轨迹,无需逐样本修改。由于该参考能够精确线性化等式约束,执行级二次规划(QP)控制器通过单次求解而非迭代投影来解决残余流形不匹配。在14自由度闭链双臂系统上的仿真和实际实验表明,MC-MPPI以100 Hz稳定运行,可靠地导航动态环境,同时有效维持硬等式约束,并在跟踪精度上显著优于基线方法。补充视频和实现细节见https://rcilab.github.io/mcmppi。

英文摘要

Sampling-based model predictive control methods, such as Model Predictive Path Integral (MPPI), offer derivative-free optimization and robustness in complex robotic systems. However, standard MPPI relies on cost-based soft penalties that cannot guarantee hard-constraint satisfaction, severely limiting its applicability to highly constrained tasks such as closed-chain manipulation. To address this, we propose Manifold-Constrained MPPI (MC-MPPI), a real-time sampling-based control framework that enforces manifold-based equality constraints while preserving the computational advantages of MPPI. The key idea is to decouple the constrained optimal control problem into latent-space planning and execution-level correction. At the planning stage, a Variational Autoencoder (VAE) learns a low-dimensional latent representation of the constraint manifold, enabling MPPI to efficiently generate near-feasible candidate trajectories without per-sample modification. Since this reference enables accurate linearization of the equality constraints, an execution-level Quadratic Programming (QP) controller resolves the residual manifold mismatch in a single solve rather than through iterative projection. Experiments on a 14-DoF closed-chain dual-arm system in both simulation and real-world settings demonstrate that MC-MPPI operates stably at 100 Hz, reliably navigates dynamic environments while effectively maintaining hard equality constraints, and significantly outperforms baseline methods in tracking accuracy. Supplementary videos and implementation details are available at https://rcilab.github.io/mcmppi.

2605.24812 2026-05-26 cs.AI

CoRe-Code: Collaborative Reinforcement Learning for Code Generation

CoRe-Code:面向代码生成的协作式强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Xiaoyu Xia, Sumon Biswas

发表机构 * The Ohio State University(俄亥俄州立大学) Royal Melbourne Institute of Technology(皇家墨尔本理工学院)

AI总结 提出CoRe-Code框架,通过规划器-编码器范式和基于GRPO的协作感知强化学习,增强多智能体间的协调与专业化,提升代码生成的准确性和效率。

详情
AI中文摘要

大型语言模型(LLM)在代码生成方面取得了强劲性能,但大多数方法依赖自回归解码而缺乏全局规划,常常导致局部连贯但全局次优的解决方案(例如,测试用例失败或复杂度低效)。虽然最近的方法如思维链(CoT)和多智能体系统(MAS)引入了规划,但它们有限的专业角色分工和协调阻碍了在复杂任务上的性能。为了解决多智能体代码生成中的协调与专业化挑战,我们提出了协作式强化代码(CoRe-Code),一个面向角色专业化的LLM智能体框架,通过增强智能体间协调来生成更准确和高效的代码。CoRe-Code采用简单的规划器-编码器范式,其中规划器生成高层计划,编码器执行计划以生成代码。我们进一步引入基于组相对策略优化(GRPO)的协作感知强化学习阶段,以增强角色专业化和对齐。实验表明,CoRe-Code优于现有多种基于强化学习和多智能体的方法。此外,我们证明CoRe-Code可以泛化到其他多智能体框架(例如,检索和调试智能体),凸显其灵活性和可扩展性。我们使用三个基础模型在多个不同难度的基准上评估CoRe-Code。与现有基线相比,结果显示在准确性上持续提升,同时在执行时间和内存使用方面也实现了更高效率,证明了CoRe-Code的有效性和实用性。

英文摘要

Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without global planning, often leading to locally coherent yet globally suboptimal solutions (e.g., failing test cases or inefficient complexity). While recent approaches such as Chain-of-Thought (CoT) and multi-agent systems (MAS) introduce planning, their limited role specialization and coordination hinder performance on complex tasks. To address the challenges of coordination and specialization in multi-agent code generation, we propose Collaborative Reinforcement Code (CoRe-Code), a framework for role specialized LLM agents that enhances inter-agent coordination to generate more accurate and efficient code. CoRe-Code adopts a simple Planner-Coder paradigm, where the Planner produces high-level plans and the Coder executes them to generate code. We further introduce a collaboration-aware reinforcement learning stage based on Group Relative Policy Optimization (GRPO) to enhance role specialization and alignment. Experiments show that CoRe-Code outperforms a wide range of existing RL-based and multi-agent methods. In addition, we demonstrate that CoRe-Code can generalize to other multi-agent frameworks (e.g., Retrieval and Debugging agents), highlighting its flexibility and scalability. We evaluate CoRe-Code on multiple benchmarks of varying difficulty using three base models. Compared to existing baselines, the results show consistent improvements in accuracy, while also achieving higher efficiency in terms of execution time and memory usage, demonstrating the effectiveness and practicality of CoRe-Code.

2605.24810 2026-05-26 cs.LG cs.AI cs.RO stat.AP

Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning

跨域能量引导扩散生成用于动态偏移强化学习

Yu Yang, Yihong Guo, Anqi Liu, Pan Xu

发表机构 * Duke University(杜克大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出CEDGE框架,利用能量引导扩散模型生成目标域轨迹,解决动态偏移下离线强化学习的域适应问题。

Comments 29 pages, 3 figures, and 14 tables

详情
AI中文摘要

离动态离线强化学习旨在从大规模源数据集和有限目标数据集中学习目标域策略,但面临转移动态不匹配的问题。现有方法如奖励增强和数据过滤受限于源数据集,无法合成新的目标行为以改善超出收集源轨迹的覆盖范围。虽然近期基于模型的方法尝试通过学习目标感知动态来解决此问题,但生成的体验仅在转移层面构建,导致长时域上的累积误差。这些限制促使离动态离线RL转向轨迹级生成。我们提出CEDGE,一种跨域能量引导扩散生成框架。CEDGE在源域轨迹上训练轨迹扩散模型,并通过能量引导将生成样本适应到目标域。该引导通过最小化源域与期望目标域轨迹之间的分布不匹配得到,并分解为回报、域和行为能量成分。得到的能量引导轨迹既可用于直接规划,也可作为策略学习的合成数据。由于目标适应通过能量引导而非重新训练扩散模型实现,与先前方法相比,CEDGE能高效适应新的目标动态。在ODRL基准上的实验表明,轨迹级能量引导生成改善了动态偏移下的扩散规划,并产生提升下游目标策略学习的合成数据。

英文摘要

Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset under mismatched transition dynamics. Existing approaches such as reward augmentation and data filtering are constrained to the source dataset and cannot synthesize new target behavior to improve coverage beyond the collected source trajectories. While recent model-based methods attempt to address this by learning target-aware dynamics, the generated experience is constructed only at the transition level, which leads to accumulated errors over long horizons. These limitations necessitate a shift toward trajectory-level generation for off-dynamics offline RL. We propose CEDGE, a Cross-domain Energy-guided Diffusion GEneration framework. CEDGE trains a trajectory diffusion model on source-domain trajectories and adapts the generated samples to the target domain through energy guidance. This guidance is derived by minimizing the distribution mismatch between the source and desired target-domain trajectories and is decomposed into return, domain, and behavior energy components. The resulting energy-guided trajectories are useful both for direct planning and as synthetic data for policy learning. Since target adaptation is achieved via energy guidance rather than retraining the diffusion model, CEDGE can be efficiently adapted to new target dynamics compared to previous methods. Experiments on the ODRL benchmark demonstrate that trajectory-level energy-guided generation improves diffusion planning under dynamics shifts and produces synthetic data that improves downstream target policy learning.

2605.24808 2026-05-26 cs.LG cs.AI

Disentangled Double Machine Learning for Accurate Causal Effect Estimation

解缠双机器学习用于精确因果效应估计

Guodu Xiang, Kui Yu, Yujie Wang, Richang Hong, Fuyuan Cao, Jiye Liang

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) School of Computer and Information Technology, Shanxi University(山西大学计算机与信息学院)

AI总结 提出解缠双机器学习(DDML),通过因果角色解缠和残差依赖正交化策略,解决高维或有限样本下双机器学习中因混淆因子未解缠导致的偏差和不稳定问题,在合成、半合成和真实数据集上优于13种基线方法。

Comments 15 pages, 9 figures

详情
AI中文摘要

混淆偏差是从观测数据中估计因果效应的一个关键挑战。双机器学习(DML)通过估计治疗和结果 nuisance 函数、构建治疗和结果残差,并从残差中估计因果效应来解决这一问题。然而,DML 在高维或有限样本场景中常常产生有偏和不稳定的估计。一个原因是 DML 使用所有协变量估计 nuisance 函数,而没有解缠不同的潜在因子,导致不可靠的 nuisance 函数估计。另一个原因是不精确的 nuisance 估计进一步引入了治疗残差与剩余结果误差之间的残差依赖,破坏了因果效应估计的准确性。为了解决这些问题,本文提出解缠双机器学习(DDML),一种整合两种关键策略的新算法。首先,因果角色解缠策略将协变量分解为混淆因子、治疗特有因子和结果特有因子,以实现可靠的 nuisance 函数估计。其次,残差依赖正交化策略减轻由 nuisance 估计误差引起的残差依赖,以增强因果效应估计的精度。在合成、半合成和真实数据集上的实验结果表明,DDML 在 MAE 和 RMSE 上均显著优于 13 种最先进的基线算法。

英文摘要

Confounding bias is a key challenge in causal effect estimation from observational data. Double Machine Learning (DML) addresses this issue by estimating treatment and outcome nuisance functions, constructing treatment and outcome residuals, and estimating causal effects from the residuals. However, DML often produces biased and unstable estimates in highdimensional or finite-sample scenarios. One reason is that DML estimates nuisance functions using all covariates without disentangling distinct latent factors, resulting in unreliable nuisance function estimation. Another is that imprecise nuisance estimation further introduces residual dependence between the treatment residual and the remaining outcome error, undermining the accuracy of causal effect estimates. To address these issues, in this paper, we propose Disentangled Double Machine Learning (DDML), a novel algorithm that integrates two key strategies. First, a causal role disentanglement strategy decomposes covariates into confounders, treatment-specific factors, and outcomespecific factors for enabling reliable nuisance function estimation. And second, a residual dependence orthogonalization strategy mitigates residual dependence caused by nuisance estimation errors for enhancing the precision of causal effect estimates. Experimental results on synthetic, semi-synthetic, and real-world datasets demonstrate that DDML significantly outperforms 13 state-of-the-art baseline algorithms in both MAE and RMSE.

2605.24807 2026-05-26 cs.CV

CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

CLIP引导的SAM:用于可提示分割的参数高效语义条件

Shayan Jalilian, Abdul Bais

发表机构 * University of Regina, Regina, SK, Canada(里贾纳大学)

AI总结 提出CLIP-Guided SAM框架,通过轻量级多模态语义适配器将CLIP特征注入SAM图像编码器,实现内部语义条件化,在低标注数据下提升分割性能并支持手动和半自动两种模式。

详情
AI中文摘要

可提示基础模型如分割一切模型(SAM)能生成高质量掩码,但语义上仍存在盲区,依赖外部提示来指定类别。现有的视觉-语言方法通过外部提示耦合来解决这一限制,即视觉-语言模型作为独立阶段为SAM生成空间提示。我们提出CLIP引导的SAM,一种基于内部语义条件的参数高效分割框架。我们不是仅使用语义信号来生成提示,而是通过轻量级多模态语义适配器将CLIP派生的文本、视觉和相似性特征直接注入SAM的图像编码器。这些适配器调节SAM的内部特征表示,使得语义信息能够影响掩码预测,同时保留SAM原有的可提示接口。我们的框架专为低标注数据场景设计,适用于通用领域基准和专门的下游任务。它支持两种操作模式:手动模式(用于同时使用文本和空间提示的交互式分割)和半自动纯文本模式(用于仅需文本输入的概念特定分割应用)。我们表明,鲁棒性取决于训练与推理时使用的提示类型是否一致,使得训练-测试提示一致性成为重要的设计原则。通过大量实验和消融研究,我们评估了我们的方法,与无语义条件的SAM+PEFT基线、视觉-语言+SAM流水线、SAM 3以及依赖大量无标注数据的强半监督分割方法进行比较。在这些设置中,CLIP引导的SAM在训练和部署中均保持参数高效的同时,始终取得优越或具有竞争力的性能。

英文摘要

Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.

2605.24806 2026-05-26 cs.SD cs.AI eess.AS

Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models

零样本帕金森病语音检测:比较大型音频和语言模型

Muhammad Ashad Kabir, Sirajam Munira

发表机构 * School of Computing, Mathematics and Engineering, Charles Sturt University(计算机科学与工程学院,查尔斯·斯图尔特大学) Department of Computer Science, Rensselaer Polytechnic Institute(计算机科学系,伦塞拉尔理工学院)

AI总结 通过比较手工声学特征和原始音频波形两种输入模态,研究零样本帕金森病检测在不同语言中的性能差异,发现手工特征在低资源语言中更稳定,而音频输入带来数据集依赖的增益。

Comments 6 pages

详情
AI中文摘要

大型音频和语言模型最近在各个领域展示了零样本推理能力。然而,尚不清楚音频输入的形式——无论是从语音中提取的手工声学特征还是原始音频波形——如何影响不同语言中帕金森病(PD)检测的性能。在本研究中,我们系统地比较了两种零样本PD检测的输入模态:(i)由通用LLM分析的从语音记录中提取的手工声学特征,以及(ii)由音频能力模型分析的直接波形输入。在四种语言的PD语音数据集上的实验表明,性能因输入模态、语音任务和语言而异。手工声学特征在低资源语言(例如孟加拉语)中提供更稳定的性能,而音频输入带来数据集依赖的增益。这些发现突显了输入模态对零样本语音PD检测的影响。

英文摘要

Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unclear how the form of audio input, whether handcrafted acoustic features extracted from speech or the raw audio waveform itself, affects performance for Parkinson's disease (PD) detection across different languages. In this study, we systematically compare two input modalities for zero-shot PD detection: (i) handcrafted acoustic features extracted from speech recordings analyzed by a general-purpose LLM, and (ii) direct waveform input analyzed by audio-capable models. Experiments on PD speech datasets in four languages show that performance varies across input modalities, speech tasks, and languages. Handcrafted acoustic features provide more stable performance in a low-resource language (e.g., Bengali), whereas audio input yields dataset-dependent gains. These findings highlight the impact of input modality on zero-shot PD detection from speech.

2605.24805 2026-05-26 cs.CV

Fishbone: From One 3D Asset to a Million Controllable Edits

Fishbone: 从一个3D资产到百万可控编辑

Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang

发表机构 * UCLA(加州大学洛杉矶分校) USC(南加州大学) UC Berkeley(加州大学伯克利分校) TRI(技术研究院) Stanford(斯坦福大学) Utah(犹他大学)

AI总结 提出一种统一的脊-肋表示方法Fishbone,支持通用网格的可控参数化变形、降阶动力学和动画,并构建了Fishbone-136K数据集,应用于可控3D生成、机器人学习数据增强等任务。

Comments 20 pages, 19 figures

详情
AI中文摘要

大规模可控3D资产对于计算机图形学、具身AI、机器人和交互式内容创作至关重要,但由于手动建模和绑定的高成本,创建多样化的3D资产仍然具有挑战性。形状变形提供了一种从现有网格生成变体的自然方式,但现有的数据驱动方法通常依赖稀疏的用户输入,而参数化编辑框架需要手动设计的控制结构和特定类别的配置。受自然生物启发,其中中央脊柱控制全局形状,横截面肋骨控制局部变化,我们引入了Fishbone,一种统一的脊-肋表示,适用于通用形状,支持可控参数化网格变形、降阶动力学和动画。给定输入网格,Fishbone使用自适应热方法计算测地标量场,提取等值线作为横截面肋骨,通过肋骨中心构建光滑的几何感知脊柱,并使用高斯加权蒙皮将表面顶点与附近的肋骨和脊柱结构关联。由此产生的表示支持实时和可预测的变形:肋骨控制局部轮廓,如厚度、方向和横截面变化,而脊柱控制全局弯曲、扭转和拉伸。相同的结构还支持降阶模拟和关键帧动画。我们进一步通过用脊-肋结构增强Hunyuan3D构建了Fishbone-136K,并展示了在可控3D生成、基于变形的机器人学习数据增强、交互式网格编辑和智能体生成中的应用。实验证明了所提出框架的有效性、效率和通用性。

英文摘要

Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. Shape deformation offers a natural way to generate variations from existing meshes, but existing data-driven methods often rely on sparse user inputs, while parametric editing frameworks require manually designed control structures and category-specific configurations. Inspired by natural creatures, where a central spine governs global shape and cross-sectional ribs control local variation, we introduce Fishbone, a unified rib-spine representation for general shapes that supports controllable parametric mesh deformation, reduced-space dynamics, and animation. Given an input mesh, Fishbone computes a geodesic scalar field with an adaptive heat method, extracts iso-contours as cross-sectional ribs, constructs a smooth geometry-aware spine through rib centers, and associates surface vertices with nearby rib and spine structures using Gaussian-weighted skinning. The resulting representation enables real-time and predictable deformation: ribs control local profiles such as thickness, orientation, and cross-sectional variation, while the spine controls global bending, twisting, and stretching. The same structure also supports reduced-space simulation and keyframe animation. We further construct Fishbone-136K by augmenting Hunyuan3D with rib-spine structures, and demonstrate applications in controllable 3D generation, deformation-based data augmentation for robot learning, interactive mesh editing, and agentic generation. Experiments demonstrate the effectiveness, efficiency, and versatility of the proposed framework.

2605.24803 2026-05-26 cs.LG

Active Learning for Stochastic Contextual Linear Bandits

随机上下文线性老虎机的主动学习

Emma Brunskill, Ishani Karmarkar, Zhaoqi Li

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种通过主动采样上下文-动作对奖励来学习近最优策略的算法,理论上证明主动上下文采样可将最小最大率改进最多√d倍,并在华法林剂量预测和笑话推荐任务中验证了样本效率提升。

详情
AI中文摘要

随机上下文线性老虎机的一个关键目标是高效学习近最优策略。现有算法通过策略性地采样动作来学习策略,但被动地从底层上下文分布中采样上下文。然而,在许多实际场景中——包括在线内容推荐、调查研究、临床试验——从业者可以根据上下文分布的先前知识主动采样或招募上下文。尽管有这种主动学习的潜力,但策略性上下文采样在随机上下文线性老虎机中的作用尚未被充分探索。我们提出一种算法,通过策略性地采样上下文-动作对的奖励来学习近最优策略。我们证明了实例相关的理论保证,表明我们的主动上下文采样策略可以将最小最大率改进最多√d倍,其中d是线性维度。我们通过实验证明,我们的算法在学习近最优策略所需的样本数量上有所减少,例如在华法林剂量预测和笑话推荐任务中。

英文摘要

A key goal in stochastic contextual linear bandits is to efficiently learn a near-optimal policy. Prior algorithms for this problem learn a policy by strategically sampling actions but naively (passively) sampling contexts from the underlying context distribution. However, in many practical scenarios -- including online content recommendation, survey research, and clinical trials -- practitioners can actively sample or recruit contexts based on prior knowledge of the context distribution. Despite this potential for active learning, the role of strategic context sampling in stochastic contextual linear bandits is underexplored. We propose an algorithm that learns a near-optimal policy by strategically sampling rewards of context-action pairs. We prove instance-dependent theoretical guarantees demonstrating that our active context sampling strategy can improve over the minimax rate by up to a factor of $\sqrt{d}$, where $d$ is the linear dimension. We show empirically that our algorithm reduces the number of samples needed to learn a near-optimal policy, in tasks such as warfarin dose prediction and joke recommendation.

2605.24799 2026-05-26 cs.CV cs.AI

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

面向大规模视觉识别的多模态大语言模型分治推理

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao

发表机构 * Taizhou Institute of Science and Technology, Nanjing University of Science and Technology(泰州科技学院、南京理工大学) Department of Intelligence Science, Xi’an Jiaotong-Liverpool University(智能科学系,西安交通大学利物浦大学) School of Computer Science and Technology, Soochow University(计算机科学与技术学院,苏州大学) Department of Statistical Sciences, University of Toronto(统计科学系,多伦多大学)

AI总结 针对多模态大语言模型在长序列识别中性能崩溃的问题,提出分治推理(DCI)策略,通过递归分解任务和动态剪枝提升信噪比与分类精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)在广泛的视觉语言任务中展现了强大的能力。然而,当应用于大规模图像分类时,随着标签空间的扩大,其性能显著下降——我们将这一现象定义为长序列识别中的性能崩溃。通过信息论分析,我们揭示了这种崩溃源于不断增长的信息熵与注意力机制中显著的注意力稀释和衰减之间的根本冲突,这损害了模型在处理极长提示时维持足够信噪比的能力。为缓解这一问题,我们提出了分治推理(DCI),一种用于MLLMs视觉识别的新型测试时扩展策略。DCI递归地将复杂的全局分类任务分解为多个更简单的局部子问题,并采用动态剪枝机制压缩搜索空间。该方法通过缓解长序列推理中固有的权重稀释问题,有效提高了局部信噪比和模型精度。此外,传统自注意力具有难以承受的二次计算复杂度,而DCI在大规模分类场景中实现了更有利的扩展行为并显著加速推理。在ImageNet-1K和ImageNet-21K等基准上的大量实验表明,DCI持续提高了分类精度。这使得轻量级开源模型无需任何额外训练或微调即可与甚至超越前沿闭源巨头。作为一种模型无关、即插即用的范式,DCI为在大规模场景中扩展MLLMs的推理精度提供了一种高效方法。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

2605.24797 2026-05-26 cs.CV

HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm

HCL-FF:用于前向-前向算法的分层对比学习

Jie-En Yao, Hong-En Chen, C. -C. Jay Kuo

发表机构 * University of Southern California(南加州大学)

AI总结 针对前向-前向算法缺乏分层协调和特征语义模糊的问题,提出HCL-FF框架,通过粗到细的分层学习策略和监督对比学习目标,在CIFAR-10等数据集上取得FF方法最佳性能。

Comments Accepted by CVPR 2026. Code: https://github.com/JNNNNYao/HCL-FF

详情
AI中文摘要

使用反向传播训练的深度神经网络在视觉任务中取得了显著性能,但仍存在生物不可解释、计算要求高和难以解释的问题。前向-前向(FF)算法通过局部目标函数独立训练每一层,提供了一种有前景的替代方案。然而,其纯局部优化缺乏跨层的分层协调,且将 goodness 与特征解耦导致表示无约束且语义模糊。我们提出分层对比学习FF框架(HCL-FF)来解决这些限制。HCL-FF引入了(1)一种从粗到细的分层学习策略,引导表示从低级线索到高级语义,以及(2)一种监督对比目标,在 goodness 解耦后强制类别判别性对齐。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明,HCL-FF在基于FF的方法中取得了新的最佳性能,准确率分别提升了+5.46%、+17.00%和+12.51%。

英文摘要

Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.

2605.24794 2026-05-26 cs.CV cs.CL

DUEL: Adversarial Self-Play for Multimodal Reasoning

DUEL: 用于多模态推理的对抗性自我对弈

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao, Ji Liu

发表机构 * Meta AI

AI总结 提出DUEL框架,通过对抗性自我对弈从预训练VLM生成监督信号,结合长度归一化对数似然奖励,无需人工标注即可提升视觉推理与判别能力。

详情
AI中文摘要

强化学习已成为提升视觉语言模型推理能力的有效范式。然而,基于RL的优化通常依赖于昂贵且难以扩展的高质量标注。现有的无监督替代方案可能因弱视觉基础和缺乏可靠验证信号而偏向有偏解。我们提出一个自我进化的训练后框架DUEL,其中监督信号源于从同一预训练VLM初始化的两个策略之间的对抗性交互。挑战者生成一个基于图像的真实声明及其最小扰动的难负样本,而求解者验证两个声明与图像的一致性,从而在近邻语义下鼓励细粒度视觉判别。为了稳定优化,我们引入长度归一化的对数似然奖励,在二元结果监督之外保留信息性优化信号,并在稀疏反馈下提高学习稳定性。实验表明,DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下,持续提升视觉推理和鲁棒判别能力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

2605.24793 2026-05-26 cs.CL

Beyond the Target: From Imitation to Collaboration in Speculative Decoding

超越目标:从模仿到协作的推测解码

Jinze Li, Yixing Xu, Guanchen Li, Jinfeng Xu, Shuo Yang, Yang Zhang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.(先进微设备公司) The University of Hong Kong(香港大学) University of North Texas(北卡罗来纳州立大学)

AI总结 提出协作推测解码(CoSpec),通过强化学习训练仲裁策略,在推测解码中灵活选择接受草稿或目标模型的令牌,在保持加速的同时超越仅使用目标模型的性能。

Comments under review

详情
AI中文摘要

推测解码(SPD)通过让较小的草稿模型提出多个未来令牌,并由较大的目标模型并行验证,从而加速大型语言模型(LLM)推理。主流的SPD范式将目标模型视为唯一可靠的教师,仅当草稿令牌与目标预测完全匹配时才接受它。这种设计隐含地假设目标在每个位置都是更好的选择。在实践中,这一假设并不成立。尽管草稿模型整体上较弱,但在令牌级别上并非均匀地劣于目标。在草稿与目标不一致的有意义的情况下,草稿的选择往往能导致正确的最终答案。受此启发,我们引入了 extbf{协作推测解码(CoSpec)},这是SPD的一种泛化,不再将目标模型视为唯一的令牌级权威。CoSpec通过强化学习训练一个仲裁策略,以决定是接受来自草稿还是目标模型的令牌,在不匹配时选择性地接受草稿令牌,如果这样做可能产生正确的最终答案。实验结果表明,CoSpec在保持显著加速的同时,超越了仅使用目标模型的性能。通过将重点从模仿转向协作,CoSpec为推测解码提供了新的视角。

英文摘要

Speculative decoding (SPD) accelerates large language model (LLM) inference by letting a smaller draft model propose multiple future tokens that are verified in parallel by a larger target model. The dominant SPD paradigm treats the target model as the sole reliable teacher, accepting a draft token only when it exactly matches the target prediction. This design implicitly assumes that the target is always the better choice at every position. In practice, this assumption does not hold. Although the draft is the weaker model overall, it is not uniformly inferior at the token level. In a meaningful fraction of cases where draft and target disagree, the draft's choice is the one that leads to the correct final answer. Inspired by this, we introduce \textbf{Collaborative Speculative Decoding (CoSpec)}, a generalization of SPD that no longer treats the target model as the sole token-level authority. CoSpec trains an arbitration policy via reinforcement learning to decide whether to accept tokens from the draft or target model, selectively accepting draft tokens at mismatches when doing so is likely to yield a correct final answer. Experimental results show that CoSpec maintains substantial speedups while surpassing target-only performance. By shifting the emphasis from imitation to collaboration, CoSpec suggests a new perspective on speculative decoding.

2605.24792 2026-05-26 cs.CV cs.AI

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

用于胃肠内窥镜的参数高效视觉语言模型:医学图像生成与临床视觉问答

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

发表机构 * Computer Science Department, Morgan State University(莫尔甘州大学计算机科学系) International Organization for Migration (IOM)(国际移民组织) Electrical & Computer Engineering Department, Morgan State University(莫尔甘州大学电气与计算机工程系)

AI总结 提出双流水线参数高效微调模型,结合Florence-2和LoRA Stable Diffusion,分别解决临床视觉问答和隐私保护合成数据生成问题,在Kvasir-VQA数据集上取得高ROUGE和BLEU分数,并显著降低计算成本。

详情
AI中文摘要

胃肠内窥镜AI系统的主要局限性源于标注数据短缺、严格的隐私政策以及传统模型微调中的显著瓶颈。这些限制阻碍了复杂AI模型在临床实践中的成功应用,尤其影响了诊断的可靠性和可扩展性。在本文中,我们提出了一种双流水线PEFT模型,解决了两个基本问题:医学视觉问答(VQA)和隐私保护合成数据的生成。对于临床VQA,我们采用Florence-2视觉语言模型。利用PEFT增强了模型的可解释性,同时大幅降低了训练的计算成本。同时,我们使用低秩适应(LoRA)与Stable Diffusion 2.1生成高质量的胃肠图像,在不违反患者隐私的情况下增强训练数据库。本研究使用了Kvasir-VQA数据集。我们的Florence-2 VQA模型实现了ROUGE-1为0.92,ROUGE-L为0.91,BLEU分数从0.08提升到0.24。在私有数据集上的微调始终优于在公共数据集上的微调。秩为4的LoRA合成达到了最优性能,保真度得分为0.290,一致性得分为0.730,Frechet BiomedCLIP距离(FBD)为1450,计算成本降低了近90%。该框架提高了AI在胃肠内窥镜中的临床潜力。与FLUX、MSDM和Kandinsky 2.2相比,我们的模型表现出更优的FBD和强语义对齐。虽然其他模型在保真度或一致性上领先,但我们更低的FBD表明更好的图像-文本一致性。这些结果确立了我们的方法作为增强临床AI中VQA和合成数据生成的稳健解决方案。

英文摘要

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

2605.24789 2026-05-26 cs.CV eess.IV

Self-Supervised Contrastive Learning for Cardiac MR Sequence Classification

自监督对比学习用于心脏磁共振序列分类

Yuli Wang, Hyewon Jung, Dongshen Peng, Yuwei Dai, Jing Wu, Haoyue Guan, Yoko Kato, Zhicheng Jiao, Yu Sun, Ihab Kamel, Joao Lima, Cheng Ting Lin, Harrison Bai

发表机构 * Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine(放射科与放射科学系,约翰霍普金斯大学医学院) Department of Electrical and Computer Engineering, Johns Hopkins University(电气与计算机工程系,约翰霍普金斯大学) Department of Computer Science, University of North Carolina at Chapel Hill(计算机科学系,北卡罗来纳大学教堂山分校) Department of Radiology, University of Colorado Denver Anschutz Medical Campus(放射科,科罗拉多大学丹佛分校安舒茨医学中心) Department of Radiology, Second Xiangya Hospital, Central South University(放射科,中南大学湘雅医院) Department of Cardiology, Johns Hopkins University School of Medicine(心血管科,约翰霍普金斯大学医学院) Department of Diagnostic Imaging, Brown University Health(诊断影像科,布朗大学健康中心)

AI总结 针对预训练ViT在心脏MR领域迁移效果差的问题,提出基于图像的自监督对比学习适应策略,在内部数据集上优于监督训练,并泛化到外部MR数据集,四个常见序列分类AUC超过0.75。

详情
AI中文摘要

利用自注意力机制的视觉Transformer(ViT)模型在各种视觉任务(包括图像分类)中展现出强大的泛化能力。然而,这些通常在通用公共数据集上预训练的模型往往缺乏医学成像应用所需的专门领域知识。在本研究中,我们使用内部数据集调查了ViT模型对心脏磁共振(MR)图像的适应情况。我们发现预训练的ViT特征不能有效地迁移到心脏MR领域。为了克服这一限制,我们引入了一种利用基于图像的自监督对比学习的适应策略,与传统的监督训练方法相比,表现出优越的性能。此外,我们适应的ViT模型对外部MR数据集(如BraTS和ADNI)表现出强大的泛化能力。通过消融研究,我们进一步研究了批次大小和数据集规模对性能的影响。最终,我们的适应模型在四种最常见的心脏MR序列上实现了超过0.75的分类AUC。

英文摘要

Vision Transformer (ViT) models, utilizing self-attention mechanisms, have demonstrated robust generalization capabilities across various vision tasks, including image classification. However, these models, typically pretrained on general public datasets, often lack the specialized domain knowledge necessary for medical imaging applications. In this study, we investigate the adaptation of ViT models, specifically for cardiac magnetic resonance (MR) images, using an in-house dataset. We found that pretrained ViT features do not effectively transfer to the cardiac MR domain. To overcome this limitation, we introduce an adaptation strategy that utilizes image-based self-supervised contrastive learning, demonstrating superior performance compared to traditional supervised training approaches. Moreover, our adapted ViT model exhibits strong generalization to external MR datasets such as BraTS and ADNI. Through ablation studies, we further investigate the impact of batch size and dataset scale on performance. Ultimately, our adapted model achieves classification AUC exceeding 0.75 across the four most common cardiac MR sequences.

2605.24786 2026-05-26 cs.LG cs.AI

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

CONF-KV:面向长序列LLM的置信度感知KV缓存淘汰与混合精度存储

Yubo Li, Yidi Miao

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出CONF-KV方法,利用模型当前不确定性(置信度)动态调整KV缓存预算,结合混合精度存储和分块在线softmax注意力,在长序列推理中显著降低显存占用并保持高精度。

详情
AI中文摘要

长序列LLM推理使键值(KV)缓存成为GPU内存的主要消耗者,并使每个token的注意力计算越来越昂贵。许多常见的淘汰策略使用静态的最近窗口或历史注意力,忽略了每个解码步骤中计算出的一个信号:模型当前的不确定性。我们引入CONF-KV,一个KV缓存管理器,它将下一个token分布转换为标量置信度分数,并用它来选择每步缓存预算,在模型不确定时保留更多上下文,在模型确定时积极剪枝。在每个预算内,token根据累积注意力质量和最近性的组合进行排序,同时一个受保护的最近窗口保持局部连贯性。我们将该策略与分块在线softmax注意力、混合FP16/INT8存储以及金字塔式逐层预算变体相结合。在四个模型家族和生成长度高达4K的情况下,CONF-KV的显存占用接近固定的512 token滑动窗口,同时与完整KV相比,困惑度差异保持在1.5-2.1点以内。在长达32K token的“大海捞针”测试中,CONF-KV的检索准确率达到91.4%,而滑动窗口为53.8%,H2O为80.6%;在75个VisualWebArena任务中,它以2.8倍的峰值内存降低保留了完整KV成功率的95.3%。

英文摘要

Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly expensive. Many common eviction policies use static recency windows or historical attention, leaving unused a signal computed on every decoding step: the model's current uncertainty. We introduce CONF-KV, a KV-cache manager that converts the next-token distribution into a scalar confidence score and uses it to choose the per-step cache budget, retaining more context when the model is uncertain and pruning aggressively when it is confident. Within each budget, tokens are ranked by a composite of accumulated attention mass and recency, while a protected recent window preserves local coherence. We combine the policy with blockwise online-softmax attention, mixed FP16/INT8 storage, and a pyramidal per-layer budget variant. Across four model families and generated lengths up to 4K, CONF-KV stays near the footprint of a fixed 512-token sliding window while remaining within 1.5--2.1 perplexity points of full KV. On Needle-in-a-Haystack up to 32K tokens, CONF-KV reaches 91.4% retrieval accuracy versus 53.8% for sliding windows and 80.6% for H2O; on 75 VisualWebArena tasks it retains 95.3% of full-KV success at 2.8 times lower peak memory.

2605.24784 2026-05-26 cs.AI

GRAIL: AI translation for scientists application workflow on satellite data

GRAIL:面向卫星数据科学家应用工作流的AI翻译

Zhuocheng Shang, Ahmed Eldawy

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 提出GRAIL系统,通过LangGraph管道将Python地理空间工作流翻译为可扩展的Spark程序,无需科学家学习新框架。

详情
AI中文摘要

领域科学家越来越多地开发Python脚本来分析卫星图像,但这些脚本缺乏大规模数据的可扩展性。本文演示了GRAIL,一个代理翻译系统,它将Python地理空间工作流转换为可执行的基于Spark的程序,而无需科学家学习新框架。GRAIL不是微调专门的LLM模型,而是调整RDPro(一个用于卫星数据分析的Scala库),通过结构化文档、API别名函数和面向修复的错误日志使其为LLM就绪。翻译被构建为一个LangGraph管道,将代码生成分解为具有引导输入和输出的显式部分,从而无需重新生成整个程序即可进行有针对性的修复。我们在真实的地理空间工作流上演示了GRAIL,并展示了翻译代码的正确性和可扩展性。

英文摘要

Domain scientists increasingly develop Python scripts to analyze satellite imagery but they lack scalability to large-scale data. This paper demonstrates GRAIL, an agentic translation system that converts Python geospatial workflows into executable Spark-based programs without requiring scientists to learn a new framework. Rather than fine-tuning a specialized LLM model, GRAIL adapts RDPro, a Scala library for satellite data analysis, to make it LLM-ready using structured documentation, API alias functions, and repair-oriented error logs. Translation is structured as a LangGraph pipeline that decomposes code generation into explicit sections with guided inputs and outputs, enabling targeted repair without regenerating the full program. We demonstrate GRAIL on real-world geospatial workflows and showcase the correctness and scalability of the translated code.

2605.24779 2026-05-26 cs.LG cs.AI math.CO

Complement Submodular Information Measures for Balanced and Robust Data Selection

互补子模信息度量用于平衡和鲁棒的数据选择

Rishabh Iyer

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 提出互补子模信息(CSI)目标函数,通过建模子集与其补集之间的共享结构信息,实现平衡且鲁棒的数据选择,并在理论上证明其近似单调性和贪心近似保证,实验表明在鲁棒隐藏切片感知子集选择中优于经典子模目标。

详情
AI中文摘要

子模优化已成为数据选择、检索、摘要和表示学习的基本范式,因为它能够建模覆盖度、多样性和代表性。然而,经典子模目标仅优化所选子集,并未明确保留所选子集与剩余数据之间的结构信息。在许多现代机器学习应用中,包括训练/验证/测试分割、基准构建和鲁棒子集选择,选择的质量关键取决于在所选子集及其补集之间保持平衡结构。在这项工作中,我们引入了互补子模信息(CSI),这是一类新的互补感知子模目标,用于量化子集与其补集之间的共享结构信息。我们的框架产生了几个经典子模函数的互补感知变体,包括设施选址、图割、LogDet、饱和覆盖、集合覆盖、概率集合覆盖和基于特征函数。我们分析了CSI目标的理论性质,并表明它们在有限曲率条件下表现出近似单调性,从而得到接近$(1-1/e)$的贪心近似保证。实验上,CSI目标在鲁棒隐藏切片感知子集选择中始终优于标准子模目标。特别是,CSI目标显著改善了相干稀有/尾部语义结构的保留,同时抑制了噪声和孤立异常值,从而显著提高了下游预测性能。合成实验进一步说明了不同的CSI实例如何捕获代表性、多样性、连通性和平衡邻域保留的互补概念。

英文摘要

Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to its ability to model coverage, diversity, and representativeness. However, classical submodular objectives optimize only the selected subset and do not explicitly preserve structural information between the selected subset and the remaining data. In many modern machine learning applications, including train/validation/test splitting, benchmark construction, and robust subset selection, the quality of a selection depends critically on preserving balanced structure across both the selected subset and its complement. In this work, we introduce Complement Submodular Information (CSI), a new class of complement-aware submodular objectives that quantify shared structural information between a subset and its complement. Our framework induces complement-aware variants of several classical submodular functions including Facility Location, Graph Cut, LogDet, Saturated Coverage, Set Cover, Probabilistic Set Cover, and Feature Based Functions. We analyze the theoretical properties of CSI objectives and show that they exhibit approximate monotonicity under bounded curvature conditions, leading to near-$(1-1/e)$ greedy approximation guarantees. Empirically, CSI objectives consistently outperform standard submodular objectives on robust hidden-slice-aware subset selection. In particular, CSI objectives significantly improve preservation of coherent rare/tail semantic structure while simultaneously suppressing noisy and isolated outliers, leading to substantially improved downstream predictive performance. Synthetic experiments further illustrate how different CSI instantiations capture complementary notions of representativeness, diversity, connectivity, and balanced neighborhood preservation.

2605.24777 2026-05-26 cs.RO

MR-LiDAR: A Multi-Resolution Roadside LiDAR Benchmark for Perception Diagnostics and Deployment Guidance

MR-LiDAR:用于感知诊断和部署指导的多分辨率路边激光雷达基准

Shunlai Cui, Peng Cao, Yuan Zhu, Yongjiang He, Jiacheng Yin, Xiao Huo, Gang Cao, Xiaobo Liu

发表机构 * Intelligent Comprehensive Transportation Key Laboratory of Sichuan Province(四川省智能综合交通运输重点实验室)

AI总结 针对激光雷达选型缺乏实证基准的问题,提出MR-LiDAR多分辨率基准,通过控制光束数和分布等变量,系统分析其对感知性能的影响,并给出选型指导。

Comments 9 pages, 6 figures

详情
AI中文摘要

激光雷达选型是路边感知系统中的关键问题,因为它直接决定了感知能力和部署成本。然而,缺乏用于比较不同激光雷达配置下感知性能的经验基准,极大地限制了科学的传感器选择和部署规划。为填补这一空白,我们提出了MR-LiDAR,一个用于路边感知诊断的受控多分辨率激光雷达基准。在相同的路边场景中,使用16、32、80和128线激光雷达,我们收集了不同距离下各类交通参与者(包括车辆和弱势道路使用者(VRU))的点云和真实标注。这种受控设计将激光雷达的内在规格(特别是线束数和线束分布)隔离为精确性能诊断的关键变量。基于MR-LiDAR,我们进行了系统的实证分析,以考察线束数、线束分布、目标距离、目标类别和车辆遮挡如何影响激光雷达感知性能。结果表明,所有这些因素都有显著影响。特别是,与“更高线束数总是带来更好感知”的常见假设相反,我们发现,具有优化线束分布的80线激光雷达可以匹配甚至超越具有均匀线束分布的128线激光雷达。此外,我们提供了实用的激光雷达选型参考指南,包括目标点计数统计和基于两种广泛使用的检测算法的检测性能比较。这项工作为确定路边感知应用中经济高效的激光雷达配置提供了诊断基准和实用指导。

英文摘要

LiDAR model selection is a critical issue in roadside sensing systems, as it directly determines both perception capability and deployment cost. However, the lack of empirical benchmarks for comparing perception performance across different LiDAR configurations has greatly constrained scientific sensor selection and deployment planning. To address this gap, we present MR-LiDAR, a controlled multi-resolution LiDAR benchmark for roadside perception diagnostics. Using 16-, 32-, 80-, and 128-beam LiDARs in identical roadside scenarios, we collect point clouds and ground-truth annotations for diverse traffic participants, including vehicles and vulnerable road users (VRUs), across varying distances. This controlled design isolates intrinsic LiDAR specifications, particularly beam count and beam distribution, as the key variables for precise performance diagnostics. Based on MR-LiDAR, we conduct systematic empirical analyses to examine how beam count, beam distribution, target distance, object category, and vehicle occlusion affect LiDAR perception performance. The results reveal that all of these factors have substantial impacts. In particular, contrary to the common assumption that higher beam counts always yield better perception, we show that an 80-beam LiDAR with optimized beam distribution can match or even outperform a 128-beam LiDAR with uniform beam distribution. In addition, we provide a practical reference guide for LiDAR selection, including target point-count statistics and detection performance comparisons based on two widely used detection algorithms. This work offers a diagnostic benchmark and practical guidance for determining cost-effective LiDAR configurations in roadside perception applications.