2605.18013 2026-05-19 cs.CV cs.AI

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2: 极端内存压缩用于高效的跟踪任何模型

Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文提出TinySAM 2，一种轻量级视频分割模型，通过引入内存质量管理机制和联合空间-时间令牌压缩，有效降低了内存存储和计算成本，实现了在DAVIS和SA-V等挑战性数据集上达到SAM 2.1 90%性能，仅使用7%内存令牌和3%训练数据。

Comments 12 pages, 6 figures

详情

AI中文摘要

Segment Anything Model 2 (SAM 2) 作为视频分割领域的核心基础模型，在半监督视频对象分割和跟踪任何任务中表现出色。然而，SAM 2的多阶段图像编码器和内存模块复杂的计算特性提高了模型在实际应用中的部署难度。为了解决这个问题，我们提出了TinySAM 2，一种在性能和效率之间取得平衡的轻量级视频分割模型。首先，引入了一个内存质量管理机制，用于选择并保留高信息量的历史帧作为内存。此外，提出了一种联合空间-时间令牌压缩方法，通过空间域上的平均池化压缩冗余令牌，在时间域上基于令牌级相似性测量选择信息令牌。此外，采用RepViT作为轻量级图像编码器，进一步减少模型参数。在DAVIS和SA-V等挑战性数据集上的大量实验表明，TinySAM 2在性能上达到了SAM 2.1的90%，仅使用7%的内存令牌和3%的训练数据。本研究有效缓解了SAM 2在参数数量、计算负载和部署成本方面的瓶颈，为视频分割模型在设备上的广泛应用提供了资源高效的解决方案。

英文摘要

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

URL PDF HTML ☆

赞 0 踩 0

2605.18012 2026-05-19 cs.CV cs.AI cs.LG

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS: 语义感知的生成数据集蒸馏

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

发表机构 * Hokkaido University（北海道大学）； University of Toronto（多伦多大学）； The University of Tokyo（东京大学）

AI总结本文提出了一种语义感知的数据集蒸馏方法，通过利用CLIP作为语义先验，设计三个语义评分函数来量化类别相关性、类别间分离性和集合内多样性，从而生成紧凑且语义区分度高的数据集。

Comments Published as a journal paper in IEEE OJSP

详情

AI中文摘要

深度神经网络在广泛的任务中取得了显著的性能，但这种成功往往伴随着由于大规模训练数据带来的巨大计算和存储成本。数据集蒸馏通过构建紧凑且信息丰富的数据集，以实现高效的模型训练同时保持下游性能。然而，大多数现有方法主要强调匹配数据分布或下游训练统计，对蒸馏数据中高阶语义信息的保留有限。在本文中，我们引入了语义感知的视角进行数据集蒸馏，通过利用对比语言-图像预训练（CLIP）作为语义先验进行后采样。我们的目标是获得不仅紧凑而且语义上类别区分度高且多样化的蒸馏数据集。为此，我们设计了三个语义评分函数，以量化预训练语义空间中的类别相关性、类别间分离性和集合内多样性。基于现有蒸馏方法生成的图像池，我们进一步开发了一种两阶段策略进行有效的采样：第一阶段过滤语义区分度高的样本以形成可靠的候选集，第二阶段进行动态多样性感知选择以减少冗余并保持语义覆盖。在多个数据集、图像池和下游模型上的广泛实验显示了一致的性能提升，突显了在数据集蒸馏中整合语义信息的有效性。

英文摘要

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.18010 2026-05-19 cs.CV cs.GR

弥合差距：将阅读文本转换为对话式语音

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

发表机构 * Samsung Research and Development Institute, Bangalore, India（三星研发研究所，班加罗尔，印度）

AI总结本文提出了一种名为PACC的新方法，通过利用深度神经网络分析和修改语调、重音和节奏等语调特征，将阅读语音转换为更自然的对话语音，从而在虚拟助手、客户服务和语言学习工具中提高语音转换的自然度和准确性。

Comments 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems

Journal ref Innovative Computing and Communications (ICICC 2025), Lecture Notes in Networks and Systems, Springer Nature, 2025, pp. 543-556

详情

DOI: 10.1007/978-981-96-6681-2_38

AI中文摘要

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

发表机构 * Department of Computer Science, University of Massachusetts Boston（马萨诸塞大学波士顿分校计算机科学系）

AI总结本文提出了一种先进的异步检索框架，通过预测检索触发时机和所需信息，以减少延迟并提高生成效率，同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情

AI中文摘要

检索增强生成（RAG）通过在大型语言模型中增强事实性，但因其同步检索导致显著延迟。尽管近期工作探索了异步检索，但现有方法依赖于检索与生成之间的启发式协调，并假设解码期间信息需求稳定，这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架，该框架能够与不断演变的信息需求相匹配，通过利用生成动态中出现的语义前驱，使用三个组件——检索预测器、上下文监视器和查询生成器，显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明，该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升，同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17985 2026-05-19 cs.LG cs.AI

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD：面向物理基础模型的敏感性感知保真度压缩SVD

Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang, He Wang

发表机构 * AI Centre, University College London（伦敦大学学院人工智能中心）； University College London（伦敦大学学院）； Central South University（中南大学）； University of Massachusetts at Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本文提出了一种新的压缩物理基础模型的方法，通过在压缩过程中显式建模损失感知的层敏感性，以保持准确性和物理保真度，实验表明在多个模型和数据集上实现了显著的压缩增益。

详情

AI中文摘要

我们提出了一种新的方法，用于压缩物理基础模型（PFMs），这是AI for Science领域的新趋势。尽管模型压缩对于减少内存使用和加速大基础模型的推理至关重要，但其在PFMs中的应用仍然不足探索，因为保持物理保真度至关重要。挑战在于物理数据的功能性质，其中偏导数编码了时空动态，并对压缩具有高度敏感性。传统压缩方法忽视了这种结构，常常导致严重的性能退化或失败。为此，我们引入了一种敏感性感知的保真度强制压缩框架，在压缩过程中显式建模输出函数空间中的损失感知层敏感性。这为压缩科学基础模型提供了一条新途径，同时保持准确性和物理保真度。实验表明，在多个模型和数据集上，相较于现有方法，取得了显著的增益，实现了更高的压缩比，同时保持准确性，在某些情况下甚至提高了几个数量级。更广泛地说，这项工作可能引领AI for Science领域高效、可部署和可持续的科学基础模型的新子领域。

英文摘要

We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.

URL PDF HTML ☆

赞 0 踩 0

2605.17980 2026-05-19 cs.CV

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

学习平衡：用于基于参考的遥感图像超分辨率的解耦孪生扩散变换器

Bin Luo, Runmin Dong, Zhaoyang Luo, Jinxiao Zhang, Jiyao Zhao, Fan Wei, Haohuan Fu

发表机构 * Tsinghua Shenzhen International Graduate School, Shenzhen, China（清华大学深圳国际研究生院）； Sun Yat-sen University, Zhuhai, China（中山大学）； National Supercomputing Center in Shenzhen, Shenzhen, China（深圳国家超算中心）； Tsinghua University, Beijing, China（清华大学）

AI总结本文提出DS-DiT解耦孪生扩散变换器，通过在注意力层面解耦低分辨率和参考信息交互，解决参考基于超分辨率中参考信息依赖过重和利用不足的问题，提升生成质量。

详情

AI中文摘要

基于扩散的方法在大尺度遥感图像超分辨率中展现出显著潜力，特别是在基于参考的超分辨率（RefSR）中，高分辨率参考图像提供关键的细粒度纹理先验。然而，现有方法往往在过度依赖参考信息导致纹理伪影和利用不足导致细节恢复不足之间存在权衡。为了解决这些问题，我们提出了DS-DiT，一种解耦孪生扩散变换器方法，该方法在注意力层面解耦低分辨率和参考信息交互。通过使低分辨率结构先验和参考纹理信息能够独立与噪声潜在空间交互，框架有效缓解了不同来源之间的竞争。此外，为了补偿全局注意力有限的局部建模能力，我们引入了Patch-Level Weights（PLW）模块，该模块可自适应地调节条件源的融合。此外，这种孪生架构在推理过程中促进了自引导策略，通过利用强参考和弱参考条件之间的预测差异来增强重建。这种方法在不额外训练的情况下提升了生成质量。在多个数据集和缩放因子上的实验结果表明，DS-DiT在定量指标和视觉保真度上均优于现有方法。

英文摘要

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.17978 2026-05-19 cs.CL

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder: 教授大语言模型生成显式向量化代码

Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou, Yuxuan Li, Xu Han, Wanxiang Che, Qi Shi, Ting Liu, Maosong Sun

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Xiamen University（厦门大学）； Tsinghua University（清华大学）

AI总结本文提出AutoVecCoder框架，通过VecPrompt和VecRL组件，使大语言模型能够自动进行显式向量化，从而在SimdBench的SSE和AVX子集上达到最先进的性能，超越传统自动向量化的方法。

详情

AI中文摘要

通过单指令多数据（SIMD）架构进行向量化是高性能计算的核心。为了充分利用硬件潜力，开发人员通常依赖显式向量化使用内联函数，因为基于编译器的自动向量化由于保守的静态分析常常产生次优结果。尽管大语言模型（LLMs）在一般代码生成方面表现出色，但它们在显式向量化方面遇到困难，因为高质量语料库稀缺且低级硬件指令的语义约束严格。在本文中，我们提出了AutoVecCoder，一种新的框架，旨在赋予LLMs自动显式向量化的能力。AutoVecCoder集成了两个核心组件：VecPrompt，一个自动数据合成管道，用于注入领域特定的内联知识；以及VecRL，一个强化学习框架，将代码生成与执行效率对齐。通过此框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上实现了最先进的性能，并在某些情况下生成的实现超过了标准-O3优化，有效克服了传统自动向量化的固有瓶颈。

英文摘要

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

URL PDF HTML ☆

赞 0 踩 0

2605.17976 2026-05-19 cs.AI math.OC

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

释放大语言模型于贝叶斯优化：用于科学发现的偏好引导框架

Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种基于大语言模型的贝叶斯优化框架LGBO，通过在优化循环中持续整合大语言模型的语义推理，提高了科学发现中的优化效率和收敛速度。

Comments Published as a conference paper at ICLR 2026. 10 pages main paper, 21 pages appendix, 26 figures

Journal ref International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

科学发现日益受到昂贵实验和有限资源的限制，凸显了在AI for science中高效优化的必要性。尽管贝叶斯优化（BO）被广泛用于平衡探索与利用，但其在高维设置中表现出冷启动性能缓慢和可扩展性差的问题，限制了其在现实科学问题中的应用。为克服这些挑战，我们提出了LLM引导的贝叶斯优化（LGBO），这是首个将大语言模型（LLMs）的偏好引导整合到优化循环中的贝叶斯优化框架。与以往仅使用LLMs进行预热启动初始化或候选生成的工作不同，LGBO引入了一种区域提升的偏好机制，将LLM驱动的偏好嵌入到每一个迭代中，以稳定且可控的方式调整替代均值。理论上，我们证明了LGBO在最坏情况下不会显著劣于标准BO，而在偏好与目标一致时，能够实现显著更快的收敛速度。实验上，LGBO在物理、化学、生物学和材料科学等多样化的干基准测试中均优于现有方法。最值得注意的是，在一个新的湿实验室优化Fe-Cr电池电解质时，LGBO在6次迭代内达到了最佳观测值的90%，而标准BO和现有LLM增强的基线方法需要超过10次。这些结果表明，LGBO为将LLMs整合到科学优化工作流中提供了一个有前景的方向。

英文摘要

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.17969 2026-05-19 cs.CV

Generation Navigator: A State-Aware Agentic Framework for Image Generation

生成导航器：一种基于状态的图像生成代理框架

Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Eastern Institute of Technology（东部技术研究所）； Independent（独立）

AI总结本文提出了一种基于状态的图像生成代理框架Generation Navigator，通过将图像生成问题重新表述为状态条件下的动作生成问题，解决了传统方法中在强化学习训练中因信用分配问题导致的不足，通过PRE-GRPO算法提升了生成质量与推理准确性。

详情

AI中文摘要

尽管文本到图像生成技术取得了快速进展，但忠实实现用户意图仍然具有挑战性，通常需要手动多轮尝试和错误。为了自动化此过程，现有系统依赖于简单的提示重写或由手工规则驱动的闭环代理，而不是学习适应不断变化的生成过程。在本文中，我们将图像生成重新表述为一个状态条件下的动作生成问题，并提出Generation Navigator，一个多轮T2I代理，能够学习动态引导生成轨迹并输出下一步动作。然而，通过强化学习训练此代理会引入关键的信用分配挑战：仅根据单一状态奖励轨迹会将所有动作视为同等信用，忽略了各轮次质量动态变化，并无法区分那些提升轨迹的动作与那些降质或浪费轮次而无进展的动作。我们通过PRE-GRPO（峰值保留-效率组相对策略优化）算法解决这一问题，这是一种轨迹级强化学习目标，明确奖励发现高质量图像（峰值）、避免后续轮次质量下降（保留）以及最小化不必要的轮次（效率）。实验表明，在多个基准测试中取得了显著提升，达到了0.90的WISE分数和79.06%的T2I-ReasonBench推理准确率。

英文摘要

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

URL PDF HTML ☆

赞 0 踩 0

2605.17968 2026-05-19 cs.LG

Function graph transformers universally approximate operators between function spaces

函数图变换器在函数空间之间近似算子

Takashi Furuya, David Mis, Ivan Dokmanić, Maarten V. de Hoop, Matti Lassas

发表机构 * Doshisha University（大阪市立大学）； RIKEN AIP（日本科学技术厅Advanced Institute for Photonics and Electron器件）； Rice University（里士满大学）； University of Basel（巴塞尔大学）； Simons Chair in Computational and Applied Mathematics and Earth Science（Simons计算与应用数学及地球科学主席职位）； University of Helsinki（赫尔辛基大学）

AI总结本文研究了通过变换器近似函数空间之间非线性算子的问题，提出了一种基于图度量的函数图变换器，能够以单值函数形式输出，并证明其在广义非线性算子近似中的通用性。

详情

SkyNative: 一种面向遥感视觉证据推理的原生多模态框架

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang

发表机构 * College of Computer Science and Technology, Jilin University, China（吉林大学计算机科学与技术学院）； Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education（教育部符号计算与知识工程重点实验室）

AI总结本文提出SkyNative，一种原生多模态框架，通过去除预训练视觉骨干，直接在语言模型token空间中表示图像为原始patch tokens，以提升遥感图像的细粒度空间推理能力。

详情

AI中文摘要

遥感视觉-语言模型通常依赖预训练的视觉编码器将图像转换为语义特征后再进行语言模型推理。尽管在场景级理解上有效，这种流程可能过早压缩局部视觉证据，使细粒度空间推理容易受到语言先验的影响，尤其是在超高分辨率遥感图像中。我们提出了SkyNative，一种面向遥感的原生多模态框架，采用无编码器架构，去除预训练视觉骨干，直接在语言模型token空间中表示图像为原始patch tokens。为协调低级视觉patches与文本tokens，SkyNative引入了模态感知的解耦机制，该机制在统一的自回归骨干中使用模态特定的参数。我们进一步引入了一个视觉依赖基准，通过逐步视觉退化和误导性文本提示来诊断模型是否基于图像证据得出答案。在标准遥感理解任务和大格式空间推理评估中，SkyNative展示了更强的图像基础感知能力和改进的抗提示诱导语言先验能力。这些结果表明，原生patch级多模态建模是可靠遥感视觉-语言推理的有前景方向。

英文摘要

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.17938 2026-05-19 cs.LG cs.AI stat.ML

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

通过镜像反学习和噪声一致偏斜训练数据归因

Joan Serrà, Dipam Goswami, Fabio Morreale, Wei-Hsiang Liao, Yuki Mitsufuji

发表机构 * Sony AI（索尼人工智能）

AI总结本文提出了一种基于镜像反学习和噪声一致偏斜的方法，用于提升扩散模型的训练数据归因的可靠性与鲁棒性，通过在不同数据集上显著优于现有方法，展示了其在生成实例间影响实例重叠和扩散损失比较任务中的潜力。

Comments 21 pages, 5 figures, 9 tables (includes appendix)

详情

AI中文摘要

训练数据归因（TDA）应能够促进生成模型的可解释性，并推动各种相关下游任务的发展。然而，当前的TDA方法缺乏可靠性和鲁棒性，阻碍了其在实际应用中的采用。在本文中，我们采取了关键步骤，以实现更可靠和鲁棒的扩散模型TDA。我们提出通过镜像反学习和噪声一致偏斜（MUCS）进行TDA。该方法的核心思想是使用受限的镜像梯度上升微调第二个模型，并通过一致的噪声样本测量该模型相对于原始模型的归一化偏斜。我们展示了，尽管概念上简单且通用，MUCS在三个不同的数据集上系统性地大幅优于现有方法。此外，我们研究了核心设计选择对最终性能的影响，并分析了影响实例在生成项目中的重叠以及整合TDA方法的潜力。我们相信，我们的发现可能对更一般的反学习设置以及需要比较扩散损失的任务具有更广泛的意义。

英文摘要

Training data attribution (TDA) should enable generative model interpretability and foster a variety of related downstream tasks. Nonetheless, current TDA approaches lack reliability and robustness, preventing their adoption in real-world setups. In this paper, we take a decisive step towards more reliable and robust TDA for diffusion models. We propose to perform TDA with mirrored unlearning and noise-consistent skew (MUCS). The idea is to fine-tune a second model with bounded mirrored gradient ascent, and to measure the normalized skew of this model with respect to the original one using consistent noise samples. We show that, while being conceptually simple and generic, MUCS systematically outperforms existing methods on three different datasets by a large margin. We additionally study the effect that core design choices have on final performance, and analyze novel aspects regarding the overlap of influential instances across generated items and the potential of ensembling TDA approaches. We believe that our findings may have broader implications for more general unlearning setups, as well as for tasks requiring the comparison of diffusion losses.

URL PDF HTML ☆

赞 0 踩 0

2605.17933 2026-05-19 cs.CV

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA: 无教师视觉技能记忆用于无需教师的VLM代理

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, Zhihao Wen

发表机构 * Ant Group（蚂蚁集团）； University of Science and Technology of China（中国科学技术大学）； Westlake University（西湖大学）； University of Michigan - Ann Arbor（密歇根大学-安娜堡分校）； Sun Yat-sen University（中山大学）

AI总结本研究提出AtlasVA，一种无需教师的视觉技能记忆框架，通过空间热图、视觉示例和符号文本技能三层结构，统一感知、记忆和优化，实现在无需外部LLM监督下的强化学习性能提升。

详情

AI中文摘要

视觉语言模型（VLM）代理越来越多地依赖记忆增强的强化学习来在长时间任务中重用经验，但大多数现有框架将记忆存储为文本并依赖专有教师模型来总结或细化。这种设计与空间决策不匹配：几何先验被压缩成有损语言，稀疏交互通常通过延迟文本反馈监督，而不是密集的视觉基础信号。我们主张VLM代理的可重用经验应保持视觉基础。基于这一见解，我们提出了AtlasVA，一种无需教师的视觉技能记忆框架，将记忆组织为三个互补的层次：空间热图、视觉示例和符号文本技能。AtlasVA进一步通过轨迹统计和轻量级网格启发式方法直接演化危险和亲和图谱，并将这些自演化图谱作为基于潜在函数的形状奖励用于强化学习。这种设计统一了感知、记忆和优化，无需外部LLM监督。在Sokoban、FrozenLake、3D沉浸导航和3D机器人操作基准测试中，实验表明AtlasVA在文本中心记忆基线和竞争VLM代理上表现一致优异，尤其在空间密集任务上收益显著。主页：https://wangpan-ustc.github.io/AtlasvaWeb

英文摘要

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb

URL PDF HTML ☆

赞 0 踩 0

2605.17932 2026-05-19 cs.CL cs.AI

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

在扩散大型语言模型中进行提示压缩：在LLDA上评估LLMLingua-2

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

发表机构 * University of Toronto（多伦多大学）； King Mongkut’s University of Technology Thonburi（泰国科技理工学院）

AI总结本文研究了提示压缩在扩散大型语言模型中的有效性，通过在LLDA上评估LLMLingua-2，发现提示压缩在数学推理任务中效果不佳，而摘要任务相对稳健，表明为扩散模型设计的提示压缩方法并不适用于所有场景。

详情

AI中文摘要

提示压缩可以减少大型语言模型的推理成本和上下文长度，但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能有效转移到扩散大型语言模型（DLLMs）中，使用LLMLingua-2，特别是具有8B参数的DLLM LLaDA。我们在GSM8K、DUC2004和ShareGPT数据集上使用每个数据集约250个提示，以大约2倍的压缩率，在数学推理、提示重建和摘要任务中评估压缩性能。通过精确匹配准确率、BLEU、ROUGE和BERTScore比较原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明，语义保持并不必然意味着在扩散模型中下游行为的稳定性。摘要任务在压缩下相对稳健，而数学推理任务在高语义相似度分数下显著退化。重建实验进一步表明，语义相似的提示可能仍然遗漏了稳定去噪所需的关键推理信息。在所有任务中，BERTScore召回率始终低于精度，表明压缩失败主要由信息遗漏驱动，而非语义漂移。这些发现表明，为自回归模型设计的提示压缩方法并不均匀地适用于扩散大型语言模型，从而推动了为扩散模型设计的压缩策略的发展。

英文摘要

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.17930 2026-05-19 cs.LG

InfoFlow: A Framework for Multi-Layer Transformer Analysis

InfoFlow: 多层Transformer分析的框架

Penghao Yu, Haotian Jiang, Zeyu Bao, Qianxiao Li

发表机构 * Department of Mathematics（数学系）； National University of Singapore（新加坡国立大学）； Institute for Functional Intelligent Materials（功能智能材料研究所）

AI总结该研究通过分析多层Transformer的近似能力，揭示了其与单层Transformer的根本差异，并提出InfoFlow框架以提升多层Transformer的近似效率。

Comments 36 pages

详情

AI中文摘要

尽管近期已有研究探讨了单层Transformer架构的近似性质，但对多层设置的严谨理论理解仍然有限。本文证明多层Transformer在某些检索任务中具有与单层Transformer根本不同的近似能力：对于某些检索任务，任何单层Transformer需要至少Ω(ε^{-k})参数才能达到精度ε，其中k与序列长度T线性增长，而双层Transformer每层一个头则能以至多O(ε^{-1})参数实现相同近似精度。为理解这种分离，我们识别出多层近似背后的两种结构机制。具体而言，softmax注意力只能高效检索获得最大注意力分数的token，导致k-th最大检索的参数成本呈指数级增长（k≥2）。此外，解码耦合信息的参数成本与所检索token集合的大小成正比。受这些发现启发，我们提出了InfoFlow框架，用于多层Transformer。该框架在每个token和层跟踪可访问的输入位置集合，并为每种信息传播模式分配明确的近似率。这种抽象恢复了已知的近似界限，与训练网络的实验观察保持一致，并在目前无法直接理论分析的设置中产生具体预测。我们的结果提供了一个原则性的框架，用于分析多层Transformer的近似效率。

英文摘要

While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $Ω(\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

SAS: Semantic-aware Sampling for Generative Dataset Distillation

Functionalization via Structure Completion and Motion Rectification

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

Scalable Decision-Focused Learning through Cost-Sensitive Regression

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

Bridging the Gap: Converting Read Text to Conversational Dialogue

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

Low Latency Gaze Tracking via Latent Optical Sensing

Predictive Prefetching for Retrieval-Augmented Generation

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

Generation Navigator: A State-Aware Agentic Framework for Image Generation

Function graph transformers universally approximate operators between function spaces

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

A More Word-like Image Tokenization for MLLMs

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

InfoFlow: A Framework for Multi-Layer Transformer Analysis