arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪 全部专题
2605.18022 2026-05-19 cs.LG cs.AI stat.ML

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

揭示记忆与泛化共存:在带有标签噪声的算术任务中的案例研究

Linyu Liu, Pinyan Lu

发表机构 * Taylor Lab, Huawei Technologies Co., Ltd.(华为技术有限公司泰勒实验室) Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics(上海财经大学计算与经济交叉研究重点实验室)

AI总结 本文研究了在高过参数化模型中如何同时记忆噪声标签和泛化,通过模运算任务中的实验发现,适当优化和模型配置下大模型泛化能力更强,噪声标签被更快记忆,而过参数化模型内部形成泛化结构,但输出被拟合噪声标签的需求所抑制。通过频率方法提取内部结构可实现高准确率,提出任务无关方法将网络分为泛化和记忆组件,尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

Comments 27 pages, 32 figures

详情
AI中文摘要

高度过参数化的模型可以同时记忆噪声标签并良好泛化,但如何这些行为共存仍不明确。本文通过模运算任务在重噪声标签下研究其内在机制。通过在两层神经网络上的广泛实验发现,适当优化和模型配置下大模型泛化能力更强,而噪声标签被更快记忆。过参数化模型内部形成泛化结构,但其在输出中的表达被拟合噪声标签的需求所抑制。值得注意的是,即使在80%的标签噪声下,通过频率方法提取内部结构也可实现接近完美的测试准确率。我们进一步提出一种任务无关的方法将网络分为泛化和记忆组件。尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

英文摘要

Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.

2605.18020 2026-05-19 cs.LG

Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation

通过效用约束随机聚合改进理性参与的联邦学习

M Yashwanth, Arunabh Singh, Ashok Nayak, Sai Kiran Bulusu, Anirban Chakraborty

发表机构 * Indian Institute of Science(印度科学研究所) Indian Institute of Technology Bombay(印度理工学院孟买分校) IIIT Hyderabad(海得拉巴IIIT)

AI总结 本文提出FedUCA框架,通过形式化服务器作为优化器的角色,旨在通过维持客户端参与来最大化全局模型性能,从而提高客户端参与度和全局模型性能。

Comments Federated Learning, Rational Clients, Endogenous Participation, and Aggregation

详情
AI中文摘要

联邦学习(FL)算法隐含假设客户端在服务器请求下被动地分享本地模型更新以配合服务器端的协调。然而,这忽略了现实世界跨机构环境中一个重要的方面:客户端通常是理性的代理,可能会优先考虑本地模型性能等效用而非全局模型的性能。在统计异质性显著的设置中,理性客户端可能会退出联邦如果感知到的合作利益未能满足其本地效用阈值。此类退出会降低全局模型性能并可能导致联邦训练过程的崩溃。在本文中,我们引入FedUCA(通过效用约束随机聚合改进理性参与的联邦学习),一个框架,形式化了服务器作为优化器的角色,旨在通过维持客户端参与来最大化全局模型性能。我们通过在标准数据集上的广泛实验验证了我们的框架,证明通过优先考虑参与可行性,FedUCA实现了显著更高的客户端保留率,从而实现了更优的全局模型性能。

英文摘要

Federated Learning (FL) algorithms implicitly assume that clients passively comply with server-side orchestration by sharing local model updates upon server request. However, this overlooks an important aspect in real-world cross-silo environments: clients are often rational agents who may prioritize their utilities such as local model performance over that of the global model. In settings with significant statistical heterogeneity, rational clients may opt out of the federation if the perceived benefits of collaboration fail to meet their local utility thresholds. Such attrition degrades the global model performance and can lead to the collapse of the federated training process. In this work, we introduce FedUCA, (Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation), a framework that formalizes the server's role as an optimizer seeking to maximize global model performance by sustaining client participation. We substantiate our framework through extensive experiments on standard datasets demonstrating that by prioritizing participation feasibility, FedUCA achieves significantly higher client retention and, consequently, a superior global model performance.

2605.18018 2026-05-19 cs.CV cs.AI cs.HC

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

See What I Mean: 对齐视觉与语言表示以实现视频细粒度物体理解

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

发表机构 * VCIP, CS, Nankai University(南开大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里云实验室) NKIARI, Shenzhen Futian(深圳福田国家信息研究所)

AI总结 本文提出SWIM方法,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解,解决了传统方法需要显式视觉提示的问题,通过构建NL-Refer数据集和多层交叉注意力图提升文本-视觉对齐性能。

Journal ref CVPR 2026

详情
AI中文摘要

我们提出了SWIM(See What I Mean),一种新颖的训练策略,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解。与需要显式视觉提示(如掩码或点)的传统方法不同,SWIM仅在训练期间利用掩码监督来指导跨模态注意力,使模型在推理时能够自动关注用户指定的物体。我们对预训练多模态大语言模型(MLLMs)的交叉注意力分析揭示了一种系统性差异:属性词在视觉模态中产生尖锐、局部化的激活,而物体名词由于语义参考偏差和分布式高层表示产生扩散和分散的模式。为了解决这种不对齐问题,我们构建了NL-Refer数据集,其中每个物体掩码都配以精确的自然语言指引用。SWIM从物体名词中提取多层交叉注意力图,并强制与真实掩码保持空间一致性。实验结果表明,SWIM显著提高了文本-视觉对齐性能,并在细粒度物体理解基准上优于基于视觉提示的方法。代码和数据可在https://github.com/HumanMLLM/SWIM获取。

英文摘要

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

2605.18015 2026-05-19 cs.LG cs.DB cs.SE

LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

LogRouter: 一种自适应的两级LLM路由用于大数据系统中的日志问题解答

Mert Coskuner, Merve Zeybel, Melik Mert Dolan

发表机构 * TUBITAK BILGEM(土耳其国家研究 institute)

AI总结 本文提出LogRouter,一种自适应两级LLM路由系统,用于在大数据系统中实现日志问题解答,通过结合PySpark-based Drain3数据摄入管道、GPU加速的嵌入和Apache Druid和PostgreSQL with pgvector的双索引存储,实现高效的日志查询处理。

详情
AI中文摘要

在自托管、资源受限的环境中,生产日志分析需要自然语言访问大规模日志流,而无需将每个查询路由通过大型语言模型的费用。我们提出了LogRouter,一个部署在TUBITAK BILGEM国家大数据平台上的端到端日志问题解答系统,结合了基于PySpark的Drain3数据摄入管道、GPU加速的嵌入以及Apache Druid和PostgreSQL with pgvector的双索引存储。一个两级成本感知路由器将每个查询沿着四个执行路径之一进行路由:直接响应、Druid关键词搜索、使用SQL生成的模板查找和pgvector语义检索,同时二级路由器选择14B或32B类生成器用于语义路径。一个专用的编码器LLM处理文本到SQL生成。我们在四个LogHub数据集(Linux、Apache、Windows和Mac;共70个问题)上评估了该系统,分别在在线完整管道配置和隔离生成器的离线配置下进行测试。路由器在各数据集上的平均准确率为88.4%,在Linux上为94.7%。完整管道的平均ROUGE-1为0.373,BERTScore为0.879,RAGAS Faithfulness为0.779,端到端延迟为18.6秒。在公平的离线比较中,路由系统将平均延迟减少了55%(与Fixed-32B基线46.3秒 vs. 102.1秒相比),同时保持答案正确性在5.8分以内,并在所有数据集上超过Fixed-14B基线的RAGAS Faithfulness。因此,成本感知的路由是生产日志QA的实用机制:路由恢复了始终使用32B配置的大部分质量,延迟不到一半,且L1关键词词汇表使路由决策具有高精度,而无需使用学习分类器。

英文摘要

Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM's national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.

2605.18013 2026-05-19 cs.CV cs.AI

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2: 极端内存压缩用于高效的跟踪任何模型

Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出TinySAM 2,一种轻量级视频分割模型,通过引入内存质量管理机制和联合空间-时间令牌压缩,有效降低了内存存储和计算成本,实现了在DAVIS和SA-V等挑战性数据集上达到SAM 2.1 90%性能,仅使用7%内存令牌和3%训练数据。

Comments 12 pages, 6 figures

详情
AI中文摘要

Segment Anything Model 2 (SAM 2) 作为视频分割领域的核心基础模型,在半监督视频对象分割和跟踪任何任务中表现出色。然而,SAM 2的多阶段图像编码器和内存模块复杂的计算特性提高了模型在实际应用中的部署难度。为了解决这个问题,我们提出了TinySAM 2,一种在性能和效率之间取得平衡的轻量级视频分割模型。首先,引入了一个内存质量管理机制,用于选择并保留高信息量的历史帧作为内存。此外,提出了一种联合空间-时间令牌压缩方法,通过空间域上的平均池化压缩冗余令牌,在时间域上基于令牌级相似性测量选择信息令牌。此外,采用RepViT作为轻量级图像编码器,进一步减少模型参数。在DAVIS和SA-V等挑战性数据集上的大量实验表明,TinySAM 2在性能上达到了SAM 2.1的90%,仅使用7%的内存令牌和3%的训练数据。本研究有效缓解了SAM 2在参数数量、计算负载和部署成本方面的瓶颈,为视频分割模型在设备上的广泛应用提供了资源高效的解决方案。

英文摘要

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

2605.18012 2026-05-19 cs.CV cs.AI cs.LG

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS: 语义感知的生成数据集蒸馏

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

发表机构 * Hokkaido University(北海道大学) University of Toronto(多伦多大学) The University of Tokyo(东京大学)

AI总结 本文提出了一种语义感知的数据集蒸馏方法,通过利用CLIP作为语义先验,设计三个语义评分函数来量化类别相关性、类别间分离性和集合内多样性,从而生成紧凑且语义区分度高的数据集。

Comments Published as a journal paper in IEEE OJSP

详情
AI中文摘要

深度神经网络在广泛的任务中取得了显著的性能,但这种成功往往伴随着由于大规模训练数据带来的巨大计算和存储成本。数据集蒸馏通过构建紧凑且信息丰富的数据集,以实现高效的模型训练同时保持下游性能。然而,大多数现有方法主要强调匹配数据分布或下游训练统计,对蒸馏数据中高阶语义信息的保留有限。在本文中,我们引入了语义感知的视角进行数据集蒸馏,通过利用对比语言-图像预训练(CLIP)作为语义先验进行后采样。我们的目标是获得不仅紧凑而且语义上类别区分度高且多样化的蒸馏数据集。为此,我们设计了三个语义评分函数,以量化预训练语义空间中的类别相关性、类别间分离性和集合内多样性。基于现有蒸馏方法生成的图像池,我们进一步开发了一种两阶段策略进行有效的采样:第一阶段过滤语义区分度高的样本以形成可靠的候选集,第二阶段进行动态多样性感知选择以减少冗余并保持语义覆盖。在多个数据集、图像池和下游模型上的广泛实验显示了一致的性能提升,突显了在数据集蒸馏中整合语义信息的有效性。

英文摘要

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

2605.18010 2026-05-19 cs.CV cs.GR

Functionalization via Structure Completion and Motion Rectification

通过结构补全和运动校正实现功能化

Mingrui Zhao, Sai Raj Kishore Perla, Kai Wang, Sauradip Nag, Duc Anh Nguyen, Jiayi Peng, Ruiqi Wang, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri, Hao Zhang

发表机构 * Simon Fraser University(西蒙弗雷泽大学) ShanghaiTech University(上海科技大学)

AI总结 本文提出了一种新的任务,即对象功能化,旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。通过将功能化问题建模为新的功能图上的图补全问题,开发了神经图功能化器(GraFu)来补全不完整的图,从而生成3D几何结构,并校正错误的人工标注和预测运动。

详情
AI中文摘要

获取和创建3D资产长期以来主要基于视角或外观驱动。因此,现有的数字3D模型往往缺乏必要的结构组件,以实现其预期功能,例如关节、支撑结构、内部结构或交互元素。同时,即使人工标注的运动也经常存在误差,导致物理上不合理的行为。我们引入了对象功能化,这是一种新的任务,旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。我们将功能化建模为一个新的功能图上的图补全问题,其中标记的节点代表对象部分,标记的边编码功能和接触关系,而可移动的节点携带运动属性,使得结构功能缺陷表现为缺失的节点或错误的边。我们开发了神经图功能化器(GraFu)来补全表示非功能3D对象的不完整图。补全后的图随后驱动一个几何实现阶段,将预测的连接器和结构元素实例化为3D,具有令人印象深刻的效果,即校正错误的人工标注和预测运动。为了支持训练和评估,专注于家具作为丰富且具有挑战性的目标类别,我们引入了FurFun-233,一个包含233对非功能化和功能化家具模型的数据集。在PartNet-Mobility(

英文摘要

Acquisition and creation of 3D assets have been largely view- or appearance-driven. As a result, existing digital 3D models often lack the requisite structural components to function as intended, such as joints, supports, interiors, or interaction elements. At the same time, even human-annotated motions are frequently error-prone, leading to physically implausible behavior. We introduce object functionalization, a novel task aimed at transforming visually plausible but non-functional 3D models into functional and physically operable ones. We formulate functionalization as a graph completion problem over a new functional graph representation, where labeled nodes represent object parts, labeled edges encode functional and contact relations, and movable nodes carry motion attributes, so that structural functional deficiencies manifest as missing nodes or incorrect edges. We develop a neural Graph Functionalizer (GraFu) to complete an incomplete graph representing a non-functional 3D object. The completed graph then drives a geometry realization stage that instantiates predicted connectors and structural elements in 3D, with the compelling side effect of rectifying erroneous human-annotated and predicted motions. To support training and evaluation, focusing on furniture as a rich and challenging target category, we introduce FurFun-233, a dataset of 233 paired non-functional and functionalized furniture models. On PartNet-Mobility ("zero-shot") and HSSD test sets, our method matches state-of-the-art methods in motion prediction accuracy while substantially improving functionality in terms of collision and connectivity.

2605.18008 2026-05-19 cs.LG stat.ML

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

域移情况下不确定性可靠性研究:面向光体积脉搏波测记中数据驱动血压估计的探讨

Mohammad Moulaeifard, Ciaran Bench, Philip J. Aston, Nils Strodthoff

发表机构 * AI4Health Department, University of Oldenburg(奥尔登堡大学AI4Health部门) Department of Data Science and AI, National Physical Laboratory(国家物理实验室数据科学与人工智能部门) School of Mathematics and Physics, University of Surrey(萨里大学数学与物理学院)

AI总结 本文研究了在域移情况下深度学习用于光体积脉搏波测记信号中血压估计的不确定性可靠性,比较了深度集成和蒙特卡洛滴答方法,并探讨了不确定性校准的重要性。

Comments 23 pages, 2 figures

详情
AI中文摘要

不确定性量化(UQ)对于安全关键领域如医疗至关重要,但很少在现实的分布外(OOD)条件下进行评估。本文评估了基于深度学习的血压(BP)估计在光体积脉搏波测记(PPG)信号中的预测性能和不确定性可靠性,分别在分布内(ID)和分布外(OOD)设置下进行。使用在PulseDB上训练的XResNet1D-50模型在四个外部数据集上进行测试,比较了深度集成(DE)和蒙特卡洛滴答(MCD)方法,并使用高斯负对数似然(GNLL)和均方误差(MSE)损失函数,可选地通过符合预测(CP)、温度缩放(TS)和等比回归(IR)进行后处理校准。我们的关键发现如下:(1)在域移情况下,DE比MCD提供更强的预测鲁棒性,这种优势主要在外部域移情况下显现。(2)经过校准的GNLL方法在不确定性校准方面表现最佳(例如,GNLL+DE+CP用于收缩压(SBP),GNLL+DE+TS用于舒张压(DBP)),而基于MSE的不确定性需要校准才能实用。(3)在各种设置中,CP和TS提供了最一致的增益,IR在某些情况下仍然具有竞争力。总体而言,我们的结果表明,基于DE的方法在域移下的预测性能最为稳健,GNLL在原生UQ中最强,而校准对于使MSE基于的不确定性实用化至关重要。这些发现突显了在外部数据上联合评估预测准确性和校准的重要性,以实现无袖带血压估计的可信度。

英文摘要

Uncertainty quantification (UQ) is critical for safety-critical domains like healthcare, yet it is rarely evaluated under realistic out-of-distribution (OOD) conditions. Here, we assessed predictive performance and uncertainty reliability for deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under both in-distribution (ID) and OOD settings. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, we compared deep ensembles (DE) and Monte Carlo dropout (MCD) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses, optionally followed by post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). The key findings of our study are as follows: (1) DE provides stronger predictive robustness under domain shift than MCD, an advantage that becomes clear primarily under external shift. (2) Recalibrated GNLL-based methods yield the best uncertainty calibration (e.g., GNLL+DE+CP for systolic blood pressure (SBP), GNLL+DE+TS for diastolic blood pressure (DBP)), while MSE-based uncertainty requires recalibration to become practically useful. (3) Across settings, CP and TS offer the most consistent gains, with IR remaining competitive in several cases. Overall, our results identify DE-based methods as most robust for predictive performance under domain shift, GNLL as strongest for native UQ, and recalibration as essential for making MSE-based uncertainty practical. These findings highlight the need to jointly assess predictive accuracy and calibration on external data for trustworthy cuffless BP estimation

2605.18007 2026-05-19 cs.CL

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

推理时对修辞角色标注中困难示例的语义重排序

Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour

发表机构 * Nantes Université, École Centrale Nantes, CNRS, LS2N, UMR 6004(南特大学,中央理工学院,国家科学研究中心,LS2N,UMR 6004) University of Lorraine(洛林大学)

AI总结 本文提出RISE框架,在推理时利用标签语义对修辞角色标注中的困难示例进行重排序,提升模型预测的准确性和鲁棒性。

Comments Accepted at ACL 2026 (Main Conference)

详情
AI中文摘要

修辞角色标注(RRL)为文档中的每个句子分配一个功能角色,广泛应用于法律、医疗和科学领域。尽管语言模型(LMs)在平均性能上表现良好,但它们在困难示例上仍然不可靠,其中预测置信度较低。现有方法通常隐式处理不确定性,将标签视为离散标识符,忽略了标签名称中编码的语义信息。我们引入RISE,一种推理时的语义重排序框架,利用标签语义来优化困难实例的预测。RISE自动识别低置信度预测,并使用对比学习的标签表示对模型输出进行重排序,无需重新训练或修改基础模型。在八个领域特定的RRL数据集上,使用七种LM(包括基于编码器和因果架构)的实验表明,在困难示例上平均获得+9.15个宏F1分数的提升。为了可解释性,我们进一步提出手动难度注释,从模型和人类视角研究难度,揭示与Cohen's kappa=0.40的中等一致程度。

英文摘要

Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.

2605.18005 2026-05-19 cs.LG stat.ML

Scalable Decision-Focused Learning through Cost-Sensitive Regression

通过成本敏感回归实现可扩展的决策聚焦学习

Noah Schutte, Senne Berden, Tias Guns, Krzysztof Postek, Neil Yorke-Smith

发表机构 * Delft University of Technology(代尔夫特理工大学) KU Leuven(库尔勒大学) Independent Researcher(独立研究者)

AI总结 本文提出了一种基于成本敏感多输出回归的方法,用于解决包含多个不确定参数的组合优化问题,通过引入成本敏感的损失函数组件,提高了决策聚焦学习的效率和可扩展性。

Comments 12 pages, 7 figures

详情
AI中文摘要

许多现实世界中的组合问题涉及不确定参数,这些参数可以根据上下文特征和历史数据进行预测。这些'预测后优化'或'上下文优化'问题已获得显著关注:端到端训练方法现在可以最小化下游任务成本而不是预测误差。然而,尽管这些决策聚焦学习(DFL)方法有效,但它们通常在训练过程中依赖于重复解决底层组合优化问题,这使得它们计算成本高且难以扩展。我们重新将学习问题视为一个成本敏感的多输出回归问题:多输出是因为组合问题有多个不确定参数,而成本敏感是因为下游任务成本是真正的目标。我们的技术贡献是正式化了多个损失函数组件,这些组件来自于这种重新框架:成本不敏感的归一化、决策意识的不对称惩罚过预测和欠预测,以及实例化的成本,这些成本在本地模仿真正的下游任务损失。这些组件需要每个训练数据实例零或一次求解,而训练过程中不需要进一步求解。实验表明,损失组件的组合在下游任务质量上与最先进的方法相当,同时显著更高效,使能够扩展到以前无法用DFL解决的问题规模。

英文摘要

Many real-world combinatorial problems involve uncertain parameters, which can be predicted given contextual features and historical data. These `predict-then-optimize' or `contextual optimization' problems have gained significant attention: end-to-end training methods can now minimize the downstream task cost rather than the predictive error. However, despite their effectiveness, these decision-focused learning (DFL) approaches often rely on repeated solving of the underlying combinatorial optimization problem during training, making them computationally expensive and difficult to scale. We reframe the learning problem as a cost-sensitive multi-output regression problem: multi-output due to the combinatorial problem having multiple uncertain parameters, and cost-sensitive due to the downstream task cost being the real target. Our technical contribution is the formalization of multiple loss function components that follow from this reframing: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs that mimic the true downstream task-based loss locally. These components require zero or one solve per training data instance, while requiring no further solves during training. Experiments show that the combination of loss components achieves comparable downstream task quality to the state of the art, while being significantly more efficient, enabling scaling to problem sizes that have not been tackled before with DFL.

2605.18004 2026-05-19 cs.LG

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

RL4RLA: 通过课程设计和基于图的搜索教机器学习发现随机线性代数算法

Jinglong Xiong, Xiaotian Liu, Ruoxin Wang, Zihang Liu, Yefan Zhou, Yujun Yan, Yaoqing Yang

发表机构 * Pratt School of Engineering, Duke University, Durham, NC, USA(杜克大学工程学院) Department of Computer Science, Dartmouth College, Hanover, NH, USA(达特茅斯学院计算机科学系)

AI总结 本文提出RL4RLA框架,通过课程设计和基于图的搜索自动化发现可解释的符号随机线性代数算法,展示了其在重发现状态-of-the-art方法和优化算法性能方面的贡献。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 9 pages main text; 21 pages total

详情
AI中文摘要

随机线性代数(RLA)算法是一类现代数值线性代数技术,在科学计算和机器学习中扮演重要角色,已被广泛采用。然而,其发现仍主要依赖手动过程,需要深厚的专家知识和灵感。尽管强化学习(RL)提供了自动化路径,但标准方法在高绩效RLA算法的稀疏奖励景观和广阔搜索空间中遇到困难。本文提出RL4RLA,一个通用的RL框架,自动化发现可解释、符号化的RLA算法。与黑盒方法不同,我们的方法从基本线性代数原语构建显式算法,确保可验证和可实现的表示。为了实现高效发现,我们引入:(1)数值课程,逐步增加问题难度以编码RLA领域的归纳偏差;(2)蒙特卡洛图搜索,通过识别和合并等价的partial算法优化探索。我们证明RL4RLA重发现状态-of-the-art方法,包括sketch-and-precondition求解器、Randomized Kaczmarz和Newton Sketch,并可针对特定的准确率、速度和稳定性之间的权衡生成算法。代码可在https://github.com/Tim-Xiong/RL4RLA获取。

英文摘要

Randomized linear algebra (RLA) algorithms are a modern class of numerical linear algebra techniques that play an essential role in scientific computing and machine learning, with broad and growing adoption. However, their discovery remains mostly a manual process that requires deep expert knowledge and inspiration. While Reinforcement Learning (RL) offers a pathway to automation, standard approaches struggle with sparse reward landscapes and vast search spaces inherent to high-performing RLA algorithms. In this paper, we present RL4RLA, a general RL framework that automates the discovery of interpretable, symbolic RLA algorithms. Unlike black-box approaches, our method builds explicit algorithms from basic linear algebra primitives, ensuring verifiable and implementable representations. To enable efficient discovery, we introduce: (1) a numerical curriculum that progressively increments problem difficulty to encode inductive bias specific to the RLA domain; (2) Monte Carlo Graph Search, which optimizes exploration by identifying and merging equivalent partial algorithms. We demonstrate that RL4RLA rediscovers state-of-the-art methods, including sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch, and can be targeted to produce algorithms optimized for specific trade-offs between accuracy, speed, and stability. Code is available at https://github.com/Tim-Xiong/RL4RLA.

2605.18001 2026-05-19 cs.CL

Bridging the Gap: Converting Read Text to Conversational Dialogue

弥合差距:将阅读文本转换为对话式语音

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

发表机构 * Samsung Research and Development Institute, Bangalore, India(三星研发研究所,班加罗尔,印度)

AI总结 本文提出了一种名为PACC的新方法,通过利用深度神经网络分析和修改语调、重音和节奏等语调特征,将阅读语音转换为更自然的对话语音,从而在虚拟助手、客户服务和语言学习工具中提高语音转换的自然度和准确性。

Comments 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems

Journal ref Innovative Computing and Communications (ICICC 2025), Lecture Notes in Networks and Systems, Springer Nature, 2025, pp. 543-556

详情
AI中文摘要

在最近的语音处理进展中,将阅读语音转换为对话语音引起了广泛关注。该领域的主要挑战是在实时应用中保持自然性和可懂性的同时,最小化计算开销。传统的阅读语音缺乏对话互动中至关重要的细微语调变化,这对虚拟助手、客户服务和语言学习工具等应用构成了挑战。本文介绍了一种新的方法,即带有对话上下文的语调调整(PACC),旨在将阅读语音转换为各种现代应用中使用的自然对话语音。PACC利用先进的深度神经网络来分析和修改语调特征,如语调、重音和节奏。与传统方法不同,我们的方法使用高保真生成对抗网络(HiFi-GAN)进行语音合成。我们的实验结果表明,语音转换在自然度和模型准确性方面有显著提高,通过在语音数据集上额外训练。这项研究为语音转换任务和Mean Opinion Score(MOS)评估建立了新的基准,并证明我们的方法可以成功扩展到其他语音转换应用。

英文摘要

In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.

2605.17999 2026-05-19 cs.AI

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

共享骨干PPO用于多UAV通信覆盖与连接保持

Z. Jiang

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出了一种共享骨干PPO算法,通过在Actor和Critic网络之间共享基础模块,实现了高效的训练和提升的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现,并与标准PPO算法进行比较。实验结果表明,所提出的方法具有优越的性能,此外,还集成了图信息聚合模块以适应代理之间的通信条件。整合该模块后,算法仍保持有效,训练后的代理群体表现出更高的合作水平。

详情
AI中文摘要

本文提出了一种共享骨干近端策略优化(Shared Backbone PPO)算法。通过在Actor和Critic网络之间共享基础模块,该算法实现了高效的训练和改进的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现,并与标准PPO算法进行比较。实验结果表明,所提出的方法实现了优越的性能。此外,将图信息聚合模块纳入模型架构中,以适应代理之间的通信条件。整合该模块后,算法仍保持有效,训练后的代理群体表现出更高的合作水平。

英文摘要

This paper proposes a Shared Backbone Proximal Policy Optimization (Shared Backbone PPO) algorithm. By sharing the base module between the Actor and Critic networks, the algorithm achieves efficient training and improved performance. The algorithm is implemented in a connectivity-preserving multi-UAV swarm communication coverage task and compared with the standard PPO algorithm. Experimental results demonstrate that the proposed method achieves superior performance. Furthermore, a graph information aggregation module is incorporated into the model architecture to accommodate the communication conditions among agents. With the integration of this module, the algorithm remains effective, and the trained agent swarm exhibits a higher level of cooperation.

2605.17997 2026-05-19 cs.LG cs.AI cs.CV

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

发表机构 * Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出MARR,一种模块自适应残差重建方法,通过为每个模块分配特定的缩放系数,平衡残差相关的HA偏差和累积误差校正,从而在低比特量化中提升性能。

详情
AI中文摘要

近年来,基于残差重建的模型量化方法在低比特后训练量化(PTQ)中取得了有希望的性能,通过引入跨层残差来减少来自先前层的误差积累。然而,这些残差也可能引入额外的偏差,源于重建基于PTQ的Hessian近似(HA)假设,导致量化性能不理想。在本文中,我们分析发现,通过将残差项乘以一个缩放系数,可以提供一种直接的方法来缓解与残差强度相关的HA偏差,同时保持累积误差校正。更重要的是,我们观察到这种权衡是模块依赖性的,使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察,我们提出了模块自适应残差重建(MARR),为每个模块分配模块特定的缩放系数,以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计,我们设计了一种基于比例-积分-微分(PID)的自适应更新策略,利用重建误差作为反馈,逐步细化此系数。在多个典型的大语言模型(LLMs)和视觉变换器(ViTs)上的实验表明,MARR在低比特量化(小于等于4位)中表现出色,实现了LLMs高达20.2%的性能提升,以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

2605.17990 2026-05-19 cs.CV cs.HC

Low Latency Gaze Tracking via Latent Optical Sensing

通过潜在光学感知实现低延迟的注视跟踪

Yidan Zheng, Matheus Souza, Kaizhang Kang, Qiang Fu, Hadi Amata, Wolfgang Heidrich

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 本文提出了一种实时注视跟踪系统,通过全被动光学编码器直接获取任务相关的潜在特征,利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码,产生足够估计注视方向的紧凑测量集,从而减少计算开销并提高延迟性能。

详情
AI中文摘要

我们提出了一种实时注视跟踪系统,该系统通过全被动光学编码器直接获取任务相关的潜在特征。与处理全分辨率图像不同,我们的方法利用微透镜阵列和共设计的二进制铬掩膜进行空间复用光学编码,产生一组紧凑的测量,足以用于注视估计。通过在光学域内整合传感和特征提取,所提出的系统消除了对高带宽图像读取的需要,并显著减少了计算开销。编码的测量通过4x4光电晶体管阵列捕获,并通过轻量级神经网络映射到注视方向。我们的概念验证原型实现了端到端的感知到推理延迟为3.4 ms,优于已发表的研究系统。我们在模拟和真实世界数据上展示了本方法的有效性,实现了与传统基于摄像头的管道相比具有竞争力的注视估计精度,同时显著提高了延迟和能效。本文工作展示了任务驱动的光学感知在超低延迟、计算高效的人机交互系统中的潜力。

英文摘要

We present a real-time gaze tracking system that directly acquires task-relevant latent features using a fully passive optical encoder. Instead of forming and processing full-resolution images, our approach leverages a microlens array with a co-designed binary chromium mask to perform spatially multiplexed optical encoding, producing a compact set of measurements sufficient for gaze estimation. By integrating sensing and feature extraction in the optical domain, the proposed system eliminates the need for high-bandwidth image readout and substantially reduces computational overhead. The encoded measurements are captured by a 4 x 4 phototransistor array and mapped to gaze direction using a lightweight neural network. Our proof-of-concept prototype enables an end-to-end sensing-to-inference latency of 3.4 ms, outperforming published research systems. We demonstrate the effectiveness of our approach on both simulated and real-world data, achieving competitive gaze estimation accuracy while significantly improving latency and energy efficiency compared to conventional camera-based pipelines. This work highlights the potential of task-driven optical sensing for ultra-low-latency, computationally efficient human-computer interaction systems.

2605.17989 2026-05-19 cs.CL cs.AI

Predictive Prefetching for Retrieval-Augmented Generation

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

发表机构 * Department of Computer Science, University of Massachusetts Boston(马萨诸塞大学波士顿分校计算机科学系)

AI总结 本文提出了一种先进的异步检索框架,通过预测检索触发时机和所需信息,以减少延迟并提高生成效率,同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情
AI中文摘要

检索增强生成(RAG)通过在大型语言模型中增强事实性,但因其同步检索导致显著延迟。尽管近期工作探索了异步检索,但现有方法依赖于检索与生成之间的启发式协调,并假设解码期间信息需求稳定,这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架,该框架能够与不断演变的信息需求相匹配,通过利用生成动态中出现的语义前驱,使用三个组件——检索预测器、上下文监视器和查询生成器,显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明,该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升,同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

2605.17985 2026-05-19 cs.LG cs.AI

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD:面向物理基础模型的敏感性感知保真度压缩SVD

Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang, He Wang

发表机构 * AI Centre, University College London(伦敦大学学院人工智能中心) University College London(伦敦大学学院) Central South University(中南大学) University of Massachusetts at Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出了一种新的压缩物理基础模型的方法,通过在压缩过程中显式建模损失感知的层敏感性,以保持准确性和物理保真度,实验表明在多个模型和数据集上实现了显著的压缩增益。

详情
AI中文摘要

我们提出了一种新的方法,用于压缩物理基础模型(PFMs),这是AI for Science领域的新趋势。尽管模型压缩对于减少内存使用和加速大基础模型的推理至关重要,但其在PFMs中的应用仍然不足探索,因为保持物理保真度至关重要。挑战在于物理数据的功能性质,其中偏导数编码了时空动态,并对压缩具有高度敏感性。传统压缩方法忽视了这种结构,常常导致严重的性能退化或失败。为此,我们引入了一种敏感性感知的保真度强制压缩框架,在压缩过程中显式建模输出函数空间中的损失感知层敏感性。这为压缩科学基础模型提供了一条新途径,同时保持准确性和物理保真度。实验表明,在多个模型和数据集上,相较于现有方法,取得了显著的增益,实现了更高的压缩比,同时保持准确性,在某些情况下甚至提高了几个数量级。更广泛地说,这项工作可能引领AI for Science领域高效、可部署和可持续的科学基础模型的新子领域。

英文摘要

We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.

2605.17980 2026-05-19 cs.CV

Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution

学习平衡:用于基于参考的遥感图像超分辨率的解耦孪生扩散变换器

Bin Luo, Runmin Dong, Zhaoyang Luo, Jinxiao Zhang, Jiyao Zhao, Fan Wei, Haohuan Fu

发表机构 * Tsinghua Shenzhen International Graduate School, Shenzhen, China(清华大学深圳国际研究生院) Sun Yat-sen University, Zhuhai, China(中山大学) National Supercomputing Center in Shenzhen, Shenzhen, China(深圳国家超算中心) Tsinghua University, Beijing, China(清华大学)

AI总结 本文提出DS-DiT解耦孪生扩散变换器,通过在注意力层面解耦低分辨率和参考信息交互,解决参考基于超分辨率中参考信息依赖过重和利用不足的问题,提升生成质量。

详情
AI中文摘要

基于扩散的方法在大尺度遥感图像超分辨率中展现出显著潜力,特别是在基于参考的超分辨率(RefSR)中,高分辨率参考图像提供关键的细粒度纹理先验。然而,现有方法往往在过度依赖参考信息导致纹理伪影和利用不足导致细节恢复不足之间存在权衡。为了解决这些问题,我们提出了DS-DiT,一种解耦孪生扩散变换器方法,该方法在注意力层面解耦低分辨率和参考信息交互。通过使低分辨率结构先验和参考纹理信息能够独立与噪声潜在空间交互,框架有效缓解了不同来源之间的竞争。此外,为了补偿全局注意力有限的局部建模能力,我们引入了Patch-Level Weights(PLW)模块,该模块可自适应地调节条件源的融合。此外,这种孪生架构在推理过程中促进了自引导策略,通过利用强参考和弱参考条件之间的预测差异来增强重建。这种方法在不额外训练的情况下提升了生成质量。在多个数据集和缩放因子上的实验结果表明,DS-DiT在定量指标和视觉保真度上均优于现有方法。

英文摘要

Diffusion-based methods demonstrate significant potential for remote sensing image super-resolution at large scaling factors, particularly in reference-based super-resolution (RefSR) where high-resolution reference images provide critical fine-grained texture priors. However, existing methods often suffer from a trade-off between over-reliance on reference information, which leads to texture artifacts, and underutilization, which results in insufficient detail recovery. To address these issues, we propose DS-DiT, a Decoupled Siamese Diffusion Transformer method that decouples low-resolution and reference interactions at the attention level. By enabling low-resolution structural priors and reference texture information to interact independently with the noisy latent, the framework effectively mitigates inter-source competition. Furthermore, to compensate for the limited local modeling ability of global attention, we introduce a Patch-Level Weights (PLW) module that adaptively modulates the fusion of conditional sources. In addition, this siamese architecture facilitates an autoguidance strategy during inference, which enhances reconstruction by exploiting the prediction discrepancy between strong and weak reference conditions. This approach boosts generation quality without additional training. Experimental results across multiple datasets and scaling factors demonstrate that DS-DiT outperforms existing methods in both quantitative metrics and visual fidelity.

2605.17978 2026-05-19 cs.CL

AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

AutoVecCoder: 教授大语言模型生成显式向量化代码

Shangzhan Li, Xinyu Yin, Xuanyu Jin, Ye He, Yuxin Zhou, Yuxuan Li, Xu Han, Wanxiang Che, Qi Shi, Ting Liu, Maosong Sun

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Xiamen University(厦门大学) Tsinghua University(清华大学)

AI总结 本文提出AutoVecCoder框架,通过VecPrompt和VecRL组件,使大语言模型能够自动进行显式向量化,从而在SimdBench的SSE和AVX子集上达到最先进的性能,超越传统自动向量化的方法。

详情
AI中文摘要

通过单指令多数据(SIMD)架构进行向量化是高性能计算的核心。为了充分利用硬件潜力,开发人员通常依赖显式向量化使用内联函数,因为基于编译器的自动向量化由于保守的静态分析常常产生次优结果。尽管大语言模型(LLMs)在一般代码生成方面表现出色,但它们在显式向量化方面遇到困难,因为高质量语料库稀缺且低级硬件指令的语义约束严格。在本文中,我们提出了AutoVecCoder,一种新的框架,旨在赋予LLMs自动显式向量化的能力。AutoVecCoder集成了两个核心组件:VecPrompt,一个自动数据合成管道,用于注入领域特定的内联知识;以及VecRL,一个强化学习框架,将代码生成与执行效率对齐。通过此框架训练的AutoVecCoder-8B在SimdBench的SSE和AVX子集上实现了最先进的性能,并在某些情况下生成的实现超过了标准-O3优化,有效克服了传统自动向量化的固有瓶颈。

英文摘要

Vectorization via Single Instruction, Multiple Data (SIMD) architectures is a cornerstone of high-performance computing. To fully exploit hardware potential, developers often resort to explicit vectorization using intrinsics, as compiler-based auto-vectorization frequently yields suboptimal results due to conservative static analysis. While Large Language Models (LLMs) have demonstrated remarkable proficiency in general code generation, they struggle with explicit vectorization due to the scarcity of high-quality corpora and the strict semantic constraints of low-level hardware instructions. In this paper, we propose AutoVecCoder, a novel framework designed to empower LLMs with the capability of automated explicit vectorization. AutoVecCoder integrates two core components: VecPrompt, an automated data synthesis pipeline to inject domain-specific intrinsic knowledge; and VecRL, a reinforcement learning framework that aligns code generation with execution efficiency. AutoVecCoder-8B trained by this framework achieves state-of-the-art performance on the SSE and AVX subsets of SimdBench and, in some cases, generates implementations surpassing standard -O3 optimizations, effectively overcoming the inherent bottlenecks of traditional automated vectorization.

2605.17976 2026-05-19 cs.AI math.OC

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

释放大语言模型于贝叶斯优化:用于科学发现的偏好引导框架

Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种基于大语言模型的贝叶斯优化框架LGBO,通过在优化循环中持续整合大语言模型的语义推理,提高了科学发现中的优化效率和收敛速度。

Comments Published as a conference paper at ICLR 2026. 10 pages main paper, 21 pages appendix, 26 figures

Journal ref International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

科学发现日益受到昂贵实验和有限资源的限制,凸显了在AI for science中高效优化的必要性。尽管贝叶斯优化(BO)被广泛用于平衡探索与利用,但其在高维设置中表现出冷启动性能缓慢和可扩展性差的问题,限制了其在现实科学问题中的应用。为克服这些挑战,我们提出了LLM引导的贝叶斯优化(LGBO),这是首个将大语言模型(LLMs)的偏好引导整合到优化循环中的贝叶斯优化框架。与以往仅使用LLMs进行预热启动初始化或候选生成的工作不同,LGBO引入了一种区域提升的偏好机制,将LLM驱动的偏好嵌入到每一个迭代中,以稳定且可控的方式调整替代均值。理论上,我们证明了LGBO在最坏情况下不会显著劣于标准BO,而在偏好与目标一致时,能够实现显著更快的收敛速度。实验上,LGBO在物理、化学、生物学和材料科学等多样化的干基准测试中均优于现有方法。最值得注意的是,在一个新的湿实验室优化Fe-Cr电池电解质时,LGBO在6次迭代内达到了最佳观测值的90%,而标准BO和现有LLM增强的基线方法需要超过10次。这些结果表明,LGBO为将LLMs整合到科学优化工作流中提供了一个有前景的方向。

英文摘要

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

2605.17969 2026-05-19 cs.CV

Generation Navigator: A State-Aware Agentic Framework for Image Generation

生成导航器:一种基于状态的图像生成代理框架

Jinming Liu, Ruoyu Feng, Yuqi Wang, Wenjun Zeng, Xin Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(东部技术研究所) Independent(独立)

AI总结 本文提出了一种基于状态的图像生成代理框架Generation Navigator,通过将图像生成问题重新表述为状态条件下的动作生成问题,解决了传统方法中在强化学习训练中因信用分配问题导致的不足,通过PRE-GRPO算法提升了生成质量与推理准确性。

详情
AI中文摘要

尽管文本到图像生成技术取得了快速进展,但忠实实现用户意图仍然具有挑战性,通常需要手动多轮尝试和错误。为了自动化此过程,现有系统依赖于简单的提示重写或由手工规则驱动的闭环代理,而不是学习适应不断变化的生成过程。在本文中,我们将图像生成重新表述为一个状态条件下的动作生成问题,并提出Generation Navigator,一个多轮T2I代理,能够学习动态引导生成轨迹并输出下一步动作。然而,通过强化学习训练此代理会引入关键的信用分配挑战:仅根据单一状态奖励轨迹会将所有动作视为同等信用,忽略了各轮次质量动态变化,并无法区分那些提升轨迹的动作与那些降质或浪费轮次而无进展的动作。我们通过PRE-GRPO(峰值保留-效率组相对策略优化)算法解决这一问题,这是一种轨迹级强化学习目标,明确奖励发现高质量图像(峰值)、避免后续轮次质量下降(保留)以及最小化不必要的轮次(效率)。实验表明,在多个基准测试中取得了显著提升,达到了0.90的WISE分数和79.06%的T2I-ReasonBench推理准确率。

英文摘要

Despite rapid advances in text-to-image generation, faithfully realizing user intent remains challenging, often requiring manual multi-turn trial and error. To automate this process, existing systems rely on either simple prompt rewriting or closed-loop agents driven by hand-crafted rules, rather than learning to adapt actions to the evolving generation process. In this paper, we reformulate image generation as a state-conditioned action-making problem and propose Generation Navigator, a multi-turn T2I agent that learns to dynamically steer the generation trajectory and output the next action. However, training this agent via reinforcement learning introduces a critical credit assignment challenge: naively rewarding a trajectory based solely on a single state assigns equal credit to all actions in the rollout, ignores the quality dynamics across turns, and fails to distinguish actions that improve the trajectory from those that degrade it or waste turns without progress. We resolve this with PRE-GRPO (Peak-Retention-Efficiency Group Relative Policy Optimization), a trajectory-level reinforcement learning objective that explicitly rewards discovering a high-quality image (Peak), avoiding subsequent quality degradation across turns (Retention), and minimizing unnecessary turns (Efficiency). Experiments show substantial improvements across benchmarks, reaching a WISE score of 0.90 and 79.06% reasoning accuracy on T2I-ReasonBench.

2605.17968 2026-05-19 cs.LG

Function graph transformers universally approximate operators between function spaces

函数图变换器在函数空间之间近似算子

Takashi Furuya, David Mis, Ivan Dokmanić, Maarten V. de Hoop, Matti Lassas

发表机构 * Doshisha University(大阪市立大学) RIKEN AIP(日本科学技术厅Advanced Institute for Photonics and Electron器件) Rice University(里士满大学) University of Basel(巴塞尔大学) Simons Chair in Computational and Applied Mathematics and Earth Science(Simons计算与应用数学及地球科学主席职位) University of Helsinki(赫尔辛基大学)

AI总结 本文研究了通过变换器近似函数空间之间非线性算子的问题,提出了一种基于图度量的函数图变换器,能够以单值函数形式输出,并证明其在广义非线性算子近似中的通用性。

详情
AI中文摘要

我们研究了通过变换器近似函数空间之间非线性算子的问题。我们的方法是将函数提升为在其图上支持的度量,并利用最近引入的度量论视角来分析变换器。函数h通过其图度量γ_h表示,其中有限的token{(x_j,h(x_j))}_{j=1}^N是其经验近似。我们证明,该框架优雅地通过度量的收敛来建模离散化细化,并提供了一个自然的算子学习设置。在此框架中,我们引入了函数图变换器,即一种图保持的度量变换器子类,能够将图度量映射为图度量,也就是说,输出保持为单值函数。关键的是,这种额外的结构并不降低通用性:我们证明,所得到的图保持映射可以被标准softmax自注意力层和点wise MLP的有限组合近似,从而在广泛的非线性算子类别中实现通用近似结果。与现有基于变换器的算子学习理论方法不同,度量论框架还能够处理正则化的负阶Sobolev输入,这些输入的离散化不变性特别具有挑战性,以及不同输出域上的查询点。总体而言,函数图变换器为基于变换器的算子学习提供了一个连续视角和数学工具包,明确了位置编码、图结构、正则化和在离散化之间保持一致的作用。

英文摘要

We study the approximation of nonlinear operators between function spaces by transformers. Our approach is to lift functions to measures supported on their graphs and leverage a recently introduced measure-theoretic view of transformers. A function $h$ is represented by its graph measure $γ_h$, with finite tokens $\{(x_j,h(x_j))\}_{j=1}^N$ being its empirical approximations. We show that this framework elegantly models discretization refinement via convergence of measures and provides a natural setting for operator learning. Within this framework, we introduce function graph transformers, a graph-preserving subclass of measure-theoretic transformers that maps graph measures to graph measures, which is to say that outputs remain single-valued functions. Crucially, this additional structure does not reduce generality: we prove that the resulting graph-preserving maps can be approximated by finite compositions of standard softmax self-attention layers and pointwise MLPs, yielding universal approximation results for broad classes of nonlinear operators. Unlike existing theoretical approaches to operator learning with transformers, the measure-theoretic framework also accommodates regularized negative-order Sobolev inputs for which discretization invariance is particularly challenging, as well as query points on different output domains. Overall, function graph transformers provide a continuum viewpoint and mathematical toolkit for transformer-based operator learning, clarifying the roles of positional encodings, graph structure, regularization, and ensuring consistency across discretizations.

2605.17967 2026-05-19 cs.AI

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

弥合对SFT在LLM中效果的矛盾观点:一种交互视角

Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai, Qing Xu, Quanshi Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beijing Institute for General Artificial Intelligence(北京一般人工智能研究院) UniDT

AI总结 本文从交互视角探讨了SFT在LLM中的效果不一致问题,发现SFT主要去除噪声交互但难以获得可靠新交互,且去噪阶段短暂,继续微调易引入过拟合交互。

详情
AI中文摘要

本文探讨了监督微调(SFT)在深度神经网络中的有效性问题:为何SFT在小规模模型中广泛有效,但在大语言模型(LLM)中却可能产生不一致甚至有害的效果。最近基于交互的解释方法表明,词/标记之间的交互提供了衡量LLM编码推理模式的忠实指标。我们发现SFT过程中交互的演变能有效解释SFT在LLM中的不一致效果。具体而言,我们发现(1)SFT主要去除噪声样的交互,而很少获得可靠的新的交互。(2)这一去噪阶段极为短暂,之后继续微调倾向于引入过拟合的交互。我们通过多个LLM和数据集验证了这些发现。我们的发现为早期停止提供了新见解,并为LLM训练提供了实用指导。

英文摘要

This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.

2605.17958 2026-05-19 cs.LG cs.PL

Enhancing the Code Reasoning Capabilities of LLMs via Consistency-based Reinforcement Learning

通过基于一致性的强化学习增强大语言模型的代码推理能力

Zhanyue Qin, Jia Feng, Yibo Lyu, Yun Peng, Dianbo Sui, Cuiyun Gao, Qing Liao

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出CodeThinker框架,通过一致性驱动的强化学习方法提升大语言模型的代码推理能力,实验表明其在多个基准测试中表现优异,显著提升了代码生成和数学推理任务的准确性。

Comments Under review

详情
AI中文摘要

代码推理指的是在给定源代码和特定输入的情况下预测程序输出的任务。它可以衡量大语言模型(LLMs)的推理能力,并且有助于下游任务,如代码生成和数学推理。现有工作已验证了强化学习在该任务上的有效性。然而,这些方法仅基于最终输出或粗粒度信号设计奖励,忽略了任务中逐步推理过程的内在一致性。因此,这些方法常常导致稀疏奖励或奖励黑客问题,限制了增强学习能力的充分发挥。为缓解这些问题,我们提出CodeThinker,一种用于代码推理的一致性驱动强化学习框架。具体而言,CodeThinker有三个关键组件:(1)一个具有逐步推理意识的模型训练模块,利用一致性追踪范式作为模板,合成捕捉逐步推理过程的训练数据;(2)一个动态束采样策略,旨在在固定采样预算下提高采样输出的质量;(3)一个一致性奖励机制,可以有效缓解奖励黑客问题。在三个流行基准测试上的实验表明,CodeThinker在多个LLMs上均取得最佳性能。例如,当部署在Qwen2.5-Coder-7B-Instruct上时,其在准确性方面比最强基线高出4.3%。我们还验证了CodeThinker在下游任务中的有效性。结果表明,在不进行额外训练的情况下,CodeThinker在覆盖17种编程语言的数学推理和代码推理任务中分别获得了平均准确率提升5.33和3.11个百分点。

英文摘要

Code reasoning refers to the task of predicting the output of a program given its source code and specific inputs. It can measure the reasoning capability of large language models (LLMs) and also benefit downstream tasks such as code generation and mathematical reasoning. Existing work has verified the effectiveness of reinforcement learning on the task. However, these methods design rewards solely based on final outputs or coarse-grained signals, and neglect the inherent consistency of the stepwise reasoning process in the task. Therefore, these methods often result in sparse reward or reward hacking, which limits the full play of enhanced learning capabilities. To alleviate these issues, we propose CodeThinker, a consistency-driven reinforcement learning framework for code reasoning. Specifically, CodeThinker has three key components: (1) a stepwise reasoning-aware model training module, which utilizes a consistency tracing paradigm as a template to synthesize training data that captures the stepwise reasoning process; (2) a dynamic beam sampling strategy, which aims to improve the quality of sampled outputs under a fixed sampling budget; and (3) a consistency reward mechanism that can effectively alleviate reward hacking. Experiments on three popular benchmarks show that CodeThinker achieves state-of-the-art performance across multiple LLMs. For instance, it outperforms the strongest baseline by 4.3% in accuracy when deployed on Qwen2.5-Coder-7B-Instruct. We also validate the effectiveness of CodeThinker on downstream tasks. Results show that, without additional training, CodeThinker obtains average accuracy gains of 5.33 and 3.11 percentage points on mathematical reasoning and code reasoning tasks covering 17 programming languages, respectively.

2605.17954 2026-05-19 cs.CV cs.AI cs.LG

A More Word-like Image Tokenization for MLLMs

一种更像单词的图像标记化方法用于大规模语言模型

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

发表机构 * Seoul National University(首尔国立大学) Ewha Womans University(成均馆大学)

AI总结 本文提出了一种解耦视觉标记化方法(DiVT),通过将图像块嵌入聚类为语义单元,使每个标记对应于独特的视觉概念,从而提升多模态模型的性能和效率。

Journal ref Proceedings of the IEEE/CVF International Conference on Pattern Recognition and Computer Vision (CVPR), 2026

详情
AI中文摘要

现代多模态大语言模型(MLLMs)通常保持语言模型不变,并训练一个视觉投影器,将像素映射到其嵌入空间中的标记序列,使图像能以与文本相同的形式呈现。然而,语言模型已优化以操作离散且具有语义意义的标记,而现有视觉投影器将图像转换为长流的连续且高度相关的嵌入。这导致视觉标记的行为不同于LLM最初训练以理解的单词状单元。我们提出了一种新的解耦视觉标记化(DiVT),将图像块嵌入聚类为连贯的语义单元,使得每个标记对应于一个独特的视觉概念,而不是一个刚性的网格单元。DiVT进一步根据图像复杂度调整其标记预算,提供显式的精度-计算权衡,既不修改视觉编码器也不修改语言模型。在多样化的多模态基准测试中,DiVT在显著较少的视觉标记下匹配或超越基线,展示了在有限标记预算下的鲁棒性,显著降低了内存成本和延迟,同时使视觉输入更兼容于LLM。我们的代码可在https://github.com/snuviplab/DiVT上获得。

英文摘要

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

2605.17949 2026-05-19 cs.CV

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

SkyNative: 一种面向遥感视觉证据推理的原生多模态框架

Xiao Yang, Ronghao Fu, Zhiwen Lin, Zhuoran Duan, Jiashun Zhu, Jiasen Hu, Lang Sun, Weipeng Zhang, Jiaqi Liu, Xu Na, Haoran Liu, Weijie Zhang, Bo Yang

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education(教育部符号计算与知识工程重点实验室)

AI总结 本文提出SkyNative,一种原生多模态框架,通过去除预训练视觉骨干,直接在语言模型token空间中表示图像为原始patch tokens,以提升遥感图像的细粒度空间推理能力。

详情
AI中文摘要

遥感视觉-语言模型通常依赖预训练的视觉编码器将图像转换为语义特征后再进行语言模型推理。尽管在场景级理解上有效,这种流程可能过早压缩局部视觉证据,使细粒度空间推理容易受到语言先验的影响,尤其是在超高分辨率遥感图像中。我们提出了SkyNative,一种面向遥感的原生多模态框架,采用无编码器架构,去除预训练视觉骨干,直接在语言模型token空间中表示图像为原始patch tokens。为协调低级视觉patches与文本tokens,SkyNative引入了模态感知的解耦机制,该机制在统一的自回归骨干中使用模态特定的参数。我们进一步引入了一个视觉依赖基准,通过逐步视觉退化和误导性文本提示来诊断模型是否基于图像证据得出答案。在标准遥感理解任务和大格式空间推理评估中,SkyNative展示了更强的图像基础感知能力和改进的抗提示诱导语言先验能力。这些结果表明,原生patch级多模态建模是可靠遥感视觉-语言推理的有前景方向。

英文摘要

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

2605.17938 2026-05-19 cs.LG cs.AI stat.ML

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

通过镜像反学习和噪声一致偏斜训练数据归因

Joan Serrà, Dipam Goswami, Fabio Morreale, Wei-Hsiang Liao, Yuki Mitsufuji

发表机构 * Sony AI(索尼人工智能)

AI总结 本文提出了一种基于镜像反学习和噪声一致偏斜的方法,用于提升扩散模型的训练数据归因的可靠性与鲁棒性,通过在不同数据集上显著优于现有方法,展示了其在生成实例间影响实例重叠和扩散损失比较任务中的潜力。

Comments 21 pages, 5 figures, 9 tables (includes appendix)

详情
AI中文摘要

训练数据归因(TDA)应能够促进生成模型的可解释性,并推动各种相关下游任务的发展。然而,当前的TDA方法缺乏可靠性和鲁棒性,阻碍了其在实际应用中的采用。在本文中,我们采取了关键步骤,以实现更可靠和鲁棒的扩散模型TDA。我们提出通过镜像反学习和噪声一致偏斜(MUCS)进行TDA。该方法的核心思想是使用受限的镜像梯度上升微调第二个模型,并通过一致的噪声样本测量该模型相对于原始模型的归一化偏斜。我们展示了,尽管概念上简单且通用,MUCS在三个不同的数据集上系统性地大幅优于现有方法。此外,我们研究了核心设计选择对最终性能的影响,并分析了影响实例在生成项目中的重叠以及整合TDA方法的潜力。我们相信,我们的发现可能对更一般的反学习设置以及需要比较扩散损失的任务具有更广泛的意义。

英文摘要

Training data attribution (TDA) should enable generative model interpretability and foster a variety of related downstream tasks. Nonetheless, current TDA approaches lack reliability and robustness, preventing their adoption in real-world setups. In this paper, we take a decisive step towards more reliable and robust TDA for diffusion models. We propose to perform TDA with mirrored unlearning and noise-consistent skew (MUCS). The idea is to fine-tune a second model with bounded mirrored gradient ascent, and to measure the normalized skew of this model with respect to the original one using consistent noise samples. We show that, while being conceptually simple and generic, MUCS systematically outperforms existing methods on three different datasets by a large margin. We additionally study the effect that core design choices have on final performance, and analyze novel aspects regarding the overlap of influential instances across generated items and the potential of ensembling TDA approaches. We believe that our findings may have broader implications for more general unlearning setups, as well as for tasks requiring the comparison of diffusion losses.

2605.17933 2026-05-19 cs.CV

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA: 无教师视觉技能记忆用于无需教师的VLM代理

Pan Wang, Yihao Hu, Xiujin Liu, Jingchu Yang, Hang Wang, Zhihao Wen

发表机构 * Ant Group(蚂蚁集团) University of Science and Technology of China(中国科学技术大学) Westlake University(西湖大学) University of Michigan - Ann Arbor(密歇根大学-安娜堡分校) Sun Yat-sen University(中山大学)

AI总结 本研究提出AtlasVA,一种无需教师的视觉技能记忆框架,通过空间热图、视觉示例和符号文本技能三层结构,统一感知、记忆和优化,实现在无需外部LLM监督下的强化学习性能提升。

详情
AI中文摘要

视觉语言模型(VLM)代理越来越多地依赖记忆增强的强化学习来在长时间任务中重用经验,但大多数现有框架将记忆存储为文本并依赖专有教师模型来总结或细化。这种设计与空间决策不匹配:几何先验被压缩成有损语言,稀疏交互通常通过延迟文本反馈监督,而不是密集的视觉基础信号。我们主张VLM代理的可重用经验应保持视觉基础。基于这一见解,我们提出了AtlasVA,一种无需教师的视觉技能记忆框架,将记忆组织为三个互补的层次:空间热图、视觉示例和符号文本技能。AtlasVA进一步通过轨迹统计和轻量级网格启发式方法直接演化危险和亲和图谱,并将这些自演化图谱作为基于潜在函数的形状奖励用于强化学习。这种设计统一了感知、记忆和优化,无需外部LLM监督。在Sokoban、FrozenLake、3D沉浸导航和3D机器人操作基准测试中,实验表明AtlasVA在文本中心记忆基线和竞争VLM代理上表现一致优异,尤其在空间密集任务上收益显著。主页:https://wangpan-ustc.github.io/AtlasvaWeb

英文摘要

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals. We argue that reusable experience for VLM agents should remain visually grounded. Based on this insight, we propose \textbf{AtlasVA}, a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. AtlasVA further evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, and reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning. This unifies perception, memory, and optimization without external LLM supervision. Experiments on \textsc{Sokoban}, \textsc{FrozenLake}, 3D embodied navigation, and 3D robotic manipulation benchmarks show that AtlasVA consistently outperforms text-centric memory baselines and competitive VLM agents, with especially strong gains on spatially intensive tasks. Homepage: https://wangpan-ustc.github.io/AtlasvaWeb

2605.17932 2026-05-19 cs.CL cs.AI

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

在扩散大型语言模型中进行提示压缩:在LLDA上评估LLMLingua-2

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

发表机构 * University of Toronto(多伦多大学) King Mongkut’s University of Technology Thonburi(泰国科技理工学院)

AI总结 本文研究了提示压缩在扩散大型语言模型中的有效性,通过在LLDA上评估LLMLingua-2,发现提示压缩在数学推理任务中效果不佳,而摘要任务相对稳健,表明为扩散模型设计的提示压缩方法并不适用于所有场景。

详情
AI中文摘要

提示压缩可以减少大型语言模型的推理成本和上下文长度,但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能有效转移到扩散大型语言模型(DLLMs)中,使用LLMLingua-2,特别是具有8B参数的DLLM LLaDA。我们在GSM8K、DUC2004和ShareGPT数据集上使用每个数据集约250个提示,以大约2倍的压缩率,在数学推理、提示重建和摘要任务中评估压缩性能。通过精确匹配准确率、BLEU、ROUGE和BERTScore比较原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明,语义保持并不必然意味着在扩散模型中下游行为的稳定性。摘要任务在压缩下相对稳健,而数学推理任务在高语义相似度分数下显著退化。重建实验进一步表明,语义相似的提示可能仍然遗漏了稳定去噪所需的关键推理信息。在所有任务中,BERTScore召回率始终低于精度,表明压缩失败主要由信息遗漏驱动,而非语义漂移。这些发现表明,为自回归模型设计的提示压缩方法并不均匀地适用于扩散大型语言模型,从而推动了为扩散模型设计的压缩策略的发展。

英文摘要

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

2605.17930 2026-05-19 cs.LG

InfoFlow: A Framework for Multi-Layer Transformer Analysis

InfoFlow: 多层Transformer分析的框架

Penghao Yu, Haotian Jiang, Zeyu Bao, Qianxiao Li

发表机构 * Department of Mathematics(数学系) National University of Singapore(新加坡国立大学) Institute for Functional Intelligent Materials(功能智能材料研究所)

AI总结 该研究通过分析多层Transformer的近似能力,揭示了其与单层Transformer的根本差异,并提出InfoFlow框架以提升多层Transformer的近似效率。

Comments 36 pages

详情
AI中文摘要

尽管近期已有研究探讨了单层Transformer架构的近似性质,但对多层设置的严谨理论理解仍然有限。本文证明多层Transformer在某些检索任务中具有与单层Transformer根本不同的近似能力:对于某些检索任务,任何单层Transformer需要至少Ω(ε^{-k})参数才能达到精度ε,其中k与序列长度T线性增长,而双层Transformer每层一个头则能以至多O(ε^{-1})参数实现相同近似精度。为理解这种分离,我们识别出多层近似背后的两种结构机制。具体而言,softmax注意力只能高效检索获得最大注意力分数的token,导致k-th最大检索的参数成本呈指数级增长(k≥2)。此外,解码耦合信息的参数成本与所检索token集合的大小成正比。受这些发现启发,我们提出了InfoFlow框架,用于多层Transformer。该框架在每个token和层跟踪可访问的输入位置集合,并为每种信息传播模式分配明确的近似率。这种抽象恢复了已知的近似界限,与训练网络的实验观察保持一致,并在目前无法直接理论分析的设置中产生具体预测。我们的结果提供了一个原则性的框架,用于分析多层Transformer的近似效率。

英文摘要

While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $Ω(\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.