arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2509.22244 2026-05-19 cs.CV

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

FlashEdit: 解耦速度、结构和语义以实现精确图像编辑

Junyi Wu, Zhiteng Li, Haotong Qin, Yulun Zhang, Xiaokang Yang

AI总结 本文提出FlashEdit,一种高效的局部图像编辑框架,通过解耦速度、结构和语义来实现精确编辑,实验表明其在保真度和效率之间取得了良好的平衡。

Comments Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit

详情
AI中文摘要

基于文本的图像编辑使用扩散模型已取得了显著的高质量成果,但往往面临可接受的延迟问题。我们介绍了FlashEdit,一种针对标准反向编辑设置的实时局部图像编辑框架。其效率和精度源于三个关键创新:(1)一个循环一致的一步反向(COSI)管道,通过循环一致性鼓励流形对齐的一步反向;(2)一种背景屏蔽(BG-Shield)技术,通过结构自注意干预提高非编辑区域的保真度;(3)一种稀疏的空间交叉注意(SSCA)机制,通过抑制语义泄漏促进精确编辑。在PIE-Bench上的实验表明,FlashEdit在保真度和效率之间取得了良好的权衡,编辑可在0.2秒内完成,比基于DDIM的多步编辑快超过150倍。我们的代码将在https://github.com/JunyiWuCode/FlashEdit上公开发布。

英文摘要

Text-guided image editing with diffusion models has achieved remarkable quality but often suffers from prohibitive latency. We introduce \textbf{FlashEdit}, a real-time localized image editing framework for the standard inversion-based editing setting. Its efficiency and precision stem from three key innovations: (1) a \textbf{Cycle-Consistent One-Step Inversion (COSI)} pipeline that encourages manifold-aligned one-step inversion through cycle consistency; (2) a \textbf{Background Shield (BG-Shield)} technique that improves preservation of non-edited regions via structural self-attention intervention; and (3) a \textbf{Sparsified Spatial Cross-Attention (SSCA)} mechanism that promotes precise edits by suppressing semantic leakage. Experiments on PIE-Bench demonstrate a strong preservation-efficiency trade-off, with edits completed in under 0.2 seconds and an over 150$\times$ speedup over DDIM-based multi-step editing. Our code will be made publicly available at \url{https://github.com/JunyiWuCode/FlashEdit}.

2509.17680 2026-05-19 cs.CL

When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables

当表格问答遇见噪声:为复杂问题和大规模表格设计的双去噪框架

Shenghao Ye, Yu Guo, Dong Jin, Yikai Shen, Yunpeng Hou, Shuangwu Chen, Jian Yang, Xiaofeng Jiang

AI总结 本文提出EnoTab双去噪框架,通过改进相关性过滤和表格修剪能力,解决复杂问题和大规模表格中的噪声问题,提升表格问答性能。

Comments 24 pages, 24 figures, accepted to ACL 2026 Main

详情
AI中文摘要

表格问答(TableQA)是自然语言处理(NLP)中的基本任务。大语言模型(LLMs)强大的推理能力在这一领域带来了显著进展。然而,随着实际应用中问题日益复杂且表格规模增大,大量噪声数据被引入,严重降低了推理性能。为了解决这一挑战,我们专注于提升两个核心能力:相关性过滤,即识别并保留与推理真正相关的信息,以及表格修剪,即在保留必要内容的同时减少表格规模。基于这些原则,我们提出了EnoTab,一种为复杂问题和大规模表格设计的双去噪框架。具体来说,我们首先通过证据-based问题去噪,将问题分解为最小的语义单元,并根据一致性和实用性标准过滤掉与答案推理无关的部分。然后,我们提出证据树引导的表格去噪,构建一个明确且透明的表格修剪路径,逐步移除无关数据。在每一步修剪过程中,我们观察表格的中间状态,并应用后序节点回滚机制来处理异常表格状态,最终产生一个高度可靠的子表格用于最终答案推理。最后,广泛的实验表明,EnoTab在复杂问题和大规模表格的TableQA任务中实现了卓越的性能,证实了其有效性。

英文摘要

Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.

2509.14004 2026-05-19 cs.CL

Early Stopping Chain-of-thoughts in Large Language Models

大语言模型中的早期停止思维链

Minjia Mao, Bowen Yin, Yu Zhu, Xiao Fang

AI总结 本文提出了一种在推理阶段减少思维链生成长度的方法ES-CoT,通过检测答案收敛并提前停止来降低推理成本,同时保持与标准思维链相当的准确性。

详情
AI中文摘要

大语言模型(LLMs)在通过生成长链式思维(CoT)解决复杂问题时表现出卓越的能力,但这种长CoT会带来较高的推理成本。先前的推理阶段高效推理方法要么需要白盒模型监控推理过程,要么通过直接提示不可靠。为此,我们引入了ES-CoT,一种在推理时缩短CoT生成的方法,通过检测答案收敛并提前停止几乎不损失性能。当观察到推理过程中的语言标记(如“wait”)时,我们提示LLM输出其当前最终答案,称为步骤答案。我们跟踪连续相同步骤答案的运行长度作为答案收敛的度量。我们通过实证和理论证明,步骤答案稳定地收敛到最终答案,且大运行长度跳跃可靠地标记这种收敛。在六个跨三个LLM的推理数据集上的实验表明,ES-CoT在平均上减少了16.08%的推理令牌数量,同时保持与标准CoT相当的准确性。

英文摘要

Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. Previous methods on inference-stage efficient reasoning either require white-box models to monitor the reasoning process or are not reliable through direct prompting. In response, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with almost no performance loss. When observing a linguistic marker (such as "wait") in the reasoning process, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. We show both empirically and theoretically that step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on six reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by 16.08% on average while maintaining accuracy comparable to standard CoT.

2509.06984 2026-05-19 cs.LG cs.AI

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA: 在缺失模态约束下联邦微调基础模型的实用方法

Lishan Yang, Wei Emma Zhang, Nam Kha Nguygen, Po Hu, Yanjun Shu, Weitong Chen, Mong Yuan Sim

AI总结 本文提出FediLoRA,一种轻量级的联邦LoRA聚合框架,旨在解决联邦学习中异构环境下的缺失模态问题,通过联合简单平均和结构化编辑提升全局和个性化模型性能,实现在多个通用领域和医疗领域基准数据集上的强大表现。

Comments 8 pages, 7 figures

详情
AI中文摘要

联邦学习与LoRA微调提供了一种高效且隐私友好的解决方案,使机构能够协作利用其大规模数据集来训练VLLMs。然而,参与机构通常拥有异质计算资源,导致LoRA秩不平衡,这对有效协作构成重大挑战。此外,医疗和交通等现实应用领域常因用户错误或设备故障导致缺失模态,这显著降低了联邦设置中的全局模型性能。到目前为止,没有先前工作同时解决了联邦VLLMs中的这两个挑战。为了解决这些问题,我们提出FediLoRA,一种轻量级的联邦LoRA聚合框架,有效减轻了异构环境中的缺失模态影响。FediLoRA受到观察的启发,即简单平均和结构化编辑可以同时受益于全局和个性化模型。我们的方法在多个通用领域和医疗领域基准数据集上实现了强大性能。此外,在医疗数据上的额外实验进一步证明,FediLoRA适合实际应用部署场景。我们的代码已发布在https://github.com/gotobcn8/FediLoRA。

英文摘要

Federated Learning with LoRA fine-tuning offers an efficient and privacy-aware solution for institutions to collaboratively leverage their large datasets to train VLLMs. However, participating institutions often possess heterogeneous computational resources, resulting in imbalanced LoRA ranks, which pose a major challenge for effective collaboration. In addition, real-world applications in domains such as healthcare and transportation frequently suffer from missing modalities due to user mistakes or device failures, which significantly degrade global model performance in federated settings. To the best of our knowledge, no prior work has addressed these two challenges simultaneously in federated VLLMs. To tackle these issues, we propose FediLoRA, a lightweight federated LoRA aggregation framework that effectively mitigates the impact of missing modalities in heterogeneous environment. FediLoRA is explicitly motivated by the observation that simple averaging and structured editing can jointly benefit both global and personalized models. Our approach achieves strong performance across multiple general-domain and medical-domain benchmark datasets. Additional experiments on healthcare data further demonstrate that FediLoRA is well-suited for practical, real-world deployment scenarios. Our code is released at https://github.com/gotobcn8/FediLoRA.

2508.17431 2026-05-19 cs.CV cs.AI cs.LG

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

AI总结 本文提出FedKLPR框架,通过KL散度引导训练、无结构剪枝和跨轮次恢复技术,解决联邦学习在人重识别中的统计异质性和通信开销问题,实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情
AI中文摘要

人重识别(re-ID)是智能监控和公共安全中的基本任务。联邦学习(FL)提供了一种隐私保护的协同模型训练范式,无需集中数据收集。然而,由于非独立同分布(non-IID)客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销,将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战,我们提出了FedKLPR,一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先,KL散度引导训练,包括KL散度正则化损失(KLL)和KL散度聚合权重(KLAW),用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次,引入无结构剪枝以减少通信开销,并提出剪枝率聚合权重(PRAW)以衡量剪枝后客户端参数的相对重要性。与KLAW结合,PRAW形成KL散度-剪枝权重聚合(KLPWA),使在异构数据分布下能够有效聚合剪枝后的本地模型。第三,跨轮次恢复(CRR)适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明,FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比,FedKLPR在ResNet-50上将通信成本减少了40%--42%,并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

2508.16663 2026-05-19 cs.CV cs.AI cs.LG

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

The Loupe: 一种用于增强视觉变换器中判别特征的插件式注意力模块

Naren Sengodan

AI总结 本文提出The Loupe模块,通过在视觉变换器的中间特征阶段插入轻量级插件式空间门控模块,利用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权,从而提升细粒度视觉分类性能。

详情
AI中文摘要

细粒度视觉分类(FGVC)要求模型关注于细微的、与任务相关的区域,而非广泛的物体上下文。我们提出了The Loupe,一种轻量级的插件式空间门控模块,用于层次化的视觉变换器。该模块在中间特征阶段插入,使用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权。在CUB-200-2011数据集上,The Loupe将Swin-Base的准确率从88.36%提升至91.72%,将Swin-Tiny的准确率从85.14%提升至88.61%,且仅增加0.1%的参数。消融实验表明,改进依赖于插入点和稀疏正则化器,表明受控的空间门控比朴素的多尺度遮蔽在此设置下更有效。定性结果表明,学习到的掩码通常与判别鸟类部分对齐,尽管该模块不是部分级监督的替代品,在遮挡或细粒度内部分差异时可能会失效。

英文摘要

Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often align with discriminative bird parts, although the module is not a substitute for part-level supervision and can fail under occlusion or fine-grained intra-part differences.

2508.14769 2026-05-19 cs.LG cs.DC

Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data

边缘设备上的联邦蒸馏:非iid数据的高效客户端过滤

Ahmed Mujtaba, Gleb Radchenko, Radu Prodan, Marc Masana

AI总结 本文提出了一种高效的边缘联邦蒸馏方法EdgeFD,通过在客户端使用KMeans基于的密度比估计器来过滤分布内外的代理数据,从而减少计算复杂度并提高知识共享质量,适用于非iid数据分布。

Comments This paper was accepted at the International Conference on Federated Learning Technologies and Applications, 2025. The final version is available at IEEE Xplore

详情
AI中文摘要

联邦蒸馏作为一种有前途的协同机器学习方法,通过交换模型输出(软日志)而不是完整模型参数,相较于传统联邦学习提供了增强的隐私保护和减少的通信开销。然而,现有方法采用复杂的选择性知识共享策略,要求客户端通过计算昂贵的统计密度比估计器来识别分布内代理数据。此外,服务器端对模糊知识的过滤引入了延迟。为了解决这些挑战,我们提出了一个鲁棒且资源高效的EdgeFD方法,该方法减少了客户端侧密度比估计的复杂性并消除了服务器端过滤的需要。EdgeFD引入了一个高效的KMeans基于的密度比估计器,用于在客户端上有效过滤分布内和分布外的代理数据,显著提高了知识共享的质量。我们评估了EdgeFD在多样化的实际场景中的表现,包括强非iid、弱非iid和iid数据分布,无需在服务器上预训练教师模型进行知识蒸馏。实验结果表明,EdgeFD优于最先进的方法,在异构和挑战性条件下仍能持续达到接近iid场景的准确率。KMeans基于的估计器显著减少的计算开销适用于在资源受限的边缘设备上部署,从而增强了联邦蒸馏的可扩展性和实际应用性。代码已在线提供以供复现。

英文摘要

Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.

2508.10678 2026-05-19 cs.CV

HyperTea: A Hypergraph-based Temporal Enhancement and Alignment Network for Moving Infrared Small Target Detection

HyperTea: 一种基于超图的时序增强与对齐网络用于移动红外小目标检测

Zhaoyuan Qi, Weihua Gao, Wenlong Niu, Jie Tang, Yun Li, Xiaodong Peng

AI总结 本文提出HyperTea网络,通过整合全局和局部时序视角,有效建模特征的高阶时空相关性,首次将CNN、RNN和HGNN结合用于MIRSTD,显著提升检测性能。

Comments Accepted by Knowledge-Based Systems

详情
AI中文摘要

在实际应用场景中,由于目标的大小小、强度弱和复杂的运动模式,移动红外小目标检测(MIRSTD)仍然极具挑战性。现有方法通常仅建模特征节点之间的低阶相关性,并在单一时间尺度上进行特征提取和增强。尽管超图已被广泛用于高阶相关性学习,但其在MIRSTD中却受到有限关注。为了探索超图的潜力并增强多时间尺度特征表示,我们提出HyperTea,它整合了全局和局部时序视角,有效建模特征的高阶时空相关性。HyperTea由三个模块组成:全局时序增强模块(GTEM)通过语义聚合和传播实现全局时序上下文增强;局部时序增强模块(LTEM)设计用于捕捉相邻帧之间的局部运动模式,然后增强局部时序上下文;此外,我们进一步开发了一个时序对齐模块(TAM)以解决潜在的跨尺度特征错位问题。据我们所知,HyperTea是首次将卷积神经网络(CNNs)、循环神经网络(RNNs)和超图神经网络(HGNNs)结合用于MIRSTD的工作,显著提升了检测性能。在DAUB和IRDST上的实验表明其处于最先进的水平(SOTA)。我们的源代码可在https://github.com/Lurenjia-LRJ/HyperTea上获得。

英文摘要

In practical application scenarios, moving infrared small target detection (MIRSTD) remains highly challenging due to the target's small size, weak intensity, and complex motion pattern. Existing methods typically only model low-order correlations between feature nodes and perform feature extraction and enhancement within a single temporal scale. Although hypergraphs have been widely used for high-order correlation learning, they have received limited attention in MIRSTD. To explore the potential of hypergraphs and enhance multi-timescale feature representation, we propose HyperTea, which integrates global and local temporal perspectives to effectively model high-order spatiotemporal correlations of features. HyperTea consists of three modules: the global temporal enhancement module (GTEM) realizes global temporal context enhancement through semantic aggregation and propagation; the local temporal enhancement module (LTEM) is designed to capture local motion patterns between adjacent frames and then enhance local temporal context; additionally, we further develop a temporal alignment module (TAM) to address potential cross-scale feature misalignment. To our best knowledge, HyperTea is the first work to integrate convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hypergraph neural networks (HGNNs) for MIRSTD, significantly improving detection performance. Experiments on DAUB and IRDST demonstrate its state-of-the-art (SOTA) performance. Our source codes are available at https://github.com/Lurenjia-LRJ/HyperTea.

2508.08080 2026-05-19 cs.LG cs.NE stat.AP

Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles

符号量化回归用于条件量化可解释性预测

Cas Oude Hoekstra, Floris den Hengst

AI总结 本文提出了一种符号量化回归方法,用于预测条件量化并解释预测变量对结果的影响,通过在航空燃料使用案例中比较预测极值和中央结果的模型,展示了SQR在高风险应用中的有效性。

详情
Journal ref
Transactions on Machine Learning Research, May 2026, https://openreview.net/pdf?id=x9OYbyPJOG
AI中文摘要

符号回归(SR)是一种生成可解释或白盒预测模型的已知框架。尽管SR已被成功应用于创建结果平均值的可解释估计,但目前尚不清楚如何利用SR来估计目标变量分布其他点处变量之间的关系。例如,中位数或极值的估计提供了预测变量如何影响结果的更全面图景,并在高风险、安全关键应用领域是必要的。本文介绍了符号量化回归(SQR),一种利用SR预测条件量化的做法。在广泛的评估中,我们发现SQR在透明模型上表现优于,并且在不牺牲透明性的情况下与强大的黑盒基线模型表现相当。我们还展示了如何利用SQR通过比较预测极值和中央结果的模型来解释目标分布的差异。我们得出结论,SQR适用于预测条件量化并理解不同分位数下的有趣特征影响。

英文摘要

Symbolic Regression (SR) is a well-established framework for generating interpretable or white-box predictive models. Although SR has been successfully applied to create interpretable estimates of the average of the outcome, it is currently not well understood how it can be used to estimate the relationship between variables at other points in the distribution of the target variable. Such estimates of e.g. the median or an extreme value provide a fuller picture of how predictive variables affect the outcome and are necessary in high-stakes, safety-critical application domains. This study introduces Symbolic Quantile Regression (SQR), an approach to predict conditional quantiles with SR. In an extensive evaluation, we find that SQR outperforms transparent models and performs comparably to a strong black-box baseline without compromising transparency. We also show how SQR can be used to explain differences in the target distribution by comparing models that predict extreme and central outcomes in an airline fuel usage case study. We conclude that SQR is suitable for predicting conditional quantiles and understanding interesting feature influences at varying quantiles.

2508.07292 2026-05-19 cs.AI cs.CL cs.CV

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent: 闭环代理推理与自我一致性验证用于内窥镜诊断

Yi Tang, Kai-Ni Wang, Yang Chen, Xiaopu He, Guangquan Zhou

AI总结 该研究提出EndoCogniAgent框架,通过闭环代理推理和自我一致性验证提升内窥镜诊断的准确性与可靠性,其核心方法是将诊断过程建模为受控状态更新过程,并引入EndoAgentBench基准进行评估。

Comments 10 pages, 8 figures, 2 tables. Revised version with major updates on methodology and extended evaluation on EndoAgentBench. Code and data are available at https://github.com/Tyyds-ai/EndoCogniAgent

详情
AI中文摘要

内窥镜诊断是一个迭代过程,临床医生逐步获取、比较和验证局部视觉证据以得出结论。当前AI系统未能充分支持此过程,因为细粒度证据获取和多步推理仍弱相关,导致两种失败模式:幻觉证据和未纠正的误差累积,影响诊断可靠性。我们提出EndoCogniAgent,一种闭环代理框架,将内窥镜诊断建模为受控状态更新过程。在每次推理轮次中,中央计划器选择下一步证据获取动作,专用专家工具提取相应观察,自我一致性验证机制沿两个维度检查观察:知识一致性与输入图像以及时间一致性与先前验证的发现,然后更新诊断状态。验证的观察被纳入演进状态以指导后续计划,而缺乏充分支持的发现则保留并带有纠正反馈,引导计划器进行进一步验证。我们进一步引入EndoAgentBench,一个以工作流程为导向的基准,包含来自11个内窥镜数据集的6132个问题-答案对,旨在评估诊断代理在全面诊断链中的表现,从细粒度视觉感知到高水平诊断推理。实验显示,EndoCogniAgent在感知任务上达到85.23%的平均准确率,在推理任务上达到71.13%的临床接受率,消融分析确认自我一致性验证和事件状态维护对这些提升至关重要。

英文摘要

Endoscopic diagnosis is an iterative process in which clinicians progressively acquire, compare, and verify local visual evidence before reaching a conclusion. Current AI systems do not adequately support this process because fine-grained evidence acquisition and multi-step reasoning remain weakly coupled. This gives rise to two failure modes, hallucinated evidence and uncorrected error accumulation, that undermine diagnostic reliability. We propose EndoCogniAgent, a closed-loop agentic framework that formulates endoscopic diagnosis as a controlled state update process. At each reasoning round, a central planner selects the next evidence acquisition action, specialized expert tools extract the corresponding observation, and a self-consistency validation mechanism examines the observation along two dimensions, knowledge consistency against the input image and temporal consistency with prior validated findings, before updating the diagnostic state. Validated observations are admitted into the evolving state to condition subsequent planning, while insufficiently supported findings are retained with corrective feedback that redirects the planner toward additional verification. We further introduce EndoAgentBench, a workflow-oriented benchmark comprising 6,132 question-answer pairs from 11 endoscopic datasets, designed to evaluate diagnostic agents across a comprehensive diagnostic chain, from fine-grained visual perception to high-level diagnostic reasoning. Experiments show that EndoCogniAgent achieves 85.23\% average accuracy on perception tasks and 71.13\% clinical acceptance rate on reasoning tasks, with ablation analysis confirming that self-consistency validation and episodic state maintenance are individually critical to these gains.

2508.06974 2026-05-19 cs.CL

Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

重新思考利用预训练大语言模型的1位优化

Zhijun Tu, Jian Li, Yuanyuan Xi, Siqi Liu, Chuanjian Liu, Hanting Chen, Jie Hu, Yunhe Wang

AI总结 本文提出了一种一致的渐进式训练方法,通过将全精度权重逐步转化为二值化权重,以提高1位大语言模型的性能,并通过二进制感知初始化和双缩放补偿减少训练难度。

Comments 15 pages, 7 figures

详情
AI中文摘要

1位LLM量化在减少存储和计算成本方面具有显著优势。然而,现有方法通常从头开始训练1位LLM,未能充分利用预训练模型,导致训练成本高且准确性下降。本文发现全精度与1位表示之间的较大差距使直接适应困难。在本文中,我们引入了一种对前向和后向都一致的渐进式训练方法,平滑地将全精度权重转换为二值化权重。此外,我们还结合了二进制感知初始化和双缩放补偿,以减少渐进式训练的难度并提高性能。在各种大小的LLM上的实验结果表明,我们的方法优于现有方法。我们的结果表明,可以使用预训练模型实现高性能的1位LLM,从而消除了从头开始昂贵训练的需要。

英文摘要

1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes naive adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the full-precision weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

2508.06038 2026-05-19 cs.CV cs.AI

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Fourier Compressor: 频域视觉令牌压缩用于视觉-语言模型

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

AI总结 本文提出了一种基于频域的视觉令牌压缩策略,通过傅里叶变换减少计算开销并提升效率,同时保持语义准确性,实验表明其在图像和视频任务中均表现出色。

详情
AI中文摘要

视觉-语言模型(VLMs)由于高分辨率图像和视频输入引入的大量视觉令牌,导致计算开销和推理延迟显著增加。现有的无参数令牌压缩方法通常依赖于令牌选择或合并,但可能丢弃大量视觉信息或扭曲原始表示分布,导致在高压缩比下性能下降。为此,我们探索了一种更有效且高效的视觉令牌压缩策略,重点在频域方向。受图像压缩中频域变换(如JPEG)的成功启发,我们系统分析了视觉表示中的频域冗余,并揭示了不同频带中语义信息的非均匀分布。基于此,我们引入了傅里叶压缩器,一种有效、无参数且高度通用的模块,通过FFT(复杂度为O(n² log n))在频域内去除视觉表示的冗余。实现过程中无额外参数,计算开销极小且保持语义保真度。在图像基准测试中,我们的方法在保留超过96%原始准确率的同时,将推理FLOPs减少高达83.8%,生成速度提升31.2%。它在图像和视频理解任务中均表现出色,且在LLaVA和Qwen-VL架构中均能稳定泛化,证明其在高效VLMs中的实用价值。

英文摘要

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

2508.00901 2026-05-19 cs.LG cs.CL

Provable Knowledge Acquisition and Extraction in One-Layer Transformers

在单层变换器中可证明的知识获取与提取

Ruichen Xu, Kexin Chen

AI总结 本文研究了单层变换器中知识获取与提取的机制,通过理论分析和实验验证,揭示了预训练和微调过程中知识存储与提取的关系,以及低秩微调如何恢复预训练的事实知识。

详情
AI中文摘要

大型语言模型在预训练过程中可能获得事实性知识,但在微调后却无法可靠地使用这些知识。尽管有越来越多的实证证据表明MLP层存储事实关联,并且微调影响事实回忆,但连接下一个标记预训练、知识存储和后微调提取的训练动态机制仍然理解有限。我们研究了这个问题,使用了一个简化的一层变换器,包含自注意力和MLP模块,通过下一个标记预测进行训练,随后在问答数据上进行微调。在适当的正则性条件下,我们首先证明模型在学习结构化注意力模式和关系特定的特征方向时达到接近最优的预训练损失,从而提供了一个事实性知识获取的机制。然后我们展示微调可以将问答提示格式转化为触发预训练关系特征的手段,使模型能够提取在微调过程中未被重新访问的事实。我们的分析给出了知识提取的关联覆盖特征化:微调不需要重新访问每一个存储的主体-答案对,但必须覆盖足够的潜在关系-模板方向,通过这些方向在预训练中编码了事实。因此,提取随着预训练的多重性和微调的覆盖度而提高,但随着关系-模板宇宙的增长而变得更加困难。相反,不足的覆盖度会导致失败状态,其中事实可能被存储但仍然无法访问,提供了一个简化的幻觉机制。该理论适用于全和低秩微调,为为什么当关系覆盖度足够时低秩适应可以恢复预训练的事实知识提供了见解。在合成数据和基于PopQA的GPT-2/Llama模型上的实验支持了预测的趋势。

英文摘要

Large language models may encounter factual knowledge during pre-training yet fail to reliably use that knowledge after fine-tuning. Despite growing empirical evidence that MLP layers store factual associations and fine-tuning affects factual recall, the training-dynamics mechanisms linking next-token pre-training, knowledge storage, and post-fine-tuning extraction remain poorly understood. We study this problem in a stylized one-layer transformer with self-attention and MLP modules, trained by next-token prediction and subsequently fine-tuned on question-answering data. Under suitable regularity conditions, we first prove that the model reaches near-optimal pre-training loss while learning structured attention patterns and relation-specific feature directions, giving a mechanism for factual knowledge acquisition. We then show that fine-tuning can turn the Q&A prompt format into a trigger for pre-trained relation features, enabling the model to extract facts that are not revisited during fine-tuning. Our analysis yields a relation-covering characterization of knowledge extraction: fine-tuning need not revisit every stored subject-answer pair, but it must cover enough latent relation-template directions through which facts were encoded during pre-training. Consequently, extraction improves with pre-training multiplicity and fine-tuning coverage, but becomes harder as the relation-template universe grows. Conversely, insufficient coverage leads to a failure regime in which facts may be stored but remain inaccessible, providing a stylized mechanism for hallucination. The theory applies to both full and low-rank fine-tuning, offering insight into why low-rank adaptation can recover pre-trained factual knowledge when relation coverage is sufficient. Experiments on synthetic data and PopQA-based GPT-2/Llama models support the predicted trends.

2507.22136 2026-05-19 cs.CV

Color as the Impetus: Transforming Few-Shot Learner

颜色作为动力:转换少样本学习者

Chaofei Qi, Zhitai Liu, Jianbin Qiu

AI总结 本文提出了一种基于颜色感知机制的少样本学习框架,通过强调不同通道的颜色信息来提升特征提取和分类性能,同时引入知识蒸馏方法增强元学习能力。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情
AI中文摘要

人类具备天生的元学习能力,部分归因于其出色的色彩感知能力。在本文中,我们开创性地从模拟人类色彩感知机制的角度出发,提出了少样本学习的新视角。我们提出了ColorSense Learner,一种生物启发的元学习框架,利用跨通道特征提取和交互学习。通过在不同通道中战略强调不同的颜色信息,我们的方法有效过滤了无关特征,同时捕捉到判别性特征。颜色信息代表了最直观的视觉特征,但传统元学习方法大多忽略了这一方面,而专注于类别间的抽象特征区分。我们的框架通过协同的色彩通道交互弥合了这一差距,使能够更好地提取类内共同性并扩大类间差异。此外,我们引入了基于知识蒸馏的元蒸馏器ColorSense Distiller,该方法利用先验教师知识来增强学生网络的元学习能力。我们对十一个多少样本基准进行了全面的粗粒度/细粒度和跨域实验进行验证。大量实验表明,我们的方法具有极强的泛化能力、鲁棒性和可迁移性,并且能够轻松地从颜色感知的角度处理少样本分类。

英文摘要

Humans possess innate meta-learning capabilities, partly attributable to their exceptional color perception. In this paper, we pioneer an innovative viewpoint on few-shot learning by simulating human color perception mechanisms. We propose the ColorSense Learner, a bio-inspired meta-learning framework that capitalizes on inter-channel feature extraction and interactive learning. By strategically emphasizing distinct color information across different channels, our approach effectively filters irrelevant features while capturing discriminative characteristics. Color information represents the most intuitive visual feature, yet conventional meta-learning methods have predominantly neglected this aspect, focusing instead on abstract feature differentiation across categories. Our framework bridges the gap via synergistic color-channel interactions, enabling better intra-class commonality extraction and larger inter-class differences. Furthermore, we introduce a meta-distiller based on knowledge distillation, ColorSense Distiller, which incorporates prior teacher knowledge to augment the student network's meta-learning capacity. We've conducted comprehensive coarse/fine-grained and cross-domain experiments on eleven few-shot benchmarks for validation. Numerous experiments reveal that our methods have extremely strong generalization ability, robustness, and transferability, and effortless handle few-shot classification from the perspective of color perception.

2507.22057 2026-05-19 cs.CV

MetaLab: Few-Shot Game Changer for Image Recognition

MetaLab: 图像识别中的少样本突破

Chaofei Qi, Zhitai Liu, Jianbin Qiu

AI总结 本文提出了一种高效的少样本图像识别方法MetaLab,通过CIELab引导的相干元学习框架,实现了高准确率、鲁棒性和有效泛化能力,接近人类识别水平。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情
AI中文摘要

困难的少样本图像识别具有显著的应用前景,但与传统大规模图像识别相比仍存在显著的技术差距。本文提出了一种高效的原生方法,称为CIELab引导的相干元学习(MetaLab)。结构上,我们的MetaLab由两个协作的神经网络组成:LabNet,能够对CIELab颜色空间进行域转换并提取丰富的分组特征,以及相干LabGNN,能够促进亮度图和颜色图之间的相互学习。为了充分验证,我们在四个粗粒度基准、四个细粒度基准和四个跨域少样本基准上进行了广泛的比较研究。具体而言,我们的方法在每个类别仅使用一个样本时能够实现高准确率、鲁棒性能和有效的泛化能力。总体而言,所有实验都表明,我们的MetaLab可以达到99%的准确率,接近人类识别水平,仅需少量的视觉偏差。

英文摘要

Difficult few-shot image recognition has significant application prospects, yet remaining the substantial technical gaps with the conventional large-scale image recognition. In this paper, we have proposed an efficient original method for few-shot image recognition, called CIELab-Guided Coherent Meta-Learning (MetaLab). Structurally, our MetaLab comprises two collaborative neural networks: LabNet, which can perform domain transformation for the CIELab color space and extract rich grouped features, and coherent LabGNN, which can facilitate mutual learning between lightness graph and color graph. For sufficient certification, we have implemented extensive comparative studies on four coarse-grained benchmarks, four fine-grained benchmarks, and four cross-domain few-shot benchmarks. Specifically, our method can achieve high accuracy, robust performance, and effective generalization capability with one-shot sample per class. Overall, all experiments have demonstrated that our MetaLab can approach 99\% $\uparrow\downarrow$ accuracy, reaching the human recognition ceiling with little visual deviation.

2507.22041 2026-05-19 cs.CV

Shallow Deep Learning Can Still Excel in Fine-Grained Few-Shot Learning

浅层深度学习仍能在细粒度少样本学习中表现出色

Chaofei Qi, Chao Ye, Zhitai Liu, Weiyang Lin, Jianbin Qiu

AI总结 本文研究了浅层深度网络在细粒度少样本学习中的表现,提出了一种位置感知星座网络(LCN-4),通过改进的特征聚类模块有效减少损失,验证了浅层网络在该任务中的有效性。

Comments This work is currently being redone. It requires significant revisions and polishing. Additionally, the title will also be revised. Therefore, this version is no longer needed.

详情
AI中文摘要

深度学习已在广泛领域得到广泛应用,包括依赖深度骨干网络的细粒度少样本学习(FGFSL)。然而,较浅的深度骨干网络如ConvNet-4不常被选用,因为它们倾向于提取大量非抽象的视觉属性。本文重新评估了网络深度与完全编码少样本实例能力之间的关系,并探讨浅层深度架构是否能实现与主流深度骨干网络相当或更优的性能。受Vanilla ConvNet-4的启发,我们提出了一种位置感知星座网络(LCN-4),配备先进的位置感知特征聚类模块。该模块能够高效编码和整合空间特征融合、特征聚类和隐蔽特征位置,从而显著减少整体损失。具体而言,我们创新性地提出了一种通用网格位置编码补偿,有效解决特定普通卷积在特征提取过程中位置信息缺失的问题。此外,我们进一步提出了一种通用频域位置嵌入技术,以补偿聚类特征中的位置损失。我们在三个代表性的细粒度少样本基准上进行了验证。相关实验表明,LCN-4显著优于基于ConvNet-4的最新方法,并实现了与大多数ResNet12方法相当或更优的性能,证实了我们的猜想。

英文摘要

Deep learning has witnessed the extensive utilization across a wide spectrum of domains, including fine-grained few-shot learning (FGFSL) which heavily depends on deep backbones. Nonetheless, shallower deep backbones such as ConvNet-4, are not commonly preferred because they're prone to extract a larger quantity of non-abstract visual attributes. In this paper, we initially re-evaluate the relationship between network depth and the ability to fully encode few-shot instances, and delve into whether shallow deep architecture could effectuate comparable or superior performance to mainstream deep backbone. Fueled by the inspiration from vanilla ConvNet-4, we introduce a location-aware constellation network (LCN-4), equipped with a cutting-edge location-aware feature clustering module. This module can proficiently encoder and integrate spatial feature fusion, feature clustering, and recessive feature location, thereby significantly minimizing the overall loss. Specifically, we innovatively put forward a general grid position encoding compensation to effectively address the issue of positional information missing during the feature extraction process of specific ordinary convolutions. Additionally, we further propose a general frequency domain location embedding technique to offset for the location loss in clustering features. We have carried out validation procedures on three representative fine-grained few-shot benchmarks. Relevant experiments have established that LCN-4 notably outperforms the ConvNet-4 based State-of-the-Arts and achieves performance that is on par with or superior to most ResNet12-based methods, confirming the correctness of our conjecture.

2507.18406 2026-05-19 cs.CL cs.DB cs.DL cs.IR

Factual Inconsistencies in Multilingual Wikipedia Tables

多语言维基百科表格中的事实不一致

Silvia Cappa, Lingxiao Kong, Pille-Riin Peet, Fanfu Wei, Yuchen Zhou, Jan-Christoph Kalo

AI总结 本研究探讨了多语言维基百科结构化内容中的跨语言不一致问题,特别是表格数据,通过开发方法收集、对齐和分析多语言维基百科文章中的表格,定义不一致的类别,并应用定量和定性指标评估多语言对齐,为事实验证、多语言知识交互和可靠AI系统设计提供启示。

Comments 11 pages, 7 figures, White Paper for RTF Work at ISWS Summer School 2025

详情
AI中文摘要

维基百科作为全球可访问的知识源,包含超过300种语言的内容。尽管覆盖相同主题,不同版本的维基百科是独立编写和更新的。这导致了事实不一致,可能影响百科全书和依赖维基百科作为主要训练数据的AI系统中立性和可靠性。本研究调查了维基百科结构化内容中的跨语言不一致,重点是表格数据。我们开发了一种方法来收集、对齐和分析维基百科多语言文章中的表格,定义不一致的类别。我们应用各种定量和定性指标来评估多语言对齐,使用样本数据集。这些见解对事实验证、多语言知识交互和设计利用维基百科内容的可靠AI系统具有影响。

英文摘要

Wikipedia serves as a globally accessible knowledge source with content in over 300 languages. Despite covering the same topics, the different versions of Wikipedia are written and updated independently. This leads to factual inconsistencies that can impact the neutrality and reliability of the encyclopedia and AI systems, which often rely on Wikipedia as a main training source. This study investigates cross-lingual inconsistencies in Wikipedia's structured content, with a focus on tabular data. We developed a methodology to collect, align, and analyze tables from Wikipedia multilingual articles, defining categories of inconsistency. We apply various quantitative and qualitative metrics to assess multilingual alignment using a sample dataset. These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems leveraging Wikipedia content.

2507.17798 2026-05-19 cs.LG

Wasserstein GAN-Based Precipitation Downscaling with Optimal Transport for Enhancing Perceptual Realism

基于Wasserstein GAN与最优传输的降水下scaling以增强感知现实性

Kenta Shiraishi, Yuka Muto, Atsushi Okazaki, Shunji Kotsuki

AI总结 本文提出利用Wasserstein GAN与最优传输成本进行降水下scaling,以提高降水预测的感知现实性,尽管WGAN在传统评估指标上略逊,但其生成的降水场在视觉上更真实,且能有效识别不真实输出和参考数据中的潜在伪影。

详情
Journal ref
Progress in Earth and Planetary Science, 13, 29, 2026
AI中文摘要

高分辨率(HR)降水预测对于减少静止和局部强降雨造成的损害至关重要;然而,使用过程驱动的数值天气预测模型进行HR降水预测仍然具有挑战性。本研究提出利用Wasserstein生成对抗网络(WGAN)结合最优传输成本进行降水下scaling。与传统神经网络使用均方误差训练不同,WGAN能够生成具有精细结构的视觉上逼真的降水场,尽管WGAN在传统评估指标上略逊。WGAN学习的批评者与人类感知现实性密切相关。基于案例的分析表明,批评者分数的显著差异有助于识别不真实的WGAN输出和参考数据中的潜在伪影。这些发现表明,WGAN框架不仅提高了降水下scaling的感知现实性,还为评估和质量控制降水数据集提供了新的视角。

英文摘要

High-resolution (HR) precipitation prediction is essential for reducing damage from stationary and localized heavy rainfall; however, HR precipitation forecasts using process-driven numerical weather prediction models remains challenging. This study proposes using Wasserstein Generative Adversarial Network (WGAN) to perform precipitation downscaling with an optimal transport cost. In contrast to a conventional neural network trained with mean squared error, the WGAN generated visually realistic precipitation fields with fine-scale structures even though the WGAN exhibited slightly lower performance on conventional evaluation metrics. The learned critic of WGAN correlated well with human perceptual realism. Case-based analysis revealed that large discrepancies in critic scores can help identify both unrealistic WGAN outputs and potential artifacts in the reference data. These findings suggest that the WGAN framework not only improves perceptual realism in precipitation downscaling but also offers a new perspective for evaluating and quality-controlling precipitation datasets.

2507.05482 2026-05-19 cs.LG stat.ML

Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions

Stein Diffusion Guidance: Training-Free Posterior Correction for Sampling Beyond High-Density Regions

Van Khoa Nguyen, Lionel Blondé, Alexandros Kalousis

AI总结 本文提出了一种基于Stein扩散引导的训练自由后验校正方法,用于在高密度区域之外进行采样。该方法结合了随机最优控制和Stein变分推断,通过引入新的理论界和运行成本函数,实现了在低密度区域的有效引导。

Comments Revised version accepted to the ICML 2026 main track; prior version accepted to two ICLR 2026 workshops: ReALM-GEN and DeLTa

详情
AI中文摘要

Training-free diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel 免训练 framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard 免训练 guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

英文摘要

Training-free diffusion guidance offers a flexible framework for leveraging off-the-shelf classifiers without additional training. Yet, current approaches hinge on posterior approximations via Tweedie's formula, which often yield unreliable guidance, particularly in low-density regions. Stochastic optimal control (SOC), in contrast, enables principled posterior sampling but remains computationally prohibitive for efficient inference. In this work, we reconcile the strengths of these paradigms by introducing Stein Diffusion Guidance (SDG), a novel training-free framework grounded in a surrogate SOC objective. We establish a new theoretical bound on the SOC value function, revealing the necessity of correcting approximate posteriors to reflect true diffusion dynamics. Building on Stein variational inference, SDG computes the steepest descent direction that minimizes the Kullback-Leibler divergence between approximate and true posteriors. By integrating a principled Stein correction mechanism along with a novel running cost functional, SDG enables effective guidance in low-density regions. Our experiments on diverse image-guidance tasks and on challenging small-ligand sampling for protein docking suggest that SDG consistently outperforms standard training-free guidance methods and highlights its potential for broader posterior sampling problems beyond high-density regimes.

2506.16042 2026-05-19 cs.AI cs.LG cs.OS

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human: 评估计算机使用代理的效率基准

Reyna Abhyankar, Qi Qi, Yiying Zhang

AI总结 本文研究了计算机使用代理在OSWorld基准上的时间性能,发现大模型调用导致高延迟,并构建了包含人类轨迹的OSWorld Human数据集,评估发现最佳代理仍需更多步骤。

详情
AI中文摘要

生成式AI正被用于解决涉及桌面应用的多种计算机使用任务。最先进的系统仅专注于提高领先基准的准确性。然而,这些系统由于端到端延迟极高(例如,数十分钟)而实际上不可用,因为通常只需人类几分钟即可完成的任务。为了理解这一现象并指导未来计算机代理的发展,我们首次研究了计算机使用代理在OSWorld基准上的时间性能。我们发现,规划、反思和判断的大模型调用占总延迟的主要部分,并且随着代理使用更多步骤完成任务,每一步骤的时间会比任务开始时的步骤长3倍。我们随后构建了OSWorld Human,即原始OSWorld数据集的手动标注版本,其中包含每个任务的人类确定轨迹。我们使用OSWorld Human评估了16个代理的效率,并发现即使最佳代理也比必要多出2.7-4.3倍的步骤。

英文摘要

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld Human and found that even the best agents take 2.7-4.3x more steps than necessary.

2506.15588 2026-05-19 cs.LG

Memory-Efficient Differentially Private Training with Gradient Random Projection

内存高效的差分隐私训练与梯度随机投影

Alex Mulrooney, Devansh Gupta, James Flemings, Huanyu Zhang, Murali Annavaram, Meisam Razaviyayn, Xinwei Zhang

AI总结 本文提出DP-GRAPE方法,通过随机高斯矩阵替代SVD子空间,减少内存使用并保持与一阶DP方法相当的效用,同时消除了昂贵的SVD计算需求,显著提升内存效率和模型性能。

详情
AI中文摘要

差分隐私(DP)在神经网络训练中保护敏感数据,但标准方法如DP-Adam由于每个样本梯度裁剪导致高内存开销,限制了可扩展性。我们引入DP-GRAPE(梯度随机投影),一种差分隐私训练方法,显著减少内存使用,同时保持与一阶DP方法相当的效用。DP-GRAPE的灵感来自我们发现隐私化使梯度奇异值谱变平,使基于SVD的投影(如GaLore(Zhao等人,2024))变得不必要的。因此,DP-GRAPE采用三个关键组件:(1)随机高斯矩阵替代基于SVD的子空间;(2)在投影后对梯度进行隐私化;(3)在反向传播期间应用投影。这些贡献消除了昂贵的SVD计算需求,实现了显著的内存节省,并提高了效用。尽管在较低维子空间中运行,我们的理论分析显示,DP-GRAPE在隐私-效用权衡上与DP-SGD相当。我们的广泛实验证明,DP-GRAPE可以显著减少DP训练的内存足迹,而不牺牲准确性和训练时间。特别是,DP-GRAPE在预训练视觉Transformer时将内存使用减少超过63%,在微调RoBERTa-Large时减少超过70%,同时实现相似性能。我们进一步证明,DP-GRAPE能够扩展到微调大型模型,如具有67亿参数的OPT,这是DP-Adam因内存限制而无法处理的规模。我们的代码可在https://github.com/alexmul1114/DP_GRAPE获得。

英文摘要

Differential privacy (DP) protects sensitive data during neural network training, but standard methods like DP-Adam suffer from high memory overhead due to per-sample gradient clipping, limiting scalability. We introduce DP-GRAPE (Gradient RAndom ProjEction), a DP training method that significantly reduces memory usage while maintaining utility on par with first-order DP approaches. DP-GRAPE is motivated by our finding that privatization flattens the gradient singular value spectrum, making SVD-based projections (as in GaLore (Zhao et al., 2024)) unnecessary. Consequently, DP-GRAPE employs three key components: (1) random Gaussian matrices replace SVD-based subspaces, (2) gradients are privatized after projection, and (3) projection is applied during backpropagation. These contributions eliminate the need for costly SVD computations, enable substantial memory savings, and lead to improved utility. Despite operating in lower-dimensional subspaces, our theoretical analysis shows that DP-GRAPE achieves a privacy-utility tradeoff comparable to DP-SGD. Our extensive empirical experiments show that DP-GRAPE can significantly reduce the memory footprint of DP training without sacrificing accuracy or training time. In particular, DP-GRAPE reduces memory usage by over 63% when pre-training Vision Transformers and over 70% when fine-tuning RoBERTa-Large as compared to DP-Adam, while achieving similar performance. We further demonstrate that DP-GRAPE scales to fine-tuning large models such as OPT with up to 6.7 billion parameters, a scale at which DP-Adam fails due to memory constraints. Our code is available at https://github.com/alexmul1114/DP_GRAPE.

2506.08244 2026-05-19 cs.LG cs.AI stat.ML

Algebraic Priors for Approximately Equivariant Networks

代数先验用于近似等变网络

Riccardo Ali, Pietro Liò, Jamie Vicary

AI总结 本文提出了一种无需参数的代数方法,利用群表示理论来构建等变网络的先验,通过实验验证该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。

详情
AI中文摘要

等变神经网络通过群作用来整合对称性,将其作为归纳偏差以提高性能。现有方法在潜在空间中学习等变作用,或设计具有等变结构的架构。这些方法通常能获得良好的经验结果,但可能涉及架构特定的约束、大量参数和高计算成本。我们挑战复杂等变架构范式,提出一种无参数的方法,基于群表示理论。我们证明,对于有限群上的等变编码器,潜在空间几乎必然包含每个线性无关数据轨道的一个副本,我们通过多个实验证明这一点。利用这一基础的代数洞察,我们通过辅助损失将群的正则表示作为归纳偏差,不增加可学习参数。我们的广泛评估显示,该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。我们进一步通过消融研究验证了正则表示的选择,显示其在所有情况下均优于定义和平凡群表示的基线模型。

英文摘要

Equivariant neural networks incorporate symmetries through group actions, embedding them as an inductive bias to improve performance. Existing methods learn an equivariant action on the latent space, or design architectures that are equivariant by construction. These approaches often deliver strong empirical results but can involve architecture-specific constraints, large parameter counts, and high computational cost. We challenge the paradigm of complex equivariant architectures with a parameter-free approach grounded in group representation theory. We prove that for an equivariant encoder over a finite group, the latent space must almost surely contain one copy of its regular representation for each linearly independent data orbit, which we explore with a number of empirical studies. Leveraging this foundational algebraic insight, we impose the group's regular representation as an inductive bias via an auxiliary loss, adding no learnable parameters. Our extensive evaluation shows that this method matches or outperforms specialized models in several cases, even those for infinite groups. We further validate our choice of the regular representation through an ablation study, showing it consistently outperforms defining and trivial group representation baselines.

2505.24438 2026-05-19 cs.LG

Weisfeiler and Leman Follow the Arrow of Time: Expressive Power of Message Passing in Temporal Event Graphs

Weisfeiler和Leman跟随时间之箭:时间事件图中消息传递的表达能力

Franziska Heeg, Jonas Sauer, Petra Mutzel, Ingo Scholtes

AI总结 研究探讨了时间事件图中消息传递方法的表达能力,提出了一种基于一致事件图同构的扩展Weisfeiler-Leman算法,以区分非同构的时间图。

详情
AI中文摘要

时间图的一个重要特征是时间箭头如何影响其因果拓扑,即哪些节点可能通过时间尊重的路径因果地相互影响。由此产生的模式常被时间图神经网络(TGNNs)忽视。为了正式分析TGNNs的表达能力,我们缺乏一个将图同构扩展到时间图的一般化方法,以完全捕捉其因果拓扑。针对这一缺口,我们引入了一致事件图同构的概念,该概念利用了时间图中时间尊重路径的时间展开表示。我们比较了这一定义与现有时间图同构的概念。我们展示了并突出了我们方法的优势,并开发了一个时间图的Weisfeiler-Leman算法的扩展,以启发式地区分非同构的时间图。基于这一理论基础,我们推导出一种新的消息传递方案,用于时间图神经网络,该方案在时间图的事件图表示上运行。实验评估显示,我们的方法在时间图分类实验中表现良好。

英文摘要

An important characteristic of temporal graphs is how the directed arrow of time influences their causal topology, i.e., which nodes can possibly influence each other causally via time-respecting paths. The resulting patterns are often neglected by temporal graph neural networks (TGNNs). To formally analyze the expressive power of TGNNs, we lack a generalization of graph isomorphism to temporal graphs that fully captures their causal topology. Addressing this gap, we introduce the notion of consistent event graph isomorphism, which utilizes a time-unfolded representation of time-respecting paths in temporal graphs. We compare this definition with existing notions of temporal graph isomorphisms. We illustrate and highlight the advantages of our approach and develop a temporal generalization of the Weisfeiler-Leman algorithm to heuristically distinguish non-isomorphic temporal graphs. Building on this theoretical foundation, we derive a novel message passing scheme for temporal graph neural networks that operates on the event graph representation of temporal graphs. An experimental evaluation shows that our approach performs well in a temporal graph classification experiment.

2505.21893 2026-05-19 cs.LG cs.AI

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

SIPO: 用于对齐扩散模型的人类偏好优化的稳定与改进方法

Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

AI总结 本研究提出SIPO框架,通过时间步感知的重要性重新加权和梯度稳定技术,解决扩散模型对齐中训练不稳定和策略偏差问题,提升了对齐效果和稳定性。

Comments This version supplements with more detailed content on reasoning and proof, additional experimental results, and ablation studies

详情
AI中文摘要

偏好学习作为一种有效技术,已被广泛用于将扩散模型与人类偏好对齐在视觉生成中。然而,现有对齐方法如Diffusion-DPO面临两个根本性挑战:由于各个时间步的高梯度方差导致的训练不稳定以及由于优化数据与策略模型分布之间的差异引起的策略偏差。我们的第一项贡献是对不同时间步的扩散轨迹进行系统分析,发现不稳定性主要源于早期时间步的低重要性权重。为了解决这些问题,我们提出了SIPO,即一种用于将扩散模型与人类偏好对齐的稳定和改进的偏好优化框架。具体而言,引入了一个关键梯度,即DPO-C&M,通过裁剪和屏蔽无信息的时间步来稳定训练。随后,采用时间步感知的重要性重新加权范式以缓解策略偏差并在对齐过程中强调信息更新。在各种基线模型上进行的广泛实验,包括图像生成模型SD1.5、SDXL和视频生成模型CogVideoX-2B/5B、Wan2.1-1.3B,表明我们的SIPO在稳定训练和性能方面均优于现有对齐方法。总体而言,这些结果表明了时间步感知对齐的重要性,并为改进扩散模型的偏好优化提供了有价值的指导。

英文摘要

Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy models' distribution. Our first contribution is a systematic analysis of diffusion trajectories across different timesteps, identifying that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose \textbf{SIPO}, a \textbf{S}tabilized and \textbf{I}mproved \textbf{P}reference \textbf{O}ptimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C\&M is introduced to stabilize training by clipping and masking uninformative timesteps. This is followed by a timestep-aware importance-reweighting paradigm to mitigate off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B/5B, Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods that with meticulous adjustments on parameters.Overall, these results suggest the importance of timestep-aware alignment and provide valuable guidelines for improved preference optimization in aligning diffusion models.

2505.20218 2026-05-19 cs.LG

Fine-grained List-wise Alignment for Generative Medication Recommendation

细粒度列表级对齐用于生成性药物推荐

Chenxiao Fan, Chongming Gao, Wentao Shi, Yaxin Gong, Zihao Zhao, Fuli Feng

AI总结 本文提出FLAME框架,通过细粒度列表级对齐方法,利用大语言模型生成药物列表,以提高药物推荐的准确性和安全性,同时考虑药物间的相互作用和潜在不良反应。

Comments NeurIPS 2025 Spotlight

详情
AI中文摘要

准确且安全的药物推荐对于有效的临床决策至关重要,尤其是在多病共存的情况下。然而,现有系统依赖于点预测范式,忽略了药物间的协同效应和潜在的不良药物-药物相互作用(DDIs)。我们提出FLAME,一种针对大语言模型(LLMs)的细粒度列表级对齐框架,能够生成药物-药物的药物列表。FLAME将推荐视为一个顺序决策过程,每一步添加或移除一种药物。为了提供细粒度的学习信号,我们设计了基于潜在函数的奖励塑造的步骤式组相对策略优化(GRPO),明确建模DDIs并优化每种药物对整体处方的贡献。此外,FLAME通过整合结构化临床知识和协作信息,增强了患者建模。在基准数据集上的实验表明,FLAME实现了最先进的性能,提供了更高的准确性和可控的安全性-准确性权衡,以及在多样化的临床场景中的强大泛化能力。我们的代码可在https://github.com/cxfann/Flame获取。

英文摘要

Accurate and safe medication recommendations are critical for effective clinical decision-making, especially in multimorbidity cases. However, existing systems rely on point-wise prediction paradigms that overlook synergistic drug effects and potential adverse drug-drug interactions (DDIs). We propose FLAME, a fine-grained list-wise alignment framework for large language models (LLMs), enabling drug-by-drug generation of drug lists. FLAME formulates recommendation as a sequential decision process, where each step adds or removes a single drug. To provide fine-grained learning signals, we devise step-wise Group Relative Policy Optimization (GRPO) with potential-based reward shaping, which explicitly models DDIs and optimizes the contribution of each drug to the overall prescription. Furthermore, FLAME enhances patient modeling by integrating structured clinical knowledge and collaborative information into the representation space of LLMs. Experiments on benchmark datasets demonstrate that FLAME achieves state-of-the-art performance, delivering superior accuracy, controllable safety-accuracy trade-offs, and strong generalization across diverse clinical scenarios. Our code is available at https://github.com/cxfann/Flame.

2505.18991 2026-05-19 cs.CV

Fast Kernel-Space Diffusion for Remote Sensing Pansharpening

快速核空间扩散用于遥感全色锐化

Hancong Jin, Zihan Cao, Liang-jian Deng, Jingjing Li

AI总结 本文提出KSDiff框架,通过整合低秩核心张量生成器和统一因子生成器,利用结构感知的多头注意力机制生成增强全局上下文的卷积核,以提升全色锐化质量并加速推理,实验表明其在性能和效率上均优于现有方法。

Comments CVPR 2026 Findings

详情
AI中文摘要

全色锐化旨在将高分辨率全色(PAN)图像和低分辨率多光谱(LRMS)图像融合为一幅具有精细空间细节和丰富光谱信息的单一图像。尽管深度学习方法取得了进展,但现有方法往往无法捕捉遥感数据分布中固有的全局先验。基于扩散模型的方法因强大的分布映射能力而成为有前途的解决方案,但它们存在推理延迟大的问题。我们引入KSDiff,一种快速核空间扩散框架,通过整合低秩核心张量生成器和统一因子生成器,利用结构感知的多头注意力机制生成增强全局上下文的卷积核,以提升全色锐化质量并加速推理。我们进一步提出一种针对全色锐化的两阶段训练策略,便于集成到现有全色锐化架构中。实验表明,KSDiff在性能上优于最近的有前途的方法,并且在扩散基线全色锐化方法上实现了超过500倍的推理速度提升。消融研究、可视化和进一步评估证实了我们方法的有效性。代码将在可能接受时发布。

英文摘要

Pansharpening seeks to fuse high-resolution panchromatic (PAN) and low-resolution multispectral (LRMS) images into a single image with both fine spatial and rich spectral detail. Despite progress in deep learning-based approaches, existing methods often fail to capture global priors inherent in remote sensing data distributions. Diffusion-based models have recently emerged as promising solutions due to their powerful distribution mapping capabilities, however, they suffer from heavy inference latency. We introduce KSDiff, a fast kernel-space diffusion framework that generates convolutional kernels enriched with global context to enhance pansharpening quality and accelerate inference. Specifically, KSDiff constructs these kernels through the integration of a low-rank core tensor generator and a unified factor generator, orchestrated by a structure-aware multi-head attention mechanism. We further introduce a two-stage training strategy tailored for pansharpening, facilitating integration into existing pansharpening architectures. Experiments show that KSDiff achieves superior performance compared to recent promising methods, and with over $500 \times$ faster inference than diffusion-based pansharpening baselines. Ablation studies, visualizations and further evaluations substantiate the effectiveness of our approach. Code will be released upon possible acceptance.

2505.17138 2026-05-19 cs.LG cs.AI

RAP: Runtime Adaptive Pruning for LLM Inference

RAP: 用于大语言模型推理的运行时自适应剪枝

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

AI总结 本文提出RAP,一种基于强化学习的弹性剪枝框架,通过动态调整压缩策略来适应运行时内存变化和异构KV缓存需求,首次在推理过程中同时考虑模型权重和KV缓存。

详情
AI中文摘要

大语言模型(LLMs)在语言理解和生成方面表现出色,但其巨大的计算和内存需求限制了部署。压缩提供了一种潜在的解决方案来缓解这些约束。然而,大多数现有方法依赖于固定的启发式方法,因此无法适应运行时内存变化或来自多样化用户请求的异构KV缓存需求。为了解决这些限制,我们提出了RAP,一种由强化学习(RL)驱动的弹性剪枝框架,能够以运行时感知的方式动态调整压缩策略。具体而言,RAP动态跟踪实际执行过程中模型参数与KV缓存之间的演变比例。认识到前馈网络(FFNs)包含大部分参数,而参数轻量的注意力层主导KV缓存的形成,RL代理只保留那些在当前内存预算内最大化效用的组件,基于即时的工作负载和设备状态。广泛的实验结果表明,RAP优于最先进的基线方法,标志着首次在推理过程中同时考虑模型权重和KV缓存。

英文摘要

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

2505.16278 2026-05-19 cs.CV cs.AI cs.RO

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE:面向端到端自动驾驶的视觉-语言-动作混合专家模型

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

AI总结 本文提出DriveMoE,一种基于混合专家架构的端到端自动驾驶框架,通过场景专用的视觉混合专家和技能专用的动作混合专家,实现了对复杂驾驶场景的有效处理,展示了在自动驾驶任务中结合视觉和动作混合专家的有效性。

Comments Accepted by CVPR 2026, Project Page: https://thinklab-sjtu.github.io/DriveMoE/

详情
AI中文摘要

端到端自动驾驶(E2E-AD)需要有效处理多视角传感器数据和稳健处理多样且复杂的驾驶场景,特别是罕见的激进转弯等场景。最近混合专家(MoE)架构在大语言模型(LLMs)中的成功表明,参数的专业化能够实现强大的可扩展性。在本工作中,我们提出了DriveMoE,一种新的基于MoE的E2E-AD框架,包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们$π_0$视觉-语言-动作(VLA)基线(最初来自具身AI领域),称为Drive-$π_0$。具体而言,我们通过训练一个路由器,根据驾驶上下文动态选择相关摄像头,将视觉MoE添加到Drive-$π_0$中。这种设计模仿了人类驾驶认知,即司机选择性地关注关键视觉线索,而不是穷尽处理所有视觉信息。此外,我们通过训练另一个路由器来激活针对不同驾驶行为的专用专家模块,通过显式的行为专业化,DriveMoE能够处理多样化的场景而不受现有模型中模式平均的困扰。在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的性能,证明了在自动驾驶任务中结合视觉和动作MoE的有效性。我们将发布DriveMoE和Drive-$π_0$的代码和模型。

英文摘要

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.

2504.13217 2026-05-19 cs.CL cs.AI

Sustainability via LLM Right-sizing

通过LLM右尺寸实现可持续性

Jennifer Haase, Finn Klessascheck, Jan Mendling, Sebastian Pokutta

AI总结 本文研究了在现实应用中,小型本地可部署模型是否足够好,通过评估十种LLM在日常职业任务中的表现,提出了一种基于可持续性的评估方法,强调在成本、本地部署和隐私方面的需求。

Comments 21 pages, 2 Figures, 6 Tables

详情
AI中文摘要

大型语言模型(LLMs)日益融入组织工作流程,引发了对其能源消耗、财务成本和数据主权的担忧。尽管性能基准常赞扬前沿模型,但实际部署决策需要更广泛的视角:何时小型、本地可部署的模型足够好?本研究通过评估十种专有和开源LLM在十种日常职业任务中的表现,提供实证答案。使用双LLM评估框架,自动化任务执行并标准化输出质量、事实准确性和伦理责任等十项标准。结果显示,GPT-4o在性能上始终优于,但成本和环境足迹显著更高。值得注意的是,较小的模型如Gemma-3和Phi-4在大多数任务中表现出强劲且可靠的结果,表明其在需要成本效率、本地部署或隐私的场景中的可行性。聚类分析揭示了三种模型群体——高端全能型、胜任的通用型和有限但安全的表演型,突显了质量、控制和可持续性之间的权衡。显著的是,任务类型影响了模型的有效性:概念性任务挑战了大多数模型,而聚合和转换任务则表现出更好的性能。我们主张从追求性能最大化的基准转向任务和情境感知的充分性评估,以更符合组织优先事项。我们的方法贡献了一种通过可持续性视角评估AI模型的可扩展方法,并为负责任的LLM部署提供了可行的指导。

英文摘要

Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

2503.14346 2026-05-19 cs.CV

3D Densification for Multi-Map Monocular VSLAM in Endoscopy

3D致密化用于内窥镜多地图单目视觉SLAM

X. Anadón, Javier Rodríguez-Puigvert, J. M. M. Montiel

AI总结 本文提出了一种方法,通过去除异常值和增强地图密度,改进了内窥镜多地图单目视觉SLAM中的3D环境表示,实现了在临床应用中更精确的3D地图重建。

详情
AI中文摘要

多地图稀疏单目视觉同时定位与建图应用于单目内窥镜序列已被证明在内窥镜中频繁的损失(如运动模糊、时间遮挡、工具交互或水喷射)后能够稳健地恢复跟踪。稀疏多地图对于稳健的相机定位是足够的,但它们在环境表示方面非常差,它们是嘈杂的,有高比例的不准确重建的3D点,包括显著的异常值,更重要的是在临床应用中具有不可接受的低密度。我们提出了一种方法来去除异常值并增强状态-of-the-art稀疏内窥镜多地图CudaSIFT-SLAM的地图。通过使用鲁棒的LMedS将NN LightDepth用于到尺度的深度密集预测对齐稀疏CudaSIFT子地图。我们的系统缓解了单目深度估计中的固有尺度模糊问题,同时过滤异常值,导致可靠的致密3D地图。我们在C3VD幻影结肠数据集中提供了准确致密地图的实验证据,4.15毫米RMS精度在可接受的计算时间内。我们还报告了在Endomapper数据集上的真实结肠镜的定性结果。

英文摘要

Multi-map Sparse Monocular visual Simultaneous Localization and Mapping applied to monocular endoscopic sequences has proven efficient to robustly recover tracking after the frequent losses in endoscopy due to motion blur, temporal occlusion, tools interaction or water jets. The sparse multi-maps are adequate for robust camera localization, however they are very poor for environment representation, they are noisy, with a high percentage of inaccurately reconstructed 3D points, including significant outliers, and more importantly with an unacceptable low density for clinical applications. We propose a method to remove outliers and densify the maps of the state of the art for sparse endoscopy multi-map CudaSIFT-SLAM. The NN LightDepth for up-to-scale depth dense predictions are aligned with the sparse CudaSIFT submaps by means of the robust to spurious LMedS. Our system mitigates the inherent scale ambiguity in monocular depth estimation while filtering outliers, leading to reliable densified 3D maps. We provide experimental evidence of accurate densified maps 4.15 mm RMS accuracy at affordable computing time in the C3VD phantom colon dataset. We report qualitative results on the real colonoscopy from the Endomapper dataset.