arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1942
2508.03578 2026-05-21 cs.CV

RadProPoser: Probabilistic Radar Tensor Human Pose Estimation That Knows Its Limits

RadProPoser: 一种具有局限性的概率雷达张量人体姿态估计方法

Jonas Leo Mueller, Lukas Engel, Eva Dorschky, Daniel Krauss, Ingrid Ullmann, Martin Vossiek, Bjoern M. Eskofier

AI总结 本文提出RadProPoser,一种端到端的概率框架,通过原始雷达张量数据预测三维身体关节及其每个关节的不确定性,该方法在新的基准数据集上实现了6.425厘米的均值位置误差,并通过等调校校准总不确定性。

Comments Accepted at IJCNN 2026 (WCCI, Maastricht)

详情
AI中文摘要

基于雷达的人体姿态估计使环境智能中的隐私保护运动跟踪成为可能,但雷达传感的噪声特性使得不确定性量化至关重要。我们提出了RadProPoser,一种端到端的概率框架,能够从原始雷达张量数据中预测三维身体关节并为每个关节提供不确定性。使用变分编码器-解码器与频谱注意力机制,该方法融合了时间帧中的实部和虚部雷达组件。通过可学习的高斯和拉普拉斯分布,我们建模了aleatoric不确定性。在新的基准数据集上训练,我们的方法实现了6.425厘米的均值位置误差。模型输出每个关节的aleatoric不确定性,等调校校准总不确定性,预期校准误差为0.027。由于频谱注意力机制在个体雷达张量组件上操作,扩展到多雷达配置只需拼接额外的输入流。在双正交雷达的HuPR基准上,该方法实现了5.042厘米的MPJPE。该框架在NVIDIA RTX 3090上以89帧每秒的速度运行,超过了15赫兹雷达帧率。

英文摘要

Radar-based human pose estimation enables privacy-preserving motion tracking for ambient intelligence, yet the noisy nature of radar sensing makes uncertainty quantification essential. We present RadProPoser, an end-to-end probabilistic framework that predicts three-dimensional body joints with per-joint uncertainties from raw radar tensor data. Using a variational encoder-decoder with spectral attention that fuses real and imaginary radar components across temporal frames, we model aleatoric uncertainty through learnable Gaussian and Laplace distributions. Trained on a new benchmark dataset with optical motion-capture ground truth, our method achieves 6.425 cm mean per-joint position error. The model outputs per-joint aleatoric uncertainties, and isotonic recalibration yields calibrated total uncertainty with expected calibration error of 0.027. Since spectral attention operates on individual radar tensor components, extending to multi-radar configurations requires only concatenating additional input streams. On the HuPR benchmark with dual orthogonal radars, this achieves 5.042 cm MPJPE. The framework runs at 89 frames per second (FPS) on an NVIDIA RTX 3090, exceeding the 15 Hz radar frame rate.

2508.02291 2026-05-21 cs.LG cs.AI

FAIR-Pruner: A Flexible Framework for Automatic Layer-Wise Pruning via Tolerance of Difference

FAIR-Pruner: 一种通过差异容忍性实现自动分层剪枝的灵活框架

Chenqing Lin, Mostafa Hussien, Chengyao Yu, Bingyi Jing, Ruixing Ming, Kim Khoa Nguyen, Mohamed Cheriet

AI总结 本文提出FAIR-Pruner,一种无需搜索的自适应分层结构化剪枝框架,通过引入差异容忍度(ToD)来实现非均匀的分层剪枝深度,从而在多个数据集和模型上实现了良好的准确率-压缩率权衡。

Comments Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
AI中文摘要

结构化剪枝是压缩深度神经网络的标准工具,但其实际性能取决于稀疏性如何分配到各层。我们提出了FAIR-Pruner,一种无需搜索的自适应分层结构化剪枝框架。FAIR-Pruner使用两种在同一层内的排名:一种是去除导向的信号,提出候选单元;另一种是保护导向的信号,识别任务敏感的单元。其核心组件,差异容忍度(ToD),测量去除前缀与保护尾部之间的重叠,并使用共享容忍级别来诱导各层非均匀的剪枝深度。作为默认视觉实例,FAIR-Pruner结合基于Wasserstein的U-Score用于类条件单元分离性,以及基于Taylor的R-Score用于任务级敏感性;相同的ToD分配规则也可以与替代的去除信号配对。理论上,我们通过群体R-Score分析ToD,推导出高R-Score质量进入剪枝集的排名控制,并识别出相同预算比较与均匀剪枝的加法交换条件。在CIFAR-10、CIFAR-100、SVHN和ImageNet上,跨VGG、ResNet、DenseNet、ConvNeXt和DeiT的实验显示了强的准确率-压缩率权衡。在 routed-expert Qwen1.5-MoE-A2.7B-Chat 上的仅剪枝实验进一步检验了在匹配专家预算下的架构扩展性。FAIR-Pruner作为可 pip-install 的开源包发布。

英文摘要

Structured pruning is a standard tool for compressing deep neural networks, but its practical performance depends on how sparsity is allocated across layers. We propose FAIR-Pruner, a search-free framework for adaptive layer-wise structured pruning. FAIR-Pruner uses two within-layer rankings: a removal-oriented signal that proposes candidate units and a protection-oriented signal that identifies task-sensitive units. Its core component, Tolerance of Difference (ToD), measures the overlap between the removal prefix and the protected tail, and uses a shared tolerance level to induce non-uniform pruning depths across layers. As a default vision instantiation, FAIR-Pruner combines a Wasserstein-based U-Score for class-conditional unit separability with a Taylor-based R-Score for task-level sensitivity; the same ToD allocation rule can also be paired with alternative removal signals. Theoretically, we analyze ToD through the population R-Score, derive rank-based control of the high-R-Score mass entering the pruning set, and identify an additive exchange condition for same-budget comparison with uniform pruning. Experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet across VGG, ResNet, DenseNet, ConvNeXt, and DeiT show strong accuracy--compression trade-offs. Prune-only experiments on routed-expert Qwen1.5-MoE-A2.7B-Chat further examine architectural extensibility under matched expert budgets. FAIR-Pruner is released as a pip-installable open-source package.

2507.23313 2026-05-21 cs.CV

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

伦勃朗的牛 - 分析文本到图像模型中艺术提示的解释

Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

AI总结 本文研究了文本到图像扩散模型在生成艺术作品时如何解释内容和风格的概念,通过交叉注意力热图分析生成图像中像素与特定提示词的关联,揭示了不同艺术提示和风格下内容与风格分离的程度,为理解大规模生成模型内部如何表示复杂艺术概念提供了新见解。

Comments to be published in: Applications of AI in the Analysis of Cultural and Artistic Heritage, organized within the 35th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025

详情
AI中文摘要

文本到图像扩散模型通过学习数十亿张图像,在生成艺术内容方面展现了显著的能力,包括流行艺术作品。然而,这些模型如何内部表示概念,如绘画中的内容和风格,这一基本问题仍未被探索。传统计算机视觉假设内容和风格是正交的,但扩散模型在训练过程中并未获得关于这一区别的显式指导。在本文中,我们研究了基于Transformer的文本到图像扩散模型在生成艺术作品时如何编码内容和风格概念。我们利用交叉注意力热图将生成图像中的像素归因于特定的提示词,使我们能够隔离受内容描述词和风格描述词影响的图像区域。我们的发现表明,扩散模型在不同艺术提示和风格请求下表现出不同程度的内容-风格分离。在许多情况下,内容词主要影响物体相关区域,而风格词影响背景和纹理区域,这表明模型对内容-风格区别的理解是涌现的。这些见解有助于理解大规模生成模型如何在没有显式监督的情况下内部表示复杂的艺术概念。我们分享了代码、数据集以及用于可视化注意力地图的探索工具,地址为https://github.com/umilISLab/artistic-prompt-interpretation。

英文摘要

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

2507.09180 2026-05-21 cs.CV cs.RO

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

多模态融合用于视觉强化学习中的仿真到现实迁移

Zichun Xu, Jingdong Zhao, Chenyu Guo, Qianxue Zhang, Liao Zhang, Xiao Zhang, Yiming Ren, Lian Zhang, Zengren Zhao

AI总结 本文提出基于视觉变换器的多模态融合框架,通过融合RGB和深度信息提升泛化能力,并设计对比学习方案和课程式域随机化方案以提高样本效率和迁移性能,实验结果表明该方法在现实任务中表现优异。

详情
AI中文摘要

深度信息对场景外观变化具有鲁棒性,并固有地包含3D空间细节。因此,本文提出基于视觉变换器的视觉主干,用于融合RGB和深度模态以增强泛化能力。不同模态首先通过单独的CNN茎部进行处理,结合的卷积特征被送入可扩展的视觉变换器以获得视觉表示。此外,设计了一种对比学习方案,通过掩码和未掩码的token来提高样本效率和泛化性能。采用基于课程的域随机化方案以灵活稳定训练过程。最后,仿真结果表明,我们的融合方案优于其他基线。通过零样本迁移验证了模型的可行性,能够执行现实世界操作任务。

英文摘要

Depth information is robust to scene appearance variations and inherently carries 3D spatial details. Thus, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization in this paper. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive learning scheme is designed with masked and unmasked tokens to enhance the sample efficiency and generalization performance. A curriculum-based domain randomization scheme is used to flexibly stabilize the training process. Finally, simulation results demonstrate that our fusion scheme outperforms the other baselines. The feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

2506.21039 2026-05-21 cs.LG cs.AI

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

严格子目标执行:在分层强化学习中的可靠长 horizon 规划

Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

AI总结 本文提出严格子目标执行(SSE)框架,通过前沿经验回放(FER)分离不可达与可接受的子目标,提高高层决策效率,从而在长horizon任务中实现更可靠的规划。

Comments 10 pages for main, 26 pages for total, Accepted to ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

长horizon目标条件任务对强化学习(RL)提出了根本性挑战,特别是在目标遥远且奖励稀疏的情况下。虽然分层和图基方法提供了部分解决方案,但它们对传统 hindsight relabeling 的依赖往往无法纠正子目标不可行性,导致高层规划效率低下。为此,我们提出严格子目标执行(SSE),一种基于图的分层RL框架,整合前沿经验回放(FER)以分离不可达与可接受的子目标,并优化高层决策。FER利用失败和部分成功转移确定可达性前沿,识别不可靠的子目标,提高子目标可靠性,并减少不必要的高层决策。此外,SSE采用解耦探索策略以覆盖目标空间的未探索区域,并通过路径细化调整边成本以利用观察到的低层失败。在多样化的长horizon基准测试中,SSE在效率和成功率方面均优于现有目标条件和分层RL方法。我们的代码可在 https://jaebak1996.github.io/SSE/ 上获得。

英文摘要

Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate. Our code is available at https://jaebak1996.github.io/SSE/

2506.17631 2026-05-21 cs.LG cs.AI

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Time-Prompt: 集成异构提示以解锁时间序列预测中的LLM

Zesen Wang, Lijuan Lan, Yonggang Li

AI总结 本文提出Time-Prompt框架,通过构建统一的提示范式、设计语义空间嵌入和跨模态对齐模块以及高效微调LLM参数,提升时间序列预测性能,并在碳排放数据集上验证其有效性。

Comments Accepted at IJCNN 2026

详情
AI中文摘要

时间序列预测旨在建模变量间的时序依赖关系以推断未来状态,对现实世界场景具有重要性和广泛应用。尽管基于深度学习的方法已取得显著进展,但其在长期预测中仍表现不佳。最近研究表明,大型语言模型(LLMs)在时间序列预测中表现出色,但其在该任务中的实用性仍存疑。为此,我们提出Time-Prompt框架,旨在激活LLMs进行时间序列预测。具体而言,我们首先构建了一个统一的提示范式,利用可学习的软提示引导LLM的行为,并利用文本化的硬提示增强时间序列表示。其次,为了增强LLM对预测任务的全面理解,我们设计了一个语义空间嵌入和跨模态对齐模块,以实现时序和文本数据的融合。最后,我们利用时间序列数据高效地微调LLM的参数。此外,我们专注于碳排放领域,旨在为全球碳中和做出贡献。在6个公开数据集和3个碳排放数据集上的综合评估表明,Time-Prompt是一个强大的时间序列预测框架。

英文摘要

Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM's behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM's parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

2505.19075 2026-05-21 cs.AI cs.CL cs.LG

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Universal Reasoner: 一个单一、可组合的即插即用推理器用于冻结的LLM

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye

AI总结 本文提出Universal Reasoner,一种可组合且即插即用的推理模块,能够在冻结的大规模语言模型上提供专门的推理能力,通过共享或对齐的token空间实现弱到强的泛化,实验表明其在数学推理和机器翻译中优于现有微调方法。

Comments ICML 2026

详情
AI中文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

英文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

2505.14654 2026-05-21 cs.CV cs.AI cs.CL

Beyond Words: Multimodal LLM Knows When to Speak

超越词语:多模态大语言模型何时说话

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

AI总结 本文提出了一种多模态策略,通过同步视频、音频和文本线索提高对话中的响应时机意识,从而提升大语言模型在对话中的响应准确性。

Comments Project page: https://github.com/lzk901372/MM-When2Speak

详情
AI中文摘要

基于大语言模型(LLMs)的聊天机器人能够生成流畅的响应,但在何时发言的问题上常常遇到困难,尤其是在对话过程中需要简短及时的听众反应时。我们提出了一种多模态策略,利用同步的视频、音频和文本线索来改进对话中的时间感知能力。该策略将响应时间重新表述为密集响应类型预测任务,使智能体能够在流式约束下决定是否保持沉默、生成简短反应或开始完整响应。因此,我们引入了一个经过精心挑选的多模态数据集,该数据集来自真实世界的双人对话视频,具有时间对齐的多模态数据和细粒度的反应类型注释。此外,我们设计了一种多模态策略MM-When2Speak,在LLM骨干网络上增加了多模态集成模块。在各种模态设置和强大的LLM基线模型上的实验表明,MM-When2Speak在响应类型预测性能上实现了高达3倍的提升,突显了多模态感知在自然和吸引人的对话交互中的重要性。

英文摘要

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.

2504.19584 2026-05-21 cs.CV

ShowMak3r: Compositional TV Show Reconstruction

ShowMak3r: 动态光场的动态重建

Sangmin Kim, Seunguk Do, Daeun Lee, Jaesik Park

AI总结 本文提出ShowMak3r,一种能够对电视节目场景进行动态重建的综合管道,通过编辑场景实现类似影视制作控制室中的剪辑效果,解决了动态光场重建中的遮挡、杂乱舞台和视角变化等挑战。

Comments Project page : https://nstar1125.github.io/showmak3r

详情
AI中文摘要

从视频片段中重建动态光场具有挑战性,尤其是当给定的是娱乐视频如电视节目时。许多挑战使重建变得困难,原因包括(1)演员相互遮挡并具有多样的面部表情,(2)杂乱的舞台,以及(3)小基线视角或突然的镜头切换。为了解决这些问题,我们提出了ShowMak3r,一种综合的重建管道,允许像在制作控制室中剪辑视频片段一样编辑场景。在ShowMak3r中,3DLocator模块利用深度先验来定位恢复的演员并估计未见的人体姿态。所提出的ShotMatcher模块则在镜头切换下跟踪演员。此外,ShowMak3r引入了一个面部拟合网络,动态地恢复演员的表情。在Sitcoms3D数据集上的实验表明,我们的管道能够用不同时间戳的新摄像机重新组装电视节目场景。我们还展示了ShowMak3r能够实现有趣的应用,如合成镜头制作、演员重新定位、插入、删除和姿态操控。项目页面:https://nstar1125.github.io/showmak3r

英文摘要

Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r

2504.13109 2026-05-21 cs.CV

UniEdit-Flow: Unleashing Inversion and Editing in the Era of Flow Models

UniEdit-Flow:在流模型时代释放反向与编辑

Guanlong Jiao, Biqing Huang, Kuan-Chieh Wang, Renjie Liao

AI总结 本文提出了一种基于预测-校正框架的流模型反向与编辑方法,通过Uni-Inv实现准确重建,并通过Uni-Edit实现区域感知的图像编辑,方法无需调优,具有通用性和高效性,实验表明其在多种生成模型中均表现出色。

Comments ICLR 2026. Project Page: https://uniedit-flow.github.io/

详情
AI中文摘要

流匹配模型已作为一种强大的替代扩散模型的选项,但现有的针对扩散模型的反向和编辑方法往往在流模型上效果不佳或不适用。流模型的直线、非交叉轨迹对基于扩散的方法构成了挑战,但也为新的解决方案提供了途径。在本文中,我们介绍了一种用于流模型反向和编辑的预测-校正框架。首先,我们提出了Uni-Inv,一种有效的反向方法,用于准确的重建。在此基础上,我们将延迟注入的概念扩展到流模型,并引入Uni-Edit,一种区域感知且稳健的图像编辑方法。我们的方法无需调优,模型无关,高效且有效,能够在多样化编辑的同时,确保对编辑无关区域的强保留。在各种生成模型上的广泛实验表明,Uni-Inv和Uni-Edit的优越性和通用性,即使在低成本设置下也是如此。项目页面:https://uniedit-flow.github.io/

英文摘要

Flow matching models have emerged as a strong alternative to diffusion models, but existing inversion and editing methods designed for diffusion are often ineffective or inapplicable to them. The straight-line, non-crossing trajectories of flow models pose challenges for diffusion-based approaches but also open avenues for novel solutions. In this paper, we introduce a predictor-corrector-based framework for inversion and editing in flow models. First, we propose Uni-Inv, an effective inversion method designed for accurate reconstruction. Building on this, we extend the concept of delayed injection to flow models and introduce Uni-Edit, a region-aware, robust image editing approach. Our methodology is tuning-free, model-agnostic, efficient, and effective, enabling diverse edits while ensuring strong preservation of edit-irrelevant regions. Extensive experiments across various generative models demonstrate the superiority and generalizability of Uni-Inv and Uni-Edit, even under low-cost settings. Project page: https://uniedit-flow.github.io/

2504.06925 2026-05-21 cs.CV cs.AI

Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition

视觉-语言模型是否准备好进行饮食评估?探索AI驱动的食品图像识别的下一个前沿

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales

AI总结 本文评估了六种先进的视觉-语言模型在不同层次上的食品识别能力,提出了一个新的评估指标,并展示了FoodNExTDB数据库在饮食评估中的应用潜力。

Comments Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

详情
Journal ref
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1-10
AI中文摘要

基于食品图像的自动饮食评估仍是一个挑战,需要精确的食品检测、分割和分类。视觉-语言模型(VLMs)通过整合视觉和文本推理提供了新的可能性。在本研究中,我们评估了六种最先进的VLMs(ChatGPT、Gemini、Claude、Moondream、DeepSeek和LLaVA),分析它们在不同层次上的食品识别能力。在实验框架中,我们引入了FoodNExTDB,一个独特的食品图像数据库,包含9,263张由专家标注的图像,涵盖10个类别(例如“蛋白质来源”)、62个子类别(例如“家禽”)和9种烹饪风格(例如“烤制”)。总共,FoodNExTDB包括50,000个由七位专家生成的营养标签,这些标签由手动标注所有数据库中的图像生成。此外,我们提出了一种新的评估指标,专家加权召回率(EWR),该指标考虑了不同标注者之间的差异。结果表明,封闭源模型在识别包含单一产品的图像中的食品产品时,性能优于开源模型,达到了超过90%的EWR。尽管有潜力,当前VLMs在细粒度食品识别方面面临挑战,特别是在区分烹饪风格的细微差异和视觉相似的食品项目时,这限制了它们在自动饮食评估中的可靠性。FoodNExTDB数据库在https://github.com/AI4Food/FoodNExtDB上公开可用。

英文摘要

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

2503.08292 2026-05-21 cs.CL cs.AI

Do LLMs Triage Like Clinicians? A Dynamic Study of Outpatient Referral

大语言模型像医生一样分诊吗?对外科会诊的动态研究

Xiaoxiao Liu, Qingying Xiao, Bingquan Zhang, Junying Chen, Xiangyi Feng, Ziniu Li, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang

AI总结 本文研究了大语言模型在动态分诊过程中的表现,发现其在动态场景中通过有效提问减少不确定性,优于传统分类器,但静态场景下优势有限。

详情
AI中文摘要

门诊会诊(OR)是一种核心临床流程,将患者分配到医院部门,在信息不完整且不断演变的情况下进行,但通常被简化为静态分类问题,尽管实际上是交互性的。在本工作中,我们将门诊会诊视为由信息获取和不确定性降低驱动的动态过程。我们分析了基于固定患者信息的静态场景和涉及多轮对话的动态场景,以测试大语言模型(LLMs)是否通过更好的预测或更有效的提问来改善分诊结果。我们的发现表明,LLMs在静态分诊准确性上对传统分类器几乎没有优势,但在动态设置中始终优于它们,通过询问具有辨别性的后续问题来减少候选部门的不确定性。这些结果表明,大语言模型在门诊分诊中的主要价值不在于静态预测,而在于支持交互式、具有不确定性的临床决策。

英文摘要

Outpatient referral (OR) is a core clinical workflow that assigns patients to hospital departments under incomplete and evolving information, yet it is commonly simplified as a static classification problem despite being inherently interactive in practice. In this work, we study outpatient referral as a dynamic process driven by information acquisition and uncertainty reduction. We analyze both static scenarios based on fixed patient information and dynamic scenarios involving multi-turn dialogue, to test whether large language models (LLMs) improve referral outcomes through better prediction or more effective questioning. Our findings show that LLMs offer limited advantages over traditional classifiers in static referral accuracy, but consistently outperform them in dynamic settings by asking discriminative follow-up questions that reduce uncertainty over candidate departments. These results suggest that the primary value of LLMs in outpatient referral lies not in static prediction, but in supporting interactive, uncertainty-aware clinical decision-making.

2502.18915 2026-05-21 cs.CL cs.AI

END: Early Noise Dropping for Efficient and Effective Context Denoising

END:早期噪声丢弃以实现高效有效的上下文去噪

Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Fangran Mo, Jinghan Zhang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin

AI总结 本文提出END方法,通过在早期层对输入序列进行分割和线性探针,有效识别并丢弃噪声部分,从而提升LLM在不同任务上的性能和效率,同时加深了对LLM内部上下文推理机制的理解。

详情
AI中文摘要

大型语言模型(LLMs)在广泛自然语言处理任务中表现出色,但它们经常受到输入序列中无关或噪声内容的干扰,从而降低输出质量。这个问题影响了长上下文和短上下文场景,如检索增强生成、表格问答和上下文学习。我们发现LLMs可以在生成令牌之前,在早期层中隐式地识别输入序列中是否有有用信息。基于这一见解,我们引入了早期噪声丢弃(END),一种无需微调LLMs的新方法,以缓解此问题。END将输入序列分成块,并在LLMs的早期层上使用线性探针来区分信息丰富和噪声块。通过在过程中早期丢弃噪声块,END保留了关键信息,减少了干扰,并降低了计算开销。广泛的实验表明,END在不同LLMs上多个评估数据集上显著提高了性能和效率。此外,通过探针研究LLMs对输入的隐式理解,这项工作也加深了对LLMs如何内部进行上下文推理的理解。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.

2502.12120 2026-05-21 cs.LG cs.AI cs.CL

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

LLMs on the Line: 数据决定损失-损失缩放定律

Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel

AI总结 研究探讨了影响LLM损失-损失缩放定律的主要因素,发现预训练数据决定了缩放趋势,而模型大小、优化超参数、分词器和架构差异对缩放影响有限,因此应精心选择预训练数据以获得最佳下游性能。

Comments ICML 2025 camera-ready version

详情
AI中文摘要

缩放定律指导大型语言模型(LLMs)的发展,通过提供模型大小、令牌和计算量之间的最佳平衡估计。最近,损失-损失缩放定律,即预训练数据集和下游任务之间损失的关系,已成为理解并改进LLM性能和泛化能力的强大工具。在本工作中,我们研究了哪些因素最强烈地影响损失-损失缩放。我们的实验发现,预训练数据决定了缩放趋势。相比之下,模型大小、优化超参数、分词器甚至显著的架构差异,如基于Transformer的模型如Llama和状态空间模型如Mamba之间的差异,通常影响有限。因此,从业者应仔细选择适合的预训练数据集以获得最佳下游性能,而架构和其他设置可以自由优化以提高训练效率。

英文摘要

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

2502.03752 2026-05-21 cs.LG cs.AI

Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning

基于鲁棒技能的元强化学习中的自我改进技能学习

Sanghyeon Lee, Sangjun Bae, Yisak Park, Seungyul Han

AI总结 本文提出Self-Improving Skill Learning (SISL)方法,通过解耦的高层和技能改进策略进行自我指导的技能细化,并利用最大回报重标记进行技能优先级排序,从而在噪声和次优数据下实现鲁棒且稳定的适应,优于其他基于技能的元强化学习方法。

Comments 10 pages main, 27 pages appendix with reference. Accepted to ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

元强化学习(Meta-RL)能够快速适应未见任务,但在长时间 horizon 环境中面临挑战。基于技能的方法通过将状态-动作序列分解为可重用的技能并采用分层决策来解决这一问题。然而,这些方法对噪声的离线演示高度敏感,导致技能学习不稳定和性能下降。为此,我们提出Self-Improving Skill Learning (SISL),通过解耦的高层和技能改进策略进行自我指导的技能细化,同时应用最大回报重标记进行技能优先级排序,从而在噪声和次优数据下实现鲁棒且稳定的适应。通过减轻噪声的影响,SISL实现了可靠的技能学习,并在多样化的长horizon任务上一致优于其他基于技能的元强化学习方法。我们的代码可在https://epsilog.github.io/SISL获取。

英文摘要

Meta-reinforcement learning (Meta-RL) facilitates rapid adaptation to unseen tasks but faces challenges in long-horizon environments. Skill-based approaches tackle this by decomposing state-action sequences into reusable skills and employing hierarchical decision-making. However, these methods are highly susceptible to noisy offline demonstrations, leading to unstable skill learning and degraded performance. To address this, we propose Self-Improving Skill Learning (SISL), which performs self-guided skill refinement using decoupled high-level and skill improvement policies, while applying skill prioritization via maximum return relabeling to focus updates on task-relevant trajectories, resulting in robust and stable adaptation even under noisy and suboptimal data. By mitigating the effect of noise, SISL achieves reliable skill learning and consistently outperforms other skill-based meta-RL methods on diverse long-horizon tasks. Our code is available at https://epsilog.github.io/SISL.

2502.02844 2026-05-21 cs.LG cs.AI cs.CR cs.MA

Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning

狼群对抗攻击用于鲁棒多智能体强化学习

Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han

AI总结 本文提出狼群对抗攻击框架,用于对抗多智能体强化学习中的协同对抗攻击,并引入狼群-对抗学习框架来训练鲁棒的MARL策略以防御该攻击。

Comments 9 pages main, 23 pages appendix with reference. Accepeted by ICML 2025

详情
Journal ref
Proceedings of Machine Learning Research (PMLR), ICML 2025
AI中文摘要

传统多智能体强化学习(MARL)中的鲁棒方法往往难以应对合作场景中的协调对抗攻击。为了解决这一限制,我们提出了受狼群狩猎策略启发的狼群对抗攻击框架,该框架针对初始智能体及其辅助智能体以破坏合作。此外,我们还引入了狼群-对抗学习用于MARL(WALL)框架,该框架通过促进系统内协作来训练鲁棒的MARL策略以防御所提出的狼群攻击。实验结果突显了狼群攻击的毁灭性影响以及WALL所实现的显著鲁棒性改进。我们的代码可在https://github.com/sunwoolee0504/WALL上获得。

英文摘要

Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering systemwide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL. Our code is available at https://github.com/sunwoolee0504/WALL.

2502.02834 2026-05-21 cs.LG cs.AI

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

任务感知虚拟训练:增强元强化学习在分布外任务中的泛化能力

Jeongmo Kim, Yisak Park, Minung Kim, Seungyul Han

AI总结 本文提出Task-Aware Virtual Training方法,通过度量学习提升元强化学习在分布外任务中的泛化能力,采用虚拟任务保持任务特征并利用状态正则化技术减少状态变化环境中的过估计误差。

Comments 9 pages main paper, 20 pages appendices with reference. Accepted to ICML 2025

详情
Journal ref
Proceedings of Machine Learning Research (PMLR), ICML 2025
AI中文摘要

元强化学习旨在开发能够泛化到未见任务的策略,这些任务从任务分布中采样。尽管基于上下文的元强化学习方法通过任务潜在变量改善任务表示,但它们在分布外(OOD)任务上常常表现不佳。为了解决这个问题,我们提出了Task-Aware Virtual Training(TAVT),一种新的算法,通过度量基于的表示学习准确捕捉任务特征,用于训练和OOD场景。我们的方法在虚拟任务中成功保持任务特征,并采用状态正则化技术以减轻状态变化环境中的过估计误差。数值结果表明,TAVT在各种MuJoCo和MetaWorld环境中显著增强了对OOD任务的泛化能力。我们的代码可在https://github.com/JM-Kim-94/tavt.git获取。

英文摘要

Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.

2501.15151 2026-05-21 cs.CV

SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks

SpikeDet: 更准确且节能的基于脉冲神经网络的目标检测中的 firing 模式

Yimeng Fan, Changsong Liu, Mingyang Li, Dongze Liu, Yuting Su, Yanyan Liu, Wei Zhang

AI总结 本文提出SpikeDet,一种新型的脉冲神经网络目标检测器,通过优化firing模式实现更准确且节能的目标检测。具体来说,设计了MDSNet脉冲骨干网络,有效调整每个层的膜电位突触输入分布,实现更优的脉冲特征提取;引入Spiking Multi-direction Fusion Module (SMFM)实现多方向融合,增强多尺度检测能力;提出Local Firing Saturation Index (LFSI)定量衡量局部firing饱和度。实验结果验证了方法的有效性,在COCO 2017数据集上达到52.2% AP,比现有SNN方法提升3.3% AP,能耗仅为一半。

详情
AI中文摘要

脉冲神经网络(SNNs)是神经网络的第三代。由于其低能耗和生物可解释性,SNNs在目标检测中获得了广泛关注。然而,现有的基于SNN的目标检测方法受到局部firing饱和的影响,相邻神经元同时达到最大firing率,尤其是在以对象为中心的区域。这种异常的神经元firing模式降低了特征辨别能力和检测准确性,同时增加了firing率,阻碍了SNNs实现其潜在的能源效率。为了解决这个问题,我们提出了SpikeDet,一种新颖的脉冲目标检测器,通过优化firing模式实现更准确且节能的检测。具体来说,我们设计了MDSNet脉冲骨干网络,该网络在每一层有效调整膜电位突触输入分布,从而在脉冲特征提取过程中实现更好的神经元firing模式。对于颈部部分,为了更好地利用和保留这些高质量的骨干特征,我们引入了Spiking Multi-direction Fusion Module (SMFM),实现了脉冲特征的多方向融合,增强了模型的多尺度检测能力。此外,我们提出了Local Firing Saturation Index (LFSI),以定量衡量局部firing饱和度。实验结果验证了我们方法的有效性。在COCO 2017数据集上,它达到了52.2%的AP,比先前的SNN方法提高了3.3%的AP,同时仅需一半的能耗。在目标检测子任务中,包括基于事件的GEN1、水下URPC 2019、低光ExDARK和密集场景CrowdHuman数据集上,SpikeDet也取得了最佳性能。

英文摘要

Spiking Neural Networks (SNNs) are the third generation of neural networks. They have gained widespread attention in object detection due to their low energy consumption and biological interpretability. However, existing SNN-based object detection methods suffer from local firing saturation, where adjacent neurons concurrently reach maximum firing rates, especially in object-centric regions. This abnormal neuron firing pattern reduces the feature discrimination capability and detection accuracy, while also increasing the firing rates that prevent SNNs from achieving their potential energy efficiency. To address this problem, we propose SpikeDet, a novel spiking object detector that optimizes firing patterns for accurate and energy-efficient detection. Specifically, we design a spiking backbone network, MDSNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better neuron firing patterns during spiking feature extraction. For the neck, to better utilize and preserve these high-quality backbone features, we introduce the Spiking Multi-direction Fusion Module (SMFM), which realizes multi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Furthermore, we propose the Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation. Experimental results validate the effectiveness of our method. On the COCO 2017 dataset, it achieves 52.2% AP, outperforming previous SNN-based methods by 3.3% AP while requiring only half the energy consumption. On object detection sub-tasks, including event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman datasets, SpikeDet also achieves the best performance.

2501.02407 2026-05-21 cs.CL cs.CR cs.LG

Towards the Anonymization of the Language Modeling

朝向语言模型的匿名化

Antoine Boutet, Lucas Magnana, Juliette Sénéchal

AI总结 本文提出了一种隐私保护的语言模型方法,通过掩码语言模型(MLM)和因果语言模型(CLM)方法,旨在解决语言模型的匿名化问题,从而促进其共享。研究通过医疗数据集评估了这两种方法,并表明在避免记忆直接和间接标识信息的同时,能够保持高隐私性和高实用性。

详情
AI中文摘要

自然语言处理(NLP)的快速发展已经革新了许多领域,包括医疗保健。然而,这些进展带来了显著的隐私问题,特别是当预训练模型在敏感数据上进行微调和专门化时,可能会记住并暴露个人信息。本文提出了一种隐私保护的语言模型方法,以解决语言模型的匿名化问题,从而促进其共享。具体来说,我们提出了掩码语言模型(MLM)方法,用于专门化类似于BERT的语言模型,以及因果语言模型(CLM)方法,用于专门化类似于GPT的语言模型,以避免模型记住训练数据中直接和间接的标识信息。我们使用医疗数据集全面评估了我们的方法,并将其与不同的基线进行了比较。我们的结果表明,通过在模型专门化过程中避免记忆直接和间接的标识符,我们的掩码和因果语言模型方案在保持高隐私性的同时,能够保持高实用性。

英文摘要

Rapid advances in Natural Language Processing (NLP) have revolutionized many fields, including healthcare. However, these advances raise significant privacy concerns, especially when pre-trained models fine-tuned and specialized on sensitive data can memorize and then expose and regurgitate personal information. This paper presents a privacy-preserving language modeling approach to address the problem of language models anonymization, and thus promote their sharing. Specifically, we propose both a Masking Language Modeling (MLM) methodology to specialize a BERT-like language model, and a Causal Language Modeling (CLM) methodology to specialize a GPT-like model that avoids the model from memorizing direct and indirect identifying information present in the training data. We have comprehensively evaluated our approaches using a medical dataset and compared them against different baselines. Our results indicate that by avoiding memorizing both direct and indirect identifiers during model specialization, our masking and causal language modeling schemes offer a good tradeoff for maintaining high privacy while retaining high utility.

2412.14738 2026-05-21 cs.LG

Spectrally unstable nodes drive reliability failures in graph learning

谱不稳定性节点驱动图学习中的可靠性故障

Yongyu Wang

AI总结 研究探讨了图学习中谱不稳定性节点对可靠性故障的影响,提出了一种可靠性感知干预方法以隔离这些节点,从而提升算法在对抗性和内在噪声下的鲁棒性。

详情
AI中文摘要

图学习算法在图结构被对抗性扰动、本质上嘈杂或由不完美观测构造时可能会失效。本文展示了一些节点比其他节点对对抗性扰动和内在噪声损害图学习算法承担更大的责任。基于图谱畸变分析,我们识别出这些故障驱动节点,并引入一种可靠性感知干预,将其隔离出主要学习步骤。目标算法应用于稳定的诱导子图,隔离节点的预测通过拓扑或质心传播恢复。在针对和非针对的结构攻击下,以及谱超图聚类和多视图谱聚类等图神经网络中,这一原理在对抗性和内在噪声下均提高了可靠性。这些结果表明,节点层面的谱不稳定性为理解并缓解图学习中的可靠性故障提供了一个共同机制。

英文摘要

Graph-learning algorithms can fail when graph structure is adversarially perturbed, intrinsically noisy or constructed from imperfect observations. Here we show that some nodes bear much greater responsibility than others for allowing adversarial perturbations and intrinsic noise to harm graph-learning algorithms. Building on graph-spectral distortion analysis, we identify these failure-driving nodes and introduce a reliability-aware intervention that isolates them from the main learning step. The target algorithm is applied to a stable induced subgraph, and predictions for isolated nodes are recovered through topology- or centroid-based propagation. Across graph neural networks under targeted and non-targeted structural attacks, spectral hypergraph clustering and multi-view spectral clustering, this principle improves reliability under both adversarial and intrinsic noise. These results suggest that node-level spectral instability provides a common mechanism for understanding and mitigating reliability failures in graph learning.

2412.01944 2026-05-21 cs.CV eess.IV

A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series

变换器与卷积模型在卫星图像时间序列作物分割中的比较研究

Mattia Gatti, Ignazio Gallo, Nicola Landro, Christian Loschiavo, Anwar Ur Rehman, Mirco Boschetti, Riccardo La Grassa

AI总结 本文比较了变换器和卷积模型在从卫星图像时间序列中进行作物分割中的应用,发现TSViT在整体表现上最佳,而VistaFormer在效率与性能之间提供了良好的权衡。

Comments This version corrects an error in the evaluation pipeline affecting previously reported metrics. Results have been recomputed, leading to updated values and a revised conclusion: the adapted Swin UNETR model does not outperform CNN baselines. Tables, figures, and comparisons have been updated, and the analysis has been extended to include additional transformer-based models

详情
AI中文摘要

从卫星图像时间序列(SITS)中进行作物分割是农业监测和土地利用分析中的基本任务。尽管卷积神经网络(CNNs)已被广泛应用,但基于变换器的架构提供了另一种机制,用于在多光谱数据中表示空间和时间依赖性。本文提出了对CNN和基于变换器的分割模型的比较研究,用于Sentinel-2时间序列的作物制图,包括3D U-Net、3D FPN、3D DeepLabv3以及三种变换器架构:Swin UNETR、TSViT和VistaFormer,它们采用不同的策略来捕捉时间依赖性。在Munich和Lombardia数据集上的实验表明,TSViT在整体表现上最佳,略微优于3D U-Net,后者仍然是一个强大的CNN基线。VistaFormer提供了最佳的效率,而Swin UNETR表现竞争,但不如那些显式建模时间动态的变换器。这些结果突显了时间建模在SITS中的重要性:TSViT优于CNNs和将时间视为额外空间维度的方法,而VistaFormer提供了良好的效率-性能权衡。

英文摘要

Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS: TSViT outperforms CNNs and approaches that treat time as an additional spatial dimension, while VistaFormer provides a strong efficiency-performance trade-off.

2411.01141 2026-05-21 cs.CL

Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models

字典插入提示用于多语言推理在多语言大语言模型上

Hongyuan Lu, Zixuan Li, Wai Lam

AI总结 本文提出了一种名为字典插入提示(DIP)的新方法,通过在提示中插入词典中的英文对应词来提升多语言推理性能,实验表明在10到200种语言上效果显著。

Comments ACL *SEM 2026

详情
AI中文摘要

在当前的大语言模型(LLMs)时代,存在两个不足:一是缺乏多语言模型,大多数LLMs以英语为中心,多语言推理性能受限;二是外部知识的使用位置,大多数检索的知识被前置到用户查询中(可能不最优)。本文提出了一种新颖且有效的称为字典插入提示(DIP)的方法。当提供非英语提示时,DIP会查找词典并插入词的英文对应词到提示的中间部分,从而使LLMs更好地翻译成英语并产生更好的英语模型思考步骤,从而获得明显更好的结果。我们实验了10到200种语言(FLORES-200)。由于没有足够的数据集,我们使用NLLB翻译器从现有的4个英语推理基准(如GSM8K和AQuA)创建合成的多语言基准。合成基准被翻译回英语以确保质量并通过人工标注。有趣的是,插入词典的位置对性能提升有重要影响,我们发现在原始词和词典之间交替插入比前置或后置词典效果更好,同一词典构建下。

英文摘要

There are two shortages in the current Large Language Models (LLMs) era. The first is short of multilingual models, where most LLMs are English-centric and performance is limited on multilingual reasoning. The second is the place of external knowledge to be used, where most retrieved knowledge is prepended to the user queries (maybe sub-optimal). This paper presents a novel and simple yet effective method called \textbf{D}ictionary \textbf{I}nsertion \textbf{P}rompting (\textbf{DIP}). When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the middle of the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with 10 to 200 languages from FLORES-200.\footnote{The number of languages varies on the datasets, and we experiment with 200 languages on GSM8K as in Appendix} Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. The synthetic benchmarks are translated back into English for quality assurance with manual annotation. Interestingly, the place for injecting the dictionary plays an important factor in the performance gains, and we found that interleaving the dictionary with the original words gives a better performance compared to prepending/appending the dictionary, under the same dictionary constructed.

2410.04155 2026-05-21 cs.CL

Toxic Subword Pruning for Dialogue Response Generation on Large Language Models

针对大型语言模型的对话响应生成中的有毒子词修剪

Hongyuan Lu, Wai Lam

AI总结 本文提出了一种名为ToxPrune的新型算法,通过修剪包含有毒词的子词来防止大型语言模型生成有毒内容,同时提升了对话响应生成任务中非有毒模型的表现。

Comments ACL *SEM 2026

详情
AI中文摘要

如何防御大型语言模型(LLMs)生成有毒内容是一个重要的研究领域。然而,大多数研究集中在各种模型训练技术上,通过更新权重来修复LLMs。安全对齐是一个相关研究领域,但这种方法通常成本高且繁琐,且如果不小心处理,可能会导致模型出现灾难性遗忘等问题。因此,我们提出了一种简单但有效且新颖的算法,即ToxPrune,用于修剪训练好的LLMs中的BPE子词中的有毒词。与之前的研究不同,我们发现修剪BPE标记在机器翻译任务中是有害的,但其在防止LLMs生成有毒内容方面却很有用。幸运的是,我们的发现表明,ToxPrune同时明显提升了有毒语言模型NSFW-3B在对话响应生成任务中的表现。我们还发现,ToxPrune甚至可以明显提升官方Llama-3.1-6B在对话多样性指标上的表现。广泛的自动结果和人工评估表明,ToxPrune对修复有毒LLMs和提升非有毒LLMs在对话响应生成任务中的表现都有帮助。

英文摘要

How to defend large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on various model training techniques to remediate LLMs by updating their weights. A typical related research area is safety alignment. This however is often costly and tedious and can expose the model to even more problems such as catastrophic forgetting if the trainings are not carefully handled by experienced NLP practitioners. We thus propose a simple yet effective and novel algorithm, namely \textbf{Tox}ic Subword \textbf{Prun}ing (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Fortunately, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously. We surprisingly found that ToxPrune can even obviously improve official Llama-3.1-6B in the metric of dialogue diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.\footnote{We plan to release the resources to facilitate future work.}

2410.03296 2026-05-21 cs.CL cs.AI

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

抽取式自我解释与人类推理在文本分类中的系统比较

Stephanie Brandl, Oliver Eberle

AI总结 本文比较了抽取式自我解释与人类推理在文本分类任务中的有效性,通过分析不同任务和语言的解释质量,发现自我解释在文本长度和任务复杂度上与人类推理存在显著差异。

Comments accepted to the Trustworthy NLP Workshop, co-located with ACL 2026

详情
AI中文摘要

指令微调的LLM能够通过生成自我解释来向用户解释其输出,而无需应用复杂的可解释性技术。本文分析这种能力是否能产生高质量的解释。我们评估了以输入推理形式呈现的自我解释在人类中的可信度。我们研究了三个文本分类任务:情感分类、强迫劳动检测和声明验证。我们包括丹麦语和意大利语的情感分类任务翻译,并将自我解释与人类注释进行比较。为此,我们收集了Climate-Fever声明验证数据集的人类推理注释。我们进一步评估了人类和自我解释推理在正确模型预测方面的忠实度,并通过纳入事后归因基于的解释扩展了研究。我们分析了四个开源LLM,并发现自我解释与人类推理之间的对齐高度依赖于文本长度和任务复杂性。然而,自我解释会产生忠实的token级推理子集,而事后归因方法则倾向于强调结构和格式token,反映出根本不同的解释策略。

英文摘要

Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

2409.18272 2026-05-21 cs.LG

SLIDE: A machine-learning based method for forced dynamic response estimation of multibody systems

SLIDE:一种基于机器学习的多体系统强迫动态响应估计方法

Peter Manzl, Alexander Humer, Qasim Khadim, Johannes Gerstmayr

AI总结 本文提出了一种基于机器学习的SLIDE方法,用于估计机械或多体系统的输出序列,通过滑动窗口初始截断动态响应估计器,利用复数特征值近似阻尼效应,提高模拟速度并实现实时性能。

Comments Paper currently in submission for journal publication

详情
Journal ref
Mechanics Based Design of Structures and Machines 54(1), 2026
AI中文摘要

在计算工程中,提高模拟速度和效率是一个永恒的目标。为了充分利用神经网络技术和硬件,我们提出了SLiding-window Initially-truncated Dynamic-response Estimator (SLIDE),一种基于深度学习的方法,用于估计机械或多体系统的输出序列,主要但不局限于强迫激励。SLIDE的一个关键优势是能够估计阻尼系统的动态响应,而无需完整系统状态,使其特别有效于柔性多体系统。该方法根据初始效应(如阻尼)的衰减截断输出窗口,该衰减通过系统线性化方程的复数特征值近似。此外,还训练了一个第二个神经网络来提供误差估计,进一步增强了方法的应用性。该方法应用于包括Duffing振荡器、柔性滑块-曲柄系统和安装在柔性底座上的工业6R机械臂在内的多种系统。我们的结果表明,从模拟到数百万次的加速显著,远超实时性能。

英文摘要

In computational engineering, enhancing the simulation speed and efficiency is a perpetual goal. To fully take advantage of neural network techniques and hardware, we present the SLiding-window Initially-truncated Dynamic-response Estimator (SLIDE), a deep learning-based method designed to estimate output sequences of mechanical or multibody systems with primarily, but not exclusively, forced excitation. A key advantage of SLIDE is its ability to estimate the dynamic response of damped systems without requiring the full system state, making it particularly effective for flexible multibody systems. The method truncates the output window based on the decay of initial effects, such as damping, which is approximated by the complex eigenvalues of the systems linearized equations. In addition, a second neural network is trained to provide an error estimation, further enhancing the methods applicability. The method is applied to a diverse selection of systems, including the Duffing oscillator, a flexible slider-crank system, and an industrial 6R manipulator, mounted on a flexible socket. Our results demonstrate significant speedups from the simulation up to several millions, exceeding real-time performance substantially.

2409.14839 2026-05-21 cs.AI cs.ET cs.HC

Explainable and Human-Grounded AI for Decision Support Systems: The Theory of Epistemic Quasi-Partnerships

可解释且以人为中心的AI用于决策支持系统:知识性准伙伴关系理论

John Dorsch, Maximilian Moll

AI总结 本文提出了一种新的理论框架,即知识性准伙伴关系理论(EQP),用于指导开发能够提供人类基础解释(原因、反事实和置信度)的AI决策支持系统,以满足伦理和可解释AI(XAI)的需求。

Comments 20 pages

详情
Journal ref
Philosophy of Artificial Intelligence. Synthese Library, vol 533. Springer. 2026
AI中文摘要

在人工智能决策支持系统(AI-DSS)的背景下,我们主张满足伦理和可解释AI(XAI)的需求是开发AI-DSS,以向人类决策者提供三种类型的以人为中心的解释:原因、反事实和置信度,这种方法我们称为RCC方法。我们首先回顾了当前的实证XAI文献,探讨了生成模型解释的各种方法(如LIME、SHAP、Anchors)与模型感知可信度和终端用户准确性之间的关系。我们展示了当前关于什么是良好人类基础原因的理论要么无法充分解释这些证据,要么没有为开发提供坚实的伦理建议。因此,我们提出了一种新的理论:知识性准伙伴关系理论(EQP)。最后,我们阐明了采用EQP的动机,并展示了它如何解释实证证据,提供坚实的伦理建议,并导致采用RCC方法。

英文摘要

In the context of AI decision support systems (AI-DSS), we argue that meeting the demands of ethical and explainable AI (XAI) is about developing AI-DSS to provide human decision-makers with three types of human-grounded explanations: reasons, counterfactuals, and confidence, an approach we refer to as the RCC approach. We begin by reviewing current empirical XAI literature that investigates the relationship between various methods for generating model explanations (e.g., LIME, SHAP, Anchors), the perceived trustworthiness of the model, and end-user accuracy. We demonstrate how current theories about what constitutes good human-grounded reasons either do not adequately explain this evidence or do not offer sound ethical advice for development. Thus, we offer a novel theory of human-machine interaction: the theory of epistemic quasi-partnerships (EQP). Finally, we motivate adopting EQP and demonstrate how it explains the empirical evidence, offers sound ethical advice, and entails adopting the RCC approach.

2409.04777 2026-05-21 cs.LG math.OC

Optimization Hyper-parameter Laws for Large Language Models

大语言模型的优化超参数规律

Xingyu Xie, Kuangyu Ding, Shuicheng Yan, Kim-Chuan Toh, Tianwen Wei

AI总结 本文提出Opt-Laws框架,通过分析SDE收敛和逃逸特性,预测最终训练损失,从而在小规模实验中预选学习率调度方案,提高了超参数选择的准确性。

详情
AI中文摘要

大语言模型推动了显著的AI进步,但其训练过程资源消耗大且对超参数选择高度敏感。尽管扩展定律提供了模型大小和数据需求的指导,但它们在选择动态超参数(如学习率调度)方面存在不足。为此,我们提出优化超参数规律(Opt-Laws),该框架将最终训练损失作为学习率调度、模型大小和数据大小的函数进行预测。基于SDE基于的收敛和逃逸分析,Opt-Laws产生可解释的收敛和逃逸特征,能够预测不同模型规模下的最终训练损失,从而在小规模实验中预选调度方案。实证表明,Opt-Laws在验证配置上实现了94%的Top-2命中率,正确识别了所有五个评估的非家族设置中的最佳性能调度家族,并以F1=0.92检测到训练发散。

英文摘要

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations, correctly identify the best-performing schedule family in all five evaluated out-of-family settings, and detect training divergence with F1 = 0.92.

2408.08812 2026-05-21 cs.LG

TRAM: Test-Time Risk Adaptation with Mixture of Agents

TRAM: 测试时风险适应与代理混合

Mohamad Fares El Hajj Chehade, Amrit Singh Bedi, Amy Zhang, Hao Zhu

AI总结 本文研究了在部署时无需更新的零更新适应问题,提出TRAM方法通过混合代理评估源策略的风险调整分数,以降低部署风险并保持奖励。

详情
AI中文摘要

部署的强化学习代理常面临在训练后才指定的安全要求,如新的危险地图、修订的风险阈值或行为对齐约束。我们研究零更新部署时适应,其中固定的风险中性源策略库在新的奖励-风险权衡下被重用。我们提出TRAM(通过代理混合的测试时风险适应),一种源评分的组合规则,该规则在目标奖励和基于占用的部署风险下评估每个源策略,然后使用风险调整的源评分选择动作。不同于训练时与固定替代物(如回报方差)绑定的风险敏感方法,TRAM支持在测试时指定的空间屏障暴露、与参考行为的偏离以及局部波动风险。我们明确将TRAM作为替代方法:它不解决拼接策略的完整占用控制问题,但允许一个可测量的源壳匹配项,将源评分风险与实际风险联系起来。在网格世界、MuJoCo Reacher、Safety-Gymnasium和LLM对齐设置中的实验表明,TRAM在不需测试时任何参数更新的情况下减少了部署风险,同时保持了奖励。

英文摘要

Deployed reinforcement learning agents often face safety requirements that are specified only after training, such as new hazard maps, revised risk thresholds, or behavioral alignment constraints. We study zero-update deployment-time adaptation, where a fixed library of risk-neutral source policies is reused under a newly specified reward-risk tradeoff. We propose TRAM (Test-Time Risk Adaptation via Mixture of Agents), a source-scored composition rule that evaluates each source policy under the target reward and an occupancy-based deployment risk, then selects actions using risk-adjusted source scores. Unlike training-time risk-sensitive methods tied to a fixed surrogate such as return variance, TRAM supports spatial barrier exposure, divergence from a reference behavior, and local volatility risks specified at test time. We explicitly characterize TRAM as a surrogate method: it does not solve the full occupancy-control problem of the stitched policy, but admits a measurable source-hull mismatch term connecting source-scored risk to realized risk. Experiments in gridworlds, MuJoCo Reacher, Safety-Gymnasium, and an LLM alignment setting show that TRAM reduces deployment risk while preserving reward, without requiring any parameter updates at test time.

2406.14978 2026-05-21 cs.CV

E2GS: Event Enhanced Gaussian Splatting

E2GS:事件增强的高斯点撒法

Hiroyuki Deguchi, Mana Masuda, Takuya Nakabayashi, Hideo Saito

AI总结 本文提出E2GS方法,结合事件数据与高斯点撒法,提升图像去模糊和高质量视角合成效果,实验表明其在合成和真实数据集上均能生成视觉吸引人的渲染结果,且训练和渲染速度更快(140 FPS)

Comments 7pages, Accepted at ICIP 2024

详情
AI中文摘要

事件相机因其高动态范围、无运动模糊和低能耗而闻名,这些特性使其在最近的应用中得到了广泛应用。在过去的几年中,基于神经辐射场(NeRF)的事件驱动3D重建领域取得了显著进展,NeRF方法展示了逼真的视角合成结果。然而,NeRF的体积渲染范式需要大量的训练和渲染时间。在本文中,我们介绍了事件增强的高斯点撒法(E2GS),这是一种将事件数据融入高斯点撒法的新方法,该方法最近在新型视角合成领域取得了显著进展。我们的E2GS有效利用了模糊图像和事件数据,显著提高了图像去模糊效果,并产生了高质量的新型视角合成。我们在合成和真实世界数据集上的全面实验表明,我们的E2GS能够生成视觉吸引人的渲染结果,同时提供更快的训练和渲染速度(140 FPS)。我们的代码可在https://github.com/deguchihiroyuki/E2GS上获得。

英文摘要

Event cameras, known for their high dynamic range, absence of motion blur, and low energy usage, have recently found a wide range of applications thanks to these attributes. In the past few years, the field of event-based 3D reconstruction saw remarkable progress, with the Neural Radiance Field (NeRF) based approach demonstrating photorealistic view synthesis results. However, the volume rendering paradigm of NeRF necessitates extensive training and rendering times. In this paper, we introduce Event Enhanced Gaussian Splatting (E2GS), a novel method that incorporates event data into Gaussian Splatting, which has recently made significant advances in the field of novel view synthesis. Our E2GS effectively utilizes both blurry images and event data, significantly improving image deblurring and producing high-quality novel view synthesis. Our comprehensive experiments on both synthetic and real-world datasets demonstrate our E2GS can generate visually appealing renderings while offering faster training and rendering speed (140 FPS). Our code is available at https://github.com/deguchihiroyuki/E2GS.

2312.01386 2026-05-21 cs.LG stat.ML

On the Suboptimality of GP-UCB under Polynomial Effective Optimism

关于多项式有效乐观性下GP-UCB的次优性质

Wenjia Wang, Xiaowei Zhang

AI总结 本文研究了GP-UCB在多项式有效乐观性下的次优性质,通过定义有效乐观性水平(核岭回归中的探索系数与正则化参数的乘积),在统一置信假设下证明了GP-UCB在Matérn核下的新后悔下界,表明有效乐观性水平的多项式增长排除了最小最大最优后悔率,揭示了标准GP-UCB证明最小最大最优性的障碍。

详情
AI中文摘要

高斯过程上置信界(GP-UCB)被广泛用于昂贵黑盒函数的序列优化。尽管文献中已建立了许多关于其累积后悔的上界,但GP-UCB是否最小最大最优仍是一个开放问题。我们通过定义有效乐观性水平(核岭回归中的探索系数与正则化参数的乘积)来研究这一问题。在统一置信假设下,我们证明了GP-UCB在Matérn核下的新后悔下界。该下界表明,有效乐观性水平的多项式增长(至对数因子)排除了最小最大最优的后悔率。由于这一情形涵盖大多数现有分析,我们的结果指出了证明标准GP-UCB最小最大最优性的具体障碍。更广泛地说,它表明当前上界与最小最大下界之间的差距可能反映了算法本身的限制,而不仅仅是分析的限制。

英文摘要

Gaussian process upper confidence bound (GP-UCB) is widely used for sequential optimization of expensive black-box functions. Although many upper bounds on its cumulative regret have been established in the literature, whether GP-UCB is minimax optimal remains open. We study this question through the effective optimism level, defined as the product of the exploration coefficient and the regularization parameter in kernel ridge regression. Under a uniform confidence assumption, we prove a new regret lower bound for GP-UCB with Matérn kernels. The bound shows that polynomial growth of the effective optimism level, up to logarithmic factors, rules out the minimax-optimal regret rate. Since this is the regime covered by most existing analyses, our result identifies a concrete obstacle to proving minimax optimality for standard GP-UCB. More broadly, it suggests that the gap between current upper bounds and minimax lower bounds may reflect a real limitation of the algorithm, not only of the analysis.