arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17504 2026-05-19 cs.CV cs.AI

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

从分布视角看视觉机制可解释性:KL最小软约束原理

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu

AI总结 本文提出了一种基于分布的视觉机制可解释性方法,通过KL最小化优化问题来平衡可解释性和模型忠实性,利用能量引导的扩散后验采样实现,并在DINOv3模型上验证了其有效性。

详情
AI中文摘要

当前视觉机制可解释性(MI)的主要范式仍局限于通过启发式方法(如Top-K激活检索或正则化优化)解释视觉模型的内部单元。在本文中,我们建立了视觉MI的理论分布视角,该视角模型了特征激活对自然图像分布的影响,从而构建了一个KL最小化优化问题来建模MI任务。在此框架下,识别了先前MI范式中的统计偏差,揭示这些范式可能在人类感知上不可解释(即偏离自然图像分布)或在机械上不忠实于视觉模型(即无法激活模型特征)。为了解决这些偏差,我们提出了一种基于KL最小化软约束原理的视觉MI模型,该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性,并展示了我们的范式在DINOv3视觉模型上的实际有效性。

英文摘要

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

2605.17503 2026-05-19 cs.AI cs.CL cs.HC

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于深度学习和大语言模型的RAG EEG到文本翻译

Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady

AI总结 本文提出了一种基于检索增强生成(RAG)的EEG到文本解码方法,结合EEG编码器、向量检索阶段和大语言模型,以提高句子级解码的准确性,并在ZuCo数据集上验证了其有效性。

Comments 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics

详情
AI中文摘要

从电生理图(EEG)信号解码语言信息仍然是脑机接口(BCI)研究中极具挑战性的问题。特别是,由于EEG记录的信噪比较低,从EEG进行句子级解码尤为困难。以往研究通常在推理阶段未使用教师强制时难以超越随机基线性能。在本文中,我们提出了一种基于检索增强生成(RAG)的句子级EEG到文本解码流程,结合与语义句子嵌入对齐的EEG编码器、向量检索阶段以及大语言模型(LLM)以将检索到的句子细化为连贯的输出。实验在Zurich认知语言处理语料库(ZuCo)数据集上进行,该数据集包含在静默阅读期间收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息,结果与随机基线进行比较。在九名受试者中,所提出的流程优于随机基线,平均余弦相似度为0.181±0.022,与基线0.139±0.029相比,相对改进为30.45%。统计分析进一步确认了这种改进的显著性,遵循严格评估流程,其中推理阶段不接触地面真实标签。

英文摘要

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

2605.17500 2026-05-19 cs.LG cs.CV

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

沉默的画笔:评估AI艺术生成中的艺术风格泄露

Ninad Joshi, Ashutosh Ranjan, Vivek Srivastava, Shirish Karande

AI总结 本文研究了AI艺术生成中由于模型学习并复现艺术风格而产生的无意风格复现问题,提出了一种评估方法Art Arena,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下风格特征的重现频率。

详情
AI中文摘要

生成式文本到图像模型通常是在大规模网络爬取数据集上训练的,这些数据集包含多样化的视觉内容,如受版权保护和风格独特的艺术品,引发了关于所有权、归属和受保护视觉表达的无意重用的担忧。一个关键问题是,模型可以从这些数据中学习风格模式,并在生成输出中复现这些模式,而无需在提示中显式引用。我们称这种现象为The Silent Brush,即使在未被请求的情况下,所学的风格也会再次出现。现有的评估方法主要集中在近似重复检索或成员推断,而没有考虑到这种跨提示的无意风格复现形式。为了解决这些差距,我们首先制定了评估The Silent Brush的指导原则。然后引入Art Arena评估协议,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下其风格特征在生成输出中重现的频率。我们对广泛使用的文本到图像扩散模型,包括Stable Diffusion v1.5、Stable Diffusion XL (SDXL)和SANA-1.5进行了评估,并设计使其能够跨文本到图像生成系统通用。我们的结果表明,The Silent Brush源于艺术作品之间表示强度和交互动态的差异,导致模型生成中的不对称混合。代码和评估资源可在:https://anonymous.4open.science/r/ArtArena-EBE4获取。

英文摘要

Generative text-to-image models are typically trained on large-scale web-scraped datasets that include diverse visual content such as copyrighted and stylistically distinctive artworks, raising concerns about ownership, attribution, and the unintended reuse of protected visual expressions. A key issue is that models can learn stylistic patterns from this data and reproduce them in generated outputs without any explicit reference in the prompt. We refer to this phenomenon as The Silent Brush, where such learned styles reappear even when they are not requested. Existing evaluation methods mainly focus on near-duplicate retrieval or membership inference and do not account for this form of unintended stylistic resurfacing across prompts. To address these gaps, we first formulate guiding principles for evaluation of The Silent Brush. We then introduce Art Arena, an evaluation protocol that measures how strongly artworks are encoded, how they interact, and how frequently their stylistic traits reappear in generated outputs without explicit mention in prompts. We evaluate Art Arena on widely used text-to-image diffusion models, including Stable Diffusion v1.5, Stable Diffusion XL (SDXL), and SANA-1.5, and design it to generalize across text-to-image generative systems. Our results show that The Silent Brush arises from differences in representational strength and interaction dynamics between artworks, leading to asymmetric blending in model generations. Code and evaluation resources are available at: https://anonymous.4open.science/r/ArtArena-EBE4.

2605.17499 2026-05-19 cs.LG

t-gems: text-guided exit modules for decreasing clip image encoder

t-gems: 基于文本引导的退出模块用于减少clip图像编码器

Alberto Presta, Grzegorz Stefanski, Michal Byra, Krzysztof Arendt

AI总结 本文提出t-gems文本引导退出模块,通过利用编码器中间层的语义内容分布,减少clip图像编码器的计算成本,同时保持跨模态理解性能。

Comments Accepted at ICASSP 2026

详情
AI中文摘要

多模态深度神经网络通过整合多种数据模态来增强深度理解。不同模态的数据通常被投影到共享的潜在空间中进行相似性计算,但这一过程由于大型图像编码器和预测期间对测试数据的等量处理而变得资源密集。早期退出方法通过利用中间层来减少计算负载,节省时间和内存。然而,对于像图像-文本对这样的多模态数据开发此类方法具有挑战性。本研究探讨了编码器如clip中中间层中存在的语义内容分布,这些分布可以从文本描述中推导出来。我们引入了文本引导退出模块(t-gems)和基于速率的正则化器,以控制编码器的使用成本,同时保持跨模态理解性能。

英文摘要

Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due to large image encoders and equal processing of test data during prediction. Early exit methods reduce computational load by utilizing intermediate layers, saving time and memory. However, developing such methods is challenging for multimodal data like image-text pairs. This study investigates the semantic content distributions present in intermediate layers of encoders such as CLIP, which can be derived from textual descriptions. We introduce Text-Guided Exit Modules (T-GEMs) and a rate-based regularizer to control encoder usage costs while maintaining cross-modal understanding performance.

2605.17497 2026-05-19 cs.LG

Self-Supervised On-Policy Distillation for Reasoning Language Models

自监督在线策略蒸馏用于推理语言模型

Zhiquan Tan, Yinrong Hong

AI总结 本文提出自监督在线策略蒸馏(SSOPD)方法,通过对比正确与错误的完成过程信号,提升推理语言模型的表现,实验表明在多个基准测试中优于GRPO和OPSD基线。

详情
AI中文摘要

GRPO-style RLVR通过多个在线策略尝试来训练推理模型,但通常仅利用终端奖励。我们展示混合组包含更丰富的过程信号:正确完成是当前策略解决问题的自生成证据,而错误完成提供需要纠正的在线策略前缀。我们引入自监督在线策略蒸馏(SSOPD),将教师分布条件在最短正确完成上,蒸馏到最长错误完成的前缀中。这将组内正确-错误对比转化为密集的过程监督,而无需外部解决方案轨迹。一个停止时间观点激励最短正确/最长错误规则作为有限组对编辑持久失败以实现快速成功动作的近似。一个提示级前沿权重集中辅助损失,其中正确和错误分支共存。在AIME 2024、AIME 2025和HMMT 2025中,SSOPD在所有九个模型基准设置中优于GRPO。在Qwen3-8B上,它达到宏Avg@12为65.6,优于GRPO 1.6个百分点,优于解决方案条件的OPSD基线0.8个百分点。代码将在https://github.com/tzq1999/SSOPD上发布。

英文摘要

GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.

2605.17493 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

超越线性叠加:利用KAN-SAE在AI天气模型中发现气候特征

Minjong Cheon

AI总结 本文提出KAN-SAE,一种基于Kolmogorov-Arnold网络的稀疏自编码器,通过非线性激活函数揭示天气预测模型中的气候特征,相比线性基线提升了72%的活跃特征数量和降低了20%的特征冗余。

详情
AI中文摘要

深度学习天气预测模型在预测能力上表现出色,但其内部如何表示物理气候现象仍不明确。通过稀疏自编码器(SAEs)实现的机理可解释性提供了一种分解这些表示的有原则方法,但现有SAEs假设严格线性特征叠加,这与现代变压器中编码的高度非线性大气动力学不匹配。我们引入KAN-SAE,一种稀疏自编码器,其编码器将标准ReLU替换为可学习的每特征B-样条激活函数,这些激活函数来自Kolmogorov-Arnold网络(KANs),使每个潜在维度能够发展出自己的非线性门控配置。应用于Sonny时,KAN-SAE发现了975个活跃特征(相比线性基线的566个,提升了72%),并具有20%更低的特征冗余和可比的重建保真度。在无任何气候监督的情况下,KAN-SAE识别出一个在西欧空间集中的可解释热浪特征,并通过因果操控实验验证了西太平洋台风追踪器。我们的结果表明,非线性激活对于深度学习天气预测模型的机理可解释性至关重要,恢复了对线性基线不可见的气候特征。

英文摘要

Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.

2605.17489 2026-05-19 cs.CV

Employing Vision-Language Models for Face Image Quality Assessment

利用视觉-语言模型进行人脸图像质量评估

Erdi Sarıtaş, Eren Onaran, Vitomir Štruc, Hazım Kemal Ekenel

AI总结 本文研究了利用现成的视觉-语言模型(VLMs)在零样本设置下进行人脸图像质量评估(FIQA)的潜力,通过综合评估框架评估VLM性能,并发现模型架构对生物识别效用性能有显著影响,VLMs的输出与传统方法一致,但生成的分数可能因提示而异。

详情
AI中文摘要

人脸图像质量评估(FIQA)是生物识别流水线中的关键控制步骤。它确保只有可靠的样本被处理,以保持系统精度。最先进的FIQA方法具有高实用性,但通常以“黑箱”方式操作。它们产生标量分数,但没有可解释的人类反馈。这种缺乏透明性限制了它们在人类在回路场景中的有效性,例如自动边境控制,其中需要可操作的反馈。在本文中,我们研究了现成的视觉-语言模型(VLMs)在零样本设置下进行FIQA的潜力,以弥合这一差距。我们提出了一个全面的评估框架来评估VLM性能。这包括通过误差-拒绝曲线基准传统FIQA方法。此外,使用多样化的数据集,从监控导向到合成生成,我们分析了它们的可解释性、一致性和对提示变化的鲁棒性。我们的结果表明,生物识别效用性能在很大程度上取决于架构,而不是仅仅参数数量。大多数VLMs的输出与传统方法一致。我们还发现,VLMs的排名性能和生成的分数可能因提示而异。我们的合成消融研究显示,尽管增加参数数量可以提高内部一致性,但比较小模型的退化检测性能更差。这些发现表明,使用VLMs进行零样本FIQA分数估计是很有前景的,可以作为传统FIQA流水线的可解释性模块进行补充。代码可在https://github.com/ThEnded32/VLM4FIQA.git获得。

英文摘要

Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as "black boxes." They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs' outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.

2605.17488 2026-05-19 cs.CV cs.MM cs.SD

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Omni-Customizer: 用于联合音频-视频生成的端到端多模态定制

Yuheng Chen, Qingdong He, Teng Hu, Yuji Wang, Yabiao Wang, Lizhuang Ma, Jiangning Zhang

AI总结 本文提出Omni-Customizer,一种端到端多模态定制框架,旨在实现精确的多模态身份信息绑定和无缝融合,通过引入Omni-Context Fusion模块和Masked TTS Cross-Attention机制,提升多模态定制生成的性能。

详情
AI中文摘要

联合音频和视频生成的领域已因强大基础模型的出现而发生根本性变革。尽管取得了进展,但实现多模态定制,以在多个相互作用的主体中同时保持视觉身份和语音音色的一致性,仍然鲜有研究。为弥合这一差距,我们提出了Omni-Customizer,一种端到端框架,专门针对多模态身份信息的精确绑定和无缝融合。具体而言,我们引入了Omni-Context Fusion(OCF)模块,该模块能有效丰富基础文本提示,加入密集的多模态身份提示,同时引入Masked TTS Cross-Attention(MTP-CA)机制,专门设计以防止严重的"语音泄漏"问题。在该架构中,我们提出了语义锚定的多模态RoPE(SA-MRoPE),用于将视觉和音频参考标记以及TTS嵌入锚定到其对应的语义描述,从而实现结构化的多模态融合和稳健的身份绑定。此外,我们设计了一种全面的训练策略,结合交错音频-视频调度以快速适应多语言场景而不影响基础先验,以及渐进式内对到跨对课程学习以促进高阶和稳健的身份特征学习。大量实验表明,Omni-Customizer在双模态定制生成中实现了最先进的性能,其在视觉身份相似性、音色一致性、精确音频-视频同步以及整体视频-音频保真度方面均表现出色。

英文摘要

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

2605.17486 2026-05-19 cs.RO cs.LG

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA: 通过动态分组残差优化实现跨任务的视觉-语言-动作模型扩展

Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu

AI总结 本文提出DyGRO-VLA,一种通过动态分组残差优化实现跨任务视觉-语言-动作模型扩展的两阶段优化框架,旨在提升模型的泛化能力。

详情
AI中文摘要

最近在强化学习(RL)方面的进展提供了一种系统的方法来优化视觉-语言-动作(VLA)模型,推动了从轨迹模仿到任务环境中的主动学习的转变。尽管在控制精度上有所改进,大多数RL优化器仍然任务特定,这使VLA模型从通用控制器退化为过度拟合狭窄任务集的策略。在本研究中,我们深入分析了这一现象,并强调了跨任务特征表示对提高VLA模型泛化能力的重要性。受这一发现的启发,我们引入了DyGRO-VLA,一种两阶段优化框架,1)基于信息论原理有效地捕捉跨任务潜在表示,2)通过混合的RL残差动态优化策略。DyGRO-VLA使RL优化器能够在优化过程中利用任务相关的潜在信息,同时战略性地减轻对学习表示的不利干扰。我们在LIBERO、RoboTwin2基准以及现实世界中评估了我们的方法,证明了在多任务训练和分布偏移下,与强基线相比,我们的方法具有持续的改进。

英文摘要

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

2605.17483 2026-05-19 cs.CV

On Applicability of Synthetic Datasets for Facial Expression Recognition

关于合成数据集在面部表情识别中的适用性

Ali Azmoudeh, Erdi Sarıtaş, Ömer Yıldırım, Hazım Kemal Ekenel

AI总结 本文研究了合成数据集在面部表情识别中的应用,提出三种隐私保护策略来构建平衡的数据集,并通过实验验证了合成数据在缓解类别不平衡和隐私限制方面的有效性。

详情
AI中文摘要

面部表情识别面临两个核心挑战。第一个是公共数据集中类别不平衡的问题,这会扭曲学习过程并削弱泛化能力。第二个是隐私和数据收集限制的问题,这限制了面部图像的共享并阻碍了大而平衡数据集的创建。为了解决这些问题,我们考察了三种互补的策略,用于在标准七种离散面部表情类别设置下构建隐私保护的面部表情识别(FER)数据集。我们的策略是:(i)在置信度阈值方案下使用教师模型对大规模未标记面部集合进行伪标签;(ii)使用扩散模型进行提示驱动的合成,条件于人口统计学属性;(iii)任务感知的基于GAN的表情编辑,该方法在保持身份和真实感的同时修改面部表情。在训练和评估中,我们采用了广泛使用的数据集,包括AffectNet、RAF-DB和FER2013。我们利用合成数据集DigiFace、DCFace和EmoNet-Face BIG作为伪标签的未标记源。此外,我们利用FFHQ数据集作为生成合成的来源。主要实验使用经典CNN主干网络IR50进行,我们还探索了更复杂的架构POSTERv1,以评估其可行性和鲁棒性。通过跨数据集评估,我们分析了每种策略在整理数据集中的权衡。研究结果展示了合成数据如何有效替代或与真实数据集结合,以缓解不平衡和隐私限制的问题。代码和生成数据集:https://www.github.com/AliAZ98/SyntFER

英文摘要

Facial Expression Recognition faces two core challenges. The first is class imbalance in public datasets, which skews the learning process and weakens generalization. The second is related to privacy and data collection constraints, which limit the sharing of facial images and restrict the creation of large, balanced datasets. To address these issues, we examine three complementary strategies for constructing privacy-preserving FER datasets in the standard seven discrete facial expression classes setting. Our strategies are: (i) pseudo-labeling large unlabeled face collections with a teacher model under a confidence-thresholding scheme, (ii) prompt-driven synthesis using diffusion models conditioned on demographic attributes, and (iii) task-aware GAN-based expression editing that modifies facial expression while preserving identity and realism. For training and evaluation, we employed widely adopted datasets, including AffectNet, RAF-DB, and FER2013. We utilized the synthetic datasets DigiFace, DCFace, and EmoNet-Face BIG as unlabeled sources for pseudo-labeling. Additionally, we utilized the FFHQ dataset as the source for generative synthesis. The main experiments are conducted using a classic CNN backbone, IR50, and we also explore a more complex architecture, POSTERv1, to assess its feasibility and robustness. Using cross-dataset evaluations, we analyze the trade-offs each strategy presents in curated datasets. The findings demonstrate how synthetic data can effectively substitute or be combined with real datasets to mitigate imbalance and privacy limitations. Code and generated datasets:https://www.github.com/AliAZ98/SyntFER

2605.17481 2026-05-19 cs.CL

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

结合CNN的混合特征组合用于孟加拉语虚假新闻分类

Md Gulzar Hussain, Babe Sultana, Md Rinku Ali

AI总结 本文研究了在BanFakeNews-2.0数据集上使用CNN模型进行孟加拉语虚假新闻分类时,不同特征组合(语义、统计和字符级特征)对识别效果的影响,发现多特征组合能显著提升召回率和F1分数。

Comments Already accepted and presented in the 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)

详情
AI中文摘要

如今,孟加拉国的人们越来越多地通过互联网和社交媒体获取日常新闻,而不是传统报纸。然而,这些平台上的虚假新闻传播对真实媒体的可信度构成了风险和挑战。尽管已有研究致力于检测孟加拉语虚假新闻,但该领域仍有改进空间。本研究探讨了在BanFakeNews-2.0数据集上使用CNN模型时,特征选择方法(如语义、统计和字符级特征及其组合)在识别虚假新闻中的有效性。本文的关键发现表明,与单独使用特征相比,结合多种特征显著提高了召回率和F1分数。本研究的代码可在此获取:https://github.com/gulzar09/Bn_FNews_H.Feature.

英文摘要

Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character-level features, or their combinations, on the BanFakeNews-2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1-scores compared to using individual features alone. The code for this research can be availed here, https://github.com/gulzar09/Bn\_FNews\_H.Feature.

2605.17478 2026-05-19 cs.CV

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Mamba-VGGT: 通过外部滑动窗口Mamba内存实现持久长序列视频几何 grounded 变换器

Tianchen Deng, Zhenxiang Xiong, Nailin Wang, Fangjinhua Wang, Jiuming Liu, Jianfei Yang, Hesheng Wang

AI总结 本文提出Mamba-VGGT框架,通过引入滑动窗口Mamba内存模块,解决传统VGGT在长序列视频中几何遗忘和累积漂移问题,提升3D场景重建的精度与稳定性。

详情
AI中文摘要

视觉几何 grounded 变换器(VGGT)在高保真3D场景重建中设立了新基准。然而,随着序列长度增加,这些模型因全局注意力的二次复杂度而出现灾难性几何遗忘和累积漂移,主要由于需要截断的时间窗口。为克服由此产生的几何漂移,我们提出了Mamba-VGGT,一种增强的VGGT框架,能够实现持久的长距离推理。我们的关键贡献是滑动窗口Mamba(SWM)内存模块,该模块在时间窗口间维护显式的外部记忆标记。该模块利用选择性状态空间建模来提炼和传播全局几何先验,有效绕过了传统变换器的记忆限制。为了在不破坏预训练VGGT高度优化的空间特征的情况下整合这些长期时间线索,我们提出了一种零初始化空间内存注入器。利用零卷积层,该注入器适应性地将持久记忆融合到patch token流中,确保结构稳定性和无缝特征对齐。广泛实验表明,我们的方法在维持空间一致性和减少轨迹累积误差方面显著优于现有VGGT方法。我们的工作为大规模3D环境中基于几何的世界建模提供了可扩展、线性复杂度的解决方案。

英文摘要

Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.

2605.17477 2026-05-19 cs.RO

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

多柔性连杆串联 manipulator 的快速振动抑制与轨迹跟踪

Chengyi Wang, Yilong Huang, Ji Wang

AI总结 本文提出了一种基于 backstepping 的输出反馈框架,用于快速抑制多连杆串联柔性 manipulator 的振动并实现末端跟踪,通过 DeepONet 近似实现实时部署和可扩展性。

详情
AI中文摘要

柔性机器人 manipulator(FRMs)在轻量化设计和大工作空间方面具有优势,但其结构灵活性会引发振动、加速疲劳、降低跟踪性能并限制操作速度。这些挑战在多连杆串联 manipulator 中进一步加剧,因为整体长度的增加导致结构灵活性更大。本文提出了一种 backstepping 输出反馈框架,用于快速抑制 n 自由度串联柔性 manipulator(nDSFMR)的振动和末端跟踪,使用基于 DeepONet 的近似方法进行实际部署。每个连杆关节被建模为 Timoshenko 梁,结合 ODE 并转换为具有边界动态的 canonical 超几何 PDE。在关节处开发了基于 backstepping 的边界控制器,以等效地在梁上注入分布式阻尼,从而实现快速振动抑制和轨迹跟踪,仅使用可用的边界测量。为了实现实时实施和可扩展性,引入了 DeepONet 神经操作符来近似 backstepping 核,显著降低了计算成本,并在变化的操作条件下促进了快速控制器更新。在双连杆柔性 manipulator 上的实验表明,与具有前馈控制的线性二次调节器(LQR)相比,振动抑制更快,末端执行器收敛到期望轨迹的速度更快。

英文摘要

Flexible robotic manipulators (FRMs) offer advantages in lightweight design and large workspace, but their structural flexibility induces vibrations, accelerates fatigue, degrades tracking performance, and limits operational speed. These challenges are further amplified in multi-link serial manipulators, where increased overall length leads to greater structural flexibility. This article presents a backstepping output-feedback framework for fast vibration suppression and tip tracking of an n-degree-of-freedom serial flexible manipulator robot (nDSFMR), with a DeepONet-based approximation for practical deployment. Each link-joint is modeled as a Timoshenko beam coupled with an ODE and transformed into a canonical hyperbolic PDE with boundary dynamics. A backstepping-based boundary controller at the joint is developed to equivalently inject distributed damping along the beam, enabling rapid vibration suppression and trajectory tracking, only using available boundary measurements. To enable real-time implementation and scalability, a DeepONet neural operator is introduced to approximate the backstepping kernels, significantly reducing computational cost and facilitating fast controller updates under varying operating conditions. Experiments on a two-link flexible manipulator demonstrate faster vibration suppression and convergence of the end-effector to the desired trajectory, compared with a linear quadratic regulator (LQR) with feedforward control.

2605.17467 2026-05-19 cs.CL

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

VerifyMAS: LLM多智能体系统中故障归因的假设验证

Hezhe Qiao, Hanghang Tong, Ee-Peng Lim, Bing Liu, Guansong Pang

AI总结 本文提出VerifyMAS框架,通过验证假设的方法对LLM多智能体系统中的故障进行归因,解决了现有方法在全局故障识别和细粒度归因方面的不足,实验表明其在多种模型上均优于现有方法。

Comments 22 pages

详情
AI中文摘要

大型语言模型驱动的多智能体系统(LLM-MAS)在复杂任务中表现出色,但不可靠的智能体仍是系统可靠性的重要瓶颈。自动故障归因因此至关重要,但现有方法如直接预测智能体错误对和智能体优先故障归因依赖于本地日志,无法识别仅在完整交互轨迹中显现的全局故障,如跨步不一致和智能体间协调错误。此外,直接预测故障会引入大规模组合搜索空间,阻碍细粒度归因。为了解决这些挑战,我们提出了VerifyMAS,一种用于智能体故障归因的假设验证框架。不同于直接预测故障智能体和错误类型,VerifyMAS针对完整轨迹验证故障假设。这种基于验证的方法将归因分解为轨迹级错误验证和细粒度智能体定位,提供了一种以错误优先的归因方法,能够捕捉全局故障模式,同时显著减少搜索空间。我们进一步引入基于结构化错误分类学的假设数据构建策略,并对专用LLM验证器模型进行微调,用于轨迹级故障验证和智能体归因。在Aegis-Bench和Who&When上的实验表明,VerifyMAS在多种基础模型上均表现优异,包括开源Qwen和基于API的GPT模型,在不牺牲长多智能体轨迹推理效率的情况下,优于现有方法。

英文摘要

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

2605.17465 2026-05-19 cs.LG

TriOpt: A Scalable Algorithm for Linear Causal Discovery

TriOpt: 一种适用于线性因果发现的可扩展算法

Rafat Ashraf Joy, Elena Zheleva

AI总结 本文提出TriOpt算法,通过整合顺序方法和连续优化方法,解决了高维线性因果发现中的可扩展性问题,实现了显著的速度提升且保持了较高的准确性。

详情
AI中文摘要

从观测数据中学习因果关系具有挑战性,因为图搜索空间随着变量数量的增加而呈超指数增长。基于顺序的方法通过首先确定拓扑顺序来减少此空间,而连续优化方法通过将DAG学习转化为可微目标函数并加入循环性约束来探索最可能的区域。尽管这些方法在概念上具有吸引力,但在高维设置中仍面临显著的可扩展性限制,限制了其实际应用。在本文中,我们提出了一种新的线性因果发现方法,紧密整合这两种方法以在不牺牲准确性的情况下实现显著的可扩展性改进。我们的方法TriOpt将问题分解为两个高效的阶段。首先,它利用Sherman-Morrison秩1更新和线性核的加法结构来恢复拓扑顺序,从而实现快速且可扩展的顺序估计。其次,在给定此顺序的情况下,我们将结构学习重新公式化为一个凸的连续优化问题,完全避免了需要强制执行昂贵的循环性约束的需要。我们理论上证明,在真实的顺序下,TriOpt可以精确恢复潜在的线性DAG。经验上,在合成、半合成和现实数据集上,TriOpt在高维情况下相对于最先进的线性因果发现方法实现了数量级的速度提升,同时保持了可比或更优的准确性。

英文摘要

Learning causal relations from observational data is challenging because the graph search space grows super-exponentially with the number of variables. Ordering-based methods reduce this space by first identifying the topological ordering, whereas continuous optimization methods explore most likely regions of the space by casting DAG learning as a differentiable objective with an acyclicity constraint. Despite their conceptual appeal, both paradigms face significant scalability limitations in high-dimensional settings, restricting their practical applicability. In this work, we introduce a new formulation for linear causal discovery that tightly integrates these two paradigms to achieve substantial gains in scalability without sacrificing accuracy. Our approach, TriOpt, decomposes the problem into two efficient stages. First, it recovers the topological ordering by exploiting the Sherman-Morrison rank-1 downdate together with the additive structure of linear kernels, enabling fast and scalable ordering estimation. Second, given this ordering, we reformulate structure learning as a convex continuous optimization problem that entirely avoids the need for enforcing costly acyclicity constraints. We theoretically show that, under the true ordering, TriOpt exactly recovers the underlying linear DAG. Empirically, across synthetic, semi-synthetic, and real-world datasets, TriOpt achieves orders-of-magnitude speedups over state-of-the-art linear causal discovery methods in high-dimensional regimes, while maintaining comparable or superior accuracy.

2605.17458 2026-05-19 cs.LG

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

ClaHF:一种基于人类反馈的强化学习框架,用于改进分类任务

Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Jiayin Wang

AI总结 本文提出ClaHF,一种基于人类反馈的强化学习框架,用于改进文本分类任务,通过整合偏好建模和强化学习优化,无需额外人工标注,在分类流程中提升分类性能和置信度校准。

详情
AI中文摘要

文本分类模型通常通过监督微调(SFT)进行训练。然而,SFT本质上是从实例级标签进行行为克隆,因此无法充分捕捉样本之间的相对偏好关系,这限制了模型塑造决策边界和校准预测置信度的能力。在本文中,我们提出ClaHF,一种受人类反馈启发的强化学习(RL)框架,用于文本分类,该框架在分类流程中整合了偏好建模和RL优化,而无需额外的人工标注。与以往仅依赖实例级监督的工作不同,ClaHF同时构建多个候选预测及其相对排名关系,并在奖励模型(RM)中联合建模Top-1偏好以及非最优候选之间的顺序。这种设计将传统的标签监督转换为可以直接应用于策略优化的偏好信号。我们在八个分类任务上进行了系统评估,涵盖三种场景类别。结果表明,ClaHF在各种语言模型(LMs)上一致提升了分类性能和置信度校准。数据和代码可在https://anonymous.4open.science/r/ClaHF上获取。

英文摘要

Text classification models are typically trained via supervised fine-tuning (SFT). However, SFT essentially performs behavior cloning from instance-wise labels and thus fails to adequately capture relative preference relations among samples, which limits the model's ability to shape decision boundaries and calibrate predictive confidence. In this paper, we propose ClaHF, a human feedback-inspired reinforcement learning (RL) framework for text classification that integrates preference modeling and RL optimization into the classification pipeline without requiring additional human annotations. Unlike prior work that relies solely on instance-wise supervision, ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM). This design converts conventional label supervision into preference signals that are directly applicable to policy optimization. We conduct systematic evaluations on eight classification tasks spanning three categories of scenarios. Results demonstrate that ClaHF consistently improves both classification performance and confidence calibration across diverse language models (LMs). The data and code are available at https://anonymous.4open.science/r/ClaHF.

2605.17456 2026-05-19 cs.CV cs.AI

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

GCE-MIL: 多实例学习中全滑片成像的可信且可恢复的证据

Xiangyu Li, Ran Su

AI总结 该研究提出GCE-MIL方法,通过优化S/N/R标准直接提升多实例学习中全滑片成像的预测性能和证据质量,改进了宏F1分数和C-index,并减少了连续-离散差距。

Comments 10 pages, 17 figures, 24 table

详情
AI中文摘要

多实例学习(MIL)是全滑片图像(WSI)分类和生存预测的标准方法,其中基于注意力的模型将图像块特征聚合为滑片级预测。这些模型将注意力权重视为预测的证据,但注意力被优化用于分类,而非识别支持诊断的实际图像块。这种混淆导致三个失败:选择的图像块不足(单独保留它们会降低宏F1分数0.078)、多余(移除它们几乎不影响预测)以及不可恢复(连续的注意力分数与推理中使用的离散图像块子集不一致)。核心前提是证据质量应通过显式标准直接优化——充分性、必要性和可恢复性(S/N/R)——而不是作为分类的副产品继承。GCE-MIL是一种背骨无关的封装器,通过三种注入模式和三种证据组件实现:一个将选择与领域特定概念对齐的 grounding 机制,一个作为可微分代理的 noisy-OR 覆盖,以及一个通过边缘引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个背骨和9个数据集(81种配置)上,GCE-MIL将平均宏F1分数提高了0.024,C-index提高了0.014,减少了连续-离散差距4-7,增加了补集退化2-4。通过在离散恢复后可选的图像块预过滤,推理速度可提高高达5倍,同时保留0.989的完整袋效用。

英文摘要

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

2605.17454 2026-05-19 cs.AI

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

多方多目标优化作为共识搜索:交叉 party 再组合的运行时间分析

Xiaolei Fang, Peilan Xu, Wenjian Luo

AI总结 本文研究了多党多目标优化问题中的交叉 party 再组合,通过分析 MP-JCG 和 BPBOMST 问题,证明了基于收益引导突变的基线方法在跨越间隙时存在瓶颈,而改进的 CPR-NSGA-II 变体能够在 O(n log n) 的预期评估次数内发现共同帕累托最优解,并推导了基于边联合再组合和均匀修复的实例参数化预期运行时间界。

Comments 40 pages, 7 figures

详情
AI中文摘要

多党多目标优化问题(MPMOPs)需要自主决策者达成共识,因此不同于扁平化多目标公式。现有多目标进化算法的运行时间理论大多针对单党帕累托前沿近似,无法直接解释MPMOPs中的共同解搜索。我们研究了两种代表性场景中的交叉 party 再组合。在MP-JCG,一个具有显式间隙区域的伪布尔基准上,我们证明了基于收益引导突变的基线方法面临跨越间隙的瓶颈,需要Θ(n²)的预期适应度评估。相比之下,分析型CPR-NSGA-II变种通过直接组装互补前缀和后缀模板,分布在党派种群中,能够在O(n log n)的预期评估次数内发现共同帕累托最优解。与扁平化四目标公式F-JCG相比,我们的全前沿覆盖分析展示了扁平化带来的额外覆盖负担。对于BPBOMST,多党多目标最小生成树问题的双党双目标专业化,我们开发了分层支持覆盖分析。对于每个共同帕累托目标向量,对称平均投影诱导了一个辅助双目标MST实例,合适的支持代表可以产生一个2λ-共同近似覆盖,其中λ∈[1,2]。我们进一步推导了一个代表池CPR-NSGA-II变种的实例参数化预期运行时间界,使用边联合再组合和均匀修复。这个界分离了局部辅助前沿填充、跨党再组合捷径和边联合修复模糊性的影响。

英文摘要

Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring \(Θ(n^2)\) expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in \(O(n\log n)\) expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a \(2λ\)-common approximation cover with \(λ\in[1,2]\). We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.

2605.17451 2026-05-19 cs.CV

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking

DeTrack:一种无人机具身跟踪的基准及海拔感知双世界模型

Guyue Hu, Haoming Liu, Siyuan Song, Chenglong Li, Feng Chen, Jin Tang

AI总结 本文提出DeTrack任务,要求无人机在交互式3D环境中利用在线自体观察和主动飞行控制进行目标跟踪,并提出AaDWorlds框架以解决海拔相关的可见性与飞行安全矛盾。

详情
AI中文摘要

空中目标跟踪在公共安全、应急救援、野生动物监测等领域有广泛应用。然而,现有空中跟踪基准主要基于固定摄像头位置或预设飞行路径的被动2D视频序列,其中无人机被视为被动相机而非具身代理,无法主动感知、交互和控制其在动态3D场景中的运动。本文定义了新的无人机具身跟踪任务DeTrack,要求无人机利用在线自体观察和主动飞行控制在闭环中跟踪目标。我们构建了一个包含11,368条目标轨迹的大型基准,涵盖多样化的场景、渲染条件、语义区域和移动干扰物,并提供了针对目标可见性、跟踪准确性和轨迹成功的评估指标。我们进一步提出了AaDWorlds,一种用于无人机具身跟踪的海拔感知双世界模型框架。AaDWorlds包含一个海拔感知感知模块和双世界模型,分别在高海拔和低海拔环境下预测未来状态。通过结合伪海拔感知观察和预测的未来状态,AaDWorlds缓解了目标可见性与飞行安全之间的固有矛盾。在DeTrack基准上的实验表明,AaDWorlds在所有评估指标上均提升了闭环跟踪性能。

英文摘要

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.

2605.17449 2026-05-19 cs.CV cs.AI

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

AI总结 本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题,提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性,从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情
AI中文摘要

全切片MIL模型通常被称为上下文感知模型,当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中,组织结构是诊断信号的一部分,几个强大的MIL基线在补丁坐标随机排列后,滑片级别AUC几乎未变。它们的预测准确,但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的:在滑片级监督下,密集的外观统计信息被早期学习,留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图,然后冻结它,同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单;干预在于如何训练空间分支。在9个公开WSI基准上,ResTopoMIL在1.15M参数下提升了分类和生存预测性能,恢复了对坐标扰动的敏感性,并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

2605.17447 2026-05-19 cs.CV cs.CL

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

AI总结 本文提出FastOCR,一种无需训练的框架,通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题,显著提升处理速度和准确性。

详情
AI中文摘要

视觉-语言模型(VLMs)在光学字符识别(OCR)中展现出强大潜力,但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐,例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效,但此策略在OCR中失效,因为几乎每个视觉令牌可能对应一个字符或结构元素,任何不可逆的损失都会导致准确性急剧下降。我们观察到,尽管文档图像看似密集且难以剪枝,模型对它们的注意力实际上在时间上是稀疏的:在每个解码步骤中,它集中在一小块区域,随着步骤逐渐移动,就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发,我们将不可行的全局剪枝问题转化为可处理的局部动态问题,并提出FastOCR,一种无需训练的框架,包含两个互补模块。具体而言,Focal-Guided Pruning识别少量焦点层,并在每一步从中选择最相关的视觉令牌;Cross-Step Fixation Reuse利用固定点的逐渐移动,从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌,FastOCR避免了永久信息丢失。广泛实验表明,FastOCR作为一种即插即用的加速模块,在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上,FastOCR在每个解码步骤只关注5%的视觉令牌,保留了未剪枝模型98%的准确性,同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

2605.17443 2026-05-19 cs.CL cs.SD eess.AS

Analyzing Error Propagation in Korean Spoken QA with ASR-LLM Cascades

分析韩语语音问答中ASR-LLM级联中的误差传播

Donghyuk Jung, Youngwon Choi

AI总结 本文研究了韩语语音问答中ASR-LLM级联中误差传播的问题,通过分析下游语义失败,揭示了传统ASR指标无法完全捕捉的误差影响,发现不同性能的LLM在级联降级上的一致性,识别出单字符ASR错误作为语义失败通道,并通过辅助比较表明大音频语言模型在噪声韩语SQA中优于匹配语言模型的ASR-LLM流水线。

Comments Preprint. Submitted to APSIPA ASC 2026

详情
AI中文摘要

我们分析了自动语音识别(ASR)误差如何通过ASR-LLM级联在韩语语音问答(SQA)中传播,重点关注传统ASR指标无法完全捕捉的下游语义失败。我们的分析显示,由ASR误差引起的相对下游降级在不同绝对性能的LLM中保持一致,表明级联降级主要跟踪ASR阶段的信息损失。我们进一步识别出单字符韩语ASR错误作为一种独特的语义失败通道,其中正确答案在下游预测中完全消失,尽管仅存在微小的转录差异。最后,辅助比较显示,大型音频语言模型在噪声韩语SQA中优于具有匹配语言骨干的ASR-LLM流水线,表明直接音频输入有潜力缓解转录诱导的信息损失。

英文摘要

We analyze how automatic speech recognition (ASR) errors propagate through ASR-LLM cascades in Korean spoken question answering (SQA), focusing on downstream semantic failures that conventional ASR metrics cannot fully capture. Our analysis shows that the relative downstream degradation caused by ASR errors is consistent across LLMs with different absolute performance, suggesting that cascade degradation largely tracks ASR-stage information loss. We further identify single-character Korean ASR errors as a distinct semantic-failure channel, where the gold answer becomes entirely absent from the downstream prediction despite only a minimal transcription difference. Finally, an auxiliary comparison shows that a large audio language model outperforms an ASR-LLM pipeline with a matched language backbone in noisy Korean SQA, indicating the potential of direct audio input to mitigate transcript-induced information loss.

2605.17442 2026-05-19 cs.CL cs.AI cs.IR

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数:低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

AI总结 本研究探讨了多语言NLP中数据集可见性不对称问题,通过结合目录基准和文献证据,提出了资源密度指数(RDI)来衡量语言的数据集可见性,揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情
AI中文摘要

多语言NLP常常依赖于集中式目录中的数据集计数来确定哪些语言是资源丰富或贫乏的。然而,这些目录只记录了数据集可见性的一层:哪些数据集已被注册或机构分发。它们不一定反映哪些数据集在研究文献中被创建、引用或重用。为了考察这一差距,我们结合基于目录的基准与文献支持的数据集流通证据。我们引入了资源密度指数(RDI),定义为每一百万使用者的数据集数量,并计算了乙努诺格(Ethnologue)中200种最广泛使用的语言的RDI。其中,118种语言(59%)在LRE地图和语言数据 consortium(LDC)中平均RDI为零,另有23种语言低于0.1,对应每十万使用者最多一个目录数据集。然后,我们利用LLM辅助的引用挖掘流程处理Semantic Scholar语料库中的这141种低可见性语言。经过人工验证和整合,我们识别出53种语言中的609个唯一数据集,其中356个仍通过工作公共链接公开访问。这些结果揭示了显著的可见性差距:许多大使用者语言在目录记录中数据贫乏,但在研究文献中显示明显的数据集活动。我们的发现表明,多语言数据稀缺不仅应被视为生产问题,还应被视为文档、可发现性和长期可访问性的问题。代码和数据可在(https://github.com/zhiyintan/dataset-visibility-asymmetry)公开获取。

英文摘要

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

2605.17436 2026-05-19 cs.CV cs.CL

Medical Context Distorts Decisions in Clinical Vision Language Models

医学语境扭曲了临床视觉语言模型的决策

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

AI总结 本文研究了医学语境对临床视觉语言模型决策的影响,发现模型在整合医学记录的视觉和文本信息时存在模态依赖、无关历史依赖和提示敏感性等问题,强调了在临床应用前需要建立明确的保障措施。

详情
AI中文摘要

视觉-语言模型(VLMs)越来越多地被提出用于临床决策支持,但其在需要整合医学记录中视觉和文本信息的现实场景中的可靠性仍缺乏充分了解。本文识别了三种失败模式:(1)对文本的过度依赖而非图像,(2)对无关临床历史的虚假依赖,以及(3)在语义等价输入上的提示敏感性。我们评估了多种通用领域和医学调优的开源和闭源VLMs,在胸片任务中使用MIMIC-CXR进行测试。通过系统地操纵图像-文本对齐、临床历史和提示公式,我们发现VLM的决策受到文本模态主导,即使有视觉证据可用。此外,我们发现VLMs受到无关报告的强烈影响,而微小的提示变化可以逆转正确的图像基预测。我们的发现强调了在考虑将这些模型用于临床实践之前,需要建立明确的保障措施和压力测试。

英文摘要

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

2605.17435 2026-05-19 cs.CL

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

BELIEF: 结构化证据建模与不确定性感知融合用于生物医学问答

Chang Zong, Hao Ning, Siliang Tang, Jie Huang, Jian Wan

AI总结 本文提出BELIEF框架,通过结构化证据建模和不确定性感知融合,提升生物医学问答任务中检索文献的利用效率,实现对证据可靠性、不确定性以及候选假设的支持强度的显式建模。

Comments 14 pages, 6 figures

详情
AI中文摘要

生物医学问答通常需要从检索文献中做出决策,这些文献的相关性、质量以及对候选答案的支持程度不均。大多数检索增强的大语言模型(LLM)方法将这些文献作为平铺文本输入模型,导致证据可靠性及剩余不确定性大多隐含。我们提出BELIEF,一种用于封闭集生物医学问答的结构化证据建模和不确定性感知融合框架。不同于将检索文档视为非区分的上下文,BELIEF将其转换为证据对象,记录临床属性、来源质量、问题相关性、支持强度以及相关的候选假设。这些证据对象为两种互补的推理路径提供共享基础。符号路径基于Dempster-Shafer(D-S)理论,在有限的答案空间上构建可靠性加权的基本概率分配,并进行不确定性感知的符号证据融合以估计信念和残余不确定性。神经路径使用相同的结构化证据进行基于LLM的语义推理,而一个可靠性感知的仲裁模块根据信念强度、不确定性、证据可靠性和语义一致性来协调符号和神经输出。在PubMedQA、MedQA和MedMCQA上使用五个通用大语言模型(LLM)后端进行的实验表明,BELIEF在25个30种后端-数据集-指标设置中取得了最佳结果。与生物医学领域模型的比较表明,BELIEF在MedQA和MedMCQA上具有竞争力,而专门的生物医学预训练仍然在PubMedQA上具有优势。消融、互补性、不确定性分层和成本分析进一步表明,BELIEF通过使证据结构、路径分歧和决策不确定性显式化,提高了检索证据的利用效率。

英文摘要

Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.

2605.17433 2026-05-19 cs.CV

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

VISTA: 用于多序列MRI分割的方差门控跨序列测试时间适应

Zhipeng Deng, Jiale Zhou, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

AI总结 本文提出VISTA框架,解决多序列MRI分割中模态交互偏移问题,通过设计跨序列干预生成器和跨视图分歧感知伪标签方法,提升模型在临床环境下的适应能力,实验表明在不同群体上性能优于现有方法。

Comments MICCAI2026 early accept

详情
AI中文摘要

在新的临床环境中部署多序列磁共振成像(MRI)分割模型具有挑战性,因为存在扫描仪和采集协议的差异。尽管现有的TTA方法能够处理基本的单模态偏移,但它们在根本性的双偏移问题下常常失效,因为其适应信号无法捕捉模态交互偏移,这会破坏跨序列一致性。为了解决这个问题,我们提出了方差门控跨序列测试时间适应(VISTA),一种无源框架,用于解决模态交互偏移问题。首先,我们设计了一个跨序列干预生成器(ISIG),通过交换低频谱和熵局部化的补丁跨序列生成一组一致性探针,保持解剖语义的同时挑战跨序列依赖性。其次,我们引入了跨视图分歧感知伪标签(CDPL),通过跨视图分歧方差建立体素级可靠性度量,动态门控自我训练并强制干预一致性,促使网络依赖于稳健的解剖语义。大量实验将模型从标准成人MRI(BraTS-GLI-Pre)适应到非洲低场(BraTS-SSA)和儿童(BraTS-PED)群体,在临床偏移下优于竞争方法,实现了绝对Dice改进+1.89%(SSA)和+2.82%(PED)超过源模型。代码可在https://github.com/dzp2095/VISTA获取。

英文摘要

Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at https://github.com/dzp2095/VISTA.

2605.17432 2026-05-19 cs.LG cs.CR

DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models

DP-SelFT: 大语言模型的差分隐私选择性微调

Haichao Sha, Zihao Wang, Yuncheng Wu, Hong Chen, Wei Dong

AI总结 本文提出DP-SelFT框架,通过选择性微调方法在保持差分隐私的同时提升大语言模型的隐私-效用权衡。

详情
AI中文摘要

大型语言模型(LLMs)通常通过微调适应下游任务,但微调数据中包含敏感信息,可能导致模型泄露。差分隐私(DP)提供正式保护,但LLM的DP微调仍因梯度裁剪和噪声注入而显著降低效用。现有工作通过将DP与参数高效微调方法(如LoRA)结合来改进这一权衡。在本文中,我们研究了互补方向:选择性微调,通过限制更新应用的位置。我们提出DP-SelFT框架,用于大语言模型的差分隐私选择性微调。DP-SelFT解决参数选择中的三个DP特定挑战:避免重复隐私成本、在噪声估计下提高稳定性、以及选择在裁剪和噪声更新下仍有用的参数。首先构建轻量级DP合成数据集,并仅在该合成数据上进行选择,因此选择阶段不增加额外隐私成本。然后通过临时训练候选层子集在合成训练拆分上,并在合成验证拆分上评估它们。关键在于临时训练是在与下游DP微调匹配的扰动范围内进行的,最坏情况下的扰动规模与DP噪声相同。这有利于不仅可学习且对噪声私人更新具有鲁棒性的层子集。在基准任务上的实验表明,DP-SelFT在相同隐私保障下,一致地改进了隐私-效用权衡。

英文摘要

Large language models (LLMs) are commonly adapted to downstream tasks through fine-tuning, but fine-tuning data often contains sensitive information that may be leaked by the resulting model. Differential privacy (DP) offers formal protection against such leakage, yet DP fine-tuning of LLMs still suffers from substantial utility degradation due to gradient clipping and noise injection. Existing work improves this trade-off by combining DP with parameter-efficient fine-tuning methods such as LoRA, which constrain the form of updates. In this work, we study a complementary direction: selective fine-tuning, which constrains where updates are applied. We propose DP-SelFT, a framework for differentially private selective fine-tuning of LLMs. DP-SelFT addresses three DP-specific challenges in parameter selection: avoiding repeated privacy cost, improving stability under noisy estimates, and selecting parameters that remain useful under clipped and noisy updates. It first constructs a lightweight DP synthetic dataset and performs selection only on this synthetic data, so the selection stage incurs no additional privacy cost. It then conducts layer-level selection by temporarily training candidate layer subsets on a synthetic training split and evaluating them on a synthetic validation split. Crucially, this temporary training is performed under a perturbation regime matched to downstream DP fine-tuning, with worst-case perturbations of the same scale as DP noise. This favors layer subsets that are not only learnable but also robust to noisy private updates. Experiments on benchmark tasks show that DP-SelFT consistently improves the privacy--utility trade-off over existing DP fine-tuning baselines under the same privacy guarantees.

2605.17431 2026-05-19 cs.LG cs.AI

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

MATE:利用累积转移嵌入记忆解决上下文马尔可夫决策过程

Himchan Hwang, Hyeokju Jeong, Gene Chung, Seungyeon Kim, Sangwoong Yoon, Frank Chongwoo Park

AI总结 MATE通过使用累积转移嵌入的记忆架构,解决了由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs),在保持后验信念的同时,避免了传统方法的计算和梯度问题,实现了高效且性能优异的解决方案。

详情
AI中文摘要

我们提出了MATE,一种简单而有效的记忆架构,用于解决由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs)。在CMDPs中,最优智能体可以通过维持上下文的后验信念来在线适应。MATE用求和聚合的记忆替代了不可行的后验,利用后验的排列不变性来保留可证明的充分表达性。与先前的记忆架构相比,MATE避免了Transformer的逐步展开成本增长和与循环神经网络(RNNs)通常相关的梯度问题。在多样化的基准测试中,MATE展示了清晰的计算优势,同时实现了与标准序列模型基线相当的性能。

英文摘要

We propose MATE, a simple yet effective memory architecture for solving Contextual Markov Decision Processes (CMDPs), a family of MDPs parameterized by an unobserved context. In CMDPs, an optimal agent can adapt online by maintaining the posterior belief over contexts. MATE replaces this intractable posterior with a sum-aggregated memory, leveraging the posterior's permutation invariance to retain provably sufficient expressiveness. Compared to prior memory architectures, MATE avoids the growing per-step rollout cost of Transformers and the gradient issues commonly associated with Recurrent Neural Networks (RNNs). Extensive evaluations across diverse benchmarks demonstrate that MATE provides clear computational advantages while achieving performance comparable to standard sequence-model baselines.

2605.17429 2026-05-19 cs.LG cs.CV

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

径向-角向几何用于噪声标签学习中的可靠更新诊断

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Weiguang Qu, Yanhui Gu

AI总结 本文提出了一种基于径向-角向几何的方法,用于在噪声标签学习中可靠地诊断更新,通过比较观测标签梯度与EMA教师诱导的参考梯度,区分对齐的困难清洁更新与由损坏标签引起的冲突更新。

详情
AI中文摘要

噪声标签方法通常从正向空间信号如损失、置信度或熵来估计样本可靠性。这些信号表明样本是否难以预测,但它们不直接测试其观察到的标签是否导致可靠的参数更新。这个差距很重要,因为困难的干净样本和错误标记的样本可能具有相似的损失,但会诱导不同的更新。我们重新诠释可靠性估计为观测标签更新的诊断。样本级经验Fisher迹提供了一个反向空间的更新能量度量:对于分类器层,它分解为一个预测残差项和一个特征敏感性项,因此捕获了超越标量损失的信息。然而,迹仍是一个径向幅度信号,无法决定大更新是否有益或有害。因此,我们提出了相对几何冲突(RGC),它将观测标签梯度与由EMA教师诱导的参考梯度进行比较。冲突项有助于区分大但对齐的困难清洁更新与由损坏标签引起的冲突更新。在合成和现实世界的噪声标签基准上,RGC在我们的评估协议下提高了困难清洁样本的保留和准确性。

英文摘要

Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.

2605.17428 2026-05-19 cs.LG cs.AI

Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning

渐进泛化增强:结合深度耦合RND-PPO和领域优先噪声注入的稳健作物管理强化学习

Wu Yang

AI总结 本文提出了一种渐进泛化增强方法,通过深度耦合RND-PPO和领域优先噪声注入,解决农业强化学习中早期学习效率与后期泛化能力的平衡、内在和外在奖励的简单加法结合以及统一噪声注入策略的问题,从而提高作物管理的鲁棒性。

详情
AI中文摘要

我们在gym-DSSAT玉米灌溉任务上的初步实验表明,±2摄氏度的温度噪声会导致在清洁条件下训练的PPO策略的经济收益减少11.9% - 这是现有研究未充分解决的系统性鲁棒性缺陷。本文针对阻碍农业RL系统实际部署的三个相互关联的限制:早期阶段学习效率与后期阶段泛化能力之间的权衡;探索增强PPO中内在和外在奖励的简单加法结合;以及忽视农业状态变量经验证实的差异敏感性的统一测量噪声注入策略。我们引入了三个系统性的创新:渐进泛化增强(PGA),实现一个三阶段课程(清洁训练0-800次回合,渐进800-1200次回合,完整增强1200-2000次回合);深度耦合RND-PPO架构,具有双通道GAE归一化、进度衰减的内在系数和语义离散化;以及领域优先噪声注入,具有层次激活。我们的实验评估显示:在佛罗里达州,相比最先进的BERT-DQN,产量提高了8.43%,氮肥利用效率提高了16.42%;在阿拉贡,产量提高了5.61%(尽管由于恶劣的地中海气候,经济评分降低了3.67%);在综合扰动下,性能保留率分别为94.4% vs 80.0%。所有实验均使用5个随机种子,在NVIDIA A100 GPU上进行,每运行约4.2±0.3小时(2000次回合,2048步缓冲区,64 mini-batch大小)。

英文摘要

Our preliminary experiments on gym-DSSAT maize irrigation tasks revealed that +/-2 degrees C temperature noise causes an 11.9% reduction in economic returns for PPO policies trained under clean conditions - a systematic robustness deficit that existing research has not adequately addressed. This paper tackles three interconnected limitations impeding practical deployment of agricultural RL systems: the trade-off between early-stage learning efficiency and late-stage generalization capability; the naive additive combination of intrinsic and extrinsic rewards in exploration-augmented PPO; and uniform measurement noise injection strategies that disregard empirically validated differential sensitivity across agricultural state variables. We introduce three systematic innovations: Progressive Generalization Augmentation (PGA) implementing a three-phase curriculum (clean training 0-800 episodes, progressive 800-1200, full augmentation 1200-2000); a deeply coupled RND-PPO architecture with dual-channel GAE normalization, progress-decayed intrinsic coefficients, and semantic discretization; and domain-prioritized noise injection with hierarchical activation. Our experimental evaluation demonstrates: 8.43% yield improvement and 16.42% nitrogen use efficiency improvement over SOTA BERT-DQN in Florida; 5.61% yield improvement in Zaragoza (though 3.67% lower economic score due to challenging Mediterranean climate); and 94.4% vs 80.0% performance retention under combined perturbations. All experiments used 5 random seeds on NVIDIA A100 GPUs with 4.2+/-0.3 hours per run (2000 episodes, 2048-step buffer, 64 mini-batch size).