arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2508.13309 2026-05-26 cs.CV cs.LG

DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

DASH：一种用于合成有效且隐蔽的对抗样本的元攻击框架

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

AI总结提出DASH元攻击框架，通过多阶段自适应组合Lp约束攻击方法，生成有效且感知对齐的对抗样本，在多个数据集上优于现有方法。

Comments Accepted to CVPR 2026

详情

AI中文摘要

在白盒设置下，已有大量技术被提出用于在严格的Lp范数约束下生成对抗样本。然而，这类范数受限的样本往往与人类感知不一致，只有少数方法专门探索感知对齐的对抗样本。此外，尚不清楚能否有效利用Lp约束攻击的见解来提升感知效能。本文介绍DASH，一个完全可微的元攻击框架，通过策略性地组合现有基于Lp的攻击方法，生成有效且感知对齐的对抗样本。DASH以多阶段方式运行：在每个阶段，它使用学习到的自适应权重聚合来自多个基础攻击的候选对抗样本，并将结果传播到下一阶段。一种新颖的元损失函数通过联合最小化误分类损失和感知失真来指导这一过程，使框架能够动态调整每个基础攻击在各阶段的贡献。我们在CIFAR-10、CIFAR-100和ImageNet上对对抗训练模型评估DASH。尽管仅依赖基于Lp约束的方法，DASH显著优于最先进的感知攻击如AdvAD，实现了更高的攻击成功率（例如提升20.63%）和更优的视觉质量（以SSIM、LPIPS和FID衡量，分别提升约11、0.015和5.7）。此外，DASH对未见过的防御具有良好的泛化能力，使其成为评估鲁棒性的实用且强大的基线，无需为每种新防御手工设计自适应攻击。

英文摘要

Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only a few methods specifically explore perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

URL PDF HTML ☆

赞 0 踩 0

2508.12628 2026-05-26 cs.CV

Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

Creative4U: 基于MLLMs的广告创意图像选择器与比较推理

Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu

AI总结提出基于多模态大语言模型的创意图像评估与选择范式，通过构建比较推理数据集CreativePair和强化学习方法Creative4U，实现可解释的创意选择。

详情

AI中文摘要

广告中的创意图像是电子商务平台的核心和灵魂。引人注目的创意图像可以提升用户的购物体验，增加广告主的收入以及平台的广告收入。随着AIGC技术的出现，广告主能够以极低的成本生产大量创意图像。然而，他们难以评估创意质量以进行选择。现有方法主要关注创意排序，无法满足可解释的创意选择需求。在这项工作中，我们提出了首个可解释的创意评估与选择范式。借助多模态大语言模型（MLLMs），我们的方法将创意图像的评估与选择整合到自然语言生成任务中。为了促进这项研究，我们构建了CreativePair，这是首个比较推理驱动的创意数据集，包含8k个带标注的图像对，每个样本包含一个标签，指示哪张图像更优。此外，我们引入了Creative4U（读作Creative for You），一种基于MLLMs的创意选择器，它考虑了用户的兴趣。通过Reason-to-Select RFT，其中包括基于思维链的监督微调（CoT-SFT）和基于组相对策略优化（GRPO）的强化学习，Creative4U能够准确评估和选择创意图像。离线和在线实验均证明了我们方法的有效性。我们的代码和数据集将公开，以推动研究和工业应用。

英文摘要

Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications.

URL PDF HTML ☆

赞 0 踩 0

2508.03104 2026-05-26 cs.LG cs.AI

HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

HiTeC: 基于语义感知增强的文本属性超图层次对比学习

Mengting Pan, Fan Li, Chen Chen, Xiaoyang Wang, Wenjie Zhang

AI总结提出HiTeC框架，通过两阶段层次对比学习，结合结构感知文本编码预训练和语义感知增强，解决文本属性超图中文本与拓扑关联不足、随机增强噪声及长程依赖捕获问题。

Comments 16 pages, 8 figures

详情

AI中文摘要

对比学习已成为自监督超图学习的主流范式，能够在无需昂贵标签的情况下实现有效训练。然而，现实世界超图中的节点实体通常关联丰富的文本信息，这在先前工作中被大量忽略。直接将现有基于对比学习的方法应用于此类文本属性超图（TAHGs）会导致三个关键限制：（1）普遍使用的图无关文本编码器无法捕获文本语义与超图拓扑之间的相关性，导致表示表达能力不足。（2）它们对随机数据增强的依赖引入了噪声并削弱了对比信号。（3）主要关注节点和超边级别的对比信号限制了捕获长程依赖的能力，而这对于有效的表示学习至关重要。为解决这些挑战，我们引入了HiTeC，一个两阶段层次对比学习框架，用于在TAHGs上进行有效的自监督学习。在第一阶段，我们使用结构感知的对比目标预训练文本编码器，以克服传统方法的图无关特性。在第二阶段，我们首先引入语义感知增强，包括结构上下文化的文本增强和语义感知的超边丢弃，以促进信息丰富的视图生成。随后，我们提出一个多尺度对比损失，结合基于$s$步行走的子图级别目标，以捕获长程依赖。在六个真实世界数据集上的大量实验验证了我们提出方法的有效性。

英文摘要

Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which has been largely ignored in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders fails to capture the correlations between textual semantics and hypergraph topology, resulting in less expressive representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive signals. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for effective representation learning. To address these challenges, we introduce HiTeC, a two-stage hierarchical contrastive learning framework for effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we begin by introducing semantic-aware augmentations, including structure-contextualized text augmentation and semantic-aware hyperedge dropping, to facilitate informative view generation. Subsequently, we propose a multi-scale contrastive loss with an $s$-walk-based subgraph-level objective to capture long-range dependencies. Extensive experiments on six real-world datasets validate the effectiveness of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2507.07644 2026-05-26 cs.AI

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

FloorplanQA：使用结构化表示进行大语言模型空间推理的基准测试

Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka

AI总结提出FloorplanQA基准，通过结构化室内场景表示评估大语言模型在距离测量、可见性、路径查找和物体放置等空间推理任务上的表现，揭示模型在物理约束和空间一致性方面的盲点。

Comments ICML 2026, Project page: https://OldDeLorean.github.io/FloorplanQA/

详情

AI中文摘要

我们引入了FloorplanQA，一个用于评估大语言模型空间推理能力的诊断基准。FloorplanQA基于室内场景的结构化表示，例如（厨房、客厅、卧室、浴室等），这些场景以JSON或XML布局进行符号编码。该基准涵盖了核心空间任务，包括距离测量、可见性、路径查找以及在受限空间内的物体放置。我们在各种前沿开源和商业大语言模型上的实验结果表明，虽然模型可能在浅层查询上成功，但它们往往无法遵守物理约束、保持空间一致性，尽管它们对小的空间扰动大多保持鲁棒。FloorplanQA揭示了当前大语言模型的一个盲点：对室内布局的不一致推理。我们希望这个基准能激发新的工作，使语言模型能够在实际场景中准确推断和操作空间与几何属性。

英文摘要

We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

URL PDF HTML ☆

赞 0 踩 0

2507.05890 2026-05-26 cs.CL cs.AI

Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators

使用具有特质-反应中介的虚拟受访者进行心理测量项目验证

Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo

AI总结提出一种利用LLM模拟虚拟受访者（通过中介因素）来高效验证心理测量项目效度的框架，实验证明该方法能有效识别高有效性项目。

Comments This paper has been accepted for publication at TACL 2026

详情

AI中文摘要

随着心理测量调查越来越多地用于评估大型语言模型（LLM）的特质，对适用于LLM的可扩展调查项目生成的需求也随之增长。这里的一个关键挑战是确保生成项目的构念效度，即它们是否真正测量了预期的特质。传统上，这需要昂贵的大规模人类数据收集。为了提高效率，我们提出了一个使用LLM进行虚拟受访者模拟的框架。我们的核心思想是考虑中介因素：通过它们，相同的特质可能对调查项目产生不同的反应。通过模拟具有不同中介因素的受访者，我们识别出那些在这些中介因素中与预期特质稳健相关的调查项目。在三种心理特质理论（大五人格、施瓦茨价值观、VIA性格优势）上的实验表明，我们的中介生成方法和模拟框架有效地识别了高有效性项目。LLM展示了从特质定义生成合理中介因素以及模拟受访者行为以进行项目验证的能力。我们的问题表述、指标、方法和数据集为成本效益高的调查开发以及更深入地理解LLM如何模拟人类调查反应开辟了新方向。我们发布数据集和代码以支持未来工作。

英文摘要

As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item generation suited for LLMs has also grown. A critical challenge here is ensuring the construct validity of generated items, i.e., whether they truly measure the intended trait. Traditionally, this requires costly, large-scale human data collection. To make it efficient, we present a framework for virtual respondent simulation using LLMs. Our central idea is to account for mediators: factors through which the same trait can give rise to varying responses to a survey item. By simulating respondents with diverse mediators, we identify survey items that yield responses robustly correlated with intended traits across these mediators. Experiments on three psychological trait theories (Big5, Schwartz, VIA) show that our mediator generation methods and simulation framework effectively identify high-validity items. LLMs demonstrate the ability to generate plausible mediators from trait definitions and to simulate respondent behavior for item validation. Our problem formulation, metrics, methodology, and dataset open a new direction for cost-efficient survey development and a deeper understanding of how LLMs simulate human survey responses. We release our dataset and code to support future work.

URL PDF HTML ☆

赞 0 踩 0

2507.03159 2026-05-26 cs.LG math.OC

MathOptAI.jl: Embed trained machine learning predictors into JuMP models

MathOptAI.jl: 将训练好的机器学习预测器嵌入JuMP模型

Oscar Dowson, Robert B Parker, Russel Bent

AI总结提出开源Julia库MathOptAI.jl，将多种训练好的机器学习模型（神经网络、决策树、高斯过程）嵌入JuMP优化模型，并支持PyTorch模型的GPU加速。

2506.21137 2026-05-26 cs.LG

Norm$\times$Direction: Restoring the Missing Query Norm in Vision Linear Attention

Norm×Direction：恢复视觉线性注意力中缺失的查询范数

Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Xin Li, Zheng Zhang

AI总结针对线性注意力中查询范数丢失和非负性导致信息损失的问题，提出基于范数-方向分解的NaLaFormer，通过注入查询范数恢复注意力分布尖峰性，并采用余弦相似度保证非负性，在多项任务上达到线性注意力新标杆。

详情

AI中文摘要

线性注意力缓解了softmax注意力的二次复杂度，但遭受了关键的表达能力损失。我们识别出两个主要原因：（1）归一化操作取消了查询范数，这打破了查询范数与softmax注意力中注意力分布的尖峰性（熵）之间的相关性。（2）强制非负性的标准技术通过抵消有效的内积交互导致破坏性的信息损失。为了解决这些挑战，我们引入了NaLaFormer，一种基于查询和键向量的范数×方向（ND）分解的新型线性注意力机制。我们利用每个分量解决一个不同的问题：查询范数被注入到我们的核中，以创建一个查询范数感知的映射，恢复注意力分布的尖峰性。方向向量通过基于几何的余弦相似度度量进行处理，该度量在保证非负性的同时保留了内积的丰富细粒度信息。我们通过全面的多模态评估验证了NaLaFormer，它在线性注意力上设立了新的最先进基准。我们的模型在ImageNet-1K上实现了高达7.5%的准确率提升，在ADE20K上实现了4.7%的mIoU改进，相比可比的基线。它展示了深刻的效率，在令牌密集的超分辨率任务（7万+令牌）中，将峰值内存减少了变革性的92.3%。NaLaFormer的通用性进一步得到证实，它在常识推理上超越了像Mamba这样的强基线，并在Long Range Arena（LRA）基准上设立了新的最先进水平。代码可在https://github.com/ZacharyMeng/NaLaFormer获取。

英文摘要

Linear attention mitigates the quadratic complexity of softmax attention but suffers from a critical loss of expressiveness. We identify two primary causes: (1) The normalization operation cancels the query norm, which breaks the correlation between a query's norm and the spikiness (entropy) of the attention distribution as in softmax attention. (2) Standard techniques for enforcing non-negativity cause destructive information loss by nullifying valid inner-product interactions. To address these challenges, we introduce NaLaFormer, a novel linear attention mechanism built upon a norm$\times$direction (ND) decomposition of the query and key vectors. We leverage each component to solve a distinct problem: The query norm is injected into our kernel to create a query-norm-aware map that restores the attention distribution's spikiness. The direction vectors are processed by a geometric, cosine-based similarity metric that guarantees non-negativity while preserving the rich, fine-grained information of the inner product. We validate NaLaFormer through a comprehensive multi-modal evaluation, where it sets new state-of-the-art benchmarks for linear attention. Our model achieves up to a 7.5% accuracy gain on ImageNet-1K and a 4.7% mIoU improvement on ADE20K over comparable baselines. It demonstrates profound efficiency, reducing peak memory by a transformative 92.3% in token-intensive super-resolution tasks (70K+ tokens). NaLaFormer's versatility is further confirmed as it surpasses strong baselines like Mamba on common-sense reasoning and sets a new state-of-the-art on the Long Range Arena (LRA) benchmark. Code is available at https://github.com/ZacharyMeng/NaLaFormer .

URL PDF HTML ☆

赞 0 踩 0

2506.19037 2026-05-26 cs.CL cs.AI cs.IT cs.LG cs.NE math.IT

Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models

速度规划：用于掩码扩散语言模型的膨胀调度

Omer Luxembourg, Haim Permuter, Eliya Nachmani

AI总结提出膨胀解掩码调度器（DUS），通过将序列位置划分为非相邻的膨胀组并并行解掩码，最小化联合熵增益上界，在不修改去噪器的情况下实现高达5.8倍加速。

Comments Accepted at ICML 2026

详情

AI中文摘要

掩码扩散语言模型（MDLM）承诺快速、非自回归的文本生成，然而现有的采样器根据模型置信度选择要解掩码的标记，忽略了并行解掩码多个位置时的交互，实际上退化为缓慢的自回归行为。我们提出了膨胀解掩码调度器（DUS），这是一种仅推理、无需规划模型的方法，它将序列位置划分为非相邻的膨胀组，并并行解掩码，以在每个去噪步骤中最小化联合熵增益的上界。通过明确权衡网络调用次数与生成质量，DUS恢复了传统并行解掩码策略下丢失的大部分性能。在数学（GSM8K, MATH500）、代码（HumanEval, MBPP）、通用知识（BBH, MMLU-Pro）和指令遵循（IFEval）基准测试中，DUS优于基于置信度的规划器，并将扩散特有的质量-速度权衡转化为由块大小$B$确定的确定性、可预测的加速，与逐标记MDLM解码相比，实现了高达5.8倍的墙钟加速，而无需修改底层去噪器。作为即插即用的后滤波器，膨胀间隔也改进了自适应采样器。代码可在https://github.com/omerlux/DUS获取。

英文摘要

Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel and effectively reduce to slow, autoregressive behavior. We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasks them in parallel so as to minimize an upper bound on joint entropy gain at each denoising step. By explicitly trading off the number of network calls against generation quality, DUS recovers most of the performance lost under traditional parallel unmasking strategies. Across math (GSM8K, MATH500), code (HumanEval, MBPP), general-knowledge (BBH, MMLU-Pro), and instruction following (IFEval) benchmarks, DUS outperforms confidence-based planners and turns the diffusion-specific quality-speed trade-off into a deterministic, predictable speedup set by the block size $B$, yielding up to $5.8\times$ wall-clock speedup over token-by-token MDLM decoding without modifying the underlying denoiser. Applied as a drop-in post-filter, dilated spacing also improves adaptive samplers. Code is available at https://github.com/omerlux/DUS.

URL PDF HTML ☆

赞 0 踩 0

2506.17629 2026-05-26 cs.CV cs.AI cs.CL

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

AI总结提出CLiViS框架，通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知，构建动态认知地图以迭代更新场景上下文，实现无需训练的具身视觉推理。

详情

AI中文摘要

具身视觉推理（EVR）旨在基于自我中心视频遵循复杂、自由形式的指令，从而在动态环境中实现语义理解和时空推理。尽管具有潜力，EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型（LLM），这通常会遗漏关键视觉细节，要么依赖端到端视觉语言模型（VLM），后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势，我们提出了CLiViS。这是一个新颖的无训练框架，利用LLM进行高层任务规划，并协调VLM驱动的开放世界视觉感知，以迭代更新场景上下文。基于这种协同，CLiViS的核心是一个动态认知地图，它在推理过程中不断演化。该地图构建了具身场景的结构化表示，连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性，特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

URL PDF HTML ☆

赞 0 踩 0

2506.17326 2026-05-26 cs.LG stat.AP stat.ML

CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

CopulaSMOTE：基于Copula的过采样方法用于糖尿病预测中的不平衡分类

Agnideep Aich, Md Monzur Murshed, Bruce Wade, Sameera Hewage

AI总结提出CopulaSMOTE方法，利用截断藤copula建模少数类联合依赖结构生成合成样本，在三个糖尿病数据集上结合多种分类器评估，显示能改善大表格数据集的少数类恢复。

详情

AI中文摘要

类别不平衡仍然是糖尿病等疾病临床预测模型开发中的一个实际障碍，其中确诊病例的数量通常远少于对照组。合成少数类过采样技术（SMOTE）及其变体被广泛用于解决这种不平衡，但它们通过特征空间中的局部插值生成合成观测值，并未显式建模少数类的联合依赖结构。为了解决这一挑战，我们的研究引入了一种基于copula的数据增强方法，该方法在生成合成样本时估计少数类的依赖结构，并与标准机器学习技术集成。具体来说，我们采用截断藤copula通过一系列双变量构建块来表示多元依赖。我们在三个公共糖尿病数据集上评估了所提出的方法，即Pima Indians糖尿病数据集、Iraqi糖尿病数据集和CDC BRFSS 2015糖尿病健康指标数据集，这些数据集涵盖了不同的样本量、维度和不平衡程度。对于每个数据集，使用5×2交叉验证协议和Dietterich配对t检验，在五个分类器上比较了五种重采样策略。我们的研究结果表明，CopulaSMOTE可以改善较大表格糖尿病数据集（尤其是CDC BRFSS数据集）中的少数类恢复，但其优势取决于分类器和评估指标。

英文摘要

Class imbalance remains a practical obstacle in the development of clinical prediction models for conditions such as diabetes mellitus, where the number of confirmed cases is often much smaller than the number of controls. The Synthetic Minority Over-sampling Technique (SMOTE) and its variants are widely used to address this imbalance, but they generate synthetic observations through local interpolation in feature space and do not explicitly model the joint dependence structure of the minority class. To address this challenge, our study introduces a copula-based data augmentation approach that estimates the minority-class dependence structure when generating synthetic samples and integrates with standard machine learning techniques. Specifically, we employ truncated vine copulas to represent multivariate dependence through a sequence of bivariate building blocks. We evaluate the proposed approach on three public diabetes datasets, namely the Pima Indians Diabetes dataset, the Iraqi Diabetes dataset, and the CDC BRFSS 2015 Diabetes Health Indicators dataset, which together cover a range of sample sizes, dimensionalities, and imbalance regimes. For each dataset, five resampling strategies are compared across five classifiers using a 5 by 2 cross validation protocol with Dietterich's paired t test. Our findings suggest that CopulaSMOTE can improve minority-class recovery in larger tabular diabetes datasets, particularly the CDC BRFSS dataset, but its advantages depend on the classifier and evaluation metric.

URL PDF HTML ☆

赞 0 踩 0

2506.11027 2026-05-26 cs.LG cs.AI cs.PL

From Reasoning to Code: GRPO Optimization for Underrepresented Languages

从推理到代码：针对代表性不足语言的GRPO优化

Federico Pennino, Bianca Raimondi, Massimo Rondelli, Andrea Gurioli, Maurizio Gabbrielli

AI总结提出结合Qwen2.5-Coder小模型与GRPO的强化学习方法，利用执行反馈和奖励机制提升Prolog、Lisp等低资源语言的代码生成准确性与推理质量。

Comments Accepted ICLP 2026

详情

AI中文摘要

使用大型语言模型（LLM）生成准确且可执行的代码对于代表性不足的编程语言（如Prolog和Lisp）仍然是一个重大挑战，因为与Python等高资源语言相比，公共训练数据稀缺。本文介绍了一种可泛化的强化学习（RL）方法，将Qwen2.5-Coder模型的小规模版本与组相对策略优化（GRPO）相结合，通过推理实现有效的代码生成。为了解决稀疏数据集的局限性，我们将执行驱动的反馈直接集成到RL循环中，利用一个奖励系统，该系统同时利用逻辑正确性和结构格式。在GSM8K数据集上的实验结果表明，在代表性不足的语言中，推理质量和代码准确性有显著提升。这些发现强调了我们的方法通过利用符号推理和基于解释器的反馈，使缺乏广泛训练资源的多种编程语言受益的潜力。

英文摘要

Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming languages, such as Prolog and Lisp, due to the scarcity of public training data compared to high-resource languages like Python. This paper introduces a generalizable Reinforcement Learning (RL) approach that combines small-scale versions of the Qwen2.5-Coder model with Group Relative Policy Optimization (GRPO) to enable effective code generation through reasoning. To address the limitations of sparse datasets, we integrate execution-driven feedback directly into the RL loop, utilizing a reward system that exploits both logical correctness and structural formatting. Experimental results on GSM8K dataset demonstrate significant improvements in reasoning quality and code accuracy across underrepresented languages. These findings underscore the potential of our approach to benefit a wide range of programming languages lacking extensive training resources by leveraging symbolic reasoning and interpreter-based feedback.

URL PDF HTML ☆

赞 0 踩 0

2506.10689 2026-05-26 cs.CV

Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

通过多任务和多年龄方法在无约束图像中筛查未成年人的未成年人检测

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

AI总结提出一种基于冻结FaRL视觉语言骨干和紧凑两层MLP的多任务架构，结合α重加权焦点损失和年龄平衡采样，在无约束图像中准确检测未成年人，并在新基准上显著提升性能。

详情

DOI: 10.1016/j.patcog.2026.113440

AI中文摘要

在无约束图像中准确自动筛查未成年人需要模型对分布偏移具有鲁棒性，并能应对公共数据集中儿童代表性不足的问题。为解决这些问题，我们提出了一种多任务架构，基于冻结的FaRL视觉语言骨干，结合一个紧凑的两层MLP，该MLP在一个年龄回归头和四个二元未成年人头（12、15、18和21岁）之间共享特征，并包含专门的超/低龄判别任务。该设计聚焦于法律关键年龄范围，同时保持骨干冻结。通过$α$重加权焦点损失和年龄平衡小批量采样缓解类别不平衡，同时通过年龄间隔移除阈值附近的模糊样本。评估在我们的新总体未成年人基准（303k清洗训练图像，110k测试图像）上进行，定义了“ASORES-39k”受限总体测试（去除噪声最大的域）和年龄估计野移测试“ASWIFT-20k”（20k图像，强调极端姿态（>45°）、表情和低图像质量以模拟现实世界偏移）。在清洗总体集上使用重采样和年龄间隔训练后，我们的多年龄模型“F”将ASORES-39k上的平均绝对误差从4.175岁（仅年龄基线）降至4.068岁，并在1%虚假成人率下将18岁以下检测的F2分数从0.801提升至0.857。在ASWIFT-20k上，相同配置几乎保持0.99的召回率，同时F2从0.742提升至0.833，展示了域偏移的鲁棒性。

英文摘要

Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds. Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test "ASWIFT-20k" of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.

URL PDF HTML ☆

赞 0 踩 0

2506.06454 2026-05-26 cs.LG cs.AI stat.ML

LETS Forecast: Learning Embedology for Time Series Forecasting

LETS Forecast：用于时间序列预测的嵌入学

Abrar Majeedi, Viswanatha Reddy Gajjala, Satya Sai Srinath Namburi GNVV, Nada Magdi Elkordi, Yin Li

AI总结提出DeepEDM框架，结合非线性动力系统建模与深度学习，通过延迟嵌入和核回归学习潜在动态，实现高精度时间序列预测。

Comments Accepted at International Conference on Machine Learning (ICML) 2025

详情

AI中文摘要

现实世界的时间序列通常受复杂的非线性动力学支配。理解这些潜在动力学对于精确的未来预测至关重要。虽然深度学习在时间序列预测中取得了重大成功，但许多现有方法并未显式建模动力学。为弥补这一差距，我们引入了DeepEDM，一个将非线性动力系统建模与深度神经网络相结合的框架。受经验动态建模（EDM）启发并基于Takens定理，DeepEDM提出了一种新颖的深度模型，该模型从时间延迟嵌入中学习潜在空间，并使用核回归来逼近潜在动力学，同时利用softmax注意力的高效实现，允许对未来时间步进行准确预测。为了评估我们的方法，我们在非线性动力系统的合成数据以及跨领域的真实世界时间序列上进行了全面实验。结果表明，DeepEDM对输入噪声具有鲁棒性，并在预测准确性上优于最先进的方法。我们的代码可在以下网址获取：https://abrarmajeedi.github.io/deep_edm。

英文摘要

Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens' theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: https://abrarmajeedi.github.io/deep_edm.

URL PDF HTML ☆

赞 0 踩 0

2506.04805 2026-05-26 cs.LG

Adaptive Preconditioners Trigger Loss Spikes in Adam

Adam中的自适应预处理器引发损失尖峰

Zhiwei Bai, Zhangchen Zhou, Jiajie Zhao, Xiaolong Li, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Yaoyu Zhang, Zhi-Qin John Xu

AI总结通过分析Adam二阶矩估计器的内部动力学，发现自适应预处理器与瞬时平方梯度之间的解耦机制导致损失尖峰，并基于二次近似分析提出尖峰预测方法。

Comments Accepted to ICML 2026

详情

AI中文摘要

损失尖峰在使用Adam优化器训练神经网络时普遍出现，跨越不同架构和规模，但其潜在机制仍不清楚。虽然先前的解释将这些现象归因于较低损失处更尖锐的损失景观，但我们表明仅景观几何不足以解释该现象。在这项工作中，我们将根本原因定位在Adam二阶矩估计器的内部动力学中。我们识别出一个关键的“解耦”机制，其中自适应预处理器 $v_t$ 未能跟踪瞬时平方梯度 $g_t^2$，导致自适应机制有效失效。这种解耦允许预处理器在梯度上升时自主衰减，从而将预处理Hessian的最大特征值推至稳定阈值 $2/η$ 以上持续一段时间，表现为剧烈的损失尖峰。通过二次近似分析，我们从理论和实验上刻画了尖峰演化的五个不同阶段，并提出了基于梯度方向曲率预测尖峰的指标。我们经验性地发现，所提出的损失尖峰机制虽然源于简化模型，但能很好地推广到从小型神经网络到大规模Transformer的实际场景。

英文摘要

Loss spikes commonly emerge during neural network training with the Adam optimizer across diverse architectures and scales, yet their underlying mechanism remains elusive. While previous explanations attribute these phenomena to sharper loss landscapes at lower loss, we show that landscape geometry alone is insufficient to explain the phenomenon. In this work, we pinpoint the root cause in the internal dynamics of Adam's second moment estimator. We identify a critical ``decoupling'' mechanism where the adaptive preconditioner $v_t$ fails to track the instantaneous squared gradients $g_t^2$, causing the adaptive mechanism to effectively fail. This decoupling allows the preconditioner to decay autonomously despite rising gradients, which pushes the maximum eigenvalue of the preconditioned Hessian beyond the stability threshold $2/η$ for sustained periods, manifesting as dramatic loss spikes. Through a quadratic approximation analysis, we theoretically and experimentally characterize five distinct stages of spike evolution and propose a predictor for anticipating spikes based on gradient-directional curvature. We empirically find that the proposed loss spike mechanism, although derived from simplified models, generalizes well to practical scenarios ranging from small neural networks to large-scale Transformers.

URL PDF HTML ☆

赞 0 踩 0

2505.22322 2026-05-26 cs.LG

A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective

表格扩散模型中记忆化的深入探究：以数据为中心的观点

Zhengyu Fang, Zhimeng Jiang, Huiyuan Chen, Xiaoge Zhang, Kaiyu Tang, Xiao Li, Jing Li

AI总结本文首次从数据角度研究表格扩散模型中的记忆化动态，通过量化每个真实样本的记忆化程度，发现少数样本贡献了大部分泄露，并提出两阶段缓解方法DynamicCut。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026

详情

AI中文摘要

扩散模型在生成高质量表格数据方面表现出色，但通过重现精确训练样本带来隐私风险。先前工作侧重于数据集级增强以减少记忆化，但鲜有研究哪些个体样本贡献最大。我们首次从数据角度研究表格扩散模型中的记忆化动态。我们基于有多少生成样本被标记为副本，使用相对距离比率量化每个真实样本的记忆化程度。实证分析揭示了记忆化计数的重尾分布：一小部分样本对泄露贡献不成比例，通过样本移除实验得到证实。为理解这一点，我们将真实样本分为顶部记忆化和非顶部记忆化两组，分析其训练时行为。我们追踪每个样本首次被记忆化的时间，并监测每轮记忆化强度（AUC）。记忆化样本稍早被记忆化，并在早期训练中表现出更强信号。基于这些见解，我们提出DynamicCut，一种两阶段、模型无关的缓解方法：（a）按轮次强度对样本排序，（b）修剪可调顶部比例，（c）在过滤后的数据集上重新训练。在多个表格数据集和模型上，DynamicCut减少了记忆化，对数据多样性和下游性能影响最小。它还补充了基于增强的防御。此外，DynamicCut实现了跨模型迁移性：从一个模型（如扩散模型）识别出的高排名样本，当从其他模型（如GAN和VAE）中移除时，也能有效减少记忆化。

英文摘要

Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC). Memorized samples are memorized slightly earlier and show stronger signals in early training. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method: (a) rank samples by epoch-wise intensity, (b) prune a tunable top fraction, and (c) retrain on the filtered dataset. Across multiple tabular datasets and models, DynamicCut reduces memorization with minimal impact on data diversity and downstream performance. It also complements augmentation-based defenses. Furthermore, DynamicCut enables cross-model transferability: high-ranked samples identified from one model (e.g., a diffusion model) are also effective for reducing memorization when removed from others, such as GANs and VAEs.

URL PDF HTML ☆

赞 0 踩 0

2505.11758 2026-05-26 cs.CV cs.AI cs.GR cs.RO

Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

具有预测性提示和负学习的可泛化视觉语言少样本适应

Sriram Mandalika

AI总结提出SCAN框架，通过查询自适应负路由、LLM引导对比提示和自适应融合权重，解决视觉语言模型少样本适应中负类信号处理问题，在11个基准上平均提升4.61%。

详情

AI中文摘要

视觉语言模型的少样本适应在推理时如何处理负类信号方面仍然存在根本性限制。现有方法对所有查询应用统一的负抑制，忽略了最具破坏性的混淆是查询特定的，并且随支持集几何形状而变化。我们提出SCAN（选择性混淆感知负样本），一个通过三个针对性贡献解决这一问题的框架。在推理中，查询自适应负路由将抑制限制在每个查询最易混淆的前K个类别，无需额外参数。通用负文本模板被替换为LLM引导的对比提示，描述易混淆类别对之间的区分属性，在关键处锐化文本决策边界。基于支持集Fisher可判别性估计的无参数自适应融合权重消除了手动调整视觉语言权衡的需要。在11个标准基准上评估，SCAN在16-shot设置下平均优于先前的基于提示和基于适配器的方法4.61%，在类间混淆最严重的细粒度数据集上提升高达7.70%。SCAN在分布偏移下也表现出强泛化性，在四个ImageNet OOD变体上平均提升2.95%，并在显著标签噪声下保持稳健性能，在50%标签损坏下的准确率仍超过最强竞争方法的干净基线。

英文摘要

Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the most damaging confusions are query-specific and shift with support-set geometry. We introduce SCAN (Selective Confusion-Aware Negatives), a framework that addresses this gap through three targeted contributions. In inference, query-adaptive negative routing restricts suppression to the top-K most confusable classes per query, requiring zero additional parameters. Generic negative text templates are replaced with LLM-bootstrapped contrastive prompts that describe discriminative attributes between confusable class pairs, sharpening the textual decision boundary where it matters most. A parameter-free adaptive fusion weight estimated from support-set Fisher discriminability removes the need for manual tuning of the vision-language trade-off. Evaluated across 11 standard benchmarks, SCAN consistently outperforms prior prompt-based and adapter-based methods by an average of 4.61% at 16-shot, with gains of up to 7.70% on fine-grained datasets where inter-class confusion is most severe. SCAN also generalizes strongly under distribution shift, improving by 2.95% on average across four ImageNet OOD variants, and maintains robust performance under significant label noise, with accuracy under 50% label corruption still exceeding the clean baseline of the strongest competing method.

URL PDF HTML ☆

赞 0 踩 0

2505.08155 2026-05-26 cs.AI

Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering

高效且可扩展的神经符号搜索用于知识图谱复杂查询回答

Weizhi Fei, Zihao Wang, hang Yin, Shukai Zhao, Wei Zhang, Yangqiu Song

AI总结提出一种结合约束策略和局部搜索的神经符号方法，以降低数据复杂度和近似解决NP难的循环查询，实现高效可扩展的复杂查询回答。

详情

AI中文摘要

复杂查询回答（CQA）是知识图谱（KG）上的一项关键推理任务，旨在从不完整的KG中回答一阶逻辑查询。现有的神经符号方法虽然取得了强劲的性能，但面临显著的复杂度瓶颈：数据复杂度随实体数量呈二次增长，且循环查询的查询复杂度为NP难。因此，这些方法难以有效扩展到大型知识图谱和复杂查询。为解决这些限制，我们提出了一种高效且可扩展的符号搜索方法，包含两个关键组件：（1）约束策略，大幅减少变量搜索域，降低数据复杂度；（2）局部搜索算法，近似解决NP难的循环查询。在各种CQA基准上的实验表明，对于树形查询，我们的方法仅使用10%的搜索空间即可达到97%的相对MRR，并实现10倍的加速。此外，该方法在复杂循环查询和大规模KG上展现出稳健的性能，有效缓解了效率和可扩展性挑战。我们的代码见https://github.com/HKUST-KnowComp/NLISA_KDD2026。

英文摘要

Complex Query Answering (CQA) is a crucial reasoning task over Knowledge Graphs (KGs), which aims to answer first-order logical queries from incomplete KGs. While existing neural-symbolic methods achieve strong performance, they face significant complexity bottlenecks: quadratic data complexity scaling with the number of entities, and NP-hard query complexity for cyclic queries. Consequently, these approaches struggle to scale effectively to large knowledge graphs and complex queries. To address these limitations, we propose an efficient and scalable symbolic search method comprising two key components: (1) constraint strategies that drastically reduce the variable search domain, lowering data complexity; and (2) a local search algorithm that approximately solves NP-hard cyclic queries. Experiments on various CQA benchmarks demonstrate that, for tree-form queries, our method achieves 97% relative MRR with a 10$\times$ speedup using only 10% of the search space. Furthermore, it demonstrates robust performance on complex cyclic queries and large-scale KGs, effectively alleviating efficiency and scalability challenges. Our code is provided in https://github.com/HKUST-KnowComp/NLISA_KDD2026.

URL PDF HTML ☆

赞 0 踩 0

2505.05880 2026-05-26 cs.AI cs.LG

Combining Abstract Argumentation and Machine Learning for Efficiently Analyzing Low-Level Process Event Streams

结合抽象论证与机器学习高效分析低层过程事件流

Bettina Fazzinga, Sergio Flesca, Filippo Furfaro, Luigi Pontieri, Francesco Scala

AI总结提出一种数据高效的神经符号方法，通过抽象论证框架（AAF）优化序列标注模型生成的候选事件解释，以解决低层过程事件流中事件到活动映射的不确定性问题。

详情

DOI: 10.1007/s40747-026-02340-1

AI中文摘要

监控和分析过程轨迹是现代公司和组织的一项关键任务。在轨迹事件与参考业务活动之间存在差距的场景中，这涉及一个解释问题，即将任何正在进行的轨迹的每个事件转换为活动实例的相应步骤。基于最近将解释问题框架化为抽象论证框架（AAF）内的接受问题的方法，可以优雅地分析可能的（可能以聚合形式）事件解释，并为那些与先验过程知识冲突的解释提供解释。由于在事件到活动映射高度不确定（或简单地说未充分指定）的环境中，这种基于推理的方法可能产生低信息量的结果和繁重的计算，因此可以考虑发现一个序列标注模型，该模型经过训练以上下文感知的方式建议高概率的候选事件解释。然而，最优地训练这样的模型可能需要使用大量手动注释的示例轨迹。因此，我们提出了一种数据高效的神经符号方法，其中由示例驱动的序列标注器返回的候选解释由基于AAF的推理器进行细化。这使我们能够利用先验知识来补偿示例数据的稀缺性，实验结果证实了这一点。

英文摘要

Monitoring and analyzing process traces is a critical task for modern companies and organizations. In scenarios where there is a gap between trace events and reference business activities, this entails an interpretation problem, amounting to translating each event of any ongoing trace into the corresponding step of the activity instance. Building on a recent approach that frames the interpretation problem as an acceptance problem within an Abstract Argumentation Framework (AAF), one can elegantly analyze plausible event interpretations (possibly in an aggregated form), as well as offer explanations for those that conflict with prior process knowledge. Since, in settings where event-to-activity mapping is highly uncertain (or simply under-specified) this reasoning-based approach may yield lowly-informative results and heavy computation, one can think of discovering a sequence-tagging model, trained to suggest highly-probable candidate event interpretations in a context-aware way. However, training such a model optimally may require using a large amount of manually-annotated example traces. We then propose a data-efficient neuro-symbolic approach to the problem, where the candidate interpretations returned by the example-driven sequence tagger is refined by the AAF-based reasoner. This allows us to also leverage prior knowledge to compensate for the scarcity of example data, as confirmed by experimenftal results.

URL PDF HTML ☆

赞 0 踩 0

2503.01122 2026-05-26 cs.CV

ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

ACCORD: 通过依赖正则化缓解文本到图像扩散个性化中的概念耦合

Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li

AI总结提出两种即插即用损失函数（去噪解耦损失和先验解耦损失）直接最小化两种依赖差异，以缓解概念耦合问题，实现文本控制与个性化保真度的更好平衡。

详情

AI中文摘要

图像个性化因其能够仅使用少量参考图像定制文本到图像生成而受到关注。然而，图像个性化的一个关键挑战是概念耦合问题，即有限的参考图像导致模型在个性化目标与其他概念之间形成不希望的关联。当前方法试图间接解决这个问题，导致文本控制与个性化保真度之间的次优平衡。本文通过统计分析直接处理概念耦合问题，揭示其源于两种不同的依赖差异来源。因此，我们提出了两种互补的即插即用损失函数：去噪解耦损失和先验解耦损失，每种损失旨在最小化一种依赖差异。大量实验表明，我们的方法在文本控制与个性化保真度之间实现了更优的权衡。

英文摘要

Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.

URL PDF HTML ☆

赞 0 踩 0

2502.16205 2026-05-26 cs.RO

A neural signed configuration distance function for path planning of picking manipulators

一种用于拾取机械臂路径规划的神经符号配置距离函数

Bernhard Wullt, Mikael Norrlöf, Per Mattsson, Thomas B. Schön

AI总结针对拾取机械臂路径规划问题，提出一种神经符号配置距离函数（nSCDF）作为隐式障碍物表示，通过构建配置空间中的无碰撞球体，将多查询路径规划器中的点替换为球体，从而快速生成无碰撞走廊并利用凸规划优化路径，实验表明该方法在显著减少时间的同时生成接近渐近最优的路径。

详情

AI中文摘要

拾取机械臂是特定任务机器人，与通用机械臂相比自由度较少，在工业中广泛使用。拾取机器人的效率高度依赖于路径规划解决方案，该方案通常基于采样的多查询方法。规划器能够稳健地解决问题，但其对碰撞检测的大量使用限制了在线使用的规划能力。我们通过提出一种新颖的隐式障碍物表示用于路径规划，即神经符号配置距离函数（nSCDF），从而能够在配置空间中形成无碰撞球体。我们使用球体表示重新表述了一种先进的多查询路径规划器，即在图中使用球体而不是点。我们的规划器返回一个无碰撞走廊，这使我们能够使用凸规划生成优化路径。从数值实验中，我们观察到我们的规划器在显著更短的时间内生成接近渐近最优路径规划器的路径。

英文摘要

Picking manipulators are task specific robots, with fewer degrees of freedom compared to general-purpose manipulators, and are heavily used in industry. The efficiency of the picking robots is highly dependent on the path planning solution, which is commonly based on sampling-based multi-query methods. The planner is robustly able to solve the problem, but its heavy use of collision-detection limits the planning capabilities for online use. We approach this problem by presenting a novel implicit obstacle representation for path planning, a neural signed configuration distance function (nSCDF), which allows us to form collision-free balls in the configuration space. We use the ball representation to re-formulate a state of the art multi-query path planner, i.e., instead of points, we use balls in the graph. Our planner returns a collision-free corridor, which allows us to use convex programming to produce optimized paths. From our numerical experiments, we observe that our planner produces paths that are close to those from an asymptotically optimal path planner, in significantly less time.

URL PDF HTML ☆

赞 0 踩 0

2502.11167 2026-05-26 cs.LG cs.CL

SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors

SURGE: 大型语言模型作为通用代理代码执行器的潜力

Bohan Lyu, Siqiao Huang, Zichen Liang

AI总结提出SURGE基准，包含1160个问题覆盖8个关键方面，通过评估21个开源和专有LLM，研究其作为代码执行预测代理模型的可行性、扩展律、数据效率和预测准确性。

详情

Journal ref: Proceedings of The 2025 Conference on Empirical Methods in Natural Language Processing

AI中文摘要

神经代理模型是数据挖掘中强大且高效的工具。同时，大型语言模型（LLM）在代码相关任务（如生成和理解）中展示了卓越的能力。然而，一个同样重要但尚未充分探索的问题是，LLM是否可以作为代码执行预测的代理模型。为了系统研究这一问题，我们引入了SURGE，一个包含1160个问题的综合基准，覆盖8个关键方面：多语言编程任务、竞赛级编程问题、仓库级代码分析、高成本科学计算、时间复杂度密集型算法、有缺陷代码分析、依赖特定编译器或执行环境的程序，以及形式化数学证明验证。通过对21个开源和专有LLM的广泛分析，我们研究了扩展律、数据效率和预测准确性。我们的发现揭示了LLM作为计算过程高效代理的可行性的重要见解。基准和评估框架可在https://github.com/Imbernoulli/SURGE获取。

英文摘要

Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.

URL PDF HTML ☆

赞 0 踩 0

2502.10906 2026-05-26 cs.AI

PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning

PCGRLLM：面向程序化内容生成强化学习的大语言模型驱动奖励设计

In-Chang Baek, Sung-Hyun Kim, Sam Earle, Zehua Jiang, Jin-Ha Noh, Julian Togelius, Kyung-Joong Kim

AI总结提出PCGRLLM架构，利用大语言模型和反馈机制生成奖励函数，在二维环境中实现故事到奖励的生成，性能接近人类水平。

Comments 14 pages, 8 figures, Acccepted to Transactions on Games

详情

DOI: 10.1109/TG.2026.3695197

AI中文摘要

奖励设计在游戏AI训练中起着关键作用，需要大量领域知识和人力。近年来，一些研究探索了使用大语言模型（LLM）生成奖励函数来训练游戏代理和控制机器人。在内容生成文献中，已有早期工作为强化学习代理生成器生成奖励函数。本文介绍了PCGRLLM，一种基于早期工作的扩展架构，采用了反馈机制和几种基于推理的提示工程技术。我们在二维环境中的故事到奖励生成任务上，使用两种最先进的LLM和各种基于推理的提示方法评估了所提出的方法。我们的实验提供了富有洞察力的评估，展示了LLM在内容生成任务中不可或缺的能力。结果表明，与之前的结构相比，性能有了显著提升，达到了与人类相当的性能。我们的工作展示了在游戏AI开发中减少人类依赖的潜力，同时支持和增强创造性过程。

英文摘要

Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and controlling robots using large language models (LLMs). In the content generation literature, there has been early work on generating reward functions for reinforcement learning agent generators. This work introduces PCGRLLM, an extended architecture based on earlier work, which employs a feedback mechanism and several reasoning-based prompt engineering techniques. We evaluate the proposed method on a story-to-reward generation task in a two-dimensional environment using two state-of-the-art LLMs across various reasoning-based prompting methods. Our experiments provide insightful evaluations that demonstrate the capabilities of LLMs essential for content generation tasks. The results demonstrate a substantial performance improvement over the previous structure, achieving performance comparable to that of humans. Our work demonstrates the potential to reduce human dependency in game AI development, while supporting and enhancing creative processes.

URL PDF HTML ☆

赞 0 踩 0

2502.10311 2026-05-26 cs.LG cs.AI cs.HC

ExplainReduce: Generating global explanations from many local explanations

ExplainReduce: 从许多局部解释生成全局解释

Lauri Seppäläinen, Mudong Guo, Kai Puolamäki

AI总结本文提出 ExplainReduce 方法，通过贪心启发式算法将大量局部解释缩减为少量简单模型，作为生成式全局解释，并证明其有效性和竞争力。

Comments 21 pages with a 36 page appendix, 8 + 39 figures, 1+1 tables. The datasets and source code used in the paper are available at https://github.com/edahelsinki/explainreduce. Accepted for publication in the 4th World Conference on eXplainable Artificial Intelligence (2026)

2502.01397 2026-05-26 cs.LG cs.AI cs.NA math.NA

Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations

消息传递GNN无法近似稀疏三角分解

Vladislav Trifonov, Ekaterina Muravleva, Ivan Oseledets

AI总结本文通过理论和实验证明，消息传递图神经网络在逼近稀疏三角分解时存在根本性局限，需要超越消息传递的架构创新。

Comments Camera-ready version published in Transactions on Machine Learning Research

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

图神经网络（GNN）已被提议作为学习稀疏矩阵预条件子的工具，预条件子是加速线性求解器的关键组件。我们提出理论和实验证据表明，对于存在高质量预条件子但需要非局部依赖的矩阵类别，消息传递GNN从根本上无法近似稀疏三角分解。为了说明这一点，我们使用合成矩阵和SuiteSparse集合中的真实示例构建了一组基线。在包括图注意力网络和图变换器在内的多种GNN架构中，我们观察到预测因子与参考因子之间的余弦相似度较低（关键情况下≤0.7）。我们的理论和实验结果表明，需要超越消息传递的架构创新才能将GNN应用于矩阵分解等科学计算任务。此外，实验表明仅克服非局部性是不够的。需要定制的架构来捕获所需的依赖关系，因为即使是完全非局部的全局图变换器也无法匹配所提出的基线。

英文摘要

Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerating linear solvers. We present theoretical and empirical evidence that message-passing GNNs are fundamentally incapable of approximating sparse triangular factorizations for classes of matrices for which high-quality preconditioners exist but require non-local dependencies. To illustrate this, we construct a set of baselines using both synthetic matrices and real-world examples from the SuiteSparse collection. Across a range of GNN architectures, including Graph Attention Networks and Graph Transformers, we observe low cosine similarity ($\leq0.7$ in key cases) between predicted and reference factors. Our theoretical and empirical results suggest that architectural innovations beyond message-passing are necessary for applying GNNs to scientific computing tasks such as matrix factorization. Moreover, experiments demonstrate that overcoming non-locality alone is insufficient. Tailored architectures are necessary to capture the required dependencies since even a completely non-local Global Graph Transformer fails to match the proposed baselines.

URL PDF HTML ☆

赞 0 踩 0

2502.01184 2026-05-26 cs.LG cs.AI physics.chem-ph q-bio.QM

FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning

FragmentNet: 自适应图分片用于图到序列分子表示学习

Ankur Samanta, Rohan Gupta, Aditi Misra, Christian McIntosh Clarke, Jayakumar Rajadas

AI总结提出FragmentNet，通过自适应学习的分词器将分子图分解为化学有效的片段，并利用化学感知的空间位置编码保持分子拓扑，在片段级别进行掩码预训练，在多个属性预测任务上提升了性能。

Comments 22 pages, 13 figures, 5 tables

详情

AI中文摘要

分子表示学习方法通常将分子标记为单个原子或使用刚性、基于规则的分片分解，限制了它们捕捉有意义化学子结构上下文的能力。我们引入了FragmentNet，一种围绕新颖的自适应学习分词器构建的图到序列模型，该分词器将分子图分解为可调整粒度的化学有效片段，并辅以化学感知的空间位置编码，在生成的序列中保留分子拓扑。将自然语言处理中的掩码预训练策略扩展到分子领域，我们在化学有意义的片段级别而非单个原子级别对分子进行掩码和重建。在多个属性预测基准上的评估发现，在片段粒度上进行预训练在大多数任务上提高了下游性能，表明标记化粒度是分子表示学习的重要设计选择。

英文摘要

Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions, limiting their ability to capture meaningful chemical substructure context. We introduce FragmentNet, a graph-to-sequence model built around a novel adaptive, learned tokenizer that decomposes molecular graphs into chemically valid fragments of adjustable granularity, complemented by chemically aware spatial positional encodings that preserve molecular topology in the resulting sequence. Extending masked pre-training strategies from natural language processing to the molecular domain, we mask and reconstruct molecules at the level of chemically meaningful fragments rather than individual atoms. Evaluating across multiple property prediction benchmarks, we find that pre-training at fragment granularity leads to improved downstream performance on the majority of tasks, demonstrating that tokenization granularity is an important design choice for molecular representation learning.

URL PDF HTML ☆

赞 0 踩 0

2501.14889 2026-05-26 cs.LG

Iterative Feature Space Optimization through Incremental Adaptive Evaluation

通过增量自适应评估的迭代特征空间优化

Yanping Wu, Yanyong Huang, Zhengzhang Chen, Zijun Yao, Yanjie Fu, Kunpeng Liu, Xiao Luo, Dongjie Wang

AI总结提出EASE框架，通过特征-样本子空间生成器和上下文注意力评估器，实现高效、泛化的特征空间优化，解决评估偏差、过拟合和低效问题。

Comments 18 pages

详情

AI中文摘要

迭代特征空间优化涉及系统评估和调整特征空间以提升下游任务性能。然而，现有工作存在三个关键局限：1）忽视数据样本间的差异导致评估偏差；2）针对特定机器学习模型定制特征空间导致过拟合和泛化能力差；3）每次优化迭代需要从头重新训练评估器，显著降低整体优化效率。为弥补这些不足，我们提出一种广义自适应特征空间评估器（EASE），以高效产生最优且泛化的特征空间。该框架包含两个关键组件：特征-样本子空间生成器和上下文注意力评估器。第一个组件旨在解耦特征空间内的信息分布以减轻评估偏差。为此，我们首先根据后续评估器的反馈，识别与预测任务最相关的特征和评估中最具挑战性的样本。这种解耦策略使评估器持续聚焦于特征空间中最具挑战性的方面。第二个组件旨在增量捕获特征空间的演化模式以实现高效评估。我们提出一种加权共享多头注意力机制，将特征空间的关键特征编码为嵌入向量用于评估。此外，评估器进行增量更新，保留先前的评估知识同时融入新见解，因为优化过程中连续的特征空间共享部分信息。在十四个真实世界数据集上的大量实验证明了所提框架的有效性。我们的代码和数据已公开。

英文摘要

Iterative feature space optimization involves systematically evaluating and adjusting the feature space to improve downstream task performance. However, existing works suffer from three key limitations:1) overlooking differences among data samples leads to evaluation bias; 2) tailoring feature spaces to specific machine learning models results in overfitting and poor generalization; 3) requiring the evaluator to be retrained from scratch during each optimization iteration significantly reduces the overall efficiency of the optimization process. To bridge these gaps, we propose a gEneralized Adaptive feature Space Evaluator (EASE) to efficiently produce optimal and generalized feature spaces. This framework consists of two key components: Feature-Sample Subspace Generator and Contextual Attention Evaluator. The first component aims to decouple the information distribution within the feature space to mitigate evaluation bias. To achieve this, we first identify features most relevant to prediction tasks and samples most challenging for evaluation based on feedback from the subsequent evaluator. This decoupling strategy makes the evaluator consistently target the most challenging aspects of the feature space. The second component intends to incrementally capture evolving patterns of the feature space for efficient evaluation. We propose a weighted-sharing multi-head attention mechanism to encode key characteristics of the feature space into an embedding vector for evaluation. Moreover, the evaluator is updated incrementally, retaining prior evaluation knowledge while incorporating new insights, as consecutive feature spaces during the optimization process share partial information. Extensive experiments on fourteen real-world datasets demonstrate the effectiveness of the proposed framework. Our code and data are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2412.15668 2026-05-26 cs.CV

Adaptive Hierarchical Graph Cut for Multi-granularity Out-of-distribution Detection

自适应层次图割用于多粒度分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest, Ponnuthurai Nagaratnam Suganthan

AI总结提出自适应层次图割网络(AHGC)，通过构建层次KNN图并基于图连接和密度信息进行子图划分，以处理不同标签粒度下的分布外检测问题，在CIFAR-10和CIFAR-100上FPR95指标分别降低40.47%和81.24%。

Comments Published in IEEE Transactions on Artificial Intelligence

详情

AI中文摘要

本文聚焦于一项重要且具有挑战性的任务：分布外检测（OOD检测），旨在区分并拒绝具有语义偏移的测试样本，以防止在分布内（ID）数据上训练的模型产生不可靠的预测。尽管先前的工作已取得一定成功，但它们对于现实世界中具有挑战性的应用效果不佳，因为这些方法简单地将所有未标记数据视为OOD数据，忽略了不同数据集具有不同标签粒度的情况。例如，CIFAR-10中的“猫”和Tiny-ImageNet中的“虎斑猫”具有相同语义，但由于标签粒度不同而具有不同标签。为此，本文提出了一种新颖的自适应层次图割网络（AHGC），以深入探索不同图像之间的语义关系。具体地，我们构建一个层次KNN图，基于余弦相似度评估不同图像之间的相似性。基于图的连接和密度信息，我们将图切割成多个子图以整合这些语义相似的样本。如果子图中标记样本的百分比大于阈值，我们将百分比最高的标签分配给未标记图像。为进一步提高模型泛化能力，我们将每张图像增强为两个增强版本，并最大化这两个版本之间的相似性。最后，我们利用相似度分数进行OOD检测。在两个具有挑战性的基准（CIFAR-10和CIFAR-100）上进行的大量实验表明，在典型情况下，AHGC在“FPR95”指标上分别比最先进的OOD检测方法在CIFAR-100上降低81.24%，在CIFAR-10上降低40.47%，这显示了我们的AHGC的有效性。

英文摘要

This paper focuses on a significant yet challenging task: out-of-distribution detection (OOD detection), which aims to distinguish and reject test samples with semantic shifts, so as to prevent models trained on in-distribution (ID) data from producing unreliable predictions. Although previous works have made decent success, they are ineffective for real-world challenging applications since these methods simply regard all unlabeled data as OOD data and ignore the case that different datasets have different label granularity. For example, "cat" on CIFAR-10 and "tabby cat" on Tiny-ImageNet share the same semantics but have different labels due to various label granularity. To this end, in this paper, we propose a novel Adaptive Hierarchical Graph Cut network (AHGC) to deeply explore the semantic relationship between different images. Specifically, we construct a hierarchical KNN graph to evaluate the similarities between different images based on the cosine similarity. Based on the linkage and density information of the graph, we cut the graph into multiple subgraphs to integrate these semantics-similar samples. If the labeled percentage in a subgraph is larger than a threshold, we will assign the label with the highest percentage to unlabeled images. To further improve the model generalization, we augment each image into two augmentation versions, and maximize the similarity between the two versions. Finally, we leverage the similarity score for OOD detection. Extensive experiments on two challenging benchmarks (CIFAR- 10 and CIFAR-100) illustrate that in representative cases, AHGC outperforms state-of-the-art OOD detection methods by 81.24% on CIFAR-100 and by 40.47% on CIFAR-10 in terms of "FPR95", which shows the effectiveness of our AHGC.

URL PDF HTML ☆

赞 0 踩 0

2409.20473 2026-05-26 cs.RO

Data-Driven Optimization of Tactile Sensor Configurations for Efficient Dexterous Manipulation

数据驱动的触觉传感器配置优化以实现高效灵巧操作

Haoran Guo, Haoyang Wang, Zhengxiong Li, He Bai, Lingfeng Tao

AI总结提出两阶段框架量化触觉传感器对深度强化学习策略的贡献，将Shadow Hand传感器从92个减少至14个仍保持90%以上性能，并发现中指传感器具有负贡献。

Comments This work has been submitted to the ICRA for possible publication

详情

AI中文摘要

触觉感知对于基于学习的灵巧操作至关重要，但传感器放置的原则性指导仍然缺乏。虽然密集传感器阵列提供丰富的接触反馈，但它们带来显著的硬件成本，甚至可能通过引入冗余或冲突输入而降低策略性能。本文提出了第一个系统框架，用于量化单个触觉传感器对深度强化学习（DRL）策略性能的贡献。我们提出了一种两阶段方法：粗粒度经验剪枝阶段将Shadow Hand上的传感器数量从92个减少到21个，同时保留93%的任务性能；随后是细粒度主动学习阶段，结合高斯过程回归（GPR）与Lasso回归对每个剩余传感器的功能重要性进行排序。我们的分析揭示，拇指、无名指和小指上的传感器主导操作性能，而中指传感器表现出负贡献——主动降低策略学习。跨三个操作任务（方块、鸡蛋和笔）的消融研究证实，14个传感器的配置保留了全阵列90%以上的性能。在两个新物体上的零样本迁移实验以及在Allegro和Leap Hand上的跨平台验证进一步表明，识别出的重要性排序在任务和机器人形态之间具有泛化性。这些发现建立了量化部署指南，使从业者能够选择具有可预测性能权衡的成本效益传感器配置。

英文摘要

Tactile sensing is critical for learning-based dexterous manipulation, yet principled guidelines for sensor placement remain largely absent. While dense sensor arrays provide rich contact feedback, they impose significant hardware costs and can even degrade policy performance by introducing redundant or conflicting inputs. This paper presents the first systematic framework for quantifying the contribution of individual tactile sensors to deep reinforcement learning (DRL) policy performance. We propose a two-stage approach: a coarse empirical pruning phase that reduces the sensor count on the Shadow Hand from 92 to 21 while retaining 93\% task performance, followed by a fine-grained active learning phase that combines Gaussian Process Regression (GPR) with Lasso regression to rank the functional importance of each remaining sensor. Our analysis reveals that sensors on the thumb, ring finger, and little finger dominate manipulation performance, while middle-finger sensors exhibit negative contributions -- actively degrading policy learning. Ablation studies across three manipulation tasks (block, egg, and pen) confirm that a 14-sensor configuration preserves over 90\% of the full-array performance. Zero-shot transfer experiments on two novel objects and cross-platform validation on the Allegro and Leap Hand further demonstrate that the identified importance rankings generalize across tasks and robot morphologies. These findings establish quantitative deployment guidelines that enable practitioners to select cost-effective sensor configurations with predictable performance trade-offs.

URL PDF HTML ☆

赞 0 踩 0

2409.17608 2026-05-26 cs.CV

Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

外观模糊驱动的自编码器和运动引导的记忆模块用于视频异常检测

Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Shuangli Du, Cheng Shi, Zhiyong Lv

AI总结提出一种基于外观模糊和运动引导记忆模块的零样本跨数据集视频异常检测方法，通过构建全局伪异常并利用运动记忆项扩大正常与异常运动差异。

Comments 13 pages, 11 figures

详情

DOI: 10.1016/j.knosys.2025.115218
Journal ref: Knowledge-Based Systems 2026

AI中文摘要

视频异常检测（VAD）通常学习正常样本的分布并通过测量显著偏差来检测异常，但不期望的泛化可能会重构一些异常从而抑制偏差。同时，大多数VAD无法应对新目标域的跨数据集验证，而少样本方法必须费力地依赖目标域的模型调优来完成域适应。为解决这些问题，我们提出一种新颖的VAD方法，带有运动引导记忆模块，实现零样本跨数据集验证。首先，我们对原始外观图像添加高斯模糊，从而构建全局伪异常，作为网络输入。然后，我们提出多尺度残差通道注意力来去模糊正常样本中的伪异常。接下来，通过记录训练阶段的运动特征获得记忆项，用于在测试阶段从原始信息中检索运动特征。最后，我们的方法可以通过注意力忽略模糊的真实异常，并依赖运动记忆项来增加正常与异常运动之间的正常性差距。在三个基准数据集上的大量实验证明了所提方法的有效性。与跨域方法相比，我们的方法在测试时无需适应即可实现有竞争力的性能。

英文摘要

Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.

URL PDF HTML ☆

赞 0 踩 0

2409.09953 2026-05-26 cs.CV

Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection

不确定性引导的外观-运动关联网络用于分布外动作检测

Xiang Fang, Arvind Easwaran, Blaise Genest

AI总结针对分布外动作检测任务，提出不确定性引导的外观-运动关联网络（UAAN），通过融合外观与运动特征并推理时空物体交互，显著优于现有方法。

Comments Accepted by MIPR 2024

详情

AI中文摘要

分布外（OOD）检测旨在检测并拒绝具有语义偏移的测试样本，以防止在分布内（ID）数据集上训练的模型产生不可靠的预测。现有工作仅在图像数据集上提取外观特征，无法处理包含大量运动信息的动态多媒体场景。因此，我们针对一个更现实且更具挑战性的OOD检测任务：OOD动作检测（ODAD）。给定一个未裁剪的视频，ODAD首先对ID动作进行分类并识别OOD动作，然后定位ID和OOD动作。为此，本文提出了一种新颖的不确定性引导的外观-运动关联网络（UAAN），该网络同时探索外观特征和运动上下文，以推理用于ODAD的时空物体间交互。首先，我们设计独立的外观和运动分支，以提取相应的面向外观和面向运动的物体表示。在每个分支中，我们构建一个时空图来推理外观引导和运动驱动的物体间交互。然后，我们设计一个外观-运动注意力模块，融合外观和运动特征以进行最终的动作检测。在两个具有挑战性的数据集上的实验结果表明，UAAN显著优于最先进的方法，证明了其有效性。

英文摘要

Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.

URL PDF HTML ☆

赞 0 踩 0