arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2508.01253 2026-05-27 cs.CV

ODOV: Benchmark the Open-Domain Open-Vocabulary Object Detection

ODOV:开放域开放词汇目标检测基准

Yupeng Zhang, Ruize Han, Fangnan Zhou, Wei Feng, Liang Wan

AI总结 针对真实场景中域偏移和类别偏移同时发生的问题,提出开放域开放词汇目标检测任务,构建OD-LVIS基准数据集,并设计基于VLM的基线方法,通过域无关类别提示和域投影嫁接模块提升检测性能。

详情
AI中文摘要

现有研究通常将域偏移和类别偏移作为独立问题进行研究,然而在真实场景中,这两种偏移常常同时发生并相互作用,导致检测性能显著下降。为了解决这一问题,我们提出并系统研究了一个新问题——开放域开放词汇(ODOV)目标检测,旨在评估模型在真实环境中适应复合域和类别偏移的能力。我们构建了一个新的基准数据集OD-LVIS,包含来自15个不同真实场景的46,949张图像和1,203个类别,用于评估目标检测性能。此外,我们提出了一种新的ODOV检测基线,充分利用VLM强大的多模态对齐能力,并引入两种关键机制以增强类别和域泛化能力。一种是域无关类别提示(DAPmt),它在增强类别语义的同时减弱域表示,从而实现纯粹的类别表示。另一种是域投影与嫁接(DP&G)模块,它融合了输入图像中的域特定特征,使模型能够动态地在各种开放域中进行泛化。这两个组件使模型能够在真实场景中同时存在类别和域变化的情况下保持有效的检测性能。我们为提出的ODOV检测任务提供了广泛的基准评估,并报告了实验结果。这些结果验证了ODOV任务的合理性、OD-LVIS数据集的实用性以及该方法的优越性。

英文摘要

Existing studies typically investigate domain shift and category shift as independent problems, however, in real-world scenarios, the two types of shifts often occur simultaneously and interact, leading to significant degradation in detection performance. To address this, we propose and systematically study a novel problem-Open-Domain Open-Vocabulary (ODOV) object detection-which aims to evaluate a model's ability to adapt to the compound domain and category shifts in real-world environments.We construct a new benchmark, OD-LVIS, which contains 46,949 images spanning 15 diverse real-world scenarios and 1,203 categories, for assessing object detection performance. Furthermore, we propose a novel ODOV detection baseline that fully leverages VLM's powerful multi-modal alignment capabilities and introduces two key mechanisms to enhance both category and domain generalization. One is the Domain-Agnostic Category Prompt (DAPmt), which strengthens category semantics while attenuating domain representations, enabling pure category representation. The other is the Domain Projection and Grafting (DP&G) module, which incorporates domain-specific features from input images, allowing the model to dynamically generalize across diverse open domains. These two components enable the model to maintain effective detection performance under simultaneous category and domain variations in real-world scenarios. We provide extensive benchmark evaluations for the proposed ODOV detection task and report experimental results. These results validate the soundness of the ODOV task, the practicality of the OD-LVIS dataset, and the superiority of the method.

2507.13762 2026-05-27 cs.LG q-bio.BM

MolPIF: A Parameter Interpolation Flow Model for Molecule Generation

MolPIF: 一种用于分子生成的参数插值流模型

Yaowei Jin, Junjie Wang, Yufan Tang, Wenkai Xiang, Duanhua Cao, Dan Teng, Zhehuan Fan, Jiacheng Xiong, Xia Sheng, Chuanlong Zeng, Duo An, Mingyue Zheng, Shuangjia Zheng, Qian Shi

AI总结 提出参数插值流模型MolPIF,通过参数空间分布插值统一连续坐标与离散原子类型的生成,在CrossDocked2020数据集上优于基线方法。

Comments Accepted to Bioinformatics

详情
AI中文摘要

动机:基于结构的药物设计(SBDD)随着深度生成模型的发展而进步,但弥合连续原子坐标与离散原子类型之间的差距仍然是一个挑战。当前的方法,如扩散和流匹配模型,通常未能统一这些异质模态,依赖于分离的策略或对离散变量不合适的欧几里得度量。缺乏一致的框架限制了生成模型捕捉蛋白质-配体复合物的几何和化学结构的能力。结果:我们提出了MolPIF,一种参数插值流机制,旨在统一连续和离散分子变量的生成。与在样本空间中运行的传统流模型不同,MolPIF在参数空间中对分布进行插值,理论上恢复了连续坐标的Wasserstein-2最优传输,并建立了离散原子类型的Fisher-Rao测地线。我们进一步整合了几何增强学习策略,以改善原子上下文的捕捉。在CrossDocked2020数据集上的广泛评估表明,MolPIF在结合亲和力、化学有效性、几何保真度和化学空间覆盖方面优于基线。此外,MolPIF在先导优化中表现出多功能性,并提供灵活的先验分布选择(如Laplace),为SBDD建立了一个稳健的范式。可用性:源代码可在https://github.com/BLEACH366/MolPIF免费获取。补充信息:补充数据可在Bioinformatics上获取。

英文摘要

Motivation: Structure-based drug design (SBDD) has advanced with deep generative models, but bridging the gap between continuous atomic coordinates and discrete atom types remains a challenge. Current approaches, such as diffusion and flow matching models, often fail to unify these heterogeneous modalities, relying on separate strategies or ill-fitting Euclidean metrics for discrete variables. This lack of a consistent framework limits generative models' ability to capture the geometric and chemical structure of protein-ligand complexes. Results: We present MolPIF, a parameter interpolation flow mechanism designed to unify the generation of continuous and discrete molecular variables. Unlike traditional flow models that operate in sample space, MolPIF interpolates between distributions in the parameter space, theoretically recovering Wasserstein-2 optimal transport for continuous coordinates and establishing Fisher-Rao geodesics for discrete atom types. We further incorporate a geometry-enhanced learning strategy to improve the capture of atomic contexts. Extensive evaluations on the CrossDocked2020 dataset demonstrate that MolPIF outperforms baselines in binding affinity, chemical validity, geometric fidelity and chemical space coverage. Additionally, MolPIF exhibits versatility in lead optimization and offers flexible prior distribution selection (such as Laplace), establishing a robust paradigm for SBDD. Availability: Source code is freely available at https://github.com/BLEACH366/MolPIF. Supplementary information: Supplementary data are available at Bioinformatics.

2507.20758 2026-05-27 cs.AI

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

思维链如何工作?从解码、投影和激活追踪信息流

Hao Yang, Qinghua Zhao, Lei Li, Lingyi Meng, Mengda Yu

AI总结 通过反向追踪解码、投影和激活阶段的信息流,揭示思维链作为解码空间剪枝器的作用,并发现其以任务依赖方式调节神经元激活。

Comments Accept by ACL 2026

详情
AI中文摘要

思维链提示显著增强了模型推理能力,但其内部机制仍知之甚少。我们通过反向追踪解码、投影和激活阶段的信息流来分析CoT的操作原理。我们的定量分析表明,CoT可能作为解码空间剪枝器,利用答案模板引导输出生成,更高的模板遵循度与性能提升强相关。此外,我们惊讶地发现CoT以任务依赖方式调节神经元参与:在开放领域任务中减少神经元激活,而在封闭领域场景中增加激活。这些发现提供了一个新颖的机制可解释性框架,并为实现有针对性的CoT干预以设计更高效和鲁棒的提示提供了关键见解。我们在https://anonymous.4open.science/r/cot-D247发布了代码和数据。

英文摘要

Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT's operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at https://anonymous.4open.science/r/cot-D247.

2507.16116 2026-05-27 cs.CV

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

Pusa V1.0: 通过向量化时间步长适应解锁预训练视频扩散模型中的时间控制

Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

AI总结 提出向量化时间步长适应(VTA)方法,在统一视频扩散框架中实现细粒度时间控制,零样本完成图像到视频生成、起止帧控制等任务,且不破坏基础模型能力。

Comments Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

详情
AI中文摘要

视频扩散模型的快速发展受到时间建模基本限制的阻碍,特别是传统标量时间步长变量导致的帧演化刚性同步。尽管任务特定适应和自回归模型试图解决这些挑战,但它们仍受限于计算效率低下、灾难性遗忘或适用性狭窄。在这项工作中,我们提出了 extbf{Pusa} V1.0,一个利用 extbf{向量化时间步长适应(VTA)}在统一视频扩散框架中实现细粒度时间控制的通用模型。注意,VTA是一种非破坏性适应,意味着它完全保留了基础模型的能力。与Wan-I2V等传统方法(通过大量资源微调基础文本到视频(T2V)模型以进行图像到视频(I2V))不同,我们在基于VTA的超高效微调过程后以零样本方式实现了可比结果。此外,该方法还同时解锁了许多其他零样本能力,例如起止帧和视频扩展——所有这些都不需要任务特定训练。同时,它保留了基础模型的T2V能力。机制分析还表明,我们的方法保留了基础模型的生成先验,同时精确注入时间动态,避免了向量化时间步长固有的组合爆炸。这项工作为下一代视频合成建立了一个可扩展、高效且通用的范式,使高保真视频生成在研究和工业领域得以普及。

英文摘要

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension -- all without task-specific training. Meanwhile, it keeps the T2V capability from the base model. Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

2507.06513 2026-05-27 cs.CV

What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies

城市街景中什么需要关注?从场景理解到道路安全:视觉驱动数据集与研究的综述

Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

AI总结 本文通过系统分类交通场景中需要关注的关键元素,全面分析35个视觉驱动任务和73个数据集,提出统一分析框架,旨在促进道路安全研究。

Comments 40 tasks, 78 datasets

详情
AI中文摘要

基于视觉的传感器和计算机视觉算法的进步显著提升了对交通场景的分析与理解。为促进这些进步在道路安全中的应用,本综述系统分类了交通场景中需要关注的关键元素,并全面分析了现有的视觉驱动任务和数据集。与现有聚焦于孤立领域的综述相比,我们的分类法将值得关注的交通实体分为两大类:异常实体和正常但关键的实体,整合了十个类别和二十个子类。它建立了内在相关领域之间的联系,并提供了统一的分析框架。我们的综述重点分析了35个视觉驱动任务,并基于提出的分类法对73个可用数据集进行了全面检查和可视化。跨领域调查涵盖了每个基准的优缺点,旨在提供标准统一和资源优化的信息。文章最后系统讨论了现有弱点,从不同角度强调了潜在影响和有前景的解决方案。集成的分类法、全面分析和总结性表格为这一快速发展的领域提供了宝贵贡献,为研究人员提供了整体概览,指导战略性资源选择,并突出了关键研究空白。

英文摘要

Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.

2507.05757 2026-05-27 cs.CV

Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy

Normal Patch Retinex 稳健算法用于数字显微镜白平衡

Radoslaw Roszczyk, Artur Krupa, Izabella Antoniuk

AI总结 提出一种基于Normal Patch Retinex的全自动白平衡算法,用于校正数字显微镜彩色图像,实验证明其优于经典算法。

详情
Journal ref
Vol. 29 No. 1/4 (2020)
AI中文摘要

在光学显微镜中获取准确彩色、平衡的图像即使对于经验丰富的显微镜操作者也可能是一个挑战。本文提出了一种完全自动的白平衡机制,能够充分校正显微彩色图像。该算法的结果已在200张显微图像数据集上通过实验验证。这些图像包含病理形态学中常用的三种显微标本的扫描图。此外,将所得结果与数字摄影中其他常用的白平衡算法进行了比较。本文应用的算法对于苏木精-荧光桃红-番红染色的显微图像和免疫组织化学染色图像比彩色摄影中使用的经典算法更有效。

英文摘要

The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.

2506.23149 2026-05-27 cs.CL

AlignEvoSkill: Towards Knowledge-Aware and Task-Aligned Agent Skill Evolution

AlignEvoSkill: 迈向知识感知与任务对齐的智能体技能进化

Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng

AI总结 提出AlignEvoSkill框架,通过联合建模知识覆盖和任务对齐,从失败轨迹中识别知识标签、检索并适配候选技能,再基于知识覆盖和任务对齐分数筛选高质量技能,在3个基准和4个LLM骨干上相对提升34.7%,实现技能进化新SOTA且成本更低。

详情
AI中文摘要

可重用技能在提升基于LLM的智能体中扮演关键角色,但现有技能进化方法往往无法确保进化后的技能既覆盖任务所需的知识,又与目标任务保持对齐。结果,进化后的技能可能不完整或无关。为解决这一局限,我们提出AlignEvoSkill,一个联合建模知识覆盖和任务对齐的技能进化框架。给定失败的任务轨迹,AlignEvoSkill首先识别与任务相关的知识标签,检索互补的先前技能,并将它们适配为弥补缺失知识的候选技能。然后,它使用基于知识覆盖和任务对齐分数的联合过滤标准选择高质量候选技能。在3个基准和4个LLM骨干上的实验表明,AlignEvoSkill相对于非进化基线实现了34.7%的相对增益,并以更低的成本实现了技能进化的新SOTA。

英文摘要

Reusable skills play a key role in improving LLM-based agents, but existing skill-evolution methods often fail to ensure that evolved skills both cover the knowledge required by the task and remain aligned with the target task. As a result, evolved skills could be incomplete or irrelevant. To address this limitation, we propose AlignEvoSkill, a skill-evolution framework that jointly models knowledge coverage and task alignment. Given failed task trajectories, AlignEvoSkill first identifies task-relevant knowledge tags, retrieves complementary prior skills, and adapts them into candidate skills that address missing knowledge. It then selects high-quality candidates using a joint filtering criterion based on knowledge-coverage and task-alignment scores. Experiments on 3 benchmarks with4 LLM backbones show a 34.7% relative gain of AlignEvoSkill over the non-evolution baseline and achieves a new SOTA in skill evolution with lower cost.

2506.21443 2026-05-27 cs.CL cs.AI

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

领域知识增强的大语言模型用于欺诈和概念漂移检测

Ali Şenol, Garima Agrawal, Huan Liu

AI总结 提出一种领域知识增强的大语言模型框架,通过集成结构化领域知识和漂移检测单元,实现高准确率的欺诈对话检测和概念漂移分类。

详情
AI中文摘要

在动态平台上检测欺骗性对话变得越来越困难,原因是语言模式的演变和概念漂移(CD)——即随着时间推移,语义或主题的转变会改变交互的上下文或意图。这些转变可能掩盖恶意意图或模仿正常对话,使得准确分类具有挑战性。尽管大语言模型(LLMs)在自然语言任务中表现出色,但在风险敏感场景中,它们常常面临上下文模糊和幻觉问题。为了解决这些挑战,我们提出了一个领域知识(DK)增强的LLM框架,该框架将预训练的LLM与结构化的、任务特定的见解相结合,以执行欺诈和概念漂移检测。所提出的架构由三个主要组件组成:(1)一个DK-LLM模块,用于检测虚假或欺骗性对话;(2)一个漂移检测单元(OCDD),用于判断是否发生了语义转变;(3)第二个DK-LLM模块,用于将漂移分类为良性或欺诈性。我们首先使用虚假评论数据集验证领域知识的价值,然后将我们的完整框架应用于SEConvo,一个包含多种欺诈和垃圾攻击的多轮对话数据集。结果表明,我们的系统能够高精度地检测虚假对话,并有效分类漂移的性质。在结构化提示的引导下,基于LLaMA的实现达到了98%的分类准确率。与零样本基线的对比研究表明,在高风险NLP应用中,融入领域知识和漂移意识显著提高了性能、可解释性和鲁棒性。

英文摘要

Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

2506.17633 2026-05-27 cs.CV cs.AI

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

自适应多提示对比网络用于少样本分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest

AI总结 针对少样本分布外检测问题,提出自适应多提示对比网络(AMCN),通过CLIP学习可学习文本提示和类间/类内分布,实现ID-OOD分离边界自适应。

Comments Published in ICML 2025

详情
AI中文摘要

分布外(OOD)检测旨在区分异常样本,以防止在分布内(ID)数据集上训练的模型产生不可用的输出。大多数OOD检测方法需要大量IID样本进行训练,这严重限制了它们的实际应用。为此,我们针对一个具有挑战性的场景:少样本OOD检测,其中只有少量标记的ID样本可用。因此,少样本OOD检测比传统的OOD检测设置更具挑战性。先前的少样本OOD检测工作忽略了不同类别之间的显著多样性。在本文中,我们提出了一种新颖的网络:自适应多提示对比网络(AMCN),它通过学习类间和类内分布来适应ID-OOD分离边界。为了弥补OOD的缺失和ID图像样本的稀缺,我们利用CLIP连接文本与图像,设计可学习的ID和OOD文本提示。具体来说,我们首先生成自适应提示(可学习ID提示、标签固定OOD提示和标签自适应OOD提示)。然后,我们通过引入类级阈值为每个类生成自适应类边界。最后,我们提出一个提示引导的ID-OOD分离模块来控制ID和OOD提示之间的间隔。实验结果表明,AMCN优于其他最先进的工作。

英文摘要

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

2506.11253 2026-05-27 cs.CV cs.LG

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

将数据追踪的机器遗忘提升为基础模型的知识追踪

Yuwen Tan, Boqing Gong

AI总结 本文提出将数据追踪的机器遗忘提升为基础模型的知识追踪,以应对多样化遗忘请求,并更接近人类遗忘机制,通过视觉语言模型案例展示实现范式。

Comments Accepted to TMLR

详情
AI中文摘要

机器遗忘从AI模型中移除特定训练数据点及其影响(例如,当数据所有者撤销其同意允许模型从数据中学习时)。在这篇立场论文中,我们提出将数据追踪的机器遗忘提升为基础模型(FMs)的知识追踪。我们基于实际需求和认知研究的见解支持这一立场。实际上,追踪数据无法满足对FMs的多样化遗忘请求,这些请求可能来自监管机构、企业用户、产品团队等,他们无法访问FMs的大量训练数据。相反,这些方方便提出关于FMs(不应)拥有的知识或能力的遗忘请求。认知上,知识追踪遗忘比追踪单个训练数据点更接近人脑的遗忘方式。我们进一步讨论了知识追踪机器遗忘范式中的重大挑战。最后,我们提供了一个关于视觉语言FMs的具体案例研究,以说明遗忘者如何实例化知识追踪机器遗忘范式。代码可在:https://1yuwen.github.io/Knowledge-Tracing-MU-Page 获取。

英文摘要

Machine unlearning removes certain training data points and their influence from AI models (e.g., when a data owner revokes their consent to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., who have no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points does. We further discuss the nontrivial challenges in the knowledge-tracing machine unlearning paradigm. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm. Code is available at: https://1yuwen.github.io/Knowledge-Tracing-MU-Page.

2506.10225 2026-05-27 cs.SD cs.AI eess.AS

Genre Controlled Music Generation via Activation Steering

通过激活引导实现体裁控制的音乐生成

Swathi Narashiman, Pranay Mathur, Dipanshu Panda, Jayden Koshy Joe, Harshith M R, Anish Veerakumar, Aniruddh Krishna, Keerthiharan A

AI总结 提出一种在推理时对自回归生成模型MusicGen进行干预的方法,利用线性探针权重引导残差流,实现细粒度的体裁控制。

详情
AI中文摘要

计算音乐生成正朝着非传统风格发展,需要能够精确且可控地融合不同音乐元素的方法。在这项工作中,我们提出了一种方法,通过对自回归生成变换器MusicGen进行推理时干预来实现细粒度控制。通过我们的方法,我们利用线性探针在残差流上的权重来引导残差流,从而实现体裁控制。通过将激活引导视为一种人类可控的交互,我们的工作突出了可解释的模型行为如何在协同创作的音乐生成中发挥作用。展示我们方法的音频样本可在我们的演示页面上找到。

英文摘要

Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending of diverse music elements. In this work, we present a method for fine grained control using inference-time interventions on an autoregressive generative transformer, MusicGen. Through our approach, we achieve genre control by steering the residual stream using weights of a linear probe on it. By framing activation steering as a human-controllable interaction, our work highlights how interpretable model behaviors can empower in co-creative music generation.Audio samples demonstrating our method are available on our demo page.

2506.07813 2026-05-27 cs.CV cs.AI

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

自级联扩散模型用于任意尺度图像超分辨率

Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun

AI总结 提出自级联扩散框架CasArbi,通过将任意缩放因子分解为连续小步骤,逐步提升分辨率并保持尺度一致性,在感知和失真指标上优于现有方法。

详情
AI中文摘要

任意尺度图像超分辨率旨在将图像上采样到任意期望分辨率,比传统固定尺度超分辨率提供更大灵活性。最近基于回归或生成模型的方法显示出有希望的结果,但由于其单阶段公式必须同时处理大范围的缩放因子,常常遭受尺度不一致的问题。为了解决这个问题,我们提出了CasArbi,一个用于任意尺度图像超分辨率的自级联扩散框架。CasArbi将不同的缩放因子分解为更小的顺序步骤,逐步提升图像分辨率,并在每一步实现任意尺度的无缝过渡。CasArbi利用坐标条件扩散模型学习连续图像表示,并在推理时采用自一致性指导生成尺度一致的细节。大量实验表明,CasArbi在感知和失真指标上均优于现有方法,并在各种任意尺度超分辨率基准上展现出卓越的尺度一致性。我们的代码可在https://github.com/junseo88/CasArbi获取。

英文摘要

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches based on regression-based or generative models have shown promising results but often suffer from scale inconsistency due to their single-stage formulation, which must handle a wide range of scaling factors simultaneously. To address this, we propose CasArbi, a self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi decomposes varying scaling factors into smaller sequential steps, progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. CasArbi leverages a coordinate-conditioned diffusion model for learning continuous image representations and adopts self-consistency guidance to generate scale-consistent details at inference time. Extensive experiments show that CasArbi outperforms existing methods in both perceptual and distortion metrics and demonstrates superior scale consistency across diverse arbitrary-scale super-resolution benchmarks. Our code is available at https://github.com/junseo88/CasArbi.

2502.06963 2026-05-27 cs.LG cs.AI cs.DC cs.MA

Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

车辆边缘计算中的智能卸载:深度强化学习方法与架构综述

Ashab Uddin, Ahmed Hamdi Sakr, Ning Zhang

AI总结 本文综述了基于深度强化学习的车辆边缘计算卸载方法,分类比较了学习范式、系统架构和优化目标,并分析了马尔可夫决策过程的应用及未来研究方向。

Comments 33 Pages, 6 Figures, 7 Tables. Machine Learning, Reinforcement Learning, Multi Agent Reinforcement Learning, Computational Offloading and Edge Computing

详情
AI中文摘要

智能交通系统(ITS)日益复杂,导致对计算卸载到边缘服务器、车辆节点和无人机等外部基础设施的兴趣显著增加。这些动态异构环境给传统卸载策略带来了挑战,促使人们探索强化学习(RL)和深度强化学习(DRL)作为自适应决策框架。本综述全面回顾了基于DRL的车辆边缘计算(VEC)卸载的最新进展。我们根据学习范式(如单智能体、多智能体)、系统架构(如集中式、分布式、分层式)和优化目标(如延迟、能量、公平性)对现有工作进行分类和比较。此外,我们分析了马尔可夫决策过程(MDP)公式的应用方式,并强调了奖励设计、协调机制和可扩展性方面的新兴趋势。最后,我们确定了开放挑战,并概述了未来研究方向,以指导下一代ITS鲁棒且智能的卸载策略的开发。

英文摘要

The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to external infrastructures such as edge servers, vehicular nodes, and UAVs. These dynamic and heterogeneous environments pose challenges for traditional offloading strategies, prompting the exploration of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) as adaptive decision-making frameworks. This survey presents a comprehensive review of recent advances in DRL-based offloading for vehicular edge computing (VEC). We classify and compare existing works based on learning paradigms (e.g., single-agent, multi-agent), system architectures (e.g., centralized, distributed, hierarchical), and optimization objectives (e.g., latency, energy, fairness). Furthermore, we analyze how Markov Decision Process (MDP) formulations are applied and highlight emerging trends in reward design, coordination mechanisms, and scalability. Finally, we identify open challenges and outline future research directions to guide the development of robust and intelligent offloading strategies for next-generation ITS.

2506.03627 2026-05-27 cs.CL cs.AI

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

提示的鲁棒性:增强大型语言模型对抗提示攻击的鲁棒性

Lin Mu, Guowei Chu, Li Ni, Lei Sang, Yiwen Zhang

AI总结 提出RoP(提示鲁棒性)策略,通过错误校正和引导两个阶段,增强LLM对输入扰动的鲁棒性,在算术、常识和逻辑推理任务上显著提升性能。

Comments Accepted by IEEE Transactions on Artificial Intelligence

详情
AI中文摘要

大型语言模型(LLM)通过有效利用提示策略在各种任务中展现了卓越的性能。然而,它们对输入扰动高度敏感,例如拼写错误或轻微字符顺序错误,这些扰动会显著损害其性能。尽管在提示技术方面取得了进展,如思维链和自动提示生成,但开发一种明确减轻此类扰动负面影响的提示策略仍然是一个开放的挑战。为弥补这一差距,我们提出了提示鲁棒性(RoP),一种旨在增强LLM鲁棒性的新型提示策略。RoP包括两个阶段:错误校正和引导。在错误校正阶段,RoP应用多种扰动方法生成对抗样本,用于生成自动纠正输入错误的提示。在引导阶段,RoP基于校正后的输入生成最优引导提示,引导模型生成更鲁棒和准确的推理。通过在算术、常识和逻辑推理任务上的全面实验,我们证明RoP显著提高了LLM对抗对抗扰动的鲁棒性。至关重要的是,与干净输入场景相比,它仅以最小的精度下降保持了模型准确性,从而将RoP确立为在实际应用中增强LLM鲁棒性的实用且有效的方法。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic prompt generation, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy aimed at enhancing the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are used to generate prompts that correct input errors automatically. In the Guidance stage, RoP generates an optimal guidance prompt based on the corrected input, guiding the model to generate more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Crucially, it preserves model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

2411.02355 2026-05-27 cs.LG cs.AI

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

“给我BF16,否则给我死亡”?LLM量化中的精度-性能权衡

Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

AI总结 本文通过超过50万次评估,全面研究了FP8、INT8和INT4量化在Llama-3.1模型族上的精度-性能权衡,发现FP8无损、INT8精度损失低、INT4权重仅量化具有竞争力,并基于vLLM框架给出了不同部署场景下的最优量化格式建议。

Comments Accepted to ACL 2025

详情
AI中文摘要

量化是加速大型语言模型(LLM)推理的强大工具,但不同格式下的精度-性能权衡仍不明确。在本文中,我们进行了迄今为止最全面的实证研究,评估了FP8、INT8和INT4量化在整个Llama-3.1模型族上的学术基准和实际任务。通过超过50万次评估,我们的研究得出了几个关键发现:(1)FP8(W8A8-FP)在所有模型规模上均无损;(2)良好调优的INT8(W8A8-INT)实现了令人惊讶的低精度下降(1-3%);(3)INT4权重仅量化(W4A16-INT)比预期更具竞争力,可与8位量化相媲美。此外,我们通过流行的vLLM框架分析推理性能,研究了不同部署场景下的最优量化格式。我们的分析提供了明确的部署建议:W4A16是同步设置中最具成本效益的,而W8A8在异步连续批处理中占主导地位。对于混合工作负载,最优选择取决于具体用例。我们的发现为大规模部署量化LLM提供了实用的、数据驱动的指导——确保速度、效率和精度之间的最佳平衡。

英文摘要

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.

2505.18728 2026-05-27 cs.LG cs.AI

Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

消息传递状态空间模型:利用现代序列建模改进图学习

Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof

AI总结 提出MP-SSM,将现代状态空间模型的核心计算嵌入消息传递神经网络,实现静态和时序图上的高效、置换等变和长程信息传播,并通过精确敏感性分析刻画深层信息流问题。

详情
AI中文摘要

状态空间模型(SSM)在序列建模中的近期成功推动了其向图学习的迁移,催生了图状态空间模型(GSSM)。然而,现有的GSSM通过将SSM模块应用于从图中提取的序列,往往损害了置换等变性、消息传递兼容性和计算效率等核心属性。本文引入了一种新视角,将现代SSM计算的关键原理直接嵌入消息传递神经网络框架,从而为静态图和时序图提供统一的方法论。我们的方法MP-SSM能够实现高效、置换等变和长程信息传播,同时保持消息传递的架构简洁性。关键的是,MP-SSM支持精确的敏感性分析,我们利用该分析从理论上刻画信息流,并评估深层网络中的梯度消失和过压缩等问题。此外,我们的设计选择允许类似现代SSM的高度优化并行实现。我们在包括节点分类、图属性预测、长程基准和时空预测在内的广泛任务上验证了MP-SSM,展示了其多功能性和强大的实证性能。

英文摘要

The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.

2505.18603 2026-05-27 cs.AI cs.CV

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Doc-CoB:通过视觉链式框推理增强文档理解

Ye Mo, Kai Ye, Xianwei Mao, Zirui Shao, Gang Huang, Bo Zhang, Hangdi Xing, Kehan Chen, Huan Zhou, Zixu Yan, Jiajun Bu, Sheng Zhou

AI总结 提出Doc-CoB框架,通过粗到细的布局感知视觉推理,结合多模态大语言模型,逐步聚焦查询相关布局区域,提升文档理解性能。

详情
AI中文摘要

文档理解旨在对文档图像进行问答和信息提取,其中视觉内容信息密集,大多数查询仅依赖于少数相关布局区域。然而,现有方法要么采用一次通过策略,隐式假设所有布局同等重要,要么过度关注小区域而丢失关键布局信息。为解决这些局限性,我们引入了Doc-CoB(链式框),一个简单而有效的框架,将粗到细的布局感知视觉推理集成到多模态大语言模型中。Doc-CoB不是直接放大到小区域,而是逐步聚焦于查询相关布局,同时保留全局文档信息。具体来说,它首先选择关键布局框,然后通过视觉提示聚焦于这些框进行进一步理解。为支持这一范式,我们引入了两个推理任务:框识别和框推理,并构建了一个自动流水线,生成24.9万个带有中间视觉监督的训练样本。在七个基准测试和四种流行模型上的广泛实验表明,Doc-CoB显著提升了性能,证明了其有效性和广泛适用性。

英文摘要

Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most queries rely on only a few relevant layout regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all layouts are equally important, or focus excessively on small regions at the cost of losing critical layout information. To address these limitations, we introduce Doc-CoB (Chain-of-Boxes), a simple-yet-effective framework that integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models. Instead of directly zooming into small regions, Doc-CoB progressively focuses on query-relevant layouts while preserving global document information. Specifically, it first selects key layout boxes and then focuses on them for further understanding with visual prompting. To support this paradigm, we introduce two reasoning tasks for box recognition and box reasoning, with an automatic pipeline that constructs 249k training samples with intermediate visual supervision. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability.

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准:揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

AI总结 提出OCR-Reasoning基准,包含1069个人工标注样本,覆盖6种核心推理能力和18个实际推理任务,通过双标注(最终答案和逐步推理过程)评估多模态大语言模型在文本丰富图像推理中的能力,发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情
AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而,由于缺乏专门且系统的基准,它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白,我们提出了OCR-Reasoning,一个新颖的基准,旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言,OCR-Reasoning包含1069个人工标注的示例,涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同,本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程,从而全面评估文本丰富推理能力。利用该基准,我们对最新的多模态大语言模型进行了全面评估。结果表明,即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难,在我们的基准上没有一个模型的准确率超过50%,这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

2505.16942 2026-05-27 cs.CV cs.LG

Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation

高效的全对相关性体素采样用于光流估计

Karlis Martins Briedis, Markus Gross, Christopher Schroers

AI总结 提出一种内存和计算高效的算法,实现全对相关性体素采样的精确数学运算,在保持低内存占用的同时显著提升速度,并应用于高分辨率光流估计达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

最近的光流估计方法通常从密集的全对相关性体素中进行局部代价采样。这导致计算和内存复杂度与像素数成二次关系。尽管存在一种按需代价计算的替代内存高效实现,但在实践中速度明显较慢,因此许多先前方法在降采样分辨率下处理图像,丢失了细粒度细节。为了解决这个问题,我们提出了一种算法,用于全对相关性体素采样的内存和计算高效实现,同时仍然匹配RAFT定义的精确数学算子。我们的方法在保持同样低内存使用的情况下,性能优于按需采样高达92%,并且与默认实现相比,内存使用降低高达99%的同时性能至少相当。由于代价采样占整体运行时间的很大一部分,这可以转化为高分辨率输入下端到端模型推理总时间高达63%的节省。我们对现有方法的评估包括一个8K超高清数据集和SEA-RAFT方法的推理时间扩展。通过这一点,我们在高分辨率下在准确性和运行时间上都达到了最先进的结果。

英文摘要

Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details. To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.

2505.11063 2026-05-27 cs.AI cs.CR

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

三思而后行:通过思想修正增强智能体行为安全

Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang

AI总结 提出Thought-Aligner,一种轻量级插件式安全模型,在动作执行前对不安全思想进行因果修正,无需修改底层智能体,通过两阶段对比学习训练,在多个基准和六种LLM上将行为安全从约50%提升至约90%,超越现有防护约23%,同时提升有用性约5%。

Comments Accepted to ICML 2026

详情
AI中文摘要

基于LLM的智能体通过迭代推理、工具使用和环境交互来解决复杂任务,其中每个中间思想直接塑造后续动作。因此,这些思想中的微小偏差可能传播为不安全行为,但现有的防护措施通常仅作用于最终输出或需要侵入式模型修改。我们引入了Thought-Aligner,一种轻量级插件式安全模型,它在动作执行前对不安全思想进行因果修正,而不改变底层智能体。修正后的思想被反馈给智能体,将其决策过程和工具使用引导至更安全的轨迹。由于仅在思想层面操作,Thought-Aligner是模型无关的,可以集成到各种智能体框架中。我们通过在十个风险场景中生成的成对安全和不安全思想上进行两阶段对比学习来训练Thought-Aligner。在多种智能体安全基准和六种LLM上的实验表明,Thought-Aligner将行为安全从无保护时的约50%提升至平均约90%,超过最先进的防护约23%,同时还将有用性提高了约5%。该方法具有低每步延迟和最小开销,实现了可扩展且实用的部署。我们在https://huggingface.co/WhitzardAgent/Thought-Aligner-7B公开发布了Thought-Aligner-7B。

英文摘要

LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.

2502.17666 2026-05-27 cs.LG cs.AI

Yes, Q-learning Helps Offline In-Context RL

是的,Q学习有助于离线上下文强化学习

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

AI总结 本文在离线上下文强化学习框架中整合RL目标,通过150多个数据集实验证明,直接优化RL目标相比算法蒸馏平均提升约30%性能,且价值学习中的保守性带来额外改进。

详情
AI中文摘要

现有的离线上下文强化学习(ICRL)方法主要依赖监督训练目标,这在离线RL设置中已知存在局限性。在本研究中,我们探索了在离线ICRL框架中整合RL目标。通过在150多个GridWorld和MuJoCo环境派生数据集上的实验,我们证明,与广泛采用的算法蒸馏(AD)相比,直接优化RL目标在各种数据集覆盖范围、结构、专业水平和环境复杂性下平均提升约30%的性能。此外,在具有挑战性的XLand-MiniGrid环境中,RL目标使AD的性能翻倍。我们的结果还揭示,在几乎所有测试的设置中,价值学习期间加入保守性带来了额外的改进。我们的发现强调了将ICRL学习目标与RL奖励最大化目标对齐的重要性,并表明离线RL是推进ICRL的一个有前景的方向。

英文摘要

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

2505.02974 2026-05-27 cs.LG

PLAID: A Unified Data Model for Machine Learning on Heterogeneous Physics Simulations

PLAID:面向异构物理模拟的机器学习统一数据模型

Fabien Casenave, Xavier Roynard, Brian Staber, Alexandre Devaux-Rivière, William Piat, Michele Alessandro Bucci, Nissrine Akkari, Abbas Kabalan, Xuan Minh Vuong Nguyen, Luca Saverio, Raphaël Carpintero Perez, Anthony Kalaydjian, Samy Fouché, Thierry Gonon, Ghassan Najjar, Thomas Daniel, Emmanuel Menier, Matthieu Nastorg, Giovanni Catalani, Christian Rey

AI总结 提出PLAID统一数据层,通过标准化异构物理模拟数据并发布六个基准数据集,解决机器学习代理模型缺乏大规模多样化数据集的问题。

Comments Presented at EuRIPS 2025 and accepted at the AI4Physics Workshop @ ICML 2026

详情
AI中文摘要

基于机器学习的代理模型已成为加速模拟驱动科学工作流的强大工具,但其应用受到缺乏大规模、多样化且标准化的物理模拟数据集的限制。现有基准测试通常聚焦于狭窄领域或依赖简化数据模型,未能捕捉由可变几何、网格和拓扑产生的异质性,而这对于评估现实场景中的泛化能力至关重要。我们提出PLAID(物理学习AI数据模型),一个用于异构物理模拟的统一且可扩展的数据层。它在保留模拟数据完整复杂性的同时,支持高效可扩展的机器学习工作流,并附带一个用于数据集构建和操作的库(https://github.com/PLAID-lib/plaid)。我们发布了六个覆盖结构力学和计算流体动力学的数据集,旨在反映真实工业场景并提供标准化基准。该框架包含可复现的评估协议,并与Hugging Face集成,支持开放、社区驱动的基准测试和用户积极参与(https://huggingface.co/PLAIDcompetitions)。

英文摘要

Machine learning-based surrogate models have emerged as a powerful tool to accelerate simulation-driven scientific workflows, but their adoption is limited by the lack of large-scale, diverse, and standardized datasets for physics-based simulations. Existing benchmarks often focus on narrow domains or rely on simplified data models, and fail to capture the heterogeneity arising from variable geometries, meshes, and topologies, which is critical for assessing generalization in realistic settings. We introduce PLAID (Physics-Learning AI Data model), a unified and extensible data layer for heterogeneous physics simulations. It preserves the full complexity of simulation data while enabling efficient and scalable machine learning workflows, together with a library for dataset construction and manipulation~(\href{https://github.com/PLAID-lib/plaid}{github.com/PLAID-lib/plaid}). We release six datasets covering structural mechanics and computational fluid dynamics, designed to reflect realistic industrial scenarios and provide standardized benchmarks. The framework includes reproducible evaluation protocols and is integrated with Hugging Face to enable open, community-driven benchmarking with active user participation (\href{https://huggingface.co/PLAIDcompetitions}{huggingface.co/PLAIDcompetitions}).

2503.21510 2026-05-27 cs.LG cs.CV stat.ML

An uncertainty-aware Bayesian framework for machine learning classification models: A case study in land cover classification

一种不确定性感知的贝叶斯机器学习分类模型框架:以土地覆盖分类为例

Samuel Bilson, Miles McCrory, Anna Pustogvar

AI总结 提出一种考虑输入测量不确定性的贝叶斯生成式分类模型框架,通过贝叶斯二次判别分析模型在土地覆盖数据集上验证,该模型在可解释性、不确定性建模和计算效率方面优于随机森林和神经网络。

Comments 38 pages, 16 figures

详情
AI中文摘要

确保机器学习分类模型的预测伴随不确定性估计是可信任人工智能的主要支柱之一。当前不确定性量化研究主要关注ML模型的认知不确定性,但很少考虑输入测量不确定性,而这对于计量学的可追溯性至关重要。在这项工作中,我们提出了一种考虑输入测量不确定性的生成式ML分类模型的贝叶斯框架。我们以贝叶斯二次判别分析(BQDA)模型为例,并将其应用于来自Copernicus Sentinel-2的2020年和2021年计量土地覆盖数据集。我们将该模型的性能与土地覆盖图中更流行的分类模型(如随机森林和神经网络)进行基准测试。为了验证和评估此类模型的泛化能力,我们还在合成分类数据上进行了模拟,改变了输入测量噪声的分布类型和强度。我们发现,对于真实和合成数据,所提出的BQDA模型更可信,因为它更具可解释性,显式建模了输入测量不确定性,并在不同领域和大小的数据集上保持了类别概率输出的预测性能,同时计算效率更高。

英文摘要

Ensuring that predictions of machine learning (ML) classification models are accompanied by uncertainty estimates is one of the main pillars of trustworthy AI. Current research in uncertainty quantification focuses mainly on epistemic uncertainty of the ML model, but rarely takes account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian framework for generative ML classification models that takes account of input measurement uncertainty. We take the specific case of a Bayesian quadratic discriminant analysis (BQDA) model, and apply it to metrological land cover datasets from Copernicus Sentinel-2 from 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. To validate and assess the generalisability of such a model, we also run simulations over synthetic classification data, varying distribution type and strength of the input measurement noise. We find for both real and synthetic data, the BQDA model presented is more trustworthy, in the sense that it is more interpretable, explicitly models the input measurement uncertainty, and maintains predictive performance of class probability outputs across datasets over different domains and sizes, whilst also being more computationally efficient.

2504.08540 2026-05-27 cs.CV

Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

自动驾驶中车道检测数据集:全面综述

Jörg Gamerdinger, Sven Teufel, Oliver Bringmann

AI总结 本文全面综述了20个公开车道检测数据集,通过多维质量指标分析其特性、优势和局限,并指出未来改进方向以推动鲁棒车道检测创新。

详情
AI中文摘要

准确的车道检测对于自动驾驶至关重要,能够在各种道路场景下实现安全可靠的车辆导航。为了支持车道检测算法的开发和评估,已经引入了许多数据集,这些数据集在数据量、传感器类型、注释粒度、环境条件和场景多样性方面各不相同。本文全面综述了20个公开可用的车道检测数据集,系统地分析了它们的特性、优势和局限性。我们基于传感器分辨率、注释类型以及道路和天气条件的多样性等关键性能指标,使用一种新颖的多维数据集质量指标对这些数据集进行分类。通过识别现有挑战和研究空白,我们强调了未来数据集改进的机会,这些改进可以进一步推动鲁棒车道检测的创新。本综述为寻求适用于鲁棒车道检测的数据集的研究人员提供了资源,并为推进自动驾驶的更广泛目标做出了贡献。

英文摘要

Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation across a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of 20 publicly available lane detection datasets, systematically analyzing their characteristics, advantages, and limitations. We classify these datasets based on key performance indicators such as sensor resolution, annotation types and diversity of road and weather conditions using a novel multidimensional metric for dataset quality. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This review serves as a resource for researchers seeking appropriate datasets for robust lane detection and contributes to the broader goal of advancing autonomous driving.

2504.07853 2026-05-27 cs.CV

V2V3D: View-to-View Denoised 3D Reconstruction for Light-Field Microscopy

V2V3D:用于光场显微镜的视图到视图去噪三维重建

Jiayin Zhao, Zhenqi Fu, Tao Yu, Hui Qiao

AI总结 提出无监督视图到视图框架V2V3D,联合优化图像去噪和三维重建,利用噪声独立性实现噪声到噪声去噪,并设计基于波动光学的特征对齐技术恢复高频细节,在效率和性能上超越现有方法。

Comments CVPR 2025; New version: Fix NSFC ID

详情
AI中文摘要

光场显微镜(LFM)因其能够捕捉基于快照的大规模三维荧光图像而受到广泛关注。然而,现有的LFM重建算法对传感器噪声高度敏感,或者需要难以获取的真实标注数据进行训练。为了解决这些挑战,本文引入了V2V3D,一个无监督的基于视图到视图的框架,在统一架构中建立了图像去噪和三维重建联合优化的新范式。我们假设LF图像源自一致的三维信号,每个视图中的噪声是独立的。这使得V2V3D能够融入噪声到噪声原理以实现有效去噪。为了增强高频细节的恢复,我们提出了一种新颖的基于波动光学的特征对齐技术,该技术将用于波动光学中前向传播的点扩散函数转换为专门用于特征对齐的卷积核。此外,我们引入了一个包含LF图像及其对应三维强度体积的LFM数据集。大量实验表明,我们的方法实现了高计算效率,并优于其他最先进的方法。这些进展使V2V3D成为在挑战性条件下进行三维成像的有前景的解决方案。

英文摘要

Light field microscopy (LFM) has gained significant attention due to its ability to capture snapshot-based, large-scale 3D fluorescence images. However, existing LFM reconstruction algorithms are highly sensitive to sensor noise or require hard-to-get ground-truth annotated data for training. To address these challenges, this paper introduces V2V3D, an unsupervised view2view-based framework that establishes a new paradigm for joint optimization of image denoising and 3D reconstruction in a unified architecture. We assume that the LF images are derived from a consistent 3D signal, with the noise in each view being independent. This enables V2V3D to incorporate the principle of noise2noise for effective denoising. To enhance the recovery of high-frequency details, we propose a novel wave-optics-based feature alignment technique, which transforms the point spread function, used for forward propagation in wave optics, into convolution kernels specifically designed for feature alignment. Moreover, we introduce an LFM dataset containing LF images and their corresponding 3D intensity volumes. Extensive experiments demonstrate that our approach achieves high computational efficiency and outperforms the other state-of-the-art methods. These advancements position V2V3D as a promising solution for 3D imaging under challenging conditions.

2504.05046 2026-05-27 cs.CV

MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond

MotionPRO:探索压力在人体动作捕捉及其它领域中的作用

Shenghao Ren, Yi Lu, Jiayi Huang, Jiayi Zhao, He Zhang, Tao Yu, Qiu Shen, Xun Cao

AI总结 本文通过构建包含压力、RGB和光学传感器的大规模人体动作捕捉数据集MotionPRO,并设计基于压力信号或融合压力与RGB的位姿和轨迹估计网络,证明了压力信号在提高物理合理性、全局轨迹精度以及驱动虚拟人和人形机器人方面的必要性和有效性。

Comments fix NSFC ID

详情
AI中文摘要

现有的人体动作捕捉(MoCap)方法大多关注视觉相似性而忽略物理合理性。因此,下游任务如驱动3D场景中的虚拟人或现实世界中的类人机器人会出现时间漂移和抖动、空间问题如滑动和穿透以及全局轨迹精度差等问题。在本文中,我们通过探索压力的作用,从人体与物理世界交互的角度重新审视人体动作捕捉。首先,我们构建了一个大规模的人体动作捕捉数据集,包含压力、RGB和光学传感器(命名为MotionPRO),该数据集由70名志愿者执行400种动作,共计1240万帧姿态。其次,我们通过两个具有挑战性的任务检验压力信号的必要性和有效性:(1)仅基于压力的姿态和轨迹估计:我们提出了一个包含小核解码器和长短期注意力模块的网络,并证明压力可以提供准确的全局轨迹和合理的下半身姿态。(2)融合压力和RGB的姿态和轨迹估计:我们沿相机轴施加正交相似性约束,沿垂直轴施加全身接触约束,以增强交叉注意力策略,融合压力和RGB特征图。实验表明,将压力与RGB特征融合不仅在客观指标上显著提升了性能,而且能够合理地驱动3D场景中的虚拟人(SMPL)。此外,我们证明融入物理感知使类人机器人能够执行更精确和稳定的动作,这对具身人工智能的发展非常有益。项目页面:https://nju-cite-mocaphumanoid.github.io/MotionPRO/

英文摘要

Existing human Motion Capture (MoCap) methods mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion, encompassing a total of 12.4M pose frames. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy to fuse pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics, but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence. Project page is available at: https://nju-cite-mocaphumanoid.github.io/MotionPRO/

2504.02775 2026-05-27 cs.CV cs.LG

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

TailedCore: 面向无监督长尾噪声异常检测的少样本采样

Yoon Gyo Jung, Jaewoo Park, Jaeho Yoon, Kuan-Chuan Peng, Wonchul Kim, Andrew Beng Jin Teoh, Octavia Camps

AI总结 针对正常数据集存在缺陷污染且类别分布未知长尾的挑战,提出TailSampler估计类别大小以独立处理尾类与噪声,并构建基于记忆的异常检测模型TailedCore,在无监督长尾噪声异常检测中达到最先进性能。

Comments Accepted to CVPR2025

详情
AI中文摘要

我们旨在解决一个实际且具有挑战性的无监督异常检测问题,其中正常数据集既包含缺陷区域污染,其产品类别分布又是长尾但未知的。我们观察到现有模型存在尾类与噪声之间的权衡:如果模型对像素噪声鲁棒,则其在尾类样本上的性能会下降,反之亦然。为缓解该问题,我们独立处理尾类和噪声样本。为此,我们提出TailSampler,一种新颖的类别大小预测器,基于嵌入相似度的类别分布对称假设来估计样本的类别基数。TailSampler可用于专门采样尾类样本,从而单独处理它们。基于这些方面,我们构建了基于记忆的异常检测模型TailedCore,其记忆既能很好地捕捉尾类信息,又对噪声鲁棒。我们在无监督长尾噪声异常检测设置上广泛验证了TailedCore的有效性,并表明TailedCore在大多数设置下优于现有最先进方法。

英文摘要

We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

2504.00307 2026-05-27 cs.LG physics.ao-ph

Generating realistic global precipitation fields from modelled atmospheric circulation

从模拟大气环流生成逼真的全球降水场

Michael Aich, Sebastian Bathiany, Philipp Hess, Yu Huang, Niklas Boers

AI总结 提出基于条件扩散模型与UNet架构的生成式机器学习方法,从少量预报大气变量生成高分辨率全球降水场,作为传统参数化方案的替代,减少偏差并实现高效集合预测。

Comments Accepted for publication at Climate Dynamics

详情
AI中文摘要

改进地球系统模型(ESMs)中降水的表示对于评估气候变化的影响,特别是洪水和干旱等极端事件至关重要。在现有的ESMs中,降水并非显式解析,而是通过参数化表示。这些参数化通常依赖于解析近似但计算昂贵的基于柱的物理过程,不考虑位置间的相互作用。它们难以捕捉精细尺度的降水过程,并引入显著偏差。我们提出了一种基于生成式机器学习的新方法,将条件扩散模型与UNet架构相结合,从一小部分预报大气变量生成准确、高分辨率(0.25°)的全球每日降水场。与传统参数化不同,我们的框架高效地生成集合预测,捕捉降水的不确定性,且无需手动微调。我们在ERA5再分析数据上训练模型,并提出一种方法使其能应用于未见过的ESM数据,从而实现概率预测和气候情景的快速生成。通过利用全球预报变量之间的相互作用,我们的方法提供了一种替代参数化方案,减轻了ESM降水中存在的偏差,同时保持与其大尺度(年)趋势的一致性。这项工作表明,复杂的降水模式可以直接从大尺度大气变量中学习,提供了一种计算高效的方法来获得高分辨率降水,而无需以如此高分辨率运行动力模型的成本。

英文摘要

Improving the representation of precipitation in Earth system models (ESMs) is critical for assessing the impacts of climate change and especially of extreme events like floods and droughts. In existing ESMs, precipitation is not resolved explicitly, but represented by parameterizations. These typically rely on resolving approximated but computationally expensive column-based physics, not accounting for interactions between locations. They struggle to capture fine-scale precipitation processes and introduce significant biases. We present a novel approach, based on generative machine learning, which integrates a conditional diffusion model with a UNet architecture to generate accurate, high-resolution (0.25°) global daily precipitation fields from a small set of prognostic atmospheric variables. Unlike traditional parameterizations, our framework efficiently produces ensemble predictions, capturing uncertainties in precipitation, and does not require fine-tuning by hand. We train our model on the ERA5 reanalysis and present a method that allows us to apply it to unseen ESM data, enabling fast generation of probabilistic forecasts and climate scenarios. By leveraging interactions between global prognostic variables, our approach provides an alternative parameterization scheme that mitigates biases present in the ESM precipitation while maintaining consistency with its large-scale (annual) trends. This work demonstrates that complex precipitation patterns can be learned directly from large-scale atmospheric variables, offering a computationally efficient method to obtain high-resolution precipitation without the cost of running the dynamical model at such high resolution.

2504.00167 2026-05-27 cs.RO

Enhancing Physical Human-Robot Interaction: Recognizing Digits via Intrinsic Robot Tactile Sensing

增强物理人机交互:通过机器人本体触觉感知识别数字

Teresa Sinico, Giovanni Boschetti, Pedro Neto

AI总结 利用协作机器人内置扭矩传感器采集人手在触控板上书写数字时的关节力矩和末端力数据,通过双向LSTM网络实现94%准确率的在线数字识别,并在水果递送任务中验证其应用潜力。

详情
AI中文摘要

物理人机交互(pHRI)仍然是实现与机器人直观安全交互的关键挑战。当前的进展通常依赖外部触觉传感器作为接口,这增加了机器人系统的复杂性。在本研究中,我们利用协作机器人的本体触觉感知能力,识别用户在安装在机器人法兰上的无仪器触控板上绘制的数字。我们提出了一个数据集,包含机器人关节扭矩信号以及相应的末端执行器(EEF)力和力矩,这些数据来自机器人每个关节的集成扭矩传感器,用户在手写数字(0-9)时采集。pHRI-DIGI-TACT数据集从不同用户收集,以捕捉手写的自然变化。为增强分类鲁棒性,我们开发了一种数据增强技术来处理反转和旋转的数字输入。双向长短期记忆(Bi-LSTM)网络利用数据的时空特性,实现在线数字分类,在各种测试场景中总体准确率达到94%,包括涉及未参与系统训练的用户。该方法在真实机器人上的水果递送任务中实现,展示了其辅助日常生活的潜力。数据集和视频演示可在 https://TS-Robotics.github.io/pHRI-DIGI/ 获取。

英文摘要

Physical human-robot interaction (pHRI) remains a key challenge for achieving intuitive and safe interaction with robots. Current advancements often rely on external tactile sensors as interface, which increase the complexity of robotic systems. In this study, we leverage the intrinsic tactile sensing capabilities of collaborative robots to recognize digits drawn by humans on an uninstrumented touchpad mounted to the robot's flange. We propose a dataset of robot joint torque signals along with corresponding end-effector (EEF) forces and moments, captured from the robot's integrated torque sensors in each joint, as users draw handwritten digits (0-9) on the touchpad. The pHRI-DIGI-TACT dataset was collected from different users to capture natural variations in handwriting. To enhance classification robustness, we developed a data augmentation technique to account for reversed and rotated digits inputs. A Bidirectional Long Short-Term Memory (Bi-LSTM) network, leveraging the spatiotemporal nature of the data, performs online digit classification with an overall accuracy of 94\% across various test scenarios, including those involving users who did not participate in training the system. This methodology is implemented on a real robot in a fruit delivery task, demonstrating its potential to assist individuals in everyday life. Dataset and video demonstrations are available at: https://TS-Robotics.github.io/pHRI-DIGI/.

2503.14359 2026-05-27 cs.CV

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

ImViD:用于增强VR沉浸感的沉浸式体积视频

Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

AI总结 提出ImViD多视角多模态数据集,支持移动中捕获完整场景,为6自由度多模态沉浸式VR体验提供基准和重建管线。

Comments CVPR 2025 Highlight; Fix NSFC ID

详情
AI中文摘要

用户参与度通过结合视觉和听觉刺激的完全沉浸式多模态体验得到极大增强。因此,VR/AR技术的下一个前沿在于具有完整场景捕获、大6自由度交互空间、多模态反馈以及高分辨率和高帧率内容的沉浸式体积视频。为了促进沉浸式体积视频的重建,我们引入了ImViD,这是一个多视角、多模态数据集,具有完整的面向空间的数据捕获和各种室内/室外场景。我们的捕获设备支持在移动中进行多视角视频-音频捕获,这是现有数据集所不具备的能力,显著提高了数据捕获的完整性、灵活性和效率。捕获的多视角视频(带有同步音频)为5K分辨率、60FPS,持续1-5分钟,包含丰富的前景-背景元素和复杂的动态。我们使用我们的数据集对现有方法进行基准测试,并建立了一个基础管线,用于从多视角视听输入构建用于6自由度多模态沉浸式VR体验的沉浸式体积视频。基准测试以及重建和交互结果证明了我们数据集和基线方法的有效性,我们相信这将激发未来对沉浸式体积视频制作的研究。

英文摘要

User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.