arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.16918 2026-05-19 cs.CV

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

HighSync: 通过潜在扩散模型实现高质量唇部同步

Saeed Firouzi Daghigh, Majid Iranpour Mobarekeh, Mostafa Alavi, Mehdi Bagheri

AI总结 本文提出HighSync,一种端到端的扩散框架,用于生成与任意输入音频对齐的逼真说话人脸视频。该方法同时解决了图像质量和同步准确性之间的矛盾,是首个原生在512*512分辨率上运行的唇部同步模型,适用于电影和广播行业等专业生产环境。

Comments 12 pages, 7 figures, 5 tables

详情
AI中文摘要

我们提出了HighSync,一种端到端的扩散基框架,用于生成与任意输入音频对齐的逼真说话人脸视频。现有方法在图像质量和同步准确性之间难以取得平衡,产生视觉降质或时间不一致的唇部运动。HighSync同时解决这两个挑战,并且据我们所知,是首个在512*512分辨率上原生运行的唇部同步模型,使其成为电影和广播行业等专业生产环境中的可行解决方案。我们方法的核心是识别并系统消除一种数据泄漏现象,这种现象在先前工作中无声地破坏了时间建模,阻碍模型发展对音频信号的真实依赖。在感知质量和同步准确性指标上的全面评估证实,HighSync在两者上均实现了最先进的性能。源代码、预训练模型和补充视频结果可在https://github.com/saeed5959/high_sync上公开获取。

英文摘要

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations across both perceptual quality and synchronization accuracy metrics confirm that HighSync achieves state-of-the-art performance on both fronts. Source code, pre-trained models, and supplementary video results are publicly available at: https://github.com/saeed5959/high_sync

2605.16911 2026-05-19 cs.CV

VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction

VGGT-Occ:基于几何和密度的门控融合用于3D占用预测

Xun Chen, Tianchen Deng, Rui Wang, Fangjinhua Wang, Junyi Ma, Hongming Shen, Hesheng Wang, Danwei Wang

AI总结 本文提出VGGT-Occ,通过在整个管道中嵌入几何标记,引入投影感知可变形注意力(PA-DA)以注入几何信息,结合视图质量语义门控实现跨视图一致性,采用顺序粗到细解码器与门控融合优化效率和性能,实验证明其在3D语义占用预测中的有效性。

详情
AI中文摘要

3D语义占用预测需要准确的2D到3D特征提升,但当前方法限制相机几何到初始投影。后续操作如偏移学习、注意力加权和跨相机聚合仍缺乏几何感知,忽略了关键的物理约束。我们提出了VGGT-Occ,一个在完整管道中嵌入几何标记的框架。我们引入了投影感知可变形注意力(PA-DA)以在所有注意力阶段注入几何信息。PA-DA将3D偏移投影回图像平面,并利用投影雅可比作为加性偏置以抑制不可靠的观测。特征随后通过视图质量语义门控进行跨视图一致性整合。为了优化效率和性能,我们采用顺序粗到细解码器与门控融合,其中低分辨率特征被细化为更高分辨率,通过信息密度分配计算,同时显著减少解码器成本。广泛的评估证明了我们方法的有效性和准确性。在SurroundOcc-nuScenes上,VGGT-Occ在T=1时达到33.00%的IoU和21.08%的mIoU,在T=2推理时达到33.64%的IoU和21.43%的mIoU,优于现有方法,仅使用约4100万可训练参数。代码将公开发布。

英文摘要

3D semantic occupancy prediction requires accurate 2D-to-3D feature lifting, yet current methods restrict camera geometry to initial projections. Subsequent operations like offset learning, attention weighting, and cross-camera aggregation remain geometry-agnostic, ignoring essential physical constraints. We propose VGGT-Occ, a framework that embeds geometric tokens throughout the entire pipeline. We introduce Projection-Aware Deformable Attention (PA-DA) to inject geometry into all attention stages. PA-DA projects 3D offsets back to image planes and leverages the projection Jacobian as an additive bias to suppress unreliable observations. Features are then integrated through a view-quality semantic gate for cross-view consistency. To optimize both efficiency and performance, we employ a sequential coarse-to-fine decoder with gated fusion, where low-resolution features are refined into higher resolutions, allocating computation by information density while substantially reducing decoder cost. Extensive evaluations demonstrate the effectiveness and accuracy of our approach. On SurroundOcc-nuScenes, VGGT-Occ achieves 33.00\% IoU and 21.08\% mIoU ($T{=}1$), and 33.64\% IoU and 21.43\% mIoU with $T{=}2$ inference, outperforming existing methods, with only ${\sim}41$M trainable parameters in the occupancy head. Code will be released publicly.

2605.16909 2026-05-19 cs.AI

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

TOBench:面向真实世界工具使用代理的任务导向多模态基准

Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si

AI总结 本文提出TOBench,一个面向真实世界工具使用代理的多模态基准,通过闭环多模态验证设计,评估和推动下一代多模态工具使用代理的发展。

Comments Github: https://github.com/Pi3AI/TOBench

详情
AI中文摘要

工具使用代理正越来越多地被期望在现实中的专业工作流程中操作,其中它们必须解释多模态输入、协调外部工具、检查中间产物并修改其行为,以最终产生结果。然而,现有的基准测试通常孤立地评估工具使用、计算机使用和多模态推理,导致基准设置与现实中的端到端多模态工具使用之间存在差距。为此,我们引入MM-ToolBench,一个用于任务导向多模态工具使用的基准和评估工具。MM-ToolBench包含100个可执行任务,来自两个宏任务家族,客户服务和智能创作,涵盖20个子类切片,并由27个MCP服务器和324个工具支持。MM-ToolBench的核心设计是闭环多模态验证:代理必须执行工具、检查渲染或转换后的产物,并在输出未能满足任务特定要求时进行自我纠正。为了使此类评估可扩展和可验证,MM-ToolBench结合了基于MCP的执行与任务特定的地面评估器以及一个半自动化的场景发现、任务实例化、评估器合成和人类审核的构建流程。在15个当代代理模型上的实验表明,MM-ToolBench仍然极具挑战性:Claude Opus 4.6,通常被视为最强的编码代理模型之一,仅达到32.0%的任务成功率,远低于94.0%的人类基准。我们设想MM-ToolBench作为评估和推动下一代多模态工具使用代理的实用基础,通过闭环多模态验证。

英文摘要

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.

2605.16905 2026-05-19 cs.LG cs.CV

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps

AIM:对抗性信息遮蔽用于显著图忠实性评估

Chia-Ying Hsieh, Hsin-Yuan Fang, Chun-Shu Wei

AI总结 本文提出AIM方法,通过对抗性信息遮蔽框架评估显著图的忠实性及遮蔽操作的可靠性,通过对比不同遮蔽方式下的退化效果,减少遮蔽诱导的偏差,并揭示不同模态下符号和非符号归因之间的差异。

详情
AI中文摘要

后验显著性方法广泛用于解释深度神经网络,但其忠实性难以可靠评估。现有评估方法根据显著性诱导的特征排序进行特征遮蔽并测量性能退化,但这种退化可能受遮蔽操作干扰:零遮蔽可能产生分布外伪影,而基于插值的遮蔽可能保留残余预测信息。我们提出对抗性信息遮蔽(AIM),一种基于显著性的对抗性特征替换框架,用于评估显著图的忠实性和遮蔽操作的可靠性。AIM将选定特征替换为输入的对抗性对应值,并在互补的遮蔽顺序下比较退化效果。我们通过随机归因偏差和解释方法忠实性排名的稳定性来评估可靠性。在图像、音频和EEG任务中的实验表明,AIM相比零和插值遮蔽减少了遮蔽诱导的偏差,同时揭示了符号和非符号归因之间的模态依赖性差异。

英文摘要

Post-hoc saliency methods are widely used to interpret deep neural networks, but their faithfulness is difficult to evaluate reliably. Existing evaluations mask features according to saliency-induced feature ordering and measure performance degradation, but this degradation can be confounded by the masking operator: zero masking may create out-of-distribution artifacts, while interpolation-based masking may preserve residual predictive information. We propose Adversarial Information Masking (AIM), a saliency-guided adversarial feature replacement framework for evaluating both saliency-map faithfulness and masking-operator reliability. AIM replaces selected features with values from an adversarial counterpart of the input and compares degradation under complementary masking orders. We assess reliability using random-attribution bias and stability of explanation-method faithfulness rankings. Experiments on image, audio, and EEG tasks suggest that AIM reduces masking-induced bias compared with zero and interpolation-based masking, while revealing modality-dependent differences between signed and unsigned attributions.

2605.16903 2026-05-19 cs.CV

WOW-Seg: A Word-free Open World Segmentation Model

WOW-Seg: 无词开放世界分割模型

Danyang Li, Tianhao Wu, Bin Li, Zhenyuan Chen, Yang Zhang, Yuxuan Li, Ming-Ming Cheng, Xiang Li

AI总结 本文提出WOW-Seg模型,旨在解决开放世界图像分割中的目标精确分割与语义理解问题,通过引入Mask2Token模块和Cascade Attention Mask,提升模型性能,并构建了Region Recognition Dataset (RR-7K)数据集,在LVIS数据集上取得优异成果。

Comments Accepted by ICLR 2026. Code and benchmark dataset are available at https://github.com/AAwCAA/WOW-Seg-Meta

详情
AI中文摘要

开放世界图像分割旨在通过解决现实世界中无限开放的对象类别集,实现图像中目标的精确分割和语义理解。然而,传统封闭集分割方法难以适应复杂的开放世界场景,而基础分割模型如SAM在分割能力与语义理解之间存在明显差距。为弥合这一差距,我们提出了WOW-Seg,一种无词开放世界分割模型,用于对开放集类别中的对象进行分割和识别。具体而言,WOW-Seg引入了新颖的视觉提示模块Mask2Token,将图像掩码转换为视觉令牌并确保其与VLLM特征空间对齐。此外,我们引入了Cascade Attention Mask以解耦不同实例之间的信息。此方法减少了实例间干扰,显著提升了模型性能。我们进一步构建了一个开放世界区域识别测试基准:Region Recognition Dataset (RR-7K)。该数据集包含7,662个类别,代表目前最丰富的区域识别数据集。WOW-Seg在LVIS数据集上取得强劲成果,达到语义相似度89.7和语义IoU 82.4。这一表现超越了先前的SOTA,同时仅使用八分之一的参数量。这些结果凸显了WOW-Seg强大的开放世界泛化能力。代码及相关资源可在https://github.com/AAwcAA/WOW-Seg-Meta获取。

英文摘要

Open world image segmentation aims to achieve precise segmentation and semantic understanding of targets within images by addressing the infinitely open set of object categories encountered in the real world. However, traditional closed-set segmentation approaches struggle to adapt to complex open world scenarios, while foundation segmentation models such as SAM exhibit notable discrepancies between their strong segmentation capabilities and relatively weaker semantic understanding. To bridge these discrepancies, we propose WOW-Seg, a Word-free Open World Segmentation model for segmenting and recognizing objects from open-set categories. Specifically, WOW-Seg introduces a novel visual prompt module, Mask2Token, which transforms image masks into visual tokens and ensures their alignment with the VLLM feature space. Moreover, we introduce the Cascade Attention Mask to decouple information across different instances. This approach mitigates inter-instance interference, leading to a significant improvement in model performance. We further construct an open world region recognition test benchmark: the Region Recognition Dataset (RR-7K). With 7,662 classes, it represents the most extensive category-rich region recognition dataset to date. WOW-Seg attains strong results on the LVIS dataset, achieving a semantic similarity of 89.7 and a semantic IoU of 82.4. This performance surpasses the previous SOTA while using only one-eighth the parameter count. These results underscore the strong open world generalization capabilities of WOW-Seg. The code and related resources are available at https://github.com/AAwcAA/WOW-Seg-Meta.

2605.16902 2026-05-19 cs.LG

ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

ArtifactLinker: 通过自动发现最新研究成果来链接科学制品

Haofei Yu, Jiaxuan You, Peter Clark, Bodhisattwa Prasad Majumder, Kyle Richardson

AI总结 本文提出ArtifactLinker框架,通过图神经网络和大语言模型预测模型-数据集链接,并通过编码实验验证,以实现自动发现最新研究成果

Comments 12 pages

详情
AI中文摘要

科学制品如模型和数据集是研究的基础。随着像HuggingFace这样的平台迅速发展,研究人员现在可以访问大量制品。然而,一个关键挑战依然存在:如何通过充分利用现有制品自动发现给定数据集的最新研究成果(SOTA)模型?我们通过将HuggingFace建模为一个制品图来正式化这一任务,其中节点是模型/数据集,边表示评估。我们提出了ArtifactLinker,一个两阶段框架:(1)使用图神经网络(GNN)或图增强的大语言模型(LLM)对有前途的未观测模型-数据集链接进行排名;(2)通过编码实验使用基于LLM的代理验证顶级链接。我们进一步引入了一个名为ArtifactBench的基准,包含14,053个制品和51,337个关系,以评估两个阶段的性能。结果表明:(1)现有制品之间的图结构对缺失链接预测有效;(2)使用ArtifactLinker进行端到端排名和验证有助于发现潜在的SOTA结果和研究见解。

英文摘要

Scientific artifacts such as models and datasets are foundations for research. With the rapid growth of platforms like HuggingFace, researchers now have access to a large number of artifacts. Yet, a key challenge remains: how can we automatically discover the state-of-the-art (SOTA) model for a given dataset by fully leveraging existing artifacts? We formalize this task as automatic SOTA discovery by modeling HuggingFace as an artifact graph, where nodes are models/datasets and edges represent evaluations. We propose ArtifactLinker, a two-stage framework: (1) ranking promising unobserved model--dataset links using Graph Neural Networks (GNNs) or graph-augmented Large Language Models (LLMs), and (2) verifying top-ranked links via coding experiments with LLM-based agents. We further introduce a benchmark named ArtifactBench with 14,053 artifacts and 51,337 relations to evaluate the performance of both stages. Results show that (1) graph structures between existing artifacts are effective for missing link prediction; (2) end-to-end ranking and verification with ArtifactLinker help discover potential SOTA results and research insights.

2605.16901 2026-05-19 cs.CV

CAR-SAM: Cross-Attention Reconstruction for Post-Training Quantization of the Segment Anything Model

CAR-SAM:跨注意力重建用于Segment Anything模型的后训练量化

Houji Wen, Jiangyong Yu, Jun Li, Dawei Yang

AI总结 本文提出CAR-SAM,一种针对Segment Anything模型的统一量化框架,通过引入MatMul-Aware Compensation机制和Joint Cross-Attention Reconstruction策略,解决后训练量化中注意力耗散和重建振荡问题,实现4位精度下的高效量化。

详情
AI中文摘要

Segment Anything Models (SAMs) 被广泛应用于计算机视觉中的通用图像分割,但在资源受限设备上部署具有挑战性,因为它们具有高计算和内存需求。后训练量化(PTQ)是一种广泛使用的模型压缩和加速技术。然而,现有的PTQ方法未能考虑SAM解码器中的跨注意力架构。这种退化主要源于SAMs特有的挑战:(1)注意力耗散,其中解码器中的注意力信息,对于表示分割掩码至关重要,在低比特量化下会坍缩成扩散且非语义的形式;(2)重建振荡,其中双向耦合的两个变压器引入了跨分支误差干扰并破坏了收敛。为了解决这些问题,我们提出了CAR-SAM,一种专门针对SAMs的统一量化框架。首先,为了缓解注意力耗散,我们引入了MatMul-Aware Compensation(MAC)机制,将激活引起的量化误差从MatMul转移到前导线性权重。其次,为了缓解解码器优化中的振荡,我们开发了一种联合跨注意力重建(JCAR)策略,联合重建耦合的注意力分支,抑制振荡行为并促进稳定收敛。广泛的实验表明,CAR-SAM能够稳健地将SAM模型量化到4位精度,在SAM-B和SAM-L上分别比现有方法在mAP上提高了14.6%和6.6%。

英文摘要

Segment Anything Models (SAMs) are extensively used in computer vision for universal image segmentation, but deploying them on resource-constrained devices is challenging due to their high computational and memory demands. Post-Training Quantization (PTQ) is a widely used technique for model compression and acceleration. However, existing PTQ methods fail to consider the cross-attention architecture in the SAM decoder. This degradation primarily stems from the unique challenges posed by SAMs: (1) Attention dissipation, where the attention information in the decoder, which is crucial for representing segmentation masks, collapses into a diffuse and non-semantic form under low-bit quantization; and (2) Reconstruction oscillation, where bidirectional coupling within the two-way transformer introduces cross-branch error interference and destabilizes convergence. To tackle these issues, we propose CAR-SAM, a unified quantization framework tailored for SAMs. Firstly, to mitigate attention dissipation, we introduce MatMul-Aware Compensation (MAC) mechanism that transfers activation-induced quantization errors from MatMul to preceding linear weights. Secondly, to mitigate oscillation in decoder optimization, we develop a Joint Cross-Attention Reconstruction (JCAR) strategy that jointly reconstructs coupled attention branches, suppressing oscillatory behavior and promoting stable convergence. Extensive experiments show that CAR-SAM robustly quantizes SAM models down to 4-bit precision, surpassing existing methods by 14.6% and 6.6% mAP on SAM-B and SAM-L respectively.

2605.16899 2026-05-19 cs.CV

LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map

LASAR:迈向基于潜在认知图的时空推理

Jinzhou Tang, Sidi Liu, Waikit Xiu, Weixing Chen, Keze Wang

AI总结 本文提出LASAR架构,通过双记忆系统维护事件经历和语义认知图,并引入ST-CRL对比目标训练该架构,以提升长距离碎片化经验中对细粒度空间关系的编码能力,在标准VLN-CE和VSI-Bench基准上实现了零样本泛化能力的2%-3.5%提升。

详情
AI中文摘要

具身AI中的一个根本挑战是验证智能体是否构建了空间结构的内部模型,或者仅仅是学习模仿任务特定的专家轨迹。这至关重要,因为基于动作中心任务(如VLN)和推理中心任务(如EQA)的基础方法往往共享一个共同的局限性:缺乏迫使它们编码细粒度空间关系(如拓扑或距离)的训练信号。为了解决这一问题,我们首先提出LASAR,一种具有双记忆系统的架构,旨在维护事件经历和语义认知图。然后引入了时空上下文表示学习(ST-CRL),一种对比目标,用于训练该架构。ST-CRL利用从模拟中生成的注释时空上下文中的时空线索来构建样本对,从而从智能体的经验中形成内部认知图。实验表明,我们的方法在标准VLN-CE和VSI-Bench基准上的零样本泛化能力提升了2%-3.5%。我们还证明了所提出认知图具有高度的自一致性。

英文摘要

A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal context in simulation to build sample pairs, thereby forming the internal cognitive map from the agent's experiences. Experiments demonstrate that our method achieves 2\%-3.5\% gains in both zero-shot generalization on standard VLN-CE and VSI-Bench benchmarks. We also demonstrate that our proposed cognitive map has high self-consistency.

2605.16896 2026-05-19 cs.CL

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

JSPG: 通过联合语义-拼音-字形检索实现中文上下文ASR的动态字典过滤

Shilin Zhou, Zhenghua Li

AI总结 针对中文上下文ASR中大规模关键词字典带来的噪声问题,本文提出JSPG框架,结合语义、拼音和字形特征进行联合检索,有效提升关键词识别准确率。

详情
AI中文摘要

上下文自动语音识别(ASR)在处理大规模关键词字典时面临挑战,因为过多的不相关候选词会引入噪声并降低准确性。为此,动态过滤通常使用基础ASR模型生成初步假设,然后通过语义文本检索器获取相关关键词的子集。然而,这种方法在中文ASR中经常失效。基础模型常常产生同音或近同音的错误,这些错误保留了目标关键词的语音线索,但严重扭曲了其语义意义,使标准语义检索器无效。为了解决这个问题,我们提出了一种过滤框架,联合整合语义、拼音和字形特征(JSPG)。拼音可以根据语音相似性检索目标,而字形提供互补的结构线索以过滤掉中文中大量无关的同音词。为了弥补字符级拼音/字形指标与序列级过滤之间的差距,我们引入了扩展的Smith-Waterman算法,计算N-best假设序列与关键词之间的相似度分数。在Aishell-1和RWCS-NER数据集上的实验表明,JSPG显著优于单一特征基线。此外,由JSPG引导的下游上下文ASR模型在关键词识别准确性上实现了显著提升。

英文摘要

Contextual Automatic Speech Recognition (ASR) faces challenges with large-scale keyword dictionaries, as excessive irrelevant candidates introduce noise that degrades accuracy. To address this, dynamic filtering typically uses a base ASR model to generate preliminary hypotheses, followed by semantic text retrievers to fetch a concise subset of relevant keywords. However, this approach frequently fails in Chinese ASR. Base models often produce homophonic or near-homophonic errors that preserve the phonetic cues of the target keywords but severely distort their semantic meaning, rendering standard semantic retrievers ineffective. To resolve this, we propose a filtering framework that jointly integrates Semantic, Pinyin, and Glyph features (JSPG). Pinyin effectively retrieves targets based on phonetic similarity, while glyph provides complementary structural cues to filter out numerous irrelevant homophones inherent in Chinese. To bridge the gap between character-level pinyin/glyph metrics and sequence-level filtering, we introduce an extended Smith-Waterman algorithm that computes similarity scores between the N-best hypothesis sequences and keywords. Experiments on the Aishell-1 and RWCS-NER datasets demonstrate that JSPG significantly outperforms single-feature baselines. Furthermore, downstream contextual ASR models guided by JSPG achieve substantial improvements in keyword recognition accuracy.

2605.16894 2026-05-19 cs.RO cs.SY eess.SY

Beyond Safety Filtering: Control Barrier Function-Informed Reinforcement Learning for Connected and Automated Vehicles

超越安全过滤:基于控制屏障函数的强化学习用于连接和自动化车辆

Jianye Xu, Bassam Alrifaee

AI总结 本文提出了一种基于控制屏障函数的多智能体强化学习奖励设计方法,通过将联合多智能体强化学习动作下的控制屏障函数约束值转化为奖励信号,以显式引导安全学习,并在四向多车道交叉口实验中验证了其在任务性能和对奖励超参数的鲁棒性方面优于传统启发式方法。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

强化学习(RL)使用奖励来引导学习,然而奖励设计通常是通过启发式方法手动构建,这可能难以调整。我们提出了一种多智能体RL(MARL)中的控制屏障函数(CBF)引导的奖励设计,将联合MARL动作下的CBF约束值转换为奖励信号,以显式引导安全学习。我们在四向多车道交叉口中与连接和自动化车辆进行了对比实验,两种启发式奖励基线。结果表明,我们的方法在任务性能上最高,并且对奖励超参数的敏感性较低,在测试的超参数范围内始终表现出一致的强性能。用于重现实验结果的代码和视频演示可在https://github.com/bassamlab/SigmaRL上获得。

英文摘要

Reinforcement Learning (RL) uses rewards to guide learning, yet reward design is typically hand-crafted using heuristics that can be difficult to tune. We propose a Control Barrier Function (CBF)-informed reward design for Multi-Agent RL (MARL) that converts CBF constraint values under joint MARL actions into a reward signal that explicitly guides safe learning. We compare against two heuristic reward baselines in a four-way multi-lane intersection with connected and automated vehicles. Results show that our method achieves the highest task performance and is less sensitive to reward hyperparameters, yielding consistently strong performance across the tested hyperparameter range. Code for reproducing the experimental results and a video demonstration are available at https://github.com/bassamlab/SigmaRL.

2605.16893 2026-05-19 cs.AI

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

NGM: 一种无需训练的插拔式内存模块用于大语言模型

Yuwen Qu, Wenhui Dong, Chenyang Si, Caifeng Shan

AI总结 本文提出NGM,一种无需训练的插拔式内存模块,通过因果n-gram编码器和余弦门控内存注入器,实现高效的知识检索,提升大语言模型在代码生成和知识密集型任务中的性能。

Comments Code is available at https://github.com/PioneerQyw/NGM

详情
AI中文摘要

近期的研究引入了条件内存模块,将知识存储与神经计算解耦,从而实现更直接的知识访问。与MoE相比,依赖动态计算路径的MoE,显式查找提供了一种更高效的检索机制。然而,这些方法仍然依赖于学习的内存嵌入,需要额外的训练,限制了灵活性。为此,我们提出了N-gram Memory (NGM),一种无需训练的插拔式模块,由因果N-gram编码器和余弦门控内存注入器组成。因果N-gram编码器直接平均预训练的骨干模型的token嵌入,以构建n-gram表示,从而消除了需要从头训练n-gram嵌入的需要。这种设计不需要额外的内存表或检索流水线。余弦门控内存注入器则使用非参数化的余弦门与ReLU,将检索到的嵌入调节为上下文表示。我们在Qwen3系列从0.6B到14B的八个基准测试中评估了NGM。NGM在平均性能上提高了0.5到1.2个点,尤其在代码生成和知识密集型任务(例如,Qwen3-14B在LiveCodeBench上+3.0,在GPQA上+3.03)中表现突出。此外,NGM还在多模态基准测试(例如,MMStar在Qwen3-VL-2B上+1.53)中提高了性能。

英文摘要

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).

2605.16892 2026-05-19 cs.CV cs.AI cs.CL

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe: 一种用于驾驶场景中风险检测与安全建议的框架

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C. V. Jawahar

AI总结 本文提出DriveSafe框架,通过结构化自然语言描述实现风险感知场景理解,结合多模态上下文生成空间 grounded 的描述,用于下游风险评估和安全建议,实验表明其在DRAMA基准上达到最先进的性能。

Comments 8 pages

详情
AI中文摘要

全面的情景意识对于在安全关键环境中运行的自动驾驶车辆至关重要,因为它能够识别并缓解潜在风险。尽管最近的多模态大语言模型(MLLMs)在通用视觉-语言任务上表现出色,但我们的研究发现,零样本MLLMs在细粒度、空间接地的风险评估中仍不如领域特定的方法。为了解决这一差距,我们提出了DriveSafe,一种用于风险感知场景理解的框架,利用结构化自然语言描述。具体而言,我们的方法首先生成包含运动、空间和深度线索的多模态上下文的时空接地描述。这些描述随后用于下游的风险评估,明确识别危险物体、其位置以及它们所暗示的不安全行为,随后提供可操作的安全建议。为了进一步提高性能,我们采用描述-风险配对来微调一个轻量级的适配器模块,高效地将领域特定的知识注入基础LLM中。通过将风险评估条件化为显式的语言基础场景表示,DriveSafe在零样本MLLMs和先前的领域特定基线之上取得了显著的提升。在DRAMA基准上的全面实验表明了最先进的性能,而消融研究验证了我们关键设计选择的有效性。项目页面:https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe

英文摘要

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

2605.16891 2026-05-19 cs.LG

Tensor Channel Equivariant Graph Neural Networks for Molecular Polarizability Prediction

张量通道等价图神经网络用于分子极化率预测

Jean Philip Filling, Daniel Franzen, Michael Wand

AI总结 本文提出了一种张量通道等价图神经网络,用于直接预测分子极化率张量,通过改进的PaiNN架构,在消息传递中传播张量结构,从而在分子极化率预测任务中取得更好的性能。

详情
AI中文摘要

我们介绍了一种张量通道等价图神经网络,用于直接预测分子极化率张量。基于高效的PaiNN架构,我们通过在隐藏表示中加入显式的对称秩-2张量通道,这些通道与极化率分解为各向同性和各向异性部分对齐。与仅在读出阶段构建张量输出的方法不同,我们的模型利用几何动机的张量基,在消息传递过程中传播张量结构。这产生了一种针对张量值分子预测的目标对齐架构。在优化的QM7-X几何结构上,所提出的模型在匹配的训练条件下,其全张量和各向异性误差均低于PaiNN风格的读出基线和介电常数MACE基线,并且在推理速度上也显著更快。消融研究显示,这种增益并非来自单纯增加容量,而是来自显式张量传播和与极化率张量各向异性部分匹配的迹零目标参数化相结合。在考虑的张量基中,最强的结果来自于学习的定向特征之间的相互作用,表明这些特征在建模分子极化率方面特别有效。旋转等价性测试进一步确认了所有比较模型在数值上都是等价的,因此观测到的改进归因于对目标张量本身的更好学习。总体而言,我们的结果表明,对于结构化的张量值目标,传播目标对齐的张量特征可以优于仅读出的张量构建和更一般的高阶等价模型。

英文摘要

We introduce a tensor-channel equivariant graph neural network for direct prediction of molecular polarizability tensors. Building on the efficient PaiNN architecture, we augment the hidden representation with explicit symmetric rank-2 tensor channels aligned with the decomposition of polarizability into isotropic and anisotropic components. In contrast to approaches that construct tensor outputs only at readout, our model propagates tensor structure throughout message passing using geometrically motivated tensor bases. This yields a target-aligned architecture for tensor-valued molecular prediction. On optimized QM7-X geometries, the proposed model achieves lower full-tensor and anisotropic error than both a PaiNN-style readout baseline and a dielectric MACE baseline under matched training conditions and at nearly identical parameter count. In this controlled setting, it also outperforms MACE while remaining substantially faster at inference. Ablation studies show that the gain does not arise from increased capacity alone, but from the combination of explicit tensor propagation and a traceless target parameterization matched to the anisotropic part of the polarizability tensor. Among the tensor bases considered, the strongest results are obtained from interactions between learned directional features, indicating that these are particularly effective for modeling molecular polarizability. Rotational equivariance tests further confirm that all compared models are numerically equivariant, so the observed improvements are attributable to better learning of the target tensor itself. Overall, our results show that for structured tensor-valued targets, propagating target-aligned tensor features can outperform both readout-only tensor construction and a more general higher-order equivariant model in the present training setting.

2605.16889 2026-05-19 cs.CV

Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities

通过缺失模态控制决策漂移的多模态情感分析

Chenglizhao Chen, Yuchen Cao, Xinyu Liu, Mengke Song, Guisheng Zhang, Xiaomin Yu

AI总结 本文提出了一种两级参考对齐框架,旨在解决多模态情感分析中因缺失模态和质量不平衡导致的决策漂移问题,通过稳定参考提升鲁棒性,实验表明在不同缺失模态设置下方法有效,且在全模态输入下达到最先进的性能。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

多模态情感分析依赖于文本、声学和视觉信号,但现实数据常面临模态缺失和质量不平衡的问题。现有方法通过可用模态生成缺失模态的特征,但不同模态的表达机制和情感动态差异可能导致生成特征偏离真实分布并误导预测。此外,不可靠的模态可能主导融合,导致不同模态组合间的表示漂移和情感表示不稳定。为了解决这些挑战,我们提出了一种两级参考对齐框架。该框架在特征表示和情感决策层面引入稳定参考,以提高模态缺失下的鲁棒性。首先级参考对齐利用完整模态样本来约束表示,并将不同模态组合对齐到共享的情感空间。第二级参考对齐通过原型检索和投票抑制不可靠模态,以决策层面实现跨模态一致性。结果表明,该框架在各种缺失模态模式下保持稳定可靠的情感预测。在CMU-MOSI和CMU-MOSEI数据集上的实验显示,方法在不同缺失模态设置下表现一致。在全模态输入下,所提方法达到最先进的性能,准确率(ACC)为86.28%和85.88%,F1值为86.24%和85.86%。

英文摘要

Multimodal sentiment analysis relies on textual, acoustic, and visual signals, yet real-world data often suffer from modality missing and quality imbalance. Existing methods generate features for modality missing from available ones, but differences in expression mechanisms and sentiment dynamics across modalities may cause the generated features to deviate from true distributions and mislead prediction. In addition, unreliable modalities may dominate fusion, resulting in representation shift across modality combinations and unstable sentiment representations. To address these challenges, we propose a two-level reference alignment framework. The framework introduces stable references at the feature representation and sentiment decision levels to improve robustness under modality missing. First-level reference alignment leverages complete-modality samples to constrain representations and align different modality combinations into a shared sentiment space. Second-level reference alignment enforces cross-modal consistency at the decision level by suppressing unreliable modalities through prototype retrieval and voting. As a result, the framework maintains stable and reliable sentiment predictions under diverse missing-modality patterns. Experiments on CMU-MOSI and CMU-MOSEI show consistent improvements across various missing-modality settings. Under full-modality input, the proposed method achieves state-of-the-art performance, with ACC of 86.28% and 85.88%, and F1 of 86.24% and 85.86%.

2605.16887 2026-05-19 cs.CV cs.LG

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

Mind the Gap: Learning Modality-Agnostic Representations with a Cross-Modality UNet

Xin Niu, Enyi Li, Jinchao Liu, Yan Wang, Margarita Osadchy, Yongchun Fang

AI总结 本文提出了一种紧凑的编码器-解码器神经模块(cmUNet),通过跨模态转换和模态内重建,学习模态无关的表示,同时保留身份相关的信息。此外,作者提出了MarrNet,通过将cmUNet连接到标准特征提取网络,实现跨模态匹配,并在多个挑战性任务上验证了其优越性能。

Comments Published in IEEE Transactions on Image Processing. See full abstract in the PDF file

详情
Journal ref
n IEEE Transactions on Image Processing, vol. 33, pp. 655-670, 2024
AI中文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

英文摘要

Cross-modality recognition has many important applications in science, law enforcement and entertainment. Popular methods to bridge the modality gap include reducing the distributional differences of representations of different modalities, learning indistinguishable representations or explicit modality transfer. The first two approaches suffer from the loss of discriminant information while removing the modality-specific variations. The third one heavily relies on the successful modality transfer, could face catastrophic performance drop when explicit modality transfers are not possible or difficult. To tackle this problem, we proposed a compact encoder-decoder neural module (cmUNet) to learn modality-agnostic representations while retaining identity-related information. This is achieved through cross-modality transformation and in-modality reconstruction, enhanced by an adversarial/perceptual loss which encourages indistinguishability of representations in the original sample space. For cross-modality matching, we propose MarrNet where cmUNet is connected to a standard feature extraction network which takes as inputs the modality-agnostic representations and outputs similarity scores for matching. We validated our method on five challenging tasks, namely Raman-infrared spectrum matching, cross-modality person re-identification and heterogeneous (photo-sketch, visible-near infrared and visible-thermal) face recognition, where MarrNet showed superior performance compared to state-of-the-art methods. Furthermore, it is observed that a cross-modality matching method could be biased to extract discriminant information from partial or even wrong regions, due to incompetence of dealing with modality gaps, which subsequently leads to poor generalization. We show that robustness to occlusions can be an indicator of whether a method can well bridge the modality gap.

2605.16883 2026-05-19 cs.LG

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

SE-GA:基于记忆的自进化GUI代理

Shilong Jin, Lanjun Wang, Zhuosheng Zhang

AI总结 本文提出SE-GA框架,通过整合分层记忆结构和迭代自我改进机制,解决GUI代理在多步骤任务中因上下文窗口受限和静态策略无法适应动态环境的问题,实验表明其在多个基准测试中均达到领先性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

自主图形用户界面(GUI)代理在多步骤任务中常因上下文窗口受限和静态策略无法适应动态环境而遇到困难。为解决这些限制,本文提出了自进化GUI代理(SE-GA),一种新颖的框架,整合了分层记忆结构和迭代自我改进机制。我们的方法核心是测试时间记忆扩展(TTME),通过动态检索事件性、语义性和经验性记忆,在推理过程中提供显著的上下文。为确保持续学习,我们引入了记忆增强自进化(MASE),这是一种训练流程,采用TTME收集的数据来稳定和增强代理的基础策略。在离线和在线基准测试中的广泛评估表明,SE-GA在ScreenSpot上达到89.0%的成功率,在具有挑战性的AndroidControl-High数据集上达到75.8%的成功率。此外,对AndroidWorld基准测试的显著改进突显了其在动态环境中的优越泛化能力。开源代码:https://github.com/jinshilong-dev/SE-GA

英文摘要

Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0\% on ScreenSpot and 75.8\% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: https://github.com/jinshilong-dev/SE-GA

2605.16882 2026-05-19 cs.CL

E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

E-PMQ: 专家引导的后合并量化与合并权重锚定

Wenjun Wang, Yanggan Gu, Shuo Cai, Yuanyi Wang, Pengkai Wang, Jianmin Wu, Hongxia Yang

AI总结 本文提出E-PMQ方法,通过专家引导的后合并量化与合并权重锚定,解决合并模型中量化偏差和合并偏差耦合的问题,从而提升低比特部署性能。

详情
AI中文摘要

低资源部署约束使得模型量化成为部署神经网络以保持性能的关键。同时,模型合并已成为一种日益实用的低资源策略,用于将多个任务或领域专门化的专家整合到一个模型中,而无需联合训练或多模型服务。共同,量化和模型合并能够通过将多个专家整合到一个低比特模型中,实现高效的低资源部署流程。我们把这个设置称为后合并量化(PMQ)。我们证明直接对合并模型应用后训练量化(PTQ)是不可靠的,因为两种不同的偏差耦合:低比特重建引入的量化偏差和来自模型合并的专家相对合并偏差。为了减轻这些偏差,我们提出E-PMQ,一个专家引导的PMQ框架,利用源专家权重在层间校准期间提供专家引导的输出目标,以及合并权重锚定来稳定校准并保持合并模型的行为。在CLIP-ViT-B/32八任务合并中,E-PMQ将4位GPTQ在任务算术下的性能从65.0%提升到73.6%,在TIES-合并下从69.1%提升到74.8%。在更困难的设置中,E-PMQ在20任务CLIP-ViT-L/14上将GPTQ从34.8%提升到76.7%,在FLAN-T5-base GLUE上从78.26%提升到83.34%。这些结果表明,E-PMQ能够实现有效的后合并量化和低比特部署。

英文摘要

Low-resource deployment constraints have made model quantization essential for deploying neural networks while preserving performance. Meanwhile, model merging has become an increasingly practical low-resource strategy for integrating multiple task- or domain-specialized experts into a single model without joint training or multi-model serving. Together, quantization and model merging enable an efficient low-resource deployment pipeline by integrating multiple experts into one low-bit model. We formulate this setting as Post-Merge Quantization (PMQ). We show that directly applying post-training quantization (PTQ) to a merged model is unreliable because two distinct deviations are coupled: the quantization deviation introduced by low-bit reconstruction and the expert-relative merging deviation inherited from model merging. To mitigate these deviations, we propose E-PMQ, an expert-guided PMQ framework that uses source expert weights to provide expert- guided output targets during layer-wise calibration, together with merged-weight anchoring to stabilize the calibration and preserve the integrated behavior of the merged model. On CLIP-ViT-B/32 eight-task merging, E-PMQ improves 4-bit GPTQ from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. On harder settings, E-PMQ improves GPTQ from 34.8% to 76.7% on 20-task CLIP-ViT-L/14 and from 78.26% to 83.34% on FLAN-T5- base GLUE. These results demonstrate that E-PMQ enables effective post-merge quantization and low-bit deployment.

2605.16881 2026-05-19 cs.CL

PaliBench: A Multi-Reference Blueprint for Classical Language Translation Benchmarks

PaliBench: 一种多参考框架用于经典语言翻译基准测试

Máté Metzger, Nadnapang Phophichit

AI总结 本文提出PaliBench,一种用于巴利语到英语翻译的基准测试,以及构建多参考翻译基准测试的方法,通过结合LLM辅助对齐、自动化验证、质量过滤、去重和多指标评估,生成包含1700段文本、8389个段落和约345,000个token的基准测试数据集,评估了十个现代大语言模型的性能。

Comments Preprint. This manuscript has not yet been peer reviewed

详情
AI中文摘要

数字人文项目越来越多地依赖机器翻译和大语言模型来扩大对古典、宗教及其他未翻译文本传统的访问。然而,标准翻译基准测试并不适合此类材料:它们通常将系统输出与单一参考翻译进行比较,尽管古典文本往往支持多种忠实的翻译,这些翻译在术语、语体和解释上有所不同。本文介绍了PaliBench,既是一种巴利语到英语翻译的基准测试,也是一种可重用的方法,用于构建多参考翻译基准测试。Pali案例研究基于与Bhikkhu Sujato、Bhikkhu Thanissaro和Bhikkhu Bodhi独立英文翻译对齐的Sutta Pitaka段落。工作流程结合了LLM辅助对齐独立分段的翻译、自动化验证源文件、段落级质量过滤、去重公式性重复以及多指标评估多个人类参考。生成的基准测试包含1700段文本,覆盖8389个段落和约345,000个token。我们使用它评估了十个现代大语言模型,发现系统排名在不同指标之间有强一致性,同时可靠性及语义异常率有显著差异。更广泛贡献是方法论:PaliBench展示了如何将现有学术翻译转化为解释性文本传统的评估基础设施,而不将任何单一翻译视为最终结论。尽管是为巴利佛教文本开发的,该方法也可用于其他古典语料库,其中存在足够的独立参考翻译。

英文摘要

Digital humanities projects increasingly rely on machine translation and large language models to widen access to classical, religious, and otherwise under-translated textual traditions. Yet standard translation benchmarks are poorly suited to such materials: they typically compare a system output against a single reference translation, even though classical texts often support multiple faithful renderings that differ in terminology, register, and interpretation. This article introduces PaliBench, both a benchmark for Pali-to-English translation and a reusable method for constructing multi-reference translation benchmarks for classical languages. The Pali case study draws on passages from the Sutta Pitaka aligned with independent English translations by Bhikkhu Sujato, Bhikkhu Thanissaro, and Bhikkhu Bodhi. The workflow combines LLM-assisted alignment of independently segmented translations, automated verification against source files, passage-level quality filtering, deduplication of formulaic repetitions, and multi-metric evaluation against multiple human references. The resulting benchmark contains 1,700 passages spanning 8,389 segments and approximately 345,000 tokens. We use it to evaluate ten contemporary large language models with complementary metrics, finding strong cross-metric concordance in system rankings alongside substantial variation in reliability and semantic outlier rates. The broader contribution is methodological: PaliBench shows how existing scholarly translations can be transformed into evaluation infrastructure for interpretive textual traditions without treating any single translation as definitive. Although developed for Pali Buddhist texts, the approach could be portable to other classical corpora where sufficient independent reference translations exist.

2605.16880 2026-05-19 cs.AI

Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

虚拟节点引导的动态图神经网络用于缺失模态的脑肿瘤分割

Sha Tao, Jiao Pan, Yu Guo, Chao Yao

AI总结 本文提出了一种基于图的单阶段框架,通过引入模态特定的虚拟节点来补偿缺失模态,利用图网络的灵活性设计动态连接策略,提升模型对任意模态组合的鲁棒性,并在BRATS-2018和BRATS-2020数据集上验证了方法在不完整模态下的优越性能。

Comments The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

详情
AI中文摘要

多模态磁共振成像(MRI)对于脑肿瘤分割至关重要,许多方法利用其四种关键模态来捕捉互补信息以实现有效的子区域分析。然而,实践中缺少几种模态的情况非常常见,导致现有全模态分割方法性能严重下降。受限于结构化数据模型,近期工作常采用多阶段训练策略处理全模态和缺失模态场景,这增加了训练成本且无法充分解决缺失模态带来的干扰。在本文中,我们提出了一种基于图的单阶段框架,用于鲁棒的脑肿瘤分割。具体而言,我们引入了模态特定的虚拟节点,作为补充信息源以补偿缺失模态。为了增强模型对任意模态组合的鲁棒性,我们利用图网络的内在灵活性设计了动态连接策略。该机制根据模态可用性动态调整邻接矩阵,在保留有益信息流的同时减轻缺失模态引起的干扰效应。此外,我们通过异质权重矩阵增强了图网络,使其更适应多模态场景。在BRATS-2018和BRATS-2020数据集上的大量实验表明,我们的方法在几乎所有不完整模态的子集上均优于现有最先进方法。

英文摘要

Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective sub-region analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.

2605.16879 2026-05-19 cs.CV

Towards Generalized Image Manipulation Localization via Score-based Model

通过基于分数的模型实现通用图像操纵定位

Yunfei Wang, Bo Du, Zhe Yang, Xin Liu, Zhiyu Lin, Tianxin Xu, Ji-Zhe Zhou

AI总结 本文提出DiffIML框架,通过引入基于分数的生成模型来解决图像操纵定位中的泛化问题,利用结构先验迭代恢复相干掩码,提升模型鲁棒性,并在多个基准测试中证明其优越的泛化能力。

Comments Accepted to ICMR 2026. 9 pages, 4 figures

详情
AI中文摘要

随着合成媒体的快速发展,图像操纵定位(IML)已成为多媒体取证中的关键组成部分,用于确保数字内容的完整性。然而,泛化仍然是核心挑战,因为现有的判别方法通常学习固定的决策边界,容易过拟合特定训练伪影,且无法适应未见过的操纵类型。为了解决这一问题,我们提出了DiffIML,一种新颖的框架,引入基于分数的生成模型到IML中。不同于直接估计硬边界,DiffIML近似分数函数,即对数似然的梯度,以捕捉掩码分布的内在几何拓扑。这一范式利用结构先验迭代地从噪声中恢复连贯的掩码,从而避免判别模型的脆弱性。在此框架下,扩散模型成为学习分数函数的有效数值求解器。为确保实用性,我们分别解决了标准扩散模型的效率和稳定性瓶颈:(1)利用轻量级的特定掩码VAE实现快速的潜在空间处理,并采用解耦架构和轻量级去噪UNet;(2)边缘监督和误差先验以减轻采样过程中的误差累积。在两个不同的协议上对八个非生成式和三个生成式基准进行的广泛实验表明,DiffIML在多个基准测试中均优于最先进的方法,实现了在多样化未见过的数据集上的显著泛化改进。代码将公开提供。

英文摘要

With the rapid evolution of synthetic media, Image Manipulation Localization (IML) has emerged as a critical component in multimedia forensics for ensuring the integrity of digital content. However, generalization remains a core challenge, as existing discriminative methods typically learn a fixed decision boundary that tends to overfit to specific training artifacts and fails to adapt to unseen manipulation types. To address this, we propose DiffIML, a novel framework that introduces score-based generative modeling to IML. Diverging from the direct estimation of hard boundaries, DiffIML approximates the score function, the gradient of the log-likelihood, to capture the intrinsic geometric topology of mask distributions. This paradigm leverages structural priors to iteratively recover coherent masks from noise, thereby circumventing the brittleness associated with discriminative models. Under this formulation, diffusion models serve as an effective numerical solver for the learned score function.To ensure practicality, we respectively resolve the efficiency and stability bottlenecks of standard diffusion by: (1) utilizing a Lightweight Mask-Specific VAE for fast latent-space process and a decoupled architecture with a lightweight denoising UNet, (2) edge supervision and error prior to mitigate error accumulation during sampling. Extensive experiments of two distinct protocols on eight non-generative and three generative benchmarks demonstrate that DiffIML consistently outperforms state-of-the-art methods, yielding remarkable generalization improvements on diverse unseen datasets. The code will be publicly available.

2605.16878 2026-05-19 cs.SD

Speaker-Disentangled Remote Speech Detection of Asthma and COPD Exacerbations

基于语音的哮喘和COPD加重症远程检测中的说话者解耦

Yuyang Yan, Sami O. Simons, Visara Urovi

AI总结 本文提出了一种对抗学习架构,用于解耦与病理相关的语音特征与说话者身份属性,以提高哮喘和COPD加重症的检测性能和患者隐私保护,同时通过SHAP分析量化了语音特征对病理相关预测的贡献。

详情
AI中文摘要

哮喘和慢性阻塞性肺病(COPD)加重症的早期检测对于及时干预至关重要。语音已成为一种有前途的工具,用于连续、非侵入性地监测呼吸系统疾病。然而,语音信号本质上包含可识别说话者属性,这可能会主导模型预测,从而影响诊断性能和患者隐私。此外,在呼吸系统疾病监测中,与呼吸疾病和说话者身份相关的声学特征仍不明确。我们提出了一种对抗学习架构,以解耦与病理相关的语音模式与说话者可识别属性。该框架优化了两个临床分层任务:(i)呼吸状态分类(稳定 vs. 加重)和(ii)加重类型分类(哮喘加重 vs. COPD加重)。通过基于梯度反转的对抗训练抑制说话者身份。为了提高临床可解释性,我们采用SHapley Additive exPlanations(SHAP)来量化语音特征对病理相关预测与说话者身份的贡献。在TACTICAS数据集上,我们的方法在两个任务上均优于单任务基线。对于呼吸状态任务(稳定 vs. 加重),AUC从0.897提高到0.910。对于加重类型任务(哮喘加重 vs. COPD加重),AUC从0.674提高到0.793。同时,J-ratio降低,证实了有效抑制说话信息。SHAP分析揭示了语音特征对两个任务的贡献。在Bridge2AI-Voice数据集上的外部验证进一步证明了性能的持续改进和说话者依赖性的降低,确认了跨数据集的泛化能力。

英文摘要

Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-identifiable attributes that may dominate model predictions, which may compromise both diagnosis performance and patient privacy. Furthermore, the acoustic features associated with respiratory disease and speaker identity remain unclear in respiratory disease monitoring. We propose an adversarial learning architecture that disentangles pathology-related acoustic patterns from speaker-identifiable attributes. The framework optimizes two clinically hierarchical tasks: (i) respiratory status classification (stable vs. exacerbated) and (ii) exacerbation type classification (asthma exacerbation vs. COPD exacerbation). Speaker identity is suppressed through gradient reversal-based adversarial training. To enhance clinical interpretability, we employ SHapley Additive exPlanations (SHAP) to quantify the contributions of acoustic features to pathology-related predictions versus speaker identity. On the TACTICAS dataset, our method outperforms the single-task baseline across both tasks. For the respiratory status task (stable vs. exacerbated), the AUC improves from 0.897 to 0.910. For the exacerbation type task (asthma exacerbation vs. COPD exacerbation), the AUC increases from 0.674 to 0.793. Concurrently, the J-ratio decreases, confirming effective suppression of speaker information. SHAP analysis reveals the contributions of the acoustic features to both tasks. External validation on the Bridge2AI-Voice dataset further demonstrates consistent performance improvement and reduced speaker dependency, confirming cross-dataset generalizability.

2605.16877 2026-05-19 cs.CV

Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions

通过预测影响的定向导数生成零样本文本解释

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

AI总结 本文提出FaithTrace方法,通过测量文本解释对分类器特征空间中类logit的定向导数,提升图像分类器的透明度和解释的忠实性。

Comments 11+8 pages, 8 figures, 6 tables

详情
AI中文摘要

零样本文本解释旨在通过探测内部表示使图像分类器更透明,而无需依赖任务特定监督或LVLMs。然而,现有方法常遗漏真正驱动预测的特征,导致解释对模型决策证据的忠实性有限。为此,我们提出FaithTrace。受忠实解释应描述强影响预测的概念的启发,FaithTrace直接测量解释诱导的表示如何改变类logit。我们引入影响评分,计算为分类器特征空间中文本诱导方向上类logit的定向导数,并用其作为忠实性的代理。此外,我们将此影响评分扩展为定量评估指标,帮助填补文本解释忠实性评估的空白。实验表明,FaithTrace产生的解释比基线更忠实,有助于更准确地理解模型。代码将公开发布。

英文摘要

Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model's decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier's feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.

2605.16874 2026-05-19 cs.AI

Reasoning Can Be Restored by Correcting a Few Decision Tokens

通过纠正少量决策标记来恢复推理能力

Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, Xiang Wang

AI总结 本文研究了基础模型在生成过程中推理优势的稀疏性,提出了一种基于分歧指导的标记干预方法,在少量干预下显著恢复甚至超越了同规模推理模型的性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型推理模型(LRMs)在具有挑战性的推理基准测试中显著优于其基础LLM counterpart,但基础模型在逐token生成过程中哪里出错以及如何高效缩小这一差距仍不清楚。我们通过量化基础模型与更强推理模型之间在token层面的分布分歧来研究基础推理差距,使用基于似然的分歧度量。在多个基准测试中,我们发现推理优势高度稀疏,集中在少量早期规划相关的决策标记上。例如,在Qwen3-0.6B中,只有约8%的生成标记导致显著分歧,这些标记集中在响应早期,规划相关决策强烈富集(17倍),并且与基础模型的高不确定性重合——表明基础模型主要在早期规划点上失败,这些点引导后续的推理轨迹。基于这些发现,我们提出了分歧指导的标记干预,一种简单的推理时间委托方案,仅在高分歧位置将单个标记的生成委托给推理模型,并立即切换回基础模型。在少量干预预算下,这种稀疏委托显著恢复,甚至在具有挑战性的推理任务上可以超越同规模推理模型的性能。代码可在https://github.com/AlphaLab-USTC/RRTokenIntervention获得。

英文摘要

Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token-by-token generation and how to narrow this gap efficiently. We study the base-reasoning gap through quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17x), and coincide with high base-model uncertainty -- suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at https://github.com/AlphaLab-USTC/RRTokenIntervention.

2605.16873 2026-05-19 cs.CV

HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction

HAD:面向3D重建的幻觉感知扩散先验

Xi Liu, Weiwei Sun, Zhou Ren, Chris Broaddus, Siyu Huang, Laurent Guigues

AI总结 本文提出HAD,一种面向3D重建的幻觉感知扩散先验,通过利用预训练在大规模3D数据上的馈送式新视角合成(NVS)网络的多视角推理能力,估计增强图像的像素级幻觉分数图,从而在逐步3D重建过程中选择性地屏蔽不可靠像素,减少幻觉伪影,提升3D重建质量。

Comments Accepted by CVPR 2026

详情
AI中文摘要

扩散先验最近在通过在新视角上增强训练视角来提高稀疏视角3D重建质量方面表现出强大的能力,但不可避免地会引入幻觉内容——与输入视角不一致的伪影——进入最终的3D模型。为了解决这一挑战,我们提出了Hallucination-Aware Diffusion prior(HAD),它通过利用预训练在大规模3D数据上的馈送式新视角合成(NVS)网络的多视角推理能力,估计增强图像的像素级幻觉分数图。这些幻觉分数使在逐步3D重建过程中能够选择性地屏蔽不可靠像素,防止将不存在的伪影引入3D模型。为了进一步提高性能,我们在每个新视角上创建多个增强图像版本,通过将扩散先验条件化于不同的输入视角,然后将这些图像融合成最终图像,该图像利用了所有输入视角的更广泛上下文。我们证明了我们的方法在扩散辅助的3D重建中显著减少了幻觉伪影,从而在多个新视角合成基准上实现了最先进的性能。我们的项目在https://xiliu8006.github.io/HAD-Project-website/上公开可用。

英文摘要

Diffusion priors have recently demonstrated strong capability in enhancing the quality of sparse-view 3D reconstruction by augmenting training views at novel viewpoints, but they inevitably introduce hallucinated content -- artifacts inconsistent with the input views -- into the final 3D model. To address this challenge, we propose Hallucination-Aware Diffusion prior (HAD), which estimates pixel-wise hallucination score maps for augmented images by leveraging multi-view reasoning capabilities from a feedforward novel view synthesis (NVS) network pre-trained on large-scale 3D data. These hallucination scores enable selective masking of unreliable pixels during the progressive 3D reconstruction procedure, preventing the introduction of non-existent artifacts into the 3D model. To further enhance performance, we create multiple versions of augmented images at each novel view by conditioning the diffusion prior on different input views, which are then fused into a final image that leverages the broader context across all input views. We show that our method substantially reduces hallucination artifacts in diffusion-assisted 3D reconstruction, thereby achieving state-of-the-art performance across multiple benchmarks on novel view synthesis. Our project are publicly available at \href{https://xiliu8006.github.io/HAD-Project-website/}{project website}.

2605.16871 2026-05-19 cs.RO

SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations

SADP:基于基础模型生成示范的子目标感知扩散策略用于可解释机器人

Site Hu, Takato Horii

AI总结 本文提出SADP,一种基于基础模型生成示范的子目标感知扩散策略,用于可解释机器人,通过自主生成子目标标注的示范数据,训练扩散策略,使机器人能够通过子目标结构和执行进度向用户解释决策过程,从而在长周期操作中实现更高的任务成功率和故障诊断能力。

详情
AI中文摘要

可解释机器人不仅需要成功执行任务,还需要以用户友好的方式暴露内部决策过程。然而,大多数模仿学习方法仅在任务层面的示范上训练,没有显式建模子目标结构或执行进度。这种限制在标准机器人学习数据集中子目标级监督稀缺的情况下进一步加剧,限制了能够传达其执行子任务的机器人发展。为了解决这个问题,本文提出了Subgoal-Aware Diffusion Policy (SADP),一种利用基础模型自主生成子目标标注的示范数据,并在这些数据集上训练扩散策略的框架。SADP通过将动作生成条件化在任务层面和子目标层面的描述上,围绕人类可解释的子目标结构构建策略执行。一个轻量级的辅助头进一步预测子目标完成状态,使机器人能够暴露其当前执行阶段并监控子目标进展。在RLBench模拟和实际UR5e机器人上的实验表明,SADP在任务成功率方面优于强大的任务条件扩散基线,同时提供子目标级执行信号用于监控进度和故障诊断。这些结果表明,内置而非事后解释性可以与高任务性能共存。

英文摘要

Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.

2605.16870 2026-05-19 cs.RO

SSTL: Self-Sensing Tendon Loop for Hysteresis Modeling and Compensation in Tendon-Sheath Mechanisms

SSTL:自感知腱环用于腱鞘机制的滞后模型与补偿

Myeongbo Park, Junhyun Park, Ihsan Ullah, Chunggil An, Minho Hwang

AI总结 本文提出了一种自感知腱环(SSTL),用于解决腱鞘机制中由于腱鞘摩擦和腱弹性引起的滞后问题,通过测量输入和输出张力来建立滞后模型并进行补偿,从而提高柔性内窥镜机器人控制精度。

Comments 8 pages, 7 figures, 4 tables

详情
AI中文摘要

柔性内窥镜机器人通过自然孔道实现微创接入,但其控制精度受限于腱鞘机制(TSMs)中配置依赖的滞后现象。腱鞘摩擦和腱弹性导致输入和输出之间存在系统性差异,且该差异随插入管配置变化。为解决这一挑战,本文提出自感知腱环(SSTL),一种通过插入管双程路由并围绕远端滑轮缠绕的腱环结构,使输入和输出张力均可在近端测量,从而无需远端力或光纤传感器即可获得输入-输出张力剖面。由于SSTL与驱动TSM共享相同路由路径,两个TSM表现出高度相关的滞后行为。从SSTL张力剖面中,基于学习的映射估计驱动TSM的配置依赖滞后参数,这些参数随后被前馈控制器用于补偿驱动滞后。我们通过在三种不同插入管配置下跟踪驱动腱张力验证了所提方法。在正弦和随机轨迹上,所提方法将平均RMSE降低88.1%,达到直接识别方法的97.8%,后者需要直接测量驱动TSM的输入和输出张力剖面。

英文摘要

Flexible endoscopic robots enable minimally invasive access through natural orifices, but their control accuracy is limited by configuration-dependent hysteresis in the tendon-sheath mechanisms (TSMs). Tendon-sheath friction and tendon elasticity induce a systematic discrepancy between the proximal actuation input and distal output, and this discrepancy varies with the insertion tube configuration. To address this challenge, this paper proposes the Self-Sensing Tendon Loop (SSTL), a double-pass tendon loop routed through the insertion tube and wrapped around a distal pulley, and returned to the proximal end. The loop structure allows both the input and output tensions of the SSTL to be measured proximally, thereby providing an input-output tension profile without requiring distal force or fiber-optic sensors. Because the SSTL shares the same routing path as the actuation TSM, the two TSMs exhibit strongly correlated hysteresis behaviors. From the SSTL tension profile, a learning-based mapping estimates the configuration-dependent hysteresis parameters of the actuation TSM, which are then used by a feedforward controller to compensate for actuation hysteresis. We validate the proposed method by tracking actuation tendon tension under three different insertion tube configurations. Across sinusoidal and random trajectories, the proposed method reduces average RMSE by 88.1% compared with the uncompensated baseline, achieving 97.8% of the performance of direct identification, which requires direct measurement of the input and output tension profile of the actuation TSM.

2605.16864 2026-05-19 cs.CV cs.AI

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

基于度量的视觉基础模型特征融合用于分割任务

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena

AI总结 本文提出了一种基于度量的特征融合方法,通过评估不同视觉基础模型的特征空间,选择并聚合互补特征以提升密集预测任务的性能。

Comments Accepted to the CVPR 2026 Findings Track

详情
AI中文摘要

尽管大规模视觉基础模型(VFMs)在语义理解方面表现优异,但在实例感知的密集预测任务中仍显不足。它们在表示上存在不同的偏倚:例如,可提示的分割模型(如SAM2)专注于细粒度区域边界,而自监督模型(如DINOv3)强调物体层面的结构。这一观察表明,结合不同VFMs的互补特征可以增强下游密集预测任务。然而,简单的多VFMs融合 seldom 导致可靠的增益,且如何利用其互补特征的可解释原则仍待探索。在本文中,我们提出了一种基于度量的方法,通过显式的评估分数选择并聚合不同VFMs的互补特征。具体而言,我们设计了一套无标签的度量标准,在特征空间的两个方面,结构一致性与边缘保真度,来评估VFM编码器的特征。在这些分数的指导下,我们识别出互补性强的边缘强和结构强的编码器对,并通过主辅融合方案进行整合。这种特征融合不需要复杂的架构更改,并且仅在单个阶段进行训练。我们的模型在多个密集预测任务中相比基线模型表现出一致的性能提升,具有更好的物体层面语义和更准确的边界定位。代码可在{https://github.com/gyc-code/metric-guided-fusion}获取。

英文摘要

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

2605.16863 2026-05-19 cs.RO cs.AI cs.LG

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划,后扩散:用于长视距扩散规划的外在图引导

Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

AI总结 本文提出了一种外在搜索引导的扩散模型(XDiffuser),通过在状态空间图上先规划再引导扩散过程,以提高长视距规划的效率和效果,尤其在低质量数据和未见任务中表现优异。

详情
AI中文摘要

组合扩散模型通过去噪多个重叠的子轨迹并确保它们构成全局解,为长视距规划提供了一条有前途的路线。然而,强制在长链上执行局部行为往往不足以产生一致的全局结构。最近的工作通过内在搜索在去噪过程中探索多条路径来解决这一限制。尽管内在搜索提高了全局一致性,但代价是重复评估已经计算密集的模型。在本文中,我们主张在去噪过程之外进行外在搜索,为长视距规划提供更有效的探索模式,同时自然地使经典算法能够解决测试时的未见组合任务。我们的eXtrinsic搜索引导的Diffuser(XDiffuser)首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连接Oracle。该计划随后用于引导单条轨迹的去噪,有效地将探索负担转移出去。XDiffuser在长视距任务上优于基于扩散的基线,特别是在低质量数据领域和超出目标到达的未见任务中,包括多智能体协调和TSP风格推理。项目网站:https://yanivhass.github.io/XDiffuser-site/

英文摘要

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

2605.16861 2026-05-19 cs.CV cs.AI

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

前缀自适应块扩散用于高效的文档识别

Mingxu Chai, Ziyu Shen, Chenyu Liu, Kaidi Zhang, Jiazheng Zhang, Dingwei Zhu, Zhiheng Xi, Ruoyu Chen, Jun Long, Jihua Kang, Tao Gui, Qi Zhang

AI总结 本文提出前缀自适应块扩散模型(PA-BDM),通过改进块内去噪和缓存机制,提升文档识别的效率和准确性。

Comments 17pages,6 figures

详情
AI中文摘要

块扩散模型(BDMs)支持并行生成、灵活长度输出和KV缓存,使其在高效文档解析中具有潜力。然而,现有BDMs将去噪和缓存承诺绑定到固定的块边界:块内去噪时并行性缩小,而生成的token无法缓存直到整个块完成。此外,块内双向去噪与块间自回归冲突,导致信息流不一致,可能挑战结构敏感的识别。我们提出前缀自适应块扩散模型(PA-BDM),用从前缀到后缀的因果去噪替代块内双向去噪,并将块大小视为最大候选范围而非固定承诺单位。PA-BDM使用置信度门控结构损失(CSL)在扩展训练到更长延续之前构建低熵前缀。在推理过程中,逐步前缀承诺(PPC)则动态地将最长可靠的前缀投入KV缓存,并从更新的前缀重置下一个候选范围,每一步都恢复大的并行解码空间。实验表明,3B PA-BDM在多个基准上实现了更高的识别得分,并在2.5B MinerU-Diffusion上将推理吞吐量提高了71.6%。

英文摘要

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

2605.16860 2026-05-19 cs.LG cs.AI q-bio.QM

PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes

PhysioSeq2Seq:一种混合生理数字孪生和序列到序列LSTM的长周期1型糖尿病葡萄糖预测方法

Phat Tran, Neville Mehta, Clara Mosquera-Lopez, Robert H. Dodier, Lizhong Chen, Peter G. Jacobs

AI总结 本文提出了一种结合患者特定生理建模与序列到序列LSTM的混合架构PhysioSeq2Seq,用于长周期1型糖尿病葡萄糖预测,通过消除递归误差累积并注入患者匹配的生理状态,提高了预测精度和临床意义。

详情
AI中文摘要

准确的长周期葡萄糖预测对于自动胰岛素输送系统至关重要,这些系统帮助1型糖尿病患者管理血糖并避免危险的低血糖。然而,标准递归长短期记忆网络(LSTM)在更长的周期内由于误差累积存在系统性负偏置,而纯粹的机理微分方程(ODE)模型在群体参数化时无法跨个体泛化。我们提出PhysioSeq2Seq,一种结合患者特定生理建模与序列到序列(Seq2Seq)LSTM的混合架构。对于每个葡萄糖段,双胞胎匹配搜索300个参数化的数字孪生体群体,以从连续葡萄糖监测(CGM)历史中找到最佳拟合的生理匹配。匹配双胞胎的10个内部ODE状态变量被注入到Seq2Seq LSTM的编码器和解码器中。这种同时48步预测策略消除了递归误差累积,而ODE特征提供了一个基于物理的约束,限制了长周期漂移在生理合理范围内。PhysioSeq2Seq在1型糖尿病运动倡议(T1DEXI)数据集中训练了348名参与者的CGM和胰岛素数据,并在74名被排除的参与者上进行评估。在240分钟的预测范围内,PhysioSeq2Seq的平均绝对误差为39.28 mg/dL,平均误差为-10.62 mg/dL,比递归LSTM减少了13.89 mg/dL的偏置,比基于ODE的数字孪生减少了28.62 mg/dL的平均绝对误差。这些结果表明,消除架构反馈并注入患者匹配的生理状态是一种有效且具有临床意义的策略,用于1型糖尿病的长周期葡萄糖预测。

英文摘要

Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.