arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1565
2606.28273 2026-06-29 cs.CL 新提交

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

视觉默认,先验覆盖:视觉-语言模型中感知-知识冲突的因果机制

Niclas Lietzow, Danielle Bitterman, Carsten Eickhoff, William Rudman, Michal Golovanevsky

发表机构 * University of Tübingen(图宾根大学) Harvard University(哈佛大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 通过激活修补和消融实验,发现VLM中视觉默认激活,而先验知识依赖少量因果注意力头(2.5-4.8%),形成不对称因果结构。

Comments 14 pages, 11 figures, 8 tables

详情
AI中文摘要

视觉-语言模型必须在视觉证据与记忆的世界知识冲突时进行协调。它们如何解决这种冲突影响着多模态系统的可靠性,然而先前的工作仅从行为上描述这一冲突,缺乏组件级的因果解释。我们结合了三种粒度(残差流、注意力头和MLP子层)的激活修补、模型组件消融研究和机制分析。在三个VLM系列中,我们发现视觉基础默认出现,而先验基础依赖于集中在网络后半部分的一小组因果必要的注意力头(2.5-4.8%)。这些头使得模型能够根据存储的世界知识(例如草莓的“红色”)给出答案,尽管视觉输入与之冲突。消融这些头会在先验知识提示下将68-96%的预测从知识基础翻转为视觉基础答案,但仅改变0.8-7.5%的视觉基础预测,建立了一种不对称的因果结构。识别出的头分解为路由头(调节信息流)和写入头(直接将答案令牌投影到残差流中)。这种结构在不同模型系列和规模上一致,揭示了VLM中感知-知识冲突背后的稀疏因果电路。

英文摘要

Vision-language models must reconcile visual evidence with memorized world knowledge when the two conflict. How they resolve this conflict shapes the reliability of multimodal systems, yet prior work characterizes it behaviorally without a component-level causal account. We combine activation patching across three granularities (residual stream, attention heads, and MLP sublayers) with model-component ablation studies and mechanistic analysis. Across three VLM families, we find that visual grounding emerges by default, whereas prior grounding depends on a small set of causally necessary attention heads (2.5-4.8%) concentrated in the second half of the network. These heads enable answers from stored world knowledge (e.g., "red" for a strawberry) despite conflicting visual input. Ablating them flips predictions from knowledge-grounded to visually grounded answers in 68-96% of cases under prior-knowledge prompts, but changes only 0.8-7.5% of visually grounded predictions, establishing an asymmetric causal structure. The identified heads decompose into routing heads, which modulate information flow, and writing heads, which directly project answer tokens into the residual stream. This structure is consistent across model families and scales, revealing a sparse causal circuit underlying perception-knowledge conflict in VLMs.

2606.28270 2026-06-29 cs.AI cs.MA 新提交

Agent-Native Immune System: Architecture, Taxonomy, and Engineering

智能体原生免疫系统:架构、分类与工程

Bo Shen, Lifeng Chang, Tianyuan Wei, Yunpeng Li, Feng Shi, Yichen Han, Peijie Gao, Shiyi Kuang, Xin Chang, Dehui Li

发表机构 * Novo Ordo for AI

AI总结 提出智能体原生免疫系统(ANIS),一种嵌入智能体认知循环的生物启发式内生防御架构,通过六层免疫塔、统一病毒疫苗分类和元认知自动化骨干实现运行时动态防护。

详情
AI中文摘要

从静态聊天机器人到自主智能体(配备持久记忆、工具使用协议和多智能体协作)的转变从根本上扩展了AI威胁格局。当前的防御机制,如边界安全和训练时对齐,仍然外在于智能体的主动推理循环。因此,它们存在不足:一个完全对齐的智能体仍然极易受到通过记忆中毒、工具链操纵或多智能体协议攻击进行的运行时劫持。为弥补这一关键差距,我们引入了智能体原生免疫系统(ANIS),这是首个直接嵌入智能体认知循环的生物启发式内生防御架构。我们的框架提出了四项主要贡献。首先,我们设计了一个六层免疫塔(L0-L5),其中独特地包含了屏障免疫(L1)作为非认知的物理与逻辑隔离层。其次,我们建立了智能体病毒和智能体疫苗的统一分类,形式化了浅层非参数防御与鲁棒参数疫苗之间的关键区别。第三,我们概念化了驾驭三元组——元、自我和自动——一个自我监控的元认知自动化骨干,驱动持续免疫学习(CIL),使疫苗能够动态适应新型威胁。最后,我们在模型对齐与智能体免疫之间建立了严格的理论分界:对齐在训练期间提供静态的“宪法”价值基础,而ANIS在运行时充当动态的“执法”机制。我们最后提出了该领域的开放挑战,包括免疫协议标准化、新的评估指标(如自身免疫率,即假阳性干预率),以及集体智能生态系统中病原体与疫苗之间的共同进化动态。

英文摘要

The transition from static chat bots to autonomous agents--equipped with persistent memory, tool-use protocols, and multi-agent collaboration--has fundamentally expanded the AI threat landscape. Current defense mechanisms, such as perimeter security and training-time alignment, remain external to the agent's active reasoning loop. Consequently, they fall short: a fully aligned agent remains highly vulnerable to runtime hijacking via memory poisoning, tool-chain manipulation, or multi-agent protocol attacks. To address this critical gap, we introduce the Agent-Native Immune System (ANIS), the first biologically inspired, endogenous defense architecture embedded directly within the agent's cognitive loop. Our framework presents four primary contributions. First, we design a six-layer Immune Tower (L0-L5), distinctly incorporating Barrier Immunity (L1) as a non-cognitive, physical-and-logical isolation layer. Second, we establish a unified taxonomy of Agent Viruses and Agent Vaccines, formalizing the critical distinction between superficial non-parametric defenses and robust parametric vaccines. Third, we conceptualize the Harness Triad--Meta, Self, and Auto--a self-monitoring, meta-cognitive automation backbone that drives Continual Immune Learning (CIL), enabling vaccines to dynamically adapt to novel threats. Finally, we establish a rigorous theoretical demarcation between model alignment and agent immunity: while alignment provides a static "constitutional" value foundation during training, ANIS serves as the dynamic "law enforcement" mechanism during runtime. We conclude by framing open challenges for the field, including immune protocol standardization, novel evaluation metrics such as the Autoimmunity Rate (false-positive intervention rate), and the co-evolutionary dynamics between pathogens and vaccines within collective intelligence ecosystems.

2606.28268 2026-06-29 cs.CV cs.AI 新提交

Learning Topology-Aware Representations via Test-Time Adaptation for Anomaly Segmentation

通过测试时自适应学习拓扑感知表示用于异常分割

Ali Zia, Usman Ali, Abdul Rehman, Umer Ramzan, Kang Han, Muhammad Faheem, Shahnawaz Qureshi, Wei Xiang

发表机构 * La Trobe University(拉筹伯大学) GIFT University(GIFT大学)

AI总结 提出TopoTTA框架,将持续同调集成到测试时自适应中,通过拓扑伪标签增强异常分割的结构一致性,在六个基准上平均F1提升15%。

详情
AI中文摘要

测试时自适应(TTA)已成为缓解深度模型分布偏移的有前景范式。然而,现有的用于异常分割的TTA方法仍受限于对像素级启发式方法(如置信度阈值化或熵最小化)的依赖,这些方法在噪声和纹理变化下无法保持结构一致性。此外,它们通常将异常图视为平坦的强度场,忽略了表征复杂缺陷几何的高阶空间关系。我们引入了TopoTTA(拓扑测试时自适应),一种新颖框架,将持续同调(拓扑数据分析的工具)集成到TTA流程中,以在自适应过程中强制几何和结构一致性。通过对异常分数图应用多层次立方复形过滤,TopoTTA推导出鲁棒的拓扑伪标签,指导轻量级测试时分类器,从而在无需重新训练骨干模型的情况下提升分割质量。该方法避免了对特定方法的原始分数阈值化进行掩码二值化的依赖,保持了连通性,并泛化到2D和3D模态。在六个标准基准(MVTec AD、VisA、Real-IAD、MVTec 3D-AD、AnomalyShapeNet和MVTec LOCO)上的大量实验表明,与最先进的无监督异常检测和分割方法相比,平均F1提升了15%,在具有复杂几何或结构变化的异常上提升最大。这些发现表明,将拓扑推理集成到测试时自适应中为结构感知泛化提供了一条原则性途径,弥合了几何学习与鲁棒自适应之间的差距。

英文摘要

Test-time adaptation (TTA) has emerged as a promising paradigm for mitigating distribution shifts in deep models. However, existing TTA approaches for anomaly segmentation remain limited by their reliance on pixel-level heuristics, such as confidence thresholding or entropy minimisation, which fail to preserve structural consistency under noise and texture variation. Moreover, they typically treat anomaly maps as flat intensity fields, ignoring the higher-order spatial relationships that characterise complex defect geometries. We introduce TopoTTA (Topological Test-Time Adaptation), a novel framework that integrates persistent homology, a tool from topological data analysis, into the TTA pipeline to enforce geometric and structural coherence during adaptation. By applying multi-level cubical complex filtration to anomaly score maps, TopoTTA derives robust topological pseudo-labels that guide a lightweight test-time classifier, enhancing segmentation quality without retraining the backbone model. The approach avoids reliance on method-specific raw-score thresholding for mask binarisation, preserves connectivity, and generalises across both 2D and 3D modalities. Extensive experiments across six standard benchmarks (MVTec AD, VisA, Real-IAD, MVTec 3D-AD, AnomalyShapeNet, and MVTec LOCO) demonstrate an average 15% F1 improvement over state-of-the-art unsupervised anomaly detection and segmentation methods, with the largest gains on anomalies exhibiting complex geometric or structural variations. These findings suggest that integrating topological reasoning into test-time adaptation provides a principled route to structure-aware generalisation, bridging the gap between geometric learning and robust adaptation.

2606.28266 2026-06-29 cs.CV 新提交

RSICCLLM: A Multimodal Large Language Model for Remote Sensing Image Change Captioning

RSICCLLM:用于遥感图像变化描述的多模态大语言模型

Yelin Wang, Zijia Song, Shuo Ye, Chuanguang Yang, Miaoyu Wang, Yong Xu, Zhulin An, Yongjun Xu, Zitong Yu

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所人工智能安全国家重点实验室) Great Bay University(大湾区大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Dongguan Key Laboratory for Intelligence and Information Technology(东莞市智能信息技术重点实验室)

AI总结 提出首个针对遥感图像变化描述任务的大视觉语言模型后训练框架RSICCLLM,通过数据生成范式、差异感知微调和双负偏好优化,仅7B参数即超越更大规模模型。

Comments Accepted by ECCV 2026

详情
AI中文摘要

遥感图像变化描述(RSICC)旨在描述双时相遥感图像之间的变化,具有重要的研究和应用价值。然而,现有方法大多依赖传统深度学习架构,有限的模型容量制约了性能。尽管大模型后训练技术在通用领域取得了巨大成功,但由于数据稀缺和细粒度变化理解的需求,直接迁移到RSICC仍面临挑战。为此,我们提出了RSICCLLM,这是首个针对RSICC的大视觉语言模型后训练框架。具体而言,我们设计了一种数据生成范式,发布了指令数据集RSICI,并建立了任务特定的RSICC基准。我们进一步引入了差异感知监督微调,以显式提取变化表示并引导模型感知和理解时间差异。此外,我们提出了双负偏好优化(DNPO),采用两种互补的负样本构建策略来构建偏好数据集RSICP,并进一步优化模型性能。大量实验验证了RSICCLLM的卓越能力,仅用7B参数就取得了显著成果,超越了更大规模的模型。代码和数据集将在https://this URL公开。

英文摘要

Remote Sensing Image Change Captioning (RSICC) aims to describe changes between bi-temporal remote sensing images and holds significant research and application value. However, most existing methods rely on conventional deep learning architectures, and the limited model capacity constrains performance. Although large-model post-training techniques have achieved great success in general domains, their direct transfer to RSICC remains challenging due to data scarcity and the need for fine-grained change understanding. To address this, we propose RSICCLLM, the first post-training framework for large vision-language models in RSICC. Specifically, we design a data generation paradigm, release the instruction dataset RSICI, and establish a task-specific RSICC benchmark. We further introduce Difference-aware Supervised Fine-tuning to explicitly extract change representations and guide the model in perceiving and understanding temporal differences. In addition, we propose Dual-Negative Preference Optimization (DNPO), which employs two complementary negative-sample construction strategies to construct the preference dataset RSICP and further refine model performance. Extensive experiments validate the superior capability of RSICCLLM, which achieves outstanding results with only 7B parameters, surpassing models of substantially larger scales. The code and dataset will be made publicly available at this https URL.

2606.28237 2026-06-29 cs.RO 新提交

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

释放无限运动:通过生成式视频先验扩展富有表现力的四足运动

Youzhi Liu, Li Gao, Yifei Qian, Liu Liu, Yang Cai, Ziqiao Li

发表机构 * Amap, Alibaba Group(高德,阿里巴巴集团)

AI总结 提出Uni-Mo全自动流水线,利用LLM和视频扩散模型生成四足机器人运动数据,通过身份一致性损失提取3D轨迹,训练真实机器人跟踪策略,并发布包含7488个语言标注运动的数据集。

详情
AI中文摘要

四足机器人已实现显著的运动能力,但其行为库仍局限于少数步态——远未达到人们长期设想的富有表现力、如伴侣般的存在。尝试引入大规模运动数据的人形机器人方案继承了一个默认假设:机器人运动必须首先通过动物身体,这使得数据收集依赖于合作动物,重建在不同物种间脆弱,且重定向在不兼容形态下不适定。我们提出Uni-Mo,一个全自动流水线,通过将数据稀缺重新定义为生成问题来移除动物环节:LLM提出运动提示,视频扩散模型合成相应的机器人行为,生成的视频被提升为3D参考轨迹,用于训练部署在真实Unitree Go2上的跟踪策略。为了使天真漂移的生成可靠地可提取,我们引入身份一致性损失,强制帧间外观一致性。我们在https://this URL发布Quad-Imaginarium,这是一个包含7488个语言标注四足运动(18.5小时)的开源数据集,涵盖杂技和表演行为。我们在真实Unitree Go2上验证了392个随机采样的运动,部署成功率为96.7%,并在仿真中对整个数据集进行了补充验证,成功率为97.6%。

英文摘要

Quadruped robots have achieved remarkable locomotion, yet their behavioral repertoire remains confined to a few gaits--far from the expressive, companion-like presence long envisioned for them. Attempts to import the humanoid recipe of large-scale motion data have inherited one tacit assumption: that robot motion must first pass through an animal body, making data collection dependent on cooperative animals, reconstruction fragile across species, and retargeting ill-posed across incompatible morphologies. We propose Uni-Mo, a fully automated pipeline that removes the animal from the loop by reframing data scarcity as a generation problem: an LLM proposes motion prompts, a video diffusion model synthesizes the corresponding robot behaviors, and the generated videos are lifted into 3D reference trajectories used to train tracking policies deployed on a real Unitree Go2. To make naively-drifting generations reliably extractable, we introduce an Identity Consistency Loss that enforces appearance coherence across frames. We release Quad-Imaginarium at this https URL, the resulting open-source dataset of 7,488 language-annotated quadruped motions (18.5 hours) spanning acrobatic and performative behaviors. We validate 392 randomly sampled motions on a real Unitree Go2 with a 96.7% deployment success rate, complemented by a 97.6% success rate across the full dataset in simulation.

2606.28228 2026-06-29 cs.LG stat.ML 新提交

Disentangling Continuous-Time Latent Dynamics: Identifiability of Latent SDEs via Diffusion Shifts

解缠连续时间潜在动力学:通过扩散偏移实现潜在SDE的可辨识性

Yuanyuan Wang, Wenjie Wang, Haoxuan Li, Mingming Gong, Kun Zhang

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) The University of Melbourne(墨尔本大学) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对连续时间潜在随机微分方程模型的可辨识性问题,利用环境引起的扩散协方差偏移,证明在漂移无稀疏性假设下,对角扩散机制可识别潜在坐标至置换和缩放,并估计潜在解缠与因果图。

详情
AI中文摘要

时间序列的因果表示学习在离散时间潜在因果模型中已建立了强大的可辨识性结果,但连续时间潜在随机微分方程(SDE)模型的可辨识性仍基本未解决。我们利用环境引起的扩散协方差偏移来解决这一空白。我们研究通过未知非线性微分同胚观测的加性噪声潜在SDE,其漂移共享但扩散协方差因环境而异。我们证明,两个具有逐对坐标方差比不同的对角扩散机制可以在没有漂移稀疏性假设的情况下,将潜在坐标识别至置换和缩放。我们首先针对线性Ornstein-Uhlenbeck系统证明这一结果,然后将其推广到一般的加性噪声潜在SDE。在温和光滑性条件下,瞬时漂移-雅可比因果图可识别至相同置换。我们提出了一种用于潜在解缠和可选图恢复的两阶段估计器;在合成系统上的实验证实了预测的可辨识性边界,而Hardanger大桥监测数据的应用展示了该方法在真实传感器轨迹上的效果。

英文摘要

Causal representation learning for time series has developed strong identifiability results in discrete-time latent causal models, but identifiability in continuous-time latent stochastic differential equation (SDE) models remains largely open. We address this gap using environment-induced shifts in diffusion covariance. We study additive-noise latent SDEs observed through an unknown nonlinear diffeomorphism, with shared drift but environment-specific diffusion covariance. We show that two diagonal diffusion regimes with pairwise distinct coordinate-wise variance ratios identify the latent coordinates up to permutation and scaling, without any sparsity assumption on the drift. We first prove this result for linear Ornstein--Uhlenbeck systems and then extend it to general additive-noise latent SDEs. Under mild smoothness, the instantaneous drift-Jacobian causal graph is identifiable up to the same permutation. We propose a two-stage estimator for latent disentanglement and optional graph recovery; experiments on synthetic systems confirm the predicted identifiability boundary, and an application to Hardanger Bridge monitoring data illustrates the approach on real sensor trajectories.

2606.28226 2026-06-29 cs.CV cs.AI 新提交

Exposure Bias Can Alleviate Itself via Directional and Frequency Rectification in Flow Matching

暴露偏差可以通过流匹配中的方向和频率校正自我缓解

Guanbo Huang, Jingjia Mao, Fanding Huang, Fengkai Liu, Xiangyang Luo, Yaoyuan Liang, Jiasheng Lu, Xiaoe Wang, Pei Liu, Ruiliu Fu, Ruqi Huang, Shao-Lun Huang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Central Media Technology Institute, Huawei(华为中央媒体技术研究院)

AI总结 提出DEFAR框架,利用暴露偏差自身的方向和频率自适应信号进行校正,通过抗漂移校正和频率补偿增强模型鲁棒性,在多个数据集上优于现有方法。

Comments arXiv admin note: text overlap with arXiv:2512.04904 (https://arxiv.org/abs/2512.04904)

详情
AI中文摘要

流匹配(FM)在生成任务中取得了显著性能,但由于训练和推理之间的差异,它遭受暴露偏差。现有的缓解策略通常依赖于静态约束或外部启发式方法。在这项工作中,我们提出暴露偏差本身包含动态信号,可以指导其自身的校正。为了利用这一点,我们引入了DEFAR(方向-频率自适应校正)框架。该框架在训练期间模拟单步推理过程以识别暴露偏差。它利用来自偏差本身的方向和频率自适应反馈信号来增强模型的偏差容忍度。它由两个关键组件组成:(1)抗漂移校正(ADR)。ADR将推理时的漂移视为信号,学习将偏离状态引导回目标的方向。ADR赋予模型内在的主动自我校正能力;(2)频率补偿(FC)。经验上,我们观察到累积偏差通常源于高噪声阶段缺乏低频分量,而暴露偏差携带缺失的频率。FC利用偏差本身作为自我反馈加权因子来增强缺失的频率分量。在CIFAR-10、CelebA-64和ImageNet-256/512上的实验表明,DEFAR优于先前的基线,并进一步展示了良好的可扩展性、兼容性和推理鲁棒性。

英文摘要

Flow Matching (FM) has achieved remarkable generative performance, yet it suffers from exposure bias due to discrepancies between training and inference. Existing mitigation strategies typically rely on static constraints or external heuristics. In this work, we propose that exposure bias itself inherently contains dynamic signals that can guide its own rectification. To leverage this, we introduce DEFAR (DirEctional-Frequency Adaptive Rectification). This framework simulates the single-step inference process during training to identify exposure bias. It utilizes directional and frequency-adaptive feedback signals from the bias itself to enhance the model's bias tolerance. It consists of two key components: (1) Anti-Drift Rectification (ADR). ADR treats inference-time drift as a signal to learn the direction to steer deviated states back toward the target. ADR endows the model with intrinsic active self-rectification capabilities; (2) Frequency Compensation (FC). Empirically, we observe that accumulated bias often stems from a lack of low-frequency components in high-noise stages, and exposure bias carries the missing frequency. FC leverages the bias itself as a self-feedback weighting factor to reinforce the missing frequency components. Experiments on CIFAR-10, CelebA-64, and ImageNet-256/512 show that DEFAR outperforms prior baselines and further demonstrates favorable scalability, compatibility, and inference robustness.

2606.28220 2026-06-29 cs.LG 新提交

Physics-Informed Neural Network with Transfer Learning for State Estimation in Lithium-Ion Batteries using the Single Particle Model with Electrolyte

基于物理信息神经网络与迁移学习的锂离子电池状态估计:使用含电解质的单粒子模型

Gift Modekwe, Qiugang Lu

发表机构 * Texas Tech University(德克萨斯理工大学)

AI总结 提出一种结合物理信息神经网络与迁移学习的框架,用于锂离子电池单粒子模型(含电解质)的状态估计,通过预训练和微调实现快速收敛与准确电压预测。

详情
AI中文摘要

物理信息神经网络(PINNs)已成为求解非线性偏微分方程(PDEs)的强大工具,包括电池电化学模型。它们通常在损失函数中强制执行守恒定律,以确保物理一致性的解。传统的数值方法如有限差分、有限体积和有限元技术依赖于离散化,对于非线性系统计算成本高昂。为应对这一挑战,PINNs提供了更好的可扩展性,特别是对于降阶模型如含电解质的单粒子模型(SPMe)。SPMe通过耦合扩散、传输、反应动力学和电压方程描述锂离子电池动力学。尽管有这些优势,针对不同电池化学体系或操作条件从头训练基于SPMe的PINNs要求高且常导致收敛缓慢。为克服这一限制,本文引入了一种用于SPMe-PINNs的迁移学习框架。模型首先预训练以学习一般电化学动力学,然后通过迁移权重、冻结选定层并微调剩余参数(包括估计关键电化学变量)来适应目标电池。使用PyBaMM进行的验证表明,该方法能准确预测电压,表明所提方法在保持电化学一致性的同时减少了训练时间,并实现了跨电池的高效泛化。

英文摘要

Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving nonlinear partial differential equations (PDEs), including battery electrochemical models. They typically en-force conservation laws within the loss function to ensure physically consistent solutions. Tradi-tional numerical methods such as finite difference, finite volume, and finite element techniques, re-ly on discretization and can be computationally expensive for nonlinear systems. To address this challenge, PINNs offer improved scalability, particularly for reduced-order models like the single particle model with electrolyte (SPMe). The SPMe describes lithium-ion battery dynamics through coupled diffusion, transport, reaction kinetics, and voltage equations. Despite these advantages, training SPMe-based PINNs from scratch for different battery chemistries or operating conditions is demanding and often leads to slow convergence. To overcome this limitation, this work introduces a transfer learning framework for SPMe-PINNs. The model is first pretrained to learn general elec-trochemical dynamics and then adapted to a target battery by transferring weights, freezing se-lected layers, and fine tuning the remaining parameters, including estimating key electrochemical variables. Validation using PyBaMM demonstrates accurate voltage prediction, indicating that the proposed approach preserves electrochemical consistency while reducing training time and ena-bling efficient generalization across batteries.

2606.28217 2026-06-29 cs.LG cs.AI cs.DC cs.MA 新提交

Towards Value-Constrained Credit Assignment in Fully Delegated AI Cooperatives

完全委托的AI合作社中面向价值约束的信用分配

Young Yoon, Jimin Kim, Soyeon Park

发表机构 * Hongik University(弘益大学)

AI总结 提出在完全委托的AI合作社中,基于异质价值约束筛选更新,利用遍历学习实现细粒度信用分配,以解决数据估值和联邦学习中的贡献归属问题。

详情
AI中文摘要

我们提出了一个在完全委托的AI合作社中的奖励分配框架,其中人类由贡献数据并在异质价值约束下参与模型更新的代理代表。关键思想是仅对那些在根据每个委托人的价值档案筛选后仍然可接受的更新进行信用分配。我们在遍历学习(TL)基板上制定了价值条件梯度过滤、在线边际贡献信号和累积收益结算。TL在这里特别有吸引力,因为它执行去中心化反向传播,而没有与聚合中心化分布式学习相关的质量损失,并且我们认为,通过保留显式遍历和梯度路径,它提供了比FedAvg式联邦学习更精细的归属基板。该框架与数据估值、联邦贡献估计、个性化联邦学习和多元对齐进行了对比。

英文摘要

We propose a framework for reward allocation in fully delegated AI cooperatives where humans are represented by agents that contribute data and participate in model updates under heterogeneous value constraints. The key idea is to credit only those updates that remain admissible after screening them against each principal's value profile. We formulate value-conditioned gradient filtering, online marginal contribution signals, and cumulative revenue settlement within a traversal learning (TL) substrate. TL is especially attractive here because it performs decentralized backpropagation without the quality loss associated with aggregation-centric distributed learning and, we argue, offers a finer attribution substrate than FedAvg-style federated learning by preserving explicit traversal and gradient paths. The framework is positioned against data valuation, federated contribution estimation, personalized federated learning, and pluralistic alignment.

2606.28215 2026-06-29 cs.CV cs.AI cs.GR 新提交

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

HAT-4D: 通过人机协作从单目视频提升4D多物体交互

Jiaxin Li, Yuxiang Wu, Zhenkai Zhang, Xinrui Shi, Haoyuan Wang, Yichen Zhao, Su Linxiang, Chenyang Yu, Mingyu Zhang, Yifan Ding, Boran Wen, Li Zhang, Ruiyang Liu, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) University of Science and Technology of China(中国科学技术大学) Math Magic

AI总结 提出HAT-4D框架,结合视觉大模型与多级人工反馈,从单目视频重建多物体的3D几何、时序动态和物理交互,解决遮挡和深度歧义,无需多相机系统。

Comments Accepted to ECCV 2026. 15 pages of main text and 39 pages of appendices. Project page: this https URL (https://lijiaxin0111.github.io/HAT4D/)

详情
AI中文摘要

从大规模、野外单目视频中提取动态4D物体交互,为扩展具身AI和训练VLA提供了一条高效的数据收集途径。然而,现有的单目4D重建方法主要关注孤立物体,在多物体交互中常因严重遮挡和复杂动态而失败。为弥补这一差距,我们提出HAT-4D,这是首个旨在从单个视频中重建多个物体的3D几何、时序动态和物理交互的智能体框架。通过将VLM与多级人在环反馈机制相结合,HAT-4D在3D生成和4D传播过程中有效解决了深度歧义和交互引起的遮挡,无需依赖昂贵的多相机设备即可生成物理上合理的资产。作为一个可扩展的数据引擎,HAT-4D促进了MVOIK-4D的创建,这是一个用于单目4D交互重建的开放世界基准,并附带了一个新颖的多维评估协议,重点关注物理合理性和时序一致性。大量实验表明,HAT-4D在大多数评估指标上达到了SOTA性能,同时保持了具有竞争力的语义对齐。消融研究表明,引入少量人工反馈可改善交互重建。此外,HAT-4D生成的数据在用于微调时有效提高了基线性能。我们的数据和代码可在该网址获取。

英文摘要

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating VLMs with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multicamera rigs. As a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments demonstrate that HAT-4D achieves SOTA performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback improves interaction reconstruction. Moreover, the data produced by HAT-4D effectively improves baseline performance when used for fine-tuning. Our data and code are available at this https URL

2606.28194 2026-06-29 cs.LG 新提交

COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives

COCOLogic-V2: 通过真正的困难负样本识别逻辑不一致性

David Steinmann, Antonia Wüst, Kristian Kersting, Wolfgang Stammer

发表机构 * Computer Science Department, TU Darmstadt(达姆施塔特工业大学计算机科学系) Hessian Center for AI (hessian.AI)(黑森人工智能中心) German Research Center for AI (DFKI)(德国人工智能研究中心) Max Planck Institute for Informatics, SIC(马克斯·普朗克信息学研究所SIC) RTG Neuroexplicit Models(RTG神经显式模型)

AI总结 提出COCOLogic-V2数据集,通过正样本、近边界和远边界负样本分类,评估模型在真实图像上的视觉归纳推理能力,发现模型在近边界样本上表现差,视觉归纳推理仍是开放挑战。

详情
AI中文摘要

尽管可解释模型(如概念瓶颈模型(CBM)和程序合成方法)能够验证模型决策,但其评估通常局限于简单任务,使得真实世界图像上的复杂推理在很大程度上未被探索。我们引入了COCOLogic-V2,这是一个面向对象的数据集,用于真实世界图像上的视觉归纳推理,涵盖一阶逻辑的广泛子集。通过将样本分类为正变体、近边界(NB)和远边界(FB)负样本,COCOLogic-V2能够对模型的可解释性进行细粒度诊断。我们的评估表明,模型能够很好地区分正样本和FB样本,但在NB样本上失败,而感知噪声和规则诱导的大搜索空间在少样本设置中带来了额外挑战。这些结果共同表明,视觉归纳推理仍然是一个开放的挑战,COCOLogic-V2为推进这一方向的方法提供了具体基础。

英文摘要

While interpretable models such as concept bottleneck models (CBMs) and program synthesis methods enable verification of model decisions, their evaluation is typically limited to simple tasks, leaving complex reasoning on real-world images largely unexplored. We introduce COCOLogic-V2, an object-centric dataset for visual inductive reasoning on real-world images covering a broad subset of first-order logic. By categorizing samples into positive variants, near-boundary (NB), and far-from-boundary (FB) negatives, COCOLogic-V2 enables fine-grained diagnosis of model accountability. Our evaluations show that models tend to separate positive and FB samples well but fail on NB samples, while perceptual noise and large rule-induced search spaces pose additional challenges in few-shot settings. Together, these results highlight that visual inductive reasoning remains an open challenge and COCOLogic-V2 provides a concrete foundation for advancing methods in this direction.

2606.28192 2026-06-29 cs.RO 新提交

PA-BiCoop: A Primary-Auxiliary Cooperative Framework for General Bimanual Manipulation

PA-BiCoop: 一种通用双臂操作的主辅协作框架

Bai Qicheng, Wang Ziru, Ma Teli, Dai Guang, Wang Jingdong, Wang Mengmeng

发表机构 * SGIT AI Lab, State Grid Corporation of China(国家电网公司SGIT人工智能实验室) The Hong Kong University of Science and Technology, Guangzhou(香港科技大学(广州)) Baidu Research, Beijing, China(百度研究院) Zhejiang University of Technology, Hangzhou(浙江工业大学)

AI总结 提出PA-BiCoop框架,通过动态主辅臂角色分配和共享编码器-双解码器结构,实现双臂协调操作,在仿真和真实任务中平均性能提升超48%。

Comments ICRA2026

详情
AI中文摘要

双臂操作对于先进机器人系统至关重要,因为与单臂配置相比,它提供了更高的效率和灵活性。然而,现有方法要么缺乏臂间交互,要么忽略了动态分工的需求,将双臂视为功能等效。为了解决这些限制,本文从人类双臂操作中汲取灵感,其中一只手臂处理核心操作,另一只提供辅助支持,并提出了PA-BiCoop,一种新的具有动态主辅臂区分的单模型双臂协作框架。PA-BiCoop将机械臂分为主臂和辅臂,其角色在任务阶段中自适应调整,采用两个共享全局特征编码器的专用解码器:主解码器生成主臂的基坐标姿态和核心任务可操作度热图,辅解码器输出辅臂在主臂坐标系中的相对姿态。此外,我们设计了一个动态角色分配模块,自动将角色映射到左/右臂,无需手动预定义。这种设计促进了臂间知识共享和协调操作。大量实验表明,我们的PA-BiCoop实现了优越的性能:在RLBench2仿真任务中平均比最先进的基线高出48%,在真实世界任务中平均高出50%以上,从而验证了其在双臂操作中的有效性和先进性。

英文摘要

Bimanual manipulation is essential for advanced robotic systems because it offers higher efficiency and flexibility compared to single-arm configurations. However, existing approaches either lack inter-arm interaction or ignore the need for a dynamic division of labor, treating the arms as functionally equivalent. To address these limitations, this paper draws inspiration from human bimanual manipulation where one arm handles core operations and the other provides auxiliary support, and proposes PA-BiCoop, a new single-model bimanual cooperation framework with dynamic primary-auxiliary arm differentiation. PA-BiCoop categorizes robotic arms into primary and auxiliary arms with adaptively adjustable roles across task stages, employs two specialized decoders that share a global feature encoder: the primary decoder generates the primary arm's base-coordinate pose and core-task affordance heatmaps, and the auxiliary decoder outputs the auxiliary arm's relative pose in the primary arm's coordinate system. Moreover, we design a dynamic role assignment module to automatically map roles to left/right arms without manual pre-definition. This design facilitates inter-arm knowledge sharing and coordinated manipulation. Extensive experiments demonstrate that our PA-BiCoop achieves superior performance: it outperforms state-of-the-art baselines by 48% on average in RLBench2 simulation tasks and by over 50% on average in real world tasks, thereby verifying its effectiveness and advancement in bimanual manipulation.

2606.28179 2026-06-29 cs.LG cs.AI 新提交

CPAgents: Agentic Composite Phenotype Generation for Cardiac Disease Association

CPAgents: 用于心脏疾病关联的智能复合表型生成

Zuoou Li, Wenlong Zhao, Kelly Yu, Weitong Zhang, Paul M. Matthews, Wenjia Bai, Bernhard Kainz, Mengyun Qiao

发表机构 * Department of Mechanical Engineering, University College London(伦敦大学学院机械工程系) CSIG Group, Tencent(腾讯CSIG组) Department of Computing, Imperial College London(帝国理工学院计算系) Department of Brain Sciences, Imperial College London(帝国理工学院脑科学系) Data Science Institute, Imperial College London(帝国理工学院数据科学研究所) FAU Erlangen–Nürnberg(埃尔朗根-纽伦堡大学) UK Dementia Research Institute, Imperial College London(英国痴呆症研究所帝国理工学院) Rosalind Franklin Institute(罗莎琳德·富兰克林研究所)

AI总结 提出CPAgents框架,通过多智能体协作自动构建可解释的复合表型(如多项式、比值和交互形式),在群体规模心脏影像队列中显著提升疾病判别性能。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

识别心脏影像表型与临床疾病之间的稳健关联是群体规模心血管研究和可靠风险分层的基础。然而,当前的表型组关联研究依赖于预定义的单一变量表型或专家设计的特征,这限制了其捕捉临床有意义的非线性效应和跨表型交互的能力。为解决这一问题,我们提出了CPAgents,一种用于心血管表型组关联研究(PheWAS)的迭代表型组合框架,该框架从基础影像特征自动构建并验证可解释的复合表型(例如多项式、比值和交互形式)。具体来说,我们的系统协调三个智能体:(i)分析师,识别统计异常并提名候选变换;(ii)提议者,在数值安全规则下生成受约束的、具有医学和统计动机的表达式;(iii)验证者,使用多阶段标准评估候选者,并为接受的表型生成透明的证据链。在群体规模心脏影像队列上的评估表明,发现的复合表型显著改善了疾病判别:在72个分类器-疾病-指标组合中,我们的变体在56个案例中达到最高排名,而基线仅为18个,所有九个临床疾病类别均观察到增益。我们的框架生成紧凑、临床可解释的表型公式并附带透明证据链,使得能够超越专家驱动的特征选择,可扩展地发现更强的表型-疾病关联。

英文摘要

Identifying robust associations between cardiac imaging phenotypes and clinical diseases is fundamental to population-scale cardiovascular research and reliable risk stratification. However, current phenome-wide association studies rely on pre-defined, single-variable phenotypes or expert-crafted features, which limits their ability to capture clinically meaningful non-linear effects and cross-phenotype interactions. To address this, we propose CPAgents, an iterative phenotype-Composition framework for cardiovascular Phenome-wide association study (PheWAS) that automatically constructs and validates interpretable composite phenotypes (e.g., polynomial, ratio, and interaction forms) from base imaging features. Specifically, our system coordinates three agents: (i) an Analyst that identifies statistical pathologies and nominates candidate transformations; (ii) a Proposer that generates constrained, medically and statistically motivated expressions under numerical safety rules; and (iii) a Verifier that evaluates candidates using multi-stage criteria and produces transparent evidence trails for accepted phenotypes. Evaluated on a population-scale cardiac imaging cohort, the discovered composite phenotypes markedly improve disease discrimination: across 72 classifier-disease-metric combinations, our variants achieve the top rank in 56 cases versus 18 for baselines, with gains observed across all nine clinical disease categories. Our framework yields compact, clinically interpretable phenotype formulas with transparent evidence trails, enabling scalable discovery of stronger phenotype-disease associations beyond expert-driven feature selection.

2606.28166 2026-06-29 cs.AI 新提交

Tandem Reinforcement Learning with Verifiable Rewards

具有可验证奖励的串联强化学习

Difan Jiao, Raghav Singhal, Robert West, Ashton Anderson

发表机构 * University of Toronto(多伦多大学) EPFL(瑞士联邦理工学院洛桑分校)

AI总结 提出串联强化学习(TRL),通过强弱模型交替生成推理链并共同奖励,在保持独立推理能力的同时提升模型间兼容性和可读性。

Comments 21 pages,7 figures,8 tables

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)显著提升了大语言模型的推理能力,在竞赛数学等领域达到专家甚至超人水平。然而,较弱智能体和人类能否实际利用这种能力远不确定,RLVR被记录为导致推理向不良模式漂移,如可读性差和语言混合。串联训练是最近引入的针对此兼容性问题的范式:训练更强的资深模型与冻结的较弱初级模型共同生成每次轨迹,两者作为团队获得奖励,从而推动资深模型以初级模型可跟随的方式推理。但该范式迄今仅在概念验证环境中展示,尚不清楚它是否能扩展到现代RLVR管道的长思维链。在这项工作中,我们提出串联强化学习(TRL),将串联训练范式引入RLVR。在TRL中,资深模型和冻结的初级模型随机交替共同生成推理,生成的推理获得奖励,并对资深模型应用标准GRPO损失。在竞赛数学上训练Qwen3-4B-Instruct,我们发现TRL在独立推理能力上与普通GRPO相当,同时从相同的轨迹结构中涌现出三个特性:与初级模型更强的交接鲁棒性、减少与初级模型的分布漂移,以及更易于初级模型理解的思维链。我们的结果展示了RLVR在多模型通信和人类兼容性方面具有实际收益的有前景的路径。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly improved the reasoning capability of large language models, reaching expert or even superhuman performance in domains such as competition math. However, whether weaker agents and humans can actually harness this capability is far less certain, with RLVR documented to drift reasoning toward idiosyncratic patterns such as poor readability and language mixing. Tandem training is a recently introduced paradigm that targets this compatibility problem: a trained, stronger senior co-generates each rollout with a frozen, weaker junior, and the two are rewarded as a team, so the senior is pushed to reason in ways the junior can follow. Yet this paradigm has so far been demonstrated only in proof-of-concept settings, leaving open whether it scales to the long chains of thought of the modern RLVR pipeline. In this work, we propose Tandem Reinforcement Learning (TRL), which carries the tandem training paradigm into RLVR. In TRL, the senior and a frozen junior alternate stochastically to co-generate the reasoning, the resulting generation is rewarded, and the standard GRPO loss is applied to the senior. Training Qwen3-4B-Instruct on competition math, we find that TRL matches vanilla GRPO on solo reasoning capability while three properties emerge together from the same rollout structure: stronger handoff robustness with the junior, reduced distributional drift from the junior, and a chain-of-thought more legible to the junior. Our results demonstrate a promising route for RLVR with practical payoffs in multi-model communication and human compatibility.

2606.28164 2026-06-29 cs.CV cs.LG 新提交

EchoSonar-R: A Multi-View Reasoning-Enabled Model for Disease Classification and Report Generation in Echocardiography

EchoSonar-R: 一种用于超声心动图疾病分类和报告生成的多视图推理增强模型

Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub

发表机构 * Division of Computing and Mathematical Sciences, Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学计算与数学科学系,阿布扎比,阿联酋)

AI总结 提出EchoSonar-R,一种结合时空视频编码器和结构感知心脏检测器的多视图推理视觉语言模型,通过两阶段训练(监督微调+组相对策略优化)联合实现多标签疾病分类和报告生成,在私有数据集和公共基准上分别提升宏平衡准确率17.1%和6.1%。

详情
AI中文摘要

超声心动图是最广泛使用的非侵入性心脏成像模态,为心血管诊断提供基本信息。解读超声心动图需要综合多个心脏视图的互补证据,以识别异常并生成结构化临床报告。尽管近期工作侧重于提高分类性能,但大多数模型缺乏明确的诊断推理和空间定位的解剖证据,限制了临床医生的信任。我们提出EchoSonar-R,一种多视图推理增强的视觉语言模型,可联合执行多标签疾病分类和超声心动图研究报告生成。EchoSonar-R将时空视频编码器与结构感知心脏检测器相结合,后者提供空间定位的解剖线索,以提高跨视图推理过程中的可解释性和临床医生信任。EchoSonar-R分两个阶段训练:首先在推理标注目标上进行监督微调(SFT),然后通过组相对策略优化(GRPO)结合任务特定奖励,在统一强化学习框架内联合对齐分类和报告生成。在私有多视图数据集和两个公共基准上,EchoSonar-R在私有集上比最强基线提升宏平衡准确率17.1%,在MIMICEchoQA上提升6.1%,达到0.800的GREEN临床忠实度分数,并生成基于多视图视觉证据的可解释推理轨迹。

英文摘要

Echocardiography is the most widely used non-invasive cardiac imaging modality, providing essential information for cardiovascular diagnosis. Interpreting an echocardiogram requires synthesizing complementary evidence across multiple heart views to identify abnormalities and produce structured clinical reports. While recent efforts focus on improving classification performance, most models lack explicit diagnostic reasoning and spatially grounded anatomical evidence, limiting clinician trust. We present EchoSonar-R, a multi-view reasoning-enabled vision-language model that jointly performs multi-label disease classification and report generation from echocardiography studies. EchoSonar-R combines a spatiotemporal video encoder with a structure-aware cardiac detector that provides spatially grounded anatomical cues to improve interpretability and clinician trust during cross-view reasoning. EchoSonar-R is trained in two stages: supervised fine-tuning (SFT) on reasoning-annotated targets, followed by Group Relative Policy Optimization (GRPO) with task-specific rewards that jointly align classification and report generation within a unified reinforcement-learning framework. Across a private multi-view dataset and two public benchmarks, EchoSonar-R improves macro balanced accuracy by 17.1% on the private set and 6.1% on MIMICEchoQA over the strongest baseline, achieves a GREEN clinical faithfulness score of 0.800, and produces interpretable reasoning traces grounded in multi-view visual evidence.

2606.28152 2026-06-29 cs.LG cs.RO 新提交

Regularized Reward-Punishment Reinforcement Learning

正则化奖惩强化学习

Jiexin Wang, Eiji Uchibe

发表机构 * Dept. of Brain Robot Interface, ATR Computational Neuroscience Laboratories(ATR计算神经科学实验室脑机接口系)

AI总结 提出KL耦合策略正则化(KCPR)框架,通过策略间动态先验交互实现奖惩强化学习中的策略协调,并衍生出KL耦合软最优性(KCSO)及深度实现klDMP,在网格世界和Gazebo机器人导航任务中提升了安全性和学习稳定性。

详情
AI中文摘要

我们提出了KL耦合策略正则化(KCPR),一种用于奖惩强化学习(RPRL)的策略协调框架。基于KCPR,我们推导出KL耦合软最优性(KCSO)并开发了其深度实现klDMP。与现有RPRL方法将奖励寻求和惩罚相关策略大致独立优化不同,KCPR通过将每个同伴策略视为另一个的动态学习先验,使得同伴策略之间能够直接交互。KCSO产生耦合的软最优策略和KL正则化的贝尔曼算子,允许奖励和惩罚信息共同影响价值传播。为了提高学习稳定性,我们引入了一种同伴先验软化机制,并评估了用于平衡奖励和惩罚相关经验的独立回放缓冲区设计。在网格世界和Gazebo机器人导航任务中的实验表明,与DQN、SQL和softDMP相比,klDMP在保持竞争性任务性能的同时提高了安全性和学习稳定性。这些结果表明,策略级协调为整合多个行为目标提供了一种有效机制,并可能作为具有交互动机过程的强化学习系统的有用设计原则。

英文摘要

We propose KL-Coupled Policy Regularization (KCPR), a policy coordination framework for Reward-Punishment Reinforcement Learning (RPRL). Based on KCPR, we derive KL-Coupled Soft Optimality (KCSO) and develop its deep realization, klDMP. Unlike existing RPRL approaches that optimize reward-seeking and punishment-related policies largely independently, KCPR enables direct interactions between companion policies by treating each as a dynamically learned prior for the other. KCSO yields coupled soft-optimal policies and KL-regularized Bellman operators, allowing reward and punishment information to jointly influence value propagation. To improve learning stability, we introduce a companion-prior softening mechanism and evaluate separate replay-buffer designs for balancing reward- and punishment-related experience. Experiments in grid-world and Gazebo robotic navigation tasks demonstrate that klDMP improves safety and learning stability while maintaining competitive task performance compared with DQN, SQL and softDMP. These results suggest that policy-level coordination provides an effective mechanism for integrating multiple behavioral objectives and may serve as a useful design principle for reinforcement learning systems with interacting motivational processes.

2606.28149 2026-06-29 cs.CV cs.AI 新提交

Toward Robust In-Context Segmentation via Concept Guidance

通过概念引导实现鲁棒的上下文分割

Zhigang Chen, Xiawu Zheng, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(厦门大学多媒体可信感知与高效计算教育部重点实验室)

AI总结 提出概念引导的上下文分割(CG-ICS),通过提取参考图像的高层语义概念而非仅依赖低层视觉匹配,结合文本概念与视觉示例,显著提升分割准确性和鲁棒性。

Comments ECCV 2026

详情
AI中文摘要

上下文分割(ICS)要求模型仅使用少量参考图像及其对应掩码,无需更新任何参数,即可分割查询图像中的目标区域。尽管近期取得进展,先前的ICS研究很大程度上忽略了一个关键方面:系统鲁棒性,即模型在相同查询下使用不同参考时能否产生稳定的分割结果。本文从鲁棒性角度重新审视ICS,并引入一种新范式——概念引导的上下文分割(CG-ICS),该方法通过从参考中提取高层语义概念而非仅依赖低层视觉匹配来执行分割。具体而言,CG-ICS引入一个概念推理模块,该模块使用多模态大语言模型(MLLM)提出候选概念,并通过SAM3驱动的评分函数结合树搜索精炼选择可靠的文本概念,同时配备一条并行的视觉示例路径,通过简单的上下文构建提供查询侧的空间定位。随后,文本概念和视觉示例共同用于激活冻结的SAM3骨干网络的分割能力。在标准ICS基准上的大量实验表明,CG-ICS不仅达到了最先进的准确率,而且显著提升了鲁棒性,产生了更可靠的ICS系统,在不同参考选择下的方差显著降低。

英文摘要

In-context segmentation (ICS) requires a model to segment target regions in a query image using only a few reference images and their corresponding masks, without updating any parameters. Despite recent progress, prior ICS studies have largely overlooked a critical aspect: system robustness, ie, whether the model can produce stable segmentation results for the same query under different references. In this work, we revisit ICS from the robustness perspective and introduce a novel paradigm, Concept-Guided In-Context Segmentation (CG-ICS), which performs segmentation by extracting high-level semantic concepts from references rather than relying solely on low-level visual matching. Specifically, CG-ICS introduces a concept reasoning module that uses an MLLM to propose candidates and a SAM3-driven scoring function with tree-search refinement to select reliable textual concepts, together with a parallel visual exemplar route that provides query-side spatial grounding via a simple context construction. Both the textual concept and the visual exemplar are then used to activate the segmentation capability of a frozen SAM3 backbone. Extensive experiments on standard ICS benchmarks demonstrate that CG-ICS not only achieves state-of-the-art accuracy but also substantially improves robustness, yielding a more reliable ICS system with significantly reduced variance across diverse reference choices.

2606.28142 2026-06-29 cs.LG 新提交

MixTTA: Low-Rank Cross-Channel Mixing for Reliable Test-Time Adaptation

MixTTA: 用于可靠测试时自适应的低秩跨通道混合

Mansoo Jung, Youngwook Kim, Jungwoo Lee

发表机构 * Seoul National University(首尔大学) Kookmin University(国民大学) HodooAI Labs(HodooAI实验室)

AI总结 提出MixTTA模块,通过低秩跨通道变换增强归一化层,解决分布偏移下通道间结构变化问题,提升测试时自适应方法的鲁棒性。

Comments To be published in the 19th European Conference on Computer Vision -- ECCV 2026

详情
AI中文摘要

测试时自适应(TTA)方法通常更新归一化层的仿射参数,以使部署模型适应分布偏移。然而,每通道的仿射参数执行轴对齐的缩放和平移,在几何上无法纠正分布偏移引起的跨通道结构变化。为解决这一限制,我们提出MixTTA,一个轻量级插件模块,为归一化层配备低秩跨通道变换,实现每层的通道间混合。为确保低秩分支仅捕获跨通道交互,我们还提出解耦投影,强制其与对角仿射路径严格分离,以及谱投影,防止在非平稳测试流下秩-1坍缩。MixTTA可无缝集成到任何现有的基于归一化的TTA方法中。在标准和野外的TTA设置下的实验表明,该方法在强基线上持续改进,同时减轻了挑战条件下的自适应失败。源代码在此https URL公开。

英文摘要

Test-Time Adaptation (TTA) methods commonly update the affine parameters of normalization layers to adapt deployed models under distribution shifts. However, per-channel affine parameters perform axis-aligned scaling and shifting, making them geometrically incapable of correcting cross-channel structural changes induced by distribution shift. To address this limitation, we propose MixTTA, a lightweight plug-in module that equips normalization layers with a low-rank cross-channel transformation, enabling inter-channel mixing at each layer. To ensure that the low-rank branch captures only cross-channel interactions, we also propose Decoupling Projection that enforces strict separation from the diagonal affine path, along with Spectral Projection that prevents rank-1 collapse under non-stationary test streams. MixTTA can be seamlessly integrated into any existing normalization-based TTA method. Experiments in both standard and wild TTA settings show consistent improvements over strong baselines while mitigating adaptation failure under challenging conditions. The source code is publicly available at this https URL.

2606.28134 2026-06-29 cs.LG cs.AI 新提交

Beyond Sparse Supervision: Diffusion-Guided Learning for Few-Shot Graph Fraud Detection

超越稀疏监督:扩散引导学习用于少样本图欺诈检测

Liming Liu, Chao Hu, Mingfei Lu, Yiwei Ge, Xingle Li, Heyuan Shi

发表机构 * Central South University(中南大学) University of Technology Sydney, Australian Artificial Intelligence Institute(悉尼科技大学,澳大利亚人工智能研究所)

AI总结 针对图欺诈检测中标签稀疏不平衡和表示稀释问题,提出ADC-GNN框架,结合扩散引导特征增强、对比学习和多跳谱注意力,在1%训练数据下显著提升检测性能。

详情
AI中文摘要

基于图的欺诈检测对于保护大规模交易系统至关重要,未检测到的异常可能导致重大财务损失和安全风险。现实世界的欺诈图面临两个耦合挑战:稀疏且不平衡的监督,其中经过验证的欺诈标签稀缺且严重偏向良性账户;以及表示稀释,其中空间消息传递可能过度平滑伪装异常,而谱滤波器可能抑制与欺诈相关的中高频不规则性。为了解决这些挑战,我们提出了ADC-GNN,即注意力引导扩散对比图神经网络,这是一个统一框架,结合了扩散引导特征增强、对比表示学习和多跳谱注意力,用于少样本图欺诈检测。扩散组件被公式化为特征空间去噪增强机制,而不是完整的拓扑生成图扩散模型:它在余弦调度下构建噪声扰动的节点特征视图,并使用对比学习来稳定跨扰动的节点表示。谱注意力模块进一步自适应地强调与欺诈相关的跳级和关系级线索。我们主要在三个公共基准上评估ADC-GNN,并额外报告一个包含约60,000条记录的专有真实世界电信交易数据集作为私有案例研究。在1%训练设置下,ADC-GNN在公共基准上相对于原始图欺诈基线和四个协议一致的最新图异常/欺诈基线取得了持续改进。关于分裂稳定性、训练比例、过采样替代方案、模块级消融、扩散调度以及运行时和内存消耗比较的额外分析进一步表征了ADC-GNN的有效运行区域。

英文摘要

Graph-based fraud detection is essential for safeguarding large-scale transaction systems, where undetected anomalies may lead to substantial financial losses and security risks. Real-world fraud graphs pose two coupled challenges: sparse and imbalanced supervision, where verified fraudulent labels are scarce and heavily skewed toward benign accounts, and representation dilution, where spatial message passing may oversmooth camouflaged anomalies while spectral filters may suppress fraud-relevant mid- and high-frequency irregularities. To address these challenges, we propose ADC-GNN, short for Attention-guided Diffusion-Contrastive Graph Neural Network, a unified framework that combines diffusion-guided feature augmentation, contrastive representation learning, and multi-hop spectral attention for few-shot graph fraud detection. The diffusion component is formulated as a feature-space denoising augmentation mechanism rather than a full topology-generative graph diffusion model: it constructs noise-perturbed node-feature views under a cosine schedule and uses contrastive learning to stabilize node representations across perturbations. The spectral attention module further adaptively emphasizes fraud-relevant hop-level and relation-level cues. We evaluate ADC-GNN primarily on three public benchmarks and additionally report a proprietary real-world telecom transaction dataset with approximately 60,000 records as a private case study. Under the 1% training setting, ADC-GNN achieves consistent improvements over original graph fraud baselines and four protocol-consistent recent graph anomaly/fraud baselines on the public benchmarks. Additional analyses on split stability, training ratios, oversampling alternatives, module-level ablations, diffusion schedules, and runtime and memory-consumption comparisons further characterize the effective operating regime of ADC-GNN.

2606.28133 2026-06-29 cs.RO cs.CV 新提交

Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

翻译作为桥接动作:将操作技能从人类迁移到机器人

Sijin Chen, Kaixuan Jiang, Haixin Shi, Yanhui Wang, Weiheng Zhong, Haosheng Li, Bo Jiang, Yuxiao Liu, Xihui Liu

发表机构 * HKU-MMLab(香港大学多媒体实验室) ByteDance Seed(字节跳动Seed)

AI总结 提出以相对手腕平移作为桥接动作表示,结合π₀类视觉-语言-动作模型,从人类数据中迁移操作技能到双臂平行夹爪机器人,有效提升迁移效果并随数据量扩展。

Comments Project Page: this https URL (https://translation-as-a-bridging-action.github.io/)

详情
AI中文摘要

我们研究是否可以从人类动作中学习新颖的操作技能,并将其迁移到具有平行夹爪的双臂机器人。人类动作数据廉价、丰富且多样,使其成为扩展机器人学习最有前景的资源之一。然而,将技能从人类迁移到机器人仍然困难:大多数先前工作将人类视为另一种双臂6自由度具身,其中手部姿态估计存在噪声,且人类手指的接触模式与平行夹爪根本不同。我们认为,从人类数据中学习包含旋转的动作信号因此是次优的,并转而提出一种桥接动作表示:初始头部相机帧内的相对手腕平移,这是人类和机器人共享的动作空间。为了处理不同具身中某些动作组件可能缺失的问题,我们构建了一个类似π₀的视觉-语言-动作模型,具有交错的动作令牌和注意力掩码。在一系列新颖的双臂操作任务上,我们的桥接动作将人类操作知识迁移到机器人的效果远优于带噪声的6自由度人类动作,并随人类数据量增加而扩展。

英文摘要

We study whether we can learn novel manipulation skills from human actions to a bi-manual robot with parallel grippers. Human action data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, where hand-pose estimates are noisy and the contact patterns of human fingers differ fundamentally from those of a parallel gripper. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a $\pi_0$-like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.

2606.28128 2026-06-29 cs.CV cs.AI cs.RO 新提交

PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

PhysisForcing:面向机器人操作的物理增强世界模拟器

Peiwen Zhang, Yufan Deng, Shangkun Sun, Juncheng Ma, Duomin Wang, Jonas Du, Zilin Pan, Ye Huang, Hao Liang, Songyan Huang, Ruihua Zhang, Enze Xie, Ming-Yu Liu, Daquan Zhou

发表机构 * Peking University(北京大学) NVIDIA(英伟达)

AI总结 针对视频生成模型在机器人操作模拟中物理不稳定的问题,提出PhysisForcing框架,通过像素级轨迹对齐和语义级关系对齐损失联合优化,显著提升物理一致性,在多个基准上改进基础模型,并提高下游策略成功率。

Comments Github: this https URL (https://github.com/DAGroup-PKU/PhysisForcing) Project website: this https URL (https://dagroup-pku.github.io/PhysisForcing.github.io/#)

详情
AI中文摘要

视频生成模型已成为具身世界模拟的一种有前景的范式。然而,通用域视频生成器和机器人特定数据微调模型仍可能产生物理上不合理的操作,包括不连续的运动轨迹和不一致的机器人-物体交互,这限制了它们作为世界模拟器的可靠性。通过大量实验,我们发现这种物理不稳定性主要源于两个因素:运动物体的变形以及交互实体之间(尤其是在接触时)不合理的时空相关性。基于这一观察,我们提出了PhysisForcing,一个可扩展的训练框架,通过联合优化像素级和语义级特征,将监督集中在物理信息区域来增强物理一致性。该框架包括一个像素级轨迹对齐损失,使用参考点轨迹监督DiT特征,以及一个语义级关系对齐损失,将DiT特征与从冻结视频理解编码器中提取的区域间关系对齐。在R-Bench、PAI-Bench和EZS-Bench上的大量实验表明,PhysisForcing在强基线上持续改进了具身视频生成,将Wan2.2-I2V-A14B和Cosmos3-Nano基础模型在R-Bench上的性能分别提升了22.3%和9.2%(相对于普通微调提升了7.1%和3.7%),其中Cosmos3-Nano变体获得了最佳总体得分。除了生成,作为WorldArena动作规划器协议下的世界模型,它将闭环成功率从16.0%提高到24.0%,并进一步提升了下游策略成功率,表明物理对齐的视频模型为机器人操作产生了更强的表示。

英文摘要

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific data fine-tuned models can still produce physically implausible manipulations, including discontinuous motion trajectories and inconsistent robot-object interactions, which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities, particularly during contact. Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations extracted from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, improving the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by 22.3\% and 9.2\% (7.1\% and 3.7\% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0\% to 24.0\% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.

2606.28126 2026-06-29 cs.AI cs.AR cs.CE cs.ET cs.RO 新提交

AI-Driven Synthesis for High-Tech System Design: Automating Innovation

AI驱动的高科技系统设计综合:自动化创新

Luuk Oerlemans, Steven Westerhof, Theo Hofman

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出自动化设计(AiD)范式,利用深度学习与生成式AI的计算设计综合(CDS)框架,通过两个案例展示从仿真优化到自主设计的转变。

详情
AI中文摘要

本文通过提出自动化设计(AiD)作为一种变革性范式,解决了现代高科技系统设计中固有的组合复杂性。我们提出了计算设计综合(CDS),这是一个利用深度学习和生成式AI来自动化创建新颖系统的框架。两个案例研究(电驱动系统设计和空间尺寸问题)作为该方法的证明点。案例研究中使用的AI驱动方法代表了工程领域的根本性转变,从基于仿真的优化推进到最小人工监督的自主设计。

英文摘要

This article addresses the combinatorial complexity inherent in modern high-tech system design by presenting automation-in-design (AiD) as a transformative paradigm. We propose computational design synthesis (CDS), a framework utilising deep learning and generative AI to automate the creation of novel systems. Two case studies (e-drive system design and spatial dimensioning problem) serve as proof-points for this approach. The AI-driven methods used in the case studies represent a fundamental shift in engineering, advancing from simulation-based optimisation towards autonomous design with minimal human supervision.

2606.28116 2026-06-29 cs.CL 新提交

Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

机制驱动的LLM训练不稳定性抢先检测监控器

Ruixuan Huang, Yipei Wang, Wenyi Fang, Hantao Huang, Yifan Huang, Ansheng You, Zhenxing Zhang, Shuai Wang, Fan Wu, Yang Zheng

发表机构 * HKUST(香港科技大学) Huawei(华为) Independent Researcher(独立研究者)

AI总结 针对大语言模型训练中的数值或超参数故障,提出基于模块功能角色的内部监控器,通过QK双线性分解谱熵和MoE路由器指标,在损失发散前数千步检测到不稳定性。

详情
AI中文摘要

前沿大语言模型训练消耗大量加速器集群和长时钟时间,使得稳定性故障发生时代价高昂。在数值或超参数故障已经破坏训练动态后,它可能持续数千步,而损失和梯度范数仍看似正常。我们通过从每个关键模块的功能角色以及预期故障产生可测量信号的最早计算位置推导内部监控器,研究机制驱动的训练不稳定性检测。对于低精度闪存注意力,我们监控QK双线性分解的谱熵,其第一阶项在损失完全崩溃前变得异常。对于MoE路由器,我们从其在专家选择中的作用推导指标。我们在低精度注意力、大学习率和组合故障上的故障注入实验表明,这些信号为不同故障提供了独特的特征,在损失发散前数千步触发警报。

英文摘要

Frontier large language model training consumes massive accelerator fleets and long wall-clock computation, making stability failures costly when they occur. After a numerical or a hyperparameter fault has already destabilized the training dynamics, it may continue for thousands of steps while loss and gradient norms still appear normal. We study mechanism-driven detection of training instability by deriving internal monitors from the functional role of each critical module and from the earliest computational sites where failures are expected to produce measurable signatures. For low-precision flash attention, we monitor the spectral entropy of a QK bilinear decomposition, whose first-order term becomes abnormal before the loss fully collapses. For MoE routers, we derive indicators from their role in expert selection. Our fault-injection experiments on low-precision attention, large learning-rate, and combined faults show that these signals provide distinct signatures for different failures, triggering thousands of steps before loss divergence.

2606.28112 2026-06-29 cs.CV cs.AI 新提交

BiDeMem: Bidirectional Degradation Memory for Explainable Image Restoration

BiDeMem: 用于可解释图像恢复的双向退化记忆

Xinrui Wu, Lichen Huang

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出BiDeMem,通过双向退化记忆实现可解释图像恢复,利用查询检索记忆槽并支持恢复与解释路径,在NAFNet设置中验证了其优于多种变体。

详情
AI中文摘要

退化感知的提示、条件和潜在先验在图像恢复中越来越常用,但它们通常仅通过一个单一终点来判断:恢复后的图像是否获得更高的PSNR。这是对语义的弱测试。条件可以通过增加容量、作为全局校正偏差或利用数据集捷径来提供帮助,而不会成为可解释的退化先验。我们提出BiDeMem,一种用于可解释图像恢复的双向退化记忆。由恢复特征和输入统计构建的查询检索记忆槽的紧凑top-k子集。相同的选定槽标识在推理时支持恢复路径,以及一个仅训练的前向退化解释路径。该研究集中在受控的多退化NAFNet设置中的可验证性。新的控制将仅校正头、密集查询先验和静态全局先验的增益分开:这些变体分别比BiRank低0.2588、0.2586和0.2839 dB。强残差监督和更宽的退化头也仍低于完整双向记忆模型。干预探针显示,BiRank在保持恢复质量的同时增加了错误先验和原生先验的敏感性,将退化记忆定位为既是恢复模块又是可证伪的解释机制。

英文摘要

Degradation-aware prompts, conditions, and latent priors are increasingly used in image restoration, yet they are usually judged by a single endpoint: whether the restored image obtains higher PSNR. This is a weak test of semantics. A condition can help by adding capacity, acting as a global correction bias, or exploiting dataset shortcuts, without becoming an interpretable degradation prior. We propose BiDeMem, a bidirectional degradation memory for explainable image restoration. A query built from restoration features and input statistics retrieves a compact top-k subset of memory slots. The same selected slot identity supports the restoration path at inference time and a training-only forward-degradation explanation path. The study centers on verifiability in a controlled multi-degradation NAFNet setting. New controls separate the gain from a correction head alone, a dense query prior, and a static global prior: these variants are 0.2588, 0.2586, and 0.2839 dB below BiRank, respectively. Strong residual supervision and a wider degradation head also remain below the full bidirectional memory model. Intervention probes show that BiRank preserves restoration quality while increasing wrong-prior and native-prior sensitivity, framing degradation memory as both a restoration module and a falsifiable explanation mechanism.

2606.28104 2026-06-29 cs.CV cs.LG 新提交

Cross-view Multimodal Vision-Based Assessment Framework for Traditional Chinese Medicine Rehabilitation Training

跨视角多模态视觉评估框架用于中医康复训练

Francis Xiatian Zhang, Hao Yao, Shengxuan Chen, Hong Zhu, Hongxiao Jia, Sisi Zheng, Hubert P. H. Shum

发表机构 * Department of Computer Science, Durham University(杜伦大学计算机科学系) Institute for Regeneration and Repair, The University of Edinburgh(爱丁堡大学再生与修复研究所) Ningbo Hospital of Traditional Chinese Medicine(宁波市中医院) Department of Rehabilitation Medicine, The Gulou Hospital of Traditional Chinese Medicine(南京市鼓楼区中医院康复医学科)

AI总结 提出跨视角多模态视觉评估框架CME-AQA,融合视觉-姿态信息,利用第一人称和第三人称视频提升推理鲁棒性,在中医康复训练动作质量评估中取得优于基线10%的加权F1提升。

Comments Published in IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2026

详情
AI中文摘要

基于视觉的评估可以为中医康复训练提供便捷且经济的评价,其中计算机视觉的动作质量评估(AQA)提供了一种有前景的解决方案。现有的物理治疗自动AQA框架通常依赖单视角捕获的骨骼数据,这对于涉及密集手部自遮挡和复杂手物交互的中医技术(如针灸或推拿)效率低下。为应对这些挑战,我们提出CME-AQA,一种跨视角多模态视觉评估框架,融合视觉-姿态融合以增强对环境上下文的理解,并在训练中利用第一人称和第三人称视频以提高推理鲁棒性。我们收集了两个双视角数据集:TCM-AQA61-A(针灸)和TCM-AQA61-T(推拿),每个数据集包含61名受试者的同步第一人称和第三人称记录及专家标注。实验结果表明,我们的方法在平均性能上优于或媲美竞争基线,在关键评分任务(如针刺深度和快速进针)上相比最佳竞争方法在加权F1上取得超过10%的相对提升,同时在定量指标(如进针时间和操作频率)上降低了平均绝对误差。在CPR数据集上的测试进一步表明,在多个基于姿势的标准上性能相当,表明该方法可适用于以参与者运动为核心的类似结构化模拟临床技能评估。总体而言,CME-AQA提高了结构化中医康复训练的评估准确性,并促进了更便捷有效的训练导向技能评价。

英文摘要

Vision-based assessment can provide convenient and cost-effective evaluation in Traditional Chinese Medicine (TCM) rehabilitation training, where action quality assessment (AQA) from computer vision offers a promising solution. Existing automatic AQA frameworks for physical therapy typically rely on skeletal data captured from a single viewpoint, which is inefficient for TCM techniques such as acupuncture or Tuina that involve dense hand self-occlusion and complex hand-object interactions. To address these challenges, we propose CME-AQA, a cross-view, multimodal vision-based assessment framework that integrates visual-pose fusion to enhance understanding of environmental context and leverages both first-person and third-person videos during training to improve inference robustness. We collected two dual-view datasets, TCM-AQA61-A (Acupuncture) and TCM-AQA61-T (Tuina), each containing synchronized first-person and third-person recordings of 61 subjects with expert annotations. Experimental results show that our approach achieves superior or comparable mean performance against competitive baselines, achieving over 10% relative improvement in weighted F1 over the best competing method on key rating tasks such as Needle Depth and Quick Needle Insertion, while also reducing mean absolute error in quantitative measures such as insertion time and manipulation frequency. Testing on a CPR dataset further demonstrates comparable performance on several posture-based criteria, suggesting applicability to related structured simulated clinical skill assessments where participant motion is central to evaluation. Overall, CME-AQA enhances assessment accuracy for structured TCM rehabilitation training and facilitates more convenient and effective training-oriented skill evaluation.

2606.28094 2026-06-29 cs.CV cs.AI 新提交

OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal

OSOR: 一步扩散修复用于效果感知的对象移除

Qinming Zhou, Chenxi Sun, Deyang Kong, Junhao He, Xiangheng Tang, Peike Yu, Haotian Wu, Leilei Cao, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Xidian University(西安电子科技大学) Tongji University(同济大学) University of Electronic Science and Technology of China(电子科技大学) Transsion(传音控股)

AI总结 提出一步扩散模型OSOR,通过占用引导判别器、alpha头和语义锚定验证管道,实现高效、效果感知且对不完美掩码鲁棒的对象移除,速度提升4-30倍。

Comments Code and resources are available at this https URL (https://github.com/Zhouqm-Git/osor)

详情
AI中文摘要

现实世界中的对象移除具有挑战性,原因在于两个关键难点:目标对象的非局部效果(如阴影和反射)难以建模,以及用户提供的掩码通常不准确或不完整。扩散模型拥有数十亿参数和数十个去噪步骤,虽然实现了强大的移除性能,但计算成本高昂,限制了其在交互式应用和边缘设备上的使用。为了解决这些挑战,我们提出了OSOR(一步对象移除),它同时实现了高效、效果感知和掩码鲁棒的对象移除。具体来说,OSOR引入了:(1) 一个占用引导的判别器,用于精确的边界监督,实现稳定的单步扩散训练;(2) 一个alpha头,利用预训练扩散模型的知识,以最小的开销预测适当的移除区域,从而处理不完美的掩码;(3) 一个语义锚定验证管道(SAVP),用于过滤基于噪声指令的三元组,从而大规模生成效果感知的监督。利用SAVP,我们整理了CORNE数据集,包含28万个经过验证的移除对,并进一步标注了AnimeEraseBench和TextEraseBench,以评估在更复杂移除任务上的性能。实验表明,OSOR在感知质量上超越了强大的多步扩散基线,同时实现了4倍到30倍的推理加速。

英文摘要

Real-world object removal is challenging due to two key difficulties: the target object's non-local effects, such as shadows and reflections, which are difficult to model, and the fact that user-provided masks are often inaccurate or incomplete. With billions of parameters and tens of denoising steps, diffusion-based models achieve strong removal performance at the expense of substantial computational cost, limiting their use in interactive applications and on edge devices. To address these challenges, we present OSOR (One-Step Object Removal), which simultaneously achieves efficient, effect-aware, and mask-robust object removal. Concretely, OSOR introduces: (1) an occupancy-guided discriminator for precise boundary supervision, enabling stable single-step diffusion training; (2) an alpha head that leverages knowledge from pretrained diffusion models to predict appropriate removal regions with minimal overhead, thereby handling imperfect masks; and (3) a semantic-anchored verification pipeline (SAVP) that filters noisy instruction-based triplets to produce effect-aware supervision at scale. Using SAVP, we curate CORNE, which contains 280K verified removal pairs, and further annotate AnimeEraseBench and TextEraseBench to evaluate performance on more complex removal tasks. Experiments show that OSOR surpasses strong multi-step diffusion baselines in perceptual quality while achieving $4\times$ to $30\times$ faster inference.

2606.28089 2026-06-29 cs.CV 新提交

RPM-Distill: Physiology-guided Adaptive Cross-modal Distillation for Robust Remote Physiological Measurement

RPM-Distill:生理引导的自适应跨模态蒸馏用于鲁棒的远程生理测量

Jiyao Wang, Qingyong Hu, Duoxun Tang, Xiao Yang, Kaishun Wu, Jiangbo Yu

发表机构 * McGill University(麦吉尔大学) Hong Kong University of Science and Technology(香港科技大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 提出RPM-Distill框架,利用训练时的雷达信号通过频域蒸馏指导视频模型,提升光照、肤色和运动变化下的远程心率估计鲁棒性,在挑战性条件下MAE降低81%,相关性提升21%。

Comments Accepted by ECCV2026

详情
AI中文摘要

基于视频的远程生理测量(RPM)易于使用,但在光照、肤色和运动变化下仍不稳定。射频(RF)雷达对光照和外观基本不变,提供互补的心肺微动线索;然而,由于雷达的普及性有限和部署开销,在推理时要求雷达通常不切实际。我们提出RPM-Distill,一种生理引导的跨模态蒸馏框架,仅在训练期间利用同步雷达,同时保持仅视频推理。我们的关键观察是,尽管RGB和RF波形在传感物理和时域形态上不同,但它们在频域中共享相似的潜在周期节律。因此,我们通过损失函数蒸馏生理结构的频谱证据以提高鲁棒性,这些损失函数(i)锚定基频峰值,(ii)匹配非峰值背景分布,以及(iii)保留频谱形态和锐度。为了避免在样本级教师质量和对齐不确定性下的负迁移,频谱策略网络根据学生-教师频谱关系图预测样本级蒸馏门和组件权重,并通过在小型标记验证集上的元双层目标进行学习。通过在挑战性条件和跨数据集设置中的广泛实验,RPM-Distill相比单模态基线实现了81%的MAE改善和21%的相关性提升。代码见此链接。

英文摘要

Video-based remote physiological measurement (RPM) is highly accessible but remains fragile under varying illumination, skin tones, and motion. Radio frequency (RF) radar is largely invariant to illumination and appearance, providing complementary cardio-respiratory micro-motion cues; however, requiring radar at inference is often impractical due to its limited ubiquity and deployment overhead. We propose RPM-Distill, a physiology-guided cross-modal distillation framework that leverages synchronized radar only during training while retaining video-only inference. Our key observation is that although RGB and RF waveforms differ in sensing physics and time-domain morphology, they share similar latent periodic rhythm in the frequency domain. We thus distill physiology-structured spectral evidence to improve robustness, via losses that (i) anchor the fundamental peak, (ii) match the off-peak background distribution, and (iii) preserve spectral morphology and sharpness. To avoid negative transfer under sample-level teacher quality and alignment uncertainty, a spectral policy network predicts sample-level distillation gates and component weights from the student--teacher spectral relation map, learned with a meta bilevel objective on a small labeled validation split. Through extensive experiments in challenging conditions and cross-dataset settings, RPM-Distill brings 81\% MAE and 21\% correlation improvement over unimodal baselines. Code is at this https URL.

2606.28083 2026-06-29 cs.CV cs.AI cs.GR cs.HC cs.MM 新提交

STAG: Spatio-temporal Evolving Structural Representation of Action Units for Micro-expression Recognition

STAG:面向微表情识别的动作单元时空演化结构表示

Nandani Sharma, Varun Sharma, Dinesh Singh

发表机构 * Vision Intelligence and Machine Learning (VIML) Group, School of Computing and Electrical Engineering, Indian Institute of Technology Mandi(印度理工学院曼迪分校计算与电气工程学院视觉智能与机器学习(VIML)实验室) Indian Institute of Information Technology Bhagalpur(印度信息技术学院巴加尔普尔分校)

AI总结 提出STAG网络,通过光流选择、双分支架构(图注意力+Transformer)和AU引导动态连接,联合建模空间与时间信息,提升微表情识别的鲁棒性和跨数据集泛化能力。

详情
AI中文摘要

微表情识别因面部肌肉运动细微且短暂而具有挑战性。现有方法严重依赖顶点-起始帧,忽略细粒度的帧间动态,并单独建模空间和时间信息,限制了跨数据集的泛化能力。为应对这些挑战,我们提出STAG,一种动态ROI-AU耦合时空网络,联合建模运动流和自适应面部连接。该框架使用基于幅度的选择和时序注意力从判别性帧中提取光流。双分支架构结合了用于结构化空间推理的增强图注意力网络和用于时序建模的Transformer编码器。双向交叉注意力模块实现空间和时间特征的相互细化,而AU引导的动态连接根据肌肉激活模式自适应调整面部区域交互。Transformer捕获超越基于顶点方法的细微时序动态,提高了可解释微表情识别的语义一致性和可解释性。融合表示使用焦点损失优化,并在CASME II、4DME、DFME、NaME、SAMM和SMIC-HS上评估。大量实验证明了改进的鲁棒性、泛化性、可解释性和计算效率,证实了自适应关系推理、AU引导动态连接和深度时空特征融合对于准确跨数据集微表情识别的有效性。

英文摘要

Micro-expression recognition is challenging due to subtle and short-lived facial muscle movements. Existing methods rely heavily on apex-onset frames, overlook fine-grained inter-frame dynamics, and separately model spatial and temporal information, limiting generalization across datasets. To address these challenges, we propose STAG, a dynamic ROI-AU-coupled spatial-temporal network that jointly models motion flow and adaptive facial connectivity. The framework extracts optical flow from discriminative frames using magnitude-based selection and temporal attention. A dual-branch architecture combines an enhanced graph attention network for structured spatial reasoning with a transformer encoder for temporal modeling. A bidirectional cross-attention module enables mutual refinement of spatial and temporal features, while AU-guided dynamic connectivity adapts facial region interactions according to muscle activation patterns. The transformer captures subtle temporal dynamics beyond apex-based approaches, improving semantic consistency and interpretability for explainable micro-expression recognition. The fused representation is optimized using focal loss and evaluated on CASME II, 4DME, DFME, NaME, SAMM, and SMIC-HS. Extensive experiments demonstrate improved robustness, generalization, interpretability, and computational efficiency, confirming the effectiveness of adaptive relational reasoning, AU-guided dynamic connectivity, and deep spatial-temporal feature fusion for accurate cross-dataset micro-expression recognition.

2606.28077 2026-06-29 cs.CV 新提交

TextDS: Parameter-Efficient Representation Alignment for Scene Text Detection under Distribution Shifts

TextDS: 分布偏移下场景文本检测的参数高效表示对齐

Boyuan Chen, Zichen Dang, Chuang Yang, Lap-pui Chau, Yi Wang

发表机构 * School of Electrical Engineering, Xi’an Jiaotong University(西安交通大学电气工程学院) Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University(香港理工大学电机及电子工程学系)

AI总结 提出TextDS框架,通过视觉基础模型、逐步LoRA适应和公共子空间融合,仅用4.9M参数实现跨域和退化条件下的鲁棒场景文本检测。

Comments Accepted by ECCV 2026. Project page: this https URL (https://github.com/ZChenDang/TextDS)

详情
AI中文摘要

在实际部署中,场景文本检测器不可避免地面临超出训练分布的分布偏移。先前的工作通常依赖于大规模场景文本预训练,但在跨域变化和真实成像退化下的评估仍然有限。我们提出TextDS,一个用于分布偏移下场景文本检测的高效框架。首先,我们提出了一种数据高效的双编码器设计,采用视觉基础模型,消除了对大规模场景文本预训练的依赖。其次,我们引入了逐步LoRA适应(SWLoRA),通过动态早停机制进行渐进式低秩精炼,以实现有效的特征适应。第三,我们提出了公共子空间融合(CSF),在共享子空间中对齐和融合两个分支,同时保留互补的、对偏移鲁棒的信息。最后,我们构建了退化条件下的场景文本检测数据集,以弥补在成像退化评估方面的空白。实验表明,TextDS在场景文本检测中取得了竞争性性能,仅用4.9M可训练参数就展示了跨域和不利成像条件下的鲁棒性。

英文摘要

In real-world deployments, scene text detectors inevitably face distribution shifts beyond the training distribution. Prior work often depends on large-scale scene-text pretraining, yet evaluation under cross-domain changes and real-world imaging degradations remains limited. We propose TextDS, an efficient framework for scene text detection under distribution shifts. First, we propose a data-efficient dual-encoder design with visual foundation models, eliminating the reliance on large-scale scene-text pretraining. Second, we introduce Step-wise LoRA adaptation (SWLoRA), which performs progressive low-rank refinement with a dynamic early-exit mechanism for effective feature adaptation. Third, we propose Common Subspace Fusion (CSF) to align and fuse the two branches in a shared subspace while retaining complementary, shift-robust information. Finally, we construct adverse-condition scene text detection datasets to address the gap in evaluating under imaging degradation. Experiments show that TextDS achieves competitive performance in scene text detection, demonstrating robustness across domains and adverse imaging conditions with only 4.9M trainable parameters.

2606.28076 2026-06-29 cs.AI 新提交

Ontology-Guided Evidence Path Inference for Multi-hop Knowledge Graph Question Answering

面向多跳知识图谱问答的本体引导证据路径推理

Yongxue Shan, Meihan Wu, Cundi Fang, Jie Peng, Xiaodong Wang

发表机构 * National University of Defense Technology(国防科技大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出OPI框架,利用关系中心本体图进行双向检索和迭代精炼,解决多跳KGQA中搜索空间爆炸和语义约束问题,在WebQSP、CWQ和MetaQA上显著提升性能。

Comments 14 pages, 4 figures

详情
AI中文摘要

知识图谱问答(KGQA)旨在通过对结构化事实进行推理来回答自然语言问题。现有的多跳KGQA方法主要依赖于以主题为中心的扩展,这面临两个关键挑战:搜索空间随着噪声混合类型路径的增多而迅速增长,并且检索到的路径可能无法满足复杂问题的语义约束。为了解决这些挑战,我们提出了OPI,一种面向多跳KGQA的本体引导证据路径推理框架。OPI引入了一个以关系为中心的本体图来捕获关系的头尾类型约束,为答案侧约束提供了紧凑的接口。基于该本体图,OPI首先通过将预测的答案类型映射到兼容的最终跳关系,并将主题侧前缀扩展与答案侧最终跳匹配相结合,引入了一种双向检索机制,从而抑制了噪声混合类型扩展。OPI进一步采用迭代精炼策略,在问题上下文中重新评估检索到的路径和候选答案,过滤类型兼容但与问题无关的证据,以实现更可靠的答案预测。在WebQSP、CWQ和MetaQA上的实验表明,OPI大幅减少了搜索空间,在WebQSP上比最强先前结果提高了4.6/5.0个点的Hit@1/F1,在CWQ上提高了8.9/3.3个点,并且仅凭检索模块就在MetaQA上达到了接近饱和的Hit@1。

英文摘要

Knowledge graph question answering (KGQA) aims to answer natural-language questions by reasoning over structured facts. Existing multi-hop KGQA methods mainly rely on topic-centered expansion, which faces two key challenges: the search space rapidly grows with noisy mixed-type paths, and retrieved paths may fail to satisfy the semantic constraints of complex questions. To address these challenges, we propose OPI, an ontology-guided evidence path inference framework for multi-hop KGQA. OPI introduces a relation-centric ontology graph to capture the head-tail type constraints of relations, providing a compact interface for answer-side constraints. Based on this ontology graph, OPI first introduces a bidirectional retrieval mechanism by mapping the predicted answer type to compatible final-hop relations and combining topic-side prefix expansion with answer-side final-hop matching, thereby suppressing noisy mixed-type expansion. OPI further adopts an iterative refinement strategy to reassess retrieved paths and candidate answers under the question context, filtering type-compatible but question-irrelevant evidence for more reliable answer prediction. Experiments on WebQSP, CWQ, and MetaQA show that OPI substantially reduces the search space, improves Hit@1/F1 by 4.6/5.0 points on WebQSP and 8.9/3.3 points on CWQ over the strongest prior results, and achieves near-saturated Hit@1 on MetaQA with the retrieval module alone.