arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
热门方向导航
2606.20291 2026-06-19 cs.LG cs.CV 新提交

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

整合国家森林清查、机载激光雷达和卫星影像,利用计算机视觉实现森林结构的全覆盖制图

Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang, Nathan E. Rutenbeck, Katharyn A. Duffy, Kiarie Ndegwa, Andreas Gros, Scott Conway, Guy Bayes

发表机构 * Vibrant Planet Public Benefit Corporation(Vibrant Planet 公益公司)

AI总结 提出VibrantForests框架,结合卫星影像、激光雷达样本和计算机视觉,以10米分辨率生成美国本土的冠层覆盖、高度、生物量等森林属性图,减少饱和与回归均值问题。

详情
AI中文摘要

遥感技术越来越被依赖,以提供可操作的科学研究,用于大型景观的森林和野火风险管理。全覆盖、每年更新的地图是有效森林管理的持续需求。许多规划系统和数据收集结合了不同目的、年份和预测质量的异质数据源,导致运营规划系统中的混淆行为。我们介绍了VibrantForests框架,该框架被开发并应用于绘制森林属性,为有效的森林和野火规划提供一致的基础。VibrantForests包括一个基于卫星的森林结构模型,该模型在激光雷达衍生的样本上训练,并应用于美国本土,以10米分辨率同时生成冠层覆盖度、冠层高度、地上活树生物量、胸高断面积和二次平均直径的估计。我们展示了跨越从稀疏冠层/低生物量到密集冠层/高生物量的全部森林条件的预测能力。结果表明,我们的模型扩展了在类似被动传感器模型中常见的饱和范围,并减少了回归均值行为,该行为通常在小/稀疏条件下高估森林属性,在大/密集条件下低估森林属性。VibrantForests框架通过以年度节奏和10米分辨率提供管理相关属性的一致全覆盖估计,解决了大面积森林和野火规划中的一个关键限制。

英文摘要

Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

2606.20287 2026-06-19 cs.CL 新提交

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

2606.20285 2026-06-19 cs.RO 新提交

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Co-VLA:面向双臂视觉-语言-动作系统的协调感知结构化动作建模

Yandong Wang, Jiaqian Yu, Xiongfeng Peng, Lu Xu, Yamin Mao, Weiming Li, Jaewook Yoo, Dongwook Lee, Daehyun Ji, Mingbo Zhao, Chao Zhang

发表机构 * Donghua University(东华大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院) Samsung AI Center, DS Division(三星DS部门AI中心)

AI总结 针对双臂紧耦合任务中隐式协调不足的问题,提出Co-VLA框架,通过结构化动作专家和潜在感知控制器显式引入协调先验,在仿真和真实场景中显著提升成功率和效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在单臂和双臂机器人操作中展现出强大能力。先前研究表明,通过端到端学习,利用大型视觉-语言骨干网络和连续动作预测,可以涌现出协调的双臂行为。然而,随着双臂任务变得紧密耦合且执行约束变得关键,仅靠隐式协调不足以确保可靠、可解释且稳定的行为。在这项工作中,我们提出了Co-VLA,一个协调感知的双臂操作框架,将显式结构先验引入VLA模型。我们在一个最先进的视觉-语言骨干网络上实例化我们的方法,用专为双臂协调设计的结构化动作专家(SAE)替换其单一动作头。具体来说,我们在动作生成层面引入显式结构,采用模块化的协调感知损失,根据任务特定结构塑造共享和残差潜在变量。共享潜在变量编码任务级协调意图,而残差潜在变量捕获每个手臂的执行调整。在部署时,潜在感知控制器(LAC)解释学习到的表示,以实时调节同步强度、执行不对称性、平滑性和安全约束。LAC在关节命令级别运行,并与标准控制流水线兼容,无需力或阻抗控制。在仿真和真实世界基准上的实验表明,Co-VLA显著优于单一基线,在紧协调任务中成功率达到27%的提升,在OOD真实世界场景中性能翻倍(从13%提升至27%),并将任务完成时间减少高达25%。

英文摘要

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

2606.20283 2026-06-19 cs.LG cs.AI 新提交

Boundary Embedding Shaping with Adaptive Contrastive Learning for Graph Structural Disentanglement

基于自适应对比学习的边界嵌入塑造用于图结构解缠

Jiaqing Chen, Zidu Yin, Yichao Cai, Yuhang Liu, Zhen Zhang, Dong Gong, Javen Qinfeng Shi

发表机构 * Yunnan Normal University(云南师范大学) Adelaide University(阿德莱德大学) The University of New South Wales(新南威尔士大学)

AI总结 针对图结构纠缠导致的分类性能下降,提出边界嵌入塑造模块,通过自适应对比学习选择性抑制决策边界处的虚假结构噪声,提升节点分类和链接预测精度。

Comments Accepted at ICML 2026

详情
AI中文摘要

图神经网络(GNN)在聚合邻居信息进行分类方面表现出色,但其性能受到图结构纠缠的阻碍,来自语义无关邻居的虚假相关污染了节点嵌入。这种挑战在嵌入空间中靠近类边界的节点最为严重,放大的结构噪声模糊了决策边界并破坏了预测的稳定性。现有的鲁棒GNN方法大多统一处理所有节点,忽略了边界脆弱性。本文中,为了提高分类性能,我们通过将边界区域纠缠识别为主要瓶颈来解决图结构解缠问题,并提出边界嵌入塑造(BES),一种自适应对比学习GNN插件模块,以最小的模型参数扰动选择性地抑制决策边界处的虚假结构噪声。大量实验表明,BES持续改善边界判别性,并优于现有领先方法。值得注意的是,BES在节点分类中平均提升GCN性能3.3%(在WikiCS上高达5.0%),并在链接预测中实现更优的准确率。

英文摘要

Graph neural networks (GNNs) excel at aggregating neighbor information for classification, yet their performance is hindered by graph structural entanglement, where spurious correlations from semantically irrelevant neighbors contaminate node embeddings. This challenge is most acute for nodes near class boundaries in the embedding space, where amplified structural noise blurs decision boundaries and destabilizes predictions. Existing robust GNN methods largely treat all nodes uniformly, ignoring boundary vulnerabilities. In this paper, to improve classification performance, we tackle graph structural disentanglement by identifying boundary-region entanglement as the primary bottleneck and propose Boundary Embedding Shaping (BES), an adaptive contrastive learning GNN plug-in module that selectively suppresses spurious structural noise at decision boundaries with minimal model parameter perturbation. Extensive experiments demonstrate that BES consistently improves boundary discrimination and outperforms existing leading methods. Notably, BES boosts GCN performance by an average of 3.3% in node classification (up to 5.0% on WikiCS) and achieves superior accuracy in link prediction.

2606.20282 2026-06-19 cs.CV 新提交

U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection

U$^2$Mamba:用于显著目标检测的两级嵌套U结构Mamba

Junhui Li, Jialu Li, Youshan Zhang

发表机构 * University of Science and Technology Liaoning(辽宁科技大学) Chuzhou University(滁州学院) Yeshiva University(叶史瓦大学)

AI总结 提出U$^2$Mamba,一种两级嵌套U结构网络,通过多尺度Mamba U块增强深度和上下文信息,并采用分层训练监督,在显著目标检测上达到先进性能。

Comments 6 pages, 2 figures

详情
AI中文摘要

基于Mamba的模型已成为显著目标检测(SOD)的有前途的替代方案,在长序列建模方面具有显著优势。然而,现有模型往往未能充分利用上下文信息和整个架构的深度。本文介绍了U$^2$Mamba,一种用于显著目标检测的强大且创新的U结构网络。我们提出了多尺度Mamba U块(MMUBs),增强了模型深度以改进局部特征提取能力。我们新开发的嵌套U结构结合了MMUBs,使网络能够整合来自浅层和深层的不同感受野,从而收集更丰富的上下文信息和更长距离的数据,而不受分辨率限制。我们提出了一种分层训练监督方法,在训练过程中在每个层级计算损失,而不是使用传统的深度监督方案和顶层监督训练。大量实验表明,U$^2$Mamba在显著目标检测上取得了与最先进方法高度竞争的性能。源代码可在\url{this https URL}获取。

英文摘要

Mamba-based models have emerged as a promising alternative for salient object detection (SOD), offering significant advantages in modeling long sequences. However, existing models often fail to explore contextual information and the depth of the entire architecture. This paper introduces U$^2$Mamba, a powerful and innovative U-structured network for salient object detection. We propose multiscale Mamba U-blocks (MMUBs) that enhance the model depth to improve local feature extraction capabilities. Our newly developed nested U-structure, incorporating MMUBs, enables the network to integrate various receptive fields from shallow and deep layers, thereby collecting richer contextual information and longer-range data without being constrained by resolution. Instead of using the traditional deep supervision scheme and top-level supervised training, we propose a hierarchical training supervision method where the loss is computed at each level during the training process. Extensive experiments demonstrate that U$^2$Mamba achieves highly competitive performance against state-of-the-art methods. The source code is available at \url{https://github.com/JL021/U2Mamba}.

2606.20272 2026-06-19 cs.RO cs.CV 新提交

Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

高效连接真实场景与合成数据生成以支持基于AI的认知机器人和计算机视觉应用

Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann, Mohamad Zaher Ziadeh, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫生产设备和设计技术研究所) TU Berlin(柏林工业大学)

AI总结 本文讨论当前AI视觉模型在认知机器人应用中的局限,并提出通过连接仿真与真实世界训练数据生成来弥合领域差距的方法。

Comments Accepted and best paper award at MHI-Kolloquium 2024

详情
AI中文摘要

AI视觉模型是认知机器人在工业和家庭应用中潜在用例场景的驱动因素。基于最新的AI成就,已经提出了从语义环境分析到6D和抓取姿态估计的大量方法。然而,这些进展需要更强大和高效的方法,特别是在训练数据和AI架构方面,这些方法能够协同应对当前挑战、精度限制以及超越领域差距的可扩展性。在本文中,我们讨论了这些当前限制和相关最先进技术中的趋势,这些趋势正对这些挑战提出挑战。此外,我们讨论了当前在弥合仿真与真实世界应用之间的领域差距方面的工作进展,通过在训练数据生成中连接两者来实现。

英文摘要

AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

2606.20264 2026-06-19 cs.AI 新提交

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

学生绘制的科学模型的置信度感知自动评估

Luyang Fang, Yingchuan Zhang, Jongchan Park, Zhaoji Wang, Ping Ma, Xiaoming Zhai

发表机构 * AI4STEM Education Center, Athens, GA, USA(AI4STEM教育中心,雅典,佐治亚州,美国) Department of Statistics, University of Georgia, Athens, GA, USA(佐治亚大学统计系,雅典,佐治亚州,美国)

AI总结 提出一种基于视觉Transformer的置信度感知评分框架,通过选择性自动化高置信度响应并延迟不确定案例至人工审核,在六个NGSS评估项上提高了评分可靠性并平衡了自动化覆盖率与评分风险。

详情
AI中文摘要

学生生成的绘图广泛应用于科学教育中,用于评估学习者在基于建模任务中的概念理解,这些任务与下一代科学标准(NGSS)保持一致。然而,对这些绘图进行评分需要专家人工判断来解释复杂的视觉表示,使得大规模评估在课堂环境中实施和维持成本高昂。在这项工作中,我们研究了使用基于视觉模型的自动评分学生生成的科学绘图。我们评估了具有参数高效适应的视觉Transformer(ViT),并提出了一个置信度感知评分框架,该框架从测试时预测分布中推导出响应级别的置信度。这种置信度信号通过自动评分高置信度响应,同时将不确定案例延迟至人工审核,实现了选择性自动化。在六个与NGSS对齐的中学评估项上的实验表明,所提出的方法提高了评分可靠性,同时支持自动化覆盖率和评分风险之间的实际权衡,突出了置信度感知方法在可信教育评估中的价值。

英文摘要

Student-generated drawings are widely used in science education to assess learners' conceptual understanding in modeling-based tasks aligned with the Next Generation Science Standards (NGSS). However, scoring such drawings requires expert human judgment to interpret complex visual representations, making large-scale assessment costly to implement and sustain in classroom settings. In this work, we study automated scoring of student-generated scientific drawings using a vision-based model. We evaluate a Vision Transformer (ViT) with parameter-efficient adaptation and propose a confidence-aware scoring framework that derives response-level confidence from test-time predictive distributions. This confidence signal enables selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items show that the proposed approach improves scoring reliability while supporting a practical trade-off between automated coverage and scoring risk, highlighting the value of confidence-aware methods for trustworthy educational assessment.

2606.20250 2026-06-19 cs.CV 新提交

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

单阶段层次化校正用于弱监督组织病理学分割

Duc T. Nguyen, Hoang-Long Nguyen, Thanh-Ha DO, Huy-Hieu Pham

发表机构 * VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam(越南河内VinUniversity VinUni-Illinois智慧健康中心) The Computer Vision and Medical AI Lab, VinUniversity, Hanoi, Vietnam(越南河内VinUniversity计算机视觉与医学人工智能实验室) Posts and Telecommunications Institute of Technology, Hanoi, Vietnam(越南河内邮电技术学院)

AI总结 提出单阶段层次化校正框架,通过层次化特征校正模块在单次训练中直接生成高保真激活图,解决多阶段弱监督分割中的误差传播和计算开销问题。

Comments Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

详情
AI中文摘要

现有的计算病理学中的弱监督语义分割方法依赖于多阶段范式:类激活图生成、离线伪掩码细化和全监督再训练。虽然这种解耦方法已被广泛采用,但它存在根本性缺陷。多阶段过程不仅导致高计算训练成本,还遭受误差传播:浅层CNN中的局部纹理偏差产生假阳性伪影,后续细化步骤往往无法纠正。为了通过简单而高效的方法解决这些持续存在的挑战,我们提出了单阶段层次化校正(SSHR)框架。我们的方法不是事后被动地细化CAM,而是在前向传播过程中主动净化中间特征表示。我们引入了一个层次化特征校正模块(HFRM),利用深层全局语义上下文过滤浅层中的局部异常。该机制在单个训练循环内直接生成高保真激活图。在LUAD-HistoSeg和BCSS数据集上的实验表明,SSHR优于最先进的多阶段方法。此外,SSHR将训练时间减少了2到5倍。这种效率降低了计算开销,并加速了大规模组织病理学工作流的临床转化。代码可在以下网址获取:this https URL

英文摘要

Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: https://github.com/trongduc-nguyen/SSHR

2606.20246 2026-06-19 cs.RO cs.AI 新提交

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

微调视觉-语言-动作模型所需的层数比你想象的少

Gia-Binh Nguyen, Trong-Bao Ho, Thien-Loc Ha, Khoa Vo, Philip Lund Møller, Quang T. Nguyen, Long Dinh, Tuan Dam, Vu Duong, Tung M. Luu, Trung Le, Tran Nguyen Le, Minh Vu, An Thai Le, Ngan Le, Daniel Sonntag, James Zou, Jan Peters, Duy M. H. Nguyen, Ngo Anh Vien

发表机构 * Center for AI Research, VinUniversity(VinUniversity人工智能研究中心) VinRobotics University of Arkansas(阿肯色大学) Technical University of Denmark(丹麦技术大学) Hanoi University of Science and Technology(河内科技大学) KAIST(韩国科学技术院) Monash University(莫纳什大学) Oldenburg University(奥尔登堡大学) DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院) Stanford University(斯坦福大学) Technische Universität Darmstadt(达姆施塔特工业大学)

AI总结 本文发现VLA模型存在层间表示冗余,提出无需训练的压缩方法,通过去除冗余层将模型深度减少50%,实现40-50%训练加速和30%推理加速,性能不变。

详情
AI中文摘要

在大规模视频-机器人数据集上预训练的视觉-语言-动作(VLA)模型彻底改变了机器人操作,但其数十亿参数架构在下游微调和实时推理过程中带来了巨大的计算负担。在这项工作中,我们揭示了这些连续控制基础策略(例如pi_0、GR00T-N1.5)的一个高度非平凡的结构特性:尽管在多样化的物理轨迹上训练,它们表现出严重的逐层表示冗余。为了利用这一点,我们引入了一个完全无需训练的结构压缩流程,避免了现有方法需要加载全尺寸模型来学习优化的令牌缩减或动态层选择器的需求。相反,仅通过使用中心核对齐的单次前向传递来识别冗余层特征,我们移除孪生层以永久压缩模型深度高达50%,涵盖VLM主干和连续控制策略头。这种精简架构的下游微调带来了双重加速效益:训练时间减少40-50%,实时推理速度提升高达30%,同时匹配或超越全尺寸基模型性能。我们在三个模拟基准(LIBERO、RoboCasa、SimplerEnv)和10个跨4种不同机器人实体的多样化真实世界操作任务上全面验证了我们的方法。这些结果证明,先进的VLA所需的层数远少于先前假设,为可扩展的机器人学习提供了一种高度计算高效的范式。

英文摘要

Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

2606.20245 2026-06-19 cs.AI 新提交

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

导航不可靠的参数化与上下文知识:面向LLM推理的显式知识冲突解决

Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

发表机构 * National Key Laboratory of Big Data and Decision, National University of Defense Technology(国防科技大学大数据与决策国家重点实验室)

AI总结 提出MACR框架,通过自适应知识评估与多智能体推理,显式解决大语言模型内部参数知识与外部上下文之间的冲突,超越传统二元选择范式。

Comments 12 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)通过利用广泛的参数化知识和上下文学习能力,在多种基于语言的任务中取得了强劲性能,使其能够整合输入提示中提供的外部信息。然而,外部知识的整合可能引入冲突,不仅存在于模型内部参数知识与外部信息之间,也存在于多个外部上下文之间。现有方法通常假设模型或提供的上下文是可靠的,忽视了两种来源都可能包含错误的情况,并通过优先考虑某一来源而非另一来源来避免冲突,而非主动解决不一致性。为解决这些局限,我们提出了一种新颖的LLM知识冲突解决框架MACR,该框架超越了传统的二元选择范式,并基于多智能体推理方法引入了显式的冲突解决机制。具体而言,我们首先提出一种自适应知识评估与检索方法,采用改进的语义熵度量来量化LLM对给定查询答案的置信度。基于此置信度估计,MACR要么将模型的内部知识外化为文本表示,要么在内部知识不足时检索相关外部知识,为后续推理生成基本上下文。然后,我们引入一个归纳式多智能体推理框架,包含三个专门智能体,分别用于归纳显式规则、分析潜在冲突以及解决所有可用上下文中的不一致性。实验结果表明,MACR在多个基准测试中显著优于最先进的基线方法,同时提供了可解释的显式冲突解决方案。

英文摘要

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

2606.20244 2026-06-19 cs.CV cs.AI 新提交

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E:基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore(新加坡国立大学) Fudan University(复旦大学) Technical University of Munich(慕尼黑工业大学) Sagenic Tech Zhejiang University(浙江大学) vivo Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出SPOT-E方法,通过测试时熵整形和视觉聚光灯,解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题,无需重新训练即可提升定位与鲁棒性。

详情
AI中文摘要

视觉语言模型(VLM)在证据密集型任务中通常表现不佳,因为决定性视觉证据往往微小、局部且容易被忽略,导致即使高层推理完好,证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位,但大多是开环的,缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号,并表明朴素熵最小化具有歧义性,因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义,我们引入低熵锚点和熵整形目标,在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E,一种即插即用的测试时方法,生成问题条件聚光灯,并通过基于组相对策略优化(GRPO)的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中,SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于:\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

2606.20241 2026-06-19 cs.CV 新提交

BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

BAFIS:评估现代文本到图像模型中的职业偏见与人类偏好的数据集与框架

Thomas Klassert, Adrian Ulges, Biying Fu

发表机构 * RheinMain University of Applied Sciences(莱茵美因应用科学大学)

AI总结 本研究提出BAFIS平台和包含21,140张多语言提示生成图像的数据集,评估五种文本到图像模型在职业生成中的性别和种族偏见,结合人类偏好反馈,发现系统性偏见并强调纳入人类偏好的必要性。

Comments Accepted at the IEEE Winter Conference on Applications of Computer Vision, WACV 2026

详情
AI中文摘要

生成式人工智能有潜力提高生产力并改变创意内容的制作。然而,现有研究表明图像生成模型受到偏见的显著影响。本文研究了文本到图像模型在职业相关图像生成中存在的固有偏见和语言诱导偏见,并通过人类偏好反馈补充了现有指标。我们对五种当前文本到图像模型进行了全面评估:Midjourney v6.1、Stable Diffusion 3 Medium、DALL-E 3、Playground v2.5和FLUX.1-dev,重点关注性别和种族偏见、图像质量以及提示对齐。为促进这一评估,我们开发了“公平图像合成竞技场”(BAFIS),一个旨在收集生成图像中偏见的人类反馈的平台。此外,我们创建了一个包含21,140张使用多语言提示生成的合成图像的数据集,作为我们分析的基础。我们进一步将结果置于更广泛的社会背景中,与德国联邦就业局的官方统计数据进行比较。我们的发现揭示了文本到图像模型中的系统性偏见,且现有评估指标与主观用户评分存在部分相关性。因此,我们的研究强调了纳入人类偏好以开发更公平、更包容的文本到图像模型的必要性。

英文摘要

Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the "Battle-Arena for Fair Image Synthesis" (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.

2606.20236 2026-06-19 cs.AI cs.LG cs.MA 新提交

A Multi-Agent system for Multi-Objective constrained optimization

多目标约束优化的多智能体系统

Federica Filippini

发表机构 * University of Milano-Bicocca(米兰比可卡大学)

AI总结 提出MAMO,通过多智能体强化学习解耦任务执行与目标设计,自动学习奖励权重以平衡主目标优化与约束违反,提升动态环境下RL的自主性和鲁棒性。

Comments Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, https://optlearnmas.github.io), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

计算和网络系统中的许多决策问题可以自然地表述为在性能约束下的成本最小化问题。在动态环境中,强化学习(RL)通常通过在运行时将成本和约束违反通过加权惩罚项嵌入到单个标量奖励中(遵循拉格朗日启发式公式)来解决此类问题。然而,在这种背景下,学习策略的行为关键取决于这些权重的选择,而权重通常是手动选择的。这使得难以在优化主要目标和有效避免约束违反之间找到适当的权衡,特别是在非平稳环境中,它们的相对重要性可能发生变化。本文提出了MAMO(多目标约束优化的多智能体系统),一种通过多智能体RL解决这种平衡问题的方法。MAMO通过将奖励权重的选择表述为一个学习问题,将任务执行与目标设计解耦,为动态环境中约束优化问题的更自主和鲁棒的基于RL的解决方案迈出了第一步。

英文摘要

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.

2606.20233 2026-06-19 cs.CV 新提交

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

使用角色-环境协调视频生成模型的电影级合成

Tianyi Xiang, Mingming He, Li Ma, Jing Liao

发表机构 * City University of Hong Kong(香港城市大学) Independent Researcher(独立研究员)

AI总结 提出端到端视频扩散框架,通过三掩码引导和RGB-D联合去噪建模角色与环境的双向物理与光照交互,实现高质量动态视频合成。

详情
AI中文摘要

电影级合成旨在将绿幕角色融入新环境,同时保持物理和光度真实性。先前的方法通常未能捕捉角色与其周围环境之间的复杂双向交互,我们将其表征为角色到环境(C2E)的物理交互和环境到角色(E2C)的光照协调。为了解决这个问题,我们提出了一个端到端的视频扩散框架,联合建模C2E和E2C交互,特别处理交互道具的挑战。我们的方法引入了一种三掩码引导架构,结合RGB-D联合去噪,以确保角色、道具和环境之间的物理一致交互。我们进一步开发了一种高效的先验驱动数据整理流程,无需昂贵的渲染即可构建高质量的重光照对。最后,参考条件机制实现了可控的环境合成和精确的道具替换。大量实验表明,我们的框架在电影级动态视频合成方面显著优于现有方法。

英文摘要

Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

2606.20232 2026-06-19 cs.RO cs.GT 新提交

Mobile Target Search with Imperfect Perception: A Partially Observable Stochastic Game Theoretical Approach

不完美感知下的移动目标搜索:一种部分可观测随机博弈论方法

Hanzheng Zhang, Shu Liang, Shuyu Liu

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学上海自主智能无人系统科学中心) Department of Control Science and Engineering, Tongji University(同济大学控制科学与工程系)

AI总结 针对传感器限制、恶意干扰或通信噪声导致的不完美感知,采用部分可观测随机博弈(POSG)框架建模搜索者与目标间的对抗互动,提出可检测性概念和基于随机递归分析的充分判据,并开发服务器辅助分布式算法。

详情
AI中文摘要

本文研究了在传感器限制、恶意干扰或通信噪声导致的不完美感知下的移动目标搜索问题。搜索者和目标在具有有限移动性的网格状区域中运行,导致搜索与逃避之间的动态相互作用。为了捕捉不完美感知下的这种对抗互动,我们采用部分可观测随机博弈(POSG)方法,该方法通过引入目标智能来推广部分可观测马尔可夫决策过程(POMDP)。为了处理感知不确定性引起的虚警和漏检,我们提出了一种新颖的可检测性概念,以确定搜索策略是否能保证最终检测,并基于随机递归分析提供了充分的可检测性准则。我们进一步开发了一种服务器辅助的分布式算法,该算法利用搜索者的聚合势博弈结构和基于KL散度的目标预测约简。数值模拟验证了所提算法的有效性,并支持了可检测性分析。

英文摘要

This paper investigates mobile target search under imperfect perceptions caused by sensor limitations, malicious jamming, or communication noise. Searchers and targets operate in a grid-shaped area with bounded mobility, leading to a dynamic interplay between search and evasion. To capture this adversarial interaction under imperfect perceptions, we adopt the partially observable stochastic game (POSG) approach, which generalizes partially observable Markov decision processes (POMDPs) by incorporating target intelligence. To handle false alarms and missed detections caused by perceptual uncertainties, we propose a novel detectability concept to determine whether a search strategy guarantees eventual detection, and provide sufficient detectability criteria based on stochastic recurrence analysis. We further develop a server-assisted distributed algorithm that utilizes the aggregative potential game structure for searchers and a KL-divergence-based reduction for target prediction. Numerical simulations validate the effectiveness of the proposed algorithm and support the detectability analysis.

2606.20227 2026-06-19 cs.AI cs.SE 新提交

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

QMFOL:通过可量化的一元一阶逻辑测试用例生成来基准测试大语言模型推理

Xinyi Zheng, Ling Shi, Tianlong Yu, Yongxin Zhao, Lorenz Goette, Kailong Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) Nanyang Technological University(南洋理工大学) Hubei University(湖北大学) East China Normal University(华东师范大学) National University of Singapore(新加坡国立大学)

AI总结 提出QMFOL框架,通过可控制复杂度的合取/析取模式生成一元一阶逻辑推理任务,并构建包含2880个实例的基准QMFOLBench,评估显示逻辑复杂度增加导致性能下降和计算开销上升。

详情
AI中文摘要

大型语言模型(LLMs)在推理方面取得了显著进展,特别是在演绎推理中,这对于高风险决策至关重要。随着模型的改进,评估基准也应随之发展。然而,现有基准缺乏对逻辑复杂性的细粒度控制,并且在语义多样性与逻辑一致性之间难以平衡。为了解决这些问题,我们提出了QMFOL,一个自动生成具有可量化和可控复杂度的一元一阶逻辑推理任务的框架。它使用合取和析取模式构建形式逻辑结构,从而能够精确控制推理深度、宽度、标签类型和干扰项。然后通过LLM将这些结构转化为自然语言,并通过外部证明器的往返验证确保逻辑一致性。基于我们的框架,我们构建了QMFOLBench,一个包含2880个实例、960种配置的基准,覆盖不同的逻辑和语义维度。对六个大型推理模型(LRMs)和两个LLM的评估表明,随着逻辑复杂度的增加,性能下降且计算开销上升。模型在True标签任务上的表现优于False或Unknown任务,并且对语义变化敏感。总体而言,QMFOL提供了一种可扩展且可靠的方法来构建具有可控复杂度的演绎推理基准,从而能够更精确地评估现代语言模型的推理能力。

英文摘要

Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

2606.20225 2026-06-19 cs.CL 新提交

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

可操作的激活方向:检测和缓解跨语言模型家族的突发性对齐失调

Abdul Rafay Syed

发表机构 * Universität des Saarlandes(萨尔大学)

AI总结 通过差分均值方向在最终层实现99.6%的对齐/失调分离,因果干预将代码泄露降低21-51点;跨架构迁移虽有效但缺乏特异性,揭示了两层特异性结构。

Comments 12 pages, 2 figures

详情
AI中文摘要

在不安全代码上微调语言模型会引发突发性对齐失调,其内部结构尚不明确。我们研究了这种失调是否对应于跨架构共享的因果可操作的激活空间方向。在四个指令微调模型家族(Qwen2.5-1.5B、Gemma-2-2B、Llama-3.2-1B、Ministral-3-3B)上进行相同微调后,差分均值方向在每个模型的最终层实现了99.6%的对齐与失调激活分离。通过减去该方向进行因果干预,代码泄露减少了21-51个百分点,而安全代码控制验证了内容特异性。通过岭回归映射进行跨架构迁移产生了较大的行为抑制(高达46个百分点),但未能通过特异性控制,因为随机和正交方向表现相当。我们识别出一个两层特异性结构:模型内方向具有因果特异性和可操作性;跨模型方向具有因果真实性但缺乏特异性。出现了不对称的迁移拓扑,Gemma和Qwen作为几何捐赠者,Llama作为接收者。这些发现定义了线性跨架构校正的局限性,并推荐使用模型内探测进行审计。

英文摘要

Fine-tuning language models on insecure code induces emergent misalignment with poorly understood internal structure. We investigate whether this misalignment corresponds to a causally actionable activation-space direction shared across architectures. Across four instruction-tuned model families (Qwen2.5-1.5B, Gemma-2-2B, Llama-3.2-1B, Ministral-3-3B) finetuned identically, a difference-in-means direction achieves 99.6% separation of aligned and misaligned activations at each model's final layer. Causal steering by subtracting this direction reduces code spillover by 21-51 points, while a secure-code control confirms content specificity. Cross-architecture transfer via ridge regression maps yields large behavioral suppression (up to 46 points) but fails specificity controls as random and orthogonal directions perform comparably. We identify a two-tier specificity structure: within-model directions are causally specific and actionable; cross-model directions are causally real but non-specific. An asymmetric transfer topology emerges, with Gemma and Qwen acting as geometric donors and Llama as a receiver. These findings define the limits of linear cross-architecture correction and recommend within-model probing for auditing.

2606.20223 2026-06-19 cs.CV q-bio.QM 新提交

DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

DeepForestVisionV2:面向非洲热带森林相机监测的生态驱动分类扩展

Hugo Magaldi, Theau d'Audiffret, Etienne Francois Akomo-Okoue, Bala Amarasekaran, Naomi Anderson, Claire Auger, Noemie Cappelle, Daniel Cornelis, Raphael Cornette, Tobias Deschner, Gabriel Dubus, Davy Fonteyn, Rosa M. Garriga, Jennifer Hatlauf, Innocent Kasekendi, Raymond Katumba, Aram Kazandjian, Alfred Ngomanda, Stephan Ntie, Simone Pika, Xavier Rufray, Harold Rugonge, John Justice Tibesigwa, Peter van Lunteren, Hadrien Vanthomme, Joeri A. Zwerts, Sabrina Krief

发表机构 * UMR7206 Eco-Anthropologie, MNHN(UMR7206 生态人类学,法国国家自然历史博物馆) One Forest Vision initiative(One Forest Vision 倡议) Sebitoli Chimpanzee Project(塞比托利黑猩猩项目) Centre National de la Recherche Scientifique et Technologique(国家科学技术研究中心) Institut de Recherche en Ecologie Tropicale(热带生态研究所) Tacugama Chimpanzee Sanctuary(塔库加马黑猩猩保护区) Biotope(Biotope 公司) CIRAD(法国农业发展国际合作研究中心) Max Planck Institute for Evolutionary Anthropology(马克斯·普朗克进化人类学研究所) BOKU University(维也纳自然资源与生命科学大学) Agence Nationale des Parcs Nationaux du Gabon(加蓬国家公园管理局) Uganda Wildlife Authority(乌干达野生动物管理局) Addax Data Science(Addax 数据科学公司) Utrecht University(乌得勒支大学)

AI总结 针对非洲热带森林相机监测中生态梯度(垂直分层、场景开放度、人为界面)导致原35类分类过粗的问题,提出扩展至64类的DeepForestVisionV2,在保持离线工作流的同时提升野外实用性。

Comments Accepted at ICPR 2026 - Computer Vision for Biodiversity Monitoring and Conservation Workshop

详情
AI中文摘要

非洲热带森林中的相机监测正从封闭冠层内部扩展到河岸、空地和公园边缘。在现有的非洲森林相机分类开放工具中,DeepForestVision是唯一提供照片和视频匹配离线工作流的工具,先前研究表明其在可比基准上优于其他基线。然而,它专为封闭冠层、地面森林内部设计,使用35类预测空间,当部署遇到树栖灵长类、鸟类、半水生类群或家畜等人为混杂因素时,该空间变得过于粗糙。我们提出DeepForestVisionV2,这是一个从35类扩展到64类预测空间(61个动物类加上人类、车辆和空白)的生态驱动扩展,旨在解决三个反复出现的部署梯度:垂直分层、场景开放度和人为界面。DeepForestVisionV2保留相同的离线工作流,并在来自多国非洲热带森林项目的1,535,010张照片和243,354个视频上训练。评估结合了一个跨国家裁剪照片验证集(用于评估跨站点和相机设置的鲁棒性)和三个涵盖目标梯度的留出乌干达视频基准。在验证集上,DeepForestVisionV2达到0.86准确率、0.82宏F1和0.81平衡准确率。在部署基准上,尽管分类任务更困难,它仍保持或提高了基线准确率,同时将识别的类群数量从森林内部视频的22个增加到29个,河岸视频从4个增加到9个。在公园边缘用例中,它将准确率从0.62提高到0.86,并将误报从11次减少到0次。这些结果表明,DeepForestVisionV2在保持跨站点、栖息地和相机设置鲁棒性的同时,显著提高了野外实用性。

英文摘要

Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.

2606.20218 2026-06-19 cs.SD 新提交

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

Zero-VC: 通过说话人匿名化实现零前瞻流式语音转换

Yudong Li, Zihao Fang, Junwen Qiu, Ruihai Jing, Ruixiang Hang, Yingda Shen, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shenzhen Loop Area Institute(深圳环域研究所) Shenzhen Transsion Holdings Co., Ltd.(深圳传音控股股份有限公司)

AI总结 针对流式零样本语音转换中音色与语言内容解耦的挑战,提出将说话人匿名化作为扰动机制,在保留韵律效用的同时显式减轻音色泄露,实现严格因果的零前瞻网络。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

流式零样本语音转换在不解耦音色与语言内容的情况下,难以避免降低效用或增加延迟。当前方法依赖于信息瓶颈(IB)或说话人扰动。虽然IB过滤了音色,但它丢弃了韵律,迫使模型显式注入基频等特征。这通常需要缓冲未来帧,产生算法前瞻延迟。另一方面,现有的扰动方法在很大程度上忽略了音色泄露与效用保留之间的关键权衡。认识到这一被忽视的权衡,我们发现说话人匿名化(SA)的内在目标与平衡这些因素高度一致。因此,我们引入SA作为一种新颖的扰动机制,在保留韵律效用的同时显式减轻音色泄露。关键在于,SA的鲁棒表示显著减轻了生成器对未来上下文的依赖,使我们能够实现严格因果的零前瞻网络。音频样本可在此https URL获取。

英文摘要

Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing models to explicitly inject features like fundamental frequency. This often requires buffering future frames, creating algorithmic lookahead latency. On the other hand, existing perturbation methods largely overlook the crucial trade-off between timbre leakage and utility preservation. Recognizing this neglected trade-off, we find that the inherent objective of Speaker Anonymization (SA) aligns well with balancing these factors. Thus, we introduce SA as a novel perturbation mechanism to explicitly mitigate timbre leakage while retaining prosodic utility. Crucially, SA's robust representations significantly alleviate the generator's reliance on future context, enabling our strictly causal, zero-lookahead network. Audio samples are available at https://amphionteam.github.io/Zero-VC-demo/.

2606.20216 2026-06-19 cs.LG cs.AI 新提交

Learner-based Concept Drift Detection: Analysis and Evaluation

基于学习器的概念漂移检测:分析与评估

Md Moman Ul Haque Khan, Samira Sadaoui

发表机构 * Department of Computer Science, University of Regina(里贾纳大学计算机科学系)

AI总结 本文从理论上分析概念漂移特征,并评估多种漂移检测算法在合成和真实数据集上的性能,旨在增强对漂移检测器行为及其适用性的理解。

Comments 2 authors, 29 pages

详情
AI中文摘要

部署于演化流环境中的机器学习算法必须处理非平稳数据分布,即所谓的概念漂移。概念漂移的存在对许多实际应用构成重大挑战,因为它会严重降低预测性能,阻碍其支持稳健决策的能力。因此,及时高效地检测漂移事件对于长期保持高准确性至关重要。本研究从理论上考察了概念漂移特征以及多个类别的多种漂移检测算法。此外,我们评估了它们在合成和真实数据集上的性能,这些数据集展示了多样的流场景和漂移特征,如突变和渐变。本研究旨在增强对概念漂移特征和漂移检测器行为这一复杂概念的理解,以及它们在不同情境下的适用性。

英文摘要

Machine learning algorithms deployed for evolving streaming environments must handle the non-stationary data distributions, commonly referred to as concept drift. The presence of concept drift poses a major challenge for many real-world applications because it can severely degrade their predictive performance, hindering their ability to support robust decision-making. Consequently, the timely and efficient detection of drift events is critical for sustaining high accuracy over time. This study examines theoretically the concept drift characteristics and numerous drift detection algorithms across several categories. Furthermore, we evaluate their performance on both synthetic and real-world datasets exhibiting diverse streaming scenarios and drift characteristics, such as abrupt and gradual changes. This study aims to enhance understanding of the complex notion of concept drift characteristics and behavior of drift detectors, along with their applicability to diverse contexts.

2606.20212 2026-06-19 cs.CL 新提交

CzechDocs: A Multiway Parallel Dataset of Formatted Documents for Minority Languages in Czechia

CzechDocs:捷克少数民族语言格式化文档的多路平行数据集

Josef Jon, Ondřej Bojar

发表机构 * Charles University, Faculty of Mathematics Physics Institute of Formal

AI总结 提出CzechDocs多路平行格式化文档数据集,覆盖捷克及少数民族语言,支持评估保留格式的机器翻译系统,并公开验证子集与评估工具。

详情
AI中文摘要

我们提出了CzechDocs,一个多路平行的格式化文档(HTML、DOCX和PDF)数据集,涵盖捷克语及捷克境内使用的少数民族语言——主要是乌克兰语和英语,以及少量越南语、俄语和其他语言。该数据集旨在支持评估旨在翻译过程中保留文档格式的机器翻译系统。我们在数据集的验证子集上比较了最常见的格式保留机器翻译方法。该验证子集连同评估工具包已公开发布,以供进一步研究。一个保留的测试子集将用于未来专注于文档级翻译并保留格式的共享任务。

英文摘要

We present CzechDocs, a multiway parallel dataset of formatted documents (HTML, DOCX, and PDF) covering Czech and minority languages used in Czechia-primarily Ukrainian and English, with smaller portions of Vietnamese, Russian and other languages. The dataset is designed to support the evaluation of machine translation systems that aim to preserve document formatting during translation. We provide a comparison of the most common approaches to format-preserving machine translation on a validation subset of the dataset. This validation split, together with the evaluation toolkit, is publicly released for further research. A held-out test split will be reserved for a future shared task focused on document-level translation with formatting preservation.

2606.20210 2026-06-19 cs.AI 新提交

Augmenting Game AI with Deep Reinforcement Learning

用深度强化学习增强游戏AI

Alessandro Sestini, Joakim Bergdahl, Amir Baghi, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Linus Gisslén

发表机构 * Electronic Arts (EA), Stockholm, Sweden(美国艺电公司(EA),斯德哥尔摩,瑞典)

AI总结 本文提出一种框架,通过深度强化学习训练游戏AI,以增强角色行为的真实感,并探讨了部署中的挑战与未来研究方向。

Comments Vision paper, published in Conference on Games 2026

详情
AI中文摘要

视频游戏的沉浸感不仅取决于图形、音频和游戏机制,还取决于游戏内角色的质量。产生可信的角色(即游戏AI)仍然是一个重大挑战,因为行为复杂性难以通过手工编码系统捕捉。游戏AI是沉浸感和参与度的来源;然而,由于创建游戏AI的挑战所带来的限制,常常导致玩家沮丧并打破游戏内的真实感幻觉。机器学习模型的引入为在游戏中创建更可信、更真实、更易共鸣的角色打开了大门。其前景是,它们要么通过与游戏互动学习,要么从玩家数据中学习,以发展出真正类似人类的行为。在本文中,我们展望未来强化学习在游戏AI中的更多应用。为实现这一目标,当前的研究限制阻碍了其在各种游戏类型中的广泛部署。因此,我们提出一个框架,用于训练强化学习模型,并考虑了一套适合游戏AI和游戏开发的需求。我们展示了带有强化学习增强游戏AI的游戏示例,并描述了在现代游戏中部署面向玩家的机器学习代理的实践。此外,我们识别了这些领域的瓶颈和难题,我们认为这些为加速机器学习在游戏AI中的应用提供了有前景的研究方向,以推动视频游戏行业的发展。

英文摘要

Immersion in video games depends not only on graphics, audio, and game mechanics, but also on the quality of in-game characters. Producing believable characters, or game AI, remains a significant challenge as behavioral complexity is hard to capture with hand-coded systems. Game AI is a source of immersion and engagement; however, the limitations stemming from the challenges of creating game AI often lead to frustration and the breaking of the illusion of realism within the game. The introduction of machine learning models opens the door to creating more believable, authentic, and relatable characters in games. The promise is that they either learn from interacting with the game, or from player data, to develop true human-like behavior. In this paper, we envision more applications of reinforcement learning for game AI in the future. For this to materialize, current research limitations are prohibitive to broad deployment across game genres. Therefore, we propose a framework for training reinforcement learning models with a set of requirements in mind that are suited towards game AI and game development. We present examples of games with reinforcement learning-augmented game AI and describe the practicalities of deploying player-facing machine learning agents in modern games. Furthermore, we identify bottlenecks and hard problems in these areas, which we believe offer promising research directions to accelerate the adoption of machine learning in game AI for the video game industry.

2606.20209 2026-06-19 cs.RO cs.AI 新提交

FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching

FlowMaps: 使用流匹配建模长期多模态物体动态

Francesco Argenziano, Miguel Saavedra-Ruiz, Sacha Morin, Charlie Gauthier, Daniele Nardi, Liam Paull

发表机构 * Sapienza University of Rome(罗马大学) Université de Montréal(蒙特利尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所)

AI总结 提出FlowMaps模型,通过潜在流匹配学习物体位置的多模态时空分布,预测动态物体未来位置,提升机器人在变化家庭环境中的导航性能。

详情
AI中文摘要

对3D场景的联合空间和时间理解是部署在日常家庭环境中的机器人的关键要求。这些智能体不仅必须理解和导航空间布局,还必须推理这些空间如何随时间演变。特别是,人类每天与物体互动,导致物体在整个环境中改变位置,使机器人难以可靠地将当前观察与先前看到的物体关联起来。然而,这些互动并非随机:人类的习惯和日常行为在物体位置上产生了时空一致的模式,机器人智能体可以学习这些模式,然后将其用于下游任务,如导航。为此,我们引入了FlowMaps,一种潜在流匹配模型,用于估计连续3D空间中动态物体未来位置的多模态分布。通过学习物体之间的隐式依赖关系及其时间演变,FlowMaps预测物体位置在人类过去互动条件下的可能变化,同时支持在具有相似物体习惯的未见环境中的泛化。为了展示该方法的实用性,我们在模拟和真实环境中将FlowMaps部署到下游的动态物体导航任务中。在超过600个回合中,FlowMaps优于最先进的方法,表明通过连续、多模态的时空分布建模物体动态可以改善机器人在变化家庭环境中的搜索和导航。代码和附加材料可在此https URL获取。

英文摘要

Joint spatial and temporal understanding of 3D scenes is a crucial requirement for robots deployed in everyday household environments. Such agents must not only comprehend and navigate spatial layouts, but also reason about how these spaces evolve over time. In particular, humans interact with objects daily, causing them to change position throughout the environment and making it difficult for robots to reliably associate current observations with previously seen objects. However, these interactions are not random: human habits and routines induce spatio-temporally consistent patterns in object locations, which robotic agents can potentially learn and then exploit for downstream tasks such as navigation. To this end, we introduce FlowMaps, a latent flow matching model for estimating multimodal distributions over the future locations of dynamic objects in a continuous 3D space. By learning the implicit dependencies among objects and their temporal evolution, FlowMaps predicts likely changes in object locations conditioned on past human interactions, while supporting generalization across previously unseen environments that share similar object routines. To demonstrate the utility of this method, we deploy FlowMaps in a downstream dynamic Object Navigation task in both simulated and real-world environments. Across more than 600 episodes, FlowMaps outperforms state-of-the-art approaches, showing that modeling object dynamics through continuous, multimodal spatio-temporal distributions improves robotic search and navigation in changing household environments. Code and additional material is available at https://fra-tsuna.github.io/flowmaps/.

2606.20205 2026-06-19 cs.AI cs.CL cs.HC 新提交

Apparent Psychological Profiles of Large Language Models are Largely a Measurement Artifact

大语言模型的心理特征很大程度上是测量假象

Jelena Meyer, David Garcia, Dirk U. Wulff

发表机构 * Max Planck Institute for Human Development(马克斯·普朗克人类发展研究所) University of Konstanz(康斯坦茨大学) Barcelona Supercomputing Center(巴塞罗那超级计算中心) University of Basel(巴塞尔大学)

AI总结 通过心理测量框架分析56个指令微调LLM,发现模型间差异主要源于方向性响应偏差而非特质,该偏差解释了81-90%的变异,且可通过题目选择操控,表明LLM心理特征是测量假象。

详情
AI中文摘要

专为人类设计的心理测量工具越来越多地被用于赋予大型语言模型(LLM)稳定的心理特征,这些特征影响其可用性、安全评估以及作为人类参与者的研究代理。使用正式的心理测量框架,我们表明这些特征很大程度上是测量假象。我们对56个指令微调LLM以及大型人类参考样本施测了一系列涵盖自我报告和行为任务的人格与风险偏好工具,报告了四个发现。第一,模型间差异并非由工具所针对的特质驱动,而是由方向性响应偏差驱动,即倾向于向量表一端或某个标签选项做出反应,而不考虑项目内容;方差分解将81-90%的模型间变异归因于这种偏差,而在人类中这一比例为9-16%。第二,偏差随模型能力提升而下降,但并未被消除。第三,由于响应由偏差而非特质驱动,工具的表面信度几乎完全由其响应正交性预测,这是我们提出的术语,指特质和偏差指向相反方向的项目比例。第四,模型呈现的特征随所用项目而变化,并可通过项目选择来制造。这些结果表明,LLM的表面心理特征是用于测量它们的工具的假象,而非模型本身的属性。由于从人类心理学借用的工具很少完全正交,且可能对LLM天生缺乏效度,我们呼吁以响应正交性为中心进行专门的评估。

英文摘要

Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.

2606.20199 2026-06-19 cs.CV 新提交

Evaluation of Image Matching for Art Skills Assessment

艺术技能评估中的图像匹配评价

Asaad Alghamdi, Michael Poor, Trung-Nghia Le, Tam V. Nguyen

发表机构 * University of Dayton(代顿大学) University of Science, VNU-HCM(胡志明市国家大学理科大学) Vietnam National University, Ho Chi Minh City(胡志明市国家大学)

AI总结 提出通过手绘图像与模板匹配来评估绘画技能的方法,比较SIFT特征与孪生网络,发现SIFT关键点匹配更有效。

Comments MAPR 2024

详情
AI中文摘要

虽然有些人天生具有绘画天赋,但掌握这项技能需要专门的训练和练习。确定一个人的绘画技能需要适当的全面评估。在本文中,我们提出了一种通过将手绘图像与原始模板匹配来衡量绘画技能的方法。现有技术通常涉及复杂的过程。然而,计算机视觉的进步使我们能够训练计算机以类似人类的水平进行这些比较,从而解决了繁琐且耗时的传统过程。使用计算机视觉应用,确定图像相似性涉及识别图像与参考图像的相似程度。我们实现并分析了SIFT特征和孪生网络来衡量图像相似性。我们的结果表明,评估艺术技能水平是可行的。通过特征分析,我们发现基于SIFT的关键点匹配为检测绘画技能提供了更有效的手段。

英文摘要

While some individuals possess a natural talent for drawing, mastering this skill requires dedicated training and practice. Determining one's skill in the art of drawing requires proper comprehensive assessment. In this paper, we propose a method to measure drawing skill by by matching the hand-drawn image with the original template. Existing techniques often involve complex processes. However, advancements in computer vision allow us to train computers to perform these comparisons at a human-like level, thereby resolving the tedious and overwhelming traditional process. Using computer vision applications, determining image similarity involves identifying the level of similarities in an image with a reference image. We have implemented and analyzed the SIFT feature and Siamese network to measure image similarity. Our results indicate that it is feasible to assess art skill levels. Through feature analysis, we found that SIFT-based key point matching provides a more effective means of detecting drawing skills.

2606.20198 2026-06-19 cs.CL 新提交

Pitch Spelling Jazz Lead Sheets, Solo Transcriptions, Classical Piano and Monophonic Scores

爵士乐领谱、独奏转录、古典钢琴与单声部乐谱的音高拼写

Augustin Bouquillard, Florent Jacquemard

发表机构 * École polytechnique(巴黎综合理工学院) INRIA(法国国家信息与自动化研究所)

AI总结 提出一种音高拼写与调性估计算法,通过两阶段优化(模态与调性)联合估计音符名称、全局调号和每小节局部音阶,在多种数字乐谱数据集上验证有效性。

详情
AI中文摘要

我们提出了一种用于音高拼写和调性估计的算法。给定MIDI格式的输入,包含音符音高(以半音表示,相对于最低参考音)和小节边界信息,该算法估计适当的音符名称、全局调号以及每小节的局部音阶。这些相关信息元素在两个优化阶段中联合评估。在初始的“模态”阶段,通过最短路径搜索为每个小节提出一个可能的音阶,以最小化印刷乐谱中的临时记号数量。然后,在称为“调性”的第二阶段,这些局部音阶被用于估计调号和音符名称,从而为整首作品生成最佳音乐记谱。我们在包含多种数字乐谱的数据集上进行了评估:来自《Real Book》的爵士领谱、爵士独奏和贝斯线的录音转录、传统曲调,以及钢琴和单声部乐器的古典乐谱。我们的程序最初设计用于音乐转录,特别是构建从音频录音转录的爵士独奏数字集合,用于音乐分析、教学和文化遗产保护。该方法也应有助于其他与音乐记谱处理相关的任务。此外,为此我们定义了各种常见爵士音阶之间的新距离,这可能对音乐学研究有一定意义。

英文摘要

We present an algorithm for pitch spelling and key estimation. Given an input in MIDI-like format, containing information on note pitches (expressed in semitones relative to the lowest reference note) and bar boundaries, it estimates the appropriate note names, a global Key Signature, and a local scale for each bar. This related information elements are evaluated jointly during two stages of optimisation. During an initial 'modal' stage, a probable scale is proposed for each bar, minimising the number of accidentals to be printed in the printed score with a shortest-path search. Then, during a second stage called 'tonal', these local scales are used to estimate the Key Signature and note names that would result in the best musical notation for the entire piece. We present evaluations conducted on datasets comprising a variety of digital musical scores: jazz lead sheets taken from the Real Book, transcriptions of recordings of jazz soli and bass lines, traditional tunes, as well as classical scores for piano and monophonic instruments. Our procedure was originally designed for use in music transcription, specifically for building digital collections of jazz solos transcribed from audio recordings, for the purposes of music analysis, teaching and the preservation of cultural heritage. This method should also prove useful for other tasks related to the processing of musical notation. Furthermore, to this end, we have defined new distances between various common jazz scales, which may be of some interest to musicological studies.

2606.20196 2026-06-19 cs.CV 新提交

Distill Once, Adapt Life-Long: Exploring Dataset Distillation for Continual Test-Time Adaptation

一次蒸馏,终身适应:探索数据集蒸馏用于持续测试时适应

Hyun-Kurl Jang, Jihun Kim, Hyeokjun Kweon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab(韩国科学技术院,视觉智能实验室) Chung-Ang University, FOV Lab(中央大学,FOV实验室)

AI总结 提出DO-ALL框架,通过数据集蒸馏生成紧凑的合成锚点,在持续测试时适应中提供稳定参考,无需保留原始源数据,提升长期鲁棒性。

Comments ECCV 2026

详情
AI中文摘要

持续测试时适应(CTTA)旨在通过在线适应无标签数据,在目标域不断变化的情况下保持模型性能。然而,实际部署中由于隐私或许可限制,通常无法保留源数据集,而纯无源CTTA方法在长期分布偏移下容易变得不稳定,遭受累积的自训练错误和灾难性遗忘。我们提出DO-ALL(一次蒸馏,终身适应),一个即插即用的框架,通过数据集蒸馏(DD)以紧凑且保护隐私的形式重新利用源信息。在部署前,DO-ALL执行DD生成一小组合成蒸馏锚点,总结源分布。在适应过程中,每个目标样本与其语义最匹配的锚点对齐,该锚点通过源重放、表示对齐和流形平滑正则化为各种CTTA提供稳定参考。DO-ALL可以无缝集成到现有CTTA算法中,在CIFAR100-C、ImageNet-C和CCC基准测试中持续提升长期鲁棒性。这展示了利用DD在不保留原始源数据的情况下实现稳定连续适应的潜力。代码可在该https URL获取。

英文摘要

Continual Test-Time Adaptation (CTTA) aims to maintain model performance under evolving target domains by adapting online without labeled data. However, practical deployments often cannot retain the source dataset due to privacy or licensing constraints, and purely source-free CTTA methods tend to become unstable under long-term distribution shift, suffering from compounding self-training errors and catastrophic forgetting. We introduce DO-ALL (Distill Once, Adapt Life-Long), a plug-and-play framework that revisits source information in a compact and privacy-conscious form via Dataset Distillation (DD). Before deployment, DO-ALL performs DD to produce a small set of synthetic distilled anchors that summarize the source distribution. During adaptation, each target sample is matched with its most semantically aligned anchor, which provides a stable reference for various CTTA via source replay, representation alignment, and manifold-smoothing regularization. DO-ALL can be seamlessly integrated into existing CTTA algorithms, consistently improving long-term robustness across CIFAR100-C, ImageNet-C, and the CCC benchmark. This demonstrates the potential of leveraging DD to enable stable and continuous adaptation without retaining raw source data. The code is available at https://github.com/blue-531/DOALL.

2606.20193 2026-06-19 cs.RO 新提交

Belt-Finger: An Affordable Soft Belt-Driven Gripper for Dexterous In-Hand Manipulation

Belt-Finger: 一种经济实惠的软带驱动夹爪,用于灵巧的手内操作

Boya Zhang, Andreas Zell, Georg Martius

发表机构 * University of Tübingen(图宾根大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 提出一种双软带手指模块,为平行夹爪增加三个手内自由度(平移、俯仰、滚动),在保持低成本、易集成的同时提升灵巧操作能力,并通过MPC和遥操作验证其有效性。

详情
AI中文摘要

平行夹爪是机器人中默认的操纵器选择,因为它们简单、坚固且廉价。然而,其有限的手内移动性常常迫使大幅度的臂部运动,并限制了在狭窄工作空间中的灵巧操作。我们提出了一种平行夹爪的升级方案:一种基于双软带的指模块,在保留标准开合功能的同时增加了三个手内自由度(DoF):平移、俯仰和滚动。该机制故意保持简单,并设计为经济制造和直接集成,保留了传统平行夹爪的可靠性和精确控制,同时大大拓宽了操作能力的范围。为了展示新增自由度的实用性,我们将该夹爪集成到两个控制流程中。首先,我们调整了一个模型预测控制器,用于已知物体的手内操作。其次,我们引入了一个轻量级遥操作接口,能够以最少的硬件同时控制机器人臂和夹爪(总共10个自由度)。通过遥操作、MPC和训练策略执行的一系列具有挑战性的操作任务,与传统的平行夹爪相比,所提出的夹爪在灵巧性和任务可行性上持续改进。

英文摘要

Parallel-jaw grippers are the default manipulator choice in robotics because they are simple, robust, and inexpensive. Their limited in-hand mobility, however, often forces large arm motions and restricts dexterous manipulation in confined workspaces. We present a parallel-gripper upgrade: a double-soft-belt-based finger module that preserves standard opening/closing while adding three in-hand degrees of freedom (DoF): translation, pitch, and roll. The mechanism is deliberately kept simple and engineered for inexpensive manufacturing and straightforward integration, preserving the reliability and precise control of traditional parallel grippers while greatly broadening the range of manipulation capabilities. To demonstrate the utility of the added DoFs, we integrate the gripper in two control pipelines. First, we adapt a model predictive controller for in-hand manipulation of known objects. Second, we introduce a lightweight teleoperation interface that enables simultaneous control of the robot arm and gripper (10 DoFs total) with minimal hardware. Across a suite of challenging manipulation tasks executed via teleoperation, MPC, and trained policies, the proposed gripper consistently improves dexterity and task feasibility compared to a conventional parallel gripper

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 新提交

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA:利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Linköping University(林雪平大学) TRATON AB(TRATON公司) Qualcomm Auto Ltd Sweden Filial(高通汽车有限公司瑞典分公司)

AI总结 提出HilDA框架,通过分层蒸馏(多层蒸馏和全局上下文蒸馏)结合时间占用扩散目标,自监督预训练LiDAR骨干网络,在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情
AI中文摘要

利用视觉基础模型(VFM)进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而,当前方法通常将VFM视为黑盒教师,仅依赖逐帧特征相似性。因此,它们未能充分利用教师的逐层语义结构和全局上下文,以及LiDAR序列中固有的丰富时空信息。我们提出HilDA,一个用于LiDAR骨干网络的自监督预训练框架,能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏(包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏)与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果,并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见:此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

2606.20177 2026-06-19 cs.CV cs.AI 新提交

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory(鹏城实验室) Tsinghua University(清华大学) Central South University(中南大学)

AI总结 提出RS-Neg基准评估遥感MLLMs的否定理解,并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种遥感(RS)任务中取得了显著成功。然而,它们理解否定的能力仍未得到充分探索,限制了在现实应用中的部署,其中模型必须明确识别什么是错误的或不存在的,例如,应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性,我们引入了RS-Neg,这是第一个从区域级到场景级任务评估否定理解的基准。具体来说,我们为遥感图像设计了一个自动数据生成流程,使用LLMs合成多样化的否定查询,并引入了一个动态视觉焦点模块进行验证。我们的评估表明,先进的遥感MLLMs在否定理解上存在困难,表现出幻觉和显著的性能下降。为了弥补这一差距,我们提出了NeFo,一种新颖的测试时学习方法,将否定的逻辑角色明确纳入模型优化。值得注意的是,使用约5%的未标注测试样本,NeFo显著提升了模型的否定理解能力,并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.