自动驾驶 - arXivDaily 专题

2606.19258 2026-06-18 cs.CV cs.RO 新提交 90%

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * College of Engineering, University of Georgia（佐治亚大学工程学院）

专题命中感知：V2X系统云辅助带宽高效感知框架

AI总结提出CABLE框架，通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码，生成感兴趣区域（ROI）并仅上传ROI掩码图像，形成掩码-ROI-LMM反馈循环，在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情

AI中文摘要

云托管的大型多模态模型（LMM）可以为车联网系统提供强大的开放词汇感知能力，但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE，一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码，通过残差运动线索进行细化，并通过走廊包络整合断开区域，形成鲁棒的感兴趣区域（ROI）。仅上传ROI掩码图像，而云分割输出作为下一帧的先验反馈，形成掩码-ROI-LMM反馈循环。在五个数据集（nuScenes、WOD-ZB、Waymo、KITTI和CADC）上的实验表明，该方法在保持感知能力的同时实现了显著的通信节省，相对于全帧推理，ROI像素覆盖减少73-87%，估计LMM预填充加速5-8倍，检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

URL PDF HTML ☆

赞 0 踩 0

2606.18824 2026-06-18 cs.CV cs.LG 新提交 90%

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

他们将去哪里？从自我中心视频建模多模态行人机动

Yuxuan Xie, Nicolas Pugeault, Chongfeng Wei, Hubert P. H. Shum, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow（格拉斯哥大学计算机科学学院）； James Watt School of Engineering, University of Glasgow（格拉斯哥大学詹姆斯·瓦特工程学院）； Department of Computer Science, Durham University（杜伦大学计算机科学系）

专题命中感知：自我中心视频行人轨迹预测，用于自动驾驶

AI总结提出MMPM框架，通过行为感知交互模块和基于CVAE的模态感知轨迹预测器，分别建模行人过马路和不过马路两种模式，提升自我中心视角下多模态轨迹预测准确性。

Comments Accepted at The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情

AI中文摘要

从自我中心摄像头进行行人轨迹预测具有挑战性，因为它依赖于与车辆和场景上下文的复杂交互以及行人的意图。通过建模行人历史与未来轨迹的相关性和意图，通常会产生多模态（即多个模式）分布。现有的随机预测器通常从单一单峰分布中采样多个未来轨迹，这可能导致次优的“混合模式”轨迹，这些轨迹位于不同的运动模式之间，并在真实场景中变得不合理。在本文中，我们提出MMPM，一种模态感知框架，基于行人的过马路行为将未来轨迹分布分别建模为语义上有意义的模式。MMPM由两个模块组成：行为感知行人交互模块（PIM），通过引入注视、头部和手势来联合捕捉行人-车辆和行人-环境交互；以及基于CVAE的模态感知轨迹预测器（MTP）模块，分别对过马路和不过马路两种模式的未来轨迹分布进行建模。基于查询的解码器进一步在解码过程中强制执行模态一致性。在PIE和JAAD数据集上的实验表明，我们的方法超越了最先进的基线。我们提出的MTP是模型无关的，可以集成到现有框架如BiTrap-NP和SGNet-ED中，以进一步提高未来轨迹预测性能。我们还引入了一种数据驱动的验证协议，将预测与时空一致的真实轨迹匹配，展示了相比先前工作改进的逐帧位移误差。

英文摘要

Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal 'mixed-mode' trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian's crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

URL PDF HTML ☆

赞 0 踩 0

2603.11417 2026-06-18 cs.CV cs.LG 版本更新 90%

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

端到端自动驾驶中的零样本跨城市泛化：自监督与监督表示

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering（电气工程系，纽约大学Tandon工程学院）

专题命中感知：端到端自动驾驶跨城市泛化研究

AI总结研究端到端自动驾驶模型在跨城市零样本迁移中的泛化能力，发现自监督预训练（如I-JEPA、DINOv2、MAE）相比监督预训练能显著减少位移和碰撞退化，提升闭环评估中的分布外PDMS。

详情

AI中文摘要

端到端自动驾驶模型通常使用监督的ImageNet预训练骨干网络在多城市数据集上训练，但其泛化到未见城市的能力尚未得到充分检验。当训练和评估数据在地理上混合时，模型可能隐含地依赖城市特定线索，掩盖了在真实世界域偏移下泛化到新位置时可能出现的失败模式。在这项工作中，我们将零样本跨城市迁移定义为端到端自动驾驶的受控表示级压力测试，并探究视觉预训练如何影响地理域偏移下的迁移行为。我们通过将自监督骨干网络I-JEPA、DINOv2和MAE集成到规划框架中进行了全面研究。我们在nuScenes上的开环设置和NAVSIM上的闭环评估协议中，在严格的地理划分下评估性能。我们的实验揭示了当模型在不同道路拓扑、交通规则和视觉环境的城市间迁移时存在显著的泛化差距。在开环评估中，监督骨干网络在城市间迁移时表现出严重退化，而某些领域特定的自监督方法可以显著减少位移和碰撞退化。在闭环评估中，自监督预训练在多个单城市训练设置中提高了平均分布外PDMS。我们的结果提供了经验证据，表明表示学习影响跨城市规划的鲁棒性，并促使将零样本地理迁移作为评估端到端自动驾驶系统的重要压力测试。

英文摘要

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18948 2026-06-18 cs.RO 新提交 85%

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt（德累斯顿技术大学）； Simulation, Systems Optimization and Robotics Group（仿真、系统优化与机器人组）

专题命中感知：提出LiDAR点云实时聚类框架，用于自动驾驶感知

AI总结提出C-ARC框架，通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索，并利用指数控制环自适应校准网格分辨率，实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情

AI中文摘要

实时LiDAR聚类识别点云中的结构，是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来，由于成本和外形尺寸小，非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设：结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布，且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求，我们开发了C-ARC，一个连续自适应范围聚类框架，它在滑动窗口上维护一个持久双图，将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸，无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现，C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明，对于扫描模式高度集中的传感器，无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

URL PDF HTML ☆

赞 0 踩 0

2606.18864 2026-06-18 cs.LG cs.AI 新提交 85%

Scaling Learning-based AEB with Massive Unlabeled Data

基于大规模无标签数据的可扩展学习型自动紧急制动

Xiangyu Wang, Yang Zhan, Mengxiang Hao, Chuanchuan Zhong, Yansong Jia, Junjie Zhang, Yu Han, Xin Jiang, Zhen Cao, Ying Wang, Yulun Song, Zhitao Xu

发表机构 * Li Auto

专题命中感知：自动紧急制动，半监督学习提升性能。

AI总结提出稳定元反馈半监督学习框架，通过噪声感知解耦和运动学门控伪标签，利用大规模无标签数据提升自动紧急制动性能，实现超100:1正误触发比和35%无事故里程提升。

Comments Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情

AI中文摘要

本文研究如何在生产约束下，利用大规模无标签车队数据扩展基于学习的自动紧急制动（AEB）。我们的方法基于元反馈半监督学习（MF-SSL），其中教师模型为无标签驾驶数据生成伪标签，并使用小型有标签锚定集作为安全关键反馈进行更新。在生产中，锚定歧义和有标签-无标签不匹配会放大系统性的伪标签错误，导致误触发。我们提出了一种稳定的MF-SSL框架，包括：(i) 噪声感知解耦，从教师监督更新路径中移除易产生歧义的锚定；(ii) 运动学门控伪标签，结合教师冲突惩罚，抑制无标签数据上由不匹配引起的风险幻觉，同时保持广泛覆盖。大量实验表明，随着无标签数据从1M扩展到1B窗口，模型性能持续提升，在保持舒适性的同时提高了安全性。经过1B数据训练的学生模型已部署到数十万辆车辆上，并在超过10^9公里的行驶中得到验证，实现了超过100:1的正误触发比，且相比仅基于规则的基线，无事故行驶里程提升了35%。

英文摘要

This paper studies how to scale learning-based automatic emergency braking (AEB) with massive unlabeled fleet data under production constraints. Our approach is based on meta-feedback semi-supervised learning (MF-SSL), where a teacher generates pseudo labels for unlabeled driving data and is updated using a small labeled anchor set as safety-critical feedback. In production, anchor ambiguity and labeled-unlabeled mismatch can amplify systematic pseudo-label errors, leading to spurious triggers. We propose a stabilized MF-SSL framework with (i) Noise-Aware Decoupling, which removes ambiguity-prone anchors from the teacher's supervised update path, and (ii) kinematics-gated pseudo-labeling with a teacher conflict penalty to suppress mismatch-induced risk hallucinations on unlabeled data while maintaining broad coverage. Extensive experiments show consistent gains as unlabeled data scale from 1M to 1B windows, improving safety while keeping comfort stable. The 1B-trained student model is deployed to hundreds of thousands of vehicles and validated over \$10^9$ km of driving, achieving a positive-to-false activation ratio exceeding 100:1 and a 35% improvement in accident-free driving mileage over a production rule-only baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.18687 2026-06-18 cs.CV cs.RO 新提交 85%

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics（澳大利亚联邦科学与工业研究组织机器人实验室）； University of Queensland（昆士兰大学）

专题命中感知：雷达位置识别，用于自动驾驶

AI总结针对4D汽车雷达与密集旋转雷达之间的异构位置识别，提出空间分层蒸馏（SSD）方法，通过基于雷达回波的物理空间非对称对齐，在重叠区域强制特征对齐，在稀疏区域降低蒸馏权重，在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情

AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性（和窄视场）的限制，该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题，即将两种信号投影到共同的表示空间。然而，它们在多会话环境中性能下降。在本文中，我们提出了空间分层蒸馏（SSD）；一种策略，用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域，SSD强制进行强特征对齐。关键的是，在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域，SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明，SSD显著优于先前的位置识别方法，在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.19190 2026-06-18 cs.RO 新提交 80%

FAST-LIVGO: A Degeneracy-Robust LiDAR-Inertial-Visual-GNSS Fusion Odometry

FAST-LIVGO：一种退化鲁棒的LiDAR-惯性-视觉-GNSS融合里程计

Zhiyu Chen, Chunran Zheng, Jiayu Wen, XiaoLei Zhang, Jiaming Xu, Feng Pan, Yukang Cui

发表机构 * College of Mechatronics and Control Engineering, Shenzhen University（深圳大学机电与控制工程学院）； Department of Mechanical Engineering, The University of Hong Kong（香港大学机械工程系）； College of Automation, Harbin Engineering University（哈尔滨工程大学自动化学院）

专题命中感知：LiDAR-惯性-视觉-GNSS融合里程计，用于自动驾驶

AI总结提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架，通过动态时间规整的时空对齐模块、多普勒和时差载波相位观测模型以及退化感知的双模式异常值拒绝策略，在长期大尺度动态环境中实现高精度鲁棒的状态估计。

Comments Accepted for presentation at the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)

详情

AI中文摘要

在长期、大规模和高度动态环境中的鲁棒状态估计与建图仍然是机器人领域的关键挑战。现有的LiDAR-惯性-视觉里程计（LIVO）系统在局部精度上表现良好，但在长距离下会累积漂移，并在几何退化或无纹理场景中可能失效。同时，GNSS辅助融合框架通常依赖LiDAR或视觉里程计进行状态预测和异常值拒绝，使其在里程计退化时变得脆弱。为解决这些局限，我们提出一种基于误差状态迭代卡尔曼滤波的紧耦合LiDAR-惯性-视觉-GNSS融合框架。引入基于动态时间规整的在线时空对齐模块以应对高度动态条件。为更好利用GNSS精度，我们开发了基于多普勒频移和固定锚点时间差载波相位的观测模型，在不增加历史锚点状态的情况下提供毫米级相对约束。我们进一步设计了一种退化感知的双模式异常值拒绝策略，根据LIVO退化程度在LIVO先验引导拒绝和GNSS辅助恢复之间切换。在公开M3DGR数据集和自建20 m/s固定翼无人机数据集上的实验表明，我们的系统减少了累积漂移和地图重影，在精度和鲁棒性上优于现有方法。

英文摘要

Robust state estimation and mapping in long-term, large-scale, and highly dynamic environments remains a key challenge in robotics. Existing LiDAR-Inertial-Visual Odometry (LIVO) systems achieve strong local accuracy but suffer from accumulated drift over long distances and may fail in geometrically degraded or textureless scenes. Meanwhile, GNSS-aided fusion frameworks often rely on LiDAR or visual odometry for state prediction and outlier rejection, making them vulnerable when odometry degenerates. To address these limitations, we propose a tightly coupled LiDAR-Inertial-Visual-GNSS fusion framework based on an Error-State Iterated Kalman Filter. An online spatiotemporal alignment module using Dynamic Time Warping is introduced for highly dynamic conditions. To better exploit GNSS precision, we develop observation models based on Doppler shifts and fixed-anchor Time-Differenced Carrier Phase, providing millimeter-level relative constraints without augmenting historical anchor states. We further design a degeneracy-aware dual-mode outlier rejection strategy that switches between LIVO-prior-guided rejection and GNSS-aided recovery according to the LIVO degeneracy level. Experiments on the public M3DGR dataset and a custom 20~m/s fixed-wing UAV dataset demonstrate that our system reduces accumulated drift and map ghosting, outperforming state-of-the-art methods in accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.18599 2026-06-18 cs.CR cs.AI 新提交 80%

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

MIDS：通过双向Mamba检测CAN总线上的隐蔽伪装和篡改攻击

Qiqi Liu, Runhan Song, Lei Cui, Heng Zhang, Yuyan Sun, Limin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Zhongguancun Laboratory（中关村实验室）

专题命中感知：提出MIDS检测CAN总线攻击，保障车辆安全

AI总结针对CAN总线缺乏加密认证易受攻击的问题，提出MIDS双流框架，利用双向状态空间模型并行处理标识符和载荷，在特斯拉Model 3数据集上F1达96.94%，优于基线8个百分点以上。

详情

AI中文摘要

控制器局域网（CAN）协议是现代车辆中电子控制单元（ECU）的主要通信标准，但其缺乏加密和认证，使其面临一系列安全威胁。现有的入侵检测系统主要针对制造型攻击（通过帧注入实现的DoS、模糊测试、ID欺骗），此类攻击中每ID到达间隔统计等检测信号易于获取。我们转而解决更困难的伪装场景，其中内部攻击者在其原始传输时隙原位替换合法帧，保持流量周期性，使基于流量统计的防御失效。我们提出Mamba入侵检测系统（MIDS），一种创新的双流框架，并行处理CAN标识符和载荷，并通过双向选择性状态空间建模重建其联合时间语义。为评估MIDS，我们从物理特斯拉Model 3在三种驾驶模式下收集了超过1亿个CAN帧，并合成了54种伪装攻击变体，涵盖仅ID、仅数据和组合修改。MIDS在该数据集上达到96.94%的F1分数，超过最强可复现基线8个百分点以上，同时保持1.147毫秒的单窗口推理延迟——为实时车载部署留有充足余量。为验证泛化能力，我们进一步在四个公开基准（ROAD、CrySyS、OTIDS、CT&T）上评估MIDS，涵盖伪装和注入场景；在统一的5折协议下，MIDS的F1分数从93.70%到99.61%，超过八个复现基线中最强者最多13.94个百分点。

英文摘要

The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder \emph{masquerade} setting~\cite{b37}, in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94\% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency -- ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\&T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70\% to 99.61\%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

URL PDF HTML ☆

赞 0 踩 0

2602.04401 2026-06-18 cs.RO cs.CV 版本更新 80%

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics（昆士兰理工大学机器人中心）； School of Electrical Engineering and Robotics（电气工程与机器人学院）； Queensland University of Technology（昆士兰理工大学）

专题命中感知：视觉地点识别操作点选择

AI总结提出一种通过分位数归一化迁移阈值的方法，自动选择视觉地点识别系统的操作点，在100%精度下最大化召回率，无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情

AI中文摘要

视觉地点识别（VPR）是全球导航卫星系统（GNSS）受限环境中定位的关键组成部分，但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值（操作点）。阈值通常针对特定环境离线手动调整，并在部署期间固定，导致在环境变化下性能下降。我们提出一种方法，自动选择VPR系统的操作点，以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历，并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明，我们提出的方法始终优于现有基线，使底层VPR技术在大约两倍的部署场景中（中位数改进）以100%精度运行，同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化，消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

URL PDF HTML ☆

赞 0 踩 0

2606.19307 2026-06-18 cs.RO 新提交 70%

Observability and Consistency Analysis for Visual-Inertial Navigation with Anchored Feature Parameterizations

基于锚定特征参数化的视觉惯性导航的可观性与一致性分析

Mitchell Cohen, Vassili Korotkine, James Richard Forbes

发表机构 * Department of Mechanical Engineering, McGill University（麦吉尔大学机械工程系）

专题命中感知：视觉惯性导航可观性与一致性

AI总结分析基于滤波的视觉惯性导航系统（VINS）使用锚定特征表示时的可观性与一致性，证明其不可观子空间独立于估计的地标状态，从而改善一致性，但仍依赖导航状态，需额外一致性增强技术。

Comments Accepted to IEEE/RSJ IROS. 8 pages, 3 figures, 4 tables

详情

AI中文摘要

本文分析了使用锚定特征表示的基于滤波的视觉惯性导航系统（VINS）的可观性和一致性特性。结果表明，采用锚定地标参数化的VINS的不可观子空间独立于估计的地标状态，从而无需任何额外修改即可改善估计器的一致性。然而，不可观子空间仍然依赖于估计的导航状态，因此需要额外的一致性增强技术。本文提出了两种方法来改善采用锚定特征表示的VINS的一致性。仿真结果表明，与使用全局参考系解析特征的算法相比，所有采用锚定特征参数化的估计器都表现出更好的一致性，特别是在特征初始化可能较差的情况下。在TUM-VI数据集上的真实世界实验表明，仅使用锚定特征表示即可获得与采用全局特征表示的一致性改进估计器相当的性能，证明了在VINS中使用锚定特征参数化的优势。

英文摘要

This paper presents an analysis of the observability and consistency properties of filtering-based visual-inertial navigation systems (VINS) that utilize anchored feature representations. The unobservable subspace of VINS with anchored landmark parameterizations is shown to be independent of the estimated landmark state, which leads to improved estimator consistency properties without any additional modifications. However, the unobservable subspace is still found to depend on the estimated navigation state, necessitating additional consistency-enforcing techniques. Two methods to improve the consistency of VINS with anchored feature representations are presented. Simulation results showcase that all estimators employing anchored feature paramterizations exhibit improved consistency properties compared to algorithms that estimate features resolved in a global reference frame, especially in scenarios where feature initialization may be poor. Real-world experiments on the TUM-VI dataset showcase that the use of anchored feature representations alone can yield comparable performance to consistency-improved estimators employing a global feature representation, demonstrating the benefit of using anchored feature parameterizations for VINS.

URL PDF HTML ☆

赞 0 踩 0

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 新提交 70%

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * Mechanical and Aerospace Engineering, George Washington University（机械与航空航天工程，乔治华盛顿大学）

专题命中感知：无人机单目位姿估计，属于自动驾驶感知

AI总结提出硬件验证的视觉在环框架，结合深度变换器单目位姿估计器和延迟卡尔曼滤波器，在模拟逼真海上环境中实现自主室内飞行，验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情

AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计，然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架，能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合，为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应，包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段，用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.19154 2026-06-18 cs.RO 新提交 70%

Viking Hill Dataset: A Lidar-Radar-Camera Dataset for Detection and Segmentation in Forest Scenes

Viking Hill数据集：用于森林场景检测与分割的激光雷达-雷达-相机数据集

Vladimír Kubelka, Oleksandr Kotlyar, Unal Artan, Martin Magnusson

发表机构 * Örebro University（奥雷布罗大学）； AASS research centre（AASS研究中心）； Robot Navigation and Perception Lab（机器人导航与感知实验室）

专题命中感知：多传感器数据集用于森林场景感知，类似自动驾驶

AI总结提出首个包含4D成像雷达的森林多传感器数据集，通过MinkowskiUNet实现雷达与激光雷达点云的语义分割，并评估树干分割质量与树木尺寸的关系。

Comments 33 pages, 11 figures

详情

AI中文摘要

在森林冠层下运行的自主机器人需要对树木及周围植被在不同季节条件下进行稳健感知。现有的林业数据集提供带有单棵树标注的激光雷达或相机数据，但均未包含共配准的4D成像雷达——这一模态因其对视觉退化、表面污染和植被遮挡的鲁棒性而日益受到关注。我们介绍了一个由移动机器人收集的多传感器森林数据集，该机器人配备了高分辨率FMCW成像雷达、激光雷达、RGB相机、IMU和RTK-GNSS。该场地在两个不同植被状态的会话中记录，3D立方体标注（包括每棵树的直径估计）为所有三种感知模态提供了共享语义标签。此外，我们提供了使用MinkowskiUNet对雷达和激光雷达点云进行语义分割的基线结果。雷达在主要类别（地面91%，冠层86%）上取得了与激光雷达竞争性的IoU分数，但在几何精细结构（如树干）上落后（56%对74%）。跨模态分析进一步比较了激光雷达和雷达的树干分割与RGB检测模型，而按直径分层的评估揭示了树干分割质量如何随树木尺寸变化。除了分割，共配准的多模态数据和RTK-GNSS辅助参考定位支持冠层下地图构建、定位和传感器融合的研究。数据集和标注工具已公开。

英文摘要

Autonomous robots operating under forest canopies need robust perception of trees and surrounding vegetation across varying seasonal conditions. Existing forestry datasets provide lidar or camera data with per-tree annotations, but none include co-registered 4D imaging radar -- a modality of growing interest for its resilience to visual degradation, surface contamination, and vegetation occlusion. We introduce a multi-sensor forest dataset collected by a mobile robot equipped with a high-resolution FMCW imaging radar, lidar, RGB camera, IMU, and RTK-GNSS. The site was recorded in two sessions under contrasting vegetation states, and 3D cuboid annotations -- including per-tree diameter estimates -- provide shared semantic labels across all three perception modalities. Furthermore, we provide baseline results for semantic segmentation of the radar and lidar point clouds using MinkowskiUNet. Radar achieves IoU scores competitive with lidar for dominant classes (ground 91%, canopy 86%) while lagging on geometrically fine structures such as tree trunks (56% vs. 74%). A cross-modality analysis further compares lidar and radar trunk segmentation against an RGB detection model, and a diameter-stratified evaluation reveals how trunk segmentation quality varies with tree size. Beyond segmentation, the co-registered multi-modal data and RTK-GNSS-aided reference positioning support research in mapping, localization, and sensor fusion under canopy. The dataset and annotation tools are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 新提交 70%

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告：通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team（Qwen团队）

专题命中感知：预测自动驾驶场景的未来视觉轨迹

AI总结提出Qwen-RobotWorld，一种以自然语言为统一动作接口的语言条件视频世界模型，通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练，在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测，在多个基准上取得最优结果。

详情

AI中文摘要

我们介绍Qwen-RobotWorld，一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口，它从当前观测预测物理上合理的未来视觉轨迹，涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向：用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境，以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的：a) 双流MMDiT与MLLM动作编码，其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合；b) 具身世界知识(EWK)，一个860万视频-文本语料库（2亿+帧），包含20+种具身形态和500+动作类别的动作-语言映射；c) 通用+专家渐进式课程，一种两阶段训练策略，首先学习通用视觉先验，然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力：在EWMBench和DreamGen Bench上总体排名第一，在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.18841 2026-06-18 cs.CV 新提交 60%

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

重新思考空地协作：渐进式跨任务基准与社会化学习框架

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao, Boan Tao, Yiming Sun, Zhen Wang, Pengfei Zhu

发表机构 * School of Automation, Southeast University（东南大学自动化学院）； School of Computer Science and Engineering, University of New South Wales（新南威尔士大学计算机科学与工程学院）； School of Sports Training, Tianjin University of Sport（天津体育学院运动训练学院）； Faculty of Information Engineering and Automation, Kunming University of Science and Technology（昆明理工大学信息工程与自动化学院）； School of Artificial Intelligence, Tianjin University（天津大学人工智能学院）； School of Artificial Intelligence, Hebei University of Technology（河北工业大学人工智能学院）

专题命中感知：空地协作感知，可用于自动驾驶。

AI总结提出空地渐进协作基准AGPC和社会化协同感知框架SCP，通过双层级路由器实现跨视角跨任务选择性交互，在异构空地感知中提升下游性能7.86%。

详情

AI中文摘要

空地协同感知对于真实世界动态环境中的鲁棒视觉理解至关重要。然而，现有研究通常将协作建模为单任务跨视角融合，忽视了定位、目标关联和细粒度解析之间的功能依赖关系。此外，空中和地面视角的异构性引入了显著的几何、尺度和遮挡差异，使得统一特征共享容易受到负迁移的影响。为解决这些问题，我们将空地感知建模为渐进式跨任务协作任务，并构建了空地渐进协作（AGPC）基准，这是一个包含超过745K原始视频帧的时空对齐基准。基于该基准，我们提出了社会化协同感知（SCP），一个从空中全局定位到地面目标关联和身份感知解析的渐进式协作框架。其核心模块——双层级路由器（DLR），将输入侧的多尺度专家选择与输出侧的任务条件调制解耦，实现了选择性的跨视角和跨任务交互，同时抑制有害干扰。大量实验证明了SCP的有效性。它实现了3.73%的协同进化增益和7.86%的平均下游性能提升。这些结果表明，对于异构空地感知，任务条件协作比统一融合更有效。代码可在该网址获取。

英文摘要

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

URL PDF HTML ☆

赞 0 踩 0