An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers
基于视觉Transformer的开源两阶段细粒度车辆分类流水线
Gandhimathi Padmanaban, Fred Feng
AI总结 提出一个结合RT-DETR检测器和微调ViT-Base/16的两阶段流水线,用于六类车身分类,并引入置信度弃权机制,在分布内和分布外数据集上分别达到0.94和0.89的准确率。
详情
- Comments
- 24 pages, 10 figures, venue TBD
车辆车身类型是超车碰撞中骑行者伤害严重程度的重要决定因素,然而,在公开文献中,尚不存在从自然道路视频中将车辆分类为与伤害风险相关类别的自动化工具。标准目标检测基准仅提供粗粒度车辆标签(轿车、卡车、公交车、摩托车),而现有的细粒度识别系统在受控图像上训练,且缺乏跨记录站点的部署鲁棒性评估。本文提出一个开源的两阶段计算机视觉流水线,结合预训练的RT-DETR检测器进行粗粒度车辆定位,以及微调的视觉Transformer(ViT-Base/16)进行六类车身分类:乘用车、SUV、皮卡、小型货车、大型货车和商用卡车。当softmax输出低于0.60时,基于置信度的弃权机制保留第二阶段预测,产生未知标签而非静默误分类。在来自密歇根州安阿伯市自行车道走廊的3,805个标注超车事件(分布内)上评估,该流水线达到0.94的准确率,每类F1分数从0.91(小型货车)到0.97(SUV)。在来自开放骑行数据集的311个事件(分布外)上独立评估,无需重新训练,准确率为0.89。四个代表性类别中的三个在域偏移下保持F1不低于0.90。观察到的最大退化出现在小型货车(F1=0.72),原因是弃权率从2.4%上升到25.0%,而非主动误分类,这与传播真实模型不确定性的机制一致。完整的流水线,包括推理脚本、训练代码、评估工具和模型权重,作为开源软件发布,以支持跨路边视频档案和骑行安全研究的可重复性和复用。
Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.