A Systematic Approach for Selecting Trajectories for Data Augmentation
一种系统化的轨迹数据增强选择方法
Adam Nordling
AI总结 提出系统化框架评估五种轨迹选择策略(异常性、多样性、代表性、不确定性和随机),在四个数据集上测试,发现异常性和不确定性策略在稀疏数据中提升稳定性,但在密集数据中可能引入噪声。
详情
- Comments
- 39 pages, 4 figures, Masters project
轨迹数据增强是缓解机器学习应用中数据稀缺问题的一种有前景的方法,但其效用因保持时空一致性的复杂性而受到限制。尽管先前的工作证明了几何扰动的可行性,但它依赖于简单的随机选择,在理解哪些轨迹应被增强以获得最大收益方面留下了关键空白。本文通过开发一个系统且可扩展的框架来评估五种系统选择策略:异常性、多样性、代表性、不确定性和随机选择,填补了这一空白。这些策略在四个数据集(涵盖动物行为(Foxes和Starkey)、海上交通(AIS)和城市交通(Car))上使用一系列线性和非线性机器学习模型进行了严格测试。作为评估的一部分,集成了基于Optuna的超参数优化循环,以在探索的搜索空间内经验性地确定每个数据集的最佳增强参数。结果表明,虽然系统选择并非通用解决方案,但它比随机基线具有明显优势。系统策略,特别是异常性和不确定性,表现出更高的稳定性,并且在密集数据集中不易出现随机采样观察到的性能下降。然而,研究结果也表明,增强的价值是有严格条件的。通过UMAP的可视化分析表明,虽然系统增强成功修复了稀疏数据集中的拓扑碎片化,但在高质量密集数据集中,它可能充当破坏性噪声信号。此外,研究还发现了高速度域中的物理限制,其中标准扰动技术导致特征空间中的发散。
Trajectory data augmentation is a promising approach to mitigate data scarcity in machine learning applications, but its utility has been limited by the complexity of preserving spatio-temporal coherence. Although prior work demonstrated the viability of geometric perturbation, it relied on naive random selection, leaving a critical gap in understanding which trajectories should be augmented for maximal benefit. This thesis addresses this gap by developing a systematic and scalable framework to evaluate five systematic selection strategies: Outlierness, Diversity, Representativeness, Uncertainty, and Random selection. These strategies were rigorously tested across four datasets covering animal behavior (Foxes and Starkey), maritime traffic (AIS), and urban traffic (Car) using a suite of linear and non-linear machine learning models. As part of this evaluation, an Optuna-based hyperparameter optimization loop was integrated to empirically identify the best-performing augmentation parameters for each dataset within the explored search space. The results indicate that, while systematic selection is not a universal solution, it offers distinct advantages over the random baseline. Systematic strategies, particularly Outlierness and Uncertainty, demonstrated higher stability and were less prone to performance degradation observed with random sampling in dense datasets. However, the findings also reveal that the value of augmentation is strictly conditional. Visual analysis via UMAP demonstrates that while systematic augmentation successfully repairs topological fragmentation in sparse datasets, it can act as a corrupting noise signal in high-quality, dense datasets. Furthermore, the study identified physical limitations in high-velocity domains, where standard perturbation techniques lead to divergence in feature space...