华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (10): 14-28.doi: 10.12141/j.issn.1000-565X.240378

• 交通安全 • 上一篇    下一篇

基于CTGAN的自动驾驶车辆交通事故关键诱因识别

张志清, 于晓正, 朱雷鹏, 孙玉凤, 李祎昕   

  1. 北京工业大学 交通工程北京市重点实验室,北京 100124
  • 收稿日期:2024-07-20 出版日期:2025-10-25 发布日期:2025-05-06
  • 作者简介:张志清(1965—),男,博士,教授,主要从事道路安全研究。E-mail: zhangzhiqing@bjut.edu.cn
  • 基金资助:
    国家自然科学基金项目(52178403)

Identifying Key Causes of Accidents for Autonomous Vehicles Based on CTGAN

ZHANG Zhiqing, YU Xiaozheng, ZHU Leipeng, SUN Yufeng, LI Yixin   

  1. Beijing Key Laboratory of Traffic Engineering,Beijing University of Technology,Beijing 100124,China
  • Received:2024-07-20 Online:2025-10-25 Published:2025-05-06
  • About author:张志清(1965—),男,博士,教授,主要从事道路安全研究。E-mail: zhangzhiqing@bjut.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(52178403)

摘要:

明晰自动驾驶车辆交通事故机理是有效防控安全风险的重要前提。自动驾驶车辆交通事故诱因分析通常基于小样本和不平衡数据进行建模,但这类模型对于少数类预测精度低。基于数据增强的分析框架可以提高模型对于少数类的预测精度。通过条件表格生成对抗网络(CTGAN)、联合生成对抗网络(CopulaGAN)以及合成少数过采样(SMOTE)、自适应过采样(ADASYN)技术增加样本量,平衡数据集,对比不同方法的合成数据质量;基于合成数据,对逻辑回归(LR)、决策树(DT)、随机森林(RF)、极端梯度提升(XGB)、支持向量机(SVM)5种分类算法进行评估,采用召回率、特异性、加权F1分数及曲线下面积(AUC)等指标确定最优组合;最后结合沙普利可加解释(SHAP)框架量化事故关键诱因重要度。结果表明:CTGAN生成数据的边际分布得分(0.96)和相关性得分(0.92)最高,合成数据的平均质量为0.94,显著优于其他方法;CTGAN与随机森林算法结合时,模型在召回率(0.82)、特异性(0.84)、AUC(0.86)等指标上均表现优异,在包含10%标签噪声的测试集中仍保持鲁棒性(召回率提升至0.88),进一步验证了其在复杂场景中的适用性。关键诱因分析表明,路面状况(潮湿状态显著增加受伤风险)、夜间行车(低光照导致传感器性能下降)、交叉口及街道化程度(复杂场景增加检测延迟)是导致事故的核心因素。该研究为自动驾驶测试场景搭建及道路基础设施改造提供了关键依据。

关键词: 自动驾驶车辆, 小样本量, 数据不平衡, 条件表格生成对抗网络, 事故预测

Abstract:

Clarifying the mechanism of traffic accidents involving autonomous vehicles is an important prerequisite for effectively preventing and controlling safety risks. Analysis of accident causation in autonomous vehicles is typically modeled on few-shot and unbalanced data, resulting in low predictive accuracy for under-represented classes. An analytical framework based on data augmentation can improve the prediction accuracy of models for minority classes. The sample size was increased and the dataset was balanced using techniques such as conditional tabular generative adversarial network (CTGAN), Copula generative adversarial network (CopulaGAN), synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling (ADASYN), and the quality of synthetic data with different methods was compared. Based on the synthetic data, five classification algorithms-logistic regression (LR), decision tree (DT), random forest (RF), extreme gradient boosting (XGB), and support vector machine (SVM)-were evaluated. Metrics such as recall, specificity, weighted F1score, and area under the ROC curve (AUC) were used to determine the optimal combination. Finally, the Shapley additive explanations (SHAP) framework was used to quantify the importance of key contributing factors to accidents. The results show that the marginal distribution score (0.96) and correlation score (0.92) of data generated by CTGAN are the highest, with an average quality of 0.94 for the synthetic data, which is significantly better than other methods. When CTGAN is combined with the random forest algorithm, the model performs excellently in metrics such as recall (0.82), specificity (0.84), and AUC (0.86), and it remains robust in test sets containing 10% label noise (with recall increased to 0.88), further verifying its applicability in complex scenarios. The analysis of key contributing factors indicates that road surface conditions (wet conditions significantly increase the risk of injury), nighttime driving (low light causes reduced sensor performance), and intersection and roadway complexity levels (complex scenarios increase detection delays) are the core factors leading to accidents. This study provides a key basis for the construction of autonomous driving test scenarios and the renovation of road infrastructure.

Key words: autonomous vehicles, few-shot, unbalanced data, conditional tabular generative adversarial network, accident prediction

中图分类号: