Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (10): 14-28.doi: 10.12141/j.issn.1000-565X.240378

• Traffic Safety • Previous Articles     Next Articles

Identifying Key Causes of Accidents for Autonomous Vehicles Based on CTGAN

ZHANG Zhiqing, YU Xiaozheng, ZHU Leipeng, SUN Yufeng, LI Yixin   

  1. Beijing Key Laboratory of Traffic Engineering,Beijing University of Technology,Beijing 100124,China
  • Received:2024-07-20 Online:2025-10-25 Published:2025-05-06
  • About author:张志清(1965—),男,博士,教授,主要从事道路安全研究。E-mail: zhangzhiqing@bjut.edu.cn
  • Supported by:
    the National Natural Science Foundation of China(52178403)

Abstract:

Clarifying the mechanism of traffic accidents involving autonomous vehicles is an important prerequisite for effectively preventing and controlling safety risks. Analysis of accident causation in autonomous vehicles is typically modeled on few-shot and unbalanced data, resulting in low predictive accuracy for under-represented classes. An analytical framework based on data augmentation can improve the prediction accuracy of models for minority classes. The sample size was increased and the dataset was balanced using techniques such as conditional tabular generative adversarial network (CTGAN), Copula generative adversarial network (CopulaGAN), synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling (ADASYN), and the quality of synthetic data with different methods was compared. Based on the synthetic data, five classification algorithms-logistic regression (LR), decision tree (DT), random forest (RF), extreme gradient boosting (XGB), and support vector machine (SVM)-were evaluated. Metrics such as recall, specificity, weighted F1score, and area under the ROC curve (AUC) were used to determine the optimal combination. Finally, the Shapley additive explanations (SHAP) framework was used to quantify the importance of key contributing factors to accidents. The results show that the marginal distribution score (0.96) and correlation score (0.92) of data generated by CTGAN are the highest, with an average quality of 0.94 for the synthetic data, which is significantly better than other methods. When CTGAN is combined with the random forest algorithm, the model performs excellently in metrics such as recall (0.82), specificity (0.84), and AUC (0.86), and it remains robust in test sets containing 10% label noise (with recall increased to 0.88), further verifying its applicability in complex scenarios. The analysis of key contributing factors indicates that road surface conditions (wet conditions significantly increase the risk of injury), nighttime driving (low light causes reduced sensor performance), and intersection and roadway complexity levels (complex scenarios increase detection delays) are the core factors leading to accidents. This study provides a key basis for the construction of autonomous driving test scenarios and the renovation of road infrastructure.

Key words: autonomous vehicles, few-shot, unbalanced data, conditional tabular generative adversarial network, accident prediction

CLC Number: