基于CTGAN的自动驾驶车辆交通事故关键诱因识别

doi:10.12141/j.issn.1000-565X.240378

摘要/Abstract

摘要：

明晰自动驾驶车辆交通事故机理是有效防控安全风险的重要前提。自动驾驶车辆交通事故诱因分析通常基于小样本和不平衡数据进行建模，但这类模型对于少数类预测精度低。基于数据增强的分析框架可以提高模型对于少数类的预测精度。通过条件表格生成对抗网络（CTGAN）、联合生成对抗网络（CopulaGAN）以及合成少数过采样（SMOTE）、自适应过采样（ADASYN）技术增加样本量，平衡数据集，对比不同方法的合成数据质量；基于合成数据，对逻辑回归（LR）、决策树（DT）、随机森林（RF）、极端梯度提升（XGB）、支持向量机（SVM）5种分类算法进行评估，采用召回率、特异性、加权F₁分数及曲线下面积（AUC）等指标确定最优组合；最后结合沙普利可加解释（SHAP）框架量化事故关键诱因重要度。结果表明：CTGAN生成数据的边际分布得分（0.96）和相关性得分（0.92）最高，合成数据的平均质量为0.94，显著优于其他方法；CTGAN与随机森林算法结合时，模型在召回率（0.82）、特异性（0.84）、AUC（0.86）等指标上均表现优异，在包含10%标签噪声的测试集中仍保持鲁棒性（召回率提升至0.88），进一步验证了其在复杂场景中的适用性。关键诱因分析表明，路面状况（潮湿状态显著增加受伤风险）、夜间行车（低光照导致传感器性能下降）、交叉口及街道化程度（复杂场景增加检测延迟）是导致事故的核心因素。该研究为自动驾驶测试场景搭建及道路基础设施改造提供了关键依据。

关键词: 自动驾驶车辆, 小样本量, 数据不平衡, 条件表格生成对抗网络, 事故预测

Abstract:

Clarifying the mechanism of traffic accidents involving autonomous vehicles is an important prerequisite for effectively preventing and controlling safety risks. Analysis of accident causation in autonomous vehicles is typically modeled on few-shot and unbalanced data, resulting in low predictive accuracy for under-represented classes. An analytical framework based on data augmentation can improve the prediction accuracy of models for minority classes. The sample size was increased and the dataset was balanced using techniques such as conditional tabular generative adversarial network (CTGAN), Copula generative adversarial network (CopulaGAN), synthetic minority oversampling technique (SMOTE), and adaptive synthetic sampling (ADASYN), and the quality of synthetic data with different methods was compared. Based on the synthetic data, five classification algorithms-logistic regression (LR), decision tree (DT), random forest (RF), extreme gradient boosting (XGB), and support vector machine (SVM)-were evaluated. Metrics such as recall, specificity, weighted F1score, and area under the ROC curve (AUC) were used to determine the optimal combination. Finally, the Shapley additive explanations (SHAP) framework was used to quantify the importance of key contributing factors to accidents. The results show that the marginal distribution score (0.96) and correlation score (0.92) of data generated by CTGAN are the highest, with an average quality of 0.94 for the synthetic data, which is significantly better than other methods. When CTGAN is combined with the random forest algorithm, the model performs excellently in metrics such as recall (0.82), specificity (0.84), and AUC (0.86), and it remains robust in test sets containing 10% label noise (with recall increased to 0.88), further verifying its applicability in complex scenarios. The analysis of key contributing factors indicates that road surface conditions (wet conditions significantly increase the risk of injury), nighttime driving (low light causes reduced sensor performance), and intersection and roadway complexity levels (complex scenarios increase detection delays) are the core factors leading to accidents. This study provides a key basis for the construction of autonomous driving test scenarios and the renovation of road infrastructure.

Key words: autonomous vehicles, few-shot, unbalanced data, conditional tabular generative adversarial network, accident prediction

中图分类号:

U491.31

张志清, 于晓正, 朱雷鹏, 孙玉凤, 李祎昕. 基于CTGAN的自动驾驶车辆交通事故关键诱因识别[J]. 华南理工大学学报(自然科学版), 2025, 53(10): 14-28.

ZHANG Zhiqing, YU Xiaozheng, ZHU Leipeng, SUN Yufeng, LI Yixin. Identifying Key Causes of Accidents for Autonomous Vehicles Based on CTGAN[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(10): 14-28.

图/表 20

图1

图2

表1

图3

表2

图4

图5

表3

表4

表5

图6

图7

表6

图8

表7

图9

图10

表8

图11

图12

参考文献 34

[1]	KUO P F， HSU W T， LORD D，et al ．Classification of autonomous vehicle crash severity：solving the pro-blems of imbalanced datasets and small sample size［J］．Accident Analysis & Prevention，2024，205：107666/1-13.
[2]	MEASE D， WYNER A J， BUJA A ．Boosted classification trees and class probability/quantile estimation［J］．Journal of Machine Learning Research，2007，8：409-439.
[3]	HE H， GARCIA E A ．Learning from imbalanced data［J］．IEEE Transactions on Knowledge and Data Engineering，2009，21（9）：1263-1284.
[4]	HE H， BAI Y， GARCIA E A，et al ．ADASYN：adaptive synthetic sampling approach for imbalanced learning［C］∥ Proceeding of 2008 IEEE International Joint Conference on Neural Networks （IEEE World Congress on Computational Intelligence）．Hong Kong：IEEE，2008：1322-1328.
[5]	BARUA S， ISLAM M M， YAO X，et al ．MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning［J］．IEEE Transactions on Knowledge and Data Engineering，2012，26（2）：405-425.
[6]	TANG B， HE H ．KernelADASYN：Kernel based adaptive synthetic data generation for imbalanced learning［C］∥ Proceeding of 2015 IEEE Congress on Evolutionary Computation （CEC）．Sendai：IEEE，2015：664-671.
[7]	ZHU S. Analysis of the severity of vehicle-bicycle crashes with data mining techniques［J］．Journal of Safety Research，2020，76：218-227.
[8]	CAI Q， ABDEL-ATY M， YUAN J，et al ．Real-time crash prediction on expressways using deep generative models［J］．Transportation Research Part C：Emerging Technologies，2020，117：102697/1-14.
[9]	MIRZA M， OSINDERO S ．Conditional generative adversarial nets［J］．arXiv preprint arXiv：，2014.
[10]	RADFORD A， METZ L， CHINTALA S ．Unsupervised representation learning with deep convolutional generative adversarial networks［J］．arXiv preprint arXiv：，2015.
[11]	ARJOVSKY M， CHINTALA S ．Bottou. Wasserstein GAN［J］．arXiv preprint arXiv：，2017.
[12]	ZHOU D， ZHANG H， LI Q，et al ．Coutfitgan：learning to synthesize compatible outfits supervised by silhouette masks and fashion styles［J］．IEEE Tran-sactions on Multimedia，2022，25（1）：4986-5001.
[13]	ZHOU D， ZHANG H， YANG K，et al ．Learning to synthesize compatible fashion items using semantic alignment and collocation classification：an outfit ge-neration framework［J］．IEEE Transactions on Neural Networks and Learning Systems，2022，35（4）：5226-5240.
[14]	FIORE U， DE SANTIS A， PERLA F，et al ．Using generative adversarial networks for improving classification effectiveness in credit card fraud detection［J］．Information Sciences， 2019，479：448-455.
[15]	ZHANG H， YU X， REN P，et al ．Deep adversarial learning in intrusion detection：a data augmentation enhanced framework［J］．arXiv preprint arXiv：，2019.
[16]	LI Y， YANG Z， XING L ．Crash injury severity prediction considering data imbalance：a wasserstein ge-nerative adversarial network with gradient penalty approach［J］．Accident Analysis & Prevention，2023，192：107271/1-18.
[17]	ZHOU B， ZHOU Q， LI Z ．Addressing data imba-lance in crash data： evaluating generative adversarial network’s efficacy against conventional methods［J］．IEEE Access，2025，13：2929-2944.
[18]	MUJALLI R O， LÓPEZ G， GARACH L ．Bayes classifiers for imbalanced traffic accidents datasets［J］．Accident Analysis & Prevention，2016，88：37-51.
[19]	SAVOLAINEN P T， MANNERING F L， LORD D，et al ．The statistical analysis of highway crash-injury severities：a review and assessment of methodological alternatives［J］．Accident Analysis & Prevention，2011，43（5）：1666-1676.
[20]	ALKHEDER S， ALRUKAIBI F， AIASH A ．Risk analysis of traffic accidents’severities：an application of three data mining models［J］．ISA Transactions，2020，106：213-220.
[21]	WEN X， XIE Y， WU L，et al ．Quantifying and comparing the effects of key risk factors on various types of roadway segment crashes with LightGBM and SHAP［J］．Accident Analysis & Prevention，2021，159：106261/1-11.
[22]	DONG S， KHATTAK A， ULLAH I，et al ．Predicting and analyzing road traffic injury severity using boosting-based ensemble learning models with SHAPley Additive exPlanations［J］．International Journal of Environmental Research and Public Health，2022，19（5）：2925/1-23.
[23]	WANG H， WANG X， HAN J，et al ．A recognition method of aggressive driving behavior based on ensemble learning［J］．Sensors，2022，22（2）：644/1-24.
[24]	WU N， SUN J ．Fatigue detection of air traffic controllers based on radiotelephony communications and self-adaption quantum genetic algorithm optimization ensemble learning［J］．Applied Sciences，2022，12（20）：10252.
[25]	IMRAN M， MAHMOOD A M， QYSER A A M ．An empirical experimental evaluation on imbalanced data sets with varied imbalance ratio［C］∥ Proceeding of International Conference on Computing and Communication Technologies．Chengdu：IEEE，2014：1-7.
[26]	XU L， SKOULARIDOU M， CUESTA-INFANTE A，et al ．Modeling tabular data using conditional gan［J］．Advances in Neural Information Processing Systems，2019，659：7335-7345.
[27]	BOUROU SEL SAER A， VELIVASSAKI T H，et al ．A review of tabular data synthesis using gans on an ids dataset［J］．Information，2021，12（9）：375.
[28]	ZHENG O， ABDEL-ATY M， WANG Z，et al ．Avoid：autonomous vehicle operation incident dataset across the globe［J］．arXiv preprint arXiv：2303．12889，2023.
[29]	DAS P， CHANDA K ．Bayesian Network based modeling of regional rainfall from multiple local meteorological drivers［J］．Journal of Hydrology，2020，591：125563/1-17.
[30]	DING S， ABDEL-ATY M， WANG D，et al ．Exploratory analysis of injury severity under different levels of driving automation （SAE Level 2-5） using multi-source data［J］．arXiv preprint arXiv：，2023.
[31]	LIU P， GUO Y， LIU P，et al ．What can we learn from the AV crashes？—an association rule analysis for identifying the contributing risky factors［J］．Accident Analysis & Prevention，2024，199：107492/1-12.
[32]	KHAN M Q， LEE S ．A comprehensive survey of dri-ving monitoring and assistance systems［J］．Sensors，2019，19（11）：2574/1-32.
[33]	LI J， LI B， TU Z，et al ．Light the night：a multi-condition diffusion framework for unpaired low-light Enhancement in Autonomous Driving［C］∥ Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville：IEEE，2024： 15205-15215.
[34]	LI X， LIN K Y， MENG M，et al ．A survey of ADAS perceptions with development in China［J］．IEEE Transactions on Intelligent Transportation Systems，2022，23（9）：14188-14203.

子系统	变量		统计特征
子系统	变量		频率	占比/%
因变量	最高受伤程度（H_I_S）	0—未受伤	680	92.14
因变量	最高受伤程度（H_I_S）	1—受伤	58	7.86
时间	事故月份（C_M）	0—春季	115	15.58
		1—夏季	203	27.51
		2—秋季	246	33.33
		3—冬季	174	23.58
	事故周（C_W）	0—周末	174	23.58
	事故周（C_W）	1—周周中	564	76.42
	事故时间（C_H）	0—早高峰	77	10.43
		1—晚高峰	102	13.82
		2—其他	559	75.75
人	驾驶员类型（D_T）	0—消费者	708	95.90
人	驾驶员类型（D_T）	1—商用/测试	30	4.00
车	行驶里程（M_L）	0—小于等于50，000	539	73.00
	行驶里程（M_L）	1—大于50，000	168	22.80
	碰撞主体（C_T）	0—乘用车	191	25.88
		1—卡车/厢式货车	64	8.67
		2—摩托车	4	0.54
		3—行人	5	0.68
		4—固定物体	185	25.07
		5—其他	289	39.16
	碰撞前运动状态（P_M）	0—向前行驶	463	62.74
		1—掉头	22	2.98
		2—车道偏离	49	6.64
		3—变道	12	1.63
		4—停车	19	2.57
		5—其他	23	3.12
	气囊是否打开（A_B）	0—是	116	15.72
	气囊是否打开（A_B）	1—否	622	84.28
	碰撞前速度（P_S）	0—小于等于32.2 km/h	104	14.09
		1—大于32.2 km/h且小于等于64.4 km/h	187	25.34
		2—大于64.4 km/h且小于96.6 km/h	160	21.68
		3—大于等于96.6 km/h	156	21.14
路	道路类型（R_T）	0—高速公路	432	58.54
		1—街道	92	12.47
		2—交叉口	69	9.35
		3—停车场	2	0.27
		4—乡村道路	30	4.07
		5—其他	113	15.31
	路面状况（R_S）	0—干燥	415	56.23
		1—雪/融雪/冰	7	0.95
		2—潮湿	123	16.67
		3—其他	193	26.15
	碰撞描述（R_D）	0—无特殊情况	463	62.74
		1—交通事故	26	3.52
		2—工作区	12	1.63
		3—标志标线缺失/不清晰	3	0.41
		4—其他	30	4.07
环境	光照条件（L_T）	0—白天	307	41.60
		1—黎明 / 黄昏	33	4.47
		2—黑夜	204	27.64
	天气状况（W_T）	0—晴朗	355	48.10
		1—雪天	7	0.95
		2—多云	62	8.40
		3—雾天	1	0.14
		4—雨天	106	14.36

超参数名称	CTGAN	CopulaGAN
Epochs	1 300	1 800
batch_size	500	500
generator_lr（float）	10^-4	10^-4
discriminator_lr（float）	10^-4	10^-4

分类算法	参数名称	最优值
LR	penalty	L1
	C	1.087
	solver	liblinear
DT	max_depth	44
	min_samples_split	2
	min_samples_leaf	1
RF	n_estimators	165
	max_depth	18
	max_features	4
	min_samples_leaf	1
	min_samples_split	19
	criterion	gini
XGB	learning_rate	0.276
	n_estimators	275
	subsample	0.857
	max_depth	32
SVM	C	3.029
	kernel	rbf
	gamma	0.3126
	probability	True

分类算法	数据增强算法	召回率	特异性	加权F₁分数	AUC
LR	None	0.00	1.00	0.89	0.72
	CTGAN	0.67	0.62	0.71	0.65
	CopulaGAN	0.41	0.75	0.78	0.66
	SMOTE	0.47	0.81	0.82	0.68
	ADASYN	0.53	0.81	0.83	0.68
DT	None	0.06	0.95	0.87	0.45
	CTGAN	0.71	0.75	0.80	0.73
	CopulaGAN	0.47	0.74	0.78	0.60
	SMOTE	0.35	0.88	0.86	0.62
	ADASYN	0.41	0.89	0.87	0.70
RF	None	0.24	0.97	0.90	0.81
	CTGAN	0.82	0.84	0.87	0.86
	CopulaGAN	0.35	0.77	0.79	0.70
	SMOTE	0.24	0.96	0.90	0.82
	ADASYN	0.29	0.95	0.90	0.79
XGB	None	0.18	0.99	0.91	0.80
	CTGAN	0.65	0.81	0.84	0.80
	CopulaGAN	0.29	0.81	0.81	0.62
	SMOTE	0.35	0.96	0.91	0.77
	ADASYN	0.47	0.94	0.91	0.79
SVM	None	0.35	0.92	0.88	0.50
	CTGAN	0.88	0.74	0.80	0.74
	CopulaGAN	0.65	0.72	0.78	0.70
	SMOTE	0.18	0.96	0.89	0.78
	ADASYN	0.11	0.97	0.89	0.79

数据增强算法	训练复杂度	生成复杂度	运行效率	计算资源
CTGAN	O（Eσh²τ）=O（7.3×10¹⁰）	O（σh²）=O（131 072）	4 min 7 s	高（GPU）
CopulaGAN	O（τ²+d³+τd）=O（1.95×10⁵）	O（τd+d²）=O（6 675）	3 min 51 s	中（CPU）
SMOTE	O（τkd）=O（32 250）	O（τkd）=O（32 250）	<1 s	低（CPU）
ADASYN	O（τkd+τ）=O（32 680）	O（τkd）=O（32 250）	<1 s	低（CPU）