A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

doi:10.12141/j.issn.1000-565X.240324

Abstract

Abstract:

Software vulnerabilities are critical weaknesses that compromise the security of computer systems, making them susceptible to attacks may lead to data breaches, system crashes or even more severe security incidents. Therefore, accurately and efficiently detecting software vulnerabilities has become a central research focus in the field of computer security. Although contemporary deep learning-based vulnerability detection approaches have made progress, they are often limited by single code representations and fail to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Vulnerability Detection via Path Representations and Pretrained Model), which effectively enhances code semantic analysis and vulnerability detection accuracy. VDPPM integrates the path representations extracted from abstract syntax tree, control flow graph and program dependency graphs, leverages the SimCodeBERT model optimized through contrastive learning framework SimCSE to enhance the model’s ability to capture vulnerability features. In the experiments, first, three types of code representations are extracted from the source code and are used to construct a corpus by deriving path representations for the training of Doc2vec model, thus generating general-purpose embedding models, converting path sequences into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after being trained under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Finally, by combining vector embeddings from Doc2vec and SimCodeBERT, high-quality code representations are constructed to perform vulnerability detection. Experimental results demonstrate that, across multiple publicly available benchmark datasets for vulnerability detection tasks, VDPPM outperforms the existing mainstream methods with significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed method.

Key words: software vulnerability, vulnerability detection, path representation, pre-training, contrastive learning

CLC Number:

TP311.5

LU Lu, WAN Tong. A Method for Software Vulnerability Detection via Path Representations and Pretrained Model[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(5): 56-65.

Figures/Tables 9

Fig.1

Fig.2

Table 1

Table 2

Table 3

Table 4

Fig.3

Fig.4

Table 5

References 31

1	WU J ．Literature review on vulnerability detection using NLP technology［EB/OL］．（2021-04-22）［2024-03-03］．．
2	LIN G， WEN S， HAN Q L，et al ．Software vulnerability detection using deep neural networks：a survey［J］．Proceedings of the IEEE，2020，108（10）：1825-1848．
3	ZOU D， WANG S， XU S，et al ．μVulDeePecker：a deep learning-based system for multiclass vulnerability detection［J］．IEEE Transactions on Dependable and Secure Computing，2021，18（5）：2224-2236．
4	YAMAGUCHI F， GOLDE N，ARP D，et al ．Modeling and discovering vulnerabilities with code property graphs［C］∥Proceedings of 2014 IEEE Symposium on Security and Privacy．San Jose：IEEE，2014：590-604．
5	VOTIPKA D， STEVENS R， REDMILES E，et al ．Hackers vs testers：a comparison of software vulnerability discovery processes［C］∥Proceedings of 2018 IEEE Symposium on Security and Privacy．San Francisco：IEEE，2018：374-391．
6	JIANG J， WEN S， YU S，et al ．Identifying propagation sources in networks：state-of-the-art and comparative studies［J］．IEEE Communications Surveys & Tutorials，2017，19（1）：465-481．
7	WU T， WEN S， XIANG Y，et al ．Twitter spam detection：survey of new approaches and comparative study［J］．Computers & Security，2018，76：265-284．
8	SCANDARIATO R， WALDEN J， HOVSEPYAN A，et al ．Predicting vulnerable software components via text mining［J］．IEEE Transactions on Software Engineering，2014，40（10）：993-1006．
9	RUSSELL R， KIM L， HAMILTON L，et al ．Automated vulnerability detection in source code using deep representation learning［C］∥Proceedings of 2018 17th IEEE International Conference on Machine Learning and Applications．Orlando：IEEE，2018：757-762．
10	LIN G， XIAO W， ZHANG J，et al ．Deep learning-based vulnerable function detection：a benchmark［C］∥ZHOU J，LUO X，SHEN Q，et al．Proceedings of the 21st International Conference on Information and Communications Security．Cham：Springer International Publishing，2020：219-232．
11	YAMAGUCHI F， LOTTMANN M， RIECK K ．Gene-ralized vulnerability extrapolation using abstract syntax trees［C］∥Proceedings of the 28th Annual Computer Security Applications Conference．Orlando：Association for Computing Machinery，2012：359-368．
12	ZHANG J， WANG X， ZHANG H，et al ．A novel neural source code representation based on abstract syntax tree［C］∥Proceedings of 2019 IEEE/ACM 41st International Conference on Software Engineering．Montreal：IEEE，2019：783-794．
13	ALON U， ZILBERSTEIN M， LEVY O，et al ．A general path-based representation for predicting program properties［C］∥Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation．New York：Association for Computing Machinery，2018：404-419．
14	ALON U， ZILBERSTEIN M， LEVY O，et al ．code2vec：learning distributed representations of code［J］．Proceedings of the ACM on Programming Languages，2019，3（POPL）：40/1-29．
15	ALON U， BRODY S， LEVY O，et al ．code2seq：generating sequences from structured representations of code［C］∥Proceedings of the 7th International Confe-rence on Learning Representations．New Orleans：［s.n.］，2018：6969-6991．
16	VAGAVOLU D， SWARNA K C， CHIMALAKONDA S ．A mocktail of source code representations［C］∥Proceedings of 2021 36th IEEE/ACM International Conference on Automated Software Engineering．Melbourne：IEEE，2021：1296-1300．
17	LI Y， WANG S， NGUYEN T N，et al ．Improving bug detection via context-based code representation learning and attention-based neural networks［J］．Proceedings of the ACM on Programming Languages，2019，3（OOPSLA）：162/1-30．
18	ZHOU Y， LIU S， SIOW J，et al ．Devign：effective vulnerability identification by learning comprehensive program semantics via graph neural networks［C］∥ Proceedings of the Coference on Advances in Neural Information Processing Systems．Vancouver：Curran Associates，Inc，2019：10197-10207．
19	KIM S，WOO S， LEE H，et al ．VUDDY：a scalable approach for vulnerable code clone discovery［C］∥ Proceedings of 2017 IEEE Symposium on Security and Privacy．San Jose：IEEE，2017：595-614．
20	LI Z， ZOU D， XU S，et al ．VulPecker：an automated vulnerability detection system based on code similarity analysis［C］∥Proceedings of the 32nd Annual Conference on Computer Security Applications．New York：Association for Computing Machinery，2016：201-213．
21	PRADEL M，SEN K ．DeepBugs：a learning approach to name-based bug detection［J］．Proceedings of the ACM on Programming Languages，2018，2（OOPSLA）：147/1-25．
22	LI Z， ZOU D， XU S，et al ．SySeVR：a framework for using deep learning to detect software vulnerabilities［J］．IEEE Transactions on Dependable and Secure Computing，2022，19（4）：2244-2258．
23	CHAKRABORTY S， KRISHNA R， DING Y，et al ．Deep learning based vulnerability detection：are we there yet？［J］．IEEE Transactions on Software Engineering，2022，48（9）：3280-3296．
24	CAO S， SUN X， BO L，et al ．BGNN4VD：constructing bidirectional graph neural-network for vulne-rability detection［J］．Information and Software Technology，2021，136：106576/1-11．
25	WAN T， LU L， XU H，et al ．Software vulnerability detection via Doc2vec with path representations［C］∥ Proceedings of 2023 IEEE 23rd International Confe-rence on Software Quality，Reliability，and Security Companion．Chiang Mai：IEEE，2023：131-139．
26	KIM Y ．Convolutional neural networks for sentence classification［C］∥Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing．Doha：Association for Computational Linguistics，2014：1746-1751．
27	LU S， GUO D， REN S，et al ．CodeXGLUE：a machine learning benchmark dataset for code understanding and generation［C］∥VANSCHOREN J，YEUNG S．Proceedings of the Conference on Neural Information Processing Systems Track on Datasets and Benchmarks 1 （NeurIPS Datasets and Benchmarks 2021）．Online：MIT Press，2021：31/1-14．
28	FENG Z， GUO D， TANG D，et al ．CodeBERT：a pre-trained model for programming and natural languages［C］∥COHN T，HE Y，LIU Y．Findings of the Association for Computational Linguistics：Proceedings of EMNLP 2020．Online：Association for Computational Linguistics，2020：1536-1547．
29	GUO D， REN S， LU S，et al ．GraphCodeBERT：pre-training code representations with data flow［C］∥ Proceedings of the International Conference on Learning Representations．Online：Washington DC，2020：1-18．
30	NGUYEN V A， NGUYEN D Q， NGUYEN V，et al ．ReGVD：revisiting graph neural networks for vulne-rability detection［C］∥Proceedings of the ACM/IEEE 44th International Conference on Software Engineering：Companion Proceedings．Pittsburgh：ACM，2022：178-182．
31	GAO T， YAO X， CHEN D ．SimCSE：simple contrastive learning of sentence embeddings［C］∥MOENS M F，HUANG X，SPECIA L，et al．Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing．Online and Punta Cana：Association for Computational Linguistics，2021：6894-6910．

数据集	嵌入向量维度	滤波器大小	全连接层维度
CodeXGLUE	512	384	32
Reveal	384	384	128
LibPNG	384	384	64
LibTIFF	512	384	64
Pidgin	512	512	64
VLC	640	512	32

数据集	样本数
数据集	训练集	测试集	验证集	有漏洞	无漏洞	总计
CodeX	20 315	2 540	2 572	11 603	13 824	25 427
Reveal	2 652	334	330	1 658	1 658	3 316
LibPNG	336	113	112	34	527	561
LibTIFF	446	149	148	71	672	743
Pidgin	5 232	1743	1744	28	8 691	8 719
VLC	3 654	1218	1218	41	6 049	6 090

方法	R		F₁		AUC
方法	CodeXGLUE	Reveal	CodeXGLUE	Reveal	CodeXGLUE	Reveal
Devign	0.700 3	0.700 6	0.585 1	0.713 4	0.555 0	0.718 6
CodeBERT	0.527 4	0.652 7	0.571 4	0.701 0	0.628 1	0.721 6
GraphCodeBERT	0.553 9	0.772 5	0.579 5	0.752 2	0.624 6	0.745 5
Reveal	0.568 5	0.759 0	0.540 7	0.745 6	0.556 7	0.741 8
ReGVD	0.533 4	0.718 6	0.561 0	0.703 8	0.610 0	0.697 6
VDDP	0.866 5	0.807 2	0.843 4	0.770 1	0.853 2	0.759 9
VDPPM	0.875 9	0.808 4	0.833 4	0.756 3	0.841 7	0.739 5

方法	R				F₁				AUC
方法	LibTIFF	LibPNG	Pidgin	VLC	LibTIFF	LibPNG	Pidgin	VLC	LibTIFF	LibPNG	Pidgin	VLC
Devign	0.528 6	0.514 3	0.400 0	0.125 0	0.576 3	0.625 4	0.403 5	0.200 0	0.748 0	0.753 4	0.699 1	0.562 1
CodeBERT	0.385 7	0.600 0	0.600 0	0.125 0	0.532 5	0.651 4	0.750 0	0.222 2	0.691 4	0.791 5	0.800 0	0.562 5
GraphCodeBERT	0.342 9	0.800 0	0.600 0	0.125 0	0.497 8	0.709 7	0.750 0	0.181 8	0.669 9	0.884 9	0.800 0	0.561 7
Reveal	0.614 3	0.685 7	0.400 0	0.125 0	0.597 4	0.668 5	0.454 8	0.178 7	0.783 3	0.830 5	0.699 5	0.561 5
ReGVD	0.528 6	0.771 4	0.400 0	0.125 0	0.659 1	0.801 5	0.571 4	0.166 7	0.760 6	0.881 0	0.700 0	0.561 3
VDDP（itself）	0.657 1	0.600 0	0.760 0	0.175 0	0.719 4	0.712 6	0.826 7	0.243 1	0.819 6	0.797 1	0.879 9	0.586 8
VDDP（outer）	0.615 4	0.828 6	0.480 0	0.125 0	0.686 4	0.824 4	0.575 2	0.205 3	0.798 8	0.908 5	0.739 7	0.562 2
VDDP（extend）	0.785 7	0.833 3	0.800 0	0.300 0	0.798 2	0.893 9	0.871 1	0.392 3	0.883 2	0.915 7	0.899 9	0.649 3
VDPPM	0.785 7	0.857 1	0.800 0	0.625 0	0.880 0	0.923 1	0.888 9	0.588 2	0.892 9	0.928 6	0.900 0	0.810 8

融合方法	A	P	R	F₁	AUC
拼接	0.836 6	0.810 4	0.841 6	0.825 7	0.837 0
乘积	0.829 1	0.782 7	0.869 9	0.824 0	0.832 2
取平均	0.838 6	0.823 9	0.825 3	0.824 6	0.837 6
取最大值	0.837 0	0.821 1	0.825 3	0.823 2	0.836 1
相加	0.839 0	0.794 9	0.875 9	0.833 4	0.841 7