A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

LU Lu; WAN Tong

doi:10.12141/j.issn.1000-565X.240324

Journal of South China University of Technology(Natural Science) >

2025 , Vol. 53 >Issue 5: 56 - 65

DOI: https://doi.org/10.12141/j.issn.1000-565X.240324

Computer Science & Technology

A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

LU Lu ,
WAN Tong

Expand

^1.School of Computer Science and Engineering，South China University of Technology，Guangzhou 510006，Guangdong，China
^2.Pengcheng Laboratory，Shenzhen 518000，Guangdong，China

陆璐（1971—），男，博士，教授，主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail： lul@scut.edu.cn

Received date: 2024-03-19

Online published: 2024-06-19

Supported by

the Key Field Research and Development Plan of Guangdong Province(2022B0101070001);the Natural Science Foundation of Guangdong Province(2024A1515010204)

Fold

Abstract

Software vulnerabilities are critical weaknesses that compromise the security of computer systems, making them susceptible to attacks may lead to data breaches, system crashes or even more severe security incidents. Therefore, accurately and efficiently detecting software vulnerabilities has become a central research focus in the field of computer security. Although contemporary deep learning-based vulnerability detection approaches have made progress, they are often limited by single code representations and fail to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Vulnerability Detection via Path Representations and Pretrained Model), which effectively enhances code semantic analysis and vulnerability detection accuracy. VDPPM integrates the path representations extracted from abstract syntax tree, control flow graph and program dependency graphs, leverages the SimCodeBERT model optimized through contrastive learning framework SimCSE to enhance the model’s ability to capture vulnerability features. In the experiments, first, three types of code representations are extracted from the source code and are used to construct a corpus by deriving path representations for the training of Doc2vec model, thus generating general-purpose embedding models, converting path sequences into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after being trained under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Finally, by combining vector embeddings from Doc2vec and SimCodeBERT, high-quality code representations are constructed to perform vulnerability detection. Experimental results demonstrate that, across multiple publicly available benchmark datasets for vulnerability detection tasks, VDPPM outperforms the existing mainstream methods with significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed method.

Key words： software vulnerability; vulnerability detection; path representation; pre-training; contrastive learning

Cite this article

LU Lu , WAN Tong . A Method for Software Vulnerability Detection via Path Representations and Pretrained Model[J]. Journal of South China University of Technology(Natural Science), 2025 , 53(5) : 56 -65 . DOI: 10.12141/j.issn.1000-565X.240324

References

1	WU J ．Literature review on vulnerability detection using NLP technology［EB/OL］．（2021-04-22）［2024-03-03］．．
2	LIN G， WEN S， HAN Q L，et al ．Software vulnerability detection using deep neural networks：a survey［J］．Proceedings of the IEEE，2020，108（10）：1825-1848．
3	ZOU D， WANG S， XU S，et al ．μVulDeePecker：a deep learning-based system for multiclass vulnerability detection［J］．IEEE Transactions on Dependable and Secure Computing，2021，18（5）：2224-2236．
4	YAMAGUCHI F， GOLDE N，ARP D，et al ．Modeling and discovering vulnerabilities with code property graphs［C］∥Proceedings of 2014 IEEE Symposium on Security and Privacy．San Jose：IEEE，2014：590-604．
5	VOTIPKA D， STEVENS R， REDMILES E，et al ．Hackers vs testers：a comparison of software vulnerability discovery processes［C］∥Proceedings of 2018 IEEE Symposium on Security and Privacy．San Francisco：IEEE，2018：374-391．
6	JIANG J， WEN S， YU S，et al ．Identifying propagation sources in networks：state-of-the-art and comparative studies［J］．IEEE Communications Surveys & Tutorials，2017，19（1）：465-481．
7	WU T， WEN S， XIANG Y，et al ．Twitter spam detection：survey of new approaches and comparative study［J］．Computers & Security，2018，76：265-284．
8	SCANDARIATO R， WALDEN J， HOVSEPYAN A，et al ．Predicting vulnerable software components via text mining［J］．IEEE Transactions on Software Engineering，2014，40（10）：993-1006．
9	RUSSELL R， KIM L， HAMILTON L，et al ．Automated vulnerability detection in source code using deep representation learning［C］∥Proceedings of 2018 17th IEEE International Conference on Machine Learning and Applications．Orlando：IEEE，2018：757-762．
10	LIN G， XIAO W， ZHANG J，et al ．Deep learning-based vulnerable function detection：a benchmark［C］∥ZHOU J，LUO X，SHEN Q，et al．Proceedings of the 21st International Conference on Information and Communications Security．Cham：Springer International Publishing，2020：219-232．
11	YAMAGUCHI F， LOTTMANN M， RIECK K ．Gene-ralized vulnerability extrapolation using abstract syntax trees［C］∥Proceedings of the 28th Annual Computer Security Applications Conference．Orlando：Association for Computing Machinery，2012：359-368．
12	ZHANG J， WANG X， ZHANG H，et al ．A novel neural source code representation based on abstract syntax tree［C］∥Proceedings of 2019 IEEE/ACM 41st International Conference on Software Engineering．Montreal：IEEE，2019：783-794．
13	ALON U， ZILBERSTEIN M， LEVY O，et al ．A general path-based representation for predicting program properties［C］∥Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation．New York：Association for Computing Machinery，2018：404-419．
14	ALON U， ZILBERSTEIN M， LEVY O，et al ．code2vec：learning distributed representations of code［J］．Proceedings of the ACM on Programming Languages，2019，3（POPL）：40/1-29．
15	ALON U， BRODY S， LEVY O，et al ．code2seq：generating sequences from structured representations of code［C］∥Proceedings of the 7th International Confe-rence on Learning Representations．New Orleans：［s.n.］，2018：6969-6991．
16	VAGAVOLU D， SWARNA K C， CHIMALAKONDA S ．A mocktail of source code representations［C］∥Proceedings of 2021 36th IEEE/ACM International Conference on Automated Software Engineering．Melbourne：IEEE，2021：1296-1300．
17	LI Y， WANG S， NGUYEN T N，et al ．Improving bug detection via context-based code representation learning and attention-based neural networks［J］．Proceedings of the ACM on Programming Languages，2019，3（OOPSLA）：162/1-30．
18	ZHOU Y， LIU S， SIOW J，et al ．Devign：effective vulnerability identification by learning comprehensive program semantics via graph neural networks［C］∥ Proceedings of the Coference on Advances in Neural Information Processing Systems．Vancouver：Curran Associates，Inc，2019：10197-10207．
19	KIM S，WOO S， LEE H，et al ．VUDDY：a scalable approach for vulnerable code clone discovery［C］∥ Proceedings of 2017 IEEE Symposium on Security and Privacy．San Jose：IEEE，2017：595-614．
20	LI Z， ZOU D， XU S，et al ．VulPecker：an automated vulnerability detection system based on code similarity analysis［C］∥Proceedings of the 32nd Annual Conference on Computer Security Applications．New York：Association for Computing Machinery，2016：201-213．
21	PRADEL M，SEN K ．DeepBugs：a learning approach to name-based bug detection［J］．Proceedings of the ACM on Programming Languages，2018，2（OOPSLA）：147/1-25．
22	LI Z， ZOU D， XU S，et al ．SySeVR：a framework for using deep learning to detect software vulnerabilities［J］．IEEE Transactions on Dependable and Secure Computing，2022，19（4）：2244-2258．
23	CHAKRABORTY S， KRISHNA R， DING Y，et al ．Deep learning based vulnerability detection：are we there yet？［J］．IEEE Transactions on Software Engineering，2022，48（9）：3280-3296．
24	CAO S， SUN X， BO L，et al ．BGNN4VD：constructing bidirectional graph neural-network for vulne-rability detection［J］．Information and Software Technology，2021，136：106576/1-11．
25	WAN T， LU L， XU H，et al ．Software vulnerability detection via Doc2vec with path representations［C］∥ Proceedings of 2023 IEEE 23rd International Confe-rence on Software Quality，Reliability，and Security Companion．Chiang Mai：IEEE，2023：131-139．
26	KIM Y ．Convolutional neural networks for sentence classification［C］∥Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing．Doha：Association for Computational Linguistics，2014：1746-1751．
27	LU S， GUO D， REN S，et al ．CodeXGLUE：a machine learning benchmark dataset for code understanding and generation［C］∥VANSCHOREN J，YEUNG S．Proceedings of the Conference on Neural Information Processing Systems Track on Datasets and Benchmarks 1 （NeurIPS Datasets and Benchmarks 2021）．Online：MIT Press，2021：31/1-14．
28	FENG Z， GUO D， TANG D，et al ．CodeBERT：a pre-trained model for programming and natural languages［C］∥COHN T，HE Y，LIU Y．Findings of the Association for Computational Linguistics：Proceedings of EMNLP 2020．Online：Association for Computational Linguistics，2020：1536-1547．
29	GUO D， REN S， LU S，et al ．GraphCodeBERT：pre-training code representations with data flow［C］∥ Proceedings of the International Conference on Learning Representations．Online：Washington DC，2020：1-18．
30	NGUYEN V A， NGUYEN D Q， NGUYEN V，et al ．ReGVD：revisiting graph neural networks for vulne-rability detection［C］∥Proceedings of the ACM/IEEE 44th International Conference on Software Engineering：Companion Proceedings．Pittsburgh：ACM，2022：178-182．
31	GAO T， YAO X， CHEN D ．SimCSE：simple contrastive learning of sentence embeddings［C］∥MOENS M F，HUANG X，SPECIA L，et al．Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing．Online and Punta Cana：Association for Computational Linguistics，2021：6894-6910．

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References