一种基于路径表示和预训练模型的软件漏洞检测方法

doi:10.12141/j.issn.1000-565X.240324

摘要/Abstract

摘要：

软件漏洞是导致计算机系统安全性受损的关键薄弱环节，易于被攻击者利用来实施非法操控，从而导致数据泄露、系统崩溃甚至更严重的安全事故。因此，如何精准、高效地检测软件漏洞已经成为计算机安全领域的核心研究课题。现有的基于深度学习的漏洞检测方法已取得一定进展，但大多受限于单一代码表示，无法全面反映代码语义与结构信息间的互补性。鉴于此，该文创新性地提出了一种基于路径表示和预训练代码模型的漏洞检测方法（简称VDPPM），以有效提升代码语义解析能力和漏洞检测精度。该方法集成了从抽象语法树、控制流图和程序依赖图抽取的路径表示，并借助对比学习框架SimCSE优化后得到的SimCodeBERT模型来增强模型对漏洞特征的捕捉能力。实验中，首先从源代码中提取3种代码表示，并从这些表示中提取路径表示来构建语料库以训练Doc2vec模型，形成通用嵌入模型，将路径序列转化为向量表示。在此基础上，融入预训练的CodeBERT模型，将其在对比学习框架下进行训练，以更精准地捕捉代码深层次语义特征。最后，通过融合Doc2vec和SimCodeBERT模型生成的向量来构建高质量的代码表示以完成漏洞检测。实验结果表明，在多个公开的漏洞检测基准数据集中，VDPPM的性能优于目前的主流方法，在漏洞检测任务上的多个指标有显著提高，证明了该方法的有效性和优越性。

关键词: 软件漏洞, 漏洞检测, 路径表示, 预训练, 对比学习

Abstract:

Software vulnerabilities are critical weaknesses that compromise the security of computer systems, making them susceptible to attacks may lead to data breaches, system crashes or even more severe security incidents. Therefore, accurately and efficiently detecting software vulnerabilities has become a central research focus in the field of computer security. Although contemporary deep learning-based vulnerability detection approaches have made progress, they are often limited by single code representations and fail to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Vulnerability Detection via Path Representations and Pretrained Model), which effectively enhances code semantic analysis and vulnerability detection accuracy. VDPPM integrates the path representations extracted from abstract syntax tree, control flow graph and program dependency graphs, leverages the SimCodeBERT model optimized through contrastive learning framework SimCSE to enhance the model’s ability to capture vulnerability features. In the experiments, first, three types of code representations are extracted from the source code and are used to construct a corpus by deriving path representations for the training of Doc2vec model, thus generating general-purpose embedding models, converting path sequences into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after being trained under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Finally, by combining vector embeddings from Doc2vec and SimCodeBERT, high-quality code representations are constructed to perform vulnerability detection. Experimental results demonstrate that, across multiple publicly available benchmark datasets for vulnerability detection tasks, VDPPM outperforms the existing mainstream methods with significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed method.

Key words: software vulnerability, vulnerability detection, path representation, pre-training, contrastive learning

中图分类号:

TP311.5

陆璐, 万童. 一种基于路径表示和预训练模型的软件漏洞检测方法[J]. 华南理工大学学报(自然科学版), 2025, 53(5): 56-65.

LU Lu, WAN Tong. A Method for Software Vulnerability Detection via Path Representations and Pretrained Model[J]. Journal of South China University of Technology(Natural Science Edition), 2025, 53(5): 56-65.

图/表 9

图1

图2

表1

表2

表3

表4

图3

图4

表5

参考文献 31

1	WU J ．Literature review on vulnerability detection using NLP technology［EB/OL］．（2021-04-22）［2024-03-03］．．
2	LIN G， WEN S， HAN Q L，et al ．Software vulnerability detection using deep neural networks：a survey［J］．Proceedings of the IEEE，2020，108（10）：1825-1848．
3	ZOU D， WANG S， XU S，et al ．μVulDeePecker：a deep learning-based system for multiclass vulnerability detection［J］．IEEE Transactions on Dependable and Secure Computing，2021，18（5）：2224-2236．
4	YAMAGUCHI F， GOLDE N，ARP D，et al ．Modeling and discovering vulnerabilities with code property graphs［C］∥Proceedings of 2014 IEEE Symposium on Security and Privacy．San Jose：IEEE，2014：590-604．
5	VOTIPKA D， STEVENS R， REDMILES E，et al ．Hackers vs testers：a comparison of software vulnerability discovery processes［C］∥Proceedings of 2018 IEEE Symposium on Security and Privacy．San Francisco：IEEE，2018：374-391．
6	JIANG J， WEN S， YU S，et al ．Identifying propagation sources in networks：state-of-the-art and comparative studies［J］．IEEE Communications Surveys & Tutorials，2017，19（1）：465-481．
7	WU T， WEN S， XIANG Y，et al ．Twitter spam detection：survey of new approaches and comparative study［J］．Computers & Security，2018，76：265-284．
8	SCANDARIATO R， WALDEN J， HOVSEPYAN A，et al ．Predicting vulnerable software components via text mining［J］．IEEE Transactions on Software Engineering，2014，40（10）：993-1006．
9	RUSSELL R， KIM L， HAMILTON L，et al ．Automated vulnerability detection in source code using deep representation learning［C］∥Proceedings of 2018 17th IEEE International Conference on Machine Learning and Applications．Orlando：IEEE，2018：757-762．
10	LIN G， XIAO W， ZHANG J，et al ．Deep learning-based vulnerable function detection：a benchmark［C］∥ZHOU J，LUO X，SHEN Q，et al．Proceedings of the 21st International Conference on Information and Communications Security．Cham：Springer International Publishing，2020：219-232．
11	YAMAGUCHI F， LOTTMANN M， RIECK K ．Gene-ralized vulnerability extrapolation using abstract syntax trees［C］∥Proceedings of the 28th Annual Computer Security Applications Conference．Orlando：Association for Computing Machinery，2012：359-368．
12	ZHANG J， WANG X， ZHANG H，et al ．A novel neural source code representation based on abstract syntax tree［C］∥Proceedings of 2019 IEEE/ACM 41st International Conference on Software Engineering．Montreal：IEEE，2019：783-794．
13	ALON U， ZILBERSTEIN M， LEVY O，et al ．A general path-based representation for predicting program properties［C］∥Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation．New York：Association for Computing Machinery，2018：404-419．
14	ALON U， ZILBERSTEIN M， LEVY O，et al ．code2vec：learning distributed representations of code［J］．Proceedings of the ACM on Programming Languages，2019，3（POPL）：40/1-29．
15	ALON U， BRODY S， LEVY O，et al ．code2seq：generating sequences from structured representations of code［C］∥Proceedings of the 7th International Confe-rence on Learning Representations．New Orleans：［s.n.］，2018：6969-6991．
16	VAGAVOLU D， SWARNA K C， CHIMALAKONDA S ．A mocktail of source code representations［C］∥Proceedings of 2021 36th IEEE/ACM International Conference on Automated Software Engineering．Melbourne：IEEE，2021：1296-1300．
17	LI Y， WANG S， NGUYEN T N，et al ．Improving bug detection via context-based code representation learning and attention-based neural networks［J］．Proceedings of the ACM on Programming Languages，2019，3（OOPSLA）：162/1-30．
18	ZHOU Y， LIU S， SIOW J，et al ．Devign：effective vulnerability identification by learning comprehensive program semantics via graph neural networks［C］∥ Proceedings of the Coference on Advances in Neural Information Processing Systems．Vancouver：Curran Associates，Inc，2019：10197-10207．
19	KIM S，WOO S， LEE H，et al ．VUDDY：a scalable approach for vulnerable code clone discovery［C］∥ Proceedings of 2017 IEEE Symposium on Security and Privacy．San Jose：IEEE，2017：595-614．
20	LI Z， ZOU D， XU S，et al ．VulPecker：an automated vulnerability detection system based on code similarity analysis［C］∥Proceedings of the 32nd Annual Conference on Computer Security Applications．New York：Association for Computing Machinery，2016：201-213．
21	PRADEL M，SEN K ．DeepBugs：a learning approach to name-based bug detection［J］．Proceedings of the ACM on Programming Languages，2018，2（OOPSLA）：147/1-25．
22	LI Z， ZOU D， XU S，et al ．SySeVR：a framework for using deep learning to detect software vulnerabilities［J］．IEEE Transactions on Dependable and Secure Computing，2022，19（4）：2244-2258．
23	CHAKRABORTY S， KRISHNA R， DING Y，et al ．Deep learning based vulnerability detection：are we there yet？［J］．IEEE Transactions on Software Engineering，2022，48（9）：3280-3296．
24	CAO S， SUN X， BO L，et al ．BGNN4VD：constructing bidirectional graph neural-network for vulne-rability detection［J］．Information and Software Technology，2021，136：106576/1-11．
25	WAN T， LU L， XU H，et al ．Software vulnerability detection via Doc2vec with path representations［C］∥ Proceedings of 2023 IEEE 23rd International Confe-rence on Software Quality，Reliability，and Security Companion．Chiang Mai：IEEE，2023：131-139．
26	KIM Y ．Convolutional neural networks for sentence classification［C］∥Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing．Doha：Association for Computational Linguistics，2014：1746-1751．
27	LU S， GUO D， REN S，et al ．CodeXGLUE：a machine learning benchmark dataset for code understanding and generation［C］∥VANSCHOREN J，YEUNG S．Proceedings of the Conference on Neural Information Processing Systems Track on Datasets and Benchmarks 1 （NeurIPS Datasets and Benchmarks 2021）．Online：MIT Press，2021：31/1-14．
28	FENG Z， GUO D， TANG D，et al ．CodeBERT：a pre-trained model for programming and natural languages［C］∥COHN T，HE Y，LIU Y．Findings of the Association for Computational Linguistics：Proceedings of EMNLP 2020．Online：Association for Computational Linguistics，2020：1536-1547．
29	GUO D， REN S， LU S，et al ．GraphCodeBERT：pre-training code representations with data flow［C］∥ Proceedings of the International Conference on Learning Representations．Online：Washington DC，2020：1-18．
30	NGUYEN V A， NGUYEN D Q， NGUYEN V，et al ．ReGVD：revisiting graph neural networks for vulne-rability detection［C］∥Proceedings of the ACM/IEEE 44th International Conference on Software Engineering：Companion Proceedings．Pittsburgh：ACM，2022：178-182．
31	GAO T， YAO X， CHEN D ．SimCSE：simple contrastive learning of sentence embeddings［C］∥MOENS M F，HUANG X，SPECIA L，et al．Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing．Online and Punta Cana：Association for Computational Linguistics，2021：6894-6910．

数据集	嵌入向量维度	滤波器大小	全连接层维度
CodeXGLUE	512	384	32
Reveal	384	384	128
LibPNG	384	384	64
LibTIFF	512	384	64
Pidgin	512	512	64
VLC	640	512	32

数据集	样本数
数据集	训练集	测试集	验证集	有漏洞	无漏洞	总计
CodeX	20 315	2 540	2 572	11 603	13 824	25 427
Reveal	2 652	334	330	1 658	1 658	3 316
LibPNG	336	113	112	34	527	561
LibTIFF	446	149	148	71	672	743
Pidgin	5 232	1743	1744	28	8 691	8 719
VLC	3 654	1218	1218	41	6 049	6 090

方法	R		F₁		AUC
方法	CodeXGLUE	Reveal	CodeXGLUE	Reveal	CodeXGLUE	Reveal
Devign	0.700 3	0.700 6	0.585 1	0.713 4	0.555 0	0.718 6
CodeBERT	0.527 4	0.652 7	0.571 4	0.701 0	0.628 1	0.721 6
GraphCodeBERT	0.553 9	0.772 5	0.579 5	0.752 2	0.624 6	0.745 5
Reveal	0.568 5	0.759 0	0.540 7	0.745 6	0.556 7	0.741 8
ReGVD	0.533 4	0.718 6	0.561 0	0.703 8	0.610 0	0.697 6
VDDP	0.866 5	0.807 2	0.843 4	0.770 1	0.853 2	0.759 9
VDPPM	0.875 9	0.808 4	0.833 4	0.756 3	0.841 7	0.739 5

方法	R				F₁				AUC
方法	LibTIFF	LibPNG	Pidgin	VLC	LibTIFF	LibPNG	Pidgin	VLC	LibTIFF	LibPNG	Pidgin	VLC
Devign	0.528 6	0.514 3	0.400 0	0.125 0	0.576 3	0.625 4	0.403 5	0.200 0	0.748 0	0.753 4	0.699 1	0.562 1
CodeBERT	0.385 7	0.600 0	0.600 0	0.125 0	0.532 5	0.651 4	0.750 0	0.222 2	0.691 4	0.791 5	0.800 0	0.562 5
GraphCodeBERT	0.342 9	0.800 0	0.600 0	0.125 0	0.497 8	0.709 7	0.750 0	0.181 8	0.669 9	0.884 9	0.800 0	0.561 7
Reveal	0.614 3	0.685 7	0.400 0	0.125 0	0.597 4	0.668 5	0.454 8	0.178 7	0.783 3	0.830 5	0.699 5	0.561 5
ReGVD	0.528 6	0.771 4	0.400 0	0.125 0	0.659 1	0.801 5	0.571 4	0.166 7	0.760 6	0.881 0	0.700 0	0.561 3
VDDP（itself）	0.657 1	0.600 0	0.760 0	0.175 0	0.719 4	0.712 6	0.826 7	0.243 1	0.819 6	0.797 1	0.879 9	0.586 8
VDDP（outer）	0.615 4	0.828 6	0.480 0	0.125 0	0.686 4	0.824 4	0.575 2	0.205 3	0.798 8	0.908 5	0.739 7	0.562 2
VDDP（extend）	0.785 7	0.833 3	0.800 0	0.300 0	0.798 2	0.893 9	0.871 1	0.392 3	0.883 2	0.915 7	0.899 9	0.649 3
VDPPM	0.785 7	0.857 1	0.800 0	0.625 0	0.880 0	0.923 1	0.888 9	0.588 2	0.892 9	0.928 6	0.900 0	0.810 8

融合方法	A	P	R	F₁	AUC
拼接	0.836 6	0.810 4	0.841 6	0.825 7	0.837 0
乘积	0.829 1	0.782 7	0.869 9	0.824 0	0.832 2
取平均	0.838 6	0.823 9	0.825 3	0.824 6	0.837 6
取最大值	0.837 0	0.821 1	0.825 3	0.823 2	0.836 1
相加	0.839 0	0.794 9	0.875 9	0.833 4	0.841 7