华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (5): 56-65.doi: 10.12141/j.issn.1000-565X.240131

• 计算机科学与技术 • 上一篇    

一种基于路径表示和预训练模型的软件漏洞检测方法

陆璐1,2 万童1    

  1. 1. 华南理工大学 计算机科学与工程学院,广东 广州 510006;

    2. 鹏城实验室,广东 深圳 518000

  • 出版日期:2025-05-25 发布日期:2024-06-21

A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

#br#

LU Lu1,2   WAN Tong1   

  1. 1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China;

    2. Pengcheng Laboratory, Shenzhen 518000, Guangdong, China

  • Online:2025-05-25 Published:2024-06-21

摘要:

软件漏洞是导致系统安全性受损的关键薄弱环节,易于被攻击者利用实施非法操控。现有的基于深度学习的漏洞检测方法大多受限于单一代码表示,无法全面反映代码语义与结构信息间的互补性,本研究创新提出了一种基于路径表示和预训练代码模型的漏洞检测方法(Software Vulnerability Detection via Path Representations and Pretrained Model,VDPPM)。该框架集成了从抽象语法树(AST)、控制流图(CFG)和程序依赖图(PDG)抽取的路径表示,并借助对比学习框架SimCSE优化后得到的SimCodeBERT模型强化了代码语义解析能力。实验中,我们首先通过提取路径表示构建语料库以训练Doc2vec模型,形成通用嵌入模型,将路径序列转化为向量表示。随后,融入预训练的CodeBERT模型,其在对比学习框架下训练后能更精准地捕捉代码深层次语义特征。最终,通过融合Doc2vec和SimCodeBERT生成的向量完成漏洞检测。实验表明,在多个公开的漏洞检测基准数据集中,VDPPM框架在性能上超越了主流的方法,在漏洞检测任务上的多个指标上有显著提升,有力验证了本方法的有效性和优越性。

关键词: 软件漏洞, 路径表示, 预训练, 对比学习

Abstract:

Software vulnerabilities represent critical vulnerabilities that can compromise system security and are susceptible to exploitation by attackers for unauthorized control. Contemporary deep learning-based vulnerability detection approaches largely suffer from limitations due to their reliance on single code representations, failing to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Software Vulnerability Detection via Path Representations and Pretrained Model), which addresses this issue. The proposed framework integrates path representations extracted from Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Program Dependency Graphs (PDG), thereby offering a more comprehensive view of code characteristics. The VDPPM framework employs SimCodeBERT, a model refined through contrastive learning framework SimCSE, enhancing its ability to interpret code semantics. In the experimental phase, we initially construct a corpus using path representations and train a Doc2vec model to generate general-purpose embedding models, converting sequence of paths into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after training under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Ultimately, the fusion of vector representations generated by both Doc2vec and the enhanced SimCodeBERT enables the effective execution of vulnerability detection. Empirical studies demonstrate that across multiple publicly available benchmark datasets for vulnerability detection tasks, the VDPPM framework outperforms mainstream methods, showing significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed methodology.

Key words:

software vulnerability, path representation, pre-training, contrastive learning