华南理工大学学报(自然科学版) ›› 2025, Vol. 53 ›› Issue (5): 56-65.doi: 10.12141/j.issn.1000-565X.240324

• 计算机科学与技术 • 上一篇    下一篇

一种基于路径表示和预训练模型的软件漏洞检测方法

陆璐1,2, 万童1   

  1. 1.华南理工大学 计算机科学与工程学院,广东 广州 510006
    2.鹏城实验室,广东 深圳 518000
  • 收稿日期:2024-03-19 出版日期:2025-05-25 发布日期:2024-06-21
  • 作者简介:陆璐(1971—),男,博士,教授,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
  • 基金资助:
    广东省重点领域研发计划项目(2022B0101070001);广东省自然科学基金项目(2024A1515010204)

A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

LU Lu1,2, WAN Tong1   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Pengcheng Laboratory,Shenzhen 518000,Guangdong,China
  • Received:2024-03-19 Online:2025-05-25 Published:2024-06-21
  • About author:陆璐(1971—),男,博士,教授,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
  • Supported by:
    the Key Field Research and Development Plan of Guangdong Province(2022B0101070001);the Natural Science Foundation of Guangdong Province(2024A1515010204)

摘要:

软件漏洞是导致计算机系统安全性受损的关键薄弱环节,易于被攻击者利用来实施非法操控,从而导致数据泄露、系统崩溃甚至更严重的安全事故。因此,如何精准、高效地检测软件漏洞已经成为计算机安全领域的核心研究课题。现有的基于深度学习的漏洞检测方法已取得一定进展,但大多受限于单一代码表示,无法全面反映代码语义与结构信息间的互补性。鉴于此,该文创新性地提出了一种基于路径表示和预训练代码模型的漏洞检测方法(简称VDPPM),以有效提升代码语义解析能力和漏洞检测精度。该方法集成了从抽象语法树、控制流图和程序依赖图抽取的路径表示,并借助对比学习框架SimCSE优化后得到的SimCodeBERT模型来增强模型对漏洞特征的捕捉能力。实验中,首先从源代码中提取3种代码表示,并从这些表示中提取路径表示来构建语料库以训练Doc2vec模型,形成通用嵌入模型,将路径序列转化为向量表示。在此基础上,融入预训练的CodeBERT模型,将其在对比学习框架下进行训练,以更精准地捕捉代码深层次语义特征。最后,通过融合Doc2vec和SimCodeBERT模型生成的向量来构建高质量的代码表示以完成漏洞检测。实验结果表明,在多个公开的漏洞检测基准数据集中,VDPPM的性能优于目前的主流方法,在漏洞检测任务上的多个指标有显著提高,证明了该方法的有效性和优越性。

关键词: 软件漏洞, 漏洞检测, 路径表示, 预训练, 对比学习

Abstract:

Software vulnerabilities are critical weaknesses that compromise the security of computer systems, making them susceptible to attacks may lead to data breaches, system crashes or even more severe security incidents. Therefore, accurately and efficiently detecting software vulnerabilities has become a central research focus in the field of computer security. Although contemporary deep learning-based vulnerability detection approaches have made progress, they are often limited by single code representations and fail to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Vulnerability Detection via Path Representations and Pretrained Model), which effectively enhances code semantic analysis and vulnerability detection accuracy. VDPPM integrates the path representations extracted from abstract syntax tree, control flow graph and program dependency graphs, leverages the SimCodeBERT model optimized through contrastive learning framework SimCSE to enhance the model’s ability to capture vulnerability features. In the experiments, first, three types of code representations are extracted from the source code and are used to construct a corpus by deriving path representations for the training of Doc2vec model, thus generating general-purpose embedding models, converting path sequences into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after being trained under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Finally, by combining vector embeddings from Doc2vec and SimCodeBERT, high-quality code representations are constructed to perform vulnerability detection. Experimental results demonstrate that, across multiple publicly available benchmark datasets for vulnerability detection tasks, VDPPM outperforms the existing mainstream methods with significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed method.

Key words: software vulnerability, vulnerability detection, path representation, pre-training, contrastive learning

中图分类号: