Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (5): 56-65.doi: 10.12141/j.issn.1000-565X.240324

• Computer Science & Technology • Previous Articles     Next Articles

A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

LU Lu1,2, WAN Tong1   

  1. 1.School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,Guangdong,China
    2.Pengcheng Laboratory,Shenzhen 518000,Guangdong,China
  • Received:2024-03-19 Online:2025-05-25 Published:2024-06-21
  • About author:陆璐(1971—),男,博士,教授,主要从事软件缺陷预测、高性能计算、深度学习训练推理加速等研究。E-mail: lul@scut.edu.cn
  • Supported by:
    the Key Field Research and Development Plan of Guangdong Province(2022B0101070001);the Natural Science Foundation of Guangdong Province(2024A1515010204)

Abstract:

Software vulnerabilities are critical weaknesses that compromise the security of computer systems, making them susceptible to attacks may lead to data breaches, system crashes or even more severe security incidents. Therefore, accurately and efficiently detecting software vulnerabilities has become a central research focus in the field of computer security. Although contemporary deep learning-based vulnerability detection approaches have made progress, they are often limited by single code representations and fail to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Vulnerability Detection via Path Representations and Pretrained Model), which effectively enhances code semantic analysis and vulnerability detection accuracy. VDPPM integrates the path representations extracted from abstract syntax tree, control flow graph and program dependency graphs, leverages the SimCodeBERT model optimized through contrastive learning framework SimCSE to enhance the model’s ability to capture vulnerability features. In the experiments, first, three types of code representations are extracted from the source code and are used to construct a corpus by deriving path representations for the training of Doc2vec model, thus generating general-purpose embedding models, converting path sequences into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after being trained under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Finally, by combining vector embeddings from Doc2vec and SimCodeBERT, high-quality code representations are constructed to perform vulnerability detection. Experimental results demonstrate that, across multiple publicly available benchmark datasets for vulnerability detection tasks, VDPPM outperforms the existing mainstream methods with significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed method.

Key words: software vulnerability, vulnerability detection, path representation, pre-training, contrastive learning

CLC Number: