Journal of South China University of Technology(Natural Science Edition) ›› 2025, Vol. 53 ›› Issue (5): 56-65.doi: 10.12141/j.issn.1000-565X.240131

• Computer Science & Technology • Previous Articles    

A Method for Software Vulnerability Detection via Path Representations and Pretrained Model

#br#

LU Lu1,2   WAN Tong1   

  1. 1. School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, Guangdong, China;

    2. Pengcheng Laboratory, Shenzhen 518000, Guangdong, China

  • Online:2025-05-25 Published:2024-06-21

Abstract:

Software vulnerabilities represent critical vulnerabilities that can compromise system security and are susceptible to exploitation by attackers for unauthorized control. Contemporary deep learning-based vulnerability detection approaches largely suffer from limitations due to their reliance on single code representations, failing to fully capture the complementary nature of code semantics and structural information. This research introduces an innovative method for software vulnerability detection, termed VDPPM (Software Vulnerability Detection via Path Representations and Pretrained Model), which addresses this issue. The proposed framework integrates path representations extracted from Abstract Syntax Tree (AST), Control Flow Graph (CFG), and Program Dependency Graphs (PDG), thereby offering a more comprehensive view of code characteristics. The VDPPM framework employs SimCodeBERT, a model refined through contrastive learning framework SimCSE, enhancing its ability to interpret code semantics. In the experimental phase, we initially construct a corpus using path representations and train a Doc2vec model to generate general-purpose embedding models, converting sequence of paths into vector representations. Subsequently, a pretrained CodeBERT model is integrated, which, after training under the contrastive learning framework, gains increased precision in capturing deep semantic features within the code. Ultimately, the fusion of vector representations generated by both Doc2vec and the enhanced SimCodeBERT enables the effective execution of vulnerability detection. Empirical studies demonstrate that across multiple publicly available benchmark datasets for vulnerability detection tasks, the VDPPM framework outperforms mainstream methods, showing significant improvements in several performance metrics. This convincingly validates the effectiveness and superiority of the proposed methodology.

Key words:

software vulnerability, path representation, pre-training, contrastive learning