Novel learning framework for building vulnerability detection models
Introduction
Using graph neural networks and open-source repositories to detect code vulnerabilities. This is an implementation of the model described in: "Combining Graph-based Learning with Automated Data Collection for Code Vulnerability Detection"
FUNDED is a novel learning framework for building vulnerability detection models, which leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the programβs control, data, and call dependencies.
November 2020 - The paper was accepted to IEEE TIFS!
Dataset are available at here, include C, Java and Php! As shown in Lili's work, our dataset had the highest complexity, the largest sample size, and the most subroutine calls compared to other public vulnerability datasets.
Contents
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
Install the necessary dependencies before running the project,the part of SoftWare is related to data preprocess while Python Libraries are the environment we have tested.
For more details, please reference requirements.txt:
Software:
Python Libraries:
Setup
This section gives the steps, explanations and examples for getting the project running.
1) Clone this repo
$ git clone [email protected]:HuantWang/FUNDED_NISL.git
2) Install Prerequisites
$ pip install -r requirements.txt
3) Run the testcase
$ cd NISL_TIFS2021/FUNDED/cli
$ CUDA_VISIBLE_DEVICES=2 python train.py GGNN GraphBinaryClassification ../data/data/CWE-77
4) load trained model and predict
$ cd NISL_TIFS2021/FUNDED/cli
$ CUDA_VISIBLE_DEVICES=2 python test.py GGNN GraphBinaryClassification ../data/data/data/cve/badall --storedModel_path "./trained_model/GGNN_GraphBinaryClassification__2023-02-01_05-36-00_f1 = 0.800_best.pkl"
GNN Detection module
This part contains GNN Detection model' relevant source code structure and partial sample data set.
Detection Structure
βββ LICENSE
βββ README.md <- The top-level README for developers using this project.
βββ requirements.txt <- The python environment for developers using this project.
βββ FUNDED
β βββ cli
β β βββ train.py <- the entrance of training models.
β β βββ test.py <- testing the specified model using data.
β β βββ __init__.py
β βββ cli_utils
β β βββ default_hypers
β β β βββ GraphBinaryClassification_GGNN.json
β β βββ dataset_utils.py
β β βββ model_utils.py
β β βββ param_helpers.py
β β βββ task_utils.py
β β βββ training_utils.py
β β βββ __init__.py
β βββ data
β β βββ data
β β β βββ data_preprocess.py
β β β βββ our_map_all.txt
β β β βββ __init__.py
β β βββ graph_dataset.py
β β βββ jsonl_graph_dataset.py
β β βββ jsonl_graph_property_dataset.py
β β βββ __init__.py
β βββ layers
β β βββ message_passing
β β β βββ ggnn.py
β β β βββ gnn_edge_mlp.py
β β β βββ gnn_film.py
β β β βββ message_passing.py
β β β βββ __init__.py
β β βββ gnn.py
β β βββ graph_global_exchange.py
β β βββ nodes_to_graph_representation.py
β β βββ __init__.py
β βββ models
β β βββ graph_binary_classification_task.py
β β βββ graph_regression_task.py
β β βββ graph_task_model.py
β β βββ node_multiclass_task.py
β β βββ __init__.py
β βββ utils
β β βββ activation.py
β β βββ constants.py
β β βββ gather_dense_gradient.py
β β βββ param_helpers.py
β βββ βββ __init__.py
βββββββ __init__.py
Data Preprocessing
To construct the AST, we use Soot for Java, ANTLR for Swift, PHP and joern for C/C++.
c/c++
For c/c++, we download different CWE types' datasets from SARD, CVE and Github.
The specific steps of data preprocessing are as followsοΌ
Warning: Modify the path with your own data in code.
- Slicing data
$ cd FUNDED_NISL/Edge_processing/slicec_7edges_funcblock/src/main/java/slice
- Run ClassifyFileOfProject.java to extract all the C file.
- Run Main.java to slice data in function level.
- Extracting different edge relationship
Then we traverse all the source codes' AST nodes,which have been parsed by cdt.While traversing, all nodes are numbered in sequence, and the relationship between different edges is obtained according to specific rules.
$ cd FUNDED_NISL/Edge_processing/slicec_7edges_funcblock/src/main/java/sevenEdges
- Use joern to get all the control flows and data flows in the source code, specific reference: joern.
- Run Main.java to extrace others.
- Run concateJoern.java to concate all edges.
We provide a demo dataset for data preprocess.
java
For java,We download data from SARD, CVE and Github.
With the same idea like parsing c/c++ above,we construct all relationships in different edges using soot and jdt.
Warning: Modify the path with your own data
$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/Java_jdt_AST_CDFG/src/main/java/yoshikihigo/tinypdg/
$ java Main.java sourceFilePath savafilePath
PHP and Swift
For PHP and Swift,We collect datasets from SARD, CVE and Github.
Then extracting edge nodes from AST constructed with Antlr.
$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/php_swift/src/php/main
$ java TestPhp.java sourceFilePath savafilePath
$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/php_swift/src/swift3/main
$ java TestSwift3.java sourceFilePath savafilePath
Dataset
The datasets can be collected here.
The edges dataset contains 44 different types of C language CWE data. Through script processing,we can get the final inputs.
For example, under data/data/CWE-399
and data/data/CWE-400
are available the test datasets with the graphs consisting of ast, cfg and pdg.
Fields
cwe | file_id | target | contents |
---|---|---|---|
399 | 0a2a9a6f-779e-47b4-823e-43eccd125b4f.c$$$0 | 0 | 1,2 1,3 2,7,9 (1,9,0)(2,8,1)(3,7,2) ... |
399 | 1b733c0b-30d5-4cc2-9431-8695795abfed.c$$$1 | 1 | 6,7 4,5 1,4,9 (2,7,0)(3,5,1)(4,2,2) ... |
399 | 3e9bebda-cef3-4988-9543-a5e5473849c2.c$$$0 | 0 | 1,2 3,5 3,5,8 (1,2,0)(2,6,1)(4,8,2) ... |
399 | 8bcbb6c4-3f3f-471c-b2dc-ab9151bb22f8.c$$$2 | 1 | 2,7 2,9 2,3,7 (6,7,0)(1,5,1)(6,9,2) ... |
399 | 53ee12a1-ba49-41f2-a163-c2b662a4db27.c$$$0 | 0 | 4,5 7,8 3,6,8 (5,8,0)(3,6,1)(7,8,2) ... |
... | ... | ... | |
400 | 8388fdcf-40cf-4e59-9f11-17d9e320efd8.c$$$4 | 0 | 1,7 2,5 3,4,8 (4,7,0)(5,8,1)(2,9,2) ... |
400 | 91978dee-4ee4-428b-8576-ffb49e8dc12a.c$$$6 | 1 | 2,3 3,8 3,7,9 (3,6,0)(4,6,1)(2,8,2) ... |
400 | 113353a8-f804-4aff-a81a-15f20e638d4b.c$$$1 | 1 | 4,6 4,7 5,6,7 (3,7,0)(4,5,1)(8,9,2) ... |
400 | b7b5ae35-d478-4c51-96c2-8f107fc08fde.c$$$3 | 1 | 2,5 7,8 1,7,8 (5,8,0)(3,6,1)(2,8,2) ... |
400 | e831aff3-bd88-4ef7-a5b0-2d87e1b20fbe.c$$$0 | 0 | 6,8 2,8 4,6,9 (6,9,0)(1,5,1)(1,4,2) ... |
... | ... | ... |
Results
Example results of training on the sample dataset CWE-400. Saved Model checkpoint at 60 epochs.
Dataset parameters: {
"max_nodes_per_batch": 128,
"num_fwd_edge_types": 7,
"add_self_loop_edges": true,
"tie_fwd_bkwd_edges": true,
"threshold_for_classification": 0.5
}
Model parameters: {
"gnn_aggregation_function": "sum",
"gnn_message_activation_function": "ReLU",
"gnn_hidden_dim": 256,
"gnn_use_target_state_as_input": false,
"gnn_normalize_by_num_incoming": true,
"gnn_num_edge_MLP_hidden_layers": 1,
"gnn_num_aggr_MLP_hidden_layers": null,
"gnn_message_calculation_class": "RGIN",
"gnn_initial_node_representation_activation": "tanh",
"gnn_dense_intermediate_layer_activation": "tanh",
"gnn_num_layers": 5, "gnn_dense_every_num_layers": 10000,
"gnn_residual_every_num_layers": 2,
"gnn_use_inter_layer_layernorm": true,
"gnn_layer_input_dropout_rate": 0.2,
"gnn_global_exchange_mode": "gru",
"gnn_global_exchange_every_num_layers": 10000,
"gnn_global_exchange_weighting_fun": "softmax",
"gnn_global_exchange_num_heads": 4,
"gnn_global_exchange_dropout_rate": 0.2,
"optimizer": "Adam", "learning_rate": 0.001,
"learning_rate_decay": 0.98, "momentum": 0.85,
"gradient_clip_value": 1.0,
"use_intermediate_gnn_results": false,
"graph_aggregation_num_heads": 16,
"graph_aggregation_hidden_layers": [128],
"graph_aggregation_dropout_rate": 0.2
}
== Running on test dataset
Loading data from ../data/data/tem_CWE-77/ast.
Loading data from ../data/data/tem_CWE-77/cdfg.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2020-11-30_10-41-23_best.pkl.
NoneCP_test Accuracy = 0.915|precision = 0.846 | recall = 1.000 | f1 = 0.917
== Running on test dataset
Loading data from ../data/data/tem_CWE-77/new/ast.
Loading data from ../data/data/tem_CWE-77/new/cdfg.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2020-11-30_10-44-23_best.pkl.
CP_test Accuracy = 0.942|precision = 0.893 | recall = 1.000 | f1 = 0.943
Tuning
We use NNI(Neural Network Intelligence) for tuning in this project.
$ pip install nni
Add a search_space.json file under the work directory and write the parameters to be configured,which we have configured in the project.
search_space.json
{
"max_nodes_per_batch":{ "_type": "choice", "_value": [32,64,128]},
"gnn_hidden_dim":{ "_type": "choice", "_value": [4,8,16,...]},
"gnn_num_layers": { "_type": "choice", "_value": [2,4,8,...] },
"graph_aggregation_num_heads":{ "_type": "choice", "_value": [4,8,16,32,...]
},
"graph_aggregation_hidden_layers":{ "_type": "choice", "_value": [32,64,128,256,...] },
"graph_aggregation_dropout_rate":{ "_type": "choice", "_value": [0.1,0.2,0.5,...] },
"learning_rate": { "_type": "choice", "_value": [0.01,0.001,0.0001,...] }
}
Define the configuration file in YAML format, which declares the search space and the path of the trial file. It also provides other information, such as the parameters of the whole algorithm, the maximum number of trials and the maximum duration.
config.yml
authorName: NNI Example
experimentName: CWE-77
trialConcurrency: 1
maxExecDuration: 110h # max executable time
maxTrialNum: 500 # max trial num
trainingServicePlatform: local
searchSpacePath: search_space.json # path of search space
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize # choices: maximize, minimize
gpuIndices: "1" # specify GPUof optimizer
trial:
command: python3 train.py GGNN GraphBinaryClassification ../data/data/CWE-77 --patience 100 # execute commands
codeDir: .
gpuNum: 0
logDir: ~/nni # log directory
localConfig:
gpuIndices: "0" # specify GPU number
useActiveGpu: true
Run NNI
nnictl create --config config.yml --port 8080
Wait for the output INFO: Successfully started experiment! in the command line. This message indicates that the experiment has been successfully started.
For more details,reference https://github.com/Microsoft/nni
Data collection module
Collection Structure
βββ EnsembleLearning.py
βββ InputData_New.py
βββ stopwords.txt
βββ sample.zip
Ready for training
- Download our pretrained w2v model here
- We also provide a dataset sample.zip, unzip and make it work
Prepare data
- You can extract features from commits, or just use our sample.zip
Train your own ensemble classifier
- Use EnsembleLearning.py to train your own ensemble model
Warning: Replace the path with your own data path.
python EnsembleLearning.py
License
Distributed under the NISL License. See LICENSE for more information.
Contact
Huanting Wang - [email protected]
Citation
@ARTICLE{Wang2020FUNDED,
author = {H. {Wang} and G. {Ye} and Z. {Tang} and S. H. {Tan} and S. {Huang} and D. {Fang} and Y. {Feng} and L. {Bian} and Z. {Wang}},
journal = {IEEE Transactions on Information Forensics and Security},
title = {Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection},
year = {2021},
volume = {16},
pages = {1943-1958},
doi = {10.1109/TIFS.2020.3044773},
ieeeid = {9293321},
publisher = {IEEE},
keywords = {Software Vulnerability, Code Vulnerability Detection, Deep Learning, Deep Graph Neural Networks},}