Devign
Implementation of Devign Model in Python with code for processing the dataset and generation of Code Property Graphs.
This project is under development. For now, just the Abstract Syntax Tree is considered for the graph embedding of code and model training.
Table of Contents
Getting Started
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Prerequisites
Install the necessary dependencies before running the project:
Software:
Python Libraries:
- Pandas (>=1.0.1)
- scikit-learn (>=0.22.2)
- PyTorch (>=1.4.0)
- PyTorch Geometric (>=1.4.2)
- Gensim (>=3.8.1)
- cpgclientlib (>=0.11.111)
Notes
These notes might save you some time:
- Changes to the
configs.json
structure need to be reflected in theconfigs.py
script. - PyTorch Geometric has several dependencies that need to match, including PyTorch. Follow the installation steps on their website.
- Joern processing might be slow and even freeze your OS, that depends on your system's specifications.
Choose a smaller size for the chunks that are processed when splitting the dataset during the Create task.
That can be done by changing the
"slice_size"
value under"create"
in the configurations fileconfigs.json
- In the
"slice_size"
file, the nodes are filtered and discarded if the size is greater than the limit configured. - When changing the number of nodes considered for processing,
"nodes_dim"
under"embed"
needs to match"in_channels"
, under"devign" -> "model" -> "conv_args" -> "conv1d_1"
. - The embedding size is equal to Word2Vec vector size plus 1.
- When executing the Create task, a directory named
joern
is created and deleted automatically under'project'\data\
. - The dataset split for modeling during Process task is done under
src/data/datamanger.py
. The sets are balanced and the train/val/test ratio are 0.8/0.1/0.1 respectively. - The scriptย graph-for-funcs.scย queries the CPG graphs from Joern. That script has a minor change to make it possible to track the files to the CPGs generated. The last time was failing because dependencies in Joern changed and needed the updated version. I assume you can find it in their latest version. I suggested you look at issue #3. Those CPGs are saved in a JSON file, check function "joern_create" line 48, it prints the CPG created in Joern to a JSON file ...ย .toString() |>ย "{json_out}",ย ย and that file is processed by the function "json_process". Both those functions are in the file devign/src/prepare/cpg_generator.py. If you have troubles creating the CPG JSON file withย Joern,ย I advise you to do what you are trying manually in Joern. Create a new project pointing to the dataset folder containing all the files and query the CPG with theย graph-for-funcs.scย script that's built-in, then export it to a file withย .toString() |>.ย Joern commands are quite easy to understand and they have good support on Gitter.ย As well, followย the commitย to understand the changes I've previously made.
- Tested on Ubuntu 18.04/19.04
Setup
For now this project is not pip installable. With the proper use cases will be implemented.
This section gives the steps, explanations and examples for getting the project running.
1) Clone this repo
$ git clone https://github.com/epicosy/devign/devign.git
2) Install Prerequisites
3) Configure the project
Verify you have the correct directory structure by matching with the "paths"
in the configurations file configs.json
.
The dataset related files that are generated are saved under those paths.
4) Joern
This step is only necessary for the Create task.
Follow the instructions on Joern's documentation page and install Joern's command line tools
under 'project'\joern\joern-cli\
.
Structure
โโโ LICENSE
โโโ README.md <- The top-level README for developers using this project.
โโโ data
โ โโโ cpg <- Dataset with CPGs.
โ โโโ input <- Cannonical dataset for modeling.
โ โโโ model <- Trained models.
โ โโโ raw <- The original, immutable data dump.
โ โโโ tokens <- Tokens dataset files generated from the raw data functions.
โ โโโ w2v <- Word2Vec model files for initial embeddings.
โ
โโโ joern
โ โโโ joern-cli <- Joern command line tools for creating and analyzing code property graphs.
โ โโโ graphs-for-funcs.sc <- Script that returns in Json format the AST, CGF, and PDG for each method
โ contained in the loaded CPG.
โ
โโโ src <- Source code for use in this project.
โ โโโ __init__.py <- Makes src a Python package.
โ โ
โ โโโ data <- Data handling scripts.
โ โ โโโ __init__.py <- Makes data a Python package.
โ โ โโโ datamanger.py <- Module for the most essential operations on the dataset.
โ โ
โ โโโ prepare <- Package for CPG generation and representation.
โ โ โโโ __init__.py <- Makes prepare a Python package.
โ โ โโโ cpg_client_wrapper.py <- Simple class wrapper for the CpgClient that interacts with the Joern REST server
โ โ โโโ cpg_generator.py <- Ad-hoc script for creating CPGs with Joern and processing the results.
โ โ โโโ embeddings.py <- Module that embeds the graph nodes into node features.
โ โ
โ โโโ process <- Scripts for modeling and predictions.
โ โ โโโ __init__.py <- Makes process a Python package.
โ โ โโโ devign.py <- Module that implements the devign model.
โ โ โโโ loader_step.py <- Module for one epoch iteration over dataset
โ โ โโโ model.py <- Module that implements the devign neural network.
โ โ โโโ modeling.py <- Module for training and prediction the model.
โ โ โโโ step.py <- Module that performs a forward step on a batch for train/val/test loop.
โ โ โโโ stopping.py <- Module that performs early stopping.
โ โ
โ โ
โ โโโ utils <- Package with helper components like functions and classes, used across
โ โ the project.
โ โโโ __init__.py <- Makes utils a Python package.
โ โโโ log.py <- Module for logging modules messages.
โ โโโ functions <- Auxiliar functions for processing.
โ โ โโโ __init__.py <- Makes functions a Python package
โ โ โโโ cpg.py <- Module with auxiliar functions for CPGs.
โ โ โโโ digraph.py <- Module for creating digraphs from nodes.
โ โ โโโ parase.py <- Module for parsing source code into tokens.
โ โ
โ โโโ objects <- Auxiliar data classes with basic methods.
โ โโโ __init__.py <- Makes objects a Python package.
โ โโโ cpg <- Auxiliar data classes for representing and handling the Json graphs.
โ โ โโโ __init__.py
โ โ โโโ ast.py
โ โ โโโ edge.py
โ โ โโโ function.py
โ โ โโโ node.py
โ โ โโโ properties.py
โ โ
โ โโโ input_dataset.py <- Custom wrapper for Torch Dataset.
โ โโโ metrics.py <- Module for evaluating the results.
โ โโโ stats.py <- Module for handling raw results.
โ
โ
โโโ configs.py <- Configuration management script.
โโโ configs.json <- Project configurations used by main.py.
โโโ main.py <- Main script file that joins the modules into executable tasks.
##Usage
Dataset
The dataset used is the partial dataset released by the authors.
The dataset is handled with Pandas and the file src/data/datamanger.py
contains wrapper functions for the most essential operations.
A small sample of 994 entries from the original dataset is available for testing purposes.
The sample dataset contains functions from the FFmpeg project with a maximum of 287 nodes per function.
For each task, the necessary dataset files are available under the respective folders.
For example, under data/cpg
are available the datasets with the graphs constituting the CPG for the functions.
Fields
project | commit_id | target | func |
---|---|---|---|
FFmpeg | 973b1a6b9070e2bf17d17568cbaf4043ce931f51 | 0 | static av_cold int vdadec_init(AVCodecContext ... |
FFmpeg | 321b2a9ded0468670b7678b7c098886930ae16b2 | 0 | static int transcode(AVFormatContext **output_... |
FFmpeg | 5d5de3eba4c7890c2e8077f5b4ae569671d11cf8 | 0 | static void v4l2_free_buffer(void *opaque, uin... |
FFmpeg | 32bf6550cb9cc9f487a6722fe2bfc272a93c1065 | 0 | int ff_get_wav_header(AVFormatContext *s, AVIO... |
FFmpeg | 57d77b3963ce1023eaf5ada8cba58b9379405cc8 | 0 | int av_opencl_buffer_write(cl_mem dst_cl_buf, ... |
... | ... | ... ... | |
qemu | 1ea879e5580f63414693655fcf0328559cdce138 | 0 | static int no_init_in (HWVoiceIn *hw, audsetti... |
qemu | f74990a5d019751c545e9800a3376b6336e77d38 | 0 | uint32_t HELPER(stfle)(CPUS390XState *env, uin... |
qemu | a89f364ae8740dfc31b321eed9ee454e996dc3c1 | 0 | static void pxa2xx_fir_write(void *opaque, hwa... |
qemu | 39fb730aed8c5f7b0058845cb9feac0d4b177985 | 0 | static void disas_thumb_insn(CPUARMState *env,... |
FFmpeg | 7104c23bd1a1dcb8a7d9e2c8838c7ce55c30a331 | 0 | static void rv34_pred_mv(RV34DecContext *r, in... |
Baseline "main.py"
The script main.py
contains functions that put together the modules into executable tasks for the baseline approach.
It can be used as example to elaborate custom functionalities.
The basic baseline transforms the dataset to the input for the model, proceeding with it's training and evaluation.
The tasks that compose it are Create, Embed and Process.
$ python main.py -c -e -p.
For each task, verify that the correct files are in the respective folders. For example, executing the Process task requires the input datasets that contain the embedded graphs with the associated labels.
Create Task
This is the first task where the dataset is filtered (optionally) and augmented with a column that
contains the respective Code Property Graph (CPG).
Functions in the dataset are written to files into a target directory which Joern is queried with for creating the CPG.
After the CPG creation, Joern is queried with the script "graph-for-funcs.sc" which creates the graphs from the CPG.
Those are returned in JSON format, containing all the functions with the respective AST, CFG and PDG graphs.
Execute with:
$ python main.py -c
Filtering the dataset can be done with data.apply_filter(raw: pandas.Dataframe, select: callable)
under create_task
function.
Embed Task
This task transforms the source code functions into tokens which are used to generate and train the word2vec model for the initial embeddings. The nodes embeddings are done as explained in the paper, for now just for the AST:
Execute with:
$ python main.py -e
Tokenization example
Source code:
'static void v4l2_free_buffer(void *opaque, uint8_t *unused)
{
V4L2Buffer* avbuf = opaque;
V4L2m2mContext *s = buf_to_m2mctx(avbuf);
if (atomic_fetch_sub(&avbuf->context_refcount, 1) == 1) {
atomic_fetch_sub_explicit(&s->refcount, 1, memory_order_acq_rel);
if (s->reinit) {
if (!atomic_load(&s->refcount))
sem_post(&s->refsync);
} else if (avbuf->context->streamon)
ff_v4l2_buffer_enqueue(avbuf);
av_buffer_unref(&avbuf->context_ref);
}
}
'
Tokens: ['static', 'void', 'FUN1', '(', 'void', '', 'VAR1', ',', 'uint8_t', '', 'VAR2)', '{', 'VAR3', '', 'VAR4', '=', 'VAR1', ';', 'V4L2m2mContext', '', 'VAR5', '=', 'FUN2', '(', 'VAR4)', ';', 'if', '(', 'FUN3', '(', '&', 'VAR4', '-', '>', 'VAR6', ',', '1)', '==', '1)', '{', 'FUN4', '(', '&', 'VAR5', '-', '>', 'VAR7', ',', '1', ',', 'VAR8)', ';', 'if', '(', 'VAR5', '-', '>', 'VAR9)', '{', 'if', '(', '!', 'FUN5', '(', '&', 'VAR5', '-', '>', 'VAR7))', 'FUN6', '(', '&', 'VAR5', '-', '>', 'VAR10)', ';', '}', 'else', 'if', '(', 'VAR4', '-', '>', 'VAR11', '-', '>', 'VAR12)', 'FUN7', '(', 'VAR4)', ';', 'FUN8', '(', '&', 'VAR4', '-', '>', 'VAR13)', ';', '}', '}']
Process Task
In this task the previous transformed dataset is split into train, validation and test sets which are used to train an evaluate the model. The accuracy from training output is softmax accuracy.
Execute with:
$ python main.py -p
Enable EarlyStopping for training with:
$ python main.py -pS
Results
Train/Val/Test ratios - 0.8/0.1/0.1 Example results of training with early stopping on the sample dataset. Last Model checkpoint at 5 epochs.
Parameters used:
- "learning_rate" : 1e-4
- "weight_decay" : 1.3e-6
- "loss_lambda" : 1.3e-6
- "epochs" : 100
- "patience" : 10
- "batch_size" : 8
- "dataset_ratio" : 1 (Total entries)
- "shuffle" : false
True Pos.: 37, False Pos.: 27, True Neg.: 22, False Neg.: 15 Accuracy: 0.5841584158415841 Precision: 0.578125 Recall: 0.7115384615384616 F-measure: 0.6379310344827586 Precision-Recall AUC: 0.5388430220841324 AUC: 0.5569073783359497 MCC: 0.166507096257419
Example results of training without early stopping on the sample dataset.
Parameters used:
- "learning_rate" : 1e-4
- "weight_decay" : 1.3e-6
- "loss_lambda" : 1.3e-6
- "epochs" : 30
- "patience" : 10
- "batch_size" : 8
- "dataset_ratio" : 1 (Total entries)
- "shuffle" : false
True Pos.: 38, False Pos.: 34, True Neg.: 15, False Neg.: 14 Accuracy: 0.5247524752475248 Precision: 0.5277777777777778 Recall: 0.7307692307692307 F-measure: 0.6129032258064515 Precision-Recall AUC: 0.5592493611149129 AUC: 0.5429748822605965 MCC: 0.04075331061223071 Error: 53.56002758457897
Roadmap
See the open issues for a list of proposed features (and known issues).
Authors
- Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, Yang Liu
- Initial work - Devign Paper, Node Representation and Datasets
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Eduard Pinconschi - [email protected]
Acknowledgments
Guidance and ideas for some parts from: