• Stars
    star
    185
  • Rank 208,271 (Top 5 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 4 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code and resources for the paper: "Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs"

Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

This is the official implementation of Nero-GNN, the prototype described in: Yaniv David, Uri Alon, and Eran Yahav, "Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphsืด, will appear in OOPSLA '2020, PDF.

Our evaluation dataset and other resources are available here (Zenodo). These will be used and further explained next.

An overview of the data-gen process

Table of Contents

Requirements

Data Generation Specific Requirements

  • python3.8
  • LLVM version 10 and the llvmlite & llvmcpy python packages (other versions might work. 3.x will not).
  • IDA-PRO (tested with version 6.95).
  • angr, and the simuvex package.
  • A few more python packages: scandir, tqdm, jsonpickle, parmap, python-magic, pyelftools, setproctitle.

Using a licensed IDA-PRO installation for Linux, all of these requirements were verified as compatible for running on an Ubuntu 20 machine (and with some more effort even on Ubuntu 16).

For Ubuntu 20, you can use the requirements.txt file in this repository to install all python packages against the native python3.8 version:

pip3 install -r requirements.txt

LLVM version 10 can be installed with:

sudo apt get install llvm-10

The IDA-python scripts (in datagen/ida/py2) were tested against the python 2.7 version bundled with IDA-PRO 6.95, and should work with newer versions at least up-to 7.4 (more info here). Please file a bug if it doesn't.

The jsonpickle python package also needs to be installed for use by this bundled python version:

  1. Download the package:
wget https://files.pythonhosted.org/packages/32/d5/2f47f03d3f64c31b0d7070b488274631d7567c36e81a9f744e6638bb0f0d/jsonpickle-0.9.6.tar.gz
  1. Extract only the package sources:
tar -xvf jsonpickle-0.9.6.tar.gz jsonpickle-0.9.6/jsonpickle/
  1. Move it to the IDA-PRO python directory:
mv jsonpickle-0.9.6/jsonpickle /opt/ida-6.95/idal64/python/

Note that, when installed as root, IDA-PRO defaults to installing in /opt/ida-6.95/idal64. Other paths will require adjusting here and in other scripts.

Neural Model Specific Requirements

  • python3.6. (For using the same Ubuntu 20 machine for training and data generation we recommend using virtualenv)
  • These two python packages: jsonpickle, scipy
  • TensorFlow 1.13.1 (install) or using:
pip install tensorflow-gpu==1.13.1 # for the GPU version

Note that CUDA >= 10.1 is required for tensorflow-gpu version 1.13 and above. (See this link for more information)

or:

pip install tensorflow==1.13.1 # for the CPU version

To check existing TensorFlow version, run:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

Generating Representations for Binary Procedures

Our binaries dataset was created by compiling several GNU source-code packages into binary executables and performing a thorough cleanup and deduplication process (detailed in our paper).

The packages are split into three sets: training, validation and test (each in its own directory in the extracted archive: TRAIN, VALIDATE & TEST resp.).

To obtain preprocessed representations for these binaries you can either download our preprocessed dataset, or create a new dataset from our or any other binaries dataset.

Creating Representations

Indexing

Indexing, i.e., analyzing the binaries and creating augmented control flow graphs based representations for them is performed using:

python3 -u index_binaries.py --input-dir TRAIN --output-dir TRAIN_INDEXED

where TRAIN is the directory holding the binaries to index, and results are placed in TRAIN_INDEXED.

To index successfully, binaries must contain debug information and adhere to this file name structure:

<compiler>-<compiler version>__O<Optimization level(u for default)>__<Package name>[-<optional package version>]__<Executable name> 

For example: "gcc-5__Ou__cssc__sccs".

Some notes on the indexing process and its results:

  1. The indexing process might take several hours. We recommend running it on a machine with multiple CPU-cores and adequate RAM.
  2. The number of procedures created might depend on the timeout value selected for procedure indexing (controlled by --index-timeout with the default of 30 minutes).
  3. Procedures containing features not supported by the indexing engine (e.g., vector operations) or CFGs with more than 1000 unique CFG paths will not be indexed.
  4. The created representations might have some minor discrepancies when compared with those published in zenodo. These include JSON field ordering and formating. These discrepancies are the result of porting this prototype to Python3 towards its publication.
  5. To change the path to the IDA-PRO installation use --idal64-path.

Filter and collect

Next, to filter and collect all the indexed procedures into one JSON file:

python3 -u collect_and_filter.py --input-dir TRAIN_INDEXED --output-file=train.json

This will filter and collect indexed procedures from TRAIN_INDEXED (which should hold the indexed binaries for training from the last step) and store them in train.json.

Preprocess for use by the model

Finally, to preprocess raw representations, preparing them for use by the neural model, use:

python3 preprocess.py -trd train.json -ted test.json -vd validation.json -o data

This will preprocess the training(train.json), validation(validation.json) and test(test.json) files. Note that this step require TensorFlow and other components mentioned here.

Using Prepared Representations

The procedure representations for the binaries in our dataset can be found in this archive.

Extracting the procedure representations archive will create the folder procedure_representations and inside it two more folders:

  1. raw: The raw representations for all the binary procedures in the above dataset. Each procedure is represented by one line in the relevant file for each set (training.json, validation.json and test.json)
  2. preprocessed: The raw procedure representations preprocessed for training.

The preprocessed directory contains:

  1. Files for training the model: data.dict and data.train (the dictionary and preprocessed training set samples accordingly)
  2. data.val - The (preprocessed) validation set samples.
  3. data.test - The (preprocessed) test set samples.

Predicting Procedure Names Using Neural Models

As we show in our paper, Nero-GNN is the best variation of our approach, and so we focus on and showcase it here.

Training From Scratch

Training a Nero-GNN model is performed by running the following command line:

python3 -u gnn.py --data procedure_representations/processed/data \
--test procedure_representations/processed/data.val --save new_model/model \
--gnn_layers NUM_GNN_LAYERS

Where NUM_GNN_LAYERS is the number of GNN layers. In the paper, we found NUM_GNN_LAYERS=4 to perform best. The paths to the (training) --data and (validation) --test arguments can be changed to point to a new dataset. Here, we provide the dataset that we used in the paper.

We trained our models using a Tesla V100 GPU. Other GPUs might require changing the number of GNN layers or other dims to fit into the available RAM.

Using Pre-Trained Models

Trained models are available in this archive. Extracting it will create the gnn directory composed of:

  1. Trained model (the dictionaries.bin & model_iter495.* files, storing the 495th training iteration)
  2. Training log.
  3. Prediction results log.

Evaluation

Evaluation of a trained model is performed using the following command line:

python3 -u gnn.py --test procedure_representations/data.test \
--load gnn/model_iter495 \
--gnn_layers NUM_GNN_LAYERS

if model_iter495 is the checkpoint that performed best on the validation set during training (this is the case in the provided trained model). The value of NUM_GNN_LAYERS should be the same as in training.

Additional Flags

  • Use the --no_arg flag during training and testing, to train a "no-values" model (as in Table 4 in our paper)
  • Use the --no_api flag during training and testing, to train an "obfuscated" model (as in Table 2 in our paper) - a model that does not use the API names (assuming they are obfuscated).

Understanding the Prediction Process and Its Results

This section provides a name prediction walk-through for an example from our test set (further explained here. For readability, we start straight from the graph representation (similar to the one depicted in Fig.2(c) in our paper) and skip the rest of the steps.

The get_tz procedure from the find executable is part of findutils package. This procedure is represented as a json found at line 1715 in procedure_representations/raw/test.json.

This json can be pretty-printed by running:

awk 'NR==1715' procedure_representations/raw/test.json | python3 -m json.tool

This json represents the procedure's graph:

  • The graph nodes are basic blocks named ob<x> (where x is a number with an optional postfix, e.g., initialize).
  • The json contains data regarding edges between the nodes, and the abstracted call sites in each node.
  • In this json we see that the first node, ob-1.initialize, contains a call to the External api call getenv marked by Egetenv. This api call is made with the argument which was resolved to the concrete string TZ.
  • Other calls to memcpy and strlen are made with the abstract value CONST as their argument. This CONST abstract value is called STK in the paper (see page 14).
  • Note that Nxmemdup is a Normal (internal) call. This name was taken from the debug information and kept here for our debugging purposes and is stripped again before being used in the training/prediction steps.

The name prediction for this procedure by our Nero-GNN can be found in line 876 of the models prediction log file:

head -n 1 gnn_model/predictions_iter495_F1_45.5.txt  && awk 'NR==876' gnn_model/predictions_iter495_F1_45.5.txt

Which results in:

PredCode,package,Original,Predicted,Thrown
[+],get*tz@find@findutils,get*tz,get*tz,['BLANK'; 'BLANK'; 'BLANK'; 'BLANK']

This line starts with the prediction code: +,ยฑ or - for full, partial, or no match accordingly. Then, the procedure information, ground truth, prediction and truncated sub-tokens follow:

  • Procedure name: get_tz
  • Executable: find
  • Package name: findutils
  • Ground truth procedure name: ['get', 'tz']
  • Model's prediction: ['get', 'tz']
  • Truncated suffix subtokens: ['BLANK', 'BLANK', 'BLANK', 'BLANK']

Note that we truncate the prediction after the first BLANK or UNKNOWN sub-token prediction.

Citation

Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs

@article{
    David2020,
    title = {Neural Reverse Engineering of Stripped Binaries Using Augmented Control Flow Graphs},
    author = {David, Yaniv and Alon, Uri and Yahav, Eran},
    doi = {10.1145/3428293},
    journal = {Proceedings of the ACM on Programming Languages},
    number = {OOPSLA},
    title = {{Neural reverse engineering of stripped binaries using augmented control flow graphs}},
    volume = {4},
    year = {2020}
}

More Repositories

1

code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
Python
1,092
star
2

code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
Python
547
star
3

how_attentive_are_gats

Code for the paper "How Attentive are Graph Attention Networks?" (ICLR'2022)
Python
299
star
4

RASP

An interpreter for RASP as described in the ICML 2021 paper "Thinking Like Transformers"
Python
279
star
5

bottleneck

Code for the paper: "On the Bottleneck of Graph Neural Networks and Its Practical Implications"
Python
91
star
6

slm-code-generation

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)
Java
86
star
7

esh

statistical similarity of binaries (Esh)
C#
73
star
8

lstar_extraction

implementation of ICML 2018 paper, Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples
Jupyter Notebook
71
star
9

layer_norm_expressivity_role

Code for the paper "On the Expressivity Role of LayerNorm in Transformers' Attention" (Findings of ACL'2023)
Python
43
star
10

c3po

Code for the paper "A Structural Model for Contextual Code Changes"
Python
25
star
11

adversarial-examples

Code for the paper: "Adversarial Examples for Models of Code"
Python
17
star
12

weighted_lstar

implementation for "learning weighted deterministic automata from queries and counterexamples", neurips 2019
Python
17
star
13

RASP-exps

Code for running the transformers in the ICML 2021 paper "Thinking Like Transformers"
Python
16
star
14

prime

Java
14
star
15

safe

SAFE static analysis tools
Java
12
star
16

differential

differential
C
12
star
17

counting_dimensions

demonstration for our ACL 2018 paper, "On the Practical Computational Power of Finite Precision RNNs for Language Recognition"
Jupyter Notebook
10
star
18

id2vec

Python
9
star
19

RNN_to_PRS_CFG

Implementation of TACAS 2021 paper, "Extrapolating CFGs from RNNs"
Python
9
star
20

atam

Example programs for ATAM
C
3
star