• Stars
    star
    132
  • Rank 274,205 (Top 6 %)
  • Language
    Shell
  • License
    MIT License
  • Created about 4 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Binary Code Similarity Analysis (BCSA) Benchmark

BinKit 2.0

BinKit is a binary code similarity analysis (BCSA) benchmark. BinKit provides scripts for building a cross-compiling environment, as well as the compiled dataset. The current dataset includes 1,904 distinct combinations of compiler options of 8 architectures, 6 optimization levels, and 23 compilers. It includes 371,928 binaries.

The main improvements of the latest version of BinKit compared to the paper version of BinKit are as follows: Additional support for relatively newer compiler versions for major compilation options, and support for Ofast optimization option.

In particular, BinKit now includes GCC and Clang versions up to 11 and 13, respectively. Currently, a total of 6 optimization options (O0, O1, O2, O3, Os, Ofast) are supported. see the Currently supported compile options section below for more detailed options.

In Binkit 2.0 dataset, the gsl package misses 8 binaries with Ofast option due to compiler bugs. See the Missing binaries part of the Issues section for more information.

BinKit 1.0 (paper version)

The original dataset includes 1,352 distinct combinations of compiler options of 8 architectures, 5 optimization levels, and 13 compilers. It includes 243,128 binaries. We tested this code in Ubuntu 16.04.

For more details, please check our paper.

BCSA tool and Ground Truth Building

For a BCSA tool and ground truth building, please check TikNib.

Pre-compiled dataset and toolchain

You can download our dataset and toolchain as below. The link will be changed to git-lfs soon.

Dataset (latest version)

Dataset (old)

Below datasets are for reproduction of paper

Below data is only used for our evaluation.

.pickle Files

These files include the extracted features and useful information for each function.

Below data is only used for our evaluation.

Toolchain

Currently supported compile options

Architecture

  • x86_32
  • x86_64
  • arm_32 (little endian)
  • arm_64 (little endian)
  • mips_32 (little endian)
  • mips_64 (little endian)
  • mipseb_32 (big endian)
  • mipseb_64 (big endian)

Optimization

  • O0
  • O1
  • O2
  • O3
  • Os
  • Ofast

Compilers

  • gcc
    • gcc-4.9.4
    • gcc-5.5.0
    • gcc-6.4.0
    • gcc-6.5.0
    • gcc-7.3.0
    • gcc-8.2.0
    • gcc-8.5.0
    • gcc-9.4.0
    • gcc-10.3.0
    • gcc-11.2.0
  • clang
    • clang-4.0.0
    • clang-5.0.2
    • clang-6.0.1
    • clang-7.0.1
    • clang-8.0.0
    • clang-9.0.1
    • clang-10.0.1
    • clang-11.0.1
    • clang-12.0.1
    • clang-13.0.0
  • clang-obfus
    • clang-obfus-fla (Obfuscator-LLVM - FLA)
    • clang-obfus-sub (Obfuscator-LLVM - SUB)
    • clang-obfus-bcf (Obfuscator-LLVM - BCF)
    • clang-obfus-all (Obfuscator-LLVM - FLA + SUB + BCF)

How to use

1. Configure the environment in scripts/env.sh

  • NUM_JOBS: for make, parallel, and python multiprocessing
  • MAX_JOBS: maximum for make

2. Build cross-compiling environment (takes lots of time)

We build crosstool-ng and clang environment. If you download pre-compiled toolchain. Please skip this.

$ source scripts/env.sh
# We may have missed some packages here ... please check
$ scripts/install_default_deps.sh # install default packages for dataset compilation
$ scripts/setup_ctng.sh       # setup crosstool-ng binaries
$ scripts/setup_gcc.sh        # build ct-ng environment. Takes a lot of time
$ scripts/cleanup_ctng.sh     # cleaning up ctng leftovers
$ scripts/setup_clang.sh      # setup clang and llvm-obfuscator

3. Link toolchains

$ scripts/link_toolchains.sh  # link base toolchain

To undo the linking, please check scripts/unlink_toolchains.sh

4. Build dataset

Please configure variables in compile_packages.sh and run below. The script automatically downloads the source code of GNU packages, and compiles them to make all the dataset. However, it may take too much time to create all of them.

  • NOTE that it takes SIGNIFIACNT time.
  • NOTE that some packages would not be compiled for some compiler options.
$ scripts/install_gnu_deps.sh # install default packages for dataset compilation
$ ./compile_packages.sh

4-1. Build dataset (manual)

You can download the source code of GNU packages of your interest as below.

  • Please check step 1 before running the command.
  • You must give ABSOLUTE PATH for --base_dir.
$ source scripts/env.sh
$ python gnu_compile_script.py \
    --base_dir "/home/dongkwan/binkit/dataset/gnu" \
    --num_jobs 8 \
    --whitelist "config/whitelist.txt" \
    --download

You can compile only the packages or compiler options of your interest as below.

$ source scripts/env.sh
$ python gnu_compile_script.py \
    --base_dir "/home/dongkwan/binkit/dataset/gnu" \
    --num_jobs 8 \
    --config "config/normal.yml" \
    --whitelist "config/whitelist.txt"

You can check the compiled binaries as below.

$ source scripts/env.sh
$ python compile_checker.py \
    --base_dir "/home/dongkwan/binkit/dataset/gnu" \
    --num_jobs 8 \
    --config "config/normal.yml"

For more details, please check compile_packages.sh

4-2. Build dataset with customized options

To build datasets by customizing options, you can make your own configuration file (.yml) and select target compiler options. You can check the format in the existing sample files in the /config directory. Here, please make sure that the name of your config file is not included in the blacklist in the compilation script.

Issues

Tested environment

We ran all our experiments on a server equipped with four Intel Xeon E7-8867v4 2.40 GHz CPUs (total 144 cores), 896 GB DDR4 RAM, and 4 TB SSD. We setup Ubuntu 16.04 on the server.

Tested python version

  • Python 3.8.0

Running example

The time spent for running the below script took 7 hours on our machine.

$ python gnu_compile_script.py \
    --base_dir "/home/dongkwan/binkit/dataset/gnu" \
    --num_jobs 72 \
    --config "config/normal.yml" \
    --whitelist "config/whitelist.txt"

Compliation failure

If compilation fails, you may have to adjust the number of jobs for parallel processing in the step 1, which is machine-dependent.

Missing binaries

In Binkit 2.0 dataset, the gsl package misses 8 binaries with Ofast option due to compiler bugs. Clang-8 and clang-9 induce compiler hang bug when compiling the gsl package for 32bit ARM with Ofast option. We reported this issue to bug-gsl and llvm-project respectively. However, bug-gsl did not reply, and the llvm-project replied that these versions are not currently supported. The bug reporting links are respectively as follows: bug-gsl, llvm-project

Authors

This project has been conducted by the below authors at KAIST.

Citation

We would appreciate if you consider citing our paper when using BinKit.

@ARTICLE{kim:tse:2022,
  author={Kim, Dongkwan and Kim, Eunsoo and Cha, Sang Kil and Son, Sooel and Kim, Yongdae},
  journal={IEEE Transactions on Software Engineering}, 
  title={Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned}, 
  year={2022},
  volume={},
  number={},
  pages={1-23},
  doi={10.1109/TSE.2022.3187689}
}

More Repositories

1

CodeAlchemist

CodeAlchemist: Semantics-Aware Code Generation to Find Vulnerabilities in JavaScript Engines (NDSS '19)
F#
235
star
2

Fuzzing-Survey

The Art, Science, and Engineering of Fuzzing: A Survey
JavaScript
207
star
3

Eclipser

Grey-box Concolic Testing on Binary Code (ICSE '19)
F#
148
star
4

Smartian

Smartian: Enhancing Smart Contract Fuzzing with Static and Dynamic Data-Flow Analyses (ASE '21)
F#
139
star
5

TikNib

Binary Code Similarity Analysis (BCSA) Tool
Python
114
star
6

IMF

Inferred Model-based Fuzzer
Python
107
star
7

NTFuzz

NTFUZZ: Enabling Type-Aware Kernel Fuzzing on Windows with Static Binary Analysis (IEEE S&P '21)
F#
94
star
8

MeanDiff

Testing Intermediate Representations for Binary Analysis (ASE '17)
F#
79
star
9

GitCTF

Git-based CTF
Python
60
star
10

Ankou

Ankou: Guiding Grey-box Fuzzing towards Combinatorial Difference (ICSE '20)
Go
54
star
11

Fuzzle

Fuzzle: Making a Puzzle for Fuzzers (ASE'22)
Python
40
star
12

Reassessor

Reassembly is Hard: A Reflection on Challenges and Strategies (USENIX Security '23)
Python
30
star
13

BotScreen

BotScreen: Trust Everybody, but Cut the Aimbots Yourself (USENIX Security '23)
Python
13
star
14

Smartian-Artifact

Artifacts for Smartian, a grey-box fuzzer for Ethereum smart contracts.
Solidity
12
star
15

Eclipser-Artifact

Docker image for Eclipser
Shell
4
star
16

Fuzzle-artifact

Artifact evaluation repository for Fuzzle
C
3
star
17

MeanDiff-LifterPyVEX

Lift instruction to VEX, using PyVEX, and translate to MeanDiff's UIR
Python
2
star
18

Ankou-Benchmark

2
star
19

MeanDiff-LifterBINSEC

Lift instruction to DBA, using BINSEC, and translate to MeanDiff's UIR
OCaml
1
star
20

LLM1dFuzz

Systematic Bug Reproduction with Large Language Model (SECAI'24)
Shell
1
star
21

MeanDiff-ExternalXED

C
1
star
22

MeanDiff-LifterBAP

Lift instruction to BIL, using BAP, and translate to MeanDiff's UIR
OCaml
1
star
23

MeanDiff-DockerBaseImage

Shell
1
star