• Stars
    star
    186
  • Rank 206,115 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Binary Function Similarity

This repository contains the code, the dataset and additional technical information for our USENIX Security '22 paper:

Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, Davide Balzarotti. How Machine Learning Is Solving the Binary Function Similarity Problem. USENIX Security '22.

The paper is available at this link.

Additional technical information

The technical report, with additional information on the dataset and the selected approaches, is available at this link.

Artifacts

The repository is structured in the following way:

  • Binaries: the compiled binaries and the scripts to compile them. Binaries are downloaded from GDrive via a Python script
  • IDBs: where the IDA Pro databases (IDBs) are stored after analysis. IDBs are generated via a Python script and IDA Pro
  • DBs: the datasets of selected functions, the corresponding features, and the scripts to generate them
  • IDA_scripts: the IDA Pro scripts used for the features extraction
  • Models: the code for the approaches we tested
  • Results: the results of our experiments on all the test cases and the code to extract the different metrics.

What to do next?

The following is a list of the main steps to follow based on the most common use cases:

  • Reproduce the experiments presented in the paper

    • Note: the binaries (Binaries) and the corresponding IDA Pro Databases (IDBs) are only needed to create a new dataset or to extract additional features. In order to reproduce the experiments or run new tests with the current set of features, DBs and Models already contain the required data.
    1. The DBs folder contains the input data needed to reproduce the results for each tested approach, including extracted features
    2. Refer to the README of each approach in the Models folder for detailed instructions on how to run it
    3. Follow the README and use the scripts in the Results folder to collect the different metrics.
  • Test a new approach on our datasets

    1. Check the README in the DBs folder to decide which data to use based on each test case
    2. Reuse the existing IDA Pro scripts codebase for the features extractions and pre/post-processing code to minimize evaluation differences
    3. Follow the README and use the scripts in the Results folder to collect the different metrics.
  • Use one of the existing approaches to infer new functions

    • Note: the current workflow and code has been written to optimize the evaluation of the similarity engines on a "fixed" dataset of functions and their features. This makes the inference on a new dataset slightly complex, as it requires to follow different steps for each approach. A simplification may be addressed in a future release.
    1. Refer to the README of each approach in Models for detailed instructions on how to run it in inference mode
    2. Use the corresponding IDA Pro script to extract the features that are needed by that specific approach
    3. Some approaches require to run a specific post-processing script to convert the extracted features into the requested format
    4. Be aware of the limitations of the ML models: new architectures, compilers and compiler options may require retraining them.

How to cite our work

Please use the following BibTeX:

@inproceedings {280046,
author = {Andrea Marcelli and Mariano Graziano and Xabier Ugarte-Pedrero and Yanick Fratantonio and Mohamad Mansouri and Davide Balzarotti},
title = {How Machine Learning Is Solving the Binary Function Similarity Problem},
booktitle = {31st USENIX Security Symposium (USENIX Security 22)},
year = {2022},
isbn = {978-1-939133-31-1},
address = {Boston, MA},
pages = {2099--2116},
url = {https://www.usenix.org/conference/usenixsecurity22/presentation/marcelli},
publisher = {USENIX Association},
month = aug,
}

Errata corrects

Our corrections to the published paper:

  • From Section 3.2 Selected Approaches: "First, the binary diffing tools grouped in the middle box [13,16,83] have all been designed for a direct comparison of two binaries (e.g., they use the call graph) and they are all mono-architecture." This sentence is inaccurate because Bindiff and Diaphora also support the cross-architecture comparisons.

License

The code in this repository is licensed under the MIT License, however some models and scripts depend on or pull in code that have different licenses.

Bugs and feedback

For help or issues, please submit a GitHub issue.

More Repositories

1

clamav

ClamAV - Documentation is here: https://docs.clamav.net
C
3,200
star
2

pyrebox

Python scriptable Reverse Engineering Sandbox, a Virtual Machine instrumentation and inspection framework based on QEMU
C
1,604
star
3

GhIDA

Python
714
star
4

mutiny-fuzzer

Python
530
star
5

MBRFilter

Cisco Talos MBR Filter Driver
C
318
star
6

moflow

Release Branches for MoFlow
C++
296
star
7

ROPMEMU

ROPMEMU is a framework to analyze, dissect and decompile complex code-reuse attacks.
Python
282
star
8

Decept

Decept Network Protocol Proxy
Python
259
star
9

Ghidraaas

Python
207
star
10

DynDataResolver

Python
204
star
11

fnc-1

Fake News Challenge
Python
173
star
12

BASS

BASS - BASS Automated Signature Synthesizer
Python
171
star
13

file2pcap

C
162
star
14

Barbervisor

Intel x86 bare metal hypervisor for researching snapshot fuzzing ideas.
Rust
145
star
15

TeslaDecrypt

Decryption Tool
C++
132
star
16

snort-faq

Snort FAQ
110
star
17

osquery_queries

Cisco Orbital - Osquery queries by Talos
96
star
18

FIRST

91
star
19

snap_wtf_macos

WTF Snapshot fuzzing of macOS targets
Python
86
star
20

FIRST-plugin-ida

Python
85
star
21

Winbox_Protocol_Dissector

Lua
67
star
22

locky

C
66
star
23

pylocky_decryptor

Python
64
star
24

cvdupdate

ClamAV Private Database Mirror Updater Tool
Python
62
star
25

smi_check

Smart Install Client Scanner
Python
61
star
26

clamav-bytecode-compiler

ClamAV ByteCode Compiler
C
60
star
27

covnavi

Python
59
star
28

IOCs

Indicators of Compromise
55
star
29

Mussels

Python
43
star
30

CASC

Python
40
star
31

clamav-safebrowsing

Python
37
star
32

freesentry

C++
34
star
33

clamav-docker

Dockerfiles for the ClamAV project
Shell
34
star
34

Re2Pcap

Python
33
star
35

oil-pumpjack

Oil Pumpjack: open source materials to create your own oil pumpjack managed by an Arduino
Python
31
star
36

FIRST-server

CSS
30
star
37

clamav-fuzz-corpus

Seed Corpus for clamav-devel oss-fuzz integration.
HTML
30
star
38

flokibot

Python
25
star
39

remcos-decoder

Talos Decryptor POC for Remcos RAT version 2.0.5 and earlier
Python
21
star
40

crashdog

C
15
star
41

badgerboard

Verilog
14
star
42

Daemonlogger

The Official Github Repository of Daemonlogger
C
14
star
43

useful-tools

Python
14
star
44

Nim-IDA-FLIRT-Generator

Nim-IDA-FLIRT-Generator
Python
13
star
45

clamav-documentation

ClamAV Documentation
JavaScript
13
star
46

clamav-mussels-cookbook

12
star
47

snort2-docker

Vim Script
10
star
48

ida_tilegx

C
6
star
49

NibiruDecrypt

C#
6
star
50

mussels-recipe-scrapbook

2
star
51

Threat-Round-Up

1
star
52

clamav-async-rs

1
star