• Stars
    star
    157
  • Rank 238,399 (Top 5 %)
  • Language
    Python
  • Created about 5 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An unofficial implementation of asm2vec as a standalone python package

asm2vec

This is an unofficial implementation of the asm2vec model as a standalone python package. The details of the model can be found in the original paper: (sp'19) Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization

Requirements

This implementation is written in python 3.7 and it's recommended to use python 3.7+ as well. The only dependency of this package is numpy which can be installed as follows:

python3 -m pip install numpy

How to use

Import

To install the package, execute the following commands:

git clone https://github.com/lancern/asm2vec.git

Add the following line to the .bashrc file to add asm2vec to your python interpreter's search path for external packages:

export PYTHONPATH="path/to/asm2vec:$PYTHONPATH"

Replace path/to/asm2vec with the directory you clone asm2vec into. Then execute the following commands to update PYTHONPATH:

source ~/.bashrc

You can also add the following code snippets to your python source code referring asm2vec to guide python interpreter finding the package successfully:

import sys
sys.path.append('path/to/asm2vec')

In your python code, use the following import statement to import this package:

import asm2vec.<module-name>

Define CFGs And Training

You have 2 approaches to define the binary program that will be sent to the asm2vec model. The first approach is to build the CFG manually, as shown below:

from asm2vec.asm import BasicBlock
from asm2vec.asm import Function
from asm2vec.asm import parse_instruction

block1 = BasicBlock()
block1.add_instruction(parse_instruction('mov eax, ebx'))
block1.add_instruction(parse_instruction('jmp _loc'))

block2 = BasicBlock()
block2.add_instruction(parse_instruction('xor eax, eax'))
block2.add_instruction(parse_instruction('ret'))

block1.add_successor(block2)

block3 = BasicBlock()
block3.add_instruction(parse_instruction('sub eax, [ebp]'))

f1 = Function(block1, 'some_func')
f2 = Function(block3, 'another_func')

# block4 is ignore here for clarity
f3 = Function(block4, 'estimate_func')

And then you can train a model with the following code:

from asm2vec.model import Asm2Vec

model = Asm2Vec(d=200)
train_repo = model.make_function_repo([f1, f2, f3])
model.train(train_repo)

The second approach is using the parse module provided by asm2vec to build CFGs automatically from an assembly code source file:

from asm2vec.parse import parse_fp

with open('source.asm', 'r') as fp:
    funcs = parse_fp(fp)

And then you can train a model with the following code:

from asm2vec.model import Asm2Vec

model = Asm2Vec(d=200)
train_repo = model.make_function_repo(funcs)
model.train(train_repo)

Estimation

You can use the asm2vec.model.Asm2Vec.to_vec method to convert a function into its vector representation.

Serialization

The implementation support serialization on many of its internal data structures so that you can serialize the internal state of a trained model into disk for future use.

You can serialize two data structures to primitive data: the function repository and the model memento.

To be finished.

Hyper Parameters

The constructor of asm2vec.model.Asm2Vec class accepts some keyword arguments as hyper parameters of the model. The following table lists all the hyper parameters available:

Parameter Name Type Meaning Default Value
d int The dimention of the vectors for tokens. 200
initial_alpha float The initial learning rate. 0.05
alpha_update_interval int How many tokens can be processed before changing the learning rate? 10000
rnd_walks int How many random walks to perform to sequentialize a function? 3
neg_samples int How many samples to take during negative sampling? 25
iteration int How many iterations to perform? (This parameter is reserved for future use and is not implemented now) 1
jobs int How many tasks to execute concurrently during training? 4

Notes

For simplicity, the Selective Callee Expansion is not implemented in this early implementation. You have to do it manually before sending CFG into asm2vec .

More Repositories

1

cache-coherence-protocol-bench

Benchmarking code for evaluating the cost of cache coherence protocols implemented on different platforms
C++
14
star
2

llvm-anderson

Anderson points-to analysis implementation based on LLVM
C++
12
star
3

dsu-tree

A non-invasive disjoint-set-like data structure implementation in Rust
Rust
10
star
4

soda

Convert shared libraries into relocatable objects
Rust
10
star
5

ptdecoder

Command line utility that decodes Intel PT packets from binary data stream
Rust
6
star
6

erased-type-arena

An allocation arena for allocating values of different types while performing proper dropping
Rust
6
star
7

llvm-covmap

Profiling for code coverage via bitmap
C++
5
star
8

resume-template

A simple resume template written in typst
Typst
3
star
9

zig-brainfuck

Brainfuck JIT interpreter written in Zig
Zig
3
star
10

OpenBook

An open-source alternatives to the unmaintained gitbook
Rust
2
star
11

qwb-2020-fuzzer

Fuzzers for automatically finding vulnerabilities in the binaries of Qiangwang Cup 2020
Rust
2
star
12

ublog

Lancern's personal blog system
Rust
2
star
13

rusty-pi-os

Home-made operating system for Raspberry Pi 4, written in Rust
2
star
14

caf

Fuzzing Language APIs written in C/C++
C++
2
star
15

thufood-tgbot

Which canteen to choose at this breakfast / lunch / dinner in Tsinghua University?
Rust
2
star
16

vul-classify

A malicious code detector leveraging machine learning approaches and plugged in to the Viper platform
Python
2
star
17

rCore-sirius

Repository for maintaining my rCore labs implementation
Rust
2
star
18

blog-comments

This repository hosts comments posted in my blogs
1
star
19

cs6120

My code for the implementation tasks in Cornell CS6120: Advanced Compilers
Python
1
star
20

BITTreeHole

ๅŒ—็†ๆ ‘ๆดžๅŽ็ซฏไป“ๅบ“
C#
1
star
21

seccomp-benchmark

Benchmark seccomp against a similar implementation with ptrace
Rust
1
star
22

sirius-ui

Host your blog with Notion!
TypeScript
1
star
23

lancern.github.io

My personal blog pages hosted on GitHub pages.
CSS
1
star
24

miniFS

mini File System
C++
1
star
25

enigma

Enigma machine emulator and its crack procedure on modern computers
Rust
1
star
26

ipt-trace

Full stack implementation of a toy program tracing tool that uses Intel-PT for tracing. Include a Linux kernel module for accessing Intel-PT functionality and a user-space utility for end use.
C
1
star
27

afl-stat-rs

Parse AFL status file using Rust
Rust
1
star
28

cpp-boilerplate

A template repository that serves as a boilerplate for modern new C++ projects using CMake.
CMake
1
star