• Stars
    star
    640
  • Rank 70,324 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Recognize cpu instructions in an arbitrary binary file

Description

cpu_rec is a tool that recognizes cpu instructions in an arbitrary binary file. It can be used as a standalone tool, or as a plugin for binwalk (https://github.com/devttys0/binwalk).

Installation instructions

Standalone tool

  1. Copy cpu_rec.py and cpu_rec_corpus in the same directory.
  2. If you don't have the lzma module installed for your python (this tool works either with python3 or with python2 >= 2.4) then you should unxz the corpus files in cpu_rec_corpus.
  3. If you want to enhance the corpus, you can add new data in the corpus directory. If you want to create your own corpus, please look at the method build_default_corpus in the source code.

For use as a binwalk module

Same as above, but the installation directory must be the binwalk module directory: $HOME/.config/binwalk/modules.

You'll need a recent version of binwalk, that includes the patch provided by ReFirmLabs/binwalk#241 .

How to use the tool

As a binwalk module

Add the flag -% when using binwalk.

Be patient. Waiting a few minutes for the result is to be expected. On my laptop the tool takes 25 seconds and 1 Gb of RAM to create the signatures for 70 architectures, and then the analysis of a binary takes one minute per Mb. If you want the tool to be faster, you can remove some architectures, if you know that your binary is not one of them (typically Cray or MMIX are not found in a firmware).

As a standalone tool

Just run the tool, with the binary file(s) to analyze as argument(s) The tool will try to match an architecture for the whole file, and then to detect the largest binary chunk that corresponds to a CPU architecture; usually it is the right answer, but one should not forget that this tool is heuristic and that some binary files contain instructions for multiple architectures, therefore a more detailed analysis may be needed.

If the result is not satisfying, prepending twice -v to the arguments makes the tool very verbose; this is helpful when adding a new architecture to the corpus or when there are doubts on the raw result of the tool.

If https://github.com/LRGH/elfesteem is installed, then the tool also extract the text section from ELF, PE, Mach-O or COFF files, and outputs the architecture corresponding to this section; the possibility of extracting the text section is also used when building a corpus from full binary files.

Option -d followed by a directory dumps the corpus in that directory; using this option one can reconstruct the default corpus.

As a python module

The function which_arch takes a bytestring as input and outputs the name of the architecture, or None. Loading the training data is done during the first call of which_arch, and calling which_arch with no argument does this precomputation only.

For example

>>> from cpu_rec import which_arch
>>> which_arch()
>>> which_arch(b'toto')
>>> which_arch(open('/bin/sh').read())
'X86-64'

Create a corpus or extend the existing corpus

Each architecture is defined by a file in cpu_rec_corpus. Only file names ending with .corpus, which can be compressed with xz.

The corpus file shall contain instructions for the target architecture. As you can see in build_default_corpus, most of the default corpus has been created by extracting the TEXT section of an executable.

If you want to add an new architecture (e.g. 78k as described below) then you have to find a binary, and extract the executable section (the command line to extract the 78k code from the Metz firmware is dd if=MB50AF1_NikonV12.bin of=Nec78k.corpus bs=1 skip=0x2ba count=0x7d5a).

Examples

Running the tool as a binwalk module typically results in:

shell_prompt> binwalk -% corpus/PE/PPC/NTDLL.DLL corpus/MSP430/goodfet32.hex

Target File:   .../corpus/PE/PPC/NTDLL.DLL
MD5 Checksum:  d006a2a87a3596c744c5573aece81d77

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0x5800, entropy=0.620536)
22528         0x5800          PPCel (size=0x4c800, entropy=0.737337)
335872        0x52000         None (size=0x23800, entropy=0.731620)

Target File:   .../corpus/MSP430/goodfet32.hex
MD5 Checksum:  4b295284024e2b6a6257b720a7168b92

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0x8000, entropy=0.473132)
32768         0x8000          MSP430 (size=0x5000, entropy=0.473457)
53248         0xD000          None (size=0x3000, entropy=0.489337)

Target File:   .../corpus/PE/ALPHA/NTDLL.DLL
MD5 Checksum:  9c76d1855b8fe4452fc67782aa0233f9

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0xa000, entropy=0.785498)
40960         0xA000          Alpha (size=0x5b800, entropy=0.810394)
415744        0x65800         None (size=0x800, entropy=0.695699)
417792        0x66000         VAX (size=0x1000, entropy=0.683740)
421888        0x67000         None (size=0x28800, entropy=0.717975)

Target File:   .../corpus/Mach-O/OSXII
MD5 Checksum:  a4097b036f7ee45c147ab7c7d871d0c1

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------
0             0x0             None (size=0x1800, entropy=0.156350)
6144          0x1800          PPCeb (size=0x1b800, entropy=0.772708)
118784        0x1D000         None (size=0xd000, entropy=0.588620)
172032        0x2A000         X86 (size=0x2000, entropy=0.594146)
180224        0x2C000         None (size=0x800, entropy=0.758712)
182272        0x2C800         X86-64 (size=0x800, entropy=0.767427)
184320        0x2D000         X86 (size=0x18800, entropy=0.786143)
284672        0x45800         None (size=0xc000, entropy=0.612610)

Important: it is usually a good idea to start the analysis of an unknown binary with some entropy analysis. cpu_rec assumes that it has been done, but to protect the user against overlooking this aspect, it displays the entropy. If the entropy value is above 0.9, it is probably encrypted or compressed data, and therefore the result of cpu_rec should be meaningless.

We can notice that during the analysis of ALPHA/NTDLL.DLL small chunks are wrongly detected as non-Alpha architectures. They should be ignored. But some files can contain multiple architectures, e.g. Mach-O/OSXII which is a Mach-O FAT file with ppc and i386 executables.

More documentation

The tool has been presented at SSTIC 2017, with a full paper describing why this technique has been used for the recognition of architectures. A video of the presentation and the slides are available.

This presentation was made in French. A translation in English of the slides is available, a translation in English of the paper is in progress.

Known architectures in the default corpus

6502 68HC08 68HC11 8051 Alpha ARC32eb ARC32el ARcompact ARM64 ARMeb ARMel ARMhf AVR AxisCris Blackfin Cell-SPU CLIPPER CompactRISC Cray CUDA Epiphany FR-V FR30 FT32 H8-300 H8S HP-Focus HP-PA i860 IA-64 IQ2000 M32C M32R M68k M88k MCore Mico32 MicroBlaze MIPS16 MIPSeb MIPSel MMIX MN10300 Moxie MSP430 NDS32 NIOS-II OCaml PDP-11 PIC10 PIC16 PIC18 PIC24 PPCeb PPCel RISC-V RL78 ROMP RX S-390 SPARC STM8 Stormy16 SuperH TILEPro TLCS-90 TMS320C2x TMS320C6x TriMedia V850 VAX Visium WASM WE32000 X86-64 X86 Xtensa Z80 #6502#cc65

Because of licencing issues, the following architectures are not in the default corpus, but they can be manually added: 78k TriCore

Licence

The tool

The cpu_rec.py file is licenced under a Apache Licence, Version 2.0.

The default corpus

The files in the default corpus have been built from various sources. The corpus is a collection of various compressed files, each compressed file is dedicated to the recognition of one architecture and is made by the compression of the concatenation of one or many binary chunks, which come from various origins and have various licences. Therefore, the default corpus is a composite document, each sub-document (the chunk) being redistributed under the appropriate licence.

The origin of each chunk is described in cpu_rec.py, in the function build_default_corpus. The licences are:

Other architectures that cannot be distributed in the default corpus

Development status

Code Quality

More Repositories

1

bincat

Binary code static analyser, with IDA integration. Performs value and taint analysis, type reconstruction, use-after-free and double-free detection
OCaml
1,662
star
2

qemu_blog

A series of posts about QEMU internals:
1,345
star
3

ilo4_toolbox

Toolbox for HPE iLO4 & iLO5 analysis
Python
412
star
4

warbirdvm

An analysis of the Warbird virtual-machine protection for the CI!g_pStore
Ruby
216
star
5

diffware

An extensively configurable tool providing a summary of the changes between two files or directories, ignoring all the fluff you don't care about.
Python
196
star
6

gustave

GUSTAVE is a fuzzing platform for embedded OS kernels. It is based on QEMU and AFL (and all of its forkserver siblings). It allows to fuzz OS kernels like simple applications.
Python
194
star
7

powersap

Powershell SAP assessment tool
PowerShell
187
star
8

crashos

A tool dedicated to the research of vulnerabilities in hypervisors by creating unusual system configurations.
C
182
star
9

c-compiler-security

Security-related flags and options for C compilers
Python
179
star
10

ramooflax

a bare metal (type 1) VMM (hypervisor) with a python remote control API
C
178
star
11

bta

Open source Active Directory security audit framework.
Python
131
star
12

android_emuroot

Android_Emuroot is a Python script that allows granting root privileges on the fly to shells running on Android virtual machines that use google-provided emulator images called Google API Playstore, to help reverse engineers to go deeper into their investigations.
Python
121
star
13

AutoResolv

Python
71
star
14

elfesteem

ELF/PE/Mach-O parsing library
Python
50
star
15

GEA1_break

Implementation of the key recovery attack against GEA-1 keys (Eurocrypt 2021)
C
47
star
16

airbus-seclab.github.io

Conferences, tools, papers, etc.
43
star
17

AFLplusplus-blogpost

Blogpost about optimizing binary-only fuzzing with AFL++
Shell
34
star
18

nbutools

Tools for offensive security of NetBackup infrastructures
Python
30
star
19

rebus

REbus facilitates the coupling of existing tools that perform specific tasks, where one's output will be used as the input of others.
Python
25
star
20

usbq_core

USB man in the middle linux kernel driver
C
19
star
21

AppVsWild

application process protection hypervisor virtualization encryption
9
star
22

gunpack

Generic unpacker (dynamic)
C
8
star
23

usbq_userland

User land program to be used with usbq_core
Python
8
star
24

ramooflax_scripts

ramooflax python scripts
Python
6
star
25

cpu_doc

Curated set of documents about CPU
3
star
26

c2newspeak

C
3
star
27

rebus_demo

REbus demo agents
Python
2
star
28

security-advisories

2
star
29

pwnvasive

semi-automatic discovery and lateralization
Python
1
star
30

pok

forked from pok-kernel/pok
C
1
star
31

afl

Airbus seclab fork of AFL
C
1
star