reopt
Reopt is a general purpose decompilation and recompilation tool for repurposing application logic. It does this by analyzing machine code to recover a more flexible program representation -- specifically the LLVM assembly language. Once in this format, one can then apply optimization tools to optimize the LLVM, recompile the application into optimized or security hardened object code, and use Reopt to merge the recompiled code back into the original executable.
Reopt supports Linux x86_64 programs. We are working towards a full 1.0 release, but the current pre-release version supports the end-to-end recompilation toolchain.
Getting Reopt
Although Reopt can build on other POSIX systems such as OSX, we recommend building Reopt to run on Linux. Reopt currently only supports Elf binaries which are the default binary format for Linux. It does not support OSX Macho binaries, and so it is easier to find applications to try Reopt on when running Linux.
Gitpod
For most people, the easiest way to try out Reopt is to try it out on Gitpod. This requires an account on Gitpod, but gives you access to a VSCode IDE connected to a Linux container with Reopt pre-installed.
Github Releases
If you have Linux installed, you can download one of our recent releases from the Releases page. We build releases as static binaries on Centos 7, so they should work on a variety of distributions.
Docker
If you have Docker installed, you can install and run the Reopt pre-release Docker image by running:
docker pull galoisbinaryanalysis/reopt
docker run --rm -it galoisbinaryanalysis/reopt
Building from source
Building Reopt requires that one has installed the GHC Haskell
compiler and supporting tooling. We currently build on GHC 8.10.4.
An easy way to get GHC is to install ghcup,
and run ghcup install ghc-8.10.4
. We also maintain a
Docker image
that has GHC and other dependencies preinstalled for building Reopt.
Once GHC is installed, the following steps may be useful for building Reopt:
git clone https://github.com/GaloisInc/reopt.git
cd reopt
# Fix submodule URLs (can skip if you have a Github account)
sed -i 's/[email protected]:/https:\/\/github.com\//' .gitmodules
git submodule update --init
# Build Reopt
cabal install exe:reopt
# Build Reopt Explore
cabal install exe:reopt-explore
Reopt and Reopt Explore will be installed at $HOME/.cabal/bin/reopt
$HOME/.cabal/bin/reopt-explore
.
Reopt's verification condition generator (reopt-vcg
) is included in the
aforementioned Github release and Docker image, however the source is currently
maintained in a separate repository
with it's own build instructions and requirements.
Using Reopt
Once reopt
is installed on a Linux system and included in your path,
you can try running it on system utilities such as ls
. To do an
end-to-end recompilation, you can run reopt with the command.
$ reopt -o ls.exe $(which ls)
This execution will use the version of ls
in your system path and produce
an executable ls.exe
in the current directory. When running reopt
will print out messages as it discovers functions within the application
and attempts to convert each discovered function into LLVM.
Inspecting intermediate state
During recompilation, Reopt has to do a complex series of analysis steps to lift the machine code into LLVM. Each of these analysis steps is incomplete and may fail either due to Reopt not recognizing features in the binary or an error in our prerelease version of Reopt. As such, do not be alarmed when Reopt fails to translate functions.
If you'd like to inspect Reopt's intermediate state, there are several
command line flags to export intermediate results. We describe the
main flags for exporting intermediate state below.
Additional options can be viewed by running reopt --help
.
-
Disassembly.
reopt --disassemble <binary>
provides a raw disassembler output view of the code in the binary. This is similiar toobjdump
's disassembly output. -
Control flow graph construction.
reopt --cfg <binary>
displays the low level control flow graphs that Reopt has constructed for each discovered function within the binary. This is a low-level IR that maintains machine code's explicit stack and register references, but lifts the machine code instructions into a more architectural neutral register transfer language. -
Function Recovery
reopt --export-fns <path> <binary>
writes the functions that Reopt has generated after performing stack and function argument analysis. This is a higher-level IR in which explicit references to the stack have been replaced with allocations, and functions take arguments. -
LLVM Generation
reopt --export-llvm <path> <binary>
generates LLVM from the binary. This is essentially a version of function recovery rendered in LLVM's format. Providing the--annotations <ann_file>
flag during LLVM generation will causereopt
to additionally emit JSON in<ann_file>
describing verification conditions which (if valid) demonstrate functional equivalence between the generated LLVM and machine code. Runningreopt-vcg <ann_file>
will simulate the executation of the LLVM and machine code, block-by-block, leveraging an SMT solver (cvc4) to verify as many of the conditions as possible. -
Object Files
reopt --export-object <path> <binary>
generates an object file from the LLVM generated in the previous state. This is essentially the same as generating the LLVM, and then running the LLVM compiler toolchain with the selected options.
Function arguments
One common reason Reopt fails is because it cannot figure out the arguments
that a function can take. We have four mechanisms for obtaining function
arguments: (1) User provided hints; (2) a small builtin database; (3) debug
information; and (4) a demand analysis that looks at what registers are used
to infer arguments. These mechanisms are listed in priority order, although
we note that the builtin database is currently the only mechanism for supporting
functions that take a variable number of arguments like printf
.
If you'd like to provide hints to Reopt, the recommended way is write a C header file with the arguments, such as:
// decls.h
typedef long ssize_t;
typedef unsigned long size_t;
ssize_t read(int fd, void* buf, size_t count);
ssize_t write(int fd, const void* buf, size_t count);
You can then use this file to tell Reopt about the expected types for
read
and write
via the --header
flag, e.g.,
reopt -o ls.exe --header decls.h $(which ls)
OCCAM
for additional optimizations
Using reopt
can leverage the OCCAM whole-program
partial evaluator for LLVM bitcode to further optimize binaries (assuming a user
has already installed and made available both OCCAM
and its accompanying
interface slash
).
This feature can be enabled by passing the --occam-config=FILE
option to
reopt
, where FILE
is the reopt
/OCCAM
manifest. The manifest should
essentially a valid OCCAM manifest
file (i.e., a file with JSON
entries) with the following (optional) additional field:
slash_options
: a list of command line option flags for OCCAM'sslash
tool,
and excluding the following fields (reopt
will populate these appropriately):
binary
name
The main
field should specify the desired name of the bitcode file that will
be generated for OCCAM
to process, and the OCCAM optimized result will share
the name with an added .occam
suffix.
N.B., when passing flags to customize OCCAM
/slash
behavior, be aware that
reopt
passes the -c
and -emit-llvm
flags via the
ldflags
manifest entry so OCCAM
skips recompiling and acts only as an LLVM
to LLVM translator.
Using Reopt Explore
With reopt-explore
installed we can gather statistics regarding reopt
's ability
to recover functions in an individual or collection of binaries.
To examine a single binary, simply call reopt-explore
with the a path to the binary:
$ reopt-explore llvm $(which ls)
...
/usr/bin/ls
Initialization:
Code segment: 112,004 bytes
Initial entry points: 234
Warnings: 0
Discovery:
Bytes discovered: 59,502 (53%)
Succeeded: 216 (92%)
Failed: 18 (8%)
Unhandled instruction: 1 (0%)
Unidentified control flow: 17 (7%)
Argument Analysis:
Succeeded: 123 (57%)
Failed: 93 (43%)
Header Warnings: 0
DWARF Warnings: 0
Code Warnings: 112
Invariant Inference:
Succeeded: 92 (75%)
Failed: 31 (25%)
Indirect call target: 1 (1%)
Unresolved call target arguments: 30 (24%)
Recovery:
Succeeded: 81 (88%)
Failed: 11 (12%)
Unsupported function value: 8 (9%)
Unimplemented LLVM backend feature: 3 (3%)
LLVM generation status: Succeeded.
To recursively search a directory for binaries and examine each,
call reopt-explore
with the path to the directory to search:
$ reopt-explore llvm /usr/bin
...
reopt analyzed 394 binaries:
Generated LLVM bitcode for 394 out of 394 binaries.
Initialization:
Code segment: 42,933,178 bytes
Initial entry points: 79776
Warnings: 0
Discovery:
Bytes discovered: 23,025,164 (54%)
Succeeded: 64,494 (81%)
Failed: 15,500 (19%)
Unhandled instruction: 425 (1%)
Unidentified control flow: 15,075 (19%)
Argument Analysis:
Succeeded: 40,429 (63%)
Failed: 24,065 (37%)
Header Warnings: 0
DWARF Warnings: 0
Code Warnings: 38,681
Invariant Inference:
Succeeded: 30,221 (75%)
Failed: 10,208 (25%)
Symbolic call stack height: 1 (0%)
Unresolved stack read: 13 (0%)
Indirect call target: 526 (1%)
Call target not function entry point: 41 (0%)
Unresolved call target arguments: 9,614 (24%)
Could not resolve varargs args: 13 (0%)
Recovery:
Succeeded: 21,952 (73%)
Failed: 8,269 (27%)
Unsupported function value: 2,425 (8%)
Unimplemented feature: 6 (0%)
Unimplemented LLVM backend feature: 4,762 (16%)
Stack offset escape: 83 (0%)
Stack read overlapping offset: 1 (0%)
Unresolved return value: 8 (0%)
Missing variable value: 984 (3%)
Improving recovery with debug information
reopt
and reopt-explore
will try to determine if any debug information
is available for dynamic dependencies by querying gdb
(if it is installed).
Users can also manually specify dependency and debug directories to search in
manually for both reopt
and reopt-explore
via the following flags:
--lib-dir=PATH Additional location to search for dynamic
dependencies.
--debug-dir=PATH Additional location to search for dynamic
dependencies' debug info.
Contributing
The project has been contributed to by many authors over many years without much coordination on code style and library usage. For our currently-maintained sub-projects, we favor the following, without enforcing them aggressively (i.e. you may still find instances where those are not used):
-
We use the
fourmolu
code formatter, informed by thefourmolu.yaml
present in the root directory. -
We are trying to move to
optparse-applicative
for CLI argument parsing (though there remain some instances ofcmdargs
for executables we are not actively maintaining). -
We use
prettyprinter
for pretty-printing. -
We tend to use labeled generic lenses for accessing complex data types in read/write. We still have a mixed use of named- and symbol-based lens operators.
-
We do not yet have a very nice logging story. Currently, the code outputs to stdout/stderr as it deems necessary, but we ought to use a more principled logging discipline.