• Stars
    star
    322
  • Rank 129,619 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Whole-Program Reverse Engineering with GPT-3

GPT-WPRE: Whole-program Reverse Engineering with GPT-3

This is a little toy prototype of a tool that attempts to summarize a whole binary using GPT-3 (specifically the text-davinci-003 model), based on decompiled code provided by Ghidra. However, today's language models can only fit a small amount of text into their context window at once (4096 tokens for text-davinci-003, a couple hundred lines of code at most) -- most programs (and even some functions) are too big to fit all at once.

GPT-WPRE attempts to work around this by recursively creating natural language summaries of a function's dependencies and then providing those as context for the function itself. It's pretty neat when it works! I have tested it on exactly one program, so YMMV.

Dependencies

You need:

Usage

Call Graph and Decompilation

Make sure the program you want to analyze is open and that the Ghidra bridge server is running. Then extract the control flow graph and decompiled functions with:

$ python extract_ghidra_decomp.py
Building call graph: 100%|██████████████████████████| 651/651 [02:21<00:00,  4.60it/s]
Decompiling functions: 100%|████████████████████████| 651/651 [01:02<00:00, 10.34it/s]
Missing 0 functions:
[]

This will create a directory named after the program you're analyzing (e.g., in our example, libpng16.so.16.38.0_stripped) with JSON files named call_graph.json (for the call graph) and decompilations.json for the decompiled functions.

Summarizing

The script used for this is the creatively named recursive_summarize.py. It takes a few arguments:

$ python recursive_summarize.py --help
usage: recursive_summarize.py [-h] [-f FUNCTION] [-d DECOMPILATIONS] [-g CALL_GRAPH]
                              [-o OUTPUT] [-v] [-n] [-l MAX_LINES]
                              progdir

positional arguments:
  progdir

options:
  -h, --help            show this help message and exit
  -f FUNCTION, --function FUNCTION
                        Summarize only this function (and dependencies)
  -d DECOMPILATIONS, --decompilations DECOMPILATIONS
  -g CALL_GRAPH, --call-graph CALL_GRAPH
  -o OUTPUT, --output OUTPUT
                        Output file (default: progdir/summaries.jsonl)
  -v, --verbose         Enable verbose logging
  -n, --dry-run         Don't actually call OpenAI, just estimate usage
  -l MAX_LINES, --max-lines MAX_LINES
                        Maximum number of lines to summarize at a time

Important: The GPT-3 API is not free! The model we're using, text-davinci-003, costs $0.02 per 1000 tokens, which can add up for a large program. You can use the --dry-run (-n) flag to estimate the cost of running GPT-WPRE on a program without actually running it:

$ python recursive_summarize.py -n libpng16.so.16.38.0_stripped
===== API usage estimates =====
Number of functions: 466
Estimated API calls: 711
Estimated prompt tokens: 784477
Estimated generated tokens: 45601
Estimated cost: $16.60

$16.60 is way too much for my academic budget, so let's try to summarize only a single function:

$ python recursive_summarize.py -n -f png_read_info libpng16.so.16.38.0_stripped
===== API usage estimates =====
Number of functions: 72
Estimated API calls: 76
Estimated prompt tokens: 74311
Estimated generated tokens: 2799
Estimated cost: $1.54

Much better. Now to run it:

$ python recursive_summarize.py -f png_read_info libpng16.so.16.38.0_stripped
Summarizing functions: 100%|██████████████████████████| 72/72 [02:30<00:00,  2.09s/it]
Wrote 72 summaries to summaries_png_read_info.jsonl.
Final summary for png_read_info:
This function checks the validity of a pointer, calls the related function, computes a CRC32 checksum, and calls the png_error/png_chunk_warning functions with an error message or warning, as well as setting various parameters for a PNG file.

Output

The output is a JSON Lines file (.jsonl) with one JSON object per function. Each object has one key (the function name) whose value is the summary. For example:

{"FUN_0011b110": "The code checks a given byte value, allocates memory, reads data from a given pointer, checks two arrays for consistency, modifies certain bits from a given parameter, calculates a CRC32 checksum, and calls the png_chunk_benign_error() function with a message based on the zlib return code."}
{"png_read_info": "This function checks parameters for validity, reads data, calculates a CRC32 checksum, allocates memory, and sets values for various parameters."}

Samples

Sample output for libpng is available in the samples/libpng16.so.16.38.0_stripped directory.

Exploring/Debugging the Summaries

I went a little overboard making this script so I'm including it in the repo even though it would need a bunch more work to be useful for anything other than libpng: extras/debug_summaries.py. For each function in a summaries.jsonl it shows the summary, the original function name (using the debug symbols), an a side-by-side view of the original source code and the decompiled code.

If you want to run it yourself, you need to first get the libpng source, then run the script:

$ cd samples/srcs
$ bash clone.sh
[...]
$ cd ../..
$ python extras/debug_summaries.py \
    samples/bins/libpng16.so.16.38.0 \
    samples/libpng16.so.16.38.0_stripped/summaries_png_read_info.jsonl \
    samples/libpng16.so.16.38.0_stripped/decompilations.json

Alternatively you can just look at the sample output here: https://moyix.net/~moyix/libpng_png_set_info_summaries.html

How It Works

GPT-WPRE starts by performing a topological sort on the call graph to get a list of functions in the order they should be summarized. It then iteratively summarizes each function, using the summaries of its callees as context. For example, if we have the following call graph and we want to summarize function A:

A -> B -------> C
      `--> D --^

Then we would start by summarizing C, then D, then B, and finally A. The sequence of prompts looks like:

Describe what this function does in a single sentence:
```
int C(double param1) {
    // ...
}
```

Giving perhaps a summary like "This function frobnicates the parameter and returns the result." Then we would prompt:

Given the following summaries:
C: This function frobnicates the parameter and returns the result.
Describe what this function does in a single sentence:
```
void D(int param1) {
    // Some code that calls C
}
```

And so on until we have a summary for A.

Summarizing Big Functions

Unfortunately, sometimes even a single function can be too big to summarize in a single prompt when context is included. In this case we need to split up the individual function, summarize its parts, and recombine those summaries.

The right way to do this might be do the same topological sort strategy as before but on the function's control flow graph rather than the program's call graph. But a) topological sort isn't defined for graphs with cycles, and whereas mutual recursion is rare at the call graph level, cycles in a control flow graph (aka "loops") are very common; and more prosaically, b) I don't know if Ghidra's decompiler exposes a source-level CFG.

Instead, when we encounter a function that's too big, we split it into sequential chunks of up to 100 lines, summarize each chunk as a paragraph, and then recombine the summaries. If 100 lines is too many, we try 80, 60, ..., down to 20 lines. If this is still too big, we instead summarize the chunks as single sentences.

This all happens in recursive_summarize.py's summarize_long_code function.

Limitations and Future Work

  • We don't try to deal with mutual recursion or other non-simple cycles in the call graph, and the topological sort will just throw an exception if it finds a cycle.
  • These prompts are the first ones that occurred to me, and probably some prompt engineering would improve the summaries!
  • Lots of ways this could be faster, e.g. by batching together requests to summarize things that don't depend on each other. Also pretty sure Ghidra has faster/better ways to get the call graph and decompilation.

More Repositories

1

fauxpilot

FauxPilot - an open-source GitHub Copilot server
Python
4,588
star
2

pdbparse

Python code to parse Microsoft PDB files
Python
306
star
3

creddump

Automatically exported from code.google.com/p/creddump
Python
232
star
4

panda

Deprecated repo for PANDA 1.0 – see PANDA 2.0 repository
C
102
star
5

fpsmt_gpu

Solving floating point SMT constraints on a GPU
C++
47
star
6

panda-malrec

A system to record malware using PANDA
Python
42
star
7

wdepy

Decryption utility for PGP Whole Disk Encryption
Python
17
star
8

elmfuzz

Evolving fuzzers with large language models
Python
11
star
9

csaw23_nervcenter

Pwn+Crypto challenge for CSAW 2023 Finals
C
7
star
10

func_asm_pairgen

Horrifying scripts / infrastructure to extract info from a large amount of C/C++ code
Python
7
star
11

2_ffast_2_furious

A more realistic demo of a buffer overflow cause by -ffast-math
C
7
star
12

irq_fuzzer

C
6
star
13

AsleepKeyboardDataset

Python
6
star
14

mmgrep

Fast search for binary strings
C
6
star
15

hilbert_kcov

Objective-C
5
star
16

fbtools

Some python tools to hack on Fitbits
Python
5
star
17

virtuoso

Automatically exported from code.google.com/p/virtuoso
C
5
star
18

polycoder_wrap

Wrapper to do text generation with VHellendoorn's PolyCoder model
Python
5
star
19

codex_cli

Script to hook OpenAI's Codex up to a Linux VM and try to execute commands
Python
4
star
20

panda_plugins_moyix

Repository for plugins that are useful to me but not generally applicable
C++
4
star
21

pandalog_taint_parser

A fast, parallel parser for PANDA taint logs
C++
4
star
22

ffdemo

Flush+Flush attack demo
C++
3
star
23

vidcolortree

Python
3
star
24

synthehol

A clone of Nick Fitzgerald's minisynth-rs
Rust
2
star
25

scripts

C++
2
star
26

debbuild

Tools and scripts for rebuilding all of Debian with bear (I should have used rebuilderd :p)
Python
2
star
27

AppSecAssignment1

C
1
star
28

ptrml

Using DNNs for dumb tasks: recognizing pointers
Python
1
star
29

appsec_hw1

C
1
star
30

codeql_weird_minimal

Minimal example of weird CodeQL behavior
C
1
star
31

feckless-woof

C
1
star
32

bwlightning

1
star
33

cardinal

Cardinal Pill Testing on Linux
Assembly
1
star
34

appsec_hw2

HTML
1
star
35

ipptests

Small tests to benchmark inter vs intra process communication.
C++
1
star
36

heapmap

C
1
star